Filesystem and path manipulation

Navigate, read, write, and discover files with pathlib.Path — the modern alternative to os.path.

Python's old os.path module handles paths as plain strings — concatenating, splitting, and testing them via a collection of loosely related functions. The pathlib module, added in Python 3.4 and idiomatic since 3.6, wraps paths in an object that carries its own operations. The result is code that reads like English and handles platform differences automatically.

Creating and joining paths

Path accepts a string (or path segments) and returns a path object. The / operator joins segments — no manual os.sep required:

from pathlib import Path

# absolute path
root = Path("/home/user/projects")

# relative path
config = Path("config") / "settings.toml"

# join segments
data_file = root / "data" / "records.csv"
print(data_file)   # /home/user/projects/data/records.csv

# current directory
here = Path(".")
print(here.resolve())  # absolute, resolves symlinks

On Windows, Path automatically uses backslashes; on Unix, forward slashes. Your code stays the same.

Inspecting path components

Every Path object exposes its anatomy as attributes — no os.path.splitext gymnastics:

p = Path("/home/user/report.csv")

print(p.name)      # 'report.csv'    — filename with extension
print(p.stem)      # 'report'        — filename without extension
print(p.suffix)    # '.csv'          — extension including the dot
print(p.parent)    # PosixPath('/home/user')
print(p.parts)     # ('/', 'home', 'user', 'report.csv')

# testing existence and type
print(p.exists())   # True / False
print(p.is_file())  # True if it's a regular file
print(p.is_dir())   # True if it's a directory

Decompose a file pathPython

Write get_parts(filepath) that returns a tuple of (stem, suffix, parent_name) for a given path string. parent_name is just the final directory component (the name of the parent directory, not the full path).

get_parts("/home/user/report.csv") → ("report", ".csv", "user")

Discovering files with glob and rglob

glob() finds files matching a shell-style pattern in a directory; rglob() recurses into subdirectories:

project = Path("src")

# all Python files in src/ (one level)
for f in project.glob("*.py"):
    print(f.name)

# all Python files anywhere under src/ (recursive)
for f in project.rglob("*.py"):
    print(f.relative_to(project))

# collect into a sorted list
py_files = sorted(project.rglob("*.py"))

Common patterns: "*.txt", "**/*.json" (rglob via glob), "test_*.py". rglob("*.py") is equivalent to glob("**/*.py").

Reading and writing with Path methods

For text files, read_text and write_text are a one-liner alternative to open/read/close:

p = Path("notes.txt")

p.write_text("first line\nsecond line\n", encoding="utf-8")
content = p.read_text(encoding="utf-8")
print(content)

# binary files
img = Path("photo.jpg")
raw = img.read_bytes()       # returns bytes
img.write_bytes(raw)         # overwrites with bytes

For large files or streaming reads, the regular open() context manager is still the right tool — read_text loads the whole file into memory.

List files by extensionPython

Write list_files(directory, ext) that returns a sorted list of filenames (not full paths, just the name) matching the given extension (e.g. ".py") in the given directory. Do not recurse into subdirectories.

list_files("/src", ".py") → ["bar.py", "foo.py"]

Creating and removing directories; shutil for copying

Path handles directory operations directly:

from pathlib import Path
import shutil

build = Path("build/output")
build.mkdir(parents=True, exist_ok=True)   # creates intermediate dirs; no error if exists
build.rmdir()                               # removes (must be empty)

# shutil for anything that moves data between locations
shutil.copy("src/data.csv", "build/data.csv")        # copy a file
shutil.copytree("src/assets", "build/assets")        # copy a whole directory tree
shutil.move("build/output", "dist/output")           # rename/move (works across filesystems)
shutil.rmtree("build")                               # remove a directory tree

shutil.rmtree is permanent and has no recycle bin. Guard destructive calls with an existence check and, in scripts, a --dry-run flag so you can see what would happen before committing.

Where to go next

That completes the theory of the Data Engineering module. The lab pulls everything together: pathlib for file discovery, boolean validation, and sqlite3 storage — end to end. Next: Lab: database-backed data pipeline.

Finished reading? Mark it complete to track your progress.