Filesystem and path manipulation
Navigate, read, write, and discover files with pathlib.Path — the modern alternative to os.path.
- Construct and join paths with the / operator
- Inspect path components — stem, suffix, parent, name
- Discover files with glob() and rglob()
- Read and write text files through Path methods
- Create and remove directories; copy and move files with shutil
Python's old os.path module handles paths as plain strings — concatenating,
splitting, and testing them via a collection of loosely related functions. The
pathlib module, added in Python 3.4 and idiomatic since 3.6, wraps paths in an
object that carries its own operations. The result is code that reads like English
and handles platform differences automatically.
Creating and joining paths
Path accepts a string (or path segments) and returns a path object. The /
operator joins segments — no manual os.sep required:
from pathlib import Path
# absolute path
root = Path("/home/user/projects")
# relative path
config = Path("config") / "settings.toml"
# join segments
data_file = root / "data" / "records.csv"
print(data_file) # /home/user/projects/data/records.csv
# current directory
here = Path(".")
print(here.resolve()) # absolute, resolves symlinksOn Windows, Path automatically uses backslashes; on Unix, forward slashes.
Your code stays the same.
Inspecting path components
Every Path object exposes its anatomy as attributes — no os.path.splitext
gymnastics:
p = Path("/home/user/report.csv")
print(p.name) # 'report.csv' — filename with extension
print(p.stem) # 'report' — filename without extension
print(p.suffix) # '.csv' — extension including the dot
print(p.parent) # PosixPath('/home/user')
print(p.parts) # ('/', 'home', 'user', 'report.csv')
# testing existence and type
print(p.exists()) # True / False
print(p.is_file()) # True if it's a regular file
print(p.is_dir()) # True if it's a directoryWrite get_parts(filepath) that returns a tuple of (stem, suffix, parent_name) for a given path string. parent_name is just the final directory component (the name of the parent directory, not the full path).
get_parts("/home/user/report.csv") → ("report", ".csv", "user")Discovering files with glob and rglob
glob() finds files matching a shell-style pattern in a directory; rglob()
recurses into subdirectories:
project = Path("src")
# all Python files in src/ (one level)
for f in project.glob("*.py"):
print(f.name)
# all Python files anywhere under src/ (recursive)
for f in project.rglob("*.py"):
print(f.relative_to(project))
# collect into a sorted list
py_files = sorted(project.rglob("*.py"))Common patterns: "*.txt", "**/*.json" (rglob via glob), "test_*.py".
rglob("*.py") is equivalent to glob("**/*.py").
Reading and writing with Path methods
For text files, read_text and write_text are a one-liner alternative to
open/read/close:
p = Path("notes.txt")
p.write_text("first line\nsecond line\n", encoding="utf-8")
content = p.read_text(encoding="utf-8")
print(content)
# binary files
img = Path("photo.jpg")
raw = img.read_bytes() # returns bytes
img.write_bytes(raw) # overwrites with bytesFor large files or streaming reads, the regular open() context manager is still
the right tool — read_text loads the whole file into memory.
Write list_files(directory, ext) that returns a sorted list of filenames (not full paths, just the name) matching the given extension (e.g. ".py") in the given directory. Do not recurse into subdirectories.
list_files("/src", ".py") → ["bar.py", "foo.py"]Creating and removing directories; shutil for copying
Path handles directory operations directly:
from pathlib import Path
import shutil
build = Path("build/output")
build.mkdir(parents=True, exist_ok=True) # creates intermediate dirs; no error if exists
build.rmdir() # removes (must be empty)
# shutil for anything that moves data between locations
shutil.copy("src/data.csv", "build/data.csv") # copy a file
shutil.copytree("src/assets", "build/assets") # copy a whole directory tree
shutil.move("build/output", "dist/output") # rename/move (works across filesystems)
shutil.rmtree("build") # remove a directory treeshutil.rmtree is permanent and has no recycle bin. Guard destructive calls
with an existence check and, in scripts, a --dry-run flag so you can see
what would happen before committing.
Where to go next
That completes the theory of the Data Engineering module. The lab pulls everything together: pathlib for file discovery, boolean validation, and sqlite3 storage — end to end. Next: Lab: database-backed data pipeline.
Bitwise operations and advanced boolean patterns
Use flags, bitmasks, any/all, walrus, and custom truth-value testing to write expressive, efficient boolean logic.
Lab: database-backed data pipeline
Build an end-to-end pipeline — discover files, validate records, store in SQLite, and aggregate results.