String manipulation and pattern matching

Extract, validate, and transform text using the re module and regular expressions.

Most real data hides structure inside strings — prices embedded in sentences, dates mixed into log lines, identifiers buried in text. Python's re module gives you regular expressions: a small language for describing text patterns, powerful enough to handle these cases without writing fragile string-slicing code.

Raw strings and why they matter

A regex pattern like \d+ contains a backslash. In a normal string, \d is an escape sequence — Python interprets it before the regex engine ever sees it. A raw string (r"...") turns off Python's escaping so the backslash passes through unchanged:

import re

# Without raw string, Python processes \d first — confusing and often wrong.
# With raw string, the pattern reaches the regex engine intact.
pattern = r"\d+"   # one or more digits

Make it a habit: regex patterns are almost always raw strings.

Searching for a match

re.search(pattern, text) scans the whole string and returns a match object if the pattern appears anywhere, or None if it doesn't. Always check before accessing the result:

m = re.search(r"\d+", "Order #4291 placed")
if m:
    print(m.group())   # "4291"

re.match(pattern, text) only matches at the start of the string — useful for validation, less useful for general search.

Finding all occurrences

re.findall(pattern, text) returns a list of every non-overlapping match:

prices = "Items: $4.99, $12.00, $0.50"
nums = re.findall(r"\d+\.\d+", prices)
# ["4.99", "12.00", "0.50"]

When the pattern has a capture group ((...)), findall returns the contents of the group rather than the full match — useful for extracting just the part you care about.

Replacing text with re.sub

re.sub(pattern, replacement, text) replaces every match:

clean = re.sub(r"\s+", " ", "too   many    spaces")
# "too many spaces"

The replacement can be a string (with \1, \2 back-references to groups) or a function that receives each match object and returns the replacement string.

Python — editable, runs in your browser

Writing patterns: quick reference

Syntax	Meaning
`.`	Any character except newline
`\d`	Digit (0–9)
`\w`	Word character (letter, digit, `_`)
`\s`	Whitespace
`[abc]`	Any of a, b, c
`[^abc]`	Any character except a, b, c
`+`	One or more
`*`	Zero or more
`?`	Zero or one (also makes quantifiers non-greedy)
`{n,m}`	Between n and m repetitions
`(...)`	Capture group
`^` / `$`	Start / end of string

Compiling patterns for reuse

If you use the same pattern many times, re.compile() pre-compiles it once and returns a reusable object:

email_pat = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")

for line in lines:
    if email_pat.search(line):
        print("found email in:", line)

Compiled patterns have the same search, match, findall, and sub methods — the only difference is you call them on the compiled object instead of passing the pattern each time.

Practical example: validating an email-like string

A full RFC-compliant email regex is famously complex. For most practical purposes a simple pattern is enough — and being explicit about what you're actually checking is more honest than pretending to validate perfectly:

import re

def looks_like_email(s: str) -> bool:
    pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
    return bool(re.match(pattern, s))

print(looks_like_email("user@example.com"))   # True
print(looks_like_email("not-an-email"))        # False

Regular expressions reward restraint. A pattern that's hard to read is hard to maintain and hard to get right. For very complex text parsing — multi-line formats, nested structures — consider a dedicated parser or break the problem into simpler regex steps. The right tool for structured formats like JSON or CSV is a proper parser, not a regex.

Where to go next

With reading, writing, transforming, and pattern-matching covered, you're ready for the Lab: data pipeline — where you'll connect all four skills into a small ETL workflow.

Finished reading? Mark it complete to track your progress.

String manipulation and pattern matching

On this page