Code of the Day
IntermediateData I/O & manipulation

String manipulation and pattern matching

Extract, validate, and transform text using the re module and regular expressions.

PythonIntermediate9 min read
By the end of this lesson you will be able to:
  • Search and match text with re.search(), re.match(), and re.findall()
  • Replace text with re.sub()
  • Write simple patterns using character classes, quantifiers, and groups
  • Compile patterns with re.compile() for reuse
  • Use raw strings to write patterns without escape noise
  • Apply regex to practical tasks like extracting numbers or validating email-like strings

Most real data hides structure inside strings — prices embedded in sentences, dates mixed into log lines, identifiers buried in text. Python's re module gives you : a small language for describing text patterns, powerful enough to handle these cases without writing fragile string-slicing code.

Raw strings and why they matter

A regex pattern like \d+ contains a backslash. In a normal string, \d is an escape sequence — Python interprets it before the regex engine ever sees it. A raw string (r"...") turns off Python's escaping so the backslash passes through unchanged:

import re

# Without raw string, Python processes \d first — confusing and often wrong.
# With raw string, the pattern reaches the regex engine intact.
pattern = r"\d+"   # one or more digits

Make it a habit: regex patterns are almost always raw strings.

Searching for a match

re.search(pattern, text) scans the whole string and returns a match object if the pattern appears anywhere, or None if it doesn't. Always check before accessing the result:

m = re.search(r"\d+", "Order #4291 placed")
if m:
    print(m.group())   # "4291"

re.match(pattern, text) only matches at the start of the string — useful for validation, less useful for general search.

Finding all occurrences

re.findall(pattern, text) returns a list of every non-overlapping match:

prices = "Items: $4.99, $12.00, $0.50"
nums = re.findall(r"\d+\.\d+", prices)
# ["4.99", "12.00", "0.50"]

When the pattern has a capture group ((...)), findall returns the contents of the group rather than the full match — useful for extracting just the part you care about.

Replacing text with re.sub

re.sub(pattern, replacement, text) replaces every match:

clean = re.sub(r"\s+", " ", "too   many    spaces")
# "too many spaces"

The replacement can be a string (with \1, \2 back-references to groups) or a function that receives each match object and returns the replacement string.

Python — editable, runs in your browser

Writing patterns: quick reference

SyntaxMeaning
.Any character except newline
\dDigit (0–9)
\wWord character (letter, digit, _)
\sWhitespace
[abc]Any of a, b, c
[^abc]Any character except a, b, c
+One or more
*Zero or more
?Zero or one (also makes quantifiers non-greedy)
{n,m}Between n and m repetitions
(...)Capture group
^ / $Start / end of string

Compiling patterns for reuse

If you use the same pattern many times, re.compile() pre-compiles it once and returns a reusable object:

email_pat = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")

for line in lines:
    if email_pat.search(line):
        print("found email in:", line)

Compiled patterns have the same search, match, findall, and sub methods — the only difference is you call them on the compiled object instead of passing the pattern each time.

Practical example: validating an email-like string

A full RFC-compliant email regex is famously complex. For most practical purposes a simple pattern is enough — and being explicit about what you're actually checking is more honest than pretending to validate perfectly:

import re

def looks_like_email(s: str) -> bool:
    pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
    return bool(re.match(pattern, s))

print(looks_like_email("user@example.com"))   # True
print(looks_like_email("not-an-email"))        # False

Regular expressions reward restraint. A pattern that's hard to read is hard to maintain and hard to get right. For very complex text parsing — multi-line formats, nested structures — consider a dedicated parser or break the problem into simpler regex steps. The right tool for structured formats like JSON or CSV is a proper parser, not a regex.

Where to go next

With reading, writing, transforming, and pattern-matching covered, you're ready for the Lab: data pipeline — where you'll connect all four skills into a small ETL workflow.

Finished reading? Mark it complete to track your progress.

On this page