String manipulation and pattern matching
Extract, validate, and transform text using the re module and regular expressions.
- Search and match text with re.search(), re.match(), and re.findall()
- Replace text with re.sub()
- Write simple patterns using character classes, quantifiers, and groups
- Compile patterns with re.compile() for reuse
- Use raw strings to write patterns without escape noise
- Apply regex to practical tasks like extracting numbers or validating email-like strings
Most real data hides structure inside strings — prices embedded in sentences,
dates mixed into log lines, identifiers buried in text. Python's re module
gives you regular expressions: a small language for describing text patterns,
powerful enough to handle these cases without writing fragile string-slicing code.
Raw strings and why they matter
A regex pattern like \d+ contains a backslash. In a normal string, \d is an
escape sequence — Python interprets it before the regex engine ever sees it. A
raw string (r"...") turns off Python's escaping so the backslash passes
through unchanged:
import re
# Without raw string, Python processes \d first — confusing and often wrong.
# With raw string, the pattern reaches the regex engine intact.
pattern = r"\d+" # one or more digitsMake it a habit: regex patterns are almost always raw strings.
Searching for a match
re.search(pattern, text) scans the whole string and returns a match object
if the pattern appears anywhere, or None if it doesn't. Always check before
accessing the result:
m = re.search(r"\d+", "Order #4291 placed")
if m:
print(m.group()) # "4291"re.match(pattern, text) only matches at the start of the string — useful
for validation, less useful for general search.
Finding all occurrences
re.findall(pattern, text) returns a list of every non-overlapping match:
prices = "Items: $4.99, $12.00, $0.50"
nums = re.findall(r"\d+\.\d+", prices)
# ["4.99", "12.00", "0.50"]When the pattern has a capture group ((...)), findall returns the
contents of the group rather than the full match — useful for extracting just the
part you care about.
Replacing text with re.sub
re.sub(pattern, replacement, text) replaces every match:
clean = re.sub(r"\s+", " ", "too many spaces")
# "too many spaces"The replacement can be a string (with \1, \2 back-references to groups) or a
function that receives each match object and returns the replacement string.
Writing patterns: quick reference
| Syntax | Meaning |
|---|---|
. | Any character except newline |
\d | Digit (0–9) |
\w | Word character (letter, digit, _) |
\s | Whitespace |
[abc] | Any of a, b, c |
[^abc] | Any character except a, b, c |
+ | One or more |
* | Zero or more |
? | Zero or one (also makes quantifiers non-greedy) |
{n,m} | Between n and m repetitions |
(...) | Capture group |
^ / $ | Start / end of string |
Compiling patterns for reuse
If you use the same pattern many times, re.compile() pre-compiles it once and
returns a reusable object:
email_pat = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
for line in lines:
if email_pat.search(line):
print("found email in:", line)Compiled patterns have the same search, match, findall, and sub methods
— the only difference is you call them on the compiled object instead of passing
the pattern each time.
Practical example: validating an email-like string
A full RFC-compliant email regex is famously complex. For most practical purposes a simple pattern is enough — and being explicit about what you're actually checking is more honest than pretending to validate perfectly:
import re
def looks_like_email(s: str) -> bool:
pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
return bool(re.match(pattern, s))
print(looks_like_email("user@example.com")) # True
print(looks_like_email("not-an-email")) # FalseRegular expressions reward restraint. A pattern that's hard to read is hard to maintain and hard to get right. For very complex text parsing — multi-line formats, nested structures — consider a dedicated parser or break the problem into simpler regex steps. The right tool for structured formats like JSON or CSV is a proper parser, not a regex.
Where to go next
With reading, writing, transforming, and pattern-matching covered, you're ready for the Lab: data pipeline — where you'll connect all four skills into a small ETL workflow.