IntermediateText processing

Cut, Sort, and Uniq

Extract columns with cut, sort data numerically and by key with sort, count duplicates with uniq -c, and compose multi-step pipelines.

BashIntermediate10 min read

Recommended first

Awk Basics

By the end of this lesson you will be able to:

Extract specific columns from delimited files with cut
Sort lines alphabetically, numerically, by column, and in reverse
Count and deduplicate lines with uniq and uniq -c
Compose cut, sort, and uniq into multi-step analysis pipelines

cut, sort, and uniq each do one small job — and in combination, they solve a large class of text analysis problems. They complement grep, sed, and awk by being simpler and faster for the cases they're designed for: extracting fixed columns, ordering lines, and counting duplicates. Mastering the combination is the difference between a script that builds a result line by line and one that expresses the entire computation as a single, readable pipeline.

cut: extract columns

cut extracts columns from structured text. The two most common uses are cutting by character position and cutting by delimiter:

# By delimiter: -d specifies the delimiter, -f the field number(s)
cut -d: -f1 /etc/passwd                 # first field (username)
cut -d: -f1,7 /etc/passwd               # first and seventh fields
cut -d, -f2-4 data.csv                  # fields 2 through 4 from a CSV

# By character position
cut -c1-10 file.txt                     # first 10 characters of each line
cut -c5- file.txt                       # from character 5 to end of line

cut is faster than awk for simple column extraction and its syntax makes the intent instantly clear.

cut only supports single-character delimiters. If your data uses a multi-character delimiter (like , or ::) you need awk (-F::). Also, cut does not re-order fields — it always outputs them left to right regardless of how you list them in -f. Use awk for reordering.

sort: order lines

sort by default sorts lexicographically (alphabetical, treating everything as text):

sort names.txt                          # ascending alphabetical
sort -r names.txt                       # descending
sort -u names.txt                       # sort and remove duplicates (unique)

For numeric and structured data, the flags matter:

sort -n numbers.txt                     # numeric sort (10 > 9, not "1" < "9")
sort -rn numbers.txt                    # numeric, descending

# Sort by a specific column: -k col.char,col.char
sort -k2 data.txt                       # sort by second whitespace-delimited field
sort -k2,2n data.txt                    # sort by second field numerically
sort -t: -k3,3n /etc/passwd             # colon-delimited, sort by uid (field 3)

The -k flag takes a start and end position: -k2,2n means "field 2, start of field to end of field, numeric". Without the ,2 end specifier, sort treats everything from field 2 to end of line as the sort key.

uniq: count and deduplicate

uniq removes or counts consecutive duplicate lines. Because it only looks at adjacent lines, it almost always follows sort:

sort names.txt | uniq                   # remove duplicates
sort names.txt | uniq -c                # count occurrences
sort names.txt | uniq -d                # print only duplicate lines
sort names.txt | uniq -u                # print only unique (non-duplicate) lines

The most common pattern is sort | uniq -c | sort -rn — count occurrences and rank by frequency:

# Most common HTTP status codes in an access log
awk '{print $9}' access.log | sort | uniq -c | sort -rn | head -10

# Most common words in a file
tr '[:upper:]' '[:lower:]' < essay.txt | tr -cs '[:alpha:]' '\n' | \
  sort | uniq -c | sort -rn | head -20

Composing multi-step pipelines

These three tools combine into a standard idiom for data exploration:

# Top 5 users by number of processes
ps aux | awk '{print $1}' | sort | uniq -c | sort -rn | head -5

# Files modified in the last 7 days, sorted by size (largest first)
find . -mtime -7 -type f | xargs ls -s 2>/dev/null | sort -rn | head -10

# Unique IP addresses accessing a web server
cut -d' ' -f1 access.log | sort -u

# Distribution of file extensions under src/
find src/ -type f | sed 's/.*\.//' | sort | uniq -c | sort -rn

The standard pipeline structure for analysis: extract field → sort → uniq -c → sort -rn → head -N. Once you internalize this, many data questions become a matter of plugging in the right extraction step at the front.

Check your understanding

Do it yourself

# Extract unique shells from /etc/passwd and count each
cut -d: -f7 /etc/passwd | sort | uniq -c | sort -rn

# Simulate a log and find top "IPs"
printf "10.0.0.1\n10.0.0.2\n10.0.0.1\n10.0.0.3\n10.0.0.1\n10.0.0.2\n" | \
  sort | uniq -c | sort -rn

# Sort /etc/passwd by UID numerically (field 3)
sort -t: -k3,3n /etc/passwd | cut -d: -f1,3 | head -10

Where to go next

You've completed the Text processing module — grep, sed, awk, cut, sort, and uniq. The lab is next for hands-on reinforcement, then the Advanced tier opens up: arrays, parameter expansion, heredocs, traps, and automation tools that put everything together in production-grade scripts.

Finished reading? Mark it complete to track your progress.

Awk Basics

Use awk to split fields, apply per-record logic with BEGIN/END blocks, use NR/NF, accumulate sums, and build small text programs.

Lab: Text processing

Hands-on quiz challenges covering regex flavours, sed address ranges, awk field splitting, and pipeline composition.