IntermediateText processing

Awk Basics

Use awk to split fields, apply per-record logic with BEGIN/END blocks, use NR/NF, accumulate sums, and build small text programs.

BashIntermediate13 min read

Recommended first

Sed Basics

By the end of this lesson you will be able to:

Print specific fields with $1, $2, $NF
Set FS and OFS to handle delimited data
Write BEGIN and END blocks for setup and summary logic
Filter records with pattern conditions
Accumulate sums and counts across records

grep filters lines. sed transforms them. awk is the tool when you need to compute — extract structured columns from delimited data, sum a column of numbers, reformat a CSV, or print a report. An awk program is a series of pattern { action } rules; for each input record (line), every matching pattern's action runs.

The basic model: fields and records

By default, awk splits each line on whitespace into numbered fields: $1 is the first field, $2 the second, and so on. $0 is the entire line. $NF is the last field regardless of how many there are:

echo "Alice 30 engineer" | awk '{ print $1, $3 }'
# Alice engineer

ls -la | awk '{ print $NF }'    # print filename (last column)
ls -la | awk '{ print $5, $NF }' # print size and filename

NR is the current record (line) number; NF is the number of fields on the current line.

FS and OFS: custom delimiters

Set FS (Field Separator) to split on something other than whitespace. OFS (Output Field Separator) controls how fields are joined when you reassign $0 or use print with commas:

# Parse /etc/passwd (colon-delimited)
awk -F: '{ print $1, $7 }' /etc/passwd      # username and shell
awk 'BEGIN { FS=":" } { print $1 }' /etc/passwd

# CSV with comma separator
awk -F, '{ print $2 }' data.csv             # second column

# Reformat: change delimiter from comma to pipe
awk -F, 'BEGIN { OFS="|" } { $1=$1; print }' data.csv

The $1=$1 trick (assigning a field to itself) forces awk to rebuild $0 using OFS, which is how you change the output delimiter.

Pattern conditions

A pattern before the action block filters which records it runs on:

# Lines where field 3 is greater than 100
awk '$3 > 100 { print $1, $3 }' data.txt

# Lines matching a regex
awk '/ERROR/ { print }' app.log

# A range of lines (start to end pattern, inclusive)
awk '/BEGIN_SECTION/,/END_SECTION/ { print }' report.txt

# Combining conditions
awk '$1 == "Alice" && $3 > 50 { print }' data.txt

Patterns without an action block default to { print }, so awk '/ERROR/' is equivalent to grep ERROR.

BEGIN and END blocks

BEGIN runs once before any input is read. END runs once after all input has been processed. Use them for setup and summaries:

awk 'BEGIN { print "Username\tShell" } \
     { print $1, "\t", $7 } \
     END { print "Total:", NR, "users" }' \
     /etc/passwd

awk -F: 'END { print NR, "lines in /etc/passwd" }' /etc/passwd

Accumulating sums

The most common awk idiom: accumulate values across all records, print the total in END:

# Sum the 5th column (file sizes from ls -la)
ls -la | awk 'NR > 1 { sum += $5 } END { print "Total bytes:", sum }'

# Count occurrences of each value in column 1
awk '{ count[$1]++ } END { for (k in count) print k, count[k] }' data.txt

# Average of column 2
awk '{ sum += $2; n++ } END { if (n > 0) print "Average:", sum/n }' data.txt

awk arrays are associative (like hash maps). The count[$1]++ idiom creates and increments a counter keyed on the first field without any initialisation needed. This pattern — grouping and counting by key — covers a large fraction of practical log-analysis tasks.

Field separator precedence: -F, on the command line sets FS for all records. Setting FS inside a BEGIN block has the same effect. Setting it in a plain action block (not BEGIN) takes effect only from the next record, not the current one — a common gotcha.

Check your understanding

Do it yourself

# Parse /etc/passwd: print username and shell, count total
awk -F: 'BEGIN { print "User\tShell" }
         { print $1 "\t" $7 }
         END { print "---\nTotal:", NR }' /etc/passwd | head -10

# Sum file sizes in /usr/bin (column 5 of ls -la)
ls -la /usr/bin | awk 'NR > 1 && $5+0 > 0 { sum += $5 } END { print "Total:", sum, "bytes" }'

# Count lines per unique first word
echo -e "apple 1\nbanana 2\napple 3\ncherry 1" | \
  awk '{ count[$1]++ } END { for (k in count) print k, count[k] }'

Where to go next

You can now extract, filter, and aggregate structured text with awk. The final lesson in this module — cut, sort, and uniq — covers three focused tools that compose naturally into compact pipelines for counting, ranking, and deduplicating data.

Finished reading? Mark it complete to track your progress.

Sed Basics

Use sed's s command, address ranges, d and p commands, in-place editing with -i, and basic multiline patterns to transform text streams.

Cut, Sort, and Uniq

Extract columns with cut, sort data numerically and by key with sort, count duplicates with uniq -c, and compose multi-step pipelines.