Data shapes
Rows are observations, columns are variables — and whether your table is wide or long determines which analyses are easy.
- Explain tabular data in terms of rows (observations) and columns (variables)
- Distinguish wide format from long format with a concrete example
- State the tidy data principle and why it matters
Most data you'll work with in practice is tabular: it lives in a grid of rows and columns. Understanding what rows and columns represent is not a technicality — it determines which questions are easy and which require reshaping the data first.
Rows and columns
The convention is consistent across every tool in the data ecosystem:
- Each row is one observation — one event, one measurement, one record.
- Each column is one variable — one attribute of every observation.
Consider a table of daily temperature readings for three cities:
| date | london | paris | berlin |
|---|---|---|---|
| 2024-06-01 | 18 | 22 | 19 |
| 2024-06-02 | 17 | 21 | 18 |
| 2024-06-03 | 20 | 24 | 21 |
Each row is a single date. Each column is a city. This is called wide format: multiple values of the same kind of thing (temperature) are spread across multiple columns.
Wide vs long format
Wide format is convenient for humans reading a spreadsheet — you can scan a row and compare the cities at a glance. But most analysis tools prefer long format, where each row is one measurement:
| date | city | temperature |
|---|---|---|
| 2024-06-01 | london | 18 |
| 2024-06-01 | paris | 22 |
| 2024-06-01 | berlin | 19 |
| 2024-06-02 | london | 17 |
| 2024-06-02 | paris | 21 |
| … | … | … |
The same data, different shape. In long format, adding a fourth city is trivial: add more rows. In wide format, you'd need a new column — and any code that names columns by position would break.
Long format also makes grouping natural. "Average temperature by city" is a
one-liner: group by city, take the mean of temperature. In wide format you'd
have to enumerate the city columns manually.
pandas provides DataFrame.melt() to go from wide to long, and DataFrame.pivot()
to go from long to wide. Knowing which shape you need before you reshape saves a
lot of head-scratching.
Tidy data
The long format above follows the tidy data principle, coined by statistician Hadley Wickham:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms its own table.
"Tidy" is not about cleanliness in the sense of missing values or typos — it is about structure. Tidy data is not inherently better than wide data for every purpose, but it is the standard input shape for most statistical and visualisation tools. When you are confused about why a plot or aggregation is not working, check whether your data is tidy first.
Check your understanding
Knowledge check
- 1.In a well-structured tabular dataset, what does each row represent?
- 2.Long-format data is generally easier to group and aggregate than wide-format data.
- 3.In tidy data, what does each column represent?
Where to go next
Next: inspecting with pandas — loading a dataset into a DataFrame and using
.head(), .shape, .dtypes, and .describe() to understand what you're
working with before you touch anything.