Code of the Day
BeginnerData Fundamentals

Data shapes

Rows are observations, columns are variables — and whether your table is wide or long determines which analyses are easy.

Data ScienceBeginner6 min read
Recommended first
By the end of this lesson you will be able to:
  • Explain tabular data in terms of rows (observations) and columns (variables)
  • Distinguish wide format from long format with a concrete example
  • State the tidy data principle and why it matters

Most data you'll work with in practice is tabular: it lives in a grid of rows and columns. Understanding what rows and columns represent is not a technicality — it determines which questions are easy and which require reshaping the data first.

Rows and columns

The convention is consistent across every tool in the data ecosystem:

  • Each row is one observation — one event, one measurement, one record.
  • Each column is one variable — one attribute of every observation.

Consider a table of daily temperature readings for three cities:

datelondonparisberlin
2024-06-01182219
2024-06-02172118
2024-06-03202421

Each row is a single date. Each column is a city. This is called wide format: multiple values of the same kind of thing (temperature) are spread across multiple columns.

Wide vs long format

Wide format is convenient for humans reading a spreadsheet — you can scan a row and compare the cities at a glance. But most analysis tools prefer long format, where each row is one measurement:

datecitytemperature
2024-06-01london18
2024-06-01paris22
2024-06-01berlin19
2024-06-02london17
2024-06-02paris21

The same data, different shape. In long format, adding a fourth city is trivial: add more rows. In wide format, you'd need a new column — and any code that names columns by position would break.

Long format also makes grouping natural. "Average temperature by city" is a one-liner: group by city, take the mean of temperature. In wide format you'd have to enumerate the city columns manually.

provides DataFrame.melt() to go from wide to long, and DataFrame.pivot() to go from long to wide. Knowing which shape you need before you reshape saves a lot of head-scratching.

Tidy data

The long format above follows the tidy data principle, coined by statistician Hadley Wickham:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms its own table.

"Tidy" is not about cleanliness in the sense of missing values or typos — it is about structure. Tidy data is not inherently better than wide data for every purpose, but it is the standard input shape for most statistical and visualisation tools. When you are confused about why a plot or aggregation is not working, check whether your data is tidy first.

Check your understanding

Knowledge check

  1. 1.
    In a well-structured tabular dataset, what does each row represent?
  2. 2.
    Long-format data is generally easier to group and aggregate than wide-format data.
  3. 3.
    In tidy data, what does each column represent?

Where to go next

Next: inspecting with pandas — loading a dataset into a DataFrame and using .head(), .shape, .dtypes, and .describe() to understand what you're working with before you touch anything.

Finished reading? Mark it complete to track your progress.

On this page