BeginnerData Fundamentals

Data shapes

Rows are observations, columns are variables — and whether your table is wide or long determines which analyses are easy.

Data ScienceBeginner6 min read

Recommended first

What is data?

By the end of this lesson you will be able to:

Explain tabular data in terms of rows (observations) and columns (variables)
Distinguish wide format from long format with a concrete example
State the tidy data principle and why it matters

Most data you'll work with in practice is tabular: it lives in a grid of rows and columns. Understanding what rows and columns represent is not a technicality — it determines which questions are easy and which require reshaping the data first.

Rows and columns

The convention is consistent across every tool in the data ecosystem:

Each row is one observation — one event, one measurement, one record.
Each column is one variable — one attribute of every observation.

Consider a table of daily temperature readings for three cities:

date	london	paris	berlin
2024-06-01	18	22	19
2024-06-02	17	21	18
2024-06-03	20	24	21

Each row is a single date. Each column is a city. This is called wide format: multiple values of the same kind of thing (temperature) are spread across multiple columns.

Wide vs long format

Wide format is convenient for humans reading a spreadsheet — you can scan a row and compare the cities at a glance. But most analysis tools prefer long format, where each row is one measurement:

date	city	temperature
2024-06-01	london	18
2024-06-01	paris	22
2024-06-01	berlin	19
2024-06-02	london	17
2024-06-02	paris	21
…	…	…

The same data, different shape. In long format, adding a fourth city is trivial: add more rows. In wide format, you'd need a new column — and any code that names columns by position would break.

Long format also makes grouping natural. "Average temperature by city" is a one-liner: group by city, take the mean of temperature. In wide format you'd have to enumerate the city columns manually.

pandas provides DataFrame.melt() to go from wide to long, and DataFrame.pivot() to go from long to wide. Knowing which shape you need before you reshape saves a lot of head-scratching.

Tidy data

The long format above follows the tidy data principle, coined by statistician Hadley Wickham:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms its own table.

"Tidy" is not about cleanliness in the sense of missing values or typos — it is about structure. Tidy data is not inherently better than wide data for every purpose, but it is the standard input shape for most statistical and visualisation tools. When you are confused about why a plot or aggregation is not working, check whether your data is tidy first.

Check your understanding

Knowledge check

Where to go next

Next: inspecting with pandas — loading a dataset into a DataFrame and using .head(), .shape, .dtypes, and .describe() to understand what you're working with before you touch anything.

Finished reading? Mark it complete to track your progress.

Reading data files

Use Python's csv module and io.StringIO to open a CSV and read its rows as dictionaries.

Inspecting with pandas

Create a DataFrame from a dict and use .head(), .shape, .dtypes, and .describe() to understand your data before touching it.