Joining data

Inner, left, right, and outer joins — how each type decides which rows survive, and what can go wrong with duplicate keys.

Most real datasets arrive in pieces. A customer table lives in one place, their orders in another, product details in a third. Before you can answer any cross-table question you have to join those pieces together — combining rows from two tables on a shared key.

Understanding which rows survive a join, and why, prevents a whole class of subtle bugs that don't raise errors but silently corrupt your analysis.

The four join types

Think of two tables, orders and customers, sharing a customer_id column.

Inner join — the default. A row appears in the result only when the key exists in both tables. Customers with no orders are dropped. Orders with an unrecognised customer id are also dropped. The result is the intersection.

Left (orders)	Right (customers)	Inner result
order 1, cid=A	cid=A, Alice	cid=A row
order 2, cid=B	cid=C, Carol	(dropped — cid=B not in customers)

Left join — keeps every row from the left table regardless of whether a match exists. Unmatched rows get NaN for all right-table columns. Use a left join when the left table is your primary dataset and the right table is supplementary enrichment.

Right join — the mirror image. Every row from the right table survives; unmatched left rows are dropped. In practice, people swap the table order and use a left join instead — the outcome is identical.

Outer join (full outer) — the union. Rows from either side survive, with NaN filling in wherever a match is absent. Useful for auditing: outer join two tables and filter for rows where one side is entirely null — those are the gaps.

The duplicate-key problem

Here is the most dangerous join mistake. Suppose the orders table has two rows for the same customer_id (two orders from the same customer) and the customers table also has two rows for that id (perhaps a data quality issue). A join on that key does not pick one — it pairs every left row with every right row that matches. Two rows on each side becomes four rows in the result.

This is called a many-to-many join. The row count expands silently. You will not get an error. Your aggregations will be inflated. The only way to detect it is to check the output shape: if joining a 100-row table with a 50-row lookup table produces 200 rows, something has multiplied.

Always check len(result) after a join. If it is larger than the larger of your two inputs, you almost certainly have duplicate keys on one or both sides. Deduplicate with .drop_duplicates(subset=["key_column"]) before joining.

Choosing the right join

Ask two questions. First: which table is primary? If you want to keep all rows from one side no matter what, that side goes on the left and you use a left join. Second: should rows without a match be kept or discarded? If discarded, use inner. If kept, use left (or outer if both sides matter).

A useful pattern for auditing data completeness: inner join first to see what matches cleanly, then left join and filter for null right-side columns to see what does not. The gap between the two results tells you exactly how much data falls outside the overlap.

Where to go next

Next: merging DataFrames — translating these concepts into pd.merge() and df.join() calls, and verifying that the result has the shape you expect.

Finished reading? Mark it complete to track your progress.

The four join types

The duplicate-key problem

Choosing the right join

Where to go next

On this page