Merging DataFrames
Use pd.merge() and df.join() to combine pandas DataFrames on shared keys, then verify the result shape.
- Use pd.merge() with how=, on=, left_on=, and right_on= arguments
- Use df.join() for index-based joins
- Verify merge results with shape and null checks
pd.merge() is the workhorse for combining DataFrames. It maps directly onto the
four join types from the previous lesson, but the syntax gives you a few extra
controls: which columns to match on, what to do when the key column has different
names in each table, and whether to join on the index instead.
Run the code and notice: the inner join drops order 105 (customer_id=4, which
does not exist in the customers table). The left join keeps it, but name and
city are NaN for that row. Checking .isnull().sum() after a left join
immediately tells you how many rows failed to match.
Joining on differently-named columns
When the key column has different names in each table, use left_on= and
right_on= instead of on=:
pd.merge(orders, customers,
left_on="customer_id",
right_on="cust_id",
how="inner")pandas keeps both key columns in the result. If they are redundant, drop one
with .drop(columns=["cust_id"]) afterwards.
Index-based joins with df.join()
df.join() is a convenience wrapper that joins on the index by default. Set
the shared key as the index on both DataFrames first:
Use pd.merge() when the key is in a column (the common case). Use df.join()
when both DataFrames already have a meaningful index and you want to avoid
resetting it. The two are functionally equivalent; merge() is more explicit.
Always check your shape
A good habit: print result.shape and compare to your expectations. If a
100-row left table produces a 120-row merged result, you have duplicate keys
somewhere. result[result.duplicated(subset=["key"])] shows the offending rows.
Where to go next
Next: reshaping — how to convert between wide and long data formats, and when each shape makes analysis easier.