Code of the Day
IntermediateReshaping and Merging

Merging DataFrames

Use pd.merge() and df.join() to combine pandas DataFrames on shared keys, then verify the result shape.

Data ScienceIntermediate10 min read
By the end of this lesson you will be able to:
  • Use pd.merge() with how=, on=, left_on=, and right_on= arguments
  • Use df.join() for index-based joins
  • Verify merge results with shape and null checks

pd.merge() is the workhorse for combining DataFrames. It maps directly onto the four join types from the previous lesson, but the syntax gives you a few extra controls: which columns to match on, what to do when the key column has different names in each table, and whether to join on the index instead.

Python — editable, runs in your browser

Run the code and notice: the inner join drops order 105 (customer_id=4, which does not exist in the customers table). The left join keeps it, but name and city are NaN for that row. Checking .isnull().sum() after a left join immediately tells you how many rows failed to match.

Joining on differently-named columns

When the key column has different names in each table, use left_on= and right_on= instead of on=:

pd.merge(orders, customers,
         left_on="customer_id",
         right_on="cust_id",
         how="inner")

pandas keeps both key columns in the result. If they are redundant, drop one with .drop(columns=["cust_id"]) afterwards.

Index-based joins with df.join()

df.join() is a convenience wrapper that joins on the index by default. Set the shared key as the index on both DataFrames first:

Python — editable, runs in your browser

Use pd.merge() when the key is in a column (the common case). Use df.join() when both DataFrames already have a meaningful index and you want to avoid resetting it. The two are functionally equivalent; merge() is more explicit.

Always check your shape

A good habit: print result.shape and compare to your expectations. If a 100-row left table produces a 120-row merged result, you have duplicate keys somewhere. result[result.duplicated(subset=["key"])] shows the offending rows.

Where to go next

Next: reshaping — how to convert between wide and long data formats, and when each shape makes analysis easier.

Finished reading? Mark it complete to track your progress.

On this page