Lab: mini analysis

Load, clean, summarise, and group a dataset end-to-end — a guided exploratory analysis from scratch.

This is an optional lab. No new syntax — just a realistic mini-analysis that strings together everything from both modules. Work through each section, run the code, and read the output carefully. The goal is to build the habit of thinking about what the numbers mean, not just producing them.

The dataset is a month of sales at a small online bookshop: 12 orders across three categories, with a few quality problems baked in. Your job is to answer the question: which category generates the most revenue from completed orders?

Step 1 — load and inspect

Always start here. Look at the data before touching anything.

Python — editable, runs in your browser

What to notice: price is object (string) even though it should be numeric. There is one null in price. Two orders are "refunded" and should be excluded from revenue totals.

Step 2 — clean

Fix the three problems: drop the null row, convert price to float.

Python — editable, runs in your browser

Step 3 — add a revenue column and filter

Revenue per order is price * quantity. Add it as a new column, then keep only completed orders.

Python — editable, runs in your browser

import pandas as pd

orders = {
  "order_id":  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
  "category":  ["fiction", "fiction", "non-fiction", "science", "fiction",
                "science", "non-fiction", "fiction", "science", "fiction",
                "non-fiction", "science"],
  "price":     ["12.99", "8.50", "24.00", "18.75", "12.99", "22.50",
                None, "9.99", "18.75", "15.00", "19.50", "22.50"],
  "quantity":  [2, 1, 1, 2, 1, 3, 1, 2, 1, 1, 2, 3],
  "status":    ["complete", "complete", "complete", "complete", "refunded",
                "complete", "complete", "complete", "refunded", "complete",
                "complete", "complete"],
}
df = pd.DataFrame(orders).dropna(subset=["price"]).copy()
df["price"] = df["price"].astype(float)

# Add a revenue column
df["revenue"] = df["price"] * df["quantity"]

# Filter to complete orders only
completed = df[df["status"] == "complete"]

print("Completed orders:")
print(completed[["order_id", "category", "revenue"]])
print("\nTotal revenue (all complete orders):", round(completed["revenue"].sum(), 2))

Step 4 — group and answer the question

Now split by category and compute total and mean revenue per group.

Python — editable, runs in your browser

import pandas as pd

summary = completed.groupby("category")["revenue"].agg(
  total_revenue="sum",
  mean_revenue="mean",
  order_count="count",
)
print(summary.sort_values("total_revenue", ascending=False))

The named aggregation syntax — agg(total_revenue="sum", ...) — gives your result columns descriptive names instead of the default "sum", "mean", etc. Worth using whenever the output will be read by others (or by you in three weeks).

Interpret the result

Science has the highest total revenue despite fewer orders than fiction — because science books are more expensive and orders contain more copies. Fiction has the most orders but lower revenue per order. Non-fiction sits in between.

That is a real insight: total order count is not the same as total revenue. You can only see this by computing revenue explicitly and grouping.

Done?

You just ran a complete mini-analysis pipeline: load, inspect, clean, engineer a feature, filter, group, and interpret. Every real data project is a longer version of this same sequence. The tools scale; the pattern does not change.

Finished reading? Mark it complete to track your progress.

Step 1 — load and inspect

Step 2 — clean

Step 3 — add a revenue column and filter

Step 4 — group and answer the question

Interpret the result

Done?

On this page