Code of the Day
IntermediateFeature Engineering

Lab: prepare a dataset

Take a raw mixed-type dataset end-to-end through cleaning, encoding, train/test splitting, and scaling — producing a model-ready train/test pair.

Lab · optionalData ScienceIntermediate30 min
Recommended first
By the end of this lesson you will be able to:
  • Clean missing values and fix types in a raw dataset
  • Apply one-hot and ordinal encoding to categorical columns
  • Split into 80/20 train and test sets
  • Scale numeric features on train and apply to test
  • Inspect final shapes and verify there is no data leakage

This is an optional lab. No new concepts — just practice running the full feature engineering pipeline on a dataset you have not seen before. Work through each step, verify the shapes, and confirm your output is model-ready before moving on.

The dataset is a small property listing table with mixed types, missing values, and a categorical column that needs encoding. Your goal is to produce a clean, scaled train/test pair suitable for a regression model predicting price.

Step 1 — inspect the raw data

Python — editable, runs in your browser

Before cleaning, note: bedrooms has 2 missing values, condition and location are strings. The id column is not a feature — it should be dropped.

Step 2 — clean missing values and drop non-features

Python — editable, runs in your browser

Step 3 — encode categorical columns

condition is ordinal (fair < good < excellent). location is nominal.

Python — editable, runs in your browser

Step 4 — split into train and test

Split 80/20 before fitting any scaler. The id column was already dropped.

Python — editable, runs in your browser

Step 5 — scale numeric features

Scale bedrooms and area_m2. The encoded columns (0/1 binary and ordinal 1–3) are on reasonable scales already, but numeric columns with large ranges should be standardised.

Python — editable, runs in your browser

The scaler is fitted on X_train[numeric_cols] only. Applying scaler.transform(X_test[numeric_cols]) reuses the mean and std from training — no information from the test set was used. The train/test boundary has not been crossed.

Done?

You have taken a raw dataset with missing values, string categories, and mixed scales through a complete feature engineering pipeline: clean missing values, encode categoricals (ordinal and one-hot), split 80/20, and scale numerics on the training set only. The result is a pair of DataFrames ready to pass into a scikit-learn estimator. The next tier — Advanced — builds on this foundation to cover model training, evaluation, and iteration.

Finished reading? Mark it complete to track your progress.

On this page