Code of the Day
IntermediateFeature Engineering

Encoding categoricals

Convert string category columns to numbers with pd.get_dummies() for nominal data and manual ordinal maps for ordered categories.

Data ScienceIntermediate10 min read
Recommended first
By the end of this lesson you will be able to:
  • Apply one-hot encoding with pd.get_dummies()
  • Apply ordinal encoding using a Python dict map
  • Explain when each encoding is appropriate

Two kinds of categorical data require different encoding strategies. Nominal categories have no natural order: shirt colours, country names, product types. Ordinal categories have a meaningful order: small/medium/large, low/mid/high satisfaction, education level. Using the wrong encoding can mislead a model into assuming order where there is none, or discarding order where it matters.

One-hot encoding — for nominal categories

One-hot encoding creates one binary column per category value. If colour has three values — red, blue, green — one-hot produces three columns: colour_red, colour_blue, colour_green. Each row has a 1 in the column for its colour and 0 in the others.

Python — editable, runs in your browser

pd.get_dummies() handles any number of distinct values automatically. prefix= gives the new columns a consistent name prefix, which makes them easy to identify later.

One practical concern: if the column has k distinct values, you get k columns, and they are perfectly collinear (the sum of all one-hot columns for a row is always 1). Many models handle this fine; linear models technically only need k-1 columns. You can drop one with drop_first=True if needed.

Ordinal encoding — for ordered categories

When the categories have a natural order, encode them as integers that preserve that order. A dict map is the cleanest approach:

Python — editable, runs in your browser

A model that sees size_code as 1, 2, 3 can learn that larger sizes tend to weigh more — because the numeric order matches the real-world order. One-hot encoding for size would have discarded that ordering.

A third technique, target encoding, replaces each category with the mean of the target variable for that category. It can be powerful for high-cardinality columns (hundreds of unique values), but it requires care: you must compute the mean on the training set only and apply it to the test set, or you introduce data leakage.

Choosing the encoding

Category typeEncodingWhy
Nominal (unordered)One-hotNo numeric order should be implied
Ordinal (ordered)Integer mapPreserves the rank relationship
High cardinality nominalTarget encoding (advanced)One-hot with 500 columns is impractical

Where to go next

Next: scaling and normalisation — why numeric features need to be on comparable scales for some models, and the difference between min-max and z-score standardisation.

Finished reading? Mark it complete to track your progress.

On this page