Code of the Day
IntermediateFeature Engineering

What is a feature?

A feature is a numeric input to a model — and most raw data isn't numeric yet. Feature engineering bridges that gap.

Data ScienceIntermediate6 min read
By the end of this lesson you will be able to:
  • Define "feature" in the machine learning sense
  • Distinguish raw data from an engineered feature
  • Give examples of features a model can and cannot use directly

Machine learning models are mathematical functions. They receive numbers as input, perform arithmetic on them, and return a number (or a probability). That single fact has a large consequence: every piece of information you want a model to use must eventually become a number. A feature is one such number — one column in the matrix of inputs the model receives for a given row.

Raw data vs features

Raw data rarely arrives in a model-ready form. Consider a housing dataset with these columns:

ColumnTypeModel-ready?
pricefloatYes — already numeric
bedroomsintYes — already numeric
neighbourhoodstringNo — categories need encoding
sale_datedate stringNo — dates need extraction
descriptionfree textNo — text needs numerical representation

The raw neighbourhood column is a string like "Hackney" or "Islington". A model cannot compute "Hackney" * 0.3. You must convert it. The raw sale_date is "2023-04-15". A model can use the year, month, or day-of-week as numbers — but not the date string directly.

Feature engineering

Feature engineering is the process of constructing model-ready numeric columns from raw data. It is often the step that makes the biggest difference to model performance — more so than the choice of algorithm.

Some common transformations:

From a date column: extract year, month, day_of_week, is_weekend, days_since_reference. Each becomes a separate feature.

From a string column: compute word_count, char_length, or whether a keyword appears (has_keyword). For category columns, apply one-hot encoding or ordinal encoding (covered in the next lesson).

From numeric columns: compute interactions (age * income), polynomial terms (age^2), bins (pd.cut(age, bins=5)), or ratios (price / floor_area).

From a birth year: compute age = current_year - birth_year. Models understand age as a continuous number; a raw birth year has the same numeric relationship to age, but the interpretation is cleaner.

Domain knowledge drives feature engineering. A model does not know that "day_of_week" matters for a retail sales dataset — you have to hypothesise it, create the feature, and then evaluate whether it helps. This is what makes feature engineering a creative as well as a technical process.

What models cannot use directly

  • Strings (including category labels)
  • Dates and datetimes (as objects)
  • Lists or nested structures
  • Columns with inconsistent types (e.g. some rows are None, some are "N/A", some are a float — the model needs one consistent type)

Any of these must be cleaned and transformed before a model sees them.

Where to go next

Next: encoding categoricals — converting string category columns into numbers using one-hot encoding and ordinal encoding in pandas.

Finished reading? Mark it complete to track your progress.

On this page