Code of the Day
BeginnerData Fundamentals

Data quality

Missing values, duplicates, wrong types, and outliers contaminate almost every real dataset — learn to recognise them before they corrupt your analysis.

Data ScienceBeginner6 min read
By the end of this lesson you will be able to:
  • Recognise the four main data quality problems in a dataset
  • Explain why "garbage in, garbage out" makes data cleaning non-optional
  • Describe a concrete example of each problem

Real data is never clean. It comes from humans clicking checkboxes carelessly, sensors that lose signal, databases that allow optional fields, and systems that were never designed to talk to each other. A data scientist's first job — before any modelling or visualisation — is to find and fix the problems. Skip this step and every result downstream is suspect.

The principle is blunt: garbage in, garbage out. A perfectly written analysis applied to bad data produces confidently wrong answers.

There are four problems you will encounter in almost every dataset.

1. Missing values

A field that should have a value does not. represents these as NaN (Not a Number) for numeric columns and None for objects. They appear for many reasons: a survey respondent skipped a question, a sensor failed for one reading, a join to another table found no match.

Why they matter: arithmetic on NaN propagates — 5 + NaN is NaN. A mean calculated over a column with missing values silently excludes them (which may or may not be what you want). A machine learning model will often refuse to run if any input contains NaN.

Example: a customer table where age is blank for users who signed up via a third-party OAuth and never completed their profile.

2. Duplicates

The same observation appears more than once. This happens when data is merged from multiple sources, when a bug causes events to be logged twice, or when a user submits a form twice.

Why they matter: every aggregate (count, sum, average) will be wrong. If you count orders and five orders appear twice, your count is off by five. The error is often invisible — the numbers look plausible, just wrong.

Example: a sales log where a network timeout caused the payment system to retry, inserting the same transaction twice.

3. Wrong types

A column that should be numeric contains strings. A date column contains the text "N/A" for unknown dates, forcing the whole column to be stored as strings.

Why they matter: you cannot do arithmetic on a string. pandas will happily let you store "120" in an object column — but df["amount"].mean() will raise a TypeError rather than give you the average. The column looks fine until you try to use it.

Example: a CSV exported from a legacy system where the price column occasionally contains "-" for items that were free, making pandas read the whole column as text.

4. Outliers

Values that are technically present and correctly typed, but implausibly extreme. An age of 300. A transaction amount of -50 000. A temperature reading of 9999 (a common sentinel value for sensor error).

Why they matter: outliers distort statistics, especially the mean and standard deviation. A single data entry error of 10x the correct value can shift a mean substantially. More dangerously, outliers can look like interesting findings rather than errors — and you may not notice without checking.

Not every outlier is an error. A genuine best-seller may have 100x the sales of a typical product. The discipline is to flag extreme values, investigate them, and make a deliberate decision — not to automatically delete anything that looks unusual.

Check your understanding

Knowledge check

  1. 1.
    Which of these are data quality problems you should check for?
  2. 2.
    Every outlier in a dataset is a data entry error and should be deleted.
  3. 3.
    What does 5 + NaN evaluate to in pandas?

Where to go next

Next: cleaning data — the hands-on counterpart to this lesson. You will use pandas to drop missing values, remove duplicates, and fix column types in a real (if small) dirty DataFrame.

Finished reading? Mark it complete to track your progress.

On this page