Code of the Day
BeginnerData Fundamentals

Inspecting with pandas

Create a DataFrame from a dict and use .head(), .shape, .dtypes, and .describe() to understand your data before touching it.

Data ScienceBeginner10 min read
By the end of this lesson you will be able to:
  • Create a pandas DataFrame from a Python dict
  • Use .head() to preview the first rows
  • Use .shape and .dtypes to understand the structure
  • Use .describe() to get a statistical summary

is the central library for tabular data in Python. Its core data structure is the — a two-dimensional table with labelled rows and columns, backed by efficient NumPy arrays. The mental model is: a DataFrame is a dict of columns, where every column is a list of the same length.

Before you transform any data, the first move is always to look at it. pandas provides four methods that together answer "what do I have?"

Python — editable, runs in your browser

What each method tells you

.head(n=5) returns the first n rows. It is always your first call — a sanity check that the data loaded the way you expected. Did the columns come through? Are the values in the right fields? Does anything look obviously wrong?

.shape is a tuple (rows, columns). Check this early. If you expected 10 000 rows and you have 10, something went wrong upstream. If you have 50 columns when you expected 5, you may have loaded the wrong file.

.dtypes lists each column and its inferred type. Common types:

  • int64 — whole numbers
  • float64 — decimals
  • object — usually strings (or a mixed column)
  • bool — True/False

If a numeric column shows as object, that is a red flag: numbers have been read as text, and arithmetic on them will fail. You will fix this in the next lesson.

.describe() gives count, mean, standard deviation, min, quartiles, and max for every numeric column. Glance at the min and max — extreme values often signal data entry errors or unit mismatches (a temperature in Celsius alongside a temperature in Fahrenheit, for example).

.describe() only covers numeric columns by default. Pass include="all" to also get counts and most-frequent values for string columns. Useful when checking a categorical column for unexpected values.

Where to go next

Now you can see what you have. Next: data quality — the four problems (missing values, duplicates, wrong types, outliers) that contaminate almost every real dataset.

Finished reading? Mark it complete to track your progress.

On this page