Inspecting with pandas
Create a DataFrame from a dict and use .head(), .shape, .dtypes, and .describe() to understand your data before touching it.
- Create a pandas DataFrame from a Python dict
- Use .head() to preview the first rows
- Use .shape and .dtypes to understand the structure
- Use .describe() to get a statistical summary
pandas is the central library for tabular data in Python. Its core data
structure is the DataFrame — a two-dimensional table with labelled rows and
columns, backed by efficient NumPy arrays. The mental model is: a DataFrame is
a dict of columns, where every column is a list of the same length.
Before you transform any data, the first move is always to look at it. pandas provides four methods that together answer "what do I have?"
What each method tells you
.head(n=5) returns the first n rows. It is always your first call — a
sanity check that the data loaded the way you expected. Did the columns come
through? Are the values in the right fields? Does anything look obviously wrong?
.shape is a tuple (rows, columns). Check this early. If you expected
10 000 rows and you have 10, something went wrong upstream. If you have 50 columns
when you expected 5, you may have loaded the wrong file.
.dtypes lists each column and its inferred type. Common types:
int64— whole numbersfloat64— decimalsobject— usually strings (or a mixed column)bool— True/False
If a numeric column shows as object, that is a red flag: numbers have been read
as text, and arithmetic on them will fail. You will fix this in the next lesson.
.describe() gives count, mean, standard deviation, min, quartiles, and max
for every numeric column. Glance at the min and max — extreme values often signal
data entry errors or unit mismatches (a temperature in Celsius alongside a
temperature in Fahrenheit, for example).
.describe() only covers numeric columns by default. Pass include="all" to
also get counts and most-frequent values for string columns. Useful when
checking a categorical column for unexpected values.
Where to go next
Now you can see what you have. Next: data quality — the four problems (missing values, duplicates, wrong types, outliers) that contaminate almost every real dataset.