Pandas: Tips to clean and describe data

Confession: I have a love-hate relationship with Pandas (the program, not the furry mammal). In my limited experiences, analysing data with Pandas tends to involve about 85% Googling, 13% Pythoning, 2% recall. Pandas is so non-intuitive and I only use it between big gaps of time that my working memory of the code is usually poor. Even though I’ve blogged about it! (see parts 1 and 2)

For my future self, and for you the reader, here are some code snippets to combine datasets, clean and check for data entry errors, and export data.

Datasets for this exercise

I adapted the open-source hsb2 dataset to create 4 separate datasets. The hsb2 dataset contains demographic information and standardised test scores of US high school students. Notes on modified datasets:

  • Original hsb dataset was split into male and female cohorts, stored in 2 CSV files
  • Dummy data of ages for each student id are created, stored in 2 separate CSV files
  • In hsb2_male.csv: id=7 read score is set to missing, race=2 read scores are set to 0

The aim is to combine the datasets into a single large dataset, for students with only some conditions. Then, clean and check the dataframe for missing/implausible values, and export summary data to CSV and TXT.

The 4 CSV files are available for download as pandas_clean_data.zip from here.

Variables and data codes.

num variable label codes
1 id ID number 3-digit number
2 female Female 0=male<br>1=female
3 race Race or Ethnicity 1=hispanic<br>2=asian<br>3=african-american<br>4=white
4 ses Socioeconomic status 1=low<br>2=medium<br>3=high
5 sctyp School type 1=public<br>2=private
6 prog High school program 1=general<br>2=academic prepartory<br>3=vocational/technical
7 read Standardized reading score
8 write Standardized writing score
9 math Standardized math score
10 science Standardized science score
11 socst Standardized soc st score

Python code to analyse data

Read in male and female datasets

"""
Code runs on Python v3.8, Pandas v1.2.1
"""
import pandas as pd
import numpy as np

df_m = pd.read_csv("hsb2_male.csv")
df_f = pd.read_csv("hsb2_female.csv")

df_m_ages = pd.read_csv("hsb2_male_ages.csv")
df_f_ages = pd.read_csv("hsb2_female_ages.csv")

Combine male and female dataframes, for scores and ages

# Concatenate male and female cohorts
df_all = pd.concat([df_m, df_f], axis=0, ignore_index=True)
df_ages = pd.concat([df_m_ages, df_f_ages], axis=0, ignore_index=True)

Merge the combined dataframes into single dataframe, linked by id

# Based on id, merge scores and ages datasets, only for conditions schtyp = 1 ses = 2
# `outer` uses union of keys from both frames and assigns new index
df = pd.merge(
    left=df_all[(df_all["schtyp"] == 1) & (df_all["ses"] == 2)],
    right=df_ages,
    how="outer",
    on="id",
)

df = df.rename(columns={"female": "sex"})  # 0 = male, 1 = female

# drop rows of students who were not included in the merge
df = df.dropna(axis=0, subset=["schtyp", "ses"])

Check by columns and rows for missing or implausible values, and strings that can’t be converted into numbers

df.info()

# convert data from str to int or float, check for non-convertable strings
# note, `df.apply(pd.to_numeric, errors='coerce').info()` sometimes silently
# does not convert str to numeric
for i in list(df.columns):
    print(i)
    try:
        df[i] = df[i].astype(int)
    except ValueError:
        print("  > str data present")
        pass

# convert `df['read']` to numeric
# with `coerce`, all non-convertible strings are stored as NaNs
df["read"] = pd.to_numeric(df["read"], errors="coerce")

# sort by sex, describe all variables, export to CSV, then describe read scores only
df.groupby(df["sex"]).describe().rename(index={0: "male", 1: "female"}).to_csv(
    "summarise.csv"
)
df["read"].groupby(df["sex"]).describe().rename(index={0: "male", 1: "female"})

# note, `df['read']` has min = 0: this is unusual and probably data entry error
# check which rows have read == 0 or NaN using `.loc[]`
# (stands for 'locate', I think; use `.iloc[]` for locating by index)
print(df.loc[df["read"] == 0])
print(df.loc[df["read"].isnull()])

# for read == 0 in female = 0 and race = 2: set these to NaN
# note read == NaN in id = 7
# finally, check which rows have read == NaN
df.loc[(df["sex"] == 0) & (df["race"] == 2), "read"] = np.nan
print(df.loc[df["read"].isnull()])

Summarise (describe) data and write to TXT as a log

file = "summarise.txt"
open(file, "w").close()
with open(file, "a") as file:
    # Sort by sex, count missing values for any variable
    file.write("\n>>> Describe read scores, sorted by sex\n")
    file.write(
        df["read"]
        .groupby(df["sex"])
        .describe()
        .rename(index={0: "male", 1: "female"})
        .to_string()
    )

Summary

These snippets show how to concatenate then merge dataframes, check for missing values or strings, summarise data sorted by groups, and export summary data to CSV or TXT.

The code here is by no means comprehensive (and I’ll still be Googling around). But hopefully it helps you figure out how to convert between data types (e.g. with to_numeric() or astype()) and specify conditional variables (e.g. with groupby, .loc[]). Remember to include conditions within parentheses () when conditioning on more than 1 variable.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s