Manipulating data with Pandas – Part 2

Previously, we used Pandas to read and write data to CSV files, reshape data from wide to long format, and used the Seaborn package to plot paired data. Here, we will summarise some data and write the results to a CSV file.
The CSV file data_pandas.csv
(available here) contains simulated data of age, height and score from 20 subjects. To begin, read in and store the CSV data in the dataframe df
. For now, set the keyword argument na_filter
to False
. Create a header using names
since the CSV data do not have a header. How has Pandas recognised the data? Print the top 5 lines in the dataframe, and identify how score
data are stored.
Output are shown after IPython style Out[1]:
prompts. Notice that subject 1 has a missing value for score
, where data are stored as strings. By default, na_filter
is True
so that numbers are stored as floating point or integer numbers, and Pandas recognises missing data formatted differently (e.g. NA, NaN, etc.). This may be handy if you need to analyse the data as text, rather than numbers.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
# Read in dataset, create header names
df = pd.read_csv('data_pandas.csv', na_filter=False,
names=['id', 'age', 'height', 'score'])
print(df.head())
print('score data type:', type(df.score.values[0]))
Out[1]:
id age height score
0 0 55 1.62 1.8879217058
1 1 56 1.67 NA
2 2 72 1.68 3.2614944103
3 3 68 1.69 1.8063051011
4 4 58 1.63 2.8550428059
score data type: <class 'str'>
Now, read in the CSV data again to overwrite df
. This time, detect missing values and check that data are stored as numbers. Create and assign subjects a group id such that the first 10 subjects are in group 0 and the next 10 subjects in group 1. Some statistical functions can’t handle missing values, so we will drop subject 1 who had missing score
data. Print and check the whole dataframe.
This time, Pandas recognises the missing value for subject 1 and score
data are stored as numbers. On printing the dataframe, subjects are now assigned a group of 0 or 1, and subject 1 has been dropped.
# Read in dataset, detect NA
df = pd.read_csv('data_pandas.csv',
names=['id', 'age', 'height', 'score'])
print(df.head())
print('score data type:', type(df.score.values[0])) # na_filter=True, by default
# create identifier for 2 independent groups
df['group'] = np.concatenate((np.zeros(int(len(df)/2)), np.ones(int(len(df)/2))), axis=0)
df.group = df.group.astype(int)
# drop NA
df = df.dropna()
print(df)
Out[2]:
id age height score
0 0 55 1.62 1.887922
1 1 56 1.67 NaN
2 2 72 1.68 3.261494
3 3 68 1.69 1.806305
4 4 58 1.63 2.855043
score data type: <class 'numpy.float64'>
id age height score group
0 0 55 1.62 1.887922 0
2 2 72 1.68 3.261494 0
3 3 68 1.69 1.806305 0
4 4 58 1.63 2.855043 0
5 5 55 1.67 3.189182 0
6 6 68 1.64 0.631574 0
7 7 61 1.65 3.451023 0
8 8 75 1.79 3.238747 0
9 9 61 1.66 1.769695 0
10 10 57 1.68 3.816165 1
11 11 67 1.60 1.782659 1
12 12 66 1.62 3.763942 1
13 13 62 1.70 4.406709 1
14 14 68 1.75 0.058346 1
15 15 63 1.75 2.490545 1
16 16 66 1.67 0.368960 1
17 17 67 1.79 3.934757 1
18 18 66 1.70 0.320337 1
19 19 75 1.74 1.776552 1
Now, summarise the data for each group using the groupby
and describe
functions. Print the summary, and also write it to the file describe.csv
.
# summarise data from each group
print(df.groupby(df.group).describe())
df.groupby(df.group).describe().to_csv('describe.csv')
Out[3]:
age height id score
group
0 count 9.000000 9.000000 9.000000 9.000000
mean 63.666667 1.670000 4.888889 2.454554
std 7.348469 0.050498 2.934469 0.967776
min 55.000000 1.620000 0.000000 0.631574
25% 58.000000 1.640000 3.000000 1.806305
50% 61.000000 1.660000 5.000000 2.855043
75% 68.000000 1.680000 7.000000 3.238747
max 75.000000 1.790000 9.000000 3.451023
1 count 10.000000 10.000000 10.000000 10.000000
mean 65.700000 1.700000 14.500000 2.271897
std 4.620005 0.060000 3.027650 1.659839
min 57.000000 1.600000 10.000000 0.058346
25% 63.750000 1.672500 12.250000 0.720858
50% 66.000000 1.700000 14.500000 2.136602
75% 67.000000 1.747500 16.750000 3.803109
max 75.000000 1.790000 19.000000 4.406709
The complete code to generate these output is:
import numpy as np
import pandas as pd
# Read in dataset, create header names
df = pd.read_csv('data_pandas.csv', na_filter=False,
names=['id', 'age', 'height', 'score'])
print(df.head())
print('score data type:', type(df.score.values[0]))
# Read in dataset, detect NA
df = pd.read_csv('data_pandas.csv',
names=['id', 'age', 'height', 'score'])
print(df.head())
print('score data type:', type(df.score.values[0])) # na_filter=True, by default
# create identifier for 2 independent groups
df['group'] = np.concatenate((np.zeros(int(len(df)/2)), np.ones(int(len(df)/2))), axis=0)
df.group = df.group.astype(int)
# drop NA
df = df.dropna()
print(df)
# summarise data from each group
print(df.groupby(df.group).describe())
df.groupby(df.group).describe().to_csv('describe.csv')
Summary
We read in CSV data and detected missing values, assigned subjects to different groups, and summarised data in each group. Finally, we wrote the summarised data to a CSV file.
In the next post, we will learn how to locate data using labels and tabulate categorical data.