Manipulating data with Pandas – Part 2

Previously, we used Pandas to read and write data to CSV files, reshape data from wide to long format, and used the Seaborn package to plot paired data. Here, we will summarise some data and write the results to a CSV file.

The CSV file data_pandas.csv (available here) contains simulated data of age, height and score from 20 subjects. To begin, read in and store the CSV data in the dataframe df. For now, set the keyword argument na_filter to False. Create a header using names since the CSV data do not have a header. How has Pandas recognised the data? Print the top 5 lines in the dataframe, and identify how score data are stored.

Output are shown after IPython style Out[1]: prompts. Notice that subject 1 has a missing value for score, where data are stored as strings. By default, na_filter is True so that numbers are stored as floating point or integer numbers, and Pandas recognises missing data formatted differently (e.g. NA, NaN, etc.). This may be handy if you need to analyse the data as text, rather than numbers.

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

# Read in dataset, create header names

df = pd.read_csv('data_pandas.csv', na_filter=False,
                 names=['id', 'age', 'height', 'score'])

print(df.head())
print('score data type:', type(df.score.values[0]))

Out[1]: 
   id  age  height         score
0   0   55    1.62  1.8879217058
1   1   56    1.67            NA
2   2   72    1.68  3.2614944103
3   3   68    1.69  1.8063051011
4   4   58    1.63  2.8550428059
score data type: <class 'str'>

Now, read in the CSV data again to overwrite df. This time, detect missing values and check that data are stored as numbers. Create and assign subjects a group id such that the first 10 subjects are in group 0 and the next 10 subjects in group 1. Some statistical functions can’t handle missing values, so we will drop subject 1 who had missing score data. Print and check the whole dataframe.

This time, Pandas recognises the missing value for subject 1 and score data are stored as numbers. On printing the dataframe, subjects are now assigned a group of 0 or 1, and subject 1 has been dropped.

# Read in dataset, detect NA

df = pd.read_csv('data_pandas.csv',
                 names=['id', 'age', 'height', 'score'])

print(df.head())
print('score data type:', type(df.score.values[0])) # na_filter=True, by default

# create identifier for 2 independent groups
df['group'] = np.concatenate((np.zeros(int(len(df)/2)), np.ones(int(len(df)/2))), axis=0)
df.group = df.group.astype(int)

# drop NA
df = df.dropna()
print(df)

Out[2]: 
   id  age  height     score
0   0   55    1.62  1.887922
1   1   56    1.67       NaN
2   2   72    1.68  3.261494
3   3   68    1.69  1.806305
4   4   58    1.63  2.855043
score data type: <class 'numpy.float64'>

    id  age  height     score  group
0    0   55    1.62  1.887922      0
2    2   72    1.68  3.261494      0
3    3   68    1.69  1.806305      0
4    4   58    1.63  2.855043      0
5    5   55    1.67  3.189182      0
6    6   68    1.64  0.631574      0
7    7   61    1.65  3.451023      0
8    8   75    1.79  3.238747      0
9    9   61    1.66  1.769695      0
10  10   57    1.68  3.816165      1
11  11   67    1.60  1.782659      1
12  12   66    1.62  3.763942      1
13  13   62    1.70  4.406709      1
14  14   68    1.75  0.058346      1
15  15   63    1.75  2.490545      1
16  16   66    1.67  0.368960      1
17  17   67    1.79  3.934757      1
18  18   66    1.70  0.320337      1
19  19   75    1.74  1.776552      1

Now, summarise the data for each group using the groupby and describe functions. Print the summary, and also write it to the file describe.csv.

# summarise data from each group
print(df.groupby(df.group).describe())

df.groupby(df.group).describe().to_csv('describe.csv')

Out[3]:
                   age     height         id      score
group                                                  
0     count   9.000000   9.000000   9.000000   9.000000
      mean   63.666667   1.670000   4.888889   2.454554
      std     7.348469   0.050498   2.934469   0.967776
      min    55.000000   1.620000   0.000000   0.631574
      25%    58.000000   1.640000   3.000000   1.806305
      50%    61.000000   1.660000   5.000000   2.855043
      75%    68.000000   1.680000   7.000000   3.238747
      max    75.000000   1.790000   9.000000   3.451023
1     count  10.000000  10.000000  10.000000  10.000000
      mean   65.700000   1.700000  14.500000   2.271897
      std     4.620005   0.060000   3.027650   1.659839
      min    57.000000   1.600000  10.000000   0.058346
      25%    63.750000   1.672500  12.250000   0.720858
      50%    66.000000   1.700000  14.500000   2.136602
      75%    67.000000   1.747500  16.750000   3.803109
      max    75.000000   1.790000  19.000000   4.406709

The complete code to generate these output is:

import numpy as np
import pandas as pd

# Read in dataset, create header names

df = pd.read_csv('data_pandas.csv', na_filter=False,
                 names=['id', 'age', 'height', 'score'])

print(df.head())
print('score data type:', type(df.score.values[0]))

# Read in dataset, detect NA

df = pd.read_csv('data_pandas.csv',
                 names=['id', 'age', 'height', 'score'])

print(df.head())
print('score data type:', type(df.score.values[0])) # na_filter=True, by default

# create identifier for 2 independent groups
df['group'] = np.concatenate((np.zeros(int(len(df)/2)), np.ones(int(len(df)/2))), axis=0)
df.group = df.group.astype(int)

# drop NA
df = df.dropna()
print(df)

# summarise data from each group
print(df.groupby(df.group).describe())

df.groupby(df.group).describe().to_csv('describe.csv')

Summary

We read in CSV data and detected missing values, assigned subjects to different groups, and summarised data in each group. Finally, we wrote the summarised data to a CSV file.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s