Manipulating data with Pandas – Part 1

Pandas (i.e. panel data) is a Python library designed to manipulate data in tables and time series. Pandas uses many Numpy library functions to manipulate data stored in dataframes, analogous to a spreadsheet or table. Let’s look at some basic Pandas functions to manipulate data, and plot the data using the Seaborn plotting package.

To begin, import libraries and simulate id, height (m) and weight (kg) for 10 subjects. Create a Pandas dataframe object df containing the simulated data, and write it to a CSV file in the working directory. Next, read the CSV file data back into memory (overwriting the df object), and print the top 5 lines in the dataframe using the head() function.

The output (shown after IPython style Out[1]: prompts) shows the top 5 lines from the dataframe df, which contains data read from data.csv. When dealing with many variables or large datasets, it is sometimes useful to store processed data in a CSV file which can be quickly read back in later.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn.apionly as sns

# Create data and dataframe, write to and read from CSV

n = 10
id = list(range(n))
height = np.random.randint(160, 180+1, n) / 100
weight = np.random.randint(55, 75+1, n)

df = pd.DataFrame({'id': id, 'height': height, 'weight': weight})

df.to_csv('data.csv')

df = pd.read_csv('data.csv')

print(df.head())

Out[1]: 
   Unnamed: 0  height  id  weight
0           0    1.62   0      73
1           1    1.73   1      59
2           2    1.75   2      56
3           3    1.75   3      74
4           4    1.70   4      56

Now, for each subject, create some data of repeated measures (score1 and score2, labeling them as scoreA and scoreB respectively) based on normal distributions centered around means of 5 and 8, each with standard deviation 2. Add these variables to the datafame df, write the new dataframe to the CSV file, and print the top 5 lines.

Note: the height and weight values in your own CSV file and those printed to your screen will be slightly different because the random values used to simulate these data are different.

We see that the variables score1 and score2 are now in the dataframe, and can be indexed later.

# Create new data, add to data frame

score1 = np.random.normal(loc=5, scale=2, size=n)
score2 = np.random.normal(loc=8, scale=2, size=n)

df['scoreA'] = score1
df['scoreB'] = score2

df.to_csv('data.csv')

print(df.head())

Out[2]: 
   Unnamed: 0  height  id  weight    scoreA     scoreB
0           0    1.62   0      73  5.358997   9.434003
1           1    1.73   1      59  3.107703   5.945404
2           2    1.75   2      56  7.596088   5.768823
3           3    1.75   3      74  3.111168  10.025494
4           4    1.70   4      56  8.053365   8.616078

The data are currently stored in wide format (i.e. one row of data for each subject). We would like to plot the repeated measures scoreA and scoreB, but Seaborn requires these data to be in long format (i.e. multiple rows of data for each subject, where data are linked to the subject by a unique identifier).

Let’s reshape the data from wide to long format, inspect how Pandas indexes multiple levels of the data, and use Seaborn to plot the repeated measures. To rehsape the data into long, we specify the dataframe df, a stubname which is the common prefix of the names of all repeated measures variables, an i subject identifier id, and a new j variable trial which is the identifier for the repeated measure.

We overwrite the dataframe df to store the data in long format, and print the top and bottom 5 lines of data. Pandas now indexes subjects by id, as well as trials A and B for each subject.

# Reshape wide to long, to plot paired data

df = pd.wide_to_long(df, stubnames=['score'], i='id', j='trial')

print(df.head())
print(df.tail())

Out[3]:
          Unnamed: 0  height  weight     score
id trial                                      
0  A               0    1.62      73  5.358997
1  A               1    1.73      59  3.107703
2  A               2    1.75      56  7.596088
3  A               3    1.75      74  3.111168
4  A               4    1.70      56  8.053365
          Unnamed: 0  height  weight     score
id trial                                      
5  B               5    1.62      71  9.338265
6  B               6    1.77      72  9.500084
7  B               7    1.66      63  5.969731
8  B               8    1.79      59  6.572579
9  B               9    1.65      56  8.408526

How does Pandas reference the multiple levels (id and trial) of the data? Let’s show the multi-index levels.

We see that multiple levels of the data are stored as multi-index objects. The multiple labels of data are included as well; notice that trial values for score1 and score2 are dummy coded as 0 and 1.

# show multi-index levels
print(df.index)

Out[4]:
MultiIndex(levels=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], ['A', 'B']],
           labels=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
           names=['id', 'trial'])

Pandas recognises the multiple levels but we can only index the data using these levels if we store them as an object. (Notice, the id and trial labels in the dataframe output sit 1 level lower than the height, weight, score labels.) Now, we index the trial values and store them in the dataframe, then plot the paired data from trials A and B.

# use the multi-index levels to index the trial values
df['trial'] = df.index.get_level_values(level=1)

# plot paired data from trials A and B
sns.stripplot(x='trial', y='score', data=df, jitter=True)
sns.pointplot(x='trial', y='score', data=df, estimator=np.mean, join=False, color='k')
plt.ylabel('Score (a.u.)')
plt.xlabel('Trial')
plt.tight_layout()
plt.savefig('paired.png', dpi=600)

Figure 1:

The complete code to generate these output is:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn.apionly as sns

# Create data and dataframe, write to and read from CSV

n = 10
id = list(range(n))
height = np.random.randint(160, 180+1, n) / 100
weight = np.random.randint(55, 75+1, n)

df = pd.DataFrame({'id': id, 'height': height, 'weight': weight})

df.to_csv('data.csv')

df = pd.read_csv('data.csv')

print(df.head())

# Create new data, add to data frame

score1 = np.random.normal(loc=5, scale=2, size=n)
score2 = np.random.normal(loc=8, scale=2, size=n)

df['scoreA'] = score1
df['scoreB'] = score2

df.to_csv('data.csv')

print(df.head())

# Reshape wide to long, to plot paired data

df = pd.wide_to_long(df, stubnames=['score'], i='id', j='trial')

print(df.head())
print(df.tail())

# show multi-index levels
print(df.index)

# use the multi-index levels to index the trial values
df['trial'] = df.index.get_level_values(level=1)

# plot paired data from trials A and B
sns.stripplot(x='trial', y='score', data=df, jitter=True)
sns.pointplot(x='trial', y='score', data=df, estimator=np.mean, join=False, color='k')
plt.ylabel('Score (a.u.)')
plt.xlabel('Trial')
plt.tight_layout()
plt.savefig('paired.png', dpi=600)

Summary

We created a Pandas dataframe, exported the data to a CSV file, imported the data from a CSV file, and did some brief checks.

Next, we reshaped data from wide to long, indexed values from the multi-index levels, and used the Seaborn package to plot repeated measures data.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s