Manipulating data with Pandas – Part 1

Pandas (i.e. panel data) is a Python library designed to manipulate data in tables and time series. Pandas uses many Numpy library functions to manipulate data stored in dataframes, analogous to a spreadsheet or table. Let’s look at some basic Pandas functions to manipulate data, and plot the data using the Seaborn plotting package.
To begin, import libraries and simulate id, height (m) and weight (kg) for 10 subjects. Create a Pandas dataframe object df
containing the simulated data, and write it to a CSV file in the working directory. Next, read the CSV file data back into memory (overwriting the df
object), and print the top 5 lines in the dataframe using the head()
function.
The output (shown after IPython style Out[1]:
prompts) shows the top 5 lines from the dataframe df
, which contains data read from data.csv
. When dealing with many variables or large datasets, it is sometimes useful to store processed data in a CSV file which can be quickly read back in later.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn.apionly as sns
# Create data and dataframe, write to and read from CSV
n = 10
id = list(range(n))
height = np.random.randint(160, 180+1, n) / 100
weight = np.random.randint(55, 75+1, n)
df = pd.DataFrame({'id': id, 'height': height, 'weight': weight})
df.to_csv('data.csv')
df = pd.read_csv('data.csv')
print(df.head())
Out[1]:
Unnamed: 0 height id weight
0 0 1.62 0 73
1 1 1.73 1 59
2 2 1.75 2 56
3 3 1.75 3 74
4 4 1.70 4 56
Now, for each subject, create some data of repeated measures (score1
and score2
, labeling them as scoreA
and scoreB
respectively) based on normal distributions centered around means of 5 and 8, each with standard deviation 2. Add these variables to the datafame df
, write the new dataframe to the CSV file, and print the top 5 lines.
Note: the height and weight values in your own CSV file and those printed to your screen will be slightly different because the random values used to simulate these data are different.
We see that the variables score1
and score2
are now in the dataframe, and can be indexed later.
# Create new data, add to data frame
score1 = np.random.normal(loc=5, scale=2, size=n)
score2 = np.random.normal(loc=8, scale=2, size=n)
df['scoreA'] = score1
df['scoreB'] = score2
df.to_csv('data.csv')
print(df.head())
Out[2]:
Unnamed: 0 height id weight scoreA scoreB
0 0 1.62 0 73 5.358997 9.434003
1 1 1.73 1 59 3.107703 5.945404
2 2 1.75 2 56 7.596088 5.768823
3 3 1.75 3 74 3.111168 10.025494
4 4 1.70 4 56 8.053365 8.616078
The data are currently stored in wide format (i.e. one row of data for each subject). We would like to plot the repeated measures scoreA
and scoreB
, but Seaborn requires these data to be in long format (i.e. multiple rows of data for each subject, where data are linked to the subject by a unique identifier).
Let’s reshape the data from wide to long format, inspect how Pandas indexes multiple levels of the data, and use Seaborn to plot the repeated measures. To rehsape the data into long, we specify the dataframe df
, a stubname
which is the common prefix of the names of all repeated measures variables, an i
subject identifier id
, and a new j
variable trial
which is the identifier for the repeated measure.
We overwrite the dataframe df
to store the data in long format, and print the top and bottom 5 lines of data. Pandas now indexes subjects by id
, as well as trials A
and B
for each subject.
# Reshape wide to long, to plot paired data
df = pd.wide_to_long(df, stubnames=['score'], i='id', j='trial', suffix='\w')
print(df.head())
print(df.tail())
Out[3]:
Unnamed: 0 height weight score
id trial
0 A 0 1.62 73 5.358997
1 A 1 1.73 59 3.107703
2 A 2 1.75 56 7.596088
3 A 3 1.75 74 3.111168
4 A 4 1.70 56 8.053365
Unnamed: 0 height weight score
id trial
5 B 5 1.62 71 9.338265
6 B 6 1.77 72 9.500084
7 B 7 1.66 63 5.969731
8 B 8 1.79 59 6.572579
9 B 9 1.65 56 8.408526
How does Pandas reference the multiple levels (id
and trial
) of the data? Let’s show the multi-index levels.
We see that multiple levels of the data are stored as multi-index objects. The multiple labels of data are included as well; notice that trial values for score1
and score2
are dummy coded as 0 and 1.
# show multi-index levels
print(df.index)
Out[4]:
MultiIndex(levels=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], ['A', 'B']],
labels=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
names=['id', 'trial'])
Pandas recognises the multiple levels but we can only index the data using these levels if we store them as an object. (Notice, the id
and trial
labels in the dataframe output sit 1 level lower than the height
, weight
, score
labels.) Now, we index the trial values and store them in the dataframe, then plot the paired data from trials A and B.
# use the multi-index levels to index the trial values
df['trial'] = df.index.get_level_values(level=1)
# plot paired data from trials A and B
sns.stripplot(x='trial', y='score', data=df, jitter=True)
sns.pointplot(x='trial', y='score', data=df, estimator=np.mean, join=False, color='k')
plt.ylabel('Score (a.u.)')
plt.xlabel('Trial')
plt.tight_layout()
plt.savefig('paired.png', dpi=600)
Figure 1:
The complete code to generate these output is:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn.apionly as sns
# Create data and dataframe, write to and read from CSV
n = 10
id = list(range(n))
height = np.random.randint(160, 180+1, n) / 100
weight = np.random.randint(55, 75+1, n)
df = pd.DataFrame({'id': id, 'height': height, 'weight': weight})
df.to_csv('data.csv')
df = pd.read_csv('data.csv')
print(df.head())
# Create new data, add to data frame
score1 = np.random.normal(loc=5, scale=2, size=n)
score2 = np.random.normal(loc=8, scale=2, size=n)
df['scoreA'] = score1
df['scoreB'] = score2
df.to_csv('data.csv')
print(df.head())
# Reshape wide to long, to plot paired data
df = pd.wide_to_long(df, stubnames=['score'], i='id', j='trial', suffix='\w')
print(df.head())
print(df.tail())
# show multi-index levels
print(df.index)
# use the multi-index levels to index the trial values
df['trial'] = df.index.get_level_values(level=1)
# plot paired data from trials A and B
sns.stripplot(x='trial', y='score', data=df, jitter=True)
sns.pointplot(x='trial', y='score', data=df, estimator=np.mean, join=False, color='k')
plt.ylabel('Score (a.u.)')
plt.xlabel('Trial')
plt.tight_layout()
plt.savefig('paired.png', dpi=600)
Summary
We created a Pandas dataframe, exported the data to a CSV file, imported the data from a CSV file, and did some brief checks.
Next, we reshaped data from wide to long, indexed values from the multi-index levels, and used the Seaborn package to plot repeated measures data.
In the next post, we will summarise some data and write the results to a CSV file.
Thanks for this quick overview. I’m curious, what if my data are already in a long format? How do I let pandas know which trial is which? How do I ID what are the variables of interest? Also, is it possible to actually check the differences on the dependent variables between trials and then run a post-hoc test to check for significance?
LikeLike
Thanks for this quick overview. I’m curious, what if my data are already in a long format? How do I let pandas know which trial is which? How do I ID what are the variables of interest? Also, is it possible to actually check the differences on the dependent variables between trials and then run a post-hoc test to check for significance?
LikeLike
Hi there, thanks for the comment.
…what if my data are already in a long format? How do I let pandas know which trial is which? How do I ID what are the variables of interest?
This is easier. You simply need separate identifiers for subject and trial, and the dependent variable will have different values in all these. This 2nd Pandas post may give some pointers.
Also, is it possible to actually check the differences on the dependent variables between trials and then run a post-hoc test to check for significance?
If you mean running an ANOVA then post-hoc tests, in theory, yes that is possible. Remember to quantify the precision of the effect (e.g. mean difference) with 95% CI so readers can infer how precisely the effect was estimated.
LikeLike