Manipulating data with Pandas – Part 1

Posted on May 10, 2018 by Joanna Diong 3 comments

Pandas (i.e. panel data) is a Python library designed to manipulate data in tables and time series. Pandas uses many Numpy library functions to manipulate data stored in dataframes, analogous to a spreadsheet or table. Let’s look at some basic Pandas functions to manipulate data, and plot the data using the Seaborn plotting package.

To begin, import libraries and simulate id, height (m) and weight (kg) for 10 subjects. Create a Pandas dataframe object df containing the simulated data, and write it to a CSV file in the working directory. Next, read the CSV file data back into memory (overwriting the df object), and print the top 5 lines in the dataframe using the head() function.

The output (shown after IPython style Out[1]: prompts) shows the top 5 lines from the dataframe df, which contains data read from data.csv. When dealing with many variables or large datasets, it is sometimes useful to store processed data in a CSV file which can be quickly read back in later.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn.apionly as sns

# Create data and dataframe, write to and read from CSV

n = 10
id = list(range(n))
height = np.random.randint(160, 180+1, n) / 100
weight = np.random.randint(55, 75+1, n)

df = pd.DataFrame({'id': id, 'height': height, 'weight': weight})

df.to_csv('data.csv')

df = pd.read_csv('data.csv')

print(df.head())

Out[1]: 
   Unnamed: 0  height  id  weight
0           0    1.62   0      73
1           1    1.73   1      59
2           2    1.75   2      56
3           3    1.75   3      74
4           4    1.70   4      56

Now, for each subject, create some data of repeated measures (score1 and score2, labeling them as scoreA and scoreB respectively) based on normal distributions centered around means of 5 and 8, each with standard deviation 2. Add these variables to the datafame df, write the new dataframe to the CSV file, and print the top 5 lines.

Note: the height and weight values in your own CSV file and those printed to your screen will be slightly different because the random values used to simulate these data are different.

We see that the variables score1 and score2 are now in the dataframe, and can be indexed later.

# Create new data, add to data frame

score1 = np.random.normal(loc=5, scale=2, size=n)
score2 = np.random.normal(loc=8, scale=2, size=n)

df['scoreA'] = score1
df['scoreB'] = score2

df.to_csv('data.csv')

print(df.head())

Out[2]: 
   Unnamed: 0  height  id  weight    scoreA     scoreB
0           0    1.62   0      73  5.358997   9.434003
1           1    1.73   1      59  3.107703   5.945404
2           2    1.75   2      56  7.596088   5.768823
3           3    1.75   3      74  3.111168  10.025494
4           4    1.70   4      56  8.053365   8.616078

The data are currently stored in wide format (i.e. one row of data for each subject). We would like to plot the repeated measures scoreA and scoreB, but Seaborn requires these data to be in long format (i.e. multiple rows of data for each subject, where data are linked to the subject by a unique identifier).

Let’s reshape the data from wide to long format, inspect how Pandas indexes multiple levels of the data, and use Seaborn to plot the repeated measures. To rehsape the data into long, we specify the dataframe df, a stubname which is the common prefix of the names of all repeated measures variables, an i subject identifier id, and a new j variable trial which is the identifier for the repeated measure.

We overwrite the dataframe df to store the data in long format, and print the top and bottom 5 lines of data. Pandas now indexes subjects by id, as well as trials A and B for each subject.

# Reshape wide to long, to plot paired data

df = pd.wide_to_long(df, stubnames=['score'], i='id', j='trial', suffix='\w')

print(df.head())
print(df.tail())

Out[3]:
          Unnamed: 0  height  weight     score
id trial                                      
0  A               0    1.62      73  5.358997
1  A               1    1.73      59  3.107703
2  A               2    1.75      56  7.596088
3  A               3    1.75      74  3.111168
4  A               4    1.70      56  8.053365
          Unnamed: 0  height  weight     score
id trial                                      
5  B               5    1.62      71  9.338265
6  B               6    1.77      72  9.500084
7  B               7    1.66      63  5.969731
8  B               8    1.79      59  6.572579
9  B               9    1.65      56  8.408526

How does Pandas reference the multiple levels (id and trial) of the data? Let’s show the multi-index levels.

We see that multiple levels of the data are stored as multi-index objects. The multiple labels of data are included as well; notice that trial values for score1 and score2 are dummy coded as 0 and 1.

# show multi-index levels
print(df.index)

Out[4]:
MultiIndex(levels=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], ['A', 'B']],
           labels=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
           names=['id', 'trial'])

Pandas recognises the multiple levels but we can only index the data using these levels if we store them as an object. (Notice, the id and trial labels in the dataframe output sit 1 level lower than the height, weight, score labels.) Now, we index the trial values and store them in the dataframe, then plot the paired data from trials A and B.

# use the multi-index levels to index the trial values
df['trial'] = df.index.get_level_values(level=1)

# plot paired data from trials A and B
sns.stripplot(x='trial', y='score', data=df, jitter=True)
sns.pointplot(x='trial', y='score', data=df, estimator=np.mean, join=False, color='k')
plt.ylabel('Score (a.u.)')
plt.xlabel('Trial')
plt.tight_layout()
plt.savefig('paired.png', dpi=600)

Figure 1:

The complete code to generate these output is:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn.apionly as sns

# Create data and dataframe, write to and read from CSV

n = 10
id = list(range(n))
height = np.random.randint(160, 180+1, n) / 100
weight = np.random.randint(55, 75+1, n)

df = pd.DataFrame({'id': id, 'height': height, 'weight': weight})

df.to_csv('data.csv')

df = pd.read_csv('data.csv')

print(df.head())

# Create new data, add to data frame

score1 = np.random.normal(loc=5, scale=2, size=n)
score2 = np.random.normal(loc=8, scale=2, size=n)

df['scoreA'] = score1
df['scoreB'] = score2

df.to_csv('data.csv')

print(df.head())

# Reshape wide to long, to plot paired data

df = pd.wide_to_long(df, stubnames=['score'], i='id', j='trial', suffix='\w')

print(df.head())
print(df.tail())

# show multi-index levels
print(df.index)

# use the multi-index levels to index the trial values
df['trial'] = df.index.get_level_values(level=1)

# plot paired data from trials A and B
sns.stripplot(x='trial', y='score', data=df, jitter=True)
sns.pointplot(x='trial', y='score', data=df, estimator=np.mean, join=False, color='k')
plt.ylabel('Score (a.u.)')
plt.xlabel('Trial')
plt.tight_layout()
plt.savefig('paired.png', dpi=600)

Summary

We created a Pandas dataframe, exported the data to a CSV file, imported the data from a CSV file, and did some brief checks.

Next, we reshaped data from wide to long, indexed values from the multi-index levels, and used the Seaborn package to plot repeated measures data.

In the next post, we will summarise some data and write the results to a CSV file.

Posts in series

tagged with pandas, Python, reshape

3 comments

staypuffinpc
October 9, 2018 5:53 pm

Thanks for this quick overview. I’m curious, what if my data are already in a long format? How do I let pandas know which trial is which? How do I ID what are the variables of interest? Also, is it possible to actually check the differences on the dependent variables between trials and then run a post-hoc test to check for significance?

LikeLike

Peter
October 9, 2018 5:53 pm

Thanks for this quick overview. I’m curious, what if my data are already in a long format? How do I let pandas know which trial is which? How do I ID what are the variables of interest? Also, is it possible to actually check the differences on the dependent variables between trials and then run a post-hoc test to check for significance?

LikeLike

- Joanna Diong
  October 9, 2018 10:01 pm
  
  Hi there, thanks for the comment.
  
  …what if my data are already in a long format? How do I let pandas know which trial is which? How do I ID what are the variables of interest?
  
  This is easier. You simply need separate identifiers for subject and trial, and the dependent variable will have different values in all these. This 2nd Pandas post may give some pointers.
  
  Also, is it possible to actually check the differences on the dependent variables between trials and then run a post-hoc test to check for significance?
  
  If you mean running an ANOVA then post-hoc tests, in theory, yes that is possible. Remember to quantify the precision of the effect (e.g. mean difference) with 95% CI so readers can infer how precisely the effect was estimated.
  
  LikeLike

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31