Manipulating data with Pandas – Part 3

Previously, we read in CSV data and summarised it. Here, we will learn how to locate data using labels and tabulate categorical data.
The CSV file data_pandas.csv
(available here) contains simulated data of age, height and score from 20 subjects. Read in and store the CSV data in the dataframe df
, using the header names
. Also create and assign subjects a group id such that the first 10 subjects are in group 0 and the next 10 subjects in group 1.
Let’s create age subgroups by decade. Create the variable subgroup
and assign all values as string 50s
first. Next, use the loc
method and replace subgroup
values as 60s
or 70s
if age falls within the decades. Print and check the whole dataframe. In addition, use loc
to print score
values only for subjects in Group 0.
Output are shown after IPython style Out[1]:
prompts. We see the new group
and subgroup
variables, and the score values for those in Group 0.
import numpy as np
import pandas as pd
# Read in dataset, detect NA
df = pd.read_csv('data_pandas.csv',
names=['id', 'age', 'height', 'score'])
# create identifier for 2 independent groups
df['group'] = np.concatenate((np.zeros(int(len(df)/2)), np.ones(int(len(df)/2))), axis=0)
df.group = df.group.astype(int)
# create age subgroups
df['subgroup'] = '50s'
df.loc[(df['age'] >= 60) & (df['age'] < 70), 'subgroup'] = '60s'
df.loc[(df['age'] >= 70) & (df['age'] < 80), 'subgroup'] = '70s'
print(df)
# print individual scores for group 0
print(df.loc[df.group==0, 'score'])
Out[1]:
id age height score group subgroup
0 0 55 1.62 1.887922 0 50s
1 1 56 1.67 NaN 0 50s
2 2 72 1.68 3.261494 0 70s
3 3 68 1.69 1.806305 0 60s
4 4 58 1.63 2.855043 0 50s
5 5 55 1.67 3.189182 0 50s
6 6 68 1.64 0.631574 0 60s
7 7 61 1.65 3.451023 0 60s
8 8 75 1.79 3.238747 0 70s
9 9 61 1.66 1.769695 0 60s
10 10 57 1.68 3.816165 1 50s
11 11 67 1.60 1.782659 1 60s
12 12 66 1.62 3.763942 1 60s
13 13 62 1.70 4.406709 1 60s
14 14 68 1.75 0.058346 1 60s
15 15 63 1.75 2.490545 1 60s
16 16 66 1.67 0.368960 1 60s
17 17 67 1.79 3.934757 1 60s
18 18 66 1.70 0.320337 1 60s
19 19 75 1.74 1.776552 1 70s
0 1.887922
1 NaN
2 3.261494
3 1.806305
4 2.855043
5 3.189182
6 0.631574
7 3.451023
8 3.238747
9 1.769695
Name: score, dtype: float64
Now, tabulate the data by setting the index as group
and columns as subgroup
, and report the sums of the values in the margins. This produces counts of the data and the marginal total counts. Also calculate the percentages of the group
and subgroup
values and tabulate these separately. Write the data to table.txt
# Cross-tabulate data and print to file
table = pd.crosstab(index=df.group, columns=df.subgroup, margins=True)
table.index = ['Group 0', 'Group 1', 'All']
table_div = table.div(table['All'], axis=0) * 100
table_div.index = ['Group 0', 'Group 1', 'All']
print('\nCounts:\n', table)
print('\nProportions:\n', table_div)
file = 'tables.txt'
open(file, 'w').close()
with open(file, 'a') as file:
file.write('\n--------------------')
file.write('\nCross-tabulation of data')
file.write('\n--------------------')
file.write('\n')
file.write('\nCounts:\n')
file.write('\n' + str(table) + '\n')
file.write('\nProportions:\n')
file.write('\n' + str(table_div) + '\n')
Out[2]:
Counts:
subgroup 50s 60s 70s All
Group 0 4 4 2 10
Group 1 1 8 1 10
All 5 12 3 20
Proportions:
subgroup 50s 60s 70s All
Group 0 40.0 40.0 20.0 100.0
Group 1 10.0 80.0 10.0 100.0
All 25.0 60.0 15.0 100.0
We observe that the data are tabulated as counts and proportions, and you should now have the new file table.txt
in your folder.
The complete code to generate these output is:
import numpy as np
import pandas as pd
# Read in dataset, detect NA
df = pd.read_csv('data_pandas.csv',
names=['id', 'age', 'height', 'score'])
# create identifier for 2 independent groups
df['group'] = np.concatenate((np.zeros(int(len(df)/2)), np.ones(int(len(df)/2))), axis=0)
df.group = df.group.astype(int)
# create age subgroups
df['subgroup'] = '50s'
df.loc[(df['age'] >= 60) & (df['age'] < 70), 'subgroup'] = '60s'
df.loc[(df['age'] >= 70) & (df['age'] < 80), 'subgroup'] = '70s'
print(df)
# print individual scores for group 0
print(df.loc[df.group==0, 'score'])
# Cross-tabulate data and print to file
table = pd.crosstab(index=df.group, columns=df.subgroup, margins=True)
table.index = ['Group 0', 'Group 1', 'All']
table_div = table.div(table['All'], axis=0) * 100
table_div.index = ['Group 0', 'Group 1', 'All']
print('\nCounts:\n', table)
print('\nProportions:\n', table_div)
file = 'tables.txt'
open(file, 'w').close()
with open(file, 'a') as file:
file.write('\n--------------------')
file.write('\nCross-tabulation of data')
file.write('\n--------------------')
file.write('\n')
file.write('\nCounts:\n')
file.write('\n' + str(table) + '\n')
file.write('\nProportions:\n')
file.write('\n' + str(table_div) + '\n')
Summary
We read in CSV data, assigned subjects to different subgroups, and indexed subject data by locating other data.
We tabulated counts and proportions of data and wrote the tables to a text file.