Manipulating data with Pandas – Part 3

Posted on May 31, 2018 by Joanna Diong Leave a comment

Previously, we read in CSV data and summarised it. Here, we will learn how to locate data using labels and tabulate categorical data.

The CSV file data_pandas.csv (available here) contains simulated data of age, height and score from 20 subjects. Read in and store the CSV data in the dataframe df, using the header names. Also create and assign subjects a group id such that the first 10 subjects are in group 0 and the next 10 subjects in group 1.

Let’s create age subgroups by decade. Create the variable subgroup and assign all values as string 50s first. Next, use the loc method and replace subgroup values as 60s or 70s if age falls within the decades. Print and check the whole dataframe. In addition, use loc to print score values only for subjects in Group 0.

Output are shown after IPython style Out[1]: prompts. We see the new group and subgroup variables, and the score values for those in Group 0.

import numpy as np
import pandas as pd

# Read in dataset, detect NA

df = pd.read_csv('data_pandas.csv',
                 names=['id', 'age', 'height', 'score'])

# create identifier for 2 independent groups
df['group'] = np.concatenate((np.zeros(int(len(df)/2)), np.ones(int(len(df)/2))), axis=0)
df.group = df.group.astype(int)

# create age subgroups
df['subgroup'] = '50s'
df.loc[(df['age'] >= 60) & (df['age'] < 70), 'subgroup'] = '60s'
df.loc[(df['age'] >= 70) & (df['age'] < 80), 'subgroup'] = '70s'
print(df)

# print individual scores for group 0
print(df.loc[df.group==0, 'score'])

Out[1]: 
    id  age  height     score  group subgroup
0    0   55    1.62  1.887922      0      50s
1    1   56    1.67       NaN      0      50s
2    2   72    1.68  3.261494      0      70s
3    3   68    1.69  1.806305      0      60s
4    4   58    1.63  2.855043      0      50s
5    5   55    1.67  3.189182      0      50s
6    6   68    1.64  0.631574      0      60s
7    7   61    1.65  3.451023      0      60s
8    8   75    1.79  3.238747      0      70s
9    9   61    1.66  1.769695      0      60s
10  10   57    1.68  3.816165      1      50s
11  11   67    1.60  1.782659      1      60s
12  12   66    1.62  3.763942      1      60s
13  13   62    1.70  4.406709      1      60s
14  14   68    1.75  0.058346      1      60s
15  15   63    1.75  2.490545      1      60s
16  16   66    1.67  0.368960      1      60s
17  17   67    1.79  3.934757      1      60s
18  18   66    1.70  0.320337      1      60s
19  19   75    1.74  1.776552      1      70s

0    1.887922
1         NaN
2    3.261494
3    1.806305
4    2.855043
5    3.189182
6    0.631574
7    3.451023
8    3.238747
9    1.769695
Name: score, dtype: float64

Now, tabulate the data by setting the index as group and columns as subgroup, and report the sums of the values in the margins. This produces counts of the data and the marginal total counts. Also calculate the percentages of the group and subgroup values and tabulate these separately. Write the data to table.txt

# Cross-tabulate data and print to file

table = pd.crosstab(index=df.group, columns=df.subgroup, margins=True)
table.index = ['Group 0', 'Group 1', 'All']
table_div = table.div(table['All'], axis=0) * 100
table_div.index = ['Group 0', 'Group 1', 'All']
print('\nCounts:\n', table)
print('\nProportions:\n', table_div)

file = 'tables.txt'
open(file, 'w').close()
with open(file, 'a') as file:
    file.write('\n--------------------')
    file.write('\nCross-tabulation of data')
    file.write('\n--------------------')
    file.write('\n')
    file.write('\nCounts:\n')
    file.write('\n' + str(table) + '\n')
    file.write('\nProportions:\n')
    file.write('\n' + str(table_div) + '\n')

Out[2]:
Counts:
 subgroup  50s  60s  70s  All
Group 0     4    4    2   10
Group 1     1    8    1   10
All         5   12    3   20

Proportions:
 subgroup   50s   60s   70s    All
Group 0   40.0  40.0  20.0  100.0
Group 1   10.0  80.0  10.0  100.0
All       25.0  60.0  15.0  100.0

We observe that the data are tabulated as counts and proportions, and you should now have the new file table.txt in your folder.

The complete code to generate these output is:

import numpy as np
import pandas as pd

# Read in dataset, detect NA

df = pd.read_csv('data_pandas.csv',
                 names=['id', 'age', 'height', 'score'])

# create identifier for 2 independent groups
df['group'] = np.concatenate((np.zeros(int(len(df)/2)), np.ones(int(len(df)/2))), axis=0)
df.group = df.group.astype(int)

# create age subgroups
df['subgroup'] = '50s'
df.loc[(df['age'] >= 60) & (df['age'] < 70), 'subgroup'] = '60s'
df.loc[(df['age'] >= 70) & (df['age'] < 80), 'subgroup'] = '70s'
print(df)

# print individual scores for group 0
print(df.loc[df.group==0, 'score'])

# Cross-tabulate data and print to file

table = pd.crosstab(index=df.group, columns=df.subgroup, margins=True)
table.index = ['Group 0', 'Group 1', 'All']
table_div = table.div(table['All'], axis=0) * 100
table_div.index = ['Group 0', 'Group 1', 'All']
print('\nCounts:\n', table)
print('\nProportions:\n', table_div)

file = 'tables.txt'
open(file, 'w').close()
with open(file, 'a') as file:
    file.write('\n--------------------')
    file.write('\nCross-tabulation of data')
    file.write('\n--------------------')
    file.write('\n')
    file.write('\nCounts:\n')
    file.write('\n' + str(table) + '\n')
    file.write('\nProportions:\n')
    file.write('\n' + str(table_div) + '\n')

Summary

We read in CSV data, assigned subjects to different subgroups, and indexed subject data by locating other data.

We tabulated counts and proportions of data and wrote the tables to a text file.

Posts in series

tagged with dataframe, Python, text files