Don’t repeat yourself: Python modules

Posted on August 9, 2018 by Martin Héroux 4 comments

We previously learned to create our own Python functions to reduce how much we repeat ourselves in our code. In this post we see another example of the DRY principle (don’t repeat yourself) and we will learn how to ensure we don’t repeat ourselves between the different programs we write.

A typical (bad) script to process data

Below is an example of a script we might use to process data from five subjects. The data consists of subject initials, height, weight, age, systolic and diastolic blood pressure. We want to calculate each subjects’ BMI, their predicted maximum heart rate and their blood pressure status and then print out the results.

# subject =  [initials, height, weight, age, systolic, diastolic]
subject1 = ['GA', 80, 1.62, 70, 120, 80]
subject2 = ['KT', 69, 1.53, 65, 136, 75]
subject3 = ['MN', 80, 1.66, 89, 113, 72]
subject4 = ['PW', 80, 1.79, 55, 141, 96]
subject5 = ['HJ', 72, 1.60, 61, 121, 78]

# process data for subject1
initials, weight, height, age, systolic, diastolic = subject1 

# Calculate BMI
bmi = int(weight / height**2)
# Caculate predicted maximum heart rate
max_HR = 208 - 0.7 * age
# Caculate blood pressure risk 
if systolic >= 120 and systolic < 130 and diastolic < 80:
    bprisk = 'elevated BP'
elif (systolic >= 130 and systolic < 140) or (diastolic >= 80 and diastolic < 90):
    bprisk = 'stage 1 hypertension'
elif systolic >= 140 or diastolic >= 90:
    bprisk = 'stage 2 hypertension'
else:
    bprisk = 'invalid values'
# Print summary
print("\n\t" + initials)
print("\tweight = {}kg".format(weight))
print("\theight = {}m".format(height))
print("\tage = {} years old".format(age))
print("\tblood pressure = {}/{}".format(systolic, diastolic))
print("\n\tbmi = {}".format(bmi))
print("\tpredicted maximal heart rate = {} bpm".format(max_HR))
print("\tblood pressure = " + bprisk)
print("\n")

# process data for subject2
initials, weight, height, age, systolic, diastolic = subject2

# [copy past code from above]
...

As you can see, we would need to cut-and-paste the majority of the above code another 4 times to process the data for each subject. If we later discovered a mistake in one of our formulas, we would have to fix the mistake in no less than 5 locations in our code. Things get even more complicated if the code was copied-and-pasted into another program.

Using functions to avoid repeating ourselves

def bmi_calc(weight_kg, height_m):
    """Calculate BMI from weight in kg and height in meters"""
    bmi = int(weight_kg / height_m**2)
    return bmi
    
def predict_max_HR(age):
    """Age predicted maximal heart rate"""
    max_HR = 208 - 0.7 * age
    return max_HR
    
def bp_risk(systolic, diastolic):
    """Categorises whether blood pressure is elevated, 
 stage 1 hypertension or stage 2 hypertension"""
    if systolic >= 120 and systolic < 130 and diastolic < 80:
        bprisk = 'elevated BP'
    elif (systolic >= 130 and systolic < 140) or (diastolic >= 80 and diastolic < 90):
        bprisk = 'stage 1 hypertension'
    elif systolic >= 140 or diastolic >= 90:
         bprisk = 'stage 2 hypertension'
    else:
        bprisk = 'invalid values'
    return bprisk
    
def print_results(initials, weight, height, age, systolic, diastolic):
    bmi = bmi_calc(weight, height)
    max_HR = predict_max_HR(age)
    bprisk = bp_risk(systolic, diastolic)
    print("\n\t" + initials)
    print("\tweight = {}kg".format(weight))
    print("\theight = {}m".format(height))
    print("\tage = {} years old".format(age))
    print("\tblood pressure = {}/{}".format(systolic, diastolic))
    print("\n\tbmi = {}".format(bmi))
    print("\tpredicted maximal heart rate = {} bpm".format(max_HR))
    print("\tblood pressure = " + bprisk)
    print("\n")
    
# subject =  [initials, height, weight, age, systolic, diastolic]
subject1 = ['GA', 80, 1.6, 70, 120, 80]
subject2 = ['KT', 69, 1.5, 65, 136, 75]
subject3 = ['MN', 80, 1.6, 89, 113, 75]
subject4 = ['PW', 80, 1.7, 55, 141, 96]

subjects = [subject1, subject2, subject3, subject4]

for sub in subjects:
    initials, weight, height, age, systolic, diastolic = sub
    print_results(initials, weight, height, age, systolic, diastolic)

The output for the first subject looks like this:

GA
    weight = 80kg
    height = 1.6m
    age = 70 years old
    blood pressure = 120/80

    bmi = 31
    predicted maximal heart rate = 159.0 bpm
    blood pressure = stage 1 hypertension

This is a big improvement over the previous version of our code. However, this is still a processing script: code that we copy-and-paste into a Python command line or run as program from the command line. It has a single purpose, which is to process that data from the 5 subjects manually entered.

What if we had a few studies that required us to calculate and print these outcomes? Should we copy-and-paste the code to other scripts? No! Don’t repeat yourself. The best thing to do is create a Python module.

Creating a Python module to reuse code

Creating a Python module is simple. We put all our our functions (just the functions, nothing else) in a file and save it with a .py file extension.

For our current example, we can put all of our function into a file called fitness.py.

def bmi_calc(weight_kg, height_m):
    """Calculate BMI from weight in kg and height in meters"""
    bmi = int(weight_kg / height_m**2)
    return bmi
    
def predict_max_HR(age):
    """Age predicted maximal heart rate"""
    max_HR = 208 - 0.7 * age
    return max_HR
    
def bp_risk(systolic, diastolic):
    """Categorises whether blood pressure is elevated, 
 stage 1 hypertension or stage 2 hypertension"""
    if systolic >= 120 and systolic < 130 and diastolic < 80:
        bprisk = 'elevated BP'
    elif (systolic >= 130 and systolic < 140) or (diastolic >= 80 and diastolic < 90):
        bprisk = 'stage 1 hypertension'
    elif systolic >= 140 or diastolic >= 90:
         bprisk = 'stage 2 hypertension'
    else:
        bprisk = 'invalid values'
    return bprisk
    
def print_results(initials, weight, height, age, systolic, diastolic):
    bmi = bmi_calc(weight, height)
    max_HR = predict_max_HR(age)
    bprisk = bp_risk(systolic, diastolic)
    print("\n\t" + initials)
    print("\tweight = {}kg".format(weight))
    print("\theight = {}m".format(height))
    print("\tage = {} years old".format(age))
    print("\tblood pressure = {}/{}".format(systolic, diastolic))
    print("\n\tbmi = {}".format(bmi))
    print("\tpredicted maximal heart rate = {} bpm".format(max_HR))
    print("\tblood presure = " + bprisk)
    print("\n")

Using a module

We have create a module called fitness.py that contains four functions. We can now use these functions in any project. Importantly, if we later find a bug in our code, we only have to fix it in one location.

There are a few ways to access (or import) the functions we placed in our module.

import. The simplest is to import our module by its name and access its functions using dot-notation. This approach is very transparent because someone reading our code will immediately see that the function comes from a specific module.

import fitness

bmi = fitness.bmi(80, 1.6)  # weight (kg), height (m)
max_HR = fitness.predict_max_HR(76)
bprisk =  fitness.bp_risk(143, 91)

from x import y. If we only want to use one or two of the function from our module, we can specifically import them. This will allow us to use the functions without using the dot-notation.

from fitness import bmi, max_HR

bmi = bmi(80, 1.6)  # weight (kg), height (m)
max_HR = predict_max_HR(76)

import x as w. It is also possible to import a module and give it an alias. This is often done to reduce the amount of typing. This type of import if common with numerical python (numpy) import numpy as np and pandas (for panel data; dataframes similar to R) import pandas as pd. For our current example:

import fitness as fit

bmi = fit.bmi(80, 1.6)  # weight (kg), height (m)
max_HR = fit.predict_max_HR(76)
bprisk =  fit.bp_risk(143, 91)

import x.y as z. It is also possible to provide an alias to a sub-module or function. This approach is often use when import matplotlib for plotting import matplotlib.pyplot as plt. This is the same as from matplotlib import pyplot as plt. Both produce access to pyplot using the alias plt. For our current example:

import fitness.predict_HR_max as HRmax

max_HR = HRmax(76)

This is the same as:

from fitness import predict_HR_max as HRmax

max_HR = HRmax(76)

Putting it all together

We now have a module called fitness.py that contains our four functions. We can now import and use these functions to process subject data from any study.

Returning to our original example, we can now write a short processing script that imports our functions and process data from our five subjects:

from fitness import print_results

# subject =  [initials, height, weight, age, systolic, diastolic]
subject1 = ['GA', 80, 1.6, 70, 120, 80]
subject2 = ['KT', 69, 1.5, 65, 136, 75]
subject3 = ['MN', 80, 1.6, 89, 113, 75]
subject4 = ['PW', 80, 1.7, 55, 141, 96]

subjects = [subject1, subject2, subject3, subject4]

for sub in subjects:
    initials, weight, height, age, systolic, diastolic = sub
    print_results(initials, weight, height, age, systolic, diastolic)

Summary

We have learned how to use Python functions and modules to not repeat ourselves in the code we write. In addition to confirming to the DRY principle (don’t repeat yourself), using functions and modules help us write easy to read code. Consider our last example. It is clear what the code is doing. The details of how the fitness module and the print_results function are hidden away from the user in a separate file (i.e., fitness.py). Once we have debugged and ensured that the functions in our fitness module are correct, we don’t have to see the code each time we use it.

In our next post we will learn more about modules and how we can turn them into stand-alone programs.

tagged with DRY, functions, modules, Python

4 comments

OS
January 31, 2020 3:34 pm

Thanks for this great blog and for this post in particular.
I followed the recommendation and made a module of generic functions and it works. However I get errors whenever I try to make modules out of parts of a main script analysing data.
For example, I get EMG and force data for different tests. All the data are imported and set into a large dataframe and I access each test separately and process them in different blocks of code in my current main script. I want to move these blocks out of the main script but structuring modules for them (made of a large function with nested functions) and to use the data imported in the main script is difficult… Can you point at a good resource that shows an example for this type of data analysis?

LikeLike

- Martin Héroux
  February 2, 2020 10:26 pm
  
  Hi there,
  
  Thanks for your interest and positive feedback.
  
  It is sometimes hard to find the write resource for a given problem, especially when it comes to data analysis. Personally, I found the book by Hans Petter Langtangen entitled ‘A Primer on Scientific Programming with Python’ to be a good resource. Having said that, it may not address your particular problem.
  
  Another thing I have learned is to create the simplest working version of your new code and slowly add functionality, testing each step. When an error arises, address it (or google it, then address it). It is sometimes hard to work with a large piece of code (like your script).
  
  It is hard to provide assistance without more information and code. If you want to put your code and sample data on github, I would be happy to have a look and give you some pointers. For this to work, the code needs to be fairly clear and well documented. Let me know if you decide to take me up on this offer…
  
  LikeLike
  
  - OS
    February 9, 2020 8:05 am
    
    Thank you for the kind offer and for your advice. This reminds me that I should use Github at some point. I may take you up on this another time but I have found the cause of the problem this time. I simply had the wrong indentation for the main function of my modules… Again, this blog is a great resource, not least because it relates to the physiology/biomechanics fields, which is hard to come by at the moment.
    
    LikeLike
Omer
February 2, 2019 1:49 pm

Good summary

LikeLike