Python: Working with text files & PubMed references part 6

I am assuming you have followed the previous tutorials in this short series on how to manipulate Pubmed references using Python (1, 2, 3, 4, 5).
We have cleaned the references and had a user select the references that are to be kept. These files are now located in refs_keep.txt. The last thing that we have to do is extract PMIDs and emails from these references.

Reading in and organizing references

The code below reads in the references the user decided to keep and organizes them so that the 6 lines of each reference are the 6 items of a list; each ref list will then itself be stored in a list so that we can iterate over all the references. Given that this is the last of these tutorials, you should be able to figure out what the code is doing by reading the various in-line comments. Because we want search for e-mail addresses, we will be using the re (regular expression) module. And because we will be reading our formated PMID-email file, we will be using the csv module (comma separated values).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Import required modules
import re
import csv

# Read in all lines and remove line-breaks using .strip().
refs = []
f_keep = 'refs_keep.txt'
keep = open(f_keep, 'r')
for line in keep:
    refs.append(line.strip())
keep.close()
# Create a list where each element contains the information for a single reference.
# Each element (e.g., refs_list[0]) will contain lists of the references, where
# refs_list[n][4] will be the author info section and refs_list[n][6] will be
# the PMID.
refs_list = [None] * 100  # initialize list to hold references
rl_index = 0  # Counter for reference list
cur_ref = ['','','','','','','']
cr_index = 0  # Counter for current reference
for line in refs:
    cur_ref[cr_index] = line
    cr_index += 1
    # Check if we are at the last item of the ref (i.e., PMID)
    # If yes, write reference to 'refs_list' and
    # re-initialize variables
    if line.split(' ')[0] == 'PMID:':
        refs_list[rl_index] = cur_ref
        cur_ref = ['','','','','','','']
        cr_index = 0
        rl_index += 1

Extrating PMIDs and e-mails

Now we want to create two files where each contains the PMID of a reference. For references that contain one or more e-mail addresses, these will be placed after the relevant PMID, separated by a comma. The reason for saving the PMIDs of files without e-mails is because it is easy to cut and paste a series of PMIDs into PubMed to see all the references and follow their web-links to locate a PDF version of the paper and extract the e-mail address of the corresponding author. Note that I did not know how to extract emails from text files, so I searched on Google and found someone who had done it using regular expressions: re.findall(<regular expression>, <text to search>).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Cycle through each reference and determine if there are emails in the authors
# information section. If yes, store them along with the PMID. If not,
# simply store the PMID

# Open files to store PMIDs and emails
pmid_emails_file = 'pmid_email.txt'
pmid_alone_file = 'pmid.txt'
pmid_emails = open(pmid_emails_file,'a')
pmid_alone = open(pmid_alone_file,'a')
# Loop through each reference
for ref in refs_list:
    if not (ref == None):
        # Look for e-mails in author information section
        au_info = ref[4]
        emails = re.findall(r'[\w\.-]+@[\w\.-]+', au_info)
    # Loop through, extract PMID and write (with or without emails)
    # to appropriate file
        for i, item in enumerate(ref):
            if item.split(' ')[0] == 'PMID:':
                pmid = item.split(' ')[1]
            if len(emails) > 0:
            # Remove trailling period from emails
                for i, email in enumerate(emails):
                    emails[i] = email.strip('.')
                # Insert PMID before emails
                emails.insert(0, pmid)
                # Add some blank values to have a total of 20 items
                # Just in case locate many e-mails
                for i in range(20-len(emails)):
                    emails.append(' ')
                # Join item of list using a comma
                data_to_save = ', '.join(emails)
                # Write PMID and emails to file
                pmid_emails.write(data_to_save)
                pmid_emails.write('\n')
            else:
                # Write PMID to file
                pmid_alone.write(pmid)
                pmid_alone.write('\n')
pmid_emails.close()
pmid_alone.close()

Amazing! We now have two files, one containing PMIDs and associated emails, and the other containing PMIDs for references that we need to locate emails.

Summary

We are finally done, and we have learned a lot of new Python skills along the way! Being able to work with text (i.e., string) variables is a basic programming skill. While I am sure there are nicer and faster ways to clean and sort references and then extract email addresses, I hope this series of tutorials gave you an appreciation of what can be done with Python.

Although I presented these tutorials as a series of cut and paste code examples, the actual code was written as a Python script (Pubmed.py) and the various tasks were broken down into individual functions. The script can be viewed and downloaded here.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s