Python: Working with text files & PubMed references part 6
I am assuming you have followed the previous tutorials in this short series on how to manipulate Pubmed references using Python (1, 2, 3, 4, 5).
We have cleaned the references and had a user select the references that are to be kept. These files are now located in
refs_keep.txt. The last thing that we have to do is extract PMIDs and emails from these references.
Reading in and organizing references
The code below reads in the references the user decided to keep and organizes them so that the 6 lines of each reference are the 6 items of a list; each ref list will then itself be stored in a list so that we can iterate over all the references. Given that this is the last of these tutorials, you should be able to figure out what the code is doing by reading the various in-line comments. Because we want search for e-mail addresses, we will be using the
re (regular expression) module. And because we will be reading our formated PMID-email file, we will be using the
csv module (comma separated values).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
# Import required modules import re import csv # Read in all lines and remove line-breaks using .strip(). refs =  f_keep = 'refs_keep.txt' keep = open(f_keep, 'r') for line in keep: refs.append(line.strip()) keep.close() # Create a list where each element contains the information for a single reference. # Each element (e.g., refs_list) will contain lists of the references, where # refs_list[n] will be the author info section and refs_list[n] will be # the PMID. refs_list = [None] * 100 # initialize list to hold references rl_index = 0 # Counter for reference list cur_ref = ['','','','','','',''] cr_index = 0 # Counter for current reference for line in refs: cur_ref[cr_index] = line cr_index += 1 # Check if we are at the last item of the ref (i.e., PMID) # If yes, write reference to 'refs_list' and # re-initialize variables if line.split(' ') == 'PMID:': refs_list[rl_index] = cur_ref cur_ref = ['','','','','','',''] cr_index = 0 rl_index += 1
Extrating PMIDs and e-mails
Now we want to create two files where each contains the PMID of a reference. For references that contain one or more e-mail addresses, these will be placed after the relevant PMID, separated by a comma. The reason for saving the PMIDs of files without e-mails is because it is easy to cut and paste a series of PMIDs into PubMed to see all the references and follow their web-links to locate a PDF version of the paper and extract the e-mail address of the corresponding author. Note that I did not know how to extract emails from text files, so I searched on Google and found someone who had done it using regular expressions:
re.findall(<regular expression>, <text to search>).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
# Cycle through each reference and determine if there are emails in the authors # information section. If yes, store them along with the PMID. If not, # simply store the PMID # Open files to store PMIDs and emails pmid_emails_file = 'pmid_email.txt' pmid_alone_file = 'pmid.txt' pmid_emails = open(pmid_emails_file,'a') pmid_alone = open(pmid_alone_file,'a') # Loop through each reference for ref in refs_list: if not (ref == None): # Look for e-mails in author information section au_info = ref emails = re.findall(r'[\w\.-]+@[\w\.-]+', au_info) # Loop through, extract PMID and write (with or without emails) # to appropriate file for i, item in enumerate(ref): if item.split(' ') == 'PMID:': pmid = item.split(' ') if len(emails) > 0: # Remove trailling period from emails for i, email in enumerate(emails): emails[i] = email.strip('.') # Insert PMID before emails emails.insert(0, pmid) # Add some blank values to have a total of 20 items # Just in case locate many e-mails for i in range(20-len(emails)): emails.append(' ') # Join item of list using a comma data_to_save = ', '.join(emails) # Write PMID and emails to file pmid_emails.write(data_to_save) pmid_emails.write('\n') else: # Write PMID to file pmid_alone.write(pmid) pmid_alone.write('\n') pmid_emails.close() pmid_alone.close()
Amazing! We now have two files, one containing PMIDs and associated emails, and the other containing PMIDs for references that we need to locate emails.
We are finally done, and we have learned a lot of new Python skills along the way! Being able to work with text (i.e., string) variables is a basic programming skill. While I am sure there are nicer and faster ways to clean and sort references and then extract email addresses, I hope this series of tutorials gave you an appreciation of what can be done with Python.
Although I presented these tutorials as a series of cut and paste code examples, the actual code was written as a Python script (Pubmed.py) and the various tasks were broken down into individual functions. The script can be viewed and downloaded here.