Python: Working with text files, an example using PubMed references
My colleagues and I recently needed to identify all the PubMed references on a given topic and locate email addresses of the corresponding authors. The good news is that the
Author information section of PubMed references contains one or more email addresses approximately half of the time. This meant that I could automate the extraction of these email addresses by searching through the exported references.
Why automate? Automating the process of extracting e-mail addresses serves a few purposes. As humans we are not very good at doing repetitive tasks without making mistakes. However, computers were designed for these sorts of things and they can churn through 10, 100 or even 10,000 references with the same accuracy. Automating tasks using Python also means we have a record of how each task was done. This means we can reap the benefits of our work if we ever need to do a similar task in the future. It also means that others can benefit if we chose to make our code available. In this way, our research becomes more transparent and reproducible.
We also wanted to make sure that we only included references that were relevant to our study. Therefore, before extracting email addresses, we needed to review each reference and determine whether it should be included.
I am sure there are more clever ways of filtering references and extracting emails. However, I figured out a way to do it and I learned a lot about manipulating text files along the way. By explaining how I did it, maybe you can learn something too. This series of tutorials will explain how to do the following in Python:
- open, manipulate and save text files;
- search and manipulate text;
- present text to users;
- get user input.
Getting the PubMed References
For this tutorial, I went to the PubMed website and conducted the follow search:
synapse [Title] AND Journal of Neuroscience [Journal] AND 2015 [Date - Publication]. This search identified 14 references. I saved these references using
Send to -> file -> Abstract(txt) to a file called
pubmed_results.txt, which can be obtained here (copy and paste the 606 lines into a text editor and save the file
You can see that the file contains the 14 references, some of which contain e-mail addresses in the
Author Information section.
Start cleaning the Reference File
The first thing we want to do is clean the file to make it easier to work with. By cleaning I mean removing information that is not required, empty lines, end-of-line characters, etc. Also, the
pubmed_results.txt file currently contains related text that spans several lines; it would make things easier if related text was grouped together.
1 2 3 4 5 6 7 8
refs =  pubMedFile = open('pubmed_results.txt', 'r') for line in pubMedFile: if line == '\n': refs.append(line) else: refs.append(line.strip()) pubMedFile.close()
Line 1. Initialize the
refs variable as a list to hold the text from the references that we will read in.
Line 2. Open
pubmed_results.txt for reading (using the
'r' flag) in a variable called
Line 3. Use a for loop to process each line of the reference file.
Line 4. Verify if the current line is an empty line (i.e., it only contains an end-of-line character
\n. You don’t see these characters when you open the file in a text editor, but they are what tells the text editor that it has reached the end of a line).
Line 5. If the current line is only an end-of-line character, append the line to the
Line 6. Or else, if the current line is not an end-of-line character…
Line 7. Remove, or strip, any end-of-line characters from the current line of text and append the remainder to the
Line 8. Close the link to the
Great, we read in our reference file and removed the end-of-line characters from the lines that contained text. Have a look at your
refs variable to get a better idea of what the code actually did. Let’s print the first 53 items, which contain all the text for the first reference. Because Python uses zero-based indexing (i.e., it counts 0, 1, 2 ,3, … rather than 1, 2, 3, 4, …) and shows values up to, but not including the last index in a range, we will ask Python to print
Here is what the first 3 items of the output should look like:
['1. J Neurosci. 2015 Dec 9;35(49):16159-70. doi: 10.1523/JNEUROSCI.2034-15.2015.', '\n', 'Persistent Associative Plasticity at an Identified Synapse Underlying Classical']
We know we are viewing the items of a list because they are contained within
[ ]. Each item in the
refs list is a string, therefore they are contained within single quotation marks. Note that the second item is an end-of-line character and that the third item only contains the first part of the title; the rest of the title is located in the fourth item.
Summary of Part 1
We have learned how to open a text file for reading using
open(<filename>,'r'), and inspect each line of a text file using a for loop. We have also used the
<string variable>.strip() method to remove end-of-line characters from string variables. However, our
refs variable is still messy. In Part 2 you will finish cleaning up the references and save the result in a new text file.