Python: Working with text files & PubMed references part 2
In the previous post, I explained how I recently had to sort through PubMed references to extract email addresses. As part of the tutorial, I had you download a sample text file containing 14 references, and we read the content of this file in Python. We are now going to continue cleaning the references to have them in a standardized format. In this tutorial we will loop over each line of the imported text and make decisions depending on its content.
To make it easier to work with the references, lets make them have a similar format. For this series of tutorials, we will use the following format:
-  = ‘—–‘
-  = journal details
-  = paper title
-  = authors
-  = authors details
-  = abstract
-  = PMID
The above format means that each new reference will start with a seperator (
'-----'), followed by 6 other relevant items. If all items are present, the reference will span 7 items in the list containing our cleaned references. For example, if the above items were part of a list called
refs_format, we could list all authors of the first reference by typing
refs_format would tell you the PMID of the first reference.
Once cleaned, we will be able to write our references to a new text file where each of these items will be located on a new line. Gone will be empty lines, as well as related text spanning more than one line.
Continuing our reference clean up
I am assuming you have followed the previous tutorial. You should have a list variable called
refs in your Python console.
Let’s make a first pass at cleaning our references:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
new_line = '-----' refs2 =  for i, line in enumerate(refs): if i == 0: new_line = line else: if line.split(' ') == 'PMID:': refs2.append(line.split(' ') + ' ' + line.split(' ')) refs2.append('-----') elif line == '\n' and refs[i-1] != '\n' and refs[i-1].split(' ') != 'PMID:': refs2.append(new_line) elif line == '\n' and refs[i-1] == '\n': pass elif line != '\n' and refs[i-1] == '\n': new_line = line else: new_line = new_line + ' ' + line
Line 1. Initialize the
new_line variable with our seperator.
Line 2. Initialize the
refs2 variable as a list. This is where we will store the next iteration of our references.
Line 3. Use a
for loop with the
enumerate function. We will process each element of the
refs variable and use a variable
i to keep track of the current loop number for us. The first time through the
i will have a value of 0 (zero) and
line will have a value of
refs. The second time through the loop,
i will have a value of 1 and
line will have a value of
refs. This will continue until Python has enumerated over all items in
Line 4. Later in the loop, we will be using the current value of line (i.e.,
refs[i]) as well as the previous value (i.e.,
refs[i-1]) to do stuff. Indexing the value of
refs this way does not apply to the first loop because the index
refs[0-1] is equal to
refs[-1], which is a Python shorthand for indexing from the end of a list. For example
refs [-2] would be the second last item in a list. So we need to force the behaviour of the first iteration through the loop by writing a special
Line 5. When
i is zero, we will assign the current value of
line to our
new_line variable. In our case, we will assign
1. J Neurosci. 2015 Dec 9;35(49):16159-70. doi: 10.1523/JNEUROSCI.2034-15.2015..
Line 6. When
i is not equal to zero, we will execute the following code.
We want to do different things depending on the current and previous value of
line. We will control the flow of which of these things is executed using an
if elif else structure. Python evaluates each of these logic statements in order. If one of them is
True, Python will execute the code indented below the corresponding statement. Once this indented code is completed, Python will skip the rest of the logic tests, go back to the top and assigns to
line the next available value, and increments
i by one. If none of these statements are
True, Python will run the code indented below
Line 7. First split the content of
line using an empty space as the seperator (i.e.,
' '). This is done using
split(), which is a method that belongs to string variable types.
As an example, pretend we had a variable
x = 'Pubmed is great'. If we applied the
split() method to this string variable and specified an empty space as the seperator,
x.split(' '), Python would return
['Pubmed', 'is', 'great'], a list containing 3 strings. We could also index items from this list. For example,
x.split(' ') would return the first item in the outputed list, which in this case would be
Thus, our code is assessing whether the first word on the current
line is equal to
PMID: If it is equal to this value, the next two lines of code will be executed; if it is not equal to this value, Python will move on and verify the next logic statement.
Line 8. Given that the logic test found on Line 7 was true, we extract the first and second value of
line.split(), concatenate them separated by a space, and append the resulting string to our
But why not simply append
refs2? This is because Pubmed references contain additional information on this line that we are not interested in (e.g.,
[PubMed - in process] ,[PubMed - indexed for MEDLINE]).
Line 9. Because the
PMID is the last item of each reference (check your
pubmed_results.txt if you want to see for yourself!), we will mark the end of the reference by appending our chosen seperator,
Line 10. Verify whether
line is an end-of-line character, and whether the previous value of
refs2[i -1]) was not (
!=) an end-of-line character and not the line starting with
Line 11. If all three of the above logic tests evaluated to
True, append the current value of
Line 12. Verify whether
line and the previous value of
line are end-of-line characters.
Line 13. In the case of two blank lines following each other, we tell Python to do nothing
pass and skip to the top of the
Line 14. Verify whether
line is not an end-of-line character, but the previous value of
line was an end-of-line character.
Line 15. In this case, we assign the value of
line to our
new_line variable. This writes over the previous content of the
Line 16. If all the previous logic tests contained in the
elif statements evaluated to
False, execute the following indented code.
Line 17. Make the current value of
new_line equal to the previous value of
new_line plus the current value of
line; separate these two items with a space. This is the line that will ensure that the text associated with a single category, say the Abstract, is kept together.
Understanding the code
If you are still confused what the above code is doing, a useful learning tool is to add
print() statements to the code in order to visualize what is going on. For the current code, you could add the following three lines of code between lines 3 and 4. Below is a revised version of the code that cycles through the content associated with the first reference. For each value of
line, try figure out which indented lines of code of the
if else and the
if elif else will be executed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
new_line = '-----' refs2 =  for i, line in enumerate(refs[0:51]): print('i = ', i) print('line = ',line) input('Press Enter to continue...') # used to pause the program if i == 0: new_line = line else: if line.split(' ') == 'PMID:': refs2.append(line.split(' ') + ' ' + line.split(' ')) refs2.append('-----') elif line == '\n' and refs[i-1] != '\n' and refs[i-1].split(' ') != 'PMID:': refs2.append(new_line) elif line == '\n' and refs[i-1] == '\n': pass elif line != '\n' and refs[i-1] == '\n': new_line = line else: new_line = new_line + ' ' + line
What have we done thus far?
Take a look at the contents of your
refs2 list by typing
print(refs2) in your Python terminal. You will note that related text is now together as part of one item in the list. For example,
refs2 contains the whole title of the article, whereas in
refs this title was split into two list items.
This tutorial has explained in detail how you might change what action is executed based on the value of a string variable (i.e.,
line). Working with this type of flow control can be daunting at first. If you are having trouble understanding all the steps, consider writing a new, simpler example and manually step through each loop iteration using
print() statements. Very soon you will understand how this type of flow control can be used to process text.
In the next tutorial we will apply some of the same principles to further clean up our references.