Python: Working with text files & PubMed references part 2

Posted on May 3, 2016 by Martin Héroux Leave a comment

In the previous post, I explained how I recently had to sort through PubMed references to extract email addresses. As part of the tutorial, I had you download a sample text file containing 14 references, and we read the content of this file in Python. We are now going to continue cleaning the references to have them in a standardized format. In this tutorial we will loop over each line of the imported text and make decisions depending on its content.

Desired format

To make it easier to work with the references, lets make them have a similar format. For this series of tutorials, we will use the following format:

[0] = ‘—–‘
[1] = journal details
[2] = paper title
[3] = authors
[4] = authors details
[5] = abstract
[6] = PMID

The above format means that each new reference will start with a seperator ('-----'), followed by 6 other relevant items. If all items are present, the reference will span 7 items in the list containing our cleaned references. For example, if the above items were part of a list called refs_format, we could list all authors of the first reference by typing refs_format[3]. Similarly, refs_format[6] would tell you the PMID of the first reference.

Once cleaned, we will be able to write our references to a new text file where each of these items will be located on a new line. Gone will be empty lines, as well as related text spanning more than one line.

Continuing our reference clean up

I am assuming you have followed the previous tutorial. You should have a list variable called refs in your Python console.

Let’s make a first pass at cleaning our references:

new_line = '-----'
refs2 = []
for i, line in enumerate(refs):
    if i == 0:
        new_line = line
    else:
        if line.split(' ')[0] == 'PMID:':
            refs2.append(line.split(' ')[0] + ' ' + line.split(' ')[1])
            refs2.append('-----')
        elif line == '\n' and refs[i-1] != '\n' and refs[i-1].split(' ')[0] != 'PMID:':
            refs2.append(new_line)
        elif line == '\n' and refs[i-1] == '\n':
            pass
        elif line != '\n' and refs[i-1] == '\n':
            new_line = line
        else:
            new_line = new_line + ' ' + line

Line 1. Initialize the new_line variable with our seperator.

Line 2. Initialize the refs2 variable as a list. This is where we will store the next iteration of our references.

Line 3. Use a for loop with the enumerate function. We will process each element of the refs variable and use a variable i to keep track of the current loop number for us. The first time through the for loop, i will have a value of 0 (zero) and line will have a value of refs[0]. The second time through the loop, i will have a value of 1 and line will have a value of refs[1]. This will continue until Python has enumerated over all items in refs.

Line 4. Later in the loop, we will be using the current value of line (i.e., refs[i]) as well as the previous value (i.e., refs[i-1]) to do stuff. Indexing the value of refs this way does not apply to the first loop because the index refs[0-1] is equal to refs[-1], which is a Python shorthand for indexing from the end of a list. For example refs [-2] would be the second last item in a list. So we need to force the behaviour of the first iteration through the loop by writing a special if statement.

Line 5. When i is zero, we will assign the current value of line to our new_line variable. In our case, we will assign 1. J Neurosci. 2015 Dec 9;35(49):16159-70. doi: 10.1523/JNEUROSCI.2034-15.2015..

Line 6. When i is not equal to zero, we will execute the following code.

We want to do different things depending on the current and previous value of line. We will control the flow of which of these things is executed using an if elif else structure. Python evaluates each of these logic statements in order. If one of them is True, Python will execute the code indented below the corresponding statement. Once this indented code is completed, Python will skip the rest of the logic tests, go back to the top and assigns to line the next available value, and increments i by one. If none of these statements are True, Python will run the code indented below else.

Line 7. First split the content of line using an empty space as the seperator (i.e., ' '). This is done using split(), which is a method that belongs to string variable types.

As an example, pretend we had a variable x = 'Pubmed is great'. If we applied the split() method to this string variable and specified an empty space as the seperator, x.split(' '), Python would return ['Pubmed', 'is', 'great'], a list containing 3 strings. We could also index items from this list. For example, x.split(' ')[0] would return the first item in the outputed list, which in this case would be 'Pubmed'.

Thus, our code is assessing whether the first word on the current line is equal to PMID: If it is equal to this value, the next two lines of code will be executed; if it is not equal to this value, Python will move on and verify the next logic statement.

Line 8. Given that the logic test found on Line 7 was true, we extract the first and second value of line.split(), concatenate them separated by a space, and append the resulting string to our refs2 variable.

But why not simply append line to refs2? This is because Pubmed references contain additional information on this line that we are not interested in (e.g., [PubMed - in process] ,[PubMed - indexed for MEDLINE]).

Line 9. Because the PMID is the last item of each reference (check your pubmed_results.txt if you want to see for yourself!), we will mark the end of the reference by appending our chosen seperator, '-----'.

Line 10. Verify whether line is an end-of-line character, and whether the previous value of line (i.e., refs2[i -1]) was not (!=) an end-of-line character and not the line starting with PMID:.

Line 11. If all three of the above logic tests evaluated to True, append the current value of new_line to refs2.

Line 12. Verify whether line and the previous value of line are end-of-line characters.

Line 13. In the case of two blank lines following each other, we tell Python to do nothing pass and skip to the top of the for loop.

Line 14. Verify whether line is not an end-of-line character, but the previous value of line was an end-of-line character.

Line 15. In this case, we assign the value of line to our new_line variable. This writes over the previous content of the new_line variable.

Line 16. If all the previous logic tests contained in the if and elif statements evaluated to False, execute the following indented code.

Line 17. Make the current value of new_line equal to the previous value of new_line plus the current value of line; separate these two items with a space. This is the line that will ensure that the text associated with a single category, say the Abstract, is kept together.

Understanding the code

If you are still confused what the above code is doing, a useful learning tool is to add print() statements to the code in order to visualize what is going on. For the current code, you could add the following three lines of code between lines 3 and 4. Below is a revised version of the code that cycles through the content associated with the first reference. For each value of i and line, try figure out which indented lines of code of the if else and the if elif else will be executed.

new_line = '-----'
refs2 = []
for i, line in enumerate(refs[0:51]):
    print('i = ', i)
    print('line = ',line)
    input('Press Enter to continue...') # used to pause the program
    if i == 0:
        new_line = line
    else:
        if line.split(' ')[0] == 'PMID:':
            refs2.append(line.split(' ')[0] + ' ' + line.split(' ')[1])
            refs2.append('-----')
        elif line == '\n' and refs[i-1] != '\n' and refs[i-1].split(' ')[0] != 'PMID:':
            refs2.append(new_line)
        elif line == '\n' and refs[i-1] == '\n':
            pass
        elif line != '\n' and refs[i-1] == '\n':
            new_line = line
        else:
            new_line = new_line + ' ' + line

What have we done thus far?

Take a look at the contents of your refs2 list by typing print(refs2) in your Python terminal. You will note that related text is now together as part of one item in the list. For example, refs2[2] contains the whole title of the article, whereas in refs this title was split into two list items.

This tutorial has explained in detail how you might change what action is executed based on the value of a string variable (i.e., line). Working with this type of flow control can be daunting at first. If you are having trouble understanding all the steps, consider writing a new, simpler example and manually step through each loop iteration using print() statements. Very soon you will understand how this type of flow control can be used to process text.

In the next tutorial we will apply some of the same principles to further clean up our references.

tagged with lists, Python, string variables, text files

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31