Python: Working with text files & PubMed references part 3

I am assuming you have followed the first and second tutorial in this short series.

You should now have a list variable called refs2 in your Python console. If you inspected the first item in refs2 list, you will notice that it specifies the journal, volume and page information for the first reference.

1
'1. J Neurosci. 2015 Dec 9;35(49):16159-70. doi: 10.1523/JNEUROSCI.2034-15.2015.'

Notice that this line also includes some doi information, which is not useful to us. So the first thing we will do is go through our references and extract only the first part of the journal information:

1
2
3
4
5
6
refs3 = []
for i, line in enumerate(refs2):
    if refs2[i-1] == '-----':
        ref = line.split('.')
        line = ref[1][1:] + ref[2]
    refs3.append(line)

Line 2. We are once again using a for loop to iterate over each item in a list, this time refs2.

Line 3-6. Because we know that the item before all our journal information items is our chosen separator, '-----', we can use that knowledge to find all the journal information items. If the previous line is our chosen separator (even if this is refs[0-1], because the last line of our file is '-----'), we will split the content of line based on the a period separator ('.'). We will then reconstruct line using only the journal name and key reference information.

Finally, we will add line, whether it is our modified journal information or any other item, to our new list refs3.

Have a look at the content of your refs3 list. You will see that the journal information items have lost their leading identification number as well as any trailing information, such as a doi.

If you want to have a better understanding of what the above lines of code are doing, you can add a few print() statements to see what is going on. Similar to what we did in the previous tutorial, you could add these lines in the indented code, below line 5:

1
2
3
4
5
6
7
8
print('CURRENT REF JOURNAL INFO:')
print(refs2[i])
print("JOURNAL INFO SPLIT BASED ON '.' ")
print(ref)
print('NEW JOURNAL INFO LINE IS:')
print(line)
input('PRESS ENTER TO CONTINUE...')
print('-----')

Finalizing our reference clean up

If you look though refs, you will notice that most references have a Copyright item. Since we are not interested in this information, lets remove these items from the references. This can be done by simply locating the items that start with the word Copyright and skip these items.

1
2
3
4
refs4 = []
for i, line in enumerate(refs3):
    if line.split(' ')[0] != 'Copyright':
        refs4.append(line)

Summary

This short post applied some of the techniques we used in our previous (slightly complicated) post. We once again cycled through the items of a list, which in our case are all text (i.e., string) variables, and we identified, modified and skipped certain items.

We are now ready to to save these cleaned references and then cycle through them to have a user decide whether they are truly relevant. This is what we will do in our next post.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s