Python: Working with text files & PubMed references part 3
I am assuming you have followed the first and second tutorial in this short series.
You should now have a list variable called
refs2 in your Python console. If you inspected the first item in
refs2 list, you will notice that it specifies the journal, volume and page information for the first reference.
'1. J Neurosci. 2015 Dec 9;35(49):16159-70. doi: 10.1523/JNEUROSCI.2034-15.2015.'
Notice that this line also includes some
doi information, which is not useful to us. So the first thing we will do is go through our references and extract only the first part of the journal information:
1 2 3 4 5 6
refs3 =  for i, line in enumerate(refs2): if refs2[i-1] == '-----': ref = line.split('.') line = ref[1:] + ref refs3.append(line)
Line 2. We are once again using a
for loop to iterate over each item in a list, this time
Line 3-6. Because we know that the item before all our journal information items is our chosen separator,
'-----', we can use that knowledge to find all the journal information items. If the previous line is our chosen separator (even if this is
refs[0-1], because the last line of our file is
'-----'), we will split the content of line based on the a period separator (
'.'). We will then reconstruct
line using only the journal name and key reference information.
Finally, we will add
line, whether it is our modified journal information or any other item, to our new list
Have a look at the content of your
refs3 list. You will see that the journal information items have lost their leading identification number as well as any trailing information, such as a
If you want to have a better understanding of what the above lines of code are doing, you can add a few
print() statements to see what is going on. Similar to what we did in the previous tutorial, you could add these lines in the indented code, below line 5:
1 2 3 4 5 6 7 8
print('CURRENT REF JOURNAL INFO:') print(refs2[i]) print("JOURNAL INFO SPLIT BASED ON '.' ") print(ref) print('NEW JOURNAL INFO LINE IS:') print(line) input('PRESS ENTER TO CONTINUE...') print('-----')
Finalizing our reference clean up
If you look though
refs, you will notice that most references have a
Copyright item. Since we are not interested in this information, lets remove these items from the references. This can be done by simply locating the items that start with the word
Copyright and skip these items.
1 2 3 4
refs4 =  for i, line in enumerate(refs3): if line.split(' ') != 'Copyright': refs4.append(line)
This short post applied some of the techniques we used in our previous (slightly complicated) post. We once again cycled through the items of a list, which in our case are all text (i.e., string) variables, and we identified, modified and skipped certain items.
We are now ready to to save these cleaned references and then cycle through them to have a user decide whether they are truly relevant. This is what we will do in our next post.