Python: Working with text files & PubMed references part 5

Posted on August 2, 2016 by Martin Héroux Leave a comment

I am assuming you have followed the previous tutorials in this short series on how to manipulate Pubmed references using Python (1, 2, 3, 4).
We have cleaned up the references by removing unwanted fields and ensuring sections like the abstract are not split over multiple lines. The cleaned references are saved in a text file cleaned_pubmed_refs.txt. We are now ready to cycle through each reference to ask a user whether they want to keep, reject, or query it (i.e., have a second user double check the reference).

Preparing files and reading references to be sorted

The code below sets up the framework for us to loop over each reference and ask the user for a response. Because we want to copy and delete files, we will import the copyfile command from the shutil module (shell utilities) and the remove command from the ‘os’ module (operating system).

# Import required commands from modules
from shutil import copyfile
from os import remove
# Make copy of cleaned_pubmed_references.txt
copyfile('cleaned_pubmed_references.txt', 'sorting_refs.txt')
f_sort   = 'sorting_refs.txt'
f_keep   = 'refs_keep.txt'
f_reject = 'refs_reject.txt'
f_query  = 'refs_query.txt'
# Read in all lines and remove line-breaks.
refs = []
sort = open(f_sort, 'r')
for line in sort:
    refs.append(line.strip())
sort.close()
# Delete current version of 'sorting_refs.txt'
remove(f_sort)
# Open the keep, reject and query files ready to add references.
keep = open(f_keep,'a')
reject = open(f_reject,'a')
query = open(f_query,'a')
sort = open(f_sort,'w')
# Innitialize some variables
refs_left = 1
ref_count = 0
cur_ref = ['-----']

Line 5-9. Make a copy of cleaned_pubmed_references.txt to keep a pristine version of the reference file. This line should only be run once, immediately after you have finished cleaning the references. If the user processes the references over several sessions, this line should not be run again otherwise the files will be overwritten each time. We also create names for the four text files we will work with.

Line 17. When we sort many references, we will likely process them over several sessions. Thus, at the end of each session we will write an updated version of f_sort with the references that still need to be sorted. Therefore, we delete the file to ensure we write the remaining references to a new file.

At this point we have read in all the references that need to be sorted (contained in sorting_refs.txt), and have opened four text files ready to receive the references we process: appending our selections to f_keep, f_reject, and f_query and writing the unprocessed references to f_sort.

Cycling through references and sorting

Now the fun part! We are going to cycle through each reference and ask the user what we should do with it. We will do this by printing to the terminal the title and abstract of the current reference and ask the user what to do. We will use the input command to get a response from the user and execute code based on this response. The loop will end when there are no more references to process, or the user provides s as a response, which will write the remaining references to sorting_refs.txt. Although the code is the longest we have seen thus far, it is not very complicated. The key is to break it down into sections and understand what is going on.

# Loop through each reference and ask user whether
# to keep, reject or query
for i, line in enumerate(refs):
    if i > 0:
        if line == '-----':  # Look for line that separates each reference
            ref_count += 1   # Keep count of processed references
            print('\n')      # Ensure terminal is on a new line
        # Print the title and abstract of the current reference
            print(cur_ref[2],'\n\n',cur_ref[-2],'\n')
        # Ask user what to do
            ans = input('Processing references number {} for this session.\n\n'
                        'Do you want to (k)eep, (r)eject or (q)uery reference?\n'
                        'Or do you want to (s)top processing references?\n\n'.format(ref_count))
        # Execute code based on user input
            if ans == 'k':
                for ref_line in cur_ref:
                    keep.write(ref_line)
                    keep.write('\n')
            elif ans == 'r':
                for ref_line in cur_ref:
                    reject.write(ref_line)
                    reject.write('\n')
            elif ans == 'q':
                for ref_line in cur_ref:
                    query.write(ref_line)
                    query.write('\n')
            else:
                for ref_line in cur_ref:
                    sort.write(ref_line)
                    sort.write('\n')
                for j in range(i,len(refs)):
                    if refs[j] == '-----':
                        refs_left += 1
                    sort.write(refs[j])
                    sort.write('\n')
                print('\n')
                print('Thank you for your efforts. There are now', str(refs_left-ref_count), 'left to process.')
                keep.close()
                reject.close()
                query.close()
                sort.close()
                break
            cur_ref = ['-----']
        else:
            cur_ref.append(line)

Line 5. This if statement checks if the current line is our designated separator between each reference. If the current line is not ‘—–‘, the program goes all the way down to the bottom to execute the code below the else statement on Line 44, cur_ref.append(line), which allows us to accumulate each line of the current reference. If the current line is ‘—–‘, we instead execute the indented code.

Line 9-13. We print the current reference title and abstract to the terminal and use input to ask the user what we should do.

Line 15-26. If the user inputs ‘k’, ‘r’, or ‘q’, the current reference is written to the appropriate file.

Line 27-42. If the user inputs ‘s’, the code under else is executed. First we write the current reference as well as the remaining references so that we can continue sorting them later. We then thank the user and tell them how many references are left to sort. Finally, we close all our text files and get out of the for loop using the break command.

Summary

After processing all the references, we will have 4 files: sorting_refs.txt, refs_keep.txt, refs_reject.txt, refs_query.txt. At this point, sorting_refs.txt should not contain any references and the other files should contain references based on the user’s selection. The user can now ask a friend to manually review the references in refs_query.txt and decide to either keep or reject them. After that, you are ready to extract the e-mail addresses from the references in refs_keep.txt, which will be the topic of the final post in this series.

tagged with Python, string variables, text files

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31