I'm fairly new to Python and I'm writing a script which should take two fairly big textfiles(~10MB) and create a new file for each, with some rules specified below in mind.

File A has tabulator-separated values on each line, file B contains ID on one line and data on the next line. IDs from file B are also present in file A, but not all IDs from file A are in file B and vice versa.

Neither file has IDs in an alphanumerical order and both files have different order. I don't need to sort them alphanumerically, I just need the output files to be in the same order and to contain only the items with common ID.

Here's how file A looks: File A

Here's how file B looks: File B

As you can see, the items from file A, column B provide the identifiers which may or may not be present in file A.

Here's a simple script I wrote. For every line from file A, it goes through the whole file B until it either finds a matching ID or reaches the end.

The script is working fine, but since it contains a nested loop, it's probably around O(n^2) (it's actually O(m*n)for m being size of file A and n being size of file B, but they are usually similar in size), which might become a problem once I'll use it on real data (hundreds of MB or units of GB).

def spectrosingle(inputline):
    if (len(inputline) > 0) and (not inputline[0] == "\t") :
        resline = re.findall(r'\d\t(.+?)\t\t\t\t|$', inputline)[0] # ID in spectro file is always followed by 3 empty columns, which is the only such occurence in the whole line
        return resline
        return None

    fastafile = open('fastaseq.fasta', "r")
    print("FASTA file corrupted or not found!\n")

    spectrometry = open('spectro.txt', "r")
    print("Spectro file corrupted or not found!\n")

missingarr = [] # array for IDs that are in spectro file but aren't present in fasta file 
misnum = 0 # counter for those IDs

with open('MAIN.fasta', mode='w') as output_handle:
    """Going through a nested cycle - for every sorted sequence in the spectrometry file,"
    "we are searching the unsorted FASTA until we find the corresponding file. If there's any sequence in the spectrometry file"
    "that is not anywhere in the fasta, it's marked so that it doesn't get copied into the final spectrometry file."""
    for line in spectrometry:
        fastaline1 = 'temp' # a temporary initialization for fastaline, so we can enter the While loop that checks if there are still lines left in the file
        missbool = True # a flag for IDs that are missing from fasta file
        speccheck = spectrosingle(line) # filters the ID from spectro file.
        if not speccheck:
            continue #spectrosingle function returns Nonetype if it gets a line without any sequence. This skips such lines.
        while fastaline1:
            fastaline1 = fastafile.readline()
            fastaline1 = fastaline1.partition(">")[2]
            fastaline1 = fastaline1.partition("\n")[0] #shave the header and newline symbols from the ID
            fastaline2 = fastafile.readline()
            if fastaline1 == speccheck:  #check if the sequence in FASTA file matches the one in the spectro file
                print("Sorted sequence ID %s." % (fastaline1))
                output_handle.write('>'+fastaline1+'\n') #write the header
                output_handle.write(fastaline2) #write the sequence
                missbool = False
                fastafile.seek(0) #return to the start of file for next cycle
        if missbool: #this fires only when the whole fastafile has been searched and no matching sequence to the one from the spectro file has been found. 
            misnum = misnum + 1 # count the number of discarded sequence
            missingarr.append(speccheck) #append the discarded sequence to the array, so we later know which sequences not to include in the new spectro file

print("Sorting finished!\n")

if misnum != 0: #check if there are any sequences marked for deletion
    num = 0
    blackbool = True
    blackword = missingarr[num]
    blackbool = False # no marked sequences available

with open('spectro.txt', "r") as spectrometry, open(os.path.splitext(finpath)[0]+'\\' + prepid + 'FINAL_spectrometry.txt', mode='w') as final_output: #writing the final spectrometry file with deleted sequences which would cause a mismatch during the final merger of data
    fullspec = spectrometry.readlines() #might be memory-heavy, but still probably the most efficient way to do this
    if not blackbool: #no redundant characters, so the whole file is copied
        for line in fullspec:
            for line in fullspec:
                if ((re.search(blackword, line)) is None):#if the ID is marked, it is not transferred to the new file
                    num = num + 1
                    blackword = missingarr[num]
print("There were %i redundant sequences in the spectro file, which have been filtered out.\n" % (num)

Is there a more efficient way to do this? I have a suspicion that the way I am doing it is not very Pythonic, but I can't really point a finger on what is wrong with it.

2 Answers

tobias_k On Best Solutions

Your code would indeed not be very efficient. Instead, I'd suggest using a dictionary to store the data from file B to each ID. To get the data, you can just call next on the same iterator that's reading the file (provided that there is an even number of lines). Something like this (not tested):

data = {}
with open("fileb") as fb:
    for line_id in fb:
        the_id = line_id.strip()[1:] # remove newline and ">"
        line_data = next(fb)  # get next line from file
        data[the_id] = line_data.strip()

Then, when you read from file A, you can just look up the data to the current ID in that dictionary without having to iterate the entire file B again and again.

Also, but less relevant, instead of using a rather complicated regular expression to get the ID from file A, you could instead either just split("\t") the line, or use the csv module. Something like this (not tested either):

with open("filea") as fa:
    for line in fa:
        num, the_id, more, stuff, dont, know, what = line.split("\t")
        if the_id in data:
            the_data = data.get(the_id)
            ... to stuff with data ...

Instead of enumerating all the columns, you could also use *_ to capture any remaining fields:

        num, the_id, *other_stuff_we_do_not_care_about = line.split("\t")
Community On

As tobias_k said:

Read the first file with the csv module; for the second file, use for idline in file: dataline = next(file), do stuff with idline and dataline, then put them in a dict mapping IDs to data.