Comparing two pipe-delimited files, row by row, wherein rows are not in order in another file, using Python

508 views Asked by At

I have two text files with a huge number of records, the records are pipe-delimited. I need to compare both the text files for similarity in data on both files. Say, File1 and File2 should have the same records. However, even though they have the same records, they do not appear to be on the same rows. say record1 on file1 might be on row10, but the same record1 on file2 is not necessarily on the same row, it might appear on any row. Now I Need to take row1 in file1, need to go through all records in file2 and see where the match happens. Likewise, I need to check for all rows in file1. I am more bothered about file1 rows be matched with file2, then file2 be matched with file1, as file2 might have several redundant records.

I tried to look into this approach, using the Python script. I came across below code-snipped, however, it compares both files row by row and does not take into consideration that rows might not be arranged sequentially.

Could someone please advise how to achieve this?

CodeLink : https://gist.github.com/insachin/c960cfeb1fef6454a8132a07cb9ebd5a

# Ask the user to enter the names of files to compare
fname1 = input("Enter the first filename: ")
fname2 = input("Enter the second filename: ")

# Open file for reading in text mode (default mode)
f1 = open(fname1)
f2 = open(fname2)

# Print confirmation
print("-----------------------------------")
print("Comparing files ", " > " + fname1, " < " + fname2, sep='\n')
print("-----------------------------------")

# Read the first line from the files
f1_line = f1.readline()
f2_line = f2.readline()

# Initialize counter for line number
line_no = 1

# Loop if either file1 or file2 has not reached EOF
while f1_line != '' or f2_line != '':

    # Strip the leading whitespaces
    f1_line = f1_line.rstrip()
    f2_line = f2_line.rstrip()

    # Compare the lines from both file
    if f1_line != f2_line:

        # If a line does not exist on file2 then mark the output with + sign
        if f2_line == '' and f1_line != '':
            print(">+", "Line-%d" % line_no, f1_line)
        # otherwise output the line on file1 and mark it with > sign
        elif f1_line != '':
            print(">", "Line-%d" % line_no, f1_line)

        # If a line does not exist on file1 then mark the output with + sign
        if f1_line == '' and f2_line != '':
            print("<+", "Line-%d" % line_no, f2_line)
        # otherwise output the line on file2 and mark it with < sign
        elif f2_line != '':
            print("<", "Line-%d" % line_no, f2_line)

        # Print a blank line
        print()

    # Read the next line from the file
    f1_line = f1.readline()
    f2_line = f2.readline()

    # Increment line counter
    line_no += 1

f1.close()
f2.close()
0

There are 0 answers