Python generating third CSV after comparing CSVfile 1 and CSVfile 2 column values

127 views Asked by At

I have two csv files that contain:

CSVFile1:

Data A  Temp at City A  Temp at City B
87.900002   275.151367  273.20108
88.300003   275.213867  273.32608

CSVFile2:

Data A  Temp at City A  Temp at City B
79.266687   299.566367  213.20766
97.300003   306.213867  271.47999

I want to make a new CSV file that takes the difference of column values. The result should be what changed between CSVFile 1 and CSVFile 2 and I want to see this difference in a new csv.

I have tried:

import numpy as np    

with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2:
  fileone = t1.readlines()
  filetwo = t2.readlines()

with open('update.csv', 'w') as outFile:
  for line in filetwo:
    if line not in fileone:
        outFile.write(line)

np.savetxt(f, output,fmt="%f",delimiter=',')
f.close()
2

There are 2 answers

13
AudioBubble On

Based on what I think your code is trying to do (output all the lines in filetwo that aren't in fileone), you could use the array.count() command.

file.readlines() returns an array and so both fileone and filetwo can be used like normal arrays. If the line isn't in fileone, the count of the line will be 0, for example:

x = ["bob", "sandra", "david", "ralph"]
y = ["bob", "david"]

for name in x:
    if(y.index(name) == 0):
        print(name)

Will output:

bob
sandra

So, in your program, you could replace:

for line in filetwo:
if line not in fileone:
    outFile.write(line)

With:

for line in filetwo:
    if(fileone.count(line) == 0):
        outFile.write(line)

EDIT:

File handling in python is accomplished through the open() function, which takes the file and the mode to open with ('w' for writing (which will overwrite the file completely) or 'r' for reading). So for example:

data = open("data.csv", "w")

Would open the file, which can then be written to using data.write(). The file can then be closed when finished with using data.close. All this put together gives us:

difference = open("difference.csv", "w")

for line in filetwo:
    if(fileone.count(line) == 0):
        difference.write(line)

difference.close()
0
AudioBubble On

This is another answer following a reclarification of the problem. Solving this problem (finding the difference in value between the two files) is rather complicated and needs to broken down into several steps:

  1. The opening of the both files and reading them into python variables
  2. Convert the raw CSV file into a python array of floats
  3. Find the difference between them and put the results into a new array
  4. Convert this results array into a raw CSV file and save it to a file

Step 1:

Opening a file in python is done using the open() function, which takes both the files location and a mode ('r' for reading or 'w' for writing, along with others). After the file has been opened, we can the use the file.readlines() function to get all the lines in the file, returning them as an array, with each item being a line. Using this:

# Open files
file1 = open("data1.csv", "r")
file2 = open("data2.csv", "r")

# Read all file lines
data1 = file1.readlines()
data2 = file2.readlines()

# Close files
file1.close()
file2.close()

If, for example, we have the files data1.csv:

2,1,5
7,2,4
1,5,1

And data2.csv:

1,2,4
3,2,6
6,3,1

Then, at the end of this segment, data1 equals ['2,1,5\n', '7,2,4\n', '1,5,1'] and data2 equals ['1,2,4\n', '3,2,6\n', '6,3,1']

Step 2:

Step 2 is split into two phases - the obtaining of the final data and then the converting of the text string to a number (a float in this case). In stage one, I utilise the array.split() function to seperate the entire row of data into the individual data points. For example:

x = "67,45,23"
print(x.split(","))

Would output:

['67', '45', '23']

Notice however, that the numbers are still strings, which is why we require the second stage, where I iterate over each seperate data point and convert it to a float (therefore, when you create your data file, you should remove any column headers to stop an error happening). I placed all this into a two seperate functions (one for getting the data, one for converting it to a float) which I then called on both the datasets.

# Extract an array from the rows of CSV data
def getDataFromCSV(data):
    extract = []
    # Go through each row in the data
    for row in data:
        # Remove newline in from row
        row = row.strip()
        # Seperate row into individual columns
        row = row.split(",")
        # Add to the final data
        extract.append(row)
    # Return the extracted data
    return extract

final1 = getDataFromCSV(data1)
final2 = getDataFromCSV(data2)

# Convert all data in an array to a float
def convToFloat(data):
    newData = []
    # Iterate through each row
    for row in data:
        newRow = []
        # Go through each column
        for column in row:
            # Convert numbers to an float
            newRow.append(float(column))
        # Append new row to newData
        newData.append(newRow)
    # Replace dataset with new data
    return newData

# Run function on both datasets
final1 = convToFloat(final1)
final2 = convToFloat(final2)

After both segments are called, final1 is [[2.0, 1.0, 5.0], [7.0, 2.0, 4.0], [1.0, 5.0, 1.0]] and final2 is [[1.0, 2.0, 4.0], [3.0, 2.0, 6.0], [6.0, 3.0, 1.0]] if we continue to use the same files from above.

Step 3:

In step 3, we find the numerical difference between the two arrays. First, I create an array that will hold the differences between the datasets. Then, the amount of rows in the dataset as well as the amount of columns is determined using the len() function (it goes without saying that to compare the two datasets, they both have to have the same number of rows and columns).

Then, I go through each row in both datasets, creating a new temporary row that will then be appended to the difference array. Before that, each column is gone through and the number in the second file is taken away from the number in the first file to find the change between them. This is also converted to a string in the same line - this is important for later.

# Create difference array
difference = []

# Get the amount of rows in the dataset
rows = len(final1)

# Get the amount of columns in the dataset
columns = len(final1[0])

# Go through this for each row
for row in range(rows):
    # Create a new row to put data in
    newRow = []
    # For each column in the row
    for column in range(columns):
        # Get the difference in the dataset and convert it to a string
        diff = str(final2[row][column] - final1[row][column])
        # Append it to the new row
        newRow.append(diff)
    # Add the new row to the final difference array
    difference.append(newRow)

After this, the difference array is [['-1.0', '1.0', '-1.0'], ['-4.0', '0.0', '2.0'], ['5.0', '-2.0', '0.0'].

Step 4:

Finally, the difference needs to be converted to a raw csv file and saved to disk. To do this, I use the str.join function, which joins the items in an array using a certain string. This only works on strings though, which is why we had to make the conversion before. For example:

y = ["The", "small", "dog"]
print(" - ".join(y))

Outputs:

The - small - dog

I create an output string to hold the output file and then I go through each row and join the data points together with a , and then add a newline character \n to the end. Then I write to a file - this is as easy as reading to a file, I simple use the open() function again, using the 'w' mode (warning - this will delete any file that already existed there). Then a simple call to the file.write() function and the program is done. Like this:

# Create an output text file
output = ""

# Loop through the results
for row in difference:
    # Append a csv-formatted list to the file
    output += ",".join(row)
    output += "\n"

# Open a file to output to
outputFile = open("output.csv", "w")

# Write output
outputFile.write(output)

The contents of the output.csv file is:

-1.0,1.0,-1.0
-4.0,0.0,2.0
5.0,-2.0,0.0

Conclusion:

If you have any questions or want any clarifications, feel free to leave a comment.