dividing elements in list to normalize data in Python

2.1k views Asked by At

I am trying to write a script in Python which normalizes a dataset by dividing all value elements by the max value element.

This is the script that I have come up with so far:

#!/usr/bin/python

with open("infile") as f:
    cols = [float(row.split("\t")[2]) for row in f.readlines()]
    maxVal = max(cols)
    #print maxVal

    data = []
    with open('infile') as f2:
        for line in f2:                  
            items = line.split() # parse the columns
            tClass, feats, values = items[:3] # parse the columns
            #print items      
            normalizedData = float(values)/float(maxVal)
            #print normalizedData

            with open('outfile', 'wb') as f3:
            output = "\t".join([tClass +"\t"+ feats, str(normalizedData)])
            f3.write(output + "\n")

in which the goal is to take an input file (3 columns tab-separated), such as :

lfr about-kind-of+n+n-the-info-n    3.743562
lfr about+n-a-j+n-a-dream-n 2.544614
lfr about+n-a-j+n-a-film-n  1.290925
lfr about+n-a-j+n-a-j-series-n  2.134124
  1. Look for the maxVal in the third column: in this case is would be 3.743562
  2. Divide all values in the 3rd column by maxVal
  3. Output following desired results:
lfr   about-kind-of+n+n-the-info-n    1
lfr   about+n-a-j+n-a-dream-n 0.67973
lfr   about+n-a-j+n-a-film-n  0.34483
lfr   about+n-a-j+n-a-j-series-n  0.57007

However, what is currently being "outputted" is only a single value, which I am assuming is the first value in the input data divided by the max. Any insight on what is going wrong in my code: why the output is only printing one line? Any possible insight on solutions? Thank you in advance.

3

There are 3 answers

2
shad0w_wa1k3r On BEST ANSWER

As far as I understood your intentions, following does the job. (Minor program flow corrections)

Also, instead of writing continuously to the file, I instead chose to store what to write & then dump everything to the output file.

Update - Turns out list creation takes same time as the excess with statement use, so, got rid of it completely. Now, writing continuously to the file, without closing it everytime.

with open("in.txt") as f:
    cols = [float(row.split()[2]) for row in f.readlines()]
    maxVal = max(cols)
    #print maxVal

data = list()
f3 = open('out.txt', 'w')
with open('in.txt') as f2:
    for line in f2:
        items = line.split() # parse the columns
        tClass, feats, values = items[:3] # parse the columns
        #print items
        normalizedData = float(values)/float(maxVal)
        #print normalizedData

        f3.write("\t".join([tClass +"\t"+ feats, str(normalizedData), "\n"]))
f3.close()
0
Martijn Pieters On

You'll need to open the output file once and keep writing to it as you process input lines. It'd also be much easier if you used the csv module to handle input and output:

import csv

with open("infile", 'rb') as inf:
    reader = csv.reader(inf, delimiter='\t')
    maxVal = max(float(row[2]) for row in reader)

with open('infile') as inf, open('outfile') as outf:
    reader = csv.reader(inf, delimiter='\t')
    writer = csv.writer(outf, delimiter='\t')
    for row in reader:
        tClass, feats, values = row[:3]

        normalizedData = float(values) / maxVal

        writer.writerow([tClass, feats, values])
3
owwoow14 On
#!/usr/bin/python

with open("lfr") as f:
    cols = [float(row.split("\t")[2]) for row in f.readlines()]
    maxVal = max(cols)
    #print maxVal

    data = []
    output1 = ''
    with open('lfr') as f2:
        for line in f2:                  
            items = line.split() # parse the columns
            tClass, feats, values = items[:3] # parse the columns
            #print items      
            normalizedData = float(values)/float(maxVal)
            output1 += tClass + "\t" + feats + "\t" + str(normalizedData) + "\n"

            with open('outfile', 'wb') as f3:
                output = output1
                f3.write(output + "\n")

I have been working on it too, it seems like I was not creating an output variable by appending the results of each cycle. However, it seems that it is a bit slow (2 seconds to process a 4MB files). Can this possibly be optimized?