I am trying to write a script in Python which normalizes a dataset by dividing all value elements by the max value element.
This is the script that I have come up with so far:
#!/usr/bin/python
with open("infile") as f:
cols = [float(row.split("\t")[2]) for row in f.readlines()]
maxVal = max(cols)
#print maxVal
data = []
with open('infile') as f2:
for line in f2:
items = line.split() # parse the columns
tClass, feats, values = items[:3] # parse the columns
#print items
normalizedData = float(values)/float(maxVal)
#print normalizedData
with open('outfile', 'wb') as f3:
output = "\t".join([tClass +"\t"+ feats, str(normalizedData)])
f3.write(output + "\n")
in which the goal is to take an input file (3 columns tab-separated), such as :
lfr about-kind-of+n+n-the-info-n 3.743562
lfr about+n-a-j+n-a-dream-n 2.544614
lfr about+n-a-j+n-a-film-n 1.290925
lfr about+n-a-j+n-a-j-series-n 2.134124
- Look for the maxVal in the third column: in this case is would be 3.743562
- Divide all values in the 3rd column by maxVal
- Output following desired results:
lfr about-kind-of+n+n-the-info-n 1 lfr about+n-a-j+n-a-dream-n 0.67973 lfr about+n-a-j+n-a-film-n 0.34483 lfr about+n-a-j+n-a-j-series-n 0.57007
However, what is currently being "outputted" is only a single value, which I am assuming is the first value in the input data divided by the max. Any insight on what is going wrong in my code: why the output is only printing one line? Any possible insight on solutions? Thank you in advance.
As far as I understood your intentions, following does the job. (Minor program flow corrections)
Also, instead of writing continuously to the file, I instead chose to store what to write & then dump everything to the output file.
Update - Turns out
list
creation takes same time as the excesswith
statement use, so, got rid of it completely. Now, writing continuously to the file, without closing it everytime.