I have a csv file for which each line is supposed to finish by at geographical coordinate (so a number). Somehow some line breaks pollute some lines so i would like to remove them.

Since some other lines are ok the plan is to remove the line breaks and add a space instead at the end of the lines of my csv file, every time a line doesnt finish by a number or a "None" (value we use when we could not get the coordinate.).

Instead of:

www.audiar.org,www.epfbretagne.fr,Agence
d'urbanisme,-1.68186449144,48.1119791219,-1.68186449144,48.1119791219
www.audiar.org,www.fnau.org,Agence
d'urbanisme,-1.68186449144,48.1119791219,None,None

I need to get this:

www.audiar.org,www.epfbretagne.fr,Agence d'urbanisme,-1.68186449144,48.1119791219,-1.68186449144,48.1119791219
www.audiar.org,www.fnau.org,Agence d'urbanisme,-1.68186449144,48.1119791219,None,None

But i must admit i have no idea how to achieve that... I checked some other posts close to my problem. Solutions seem to be using sed but i dont have a linux here (and not sure to understand the syntax) and i'm a poor poor python user...

1 Answers

0
ForceBru On

Suppose you have two consecutive lines:

>>> line1 = 'www.audiar.org,www.epfbretagne.fr,Agence'
>>> line2 = "d'urbanisme,-1.68186449144,48.1119791219,-1.68186449144,48.1119791219"

Attempt to interpret the last part of the first line as a number or None. If it fails, concatenate the next line to it:

import ast

last_part = line1.rsplit(',', 1)[1]  # == 'Agence'

try:
    data = ast.literal_eval(last_part)
except:
    # this line is incorrect
    output = line1 + ' ' + line2
else:
    if isinstance(data, float) or data is None:
        output = line1 # everything is OK
    else:
        raise ValueError("Malformed data!")

# `output` is one processed line

Then move line2 to line1 and read a new line into line2. If the two lines were concatenated, special handling may be required, because the "error" (a line not ending with a float or None) may continue on line2. Rinse, repeat.