Python pandas read_fwf strips white space

1.4k views Asked by At

I am facing an issue using the read_fwf command from the Python library pandas, same as described in this unresolved question

I want to read an ascii file containing results of numeric computations. The file contains repetitive header blocks with a variable number of lines between each of them and with a roughly described format but rather arbitrary string content. I do not want to allow any loss of information and therefore need to be able to process everything as it was originally.

Look at this examplary line:

$SUBTITLE= 48-ELEMENT CANTILEVER BEAM ON THE X-AXIS 1181

Using a

with pd.read_fwf(self.file_path, header=None,
                         chunksize=chunk_size,  # no of lines read
                         colspecs=[(0, 23), (23, 41), (41, 59), (59, 72)],
                         engine='c', dtype={0: str}) as df_gen:

context manager to read the file, white spaces between columns will be stripped despite the seamlessly neighbouring column lengths. So it does not act as a fixed-width reader.

Cutting out the middle part of the line above with

"".join(s.dropna()).split("=")[-1].strip(" ")

with s being the row of the DataFrame containing that line, I will get a string

48-ELEMENT CANTILEVER BEAM ONTHE X-AXIS instead of

48-ELEMENT CANTILEVER BEAM ON THE X-AXIS.

Does anybody know an elegant solution to this issue? I am working on an approach where I will initially not cut into columns and do it for the numeric blocks after i extracted the header lines but it might increase computation time significantly.

Is there maybe a keyword in newer pandas versions that will disable the stripping?

1

There are 1 answers

1
BeRT2me On BEST ANSWER

Take bigger cuts:

s = "$SUBTITLE= 48-ELEMENT CANTILEVER BEAM ON THE X-AXIS                         1181"

pd.read_fwf(StringIO(s), header=None, colspecs=[(0, 11),(11, 51),(51,80)]

Output:

            0                                         1     2
0  $SUBTITLE=  48-ELEMENT CANTILEVER BEAM ON THE X-AXIS  1181

Or you can just read in line by line:

pd.read_table(StringIO(s), header=None)

...
                                                                                  0
0  $SUBTITLE= 48-ELEMENT CANTILEVER BEAM ON THE X-AXIS                         1181

A little regex magic and you can consistently extract the inner line:

re.findall('.*= (.+?)\s\s', s)

...

['48-ELEMENT CANTILEVER BEAM ON THE X-AXIS']