Reading misaligned .txt file to a pd dataframe

39 views Asked by At

I'm trying to read numerical data from a .txt to a pandas dataframe, but it needs some wrangling. Some rows are misaligned (I think by tabs)

Snippet of data (pasting the table actually makes it appear aligned): .txt dataset with mixed alignment

What worked for now was simply dropping the misaligned rows, but it's a small dataset that I'd like to retain every row for. Code:

df = pd.read_table('path/file.txt', on_bad_lines='skip', header=None)
df

Output:

    0
0   15.26\t14.84\t0.871\t5.763\t3.312\t2.221\t5.22\t1
1   14.88\t14.57\t0.8811\t5.554\t3.333\t1.018\t4.9...
2   14.29\t14.09\t0.905\t5.291\t3.337\t2.699\t4.82...

Using read_table without skipping bad lines returns: 'ParserError: Error tokenizing data. C error: Expected 8 fields in line 8, saw 10'

I've tried rewriting the .txt to replace tabs with a single space (or a comma) and trying to read the new file in with the specific delimiter, but that brings me back to the ParserError (strategy inspired by Replace Tab with space in entire text file python).

inputFile = open('path/file.txt', 'r') # read mode
exportFile = open('path/file_v1.txt', 'w') # write mode
for line in inputFile:
   new_line = line.replace('\t', ',')
   exportFile.write(new_line)

inputFile.close()
exportFile.close()

(PS. Python beginner, and first StackOverflow problem. Thanks and sorry in advance if I missed some posting convention)

1

There are 1 answers

0
Corralien On

You can use the sep='\s+' parameter to specify how to split your data. This means that each column is separated by one or more spaces.

Try:

df = pd.read_table('path/file.txt', header=None, sep='\s+')  # or sep='\t+'