I have a text file that contains many tables. I would like to capture these tables into dataframes. The problem is that even though these tables look like tables , they are structurally just text. The reason they look like tables is the use of spacing. Like this:
Three Months Ended Nine Months Ended
------------------ -----------------
November 30, November 30,
1996 1995 1996 1995
------- ------- -------- --------
<S> <C> <C> <C> <C>
DEPARTMENTA Team 1 73,003 $52,729 $235,753 $169,532
DEPARTMENTA Team 2 51,129 37,770 162,884 119,006
My hopes for the dataframe to look like this:

to describe the dataframe in words: 5 columns:
- Department/team
- Three Months Ended November 30, 1996
- Three Months Ended November 30, 1995
- Nine Months Ended November 30, 1995
- Nine Months Ended November 30, 1996
plus headers and 20 rows of data
I tried to identify any markup to help me parse it but it looks like there isn't any. it is not html xml or excel..just text
Thank you any help
Case 1: More then 1 spaces as a field separator
If two or more spaces occur strictly between columns, then you can use
pd.read_tableorpd.read_csvwith more then 1 spaces as a separator and skipped header lines, sort of:The rest is cleaning data, something like:
But the more precise approach still depends on the context.
Case 2: A single space as a field separator
What if we can't rely on the 2+ spaces between columns? Then we can pass to
pd.DataFramea generator that returns the file lines splitted from the right a specified number of times, which can be calculated as the number of years in the header. For example:Case 3: Fixed-width formatted lines
When starting position of each column is determined by the
"<"character in the last header line, then we can use pandas.read_fwf to read fixed-width columns. For example: