Free Text to pandas dataframe in python

113 views Asked by At

I have a text file that contains many tables. I would like to capture these tables into dataframes. The problem is that even though these tables look like tables , they are structurally just text. The reason they look like tables is the use of spacing. Like this:

                       Three Months Ended    Nine Months Ended
                      ------------------    -----------------
                       November 30,          November 30,
                       1996       1995       1996       1995
                      -------    -------   --------    --------
<S>                    <C>        <C>       <C>         <C> 
DEPARTMENTA Team 1     73,003    $52,729   $235,753    $169,532
DEPARTMENTA Team 2     51,129     37,770    162,884     119,006

My hopes for the dataframe to look like this: enter image description here

to describe the dataframe in words: 5 columns:

  1. Department/team
  2. Three Months Ended November 30, 1996
  3. Three Months Ended November 30, 1995
  4. Nine Months Ended November 30, 1995
  5. Nine Months Ended November 30, 1996

plus headers and 20 rows of data

I tried to identify any markup to help me parse it but it looks like there isn't any. it is not html xml or excel..just text

Thank you any help

1

There are 1 answers

6
Vitalizzare On

Case 1: More then 1 spaces as a field separator

If two or more spaces occur strictly between columns, then you can use pd.read_table or pd.read_csv with more then 1 spaces as a separator and skipped header lines, sort of:

headerhight = 6
fieldsep = r'\t+|\s{2,}'
names = ['your', 'names', 'for', 'each', 'column']
pd.read_table(
    file, skiprows=headerhight, sep=fieldsep, 
    header=None, names=names, engine='python', dtype=str)

The rest is cleaning data, something like:

for col in df.columns[1:]:
    df[col] = (
        df[col]
        .str.lstrip('$')
        .str.replace(',', '.')
        .astype(float)
    )

But the more precise approach still depends on the context.

Case 2: A single space as a field separator

What if we can't rely on the 2+ spaces between columns? Then we can pass to pd.DataFrame a generator that returns the file lines splitted from the right a specified number of times, which can be calculated as the number of years in the header. For example:

import pandas as pd
from io import StringIO
from itertools import chain, tee
import re

data = '''\
                     1 Month Ended     3 Months Ended    9 Months Ended
                     ---------------   ---------------   -----------------
                     November 30,      November 30,      November 30,
                     1996     1995     1996     1995     1996     1995
                     -------- -------- -------- -------- -------- --------
<S>                  <C>      <C>      <C>      <C>      <C>      <C> 
DEPARTMENT A  Team 1   73,003  $52,729 $235,753 $169,532 $253,112  154,143
DEPARTMENT A  Team 2   51,129   37,770  162,884  119,006   12,005 $789,432
'''

with StringIO(data) as file:
    
    # read headers
    level_1 = re.findall(r'[^\s].*?\s+ended', file.readline(), flags=re.IGNORECASE)
    file.readline()     # skip ---
    level_2 = re.findall(r'[a-z]+?\s+\d+,', file.readline(), flags=re.IGNORECASE)
    level_3 = file.readline().split()
    file.readline()     # skip ---
    file.readline()     # skip <type> specs
    
    # create header names (a frivolous assumption was made here 
    # about some possible symmetry in a header structure)
    ncol = len(level_3)
    names = pd.MultiIndex.from_arrays([
        [*chain(*zip(*tee(level_1, ncol // len(level_1))))],
        [*chain(*zip(*tee(level_2, ncol // len(level_2))))],
        level_3
    ])
    
    # read the rest of the data
    rows = (line.rsplit(maxsplit=ncol) for line in file)
    df = (
        pd.DataFrame(rows)
        .set_index(0).rename_axis('Department/team')
        .set_axis(names, axis='columns')
        .reset_index()
    )

Case 3: Fixed-width formatted lines

When starting position of each column is determined by the "<" character in the last header line, then we can use pandas.read_fwf to read fixed-width columns. For example:

data = '''\
                       Three Months Ended   Nine Months Ended
                      ------------------    -----------------
                       November 30,          November 30,
                       1996       1995       1996       1995
                      -------    -------   --------    ------
<S>                    <C>        <C>       <C>         <C> 
DEPARTMENTA Team 1     73,003     52,729    235,753     169,532
DEPARTMENTA Team 2     51,129     37,770    162,884     119,006
'''

class FileStructureError(Exception): pass

with StringIO(data) as file:
    # find the line with <type> specs
    for line in file:
        if line.startswith('<'):
            break
    else:
        raise FileStructureError(file)

    # collect starting positions marked by '<'
    pos = [*(i for i, ch in enumerate(line) if ch == '<'), None]
    colspecs = [*zip(pos[:-1], pos[1:])]

    # read fixed-width formatted data
    df = pd.read_fwf(file, colspecs=colspecs, header=None)