Reformat unstructured Pandas DataFrame

52 views Asked by At

I have to read multiple PDF files via tabula-py. This works good, it returns a dataframe, or a list of dataframes in case of multiple ranges of interests are set.

The problem is, that the underlying pdf has no structured format:

index 0
0 name
1 Mr. John Doe
2 Address
3 123 Main Street
4 Anytown
5 Germany
6 Date
7 01.01.2010

How could I reformat a Pandas Dataframe to set "name","address" and "date" to resp. columns and set residual entries correctly as values?

index name address_street address_city address_state date
0 Mr. John Doe 123 Main Street Anytown Germany 01.01.2010
1

There are 1 answers

1
Jacob H On

Just to make sure I'm understanding the problem correctly: You have a bunch of individual dataframes that look like your first table. That is, a single column (with label 0), of alternating (key, value) pairs, and therefore always an even number of rows. And you want to combine them into a single table, with a row for each of the original dataframe.

First, some sample data:

import pandas as pd

x = pd.DataFrame({0: ['name:', 'Mr. John Doe', 'Address:', '123 Main Street', 'Date', '01.01.2010']})
y = pd.DataFrame({0: ['name:', 'Mr. Freddy Kreuger', 'Address:', '1428 Elm Street', 'Date', '11.09.1984']})

We can convert one of these dataframes into a dictionary of key:value pairs using dictionary comprehension:

{x[0].values[2*i]: x[0].values[2*i+1] for i in range(int(len(x)/2))}

And that results in

{'name:': 'Mr. John Doe', 'Address:': '123 Main Street', 'Date': '01.01.2010'}

To break down the dictionary comprehension: The general form of this is {key: value for index in iterable}. The iterable we're using is the integers up to half the length of x. In this case range(int(len(x)/2)) will be [0, 1, 2]. We slice the values of the 0 column of the dataframe x with x[0].values[]. We use 2*i for the key and 2*i + 1 for the value, so that the index pairs we use are (0, 1), (2, 3) and (4, 5).

Once we have our data in a bunch of dictionaries like that, we can combine them into a dataframe as follows:

import pandas as pd

x = pd.DataFrame({0: ['name:', 'Mr. John Doe', 'Address:', '123 Main Street', 'Date', '01.01.2010']})
y = pd.DataFrame({0: ['name:', 'Mr. Freddy Kreuger', 'Address:', '1428 Elm Street', 'Date', '11.09.1984']})

raw_dfs = [x, y]

list_of_dicts = [{df[0].values[2*i]: df[0].values[2*i+1] for i in range(int(len(df)/2))}
                 for df in raw_dfs]

final_df = pd.DataFrame(list_of_dicts)

Here I used a list comprehension to collect all those dictionaries into a single list. But if you're iterating through PDF filenames to generate the raw dataframes, maybe you're building this list as you go. The end result is:

>>> print(final_df)
                name:         Address:        Date
0        Mr. John Doe  123 Main Street  01.01.2010
1  Mr. Freddy Kreuger  1428 Elm Street  11.09.1984