Reformat unstructured Pandas DataFrame

Question

Reformat unstructured Pandas DataFrame

71 views Asked by cekar At 29 December 2024 at 12:34

I have to read multiple PDF files via tabula-py. This works good, it returns a dataframe, or a list of dataframes in case of multiple ranges of interests are set.

The problem is, that the underlying pdf has no structured format:

index	0
0	name
1	Mr. John Doe
2	Address
3	123 Main Street
4	Anytown
5	Germany
6	Date
7	01.01.2010

How could I reformat a Pandas Dataframe to set "name","address" and "date" to resp. columns and set residual entries correctly as values?

index	name	address_street	address_city	address_state	date
0	Mr. John Doe	123 Main Street	Anytown	Germany	01.01.2010

Original Q&A

There are 1 answers

**Jacob H** · Answer 1 · 2023-10-23T15:11:01+00:00

Just to make sure I'm understanding the problem correctly: You have a bunch of individual dataframes that look like your first table. That is, a single column (with label 0), of alternating (key, value) pairs, and therefore always an even number of rows. And you want to combine them into a single table, with a row for each of the original dataframe.

First, some sample data:

import pandas as pd

x = pd.DataFrame({0: ['name:', 'Mr. John Doe', 'Address:', '123 Main Street', 'Date', '01.01.2010']})
y = pd.DataFrame({0: ['name:', 'Mr. Freddy Kreuger', 'Address:', '1428 Elm Street', 'Date', '11.09.1984']})

We can convert one of these dataframes into a dictionary of key:value pairs using dictionary comprehension:

{x[0].values[2*i]: x[0].values[2*i+1] for i in range(int(len(x)/2))}

And that results in

{'name:': 'Mr. John Doe', 'Address:': '123 Main Street', 'Date': '01.01.2010'}

To break down the dictionary comprehension: The general form of this is {key: value for index in iterable}. The iterable we're using is the integers up to half the length of x. In this case range(int(len(x)/2)) will be [0, 1, 2]. We slice the values of the 0 column of the dataframe x with x[0].values[]. We use 2*i for the key and 2*i + 1 for the value, so that the index pairs we use are (0, 1), (2, 3) and (4, 5).

Once we have our data in a bunch of dictionaries like that, we can combine them into a dataframe as follows:

import pandas as pd

x = pd.DataFrame({0: ['name:', 'Mr. John Doe', 'Address:', '123 Main Street', 'Date', '01.01.2010']})
y = pd.DataFrame({0: ['name:', 'Mr. Freddy Kreuger', 'Address:', '1428 Elm Street', 'Date', '11.09.1984']})

raw_dfs = [x, y]

list_of_dicts = [{df[0].values[2*i]: df[0].values[2*i+1] for i in range(int(len(df)/2))}
                 for df in raw_dfs]

final_df = pd.DataFrame(list_of_dicts)

Here I used a list comprehension to collect all those dictionaries into a single list. But if you're iterating through PDF filenames to generate the raw dataframes, maybe you're building this list as you go. The end result is:

>>> print(final_df)
                name:         Address:        Date
0        Mr. John Doe  123 Main Street  01.01.2010
1  Mr. Freddy Kreuger  1428 Elm Street  11.09.1984

TechQA.

Reformat unstructured Pandas DataFrame

There are 1 answers

Related Questions in PYTHON-3.X

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in REFORMATTING

Related Questions in UNSTRUCTURED-DATA

Popular Questions

Popular Tags

Trending Questions