I have to read multiple PDF files via tabula-py. This works good, it returns a dataframe, or a list of dataframes in case of multiple ranges of interests are set.
The problem is, that the underlying pdf has no structured format:
index | 0 |
---|---|
0 | name |
1 | Mr. John Doe |
2 | Address |
3 | 123 Main Street |
4 | Anytown |
5 | Germany |
6 | Date |
7 | 01.01.2010 |
How could I reformat a Pandas Dataframe to set "name","address" and "date" to resp. columns and set residual entries correctly as values?
index | name | address_street | address_city | address_state | date |
---|---|---|---|---|---|
0 | Mr. John Doe | 123 Main Street | Anytown | Germany | 01.01.2010 |
Just to make sure I'm understanding the problem correctly: You have a bunch of individual dataframes that look like your first table. That is, a single column (with label
0
), of alternating (key, value) pairs, and therefore always an even number of rows. And you want to combine them into a single table, with a row for each of the original dataframe.First, some sample data:
We can convert one of these dataframes into a dictionary of key:value pairs using dictionary comprehension:
And that results in
To break down the dictionary comprehension: The general form of this is
{key: value for index in iterable}
. The iterable we're using is the integers up to half the length ofx
. In this caserange(int(len(x)/2))
will be[0, 1, 2]
. We slice the values of the0
column of the dataframex
withx[0].values[]
. We use2*i
for the key and2*i + 1
for the value, so that the index pairs we use are(0, 1)
,(2, 3)
and(4, 5)
.Once we have our data in a bunch of dictionaries like that, we can combine them into a dataframe as follows:
Here I used a list comprehension to collect all those dictionaries into a single list. But if you're iterating through PDF filenames to generate the raw dataframes, maybe you're building this list as you go. The end result is: