I have to read multiple PDF files via tabula-py. This works good, it returns a dataframe, or a list of dataframes in case of multiple ranges of interests are set.
The problem is, that the underlying pdf has no structured format:
| index | 0 |
|---|---|
| 0 | name |
| 1 | Mr. John Doe |
| 2 | Address |
| 3 | 123 Main Street |
| 4 | Anytown |
| 5 | Germany |
| 6 | Date |
| 7 | 01.01.2010 |
How could I reformat a Pandas Dataframe to set "name","address" and "date" to resp. columns and set residual entries correctly as values?
| index | name | address_street | address_city | address_state | date |
|---|---|---|---|---|---|
| 0 | Mr. John Doe | 123 Main Street | Anytown | Germany | 01.01.2010 |
Just to make sure I'm understanding the problem correctly: You have a bunch of individual dataframes that look like your first table. That is, a single column (with label
0), of alternating (key, value) pairs, and therefore always an even number of rows. And you want to combine them into a single table, with a row for each of the original dataframe.First, some sample data:
We can convert one of these dataframes into a dictionary of key:value pairs using dictionary comprehension:
And that results in
To break down the dictionary comprehension: The general form of this is
{key: value for index in iterable}. The iterable we're using is the integers up to half the length ofx. In this caserange(int(len(x)/2))will be[0, 1, 2]. We slice the values of the0column of the dataframexwithx[0].values[]. We use2*ifor the key and2*i + 1for the value, so that the index pairs we use are(0, 1),(2, 3)and(4, 5).Once we have our data in a bunch of dictionaries like that, we can combine them into a dataframe as follows:
Here I used a list comprehension to collect all those dictionaries into a single list. But if you're iterating through PDF filenames to generate the raw dataframes, maybe you're building this list as you go. The end result is: