How to get rid of '\r' when exttracting and printing a table from a pdf file?

Question

How to get rid of '\r' when exttracting and printing a table from a pdf file?

427 views Asked by Torsten_Z90 At 28 April 2022 at 09:02

The Objection is to extract a table from a given PDF file and convert the whole table to an pd dataframe for further operations. Obviously, the whole table will only contain strings in it.

While the code itself is working, when converting the extracted table to a dataframe, every string which had originally a break in its cell from the table appears with "\r" in between the words

Example: Original Appearance in cell: "Neues Wh..."

Should look like: "Neues Wh..."

Result after converting to df: "Neues\rWh..."

See my code below:

import pandas as pd
import win32com.client
from win32com.client import Dispatch, constants
import codecs
import os
import io

import tabula
from tabula import read_pdf
from tabulate import tabulate

mapping = {df.columns[0]: 'x1',
           df.columns[1]: 'x2',
           df.columns[2]: 'x3',
           df.columns[3]: 'x4?',
           df.columns[4]: 'x5',
           df.columns[5]: 'x6',
           df.columns[6]: 'x7',
           df.columns[7]: 'x8'}

pdf_template_path = os.path.join(r'H:\folder\ pdf-file')
pdf_template_path1 = pdf_template_path + '.pdf'

pdf_table = read_pdf(pdf_template_path1,
                     pages = 'all', 
                     multiple_tables = True,
                     lattice= True, 
                     pandas_options={'header': None}
)

# Transform the result into a string table format
table = tabulate(pdf_table)

# Transform the table into dataframe
df = pd.read_fwf(io.StringIO(table))

df.rename(columns= mapping, inplace= True)
df_pdf.style.set_properties(subset=['Beschreibung'], **{'width': '300px'})

display(df.head())
df.shape

Following result: result

As you can see by the picture, sometimes the Carriage Return sequence "\r" appears between the words, i.e.: 'Neues\rWh..', but the result should look like this: 'Neues Wh..'.

I tried methods like replace():

df = df.replace('\r', '', regex= True)

EDIT: But it didn't work, as the strings in the df remains the same, see the result-picture: result after df_replace

I'm thankful for any advice.

Original Q&A

There are 1 answers

**Torsten_Z90** · Accepted Answer · 2022-04-28T12:26:15+00:00

Torsten_Z90 On 28 April 2022 at 12:26 BEST ANSWER

Solved. The solution here is:

df = df.replace(r'\\r', ' ', regex= True)

as r'\\' disable the first \. Thus, '\r' can be handled as normal character of a string.

TechQA.

How to get rid of '\r' when exttracting and printing a table from a pdf file?

There are 1 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in PDF

Related Questions in STRINGIO

Related Questions in TABULATE

Popular Questions

Popular Tags

Trending Questions