Tabula broke text into unnamed columns

89 views Asked by At

I'm writing a script in Python to read a PDF and convert the dataFrame to a CSV with tabula, i've tried both methods, convert_into and read_pdf, they return all the pages i call, but some of the columns are split into new columns with no name like unnamed: 0, unnamed: 1 ... and have the missing fragments of the original column.

page = t.read_pdf(path,stream=True, encoding="UTF-8", pages=paginas, multiple_tables=True, guess=True,format="pdf")
        df = pd.DataFrame([])
        for i in range(len(page)):
                page[i] = pd.DataFrame(page[i])
                df = pd.concat([df,page[i]])
        dataAtual = datetime.datetime.now()
        saida = destino+"/ErrNFE"+str(dataAtual)+".csv"
        df.to_csv(saida, encoding="UTF-8", sep=";")

Snippet of the output on LibreOffice Calc

I used the columns attribute of dataFrame to restrict the columns on the return, which works to remove the unnamed columns but also removes the text in these columns.

page = t.read_pdf(path,stream=True, encoding="UTF-8", pages=paginas, multiple_tables=True, guess=True,format="pdf")
        df = pd.DataFrame([])
        for i in range(len(page)):
                page[i] = pd.DataFrame(page[i],columns=({'Campo-Seq':[1,2], 'Modelo':[1,2], 'Regra de Validação':[1,2], 'Aplic.':[1,2], 'Msg':[1,2],'Efeito':[1,2],'Descrição Erro':[1,2]}))
                df = pd.concat([df,page[i]])
        dataAtual = datetime.datetime.now()
        saida = destino+"/ErrNFE"+str(dataAtual)+".csv"
        df.to_csv(saida, encoding="UTF-8", sep=";")

The result looks like this: Snippet of the output on Calc after marking the columns

0

There are 0 answers