Pandas read_csv works but pyarrow doesnt

41 views Asked by abinitio At 18 March 2024 at 16:05

I have a csv file, which is tab separated. The following code:

import numpy as np
import sys
import pyarrow.csv as pa_csv
import pandas as pd

df = pd.read_csv(sys.argv[1],sep='\t',header=0,dtype='object')
parse_options = pa_csv.ParseOptions(delimiter='\t')
data = pa_csv.read_csv(sys.argv[1], parse_options=parse_options)

fails on the pyarrow read:

Having looked at the data I have been given it seems the nunmber of columns varies:

awk '{print NF}' data.csv:

200651
200651
200651
200653
200651
200651
200651

How does pandas handle this case, and why doesnt pyarrow do the same?

Can pyarrow be forced to behave in the same way?

EDIT

The number of columns doesnt vary. I didnt use the tab as a delimter to awk.

awk -F'\t' '{print NF}'
200669
200669
200669
200669
200669
200669
200669
200669

so what is causing the error?

Update

adding

read_options=pa_csv.ReadOptions(block_size=1e9)

solved the issue. I guess it is down to the number of columns being large.

Original Q&A

TechQA.

Pandas read_csv works but pyarrow doesnt

There are 0 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in CSV

Related Questions in PYARROW

Popular Questions

Trending Questions