Pandas read_csv works but pyarrow doesnt

41 views Asked by At

I have a csv file, which is tab separated. The following code:

import numpy as np
import sys
import pyarrow.csv as pa_csv
import pandas as pd

df = pd.read_csv(sys.argv[1],sep='\t',header=0,dtype='object')
parse_options = pa_csv.ParseOptions(delimiter='\t')
data = pa_csv.read_csv(sys.argv[1], parse_options=parse_options)

fails on the pyarrow read:

Having looked at the data I have been given it seems the nunmber of columns varies:

awk '{print NF}' data.csv:

200651
200651
200651
200653
200651
200651
200651

How does pandas handle this case, and why doesnt pyarrow do the same?

Can pyarrow be forced to behave in the same way?

EDIT

The number of columns doesnt vary. I didnt use the tab as a delimter to awk.

awk -F'\t' '{print NF}'
200669
200669
200669
200669
200669
200669
200669
200669

so what is causing the error?

Update

adding

read_options=pa_csv.ReadOptions(block_size=1e9) 

solved the issue. I guess it is down to the number of columns being large.

0

There are 0 answers