CuDF KeyError: 'Field "None" does not exist in schema'

96 views Asked by At

I am reading in one single file (1.4GB) using CuDF, I am actually using RapidsAI pandas implementation, but since I was getting the same error I tried directly with CuDF. Now the shape is (847942, 4) and dtypes shows:

@timestamp         object
 message            object
 syslog_program     object
 syslog_hostname    object
 dtype: object

DTPYES: dict = {
    '@timestamp': str,
    'syslog_program': str,
    'syslog_hostname': str,
    'message': str,
}

COLUMN_ORDER:List[str] = [
    '@timestamp',
    'syslog_program',
    'syslog_hostname',
    'message',
]
gdf = cudf.read_csv(matching_files[:1][0],
                    dtype=DTPYES,
                    usecols=COLUMN_ORDER,
                    delimiter=",",
                   )
print(gdf.shape)

but if I try to gdf.head(5) i get the following error

return libcudf.interop.to_arrow([self], [("None", self.dtype)])["None"].chunk(0)

File ~/.local/lib/python3.9/site-packages/pyarrow/table.pxi:1525, in pyarrow.lib._Tabular.getitem()

File ~/.local/lib/python3.9/site-packages/pyarrow/table.pxi:1611, in pyarrow.lib._Tabular.column()

File ~/.local/lib/python3.9/site-packages/pyarrow/table.pxi:1547, in pyarrow.lib._Tabular._ensure_integer_index()

KeyError: 'Field "None" does not exist in schema'

how can I overcome this issue, considering that if I use dask or polars I do not have that issue?

0

There are 0 answers