I am reading in one single file (1.4GB) using CuDF, I am actually using RapidsAI pandas implementation, but since I was getting the same error I tried directly with CuDF. Now the shape is (847942, 4) and dtypes shows:
@timestamp object
message object
syslog_program object
syslog_hostname object
dtype: object
DTPYES: dict = {
'@timestamp': str,
'syslog_program': str,
'syslog_hostname': str,
'message': str,
}
COLUMN_ORDER:List[str] = [
'@timestamp',
'syslog_program',
'syslog_hostname',
'message',
]
gdf = cudf.read_csv(matching_files[:1][0],
dtype=DTPYES,
usecols=COLUMN_ORDER,
delimiter=",",
)
print(gdf.shape)
but if I try to gdf.head(5) i get the following error
return libcudf.interop.to_arrow([self], [("None", self.dtype)])["None"].chunk(0)
File ~/.local/lib/python3.9/site-packages/pyarrow/table.pxi:1525, in pyarrow.lib._Tabular.getitem()
File ~/.local/lib/python3.9/site-packages/pyarrow/table.pxi:1611, in pyarrow.lib._Tabular.column()
File ~/.local/lib/python3.9/site-packages/pyarrow/table.pxi:1547, in pyarrow.lib._Tabular._ensure_integer_index()
KeyError: 'Field "None" does not exist in schema'
how can I overcome this issue, considering that if I use dask or polars I do not have that issue?