Assume that I am unable to change how the Parquet file is written, i.e. it is immutable and so we must find a way of reading it given the following complexities...
In:
import pandas as pd
pd.read_parquet("file_name.parquet", engine="pyarrow")
Out:
OSError: Malformed levels. min: 102 max: 162 out of range. Max Level: 1
I have tried fastparquet
instead of pyarrow
as the engine but get a similar error: ValueError: buffer is smaller than requested size
.
And I get the same OSError
when I do:
import pyarrow.parquet as pq
pq.read_table("file_name.parquet")
I have inspected the schema of my Parquet file:
In:
pq.read_schema("file_name.parquet")
Out:
list: list<element: struct<q: string, a: string>>
child 0, element: struct<q: string, a: string>
child 0, column_1: string
child 1, column_2: string
column_3: int64
I have narrowed down the problem to column_1
and column_2
by doing:
In:
pd.read_parquet("file_name.parquet", engine="pyarrow", columns=["column_1", "column_2"])
Out:
pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(column_1) in list: list<element: struct<column_1: string, column_2: string>>
This is to be expected since column_1
and column_2
are nested within list
, so instead I do:
In:
pd.read_parquet("file_name.parquet", engine="pyarrow", columns=["list"])
Out:
OSError: Malformed levels. min: 102 max: 162 out of range. Max Level: 1
I then tried fastparquet
which is at least able to read list
, but it fails to interpret the nested structure of list
:
In:
pd.read_parquet("file_name.parquet", engine="fastparquet", columns=["list"])
Out:
list
0 None
1 None
2 None
3 None
4 None
.. ...
455 None
456 None
457 None
458 None
459 None
[460 rows x 1 columns]
For further context, I used the parquet-tools
CLI to inspect my file (below I have just shown the output for column_1
, the output for column_2
is the same):
In:
parquet-tools inspect file_name.parquet
Out:
############ Column(column_1) ############
name: column_1
path: list.list.element.column_1
max_definition_level: 4
max_repetition_level: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 85%)
max_definition_level: 4
indicates that there are four levels of nesting for column_1
.
So essentially I need a way to unpack column_1
and column_2
when reading the file.
My question is similar to: https://stackoverflow.com/questions/76706305/parquet-pyarrow-malformed-levels#:~:text=This%20indicates%20that%20there%20is,until%20you%20can%20reproduce%20it.