Unable to read Parquet file with PyArrow: Malformed levels

131 views Asked by At

Assume that I am unable to change how the Parquet file is written, i.e. it is immutable and so we must find a way of reading it given the following complexities...

In:
import pandas as pd
pd.read_parquet("file_name.parquet", engine="pyarrow")

Out:
OSError: Malformed levels. min: 102 max: 162 out of range.  Max Level: 1

I have tried fastparquet instead of pyarrow as the engine but get a similar error: ValueError: buffer is smaller than requested size.

And I get the same OSError when I do:

import pyarrow.parquet as pq
pq.read_table("file_name.parquet")

I have inspected the schema of my Parquet file:

In:
pq.read_schema("file_name.parquet")

Out:
list: list<element: struct<q: string, a: string>>
  child 0, element: struct<q: string, a: string>
      child 0, column_1: string
      child 1, column_2: string
column_3: int64

I have narrowed down the problem to column_1 and column_2 by doing:

In:
pd.read_parquet("file_name.parquet", engine="pyarrow", columns=["column_1", "column_2"])

Out:
pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(column_1) in list: list<element: struct<column_1: string, column_2: string>>

This is to be expected since column_1 and column_2 are nested within list, so instead I do:

In:
pd.read_parquet("file_name.parquet", engine="pyarrow", columns=["list"])

Out:
OSError: Malformed levels. min: 102 max: 162 out of range.  Max Level: 1

I then tried fastparquet which is at least able to read list, but it fails to interpret the nested structure of list:

In:
pd.read_parquet("file_name.parquet", engine="fastparquet", columns=["list"])

Out:
     list
0    None
1    None
2    None
3    None
4    None
..    ...
455  None
456  None
457  None
458  None
459  None

[460 rows x 1 columns]

For further context, I used the parquet-tools CLI to inspect my file (below I have just shown the output for column_1, the output for column_2 is the same):

In:
parquet-tools inspect file_name.parquet

Out:
############ Column(column_1) ############
name: column_1
path: list.list.element.column_1
max_definition_level: 4
max_repetition_level: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY (space_saved: 85%)

max_definition_level: 4 indicates that there are four levels of nesting for column_1.

So essentially I need a way to unpack column_1 and column_2 when reading the file.

My question is similar to: https://stackoverflow.com/questions/76706305/parquet-pyarrow-malformed-levels#:~:text=This%20indicates%20that%20there%20is,until%20you%20can%20reproduce%20it.

0

There are 0 answers