No data for map column of a parquet file created from pyarrow and pandas

1.5k views Asked by At

I followed pyarrow data types for columns that have lists of dictionaries? to create an Arrow table which includes a column of MapType.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

print(f'PyArrow Version = {pa.__version__}')
print(f'Pandas Version = {pd.__version__}')

df = pd.DataFrame({
        'col1': pd.Series([
            [('id', 'something'), ('value2', 'else')],
            [('id', 'something2'), ('value','else2')],
        ]),
        'col2': pd.Series(['foo', 'bar'])
    }
)

udt = pa.map_(pa.string(), pa.string())
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
table = pa.Table.from_pandas(df, schema)
pq.write_table(table, './test_map.parquet')

The above code runs smoothly on my developing computer:

PyArrow Version = 1.0.1
Pandas Version = 1.1.2

And generated the test_map.parquet file successfully.

Then I use parquet-tools (1.11.1) to read the file, but get the following output:

col1:
.key_value:
.key_value:
col2 = foo

col1:
.key_value:
.key_value:
col2 = bar

The keys and values are missing... Could you help me on this?

2

There are 2 answers

1
0x26res On

I've tried to replicate but I get this error:

pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet files not yet supported: key_value: list<key_value: struct<key: string not null, value: string> not null> not null

As mentioned in the error list of structs as well as maps are not well supported when it come to reading from parquet.

I'd recommend using a simpler schema for your data like this one:

df = pd.DataFrame({
        'col1': pd.Series([
            {'id': 'something', 'value':'else'},
            {'id': 'somethings', 'value':'elses'},
        ]),
        'col2': pd.Series(['foo', 'bar'])
    }
)
 
udt = pa.struct([pa.field('id', pa.string()), pa.field('value', pa.string())])
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])

table = pa.Table.from_pandas(df, schema)

which outputs:

+----------------------------------------+--------+
| col1                                   | col2   |
|----------------------------------------+--------|
| {'id': 'something', 'value': 'else'}   | foo    |
| {'id': 'somethings', 'value': 'elses'} | bar    |
+----------------------------------------+--------+
0
acan On

We submitted a JIRA issue to Apache Arrow on Sep 30, 2020: https://issues.apache.org/jira/browse/ARROW-10140

And the issue had been resolved in PyArrow 2.0.0 which was released on Oct 20, 2020.

So if you have the same issue when using the map type, please upgrade your PyArrow to 2.0.0 (or higher in future).