I'm writing in Python and would like to use PyArrow to generate Parquet files.
Per my understanding and the Implementation Status, the C++ (Python) library already implemented the MAP type. From the Data Types, I can also find the type map_(key_type, item_type[, keys_sorted])
.
So, I tested with several different approaches in Python/PyArrow. But all of them failed.
E.g.:
df = pd.DataFrame({
'col1': pd.Series([
[('key', 'aaaa'), ('value', '1111')],
[('key', 'bbbb'), ('value', '2222')],
]),
'col2': pd.Series(['foo', 'bar'])
}
)
udt = pa.map_(pa.string(), pa.string())
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])
table = pa.Table.from_pandas(df, schema)
pq.write_table(table, FILE_NAME)
When I read the file with parquet-tools cat rand_gen_test_map.parquet
, I got:
col1:
.key_value:
.key_value:
col2 = foo
col1:
.key_value:
.key_value:
col2 = bar
It seems to me that the Map values are not outputted correctly (or missed). Though the schema is correct:
message schema {
optional group col1 (MAP) {
repeated group key_value {
required binary key (UTF8);
optional binary value (UTF8);
}
}
optional binary col2 (UTF8);
}
All in all, I have two questions (all in Python):
what is the best way to generate Parquet files with MAP datatype (if will be great if an example can be attached)
I understand that we can use a
STRUCT
to mimic a map structure. But since Parquet provided the MAP type, we still want to use it. If the MAP data type can't be generated, what is the reason behind providing a MAP type?
There was a bug in writing map types. This should be fixed in pyarrow 2.0 (also reading is now supported natively)