Write Parquet MAP datatype by PyArrow

1.4k views Asked by At

I'm writing in Python and would like to use PyArrow to generate Parquet files.

Per my understanding and the Implementation Status, the C++ (Python) library already implemented the MAP type. From the Data Types, I can also find the type map_(key_type, item_type[, keys_sorted]).

So, I tested with several different approaches in Python/PyArrow. But all of them failed.

E.g.:

df = pd.DataFrame({
        'col1': pd.Series([
            [('key', 'aaaa'), ('value', '1111')],
            [('key', 'bbbb'), ('value', '2222')],
        ]),
        'col2': pd.Series(['foo', 'bar'])
    }
)

udt = pa.map_(pa.string(), pa.string())
schema = pa.schema([pa.field('col1', udt), pa.field('col2', pa.string())])

table = pa.Table.from_pandas(df, schema)
pq.write_table(table, FILE_NAME)

When I read the file with parquet-tools cat rand_gen_test_map.parquet, I got:

col1:
.key_value:
.key_value:
col2 = foo

col1:
.key_value:
.key_value:
col2 = bar

It seems to me that the Map values are not outputted correctly (or missed). Though the schema is correct:

message schema {
  optional group col1 (MAP) {
    repeated group key_value {
      required binary key (UTF8);
      optional binary value (UTF8);
    }
  }
  optional binary col2 (UTF8);
}

All in all, I have two questions (all in Python):

  1. what is the best way to generate Parquet files with MAP datatype (if will be great if an example can be attached)

  2. I understand that we can use a STRUCT to mimic a map structure. But since Parquet provided the MAP type, we still want to use it. If the MAP data type can't be generated, what is the reason behind providing a MAP type?

1

There are 1 answers

2
Micah Kornfield On

There was a bug in writing map types. This should be fixed in pyarrow 2.0 (also reading is now supported natively)