Outputting pyspark dataframe in capnp (cap'n proto) format

252 views Asked by At

I have been tasked with outputting a Pyspark Dataframe into cap'n proto (.capnp) format. Does anyone have a suggestion for the best way to do this?

I have a capnp schema, and I have seen the python wrapper for capnp (http://capnproto.github.io/pycapnp/), but I'm still not sure what's the best way to go from dataframe to capnp.

1

There are 1 answers

0
Mariusz On

The easiest way is to go to RDD, the use mapPartitions to collect partitions as serialized byte arrays and join them in collect() or use toLocalIterator with saving to disk, if dataframe is big. See example pseudocode:

create = your_serialization_method
serialize_partition = lambda partition: [b''.join([create(object).to_bytes() for object in partition])] # creates one-element partition
output = b''.join(df.rdd.mapPartitions(serialize_partition).collect())