I have been tasked with outputting a Pyspark Dataframe into cap'n proto (.capnp) format. Does anyone have a suggestion for the best way to do this?
I have a capnp schema, and I have seen the python wrapper for capnp (http://capnproto.github.io/pycapnp/), but I'm still not sure what's the best way to go from dataframe to capnp.
The easiest way is to go to RDD, the use
mapPartitions
to collect partitions as serialized byte arrays and join them incollect()
or usetoLocalIterator
with saving to disk, if dataframe is big. See example pseudocode: