How can I generate histogram on a big bounded dataset with Apache Beam?

58 views Asked by At

I'm writing an Apache Beam pipeline that transforms a raw dataset in a canonical schema defined with Google's Protocol Buffers, then I compute some metrics for each data instance and save them to the Proto object too.

Now for each computed metrics I want to extract an histogram that describes the distribution of the metric across the dataset. How can I do that in Beam?

I see that there's an histogram metric implementation in the Python SDK but it is only for internal use and it is not supported by the runners. Is there any workaround to this?

0

There are 0 answers