Is there an Apache Beam function to gather a fixed number of elements?

126 views Asked by At

I am fairly new to Apache Beam and Dataflow, and I want to gather or "batch together" k number of elements from a PCollection that has n elements. In this case n is not a fixed number and n > k. For example, suppose my PCollection has 1000 elements in it, I want to create batches of up to 50 random elements as the API that I am hitting has a limit of 50 elements per call. I understand that the batching process would be in parallel, and I can make let's say easily 20 parallel calls to my API as well.

Is there a built-in component/function that allows me to do so or can you suggest me a custom DoFn that allows me to do so?

I believe this can be solved by a combination of Combine, Partion or GroupByKey but I don't know how to put it all together. I am looking for a solution using Python.  

A little background on what I want to achieve by using Dataflow by creating a component for each step (if it helps):

  1. Read a CSV file from a GCS bucket. The CSV file contains paths to numerous text files that are also on GCS.
  2. Read each text file from the GCS bucket
  3. Break down the raw text file into chunks of 100 characters
  4. Collect 50 chunks together and call an API using the collected 50 chunks as a single request (this is the step where I need help)
  5. Save the result of the API onto a database.
1

There are 1 answers

0
Dev Yns On