I am fairly new to Apache Beam and Dataflow, and I want to gather or "batch together" k
number of elements from a PCollection
that has n
elements. In this case n
is not a fixed number and n > k
. For example, suppose my PCollection
has 1000 elements in it, I want to create batches of up to 50 random elements as the API that I am hitting has a limit of 50 elements per call. I understand that the batching process would be in parallel, and I can make let's say easily 20 parallel calls to my API as well.
Is there a built-in component/function that allows me to do so or can you suggest me a custom DoFn
that allows me to do so?
I believe this can be solved by a combination of Combine
, Partion
or GroupByKey
but I don't know how to put it all together. I am looking for a solution using Python.
A little background on what I want to achieve by using Dataflow by creating a component for each step (if it helps):
- Read a CSV file from a GCS bucket. The CSV file contains paths to numerous text files that are also on GCS.
- Read each text file from the GCS bucket
- Break down the raw text file into chunks of 100 characters
- Collect 50 chunks together and call an API using the collected 50 chunks as a single request (this is the step where I need help)
- Save the result of the API onto a database.
Check
GroupIntoBatches
Transform : https://beam.apache.org/documentation/transforms/python/aggregation/groupintobatches/