I am using Google Cloud Dataflow with the Python SDK.
I would like to :
- Get a list of unique dates out of a master PCollection
- Loop through the dates in that list to create filtered PCollections (each with a unique date), and write each filtered PCollection to its partition in a time-partitioned table in BigQuery.
How can I get that list ? After the following combine transform, I created a ListPCollectionView object but I cannot iterate that object :
class ToUniqueList(beam.CombineFn):
def create_accumulator(self):
return []
def add_input(self, accumulator, element):
if element not in accumulator:
accumulator.append(element)
return accumulator
def merge_accumulators(self, accumulators):
return list(set(accumulators))
def extract_output(self, accumulator):
return accumulator
def get_list_of_dates(pcoll):
return (pcoll
| 'get the list of dates' >> beam.CombineGlobally(ToUniqueList()))
Am I doing it all wrong ? What is the best way to do that ?
Thanks.
It is not possible to get the contents of a
PCollection
directly - an Apache Beam or Dataflow pipeline is more like a query plan of what processing should be done, withPCollection
being a logical intermediate node in the plan, rather than containing the data. The main program assembles the plan (pipeline) and kicks it off.However, ultimately you're trying to write data to BigQuery tables sharded by date. This use case is currently supported only in the Java SDK and only for streaming pipelines.
For a more general treatment of writing data to multiple destinations depending on the data, follow BEAM-92.
See also Creating/Writing to Parititoned BigQuery table via Google Cloud Dataflow