How to get a list of elements out of a PCollection in Google Dataflow and use it in the pipeline to loop Write Transforms?

Question

How to get a list of elements out of a PCollection in Google Dataflow and use it in the pipeline to loop Write Transforms?

8.7k views Asked by orayer At 03 January 2017 at 09:50

I am using Google Cloud Dataflow with the Python SDK.

I would like to :

Get a list of unique dates out of a master PCollection
Loop through the dates in that list to create filtered PCollections (each with a unique date), and write each filtered PCollection to its partition in a time-partitioned table in BigQuery.

How can I get that list ? After the following combine transform, I created a ListPCollectionView object but I cannot iterate that object :

class ToUniqueList(beam.CombineFn):

    def create_accumulator(self):
        return []

    def add_input(self, accumulator, element):
        if element not in accumulator:
            accumulator.append(element)
        return accumulator

    def merge_accumulators(self, accumulators):
        return list(set(accumulators))

    def extract_output(self, accumulator):
        return accumulator


def get_list_of_dates(pcoll):

    return (pcoll
            | 'get the list of dates' >> beam.CombineGlobally(ToUniqueList()))

Am I doing it all wrong ? What is the best way to do that ?

Thanks.

Original Q&A

There are 1 answers

**jkff** · Accepted Answer · 2017-01-03T20:56:07+00:00

It is not possible to get the contents of a PCollection directly - an Apache Beam or Dataflow pipeline is more like a query plan of what processing should be done, with PCollection being a logical intermediate node in the plan, rather than containing the data. The main program assembles the plan (pipeline) and kicks it off.

However, ultimately you're trying to write data to BigQuery tables sharded by date. This use case is currently supported only in the Java SDK and only for streaming pipelines.

For a more general treatment of writing data to multiple destinations depending on the data, follow BEAM-92.

See also Creating/Writing to Parititoned BigQuery table via Google Cloud Dataflow

TechQA.

How to get a list of elements out of a PCollection in Google Dataflow and use it in the pipeline to loop Write Transforms?

There are 1 answers

Related Questions in PYTHON

Related Questions in GOOGLE-BIGQUERY

Related Questions in GOOGLE-CLOUD-DATAFLOW

Related Questions in APACHE-BEAM

Popular Questions

Popular Tags

Trending Questions