I am creating a Google dataflow pipeline, using Apache Beam Java SDK. I have a few transforms there, and I finally create a collection of Entities ( PCollection< Entity > ) . I need to write this into the Google DataStore and then, perform another transform AFTER all entities have been written. (such as broadcasting the IDs of the saved objects through a PubSub Message to multiple subscribers).
Now, the way to store a PCollection is by: entities.DatastoreIO.v1().write().withProjectId("abc")
This returns a PDone object, and I am not sure how I can chain another transform to occur after this Write() has completed. Since DatastoreIO.write() call does not return a PCollection, I am not able to further the pipeline. I have 2 questions :
How can I get the Ids of the objects written to datastore?
How can I attach another transform that will act after all entities are saved?
We don't have a good way to do either of these things (returning IDs of written Datastore entities, or waiting until entities have been written), though this is far from the first similar request (people have asked for this for BigQuery, for example) and we're thinking about it.
Right now your only option is to wait until the entire pipeline finishes, e.g. via
pipeline.run().waitUntilFinish()
, and then doing what you wanted in your main program (e.g. you can run another pipeline).