GCP Dataflow Computation Graph and Job Execution

315 views Asked by At

Hi Everyone I tried hard to understand what is happening when I create a custom template in Google cloud Dataflow but failed to understand. Thanks to GCP documentations. Below is what I am achieving.

  1. Read Data from Google cloud Bucket
  2. Pre-Process it
  3. Load Deeplearning models (1 GB each) and get the predictions
  4. Dump the results in BigQuery.

I successfully created the template and I am able to execute the job. But I have following questions.

  1. When I execute the job, Everytime the models (5 models and each of 1GB) gets downloaded during execution OR the models are loaded and placed in the template (Execution Graph) and during execution it uses the loaded ones
  2. If loading of the models happen only during the job execution, then does it not impact the execution time? Since it has to load GBs of Model files everytime the job is triggered?
  3. Can multiple users trigger the same template at same time? Since I want to productionize it, I am not sure how this will handle multiple requests at same time?

Can anyone please share some information on it?

Sources I referred and failed to get the answer: https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#pipeline-lifecycle-from-pipeline-code-to-dataflow-job http://alumni.media.mit.edu/~wad/magiceight/isa/node3.html https://cloud.google.com/dataflow/docs/guides/setting-pipeline-options#configuring-pipelineoptions-for-local-execution https://beam.apache.org/documentation/basics/ https://beam.apache.org/documentation/runtime/model/ https://mehmandarov.com/apache-beam-pipeline-graph/

1

There are 1 answers

4
robertwb On

This depends on where the models are being loaded from. If they're loaded in the DoFns (most likely), then it will happen in the workers (during job execution).

As for your other question, there should be no issues with multiple users triggering a template job simultaneously.