Vertex workbench - how to run BigQueryExampleGen in Jupyter notebook

604 views Asked by At

Problem

Tried to run BigQueryExampleGen in the

InvalidUserInputError: Request missing required parameter projectId [while running 'InputToRecord/QueryTable/ReadFromBigQuery/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']

Steps

BigQueryExampleGen Setup the GCP project and the interactive TFX context.

import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "path_to_credential_file"


from tfx.v1.extensions.google_cloud_big_query import BigQueryExampleGen
from tfx.v1.components import (
    StatisticsGen,
    SchemaGen,
)
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext
%load_ext tfx.orchestration.experimental.interactive.notebook_extensions.skip
context = InteractiveContext(pipeline_root='./data/artifacts')

Run the BigqueryExampleGen.

query = """
SELECT 
    * EXCEPT (trip_start_timestamp, ML_use)
FROM 
    {PROJECT_ID}.public_dataset.chicago_taxitrips_prep
""".format(PROJECT_ID=PROJECT_ID)

example_gen = context.run(
    BigQueryExampleGen(query=query)
)

Got the error.

InvalidUserInputError: Request missing required parameter projectId [while running 'InputToRecord/QueryTable/ReadFromBigQuery/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']

Data

See mlops-with-vertex-ai/01-dataset-management.ipynb to setup the BigQuery dataset for CThe Chicago Taxi Trips dataset.

1

There are 1 answers

0
mon On

Project ID

To run in GCP, need to provide the project ID via beam_pipeline_args argument.

have proposed #888 to make this work. With that change, you would be able to do

context.run(..., beam_pipeline_args=['--project', 'my-project'])
query = """
SELECT 
    * EXCEPT (trip_start_timestamp, ML_use)
FROM 
    {PROJECT_ID}.public_dataset.chicago_taxitrips_prep
""".format(PROJECT_ID=PROJECT_ID)

example_gen = context.run(
    BigQueryExampleGen(query=query),
    beam_pipeline_args=[
        '--project', PROJECT_ID,
    ]
)

However, it still fails with another error.

ValueError: ReadFromBigQuery requires a GCS location to be provided. Neither gcs_location in the constructor nor the fallback option --temp_location is set. [while running 'InputToRecord/QueryTable/ReadFromBigQuery/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']

GCS Bucket

It looks inside GCP, the interactive context runs the BigQueryExampleGen via Dataflow, hence need to provide a GCS bucket URL via the beam_pipeline_args argument.

When running your Dataflow pipeline pass the argument --temp_location gs://bucket/subfolder/

query = """
SELECT 
    * EXCEPT (trip_start_timestamp, ML_use)
FROM 
    {PROJECT_ID}.public_dataset.chicago_taxitrips_prep
""".format(PROJECT_ID=PROJECT_ID)

example_gen = context.run(
    BigQueryExampleGen(query=query),
    beam_pipeline_args=[
        '--project', PROJECT_ID,
        '--temp_location', BUCKET
    ]
)
statistics_gen = context.run(
    StatisticsGen(examples=example_gen.component.outputs['examples'])
)
context.show(statistics_gen.component.outputs['statistics'])

schema_gen = SchemaGen(
    statistics=statistics_gen.component.outputs['statistics'],
    infer_feature_shape=True
)
context.run(schema_gen)
context.show(schema_gen.outputs['schema'])

enter image description here

Documentation

This notebook-based tutorial will use Google Cloud BigQuery as a data source to train an ML model. The ML pipeline will be constructed using TFX and run on Google Cloud Vertex Pipelines. In this tutorial, we will use the BigQueryExampleGen component which reads data from BigQuery to TFX pipelines.

We also need to pass beam_pipeline_args for the BigQueryExampleGen. It includes configs like the name of the GCP project and the temporary storage for the BigQuery execution.