Apache Beam Streaming Write/Read BigQuery

Question

Apache Beam Streaming Write/Read BigQuery

112 views Asked by Marta Regolo At 24 October 2023 at 14:23

I'm running a streaming pipeline where I'm trying to write in BigQuery and then reading from it. Is there a way to be sure that what I've just written is there, before reading? I'm using python:

write_result_bigquery = (
            read_pubsub
            | f'Deserialize {TABLE_NAME.capitalize()} JSON' >> beam.Map(json.loads)
            | 'Add Timestamp Field' >>  beam.ParDo(AddTimestamp())
            | 'Log writing' >> beam.ParDo(LogResult())
            | 'Write to BigQuery' >> beam.io.WriteToBigQuery(
                f'{PROJECT_ID}:{DATASET}.{TABLE_NAME}',
                schema = table_schema,
                custom_gcs_temp_location = f'gs://{GCS_BUCKET}/tmpWrite/tmp',
                write_disposition = beam.io.BigQueryDisposition.WRITE_APPEND,
                create_disposition = beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                insert_retry_strategy = RetryStrategy.RETRY_ON_TRANSIENT_ERROR
                )
        )
_ = write_result_bigquery

#2. READ DATA > LAST PROCESSED TIMESTAMP
read_bigquery_timestamp  = (
            read_pubsub
            | 'Get data' >> beam.Map(read_timestamp_file, GCS_BUCKET, source, timestamp_key_cons_mobile)
            | 'Get timestamp' >> beam.Map(get_timestamp, key = timestamp_key_cons_mobile)
            | 'Create query input' >> beam.Map(generate_query_timestamp, query_ts, PROJECT_ID, DATASET, [TABLE_NAME], [id_col])
            | 'Read BigQuery by timestamp' >> beam.ParDo(ReadFromBigQueryRetryDoFn(), schema = schema_output)

For now I implemented a custom method that throws an exception when the query gives empty results, but it's not efficient and I'm getting other issues. It just forces Dataflow streaming to retry until he finds something.

any ideas?

Original Q&A

There are 1 answers

**Joe Moore** · Answer 1 · 2023-11-07T13:44:44+00:00

For a streaming pipeline, the default API that Apache Beam uses in it's I/O built-in is the legacy streaming API which withholds access to newly written data for up to 2 minutes.

In order to be able to query data more immediately the new Storage Write API should be used instead as follows:

| "Write to BigQuery"
    >> beam.io.WriteToBigQuery(
        f"{PROJECT_ID}:{DATASET}.{TABLE_NAME}",
        schema=table_schema,
        custom_gcs_temp_location=f"gs://{GCS_BUCKET}/tmpWrite/tmp",
        write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
        create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
        insert_retry_strategy=RetryStrategy.RETRY_ON_TRANSIENT_ERROR,
        method=beam.io.WriteToBigQuery.Method.STORAGE_WRITE_API
    )

What the Storage Write API does is buffers the newly written rows and iteratively commits them in batches to the table. This allows you to near immediately access new data from the buffer. This is the recommended way of writing to BigQuery from DataFlow and should work for you.

TechQA.

Apache Beam Streaming Write/Read BigQuery

There are 1 answers

Related Questions in PYTHON

Related Questions in GOOGLE-CLOUD-PLATFORM

Related Questions in GOOGLE-CLOUD-DATAFLOW

Related Questions in APACHE-BEAM

Related Questions in BEAM

Popular Questions

Popular Tags

Trending Questions