Google Dataflow pipeline for varying schema

295 views Asked by At

I have a product to define and configure business workflows. A part of this product is a form-builder which enables users to setup different forms.

This entire forms data is backed on MongoDB in the following structure

- form_schemas

{
  "_id" : "",
  "name" : "",
  "account_id" : "",
  "fields" : [
    {
      "label" : "Enter Email",
      "name" : "email",
      "type" : "text",
      "required" : "true",
      "hidden" : "false",
      "additional_config" : { }
    },
    {
      "label" : "Select DOB",
      "name" : "dob",
      "type" : "date",
      "required" : "true",
      "hidden" : "false",
      "additional_config" : { }
    }
    ...
  ]
}

- form_datas

{
  "workflow_template_id" : "",
  "account_id" : ""
  "data" : {
    "email" : "[email protected]",
    "dob" : "2001-04-05"
  },
  "created_at" : "",
  "updated_at" : ""

}


As seen above the form can be for various different businesses. However, I am looking at data pipeline to transport the data to Google Bigquery at periodic intervals for analysis.

On BQ side, I am maintaining separate tables for each workflows

I have a current working solution which is completely written on Google Cloud Functions. I have a Google Scheduler Job run at periodic intervals invoking the different cloud functions. The cloud functions is doing the following things at high level

  • Iterate for each schema
  • Read the data mongodb for every schema since the last run (as cursor)
  • For each row of data, run the custom transformation logic (this includes transforming various nested data types like grids/lookup etc)
  • Write each row of transformed data directly as stream as ndjson on Google Cloud Storage

I above solution provides me with,

  • Complete control on transformation
  • Simple deployment

However since its all on CF, I am bound by limitation of 9 minutes per run. This essentially puts a lot of pagination requirements especially if there is a need to regenerate the complete data from beginning of time

While the above solution works fine for now, I was looking at other serverless options like Google data-flow. Since I am just starting on data-flow/apache beam, I was wondering

If I were to write a pipeline on beam, should I go with same approach of

  1. Extract(Row by Row) -> Transform -> Load (GCS) -> Load (BQ)

or

  1. Extract (entire data as JSON) -> Load to GCS -> Transform (Beam) -> Load to GCS -> Load to BQ

Let me know if there is any better option for entire data processing.

1

There are 1 answers

3
Steven Ensslen On

Typically, this sort of process writes raw data to GCS and then transforms into Bigquery. This is done so that when you discover defects in the transform (which are inevitable) and the requirements change (also inevitable) you can replay the data with the new code.

Ideally, the steps prior to the transform are automated by a Change Data Capture (CDC) tool. There are plenty of CDC tools, but Debezium is taking over as it is reliable and free. There is a Debezium connector to get data from MongoDB and examples of how to put Debezium CDC into Bigquery.

If you are going to write the code that puts data to GCS, I would recommend considering using Apache Parquet rather than NDJSON as a format. Performance and cost will be better, and I find a format with data types easier to work with.