BigQuery Create External table from multiples files all files share same 18 columns, except newer files have one extra column

357 views Asked by At

I have a situation I would like to know if possible to combat. I currently have a BigQueryUpsertTableOperator that goes to a google cloud storage and creates one external table from all the csv files in there. All these files share the exact same 18 columns, they are just different data, a new files comes in once a day. Recently the newer files have add a extra 19th column. Is it possible to still allow all these files to be created into the same external table? Some way to have these values just be null in the new files?

 create_external_table = BigQueryUpsertTableOperator(
    task_id=f"create_external_{TABLE}_table",
    dataset_id=DATASET,
    project_id=INGESTION_PROJECT_ID,
    table_resource={
        "tableReference": {"tableId": f"{TABLE}_external"},
        "externalDataConfiguration": {
            "sourceFormat": "CSV",
            "allow_quoted_newlines": True,
            "allow_jagged_rows":True,
            "autodetect": True,
            "sourceUris": [f"gs://{ARCHIVE_BUCKET}/{DATASET}_data/*.csv"],
        },
        "labels": labeler.get_labels_bigquery_table_v2(
            target_project=INGESTION_PROJECT_ID,
            target_dataset=DATASET,
            target_table=f"{TABLE}_external",
        ),
    },
)

Would I need to manually go into these older files and add a comma at the end of each record, and the new column name in the first row for these to be ingested side by side the new files or any way to set an option in the task for this?

1

There are 1 answers

0
Nestor On

I have replicated your concern, it looks like you will have to edit your previous data to be suitable with the new files you are receiving.

Getting error when querying with csv upserted with new column:

enter image description here

Be reminded though that you cannot edit directly through GCS files in your bucket:

An object is an immutable piece of data consisting of a file of any format

You can add new column on your csv using awk command in bash or use of dataframes in python and I believe many other ways .