I have a situation I would like to know if possible to combat. I currently have a BigQueryUpsertTableOperator that goes to a google cloud storage and creates one external table from all the csv files in there. All these files share the exact same 18 columns, they are just different data, a new files comes in once a day. Recently the newer files have add a extra 19th column. Is it possible to still allow all these files to be created into the same external table? Some way to have these values just be null in the new files?
create_external_table = BigQueryUpsertTableOperator(
task_id=f"create_external_{TABLE}_table",
dataset_id=DATASET,
project_id=INGESTION_PROJECT_ID,
table_resource={
"tableReference": {"tableId": f"{TABLE}_external"},
"externalDataConfiguration": {
"sourceFormat": "CSV",
"allow_quoted_newlines": True,
"allow_jagged_rows":True,
"autodetect": True,
"sourceUris": [f"gs://{ARCHIVE_BUCKET}/{DATASET}_data/*.csv"],
},
"labels": labeler.get_labels_bigquery_table_v2(
target_project=INGESTION_PROJECT_ID,
target_dataset=DATASET,
target_table=f"{TABLE}_external",
),
},
)
Would I need to manually go into these older files and add a comma at the end of each record, and the new column name in the first row for these to be ingested side by side the new files or any way to set an option in the task for this?
I have replicated your concern, it looks like you will have to edit your previous data to be suitable with the new files you are receiving.
Getting error when querying with csv upserted with new column:
Be reminded though that you cannot edit directly through GCS files in your bucket:
You can add new column on your csv using
awkcommand in bash or use of dataframes in python and I believe many other ways .