Pyspark saveAsTable to update Glue schema

590 views Asked by At

I have a Pyspark dataframe which I am writing to Glue catalog as below :

df.write.format("parquet").mode("append").saveAsTable('db.table')

This works fine if the input dataframe and Glue catalog have the same columns. But when new columns are added in dataframe, the Glue job fails.

Is there a way to update the schema in Glue catalog if new columns/schema changes are detected in incoming spark dataframe?

I tried using different modes (append/overwrite etc..) but that removes the existing data. Also, tried converting spark df to Dynamic frame and then update schema but that also seems to not work as the way expected.

1

There are 1 answers

0
Oleksandr Lykhonosov On

Creating tables, updating the schema, and adding new partitions in the Data Catalog from AWS Glue ETL jobs

If you want to overwrite the Data Catalog table’s schema you can do one of the following:

  • When the job finishes, rerun the crawler and make sure your crawler is configured to update the table definition as well. View the new partitions on the console along with any schema updates, when the crawler finishes. For more information, see Configuring a Crawler Using the API.

  • When the job finishes, view the modified schema on the console right away, without having to rerun the crawler. You can enable this feature by adding a few lines of code to your ETL script, as shown in the following examples. The code uses enableUpdateCatalog set to true, and also updateBehavior set to UPDATE_IN_DATABASE, which indicates to overwrite the schema and add new partitions in the Data Catalog during the job run.

additionalOptions = {
    "enableUpdateCatalog": True, 
    "updateBehavior": "UPDATE_IN_DATABASE"
}

glueContext.write_dynamic_frame_from_catalog(
    frame=last_transform,
    database=<dst_db_name>,
    table_name=<dst_tbl_name>,
    transformation_ctx="write_sink",
    additional_options=additionalOptions
)

You can also set the updateBehavior value to LOG if you want to prevent your table schema from being overwritten, but still want to add the new partitions. The default value of updateBehavior is UPDATE_IN_DATABASE, so if you don’t explicitly define it, then the table schema will be overwritten.

If enableUpdateCatalog is not set to true, regardless of whichever option selected for updateBehavior, the ETL job will not update the table in the Data Catalog.

When the updateBehavior is set to LOG, new partitions will be added only if the DynamicFrame schema is equivalent to or contains a subset of the columns defined in the Data Catalog table's schema.