pyspark casting missing struct in optional array for delta table

21 views Asked by At

We have some data stored in multiple delta tables (one table for each day) without specifiying a schema . Since there is an optional array included in the data, the delta tables now have an incompatible schema.

To unify the delta tables we want to apply a schema to all delta tables.

The correct shortened schema looks like

# target schema
schema = StructType(
    [
        ...,
        StructField(
            "extracostlist", 
            ArrayType(
                StructType(
                    [
                        StructField("linenumber", IntegerType(), True),
                        StructField("costtype", StringType(), True),
                        StructField("netamount", DecimalType(38,10), True),
                    ]
                ),
                True,
            ),
            True,
        ),
        ...
    ]
)

while some of the tables now have a schema like:

# missing structs in array in delta table
root
|-- ...
|-- extracostlist: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- ...

The original data is delivered as a json either an empty list or with the data included according to the schema described above.

{
    ...,
    "extraCostList": [],
    ...,
}

So far I tried to

  • create a new DataFrame with the data of the delta table and specified schema
  • create an empty DataFrame and union it with the delta table data
  • cast the data

each without success.

Since there are multiple cases like this a "dynamic" fix using a target schema would be appreciated.

0

There are 0 answers