We have some data stored in multiple delta tables (one table for each day) without specifiying a schema . Since there is an optional array included in the data, the delta tables now have an incompatible schema.
To unify the delta tables we want to apply a schema to all delta tables.
The correct shortened schema looks like
# target schema
schema = StructType(
[
...,
StructField(
"extracostlist",
ArrayType(
StructType(
[
StructField("linenumber", IntegerType(), True),
StructField("costtype", StringType(), True),
StructField("netamount", DecimalType(38,10), True),
]
),
True,
),
True,
),
...
]
)
while some of the tables now have a schema like:
# missing structs in array in delta table
root
|-- ...
|-- extracostlist: array (nullable = true)
| |-- element: string (containsNull = true)
|-- ...
The original data is delivered as a json either an empty list or with the data included according to the schema described above.
{
...,
"extraCostList": [],
...,
}
So far I tried to
- create a new DataFrame with the data of the delta table and specified schema
- create an empty DataFrame and union it with the delta table data
- cast the data
each without success.
Since there are multiple cases like this a "dynamic" fix using a target schema would be appreciated.