Given a deeply nested parquet struct like so
|-- bet: struct (nullable = true)
| |-- sides: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- side: string (nullable = true)
| | | |-- betID: string (nullable = true)
| | | |-- secondarybetID: string (nullable = true)
| | | |-- parties: struct (nullable = true)
| | | | |-- partyIDs: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- partyID: string (nullable = true)
| | | | | | |-- partyRole: integer (nullable = true)
| | | | | | |-- partySubGrp: struct (nullable = true)
| | | | | | | |-- partySubIDs: array (nullable = true)
| | | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | | |-- partySubID: string (nullable = true)
| | | | | | | | | |-- partySubIDType: integer (nullable = true)
And given the fact there are several sides to a bet and somehow we are interested only in the first side in sides array. How can I find the parties involved in the side whose partyRole is 10 ?
in prestosql i can do something like
SELECT
filter(bet.sides[1].parties.partyids, x -> x.partyrole=10)[1] as party10
FROM
parquetbets
WHERE
cardinality(filter(bet.sides[1].parties.partyids, x -> x.partyrole=10))>0
How do I go about doing the same in spark2 sql ?
SELECT bet.sides[1] from parquetbets
In spark2 sql the above returns an array with no scope of further pruning on the nested structure ?
i.e.
SELECT bet.sides[1].parties from parquetbets
returns null. I have tried out a few combination but the results return WrappedArrayElements which do not provide a mechanism to query the nested data. In prestosql results returned contain the field names so that it is easy to continue and probe deeper into the structure.
Can someone point me on how spark2 sql can support this ? And if spark2 sql cannot then how spark dataframes could do this ?
Silly question : Have you considered using the DataSet API with encoders ? It provides a functional API to reason on your problem (which is a way easier to solve functionaly).
Otherwise consider exploding your arrays to reason on flatten data (see org.apache.spark.sql.functions.explode).
Example in scala :