Performant way to handle arrays in Athena/Quicksight

3.7k views Asked by At

I currently have a large set of json data that I'd like to import into Amazon Athena for visualization in Amazon Quicksight. In each json, there are two fields: one is a comma separated string of ids (orderlist), and the other field is an array of strings(locations). Because Quicksight doesn't support array searching, I'm currently resorting to creating a view where I generate crossjoins across the two string arrays:

select id,
 try_CAST(orderid AS bigint) orderid_targeting,
 location
from advertising_json 
CROSS JOIN UNNEST(split(orderlist, ',')) as x(orderid)
CROSS JOIN UNNEST(locations) t (location)

With two cross joins, this can explode out the data to 20x-30x the original size.

If I were working on individual queries on Athena, I could use Presto array functions to search through the arrays. Is there a better way to make these fields accessible for filtering on Quicksight?

1

There are 1 answers

0
Theo On

You have two options: keep doing what you're doing or implement an ETL workflow where you periodically materialise the view, for example using CTAS. The latter has the added benefit that you can produce Parquet files, which could help speed up your queries.

On the other hand it's not as simple as it sounds. If you're in luck you can use INSERT INTO to transform partitions from your current table into an optimised table after a point in time when they will not change – but in my experience most of the time your most recent data gets updated during some window of time, but you still want to be able to query it during that window. In that situation the ETL process becomes much more complicated since you need to remove data from the optimised table to avoid ending up with duplicate data. It's not hard, it's just a lot of code and juggling S3 and Glue Data Catalog operations so that you never have tables that have duplicate data nor too little data.

Unless you feel like your current setup with the view is too slow, don't go implementing something big and complicated. Remember that you pay for bytes scanned in Athena, not the amount of time Athena spends crunching your query. You get quite a lot of compute power running your queries and in my experience there's rarely any point in micro-optimisation of queries, the gains you make are orders of magnitude lower than minimising the amount of data you process, either through clever partitioning or moving to columnar file formats. Most of the time the gains from small optimisations are not measurable because the error bars caused by Athena's query queue and waiting for S3 operations. You may get your query to run 50ms faster, but sometimes it gets queued for 500ms, and spends another 2000ms doing list operations on S3 so how can you tell?

If you decide to go down the materialisation route, first do it once using CTAS and run your QuickSight visualisation against the results. Don't implement the whole ETL workflow before you've checked that you get something that is significantly more performant.

If all you are worried about is that it's less performant to apply filters after the unnesting of your arrays than using array functions, write the two versions of the query and benchmark them against each other. I suspect array functions are going to be slightly faster – but for the same reasons I mentioned above, the gains may drown in the error bars caused by Athena's queuing and other operations.

Make sure to benchmark at different points during the day, and be especially conscious of the fact that top-of-the-hour behaviour in Athena is extremely different from other times (run queries at 10:00 and then at 10:10 – your total execution times will be very different because everyone's cron jobs run at the top of the hour).