Trying to query a large dataset from Athena using AWS data wrangler. The query fails for large datasets. This is for setting up a datawrangler pipeline using UI in AWS studio trying to add a Athena Source.
Some observations:
- Small Athena queries works
- Same dataset is successfully read from S3 after querying using Athena.
- First I get the warning in UI saying your query takes longer than usual, and then failure message with no specific reason. No useful message in cloudformation logs also
- Same query completed directly in Athena in around 30 minutes.
Anyone encountered a similar problem? any timeout settings for data wrangler?
I had the same issue with the Snowflake as a source. I created a support ticket and according to them they are working to enhance performance on large datasets.
As a workaround export the flow to a SageMaker pipeline and run it as a Processing Job on multiple instances as it runs in a distributed environment using Spark.