Amazon Sagemaker Studio Data Wrangler athena query failing for large datasets

407 views Asked by At

Trying to query a large dataset from Athena using AWS data wrangler. The query fails for large datasets. This is for setting up a datawrangler pipeline using UI in AWS studio trying to add a Athena Source.

Some observations:

  1. Small Athena queries works
  2. Same dataset is successfully read from S3 after querying using Athena.
  3. First I get the warning in UI saying your query takes longer than usual, and then failure message with no specific reason. No useful message in cloudformation logs also
  4. Same query completed directly in Athena in around 30 minutes.

Anyone encountered a similar problem? any timeout settings for data wrangler?

1

There are 1 answers

0
Luk3rson On

I had the same issue with the Snowflake as a source. I created a support ticket and according to them they are working to enhance performance on large datasets.

As a workaround export the flow to a SageMaker pipeline and run it as a Processing Job on multiple instances as it runs in a distributed environment using Spark.