I was using Amazon Athena to query the index of the Common Crawl archives successfully until a couple of weeks when it started to return "Service: Amazon S3; Status Code: 503; Error Code: SlowDown". I followed this approach https://skeptric.com/common-crawl-index-athena/ and it was working out pretty fast and as expected. If successfully running Athena takes less than 10 seconds to scan a bucket of 300 parquet files to return a result but now it is running 1 minute and then fails opening a random parquet file returning the before mentioned error code.
A SQL statement in Athena looks like this:
SELECT url_host_registered_domain As domain, url_path, warc_filename, warc_record_offset, warc_record_length
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2023-06' AND subset = 'warc' AND url_host_registered_domain IN ('ica.se', 'hemkop.se', 'spar.no', 'obs.no', 'obsbygg.no', 'rarecoin.store')
The error code I get now every time is:
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2023-06/subset=warc/part-00275-b5ddf469-bf28-43c4-9c36-5b5ccc3b2bf1.c000.gz.parquet (offset=0, length=67108864): com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown...
I've set up an exponential backoff algo to retry and it worked out once in a while but I am not happy to scan (and pay for) gigabytes of data all the time without getting any results out of it :/
Is that an issue from my side or simply Amazon not providing enough resources? Has anyone experienced the same issue or can suggest an alternative way to retrieve the index results?
Any help highly appreciated! Thanks.
I see you solved your own question by making a mirror of the parquet files, but, the underlying issue on our end is not longer happening. We're not sure if the person sending us millions of requests per second stopped, or if Amazon finally figured out a signature for dropping those requests, but things are much better for the past 12 hours.
In the future, we'd recommend checking out our new status webpage to see what's going on. Also, our blog sometimes has some interesting posts. The recent performance blog post contained the workaround you used, for example.
Thank you for using Common Crawl!
New status webpage: https://status.commoncrawl.org/
Recent blog post about our performance issues: https://commoncrawl.org/blog/oct-nov-2023-performance-issues