Python Scrapy - brotli.brotli.Error: Decompression error: incomplete compressed stream

167 views Asked by At

I'm scraping restaurant reviews from Yelp and I'm accessing the restaurant's APIs to do so. I'm currently scraping 4 star reviews, for example this restaurant page has this corresponding API.

This is the block of code that sends an http request to the API when the crawler is currently on the restaurant page

bizId = response.xpath("//meta[@name='yelp-biz-id']/@content").extract_first()
api_url = 'https://www.yelp.it/biz/' + bizId + '/review_feed?rr=' + str(n_star_filter)
yield response.follow(url=api_url, callback = self.parse_yelp_restaurant_api)

Sometimes the API are accessed correctly and I'm able to scrape them. However, most of the time, I get this error:

2023-10-27 15:57:39 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.yelp.it/biz/78t73jTxdUw5C-v44lj4Iw/review_feed?rr=4>
Traceback (most recent call last):
  File "/Users/mauri/anaconda3/lib/python3.11/site-packages/twisted/internet/defer.py", line 1697, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "/Users/mauri/anaconda3/lib/python3.11/site-packages/scrapy/core/downloader/middleware.py", line 64, in process_response
    method(request=request, response=response, spider=spider)
  File "/Users/mauri/anaconda3/lib/python3.11/site-packages/scrapy/downloadermiddlewares/httpcompression.py", line 63, in process_response
    decoded_body = self._decode(response.body, encoding.lower())
  File "/Users/mauri/anaconda3/lib/python3.11/site-packages/scrapy/downloadermiddlewares/httpcompression.py", line 102, in _decode
    body = brotli.decompress(body)
  File "/Users/mauri/anaconda3/lib/python3.11/site-packages/brotli/brotli.py", line 90, in decompress
    d.finish()
  File "/Users/mauri/anaconda3/lib/python3.11/site-packages/brotli/brotli.py", line 464, in finish
    raise Error("Decompression error: incomplete compressed stream.")
brotli.brotli.Error: Decompression error: incomplete compressed stream.

I can't figure out what this means and it's really weird that some APIs are downloaded and others produce this error when they apparently are no different from each other.

1

There are 1 answers

0
oleksii On BEST ANSWER

This is likely a violation of yelp policy, such websites don't like when people scrape data in this fasion. For example, this policy says

Use any robot, spider, Service search/retrieval application, or other automated device, process or means to access, retrieve, copy, scrape, or index any portion of the Service or any Service Content, except as expressly permitted by Yelp (for example, as described at www.yelp.com/robots.txt);

Based on the code and behaviour, it's likely that the server detects automated scraping and cuts off the response halfway through. This is not a compression problem. You may want to see Yelp API access via https://www.yelp.com/developers.