I'm scraping restaurant reviews from Yelp and I'm accessing the restaurant's APIs to do so. I'm currently scraping 4 star reviews, for example this restaurant page has this corresponding API.
This is the block of code that sends an http request to the API when the crawler is currently on the restaurant page
bizId = response.xpath("//meta[@name='yelp-biz-id']/@content").extract_first()
api_url = 'https://www.yelp.it/biz/' + bizId + '/review_feed?rr=' + str(n_star_filter)
yield response.follow(url=api_url, callback = self.parse_yelp_restaurant_api)
Sometimes the API are accessed correctly and I'm able to scrape them. However, most of the time, I get this error:
2023-10-27 15:57:39 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.yelp.it/biz/78t73jTxdUw5C-v44lj4Iw/review_feed?rr=4>
Traceback (most recent call last):
File "/Users/mauri/anaconda3/lib/python3.11/site-packages/twisted/internet/defer.py", line 1697, in _inlineCallbacks
result = context.run(gen.send, result)
File "/Users/mauri/anaconda3/lib/python3.11/site-packages/scrapy/core/downloader/middleware.py", line 64, in process_response
method(request=request, response=response, spider=spider)
File "/Users/mauri/anaconda3/lib/python3.11/site-packages/scrapy/downloadermiddlewares/httpcompression.py", line 63, in process_response
decoded_body = self._decode(response.body, encoding.lower())
File "/Users/mauri/anaconda3/lib/python3.11/site-packages/scrapy/downloadermiddlewares/httpcompression.py", line 102, in _decode
body = brotli.decompress(body)
File "/Users/mauri/anaconda3/lib/python3.11/site-packages/brotli/brotli.py", line 90, in decompress
d.finish()
File "/Users/mauri/anaconda3/lib/python3.11/site-packages/brotli/brotli.py", line 464, in finish
raise Error("Decompression error: incomplete compressed stream.")
brotli.brotli.Error: Decompression error: incomplete compressed stream.
I can't figure out what this means and it's really weird that some APIs are downloaded and others produce this error when they apparently are no different from each other.
This is likely a violation of yelp policy, such websites don't like when people scrape data in this fasion. For example, this policy says
Based on the code and behaviour, it's likely that the server detects automated scraping and cuts off the response halfway through. This is not a compression problem. You may want to see Yelp API access via https://www.yelp.com/developers.