As the title suggest: calling the requests.get() method gives me a different image src link as opposed to when browsing the site manually.
I'm trying to scrape a site for products and want to store the images but the src I get from the site is for a very low quality image that's blurry. I compared the src to the one on the site and it's different. Not sure if I need to pass it something to "force" screen size in the request?
My code below:
from requests import get
from lxml import html
def demo():
params = {'page': 0}
response = get('https://www.checkers.co.za/c-2256/All-Departments', params=params)
tree = html.fromstring(response.content)
images = tree.xpath('//a[@class="product-listening-click"]/img[@src]')
images = ['https://www.checkers.co.za' + image.attrib['src'] for image in images]
print(images)
The src differences of first link in list to original image on the site
site src:
https://www.checkers.co.za/medias/10136669EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3wzNTA4MHxpbWFnZS9wbmd8aW1hZ2VzL2g1My9oZmYvODg1NzQ3ODYyNzM1OC5wbmd8YTM4YjE3YmMxYzJjMzI4MmIzMTQ0ZWU1MjlkYjBmNWZjZGFhYzYxYzAyZGMyNDhlNDE0MDhjYWQ0MjQxNmQ3NA
retrieved src:
https://www.checkers.co.za/medias/lqi-10136669EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3wxNDkwfGltYWdlL3BuZ3xpbWFnZXMvaDY1L2gwNC85MDgxNzU4NDgyNDYyLnBuZ3w0MmY3ZmMzNzJmYTU0MGIzNDk0ZjdmOTkyODYwMGI3N2I5YWJhZDRkOTljNzViYjIxMWQ3OWU2NDVjZGZhZTdm
EDIT 1:
I tried adding a User-agent header using the fake-useragent package and looped through all possible ones. src results did not change.
EDIT 2:
seems like parsing it with lxml.html as opposed to bs4 gives different output for image's data-original-src. Not sure why but thank you @AmineBTG for helping notice this issue.
Note:
accessing the data-original-src with lxml.html with //a[@class="product-listening-click"]/img/@data-original-src instead of '//a[@class="product-listening-click"]/img[@data-original-src]' did the trick. Header was not required by the looks of it when testing.
It works when passing the appropriate headers and retrieving the "data-original-src" attribute rather than "src" attribute. Please see below code (slightly modified)
Output: