Why does python requests.get() retrieve different image src compared to browsing the site

1k views Asked by At

As the title suggest: calling the requests.get() method gives me a different image src link as opposed to when browsing the site manually.

I'm trying to scrape a site for products and want to store the images but the src I get from the site is for a very low quality image that's blurry. I compared the src to the one on the site and it's different. Not sure if I need to pass it something to "force" screen size in the request?

My code below:

from requests import get
from lxml import html

def demo():
    params = {'page': 0}
    response = get('https://www.checkers.co.za/c-2256/All-Departments', params=params)
    tree = html.fromstring(response.content)
    images = tree.xpath('//a[@class="product-listening-click"]/img[@src]')
    images = ['https://www.checkers.co.za' + image.attrib['src'] for image in images]
    print(images)

The src differences of first link in list to original image on the site

site src:
https://www.checkers.co.za/medias/10136669EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3wzNTA4MHxpbWFnZS9wbmd8aW1hZ2VzL2g1My9oZmYvODg1NzQ3ODYyNzM1OC5wbmd8YTM4YjE3YmMxYzJjMzI4MmIzMTQ0ZWU1MjlkYjBmNWZjZGFhYzYxYzAyZGMyNDhlNDE0MDhjYWQ0MjQxNmQ3NA

retrieved src:
https://www.checkers.co.za/medias/lqi-10136669EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3wxNDkwfGltYWdlL3BuZ3xpbWFnZXMvaDY1L2gwNC85MDgxNzU4NDgyNDYyLnBuZ3w0MmY3ZmMzNzJmYTU0MGIzNDk0ZjdmOTkyODYwMGI3N2I5YWJhZDRkOTljNzViYjIxMWQ3OWU2NDVjZGZhZTdm

EDIT 1:

I tried adding a User-agent header using the fake-useragent package and looped through all possible ones. src results did not change.

EDIT 2:

seems like parsing it with lxml.html as opposed to bs4 gives different output for image's data-original-src. Not sure why but thank you @AmineBTG for helping notice this issue.

Note:
accessing the data-original-src with lxml.html with //a[@class="product-listening-click"]/img/@data-original-src instead of '//a[@class="product-listening-click"]/img[@data-original-src]' did the trick. Header was not required by the looks of it when testing.

2

There are 2 answers

4
AmineBTG On BEST ANSWER

It works when passing the appropriate headers and retrieving the "data-original-src" attribute rather than "src" attribute. Please see below code (slightly modified)

import requests
from bs4 import BeautifulSoup

def demo():
    header ={
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "accept-language": "fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7",
        "cache-control": "max-age=0",
        "sec-ch-ua": "\"Google Chrome\";v=\"87\", \" Not;A Brand\";v=\"99\", \"Chromium\";v=\"87\"",
        "sec-ch-ua-mobile": "?0",
        "sec-fetch-dest": "document",
        "sec-fetch-mode": "navigate",
        "sec-fetch-site": "none",
        "sec-fetch-user": "?1",
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36",
    }

    params = {'page': 0}

    r = requests.get('https://www.checkers.co.za/c-2256/All-Departments', headers = header, params=params)
    s = BeautifulSoup(r.content, "html.parser")

    products = s.find_all("div", {"class":"item-product__image"})
    images = ['https://www.checkers.co.za' + prod.find("img").attrs.get("data-original-src") for prod in products]

    return images

print(demo())

Output:

['https://www.checkers.co.za/medias/10136669EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3wzNTA4MHxpbWFnZS9wbmd8aW1hZ2VzL2g1My9oZmYvODg1NzQ3ODYyNzM1OC5wbmd8YTM4YjE3YmMxYzJjMzI4MmIzMTQ0ZWU1MjlkYjBmNWZjZGFhYzYxYzAyZGMyNDhlNDE0MDhjYWQ0MjQxNmQ3NA', 'https://www.checkers.co.za/medias/10136301EAV2-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w1ODA4OHxpbWFnZS9wbmd8aW1hZ2VzL2g2ZS9oMGIvOTA5NjUxNTA5MjUxMC5wbmd8ZTViYzUzY2FiOWIyNWNmYmY0OGQ0ZGY0ZmY2ZDQwMGI3Nzk4ODMwOGYzMWRhNjIxOGZmM2Y1ZTExNDgxZWZjMg', 'https://www.checkers.co.za/medias/10151456EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w4NDE0NHxpbWFnZS9wbmd8aW1hZ2VzL2gzNC9oMjIvODg1NzgxMjU5ODgxNC5wbmd8OWMxYjE4Nzc0MjNkZTU2ZGI5ZDZmN2Q2M2FhMTdhZmM3Yjc4NDgwMzEwMjg1NmNiZTM1YWNjZjkxOTUyMzhmNQ', 'https://www.checkers.co.za/medias/10151458EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w1MTc0MnxpbWFnZS9wbmd8aW1hZ2VzL2g2MS9oYmEvODg1Nzg5Nzk5MjIyMi5wbmd8ZDFhNTlkMmJiNGY3ZDA3YjU5NjkwZGMxMjY3ZTgzMDVjYzFkMDkxNzI4NzlmN2U2MjhjZGJmYjE1NDg5ZDU2ZA', 'https://www.checkers.co.za/medias/10143000EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w0NjYyN3xpbWFnZS9wbmd8aW1hZ2VzL2g2MC9oZDIvODg1NzY3NDA4ODQ3OC5wbmd8ZDAzYzk5ZGFkY2Y1NjBiOTllZGJjOTVkZTUwNTg3NjBhYTM4NTk0OGFiYzk4OWRlNDQxZTdkNjQzNWM5YmU1NA', 'https://www.checkers.co.za/medias/10136298EAV2-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w1NzYzNXxpbWFnZS9wbmd8aW1hZ2VzL2g3MC9oZWEvOTA5NTI5NTY2NDE1OC5wbmd8NDVkZmY4YWU4NWY3MDliYjYyNTk4MWM1NzIyOTNlMjYwMzEwOGFiNGNiZTEzNGRhMmVkZjNiZTU0ZjNiYThiMw', 'https://www.checkers.co.za/medias/10151462EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w2MzY1MHxpbWFnZS9wbmd8aW1hZ2VzL2g4Zi9oMjcvODg1NzgxNDU2NDg5NC5wbmd8ZTE5ZmViZTdjNDNkOTVjMjZkODYwMjA4YTczNTgwNjU5ZmViMmE4OTQ4YzUwYjgwMjI0ZGJkNzJkNTI5OGU0Mg', 'https://www.checkers.co.za/medias/10241929EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w2MjMwOXxpbWFnZS9wbmd8aW1hZ2VzL2g0NS9oY2QvODg1ODQ2ODM4NDc5OC5wbmd8YzE1MGZkOWI2MjAyOWVlNzQ2YmRkMWM2MDNhZTk3ZGFkYWY4ZWMxNzA5Njc5OTMxNzY3OWEyNzg5MzczZmM1ZA', 'https://www.checkers.co.za/medias/10165121EAV2-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w2NTUzMXxpbWFnZS9wbmd8aW1hZ2VzL2gyZC9oYjcvODg2NDkyODM2NjYyMi5wbmd8NzEyODllNDlmZjE1NjJmMzEyMmU4MTU4NWQ4ZjRjYmM1Nzc2NWNjM2Y2YmFmZGQ1N2Y5ZjFmOTY0ZjBkMGE1OQ', 'https://www.checkers.co.za/medias/10145817EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3wyMzU0MHxpbWFnZS9wbmd8aW1hZ2VzL2gwMy9oZDIvODg1Nzc0ODc5OTUxOC5wbmd8NGQxZjA0OWNkZTVjY2JmNTI2ZTdlOGIwZjFiMmE5MGFhYjQ2NjZhMDBiYWMyNzVhYjMxNTI4YTZjYWU3OWZhZA', 'https://www.checkers.co.za/medias/10151065EAv2-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w2MDkyNHxpbWFnZS9wbmd8aW1hZ2VzL2g2Ny9oMDEvOTI3ODc2MDE4OTk4Mi5wbmd8YWQ1N2E4Y2ZmNTQ3YzA1ZDdmODcyZDlmZTg4ZGUwZGJhOWQ3ZGNiNWI5ZmE4OWFmOTVkZDgyYjEzOTUyZjlhYg', 'https://www.checkers.co.za/medias/10126789EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3wxMDYzMDh8aW1hZ2UvcG5nfGltYWdlcy9oNzcvaDAzLzg4NjA4NzYzMDg1MTAucG5nfDVhMTE1MmE5YmMxNjE0OGZmM2IwOTcxMWQzYWIyY2IxOTU2MmY1M2M2N2MzZjc5ZDE2YWFmNGFiZjdiOTI1YzY', 'https://www.checkers.co.za/medias/10241933EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w2NDQ4NXxpbWFnZS9wbmd8aW1hZ2VzL2gzNy9oMTkvODg1ODQwMjg4MTU2Ni5wbmd8NDA5MDRlMDZkM2U3M2JiODUwNWVmYmE5YzM3NjQ3NDkxYTMzZmI0ZjY3OWFkNDZiODU2YjQ2NTRjOTQyNjI2MQ', 'https://www.checkers.co.za/medias/10147193EAV2-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w2MDk2M3xpbWFnZS9wbmd8aW1hZ2VzL2g4Zi9oNTUvODk1NjcwODIyNTA1NC5wbmd8YmFkMTgzZDFkNGRiY2ViMzU4MjNhMGY0ZmM1OTgyY2U4NTY5MGZiZWI5ZTMzNjE1ZTNkY2Q1YzAwY2JhZTgyYw', 'https://www.checkers.co.za/medias/10164636EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w0MDM4NnxpbWFnZS9wbmd8aW1hZ2VzL2gxMS9oYWIvODg1ODA5NTU4MzI2Mi5wbmd8YTYxY2ZkNjAzOTg2NGVmNGMxODVjNmRkNTAyYmYzOWM2ZDU0MzgyYzk3YjM2YWUzOTRkMGEwOWE0NmVjMDQ0NA', 'https://www.checkers.co.za/medias/10136574EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w1MzU0NXxpbWFnZS9wbmd8aW1hZ2VzL2hjMi9oNjEvODg1NzQzMjk4MTUzNC5wbmd8YTJmZGU0ZGVjNDU1NTIyMzU0NGM1ZTQzOTQ4OTUwNmEwN2I3MDc5MjliOWNkNmRlNTgzZDMxYjdkMmNjNTIyYw', 'https://www.checkers.co.za/medias/10604301EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w1OTUxN3xpbWFnZS9wbmd8aW1hZ2VzL2hiZi9oZDQvODg1OTg2MzY0NjIzOC5wbmd8MTE5ZTQ4NzJkMzhjMzczMDk0MTE4YTZhZTllMTFlYjBiMTUyY2IzNjIyMmM4NzFlODA0MTU1Yzg3ZWNkMjMyNQ', 'https://www.checkers.co.za/medias/10145422EA-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w3OTc0MHxpbWFnZS9wbmd8aW1hZ2VzL2gxYy9oNTIvODk2MjM0ODIyMDQ0Ni5wbmd8MDc5NjY1YzY2NGE0NDFiNWRiN2NkNWZkMWJlODg5MDhlOWUyZWNhNDEzMWJiZTQ3MjM5MDYyZjgzZWYyYWM2Mg', 'https://www.checkers.co.za/medias/10136291EA-20190726-Media-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w5NjM3NnxpbWFnZS9wbmd8aW1hZ2VzL2hjNi9oMzUvODg1NzQzMDU1NjcwMi5wbmd8ZDQwZjEzMmU5Y2JkMDhkNGM2MGQ4ZTc1MWY0Y2Q5YTJhZWI2YmM2YmY5YjNiYWEyZjQ0YWQ5ZDgyMmE3ZWE2YQ', 'https://www.checkers.co.za/medias/10148833EAV2-checkers300Wx300H?context=bWFzdGVyfGltYWdlc3w3NTU3NnxpbWFnZS9wbmd8aW1hZ2VzL2g1OC9oNzIvODk1NjY5NjQ2MTM0Mi5wbmd8N2YyZmMxMjA5ZjkxZjkzZWExN2E2MGE1ZTZiZjI0M2FkMDcxZTVlMzY0ZjAzOTRjMjAzNzRjYWQ5Yzk4NjZkNQ'] 
3
Awesomepotato29 On

When a web browser sends an HTTP request, it includes a bunch of info about itself in the header, allowing the website to retrieve a version of itself that is best suited for display in that particular browser. When you make a request via the requests module, the website doesn't get any of this information, and sends a version of the site that is slightly different from what you would get in a browser.

This is why you are getting two different image sources depending on how you request the website. The browser is getting a higher-quality image because the website has enough info about the way the image is going to be used to send the best possible version of the image, while the script request gets a lower-quality image because the website sends a much smaller version of the image to reduce traffic.