I want to use Python requests with splash browser (https://splash.readthedocs.io/en/stable/) and custom headers to crawl some data from a website. However, before starting the crawling itself I decided to check on this website http://xhaus.com/headers what headers I send. As a result, I see that I am not sending those headers I want to send.
import requests
def headers():
headers = requests.utils.default_headers()
headers.update({
'User-Agent': random_user_agent()
})
return headers
def random_user_agent():
with open('user-agents.txt','r') as f:
user_agents = f.readlines()
user_agents = [h.rstrip('\n') for h in user_agents]
random_index = random.randint(0,len(user_agents)-1)
ua = user_agents[random_index]
return ua
splash = 'http://localhost:8050/render.html'
headers = headers()
url_h = 'http://xhaus.com/headers'
page = requests.get(splash, params={'url':url_h,},headers=headers)
After I run this code, I have the following user agent:
{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
However, when I check it by the website I mentioned, it shows me a different user agent:
soup = BeautifulSoup(page.text)
print soup.prettify()
...
<td class="even">
User-Agent
</td>
<td class="even">
<b>
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) splash Safari/538.1
</b>
</td>
...