Thank you for reading my post. I have a list of urls for pdf files.
for eachurl in url_list:
print(eachurl)
Below are the links for my pdfs:
https://www.sec.gov/Archives/edgar/data/1005757/999999999715000035/filename1.pdf https://www.sec.gov/Archives/edgar/data/1037760/999999999715000162/filename1.pdf https://www.sec.gov/Archives/edgar/data/1038133/999999999715000169/filename1.pdf https://www.sec.gov/Archives/edgar/data/1009626/999999999715000483/filename1.pdf https://www.sec.gov/Archives/edgar/data/1017491/999999999715000518/filename1.pdf https://www.sec.gov/Archives/edgar/data/1020214/999999999715000557/filename1.pdf https://www.sec.gov/Archives/edgar/data/1020214/999999999715000795/filename1.pdf
These seven links work perfectly if I mannually click on them and download the pdf file. However, if I use python codes to download them, random error happens. Sometimes, the first pdf is damaged and cannot be opened. Sometime. it is the second, or third, etc...
from pathlib import Path
import requests
n_files = 0
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169'}
for eachurl in url_list:
n_files += 1
response = requests.get(eachurl, headers=headers)
filename = Path(str(n_files) + '.pdf')
filename.write_bytes(response.content)
Could you help me understand why this happens?
Update: I uploaded these files to google drive, and finnaly found out that it is because SEC identifies me as a robot. I have added the headers. Any idea how to bypass this? Google Drive
There is nothing wrong with your code. It's just that the website you are downloading the pdf documents from, detects you are using an automated tool and instead of providing you with a pdf like it normally would, it returns an html page informing you of the above.
SOLUTION
Remove the headers, seems to be working fine after that.