Randomly damaged pdf files when using requests.get() with Python to download pdf

791 views Asked by At

Thank you for reading my post. I have a list of urls for pdf files.

for eachurl in url_list:
    print(eachurl)

Below are the links for my pdfs:

https://www.sec.gov/Archives/edgar/data/1005757/999999999715000035/filename1.pdf https://www.sec.gov/Archives/edgar/data/1037760/999999999715000162/filename1.pdf https://www.sec.gov/Archives/edgar/data/1038133/999999999715000169/filename1.pdf https://www.sec.gov/Archives/edgar/data/1009626/999999999715000483/filename1.pdf https://www.sec.gov/Archives/edgar/data/1017491/999999999715000518/filename1.pdf https://www.sec.gov/Archives/edgar/data/1020214/999999999715000557/filename1.pdf https://www.sec.gov/Archives/edgar/data/1020214/999999999715000795/filename1.pdf

These seven links work perfectly if I mannually click on them and download the pdf file. However, if I use python codes to download them, random error happens. Sometimes, the first pdf is damaged and cannot be opened. Sometime. it is the second, or third, etc...

from pathlib import Path
import requests
n_files = 0
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169'}
for eachurl in url_list:
    n_files += 1
    response = requests.get(eachurl, headers=headers)
    filename = Path(str(n_files) + '.pdf')
    filename.write_bytes(response.content)

Could you help me understand why this happens?


Update: I uploaded these files to google drive, and finnaly found out that it is because SEC identifies me as a robot. I have added the headers. Any idea how to bypass this? Google Drive

1

There are 1 answers

8
Sujal Singh On

There is nothing wrong with your code. It's just that the website you are downloading the pdf documents from, detects you are using an automated tool and instead of providing you with a pdf like it normally would, it returns an html page informing you of the above.

Your Request Originates from an Undeclared Automated Tool

To allow for equitable access to all users, SEC reserves the right to limit requests originating from undeclared automated tools. Your request has been identified as part of a network of automated tools outside of the acceptable policy and will be managed until action is taken to declare your traffic.

Please declare your traffic by updating your user agent to include company specific information.

For best practices on efficiently downloading information from SEC.gov, including the latest EDGAR filings, visit sec.gov/developer. You can also sign up for email updates on the SEC open data program, including best practices that make it more efficient to download data, and SEC.gov enhancements that may impact scripted downloading processes. For more information, contact [email protected].

For more information, please see the SEC’s Web Site Privacy and Security Policy. Thank you for your interest in the U.S. Securities and Exchange Commission.

Reference ID: 0.2420b07b.1629818487.2ac196c

More Information

Internet Security Policy

By using this site, you are agreeing to security monitoring and auditing. For security purposes, and to ensure that the public service remains available to users, this government computer system employs programs to monitor network traffic to identify unauthorized attempts to upload or change information or to otherwise cause damage, including attempts to deny service to users.

Unauthorized attempts to upload information and/or change information on any portion of this site are strictly prohibited and are subject to prosecution under the Computer Fraud and Abuse Act of 1986 and the National Information Infrastructure Protection Act of 1996 (see Title 18 U.S.C. §§ 1001 and 1030).

To ensure our website performs well for all users, the SEC monitors the frequency of requests for SEC.gov content to ensure automated searches do not impact the ability of others to access SEC.gov content. We reserve the right to block IP addresses that submit excessive requests. Current guidelines limit users to a total of no more than 10 requests per second, regardless of the number of machines used to submit requests.

If a user or application submits more than 10 requests per second, further requests from the IP address(es) may be limited for a brief period. Once the rate of requests has dropped below the threshold for 10 minutes, the user may resume accessing content on SEC.gov. This SEC practice is designed to limit excessive automated searches on SEC.gov and is not intended or expected to impact individuals browsing the SEC.gov website.

Note that this policy may change as the SEC manages SEC.gov to ensure that the website performs efficiently and remains available to all users.


Note: We do not offer technical support for developing or debugging scripted downloading processes.

SOLUTION

Remove the headers, seems to be working fine after that.