Edit: after much fiddling, it seems urlgrabber succeeds where urllib2 fails, even when telling it close the connection after each file. Seems like there might be something wrong with the way urllib2 handles proxies, or with the way I use it ! Anyways, here is the simplest possible code to retrieve files in a loop:
import urlgrabber
for i in range(1, 100):
url = "http://www.iana.org/domains/example/"
urlgrabber.urlgrab(url, proxies={'http':'http://<user>:<password>@<proxy url>:<proxy port>'}, keepalive=1, close_connection=1, throttle=0)
Hello all !
I am trying to write a very simple python script to grab a bunch of files via urllib2.
This script needs to work through the proxy at work (my issue does not exist if grabbing files on the intranet, i.e. without the proxy).
Said script fails after a couple of requests with "HTTPError: HTTP Error 401: basic auth failed". Any idea why that might be ? It seems the proxy is rejecting my authentication, but why ? The first couple of urlopen requests went through correctly !
Edit: Adding a sleep of 10 seconds between requests to avoid some kind of throttling that might be done by the proxy did not change the results.
Here is a simplified version of my script (with identified information stripped, obviously):
import urllib2
passmgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
passmgr.add_password(None, '<proxy url>:<proxy port>', '<my user name>', '<my password>')
authinfo = urllib2.ProxyBasicAuthHandler(passmgr)
proxy_support = urllib2.ProxyHandler({"http" : "<proxy http address>"})
opener = urllib2.build_opener(authinfo, proxy_support)
urllib2.install_opener(opener)
for i in range(100):
with open("e:/tmp/images/tst{}.htm".format(i), "w") as outfile:
f = urllib2.urlopen("http://www.iana.org/domains/example/")
outfile.write(f.read())
Thanks in advance !
You can minimize the number of connection by using the keepalive handler from the urlgrabber module.
I am unsure that this will work correctly with your Proxy setup. You may have to hack the keepalive module.