Problem: An unexplained ValueError("No tables found") is being raised intermittently when using pandas read_html in conjunction with a proxy-configuration to parse data from multiple webpages (Python 3.x).
Background: To access each webpage, http_url is used as the target address. After each iteration of the loop, the {team} parameter in http_url is updated to access the next webpage (32 total pages – same host domain):
for team in teams:
http_url = f"http://www.footballguys.com/stats/game-logs-against/teams?team={team}&year={season_year}"
Problem Description: The target data from each webpage (http_url) is retrieved/parsed into a list of pandas DataFrames using the read_html method in one of two ways:
- Without Using a Proxy – The HTML is parsed directly from each webpage:
dataframe_list = pd.read_html(http_url)
Successful: This method always successfully returns the list of
DataFramesfrom each webpage – loop completes after returning data from all 32 webpages.
- Using a Proxy: The the HTML is parsed from the returned unicode
GETresponse converted to a string/file-like object usingio.StringIO:
proxies = {
"http": "http://{}:{}@{}:{}".format(proxy_user, proxy_pass, proxy_host, proxy_port)
}
The
proxiesdict is configured by concatenatingproxy_user,proxy_pass,proxy_host,proxy_port, which are input strings for each proxy parameter of the same name, and passed to the=proxiesargument in eachGETrequest.
source = requests.get(http_url, proxies=proxies, verify=False).text
dataframe_list = pd.read_html(io.StringIO(source))
Frequently Unsuccessful: This method frequently, and without explanation, returns
ValueError("No tables found"). This error may be raised after the 2ndGETrequest or the 29th, there is seemingly no pattern.
Additional Details: The results of five consecutive run-tests using the Option 2 - Proxy Method including the inspected response details from any returned failed requests:
# Successful Requests Before ValueError |
Details of Failed Response |
|---|---|
| 2 | MAX_THREADS_REACHED |
| 21 | Request failed from proxy-provider:Request failed. You will not be charged for this request... |
| 13 | http.client.RemoteDisconnected:...During handling of the above exception, another exception occurred: urllib3.exceptions.MaxRetryError:... During handling of the above exception, another exception occurred: requests.exceptions.ProxyError:... Note: Error messages should be posted in full, but I haven't done so as this is not a ValueError. I've noted it incase it would be helpful to see the full error, but I thought it might be excessive to post right away, for what looks to be a blocked request (even though being blocked would be strange since the residential proxy-provider uses custom headers). |
| 14 | MAX_THREADS_REACHED |
| All requests successful | - |
Edit - I find/found the MAX_THREADS_REACHED response to be the most perplexing, therefore, I reached out to my residential proxy-provider and, unfortunately, they confirmed that that response was not being returned from their API. I had hoped they might provide some insight, as I have not been able to find any documentation with those specific response details, and I am stumped as to what could possibly be causing that error. For reference, my program is not multi-threaded & the proxy allows up to 5 concurrent requests, which I am not surpassing.