I'm trying to separate the different parts of a url with python s urlparse, but I'm seeming to get the wrong values in the results.
baseline = runSql(conn,"Select url from malware_traffic where tag = 'baseline';")
for i in baseline:
print i[0]
print urlparse.urlparse(i[0])
the runSql function just returns a list of urls. I loop through them and attempt to turn the urls from the baseline variable into urls, but the way python parses the urls seems to be incorrect
172.217.9.174:443/c2dm/register3
ParseResult(scheme='172.217.9.174', netloc='', path='443/c2dm/register3', params='', query='', fragment='')
connectivitycheck.gstatic.com:80/generate_204
ParseResult(scheme='connectivitycheck.gstatic.com', netloc='', path='80/generate_204', params='', query='', fragment='')
www.google.com:80/gen_204
ParseResult(scheme='www.google.com', netloc='', path='80/gen_204', params='', query='', fragment='')
172.217.9.174:443/auth/devicekey
ParseResult(scheme='172.217.9.174', netloc='', path='443/auth/devicekey', params='', query='', fragment='')
In the results you can clearly see that it is mixing up scheme and netloc as well as including the port in path.
For instance the first result should be this.
ParseResult(scheme='', netloc='172.217.9.174:443', path='/c2dm/register3', params='', query='', fragment='')
not sure why it's getting messed up.
I'm practically using the same thing as one of the examples in the documentation here https://docs.python.org/2/library/urlparse.html.
So what am I doing wrong or is it a bug?
The problem is that your urls don't have a scheme (the
http://
part), so python thinks172.217.9.174:
is the scheme. Prefixed withhttp://
everything works as expected: