Python - urlparse a link

1.6k views Asked by At

I want to add scheme in urls if not present.

import urlparse

p = urlparse.urlparse(url)
print p
netloc = p.netloc or p.path
path = p.path if p.netloc else ''
scheme = p.scheme or 'http'
p = urlparse.ParseResult(scheme, netloc, path, *p[3:])
url = p.geturl()
print url

The above code works great, in case when I dont have any port number. When port number is there, it show arbitary output. For eg:-

input go.com:8000/3/
output go.com://8000/3/

Same goes for localhost. What approach should I been following in this case?

2

There are 2 answers

0
Ghostranger On

if you have port number and dont have the url scheme your url must start with //. urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component.

check out the following code and observe the diffrence

1) In this first sample i have added // so that the parser will identify it as the netloc rather than the scheme and then comes the path.

p.urlparse('//go.com:8000/3/')
ParseResult(scheme='', netloc='go.com:8000', path='/3/', params='', query='', fragment='')

2) In this sample we dont have the scheme and dint specify the // and we dont have the port number so the entire url is considered as the path.

p.urlparse('go.com/3/')
ParseResult(scheme='', netloc='', path='go.com/3/', params='', query='', fragment='')

3)In this sample i did specify the port. we know that after the scheme we have ://, parser recognized before : as the scheme and after : as path.

p.urlparse('go.com:8000/3/')
ParseResult(scheme='go.com', netloc='', path='8000/3/', params='', query='', fragment='')

this is how the urlparse is parsing the url. for you to get the url scheme to work, check for :// if you dint find explicitly append // in the front of your url then the job will be done.

for more detail you can visit this url [https://docs.python.org/2/library/urlparse.html]

0
Szabolcs On

According to the docs you need to properly introduce netloc to be parsed correctly. So try adding // at the beginning of the url if it's not an absolute path so like:

urlparse.urlparse('//go.com:8000/3')
ParseResult(scheme='', netloc='go.com:8000', path='/3', params='', query='', fragment='')

This way it correctly identifies each part of the url. Also please see the docs: https://docs.python.org/2/library/urlparse.html#urlparse.urlparse