How to fetch final destination URL after redirects?

762 views Asked by At

I am trying to test valid URLs in my model using Django Custom Management Commands. I have the following model, and I need to test whether there are inactive URLs (HTTP 404 error).

class Association(models.Model):
    name = models.CharField(max_length=25, blank=True, null=False)
    publication_doi_url = models.TextField(blank=True)

Some URLs have multiple redirects; hence I wrote a function to fetch the final URL. It works mainly except few. For example, the URL https://doi.org/10.1603/EC11207 redirect shows this as the final URL https://academic.oup.com/jee/article-lookup/doi/10.1603/EC11207. However, this returns the HTTP response code is 302. There is one more redirect. How can I get the final URL? I assume the journals allows the access based on IP. The site doesn't require username/password. Any pointers will be helpful.


def return_final_url(url_link):
    response = requests.get(url_link)
    finalurl = ''
    if response.history:
        for resp in response.history:
            pass
        finalurl = response.url
    return finalurl


class Command(BaseCommand):
    help = 'Prints inactive urls (HTTP 404 error)'

    def handle(self, *args, **kwargs):
        for item in Association.objects.all():
            base_url = "https://doi.org/"
            url = base_url + item.publication
            finalurl = return_final_url(url)
            print("finalurl", finalurl)
            response = requests.get(finalurl)
            try:
                response.raise_for_status()
            except requests.exceptions.HTTPError:
                print("HTTPError")

1

There are 1 answers

2
Panos Angelopoulos On

Firstly you can check this chapter https://docs.python-requests.org/en/master/user/quickstart/#redirection-and-history where you can find out the logic behind redirects.

r = requests.get('http://original.url/')

>>> r.url
'https://redirected.url/'

>>> r.history
[<Response [301]>]

Please take a look at the following paragraph:

If a request exceeds the configured number of maximum redirections, a TooManyRedirects exception is raised.

The limit for raising this error is 30 times.