I am trying to test valid URLs in my model using Django Custom Management Commands. I have the following model, and I need to test whether there are inactive URLs (HTTP 404 error).
class Association(models.Model):
name = models.CharField(max_length=25, blank=True, null=False)
publication_doi_url = models.TextField(blank=True)
Some URLs have multiple redirects; hence I wrote a function to fetch the final URL. It works mainly except few. For example, the URL https://doi.org/10.1603/EC11207 redirect shows this as the final URL https://academic.oup.com/jee/article-lookup/doi/10.1603/EC11207. However, this returns the HTTP response code is 302. There is one more redirect. How can I get the final URL? I assume the journals allows the access based on IP. The site doesn't require username/password. Any pointers will be helpful.
def return_final_url(url_link):
response = requests.get(url_link)
finalurl = ''
if response.history:
for resp in response.history:
pass
finalurl = response.url
return finalurl
class Command(BaseCommand):
help = 'Prints inactive urls (HTTP 404 error)'
def handle(self, *args, **kwargs):
for item in Association.objects.all():
base_url = "https://doi.org/"
url = base_url + item.publication
finalurl = return_final_url(url)
print("finalurl", finalurl)
response = requests.get(finalurl)
try:
response.raise_for_status()
except requests.exceptions.HTTPError:
print("HTTPError")
Firstly you can check this chapter https://docs.python-requests.org/en/master/user/quickstart/#redirection-and-history where you can find out the logic behind redirects.
Please take a look at the following paragraph:
The limit for raising this error is 30 times.