We are exploring the sitemaps of several websites from their robots.txt. We are seeing that often the sitemap does not contain the complete map of the website. In many cases, only few pages are present for say some years, the other years are missing, however, if the patterns of the existing pages are repeated for those missing years, the pages for those missing years are found to be actually there.
Like this one: https://www.republicworld.com/sitemap.xml. Here, there is a sitemap link for every page for the last 2 months, the links before that are missing. However, replacing 2023 by 2021 in any of the URLs present easily shows that the data for 2021 is present as well, just that it is not present in the main sitemap page. One may keep doing that until 2018, before which the website seems to have no more data present at least in the sitemaps of that pattern.
Or this one - https://zeenews.india.com/robots.txt. It contains only one sitemap link - for the year 2019, but replacing by other years like 2020 or 2018 in that year also works fine.
In this way, the patterns can be infinite - sometimes it may contain only upto 2020-Jan, and upon substituting like 2019-Dec, 2019-Nov, etc. can one find the other ones.
Is there some standard way one can extrapolate such patterns to check if they exist before calling it a day for the sitemap of a website? Some NLP extrapolating tool (the patterns can really be anything), or sitemap tool?
I am not aware of a "standard" way.
One possible approach would be a "date range extrapolation:
The
url_patternis defined based on the assumed URL structure of the website.The
start_dateandend_dateare defined to specify the date range for which URLs will be generated.A URL is constructed for each generated date by substituting the year and month values into the
url_pattern, with therrulefunction fromdateutilused to generate a date for each month within the specified range. Therequests.headmethod is used to send a HTTP HEAD request to each generated URL to check if it exists without downloading the entire page content. If the HTTP status code is 200, it indicates that the URL exists; otherwise, it's assumed that the URL does not exist.