Extrapolate missing sitemap links

63 views Asked by At

We are exploring the sitemaps of several websites from their robots.txt. We are seeing that often the sitemap does not contain the complete map of the website. In many cases, only few pages are present for say some years, the other years are missing, however, if the patterns of the existing pages are repeated for those missing years, the pages for those missing years are found to be actually there.

Like this one: https://www.republicworld.com/sitemap.xml. Here, there is a sitemap link for every page for the last 2 months, the links before that are missing. However, replacing 2023 by 2021 in any of the URLs present easily shows that the data for 2021 is present as well, just that it is not present in the main sitemap page. One may keep doing that until 2018, before which the website seems to have no more data present at least in the sitemaps of that pattern.

Or this one - https://zeenews.india.com/robots.txt. It contains only one sitemap link - for the year 2019, but replacing by other years like 2020 or 2018 in that year also works fine.

In this way, the patterns can be infinite - sometimes it may contain only upto 2020-Jan, and upon substituting like 2019-Dec, 2019-Nov, etc. can one find the other ones.

Is there some standard way one can extrapolate such patterns to check if they exist before calling it a day for the sitemap of a website? Some NLP extrapolating tool (the patterns can really be anything), or sitemap tool?

2

There are 2 answers

0
VonC On BEST ANSWER

Is there some standard way one can extrapolate such patterns to check if they exist before calling it a day for the sitemap of a website?

I am not aware of a "standard" way.

One possible approach would be a "date range extrapolation:

  • If the pattern is date-based, as in your examples, a script could be written to generate URLs for all possible dates within a certain range.
  • Libraries such as Python's dateutil can help in generating date ranges.
import requests
from dateutil.rrule import rrule, MONTHLY
from datetime import datetime
import time
from reppy.robots import Robots

# Define the base URL pattern
url_pattern = "https://www.example.com/archive/{year}-{month}.html"

# Define the date range
start_date = datetime(2018, 1, 1)
end_date = datetime(2023, 10, 1)

# Parse the robots.txt file
robots = Robots.fetch('https://www.example.com/robots.txt')

def check_url(url):
    """Check if a URL exists, with retry logic."""
    retries = 3  # Number of retries
    delay = 5    # Delay between retries in seconds
    for attempt in range(retries):
        try:
            response = requests.head(url, timeout=10)  # 10-second timeout
            if response.status_code == 200:
                print(f"URL exists: {url}")
            else:
                print(f"URL does not exist: {url}")
            return  # Exit the function if the request was successful
        except requests.RequestException as e:
            print(f"An error occurred: {e}. Retrying in {delay} seconds...")
            time.sleep(delay)  # Wait before retrying

# Generate URLs for each month within the date range
for date in rrule(MONTHLY, dtstart=start_date, until=end_date):
    year, month = date.year, date.strftime('%m')  # Zero-padded month
    url = url_pattern.format(year=year, month=month)
    
    # Check robots.txt for allowance
    if robots.allowed(url, '*'):  # Assuming a generic user agent
        check_url(url)
    else:
        print(f"URL disallowed by robots.txt: {url}")

    time.sleep(1)  # Respectful crawling by waiting between requests

The url_pattern is defined based on the assumed URL structure of the website.
The start_date and end_date are defined to specify the date range for which URLs will be generated.

A URL is constructed for each generated date by substituting the year and month values into the url_pattern, with the rrule function from dateutil used to generate a date for each month within the specified range. The requests.head method is used to send a HTTP HEAD request to each generated URL to check if it exists without downloading the entire page content. If the HTTP status code is 200, it indicates that the URL exists; otherwise, it's assumed that the URL does not exist.

0
Zebulun Crenshaw On

You could check within all of the pages for links to the same website. Here is a basic pseudo-code example.

getPage(hostname + "/html");
for links in page { 
    if link.contains(hostname) { 
        tosearch.append(link); 
    }
}