Is there a way to get the site map of a domain?

220 views Asked by At

As part of an assignment, I am trying to get all the links and subdomains on a webpage. An example would be for "www.stanford.edu" returning a hash of "www.stanford.edu/admissions", "www.stanford.edu/academics", "cs.stanford.edu" etc.

I found ways to do this with the Mechanize and Spidr gems, as exemplified in "Create dynamic sitemap from URL with Ruby on Rails" and "How can I get all links of a website using the Mechanize gem?".

However, with these gems, I can only get a site map by clicking on all the links on the webpage, accessing those links, and then clicking on the links on those child pages as well until I have the site map. This is a very inefficient and also slow, as most times there are links on a page like ads, which are not part of the domain. These unrelated pages end up being in the site map array/hash as well.

Is there a way to get the site map of a webpage? I am open to non-Ruby solutions as well.

1

There are 1 answers

2
sawa On

I don't think it is possible other than by following the links (although that can be automated using mechanize). A server can create a dynamic page and serve it under an arbitrary subdomain. You cannot achieve that information other than by asking the server. In fact, you cannot get all subdomains even by following all the links.