How can I mirror the results of MOSS plagiarism detection?

1.1k views Asked by At

MOSS is a well-known server for checking software plagiarism. It allows teachers to send homework submissions, calculates the similarity between different submissions, and colors code blocks that are very similar. Here is an example of the results of the comparison. As you can see, it is very simple: it contains an HTML file with the index of the suspected files, and it contains links to specific HTML files for the comparison.

The results are kept on the MOSS website for two weeks. I would like to download all the results into my computer, so that I can view them later. I use this command on Linux:

wget -mkEpnp http://moss.stanford.edu/results/5/7683916027631/index.html

What I get is the following:

enter image description here

As you can see, only the index.html file is downloaded. The other files, that are linked from the index.html, e.g. match0.html and match1.html, are not downloaded.

I tried to mirror the same website with a different tool - Web HTTrack - but got exactly the same results - only the index file is mirrored, and not the match files.

The HTML looks very simple, so I cannot figure out why the mirroring does not work. What can I do to correctly mirror the results?

P.S. In case it is relevant, the robots.txt file contains the following:

User-agent: *
Disallow: /
2

There are 2 answers

1
Perdikopanis Nikos On BEST ANSWER

you need to ignore robots.txt file e.g.

wget -r -l 1 -e robots=off http://moss.stanford.edu/results/1/XXXXXXXXXX/

0
Mastergalen On

Here is a command that correctly scrapes all the .html pages from the MOSS results:

wget --recursive --no-clobber --page-requisites \
  --html-extension --convert-links \
  --restrict-file-names=windows \
  --domains moss.stanford.edu \
  --no-parent \
  -e robots=off \
  http://moss.stanford.edu/results/1/XXXXXXXXXX/

What the options mean:

  • --recursive: download the entire Web site.
  • --domains moss.standford.edu: don't follow links outside moss.stanford.edu.
  • --no-parent: don't follow links outside the directory hierarchy.
  • --page-requisites: get all assets (images, css, etc) needed to display the page offline.
  • --html-extension: save files with the .html extension.
  • --convert-links: convert links so that they work locally, off-line.
  • --restrict-file-names=windows: modify filenames so that they will work in Windows as well.
  • --no-clobber: don't overwrite any existing files (used in case the download is interrupted and resumed).
  • -e robots=off Ignore robots.txt file, allowing the pages to be scraped.