How can I mirror the results of MOSS plagiarism detection?

Question

How can I mirror the results of MOSS plagiarism detection?

1.1k views Asked by Erel Segal-Halevi At 02 May 2021 at 18:48

MOSS is a well-known server for checking software plagiarism. It allows teachers to send homework submissions, calculates the similarity between different submissions, and colors code blocks that are very similar. Here is an example of the results of the comparison. As you can see, it is very simple: it contains an HTML file with the index of the suspected files, and it contains links to specific HTML files for the comparison.

The results are kept on the MOSS website for two weeks. I would like to download all the results into my computer, so that I can view them later. I use this command on Linux:

wget -mkEpnp http://moss.stanford.edu/results/5/7683916027631/index.html

What I get is the following:

As you can see, only the index.html file is downloaded. The other files, that are linked from the index.html, e.g. match0.html and match1.html, are not downloaded.

I tried to mirror the same website with a different tool - Web HTTrack - but got exactly the same results - only the index file is mirrored, and not the match files.

The HTML looks very simple, so I cannot figure out why the mirroring does not work. What can I do to correctly mirror the results?

P.S. In case it is relevant, the robots.txt file contains the following:

User-agent: *
Disallow: /

Original Q&A

There are 2 answers

Mastergalen On 31 May 2023 at 10:15

Here is a command that correctly scrapes all the .html pages from the MOSS results:

wget --recursive --no-clobber --page-requisites \
  --html-extension --convert-links \
  --restrict-file-names=windows \
  --domains moss.stanford.edu \
  --no-parent \
  -e robots=off \
  http://moss.stanford.edu/results/1/XXXXXXXXXX/

What the options mean:

--recursive: download the entire Web site.
--domains moss.standford.edu: don't follow links outside moss.stanford.edu.
--no-parent: don't follow links outside the directory hierarchy.
--page-requisites: get all assets (images, css, etc) needed to display the page offline.
--html-extension: save files with the .html extension.
--convert-links: convert links so that they work locally, off-line.
--restrict-file-names=windows: modify filenames so that they will work in Windows as well.
--no-clobber: don't overwrite any existing files (used in case the download is interrupted and resumed).
-e robots=off Ignore robots.txt file, allowing the pages to be scraped.

**Perdikopanis Nikos** · Accepted Answer · 2021-05-14T06:28:50+00:00

Perdikopanis Nikos On 14 May 2021 at 06:28 BEST ANSWER

you need to ignore robots.txt file e.g.

wget -r -l 1 -e robots=off http://moss.stanford.edu/results/1/XXXXXXXXXX/

TechQA.

How can I mirror the results of MOSS plagiarism detection?

There are 2 answers

Related Questions in HTML

Related Questions in WGET

Related Questions in MIRRORING

Related Questions in PLAGIARISM-DETECTION

Related Questions in HTTRACK

Popular Questions

Popular Tags

Trending Questions