Can Git detect if two source files are essentially copies of each others?

8.7k views Asked by At

Sorry if this is off-topic, but here is your chance to reduce the amount of "homework" questions on this site :-)

I'm teaching a class of C programming where the students work on a small library of numeric routines in C. This year, the source files from several groups of students had significant amounts of code duplication in them.

(Down to identically misspelled printf debug statements. I mean, how dumb can you be.)

I know that Git can detect when two source files are similar to each others beyond a certain threshold but I never manager to get that to work on two source files that are not in a Git repository.

Keep in mind that these are not particularly sophisticated students. It is unlikely that they would go to the trouble of changing variable/function names.

Is there a way I can use Git to detect significant and literal code duplication a.k.a plagiarism? Or is there some other tool you could recommend for that

5

There are 5 answers

0
Mankarse On

Why use git at all? A simple but effective technique would be to compare the sizes of the diffs between all of the different submissions, and then to manually inspect and compare those with the smallest differences.

2
Blender On

You could use diff and check whether the two files seem similar:

diff -iEZbwB -U 0 file1.cpp file2.cpp

Those options tell diff to ignore whitespace changes and make a git-like diff file. Try it out on two samples.

0
Ravi On

Moss is a tool that was developed by a Stanford CS prof. I think they use it there as well. It's like diff for source code.

0
Brooks Moses On

Adding to the other answers, you could use diff -- but I don't think the answers will be that useful by themselves. What you want is the number of lines that match, minus the number of non-blank lines, and to get that automatically you need to do a fair bit of magic with wc -l and grep to compute the sum of the lengths of the files, minus the length of the diff file, minus the number of blank lines that diff included as matching. And even then you'll miss some cases where diff decided that identical lines didn't match because of different things inserted before them.

A much better option is one of the suggestions listed in https://stackoverflow.com/questions/5294447/how-can-i-find-source-code-copying (or in https://stackoverflow.com/questions/4131900/how-to-detect-plagiarized-code, though the answers seem to duplicate).

0
Sylvain Leroux On

Using diff is absolutely not a good idea unless you want to venture in the realm of combinatory hell:

  • If you have 2 submissions, you have to perform 1 diff to check for plagiarism,
  • If you have 3 submissions, you have to perform 2 diff to check for plagiarism,
  • If you have 4 submissions, you have to perform 6 diff to check for plagiarism,
  • ...
  • If you have n submissions, you have to perform (n-1)! diff !

On the other hand, Moss, already suggested in an other answer, uses a completely different algorithm. Basically, it computes a set of fingerprints for significant k-grams of each document. The fingerprint is in fact a hash used to classify documents, and a possible plagiarism is detected when two documents end-up being sorted in the same bucket.