Find plagiarism in bulk articles

233 views Asked by At

I have a 20,000 collection of master articles and I will get about 400,000 articles of one or two pages everyday. Now, I am trying to see if each one of this 400k articles are a copy or modified version of my collection of master articles (a threshold of above 60% plagiarism is fine with me) What are the algorithms and technologies I should use to tackle the problem in a very efficient and timely manner. Thanks

1

There are 1 answers

3
Elliptical view On BEST ANSWER

Fingerprint the articles (i.e. intelligently hash them based on the word frequency) and then look for statistical connection between the fingerprints. Then if there is a hunch on some of the data set, do a brute force search for matching strings on those.