Detect repetition in text string / copied text

560 views Asked by At

I have an input form where users can upload a test report, minimum length is 100 words. Some users write less than this, and simply copy what they wrote until the threshold of 100 words is met.

I would like to test (ideally via php) that a text string contains repeated text, i.e. where subsets of this string are copied. I was thinking to make a fourier analysis of the text, which could give rise to text repetitions inside the string. Does a php class or regex example exist for this purpose?

Some sample text:

blabla bla. this is some text now I am getting bored. this is some text now I am getting bored. this is some text now I am getting bored. this is some text now I am getting bored. this is some text now I am getting bored. some stuff in the end.

Update: My proposal to solve this is as follows

1) Map the string to an array of integers, i.e. find a numeric representation for every character. So the sample above would become

numerics = array ( 2, 5, 1, 2, 5, 1, ...);

2) Apply fourier transform on this array to get the "character frequency spectrum"

FT = fft (numerics);

This detects regular patterns in the character space. e.g. one could use this class to compute the fft.

3) Detect peaks of the function FT. Measure the relative height of the peaks, compared to the noise in the background.

4) Set a threshold for the peaks. If any peak is above this threshold, then return that regular patterns in the text have emerged. e.g. the repetition of sentences several times should clearly mark a high peak at a certain frequency.

As this proposal would be quite straight forward in data analytics, I wonder whether it has not been coded before. So that was my purpose of asking here, if anybody knows if such an algorithm already exists in the open source.

Of course, alternative solutions / proposals how to solve this problem would be appreciated.

1

There are 1 answers

0
Daniel On

There is no existing function or libary that detects repeating strings in a way you like to have. You can break down the problem to an algorythm, that starts with one word, than two words ect. but this will be very much work for this.

Your customers will start copying non-repeating sentenses and you'll have another problem, you cannot solve.

You have to manage your testers, options to punish them for illegal entries.