Matching Differences between two documents

167 views Asked by At

i have a set of strings along with their co-ordinates and rectangular bounds int two similar pages. these strings are different in three possible ways. (i) a string can be moved to a new location on a page. (ii) a string is in the same location but it is modified. say ( abc --> abd or ABC) (iii) a combination of (i) and (ii). (iv) a string might be missing.

i tried to use locality sensitive hashing but couldn't find a good hash function for this. Can anyone please suggest me a good hash function or another algorithm to solve this problem. thanks in advance

1

There are 1 answers

5
Andy Jones On

So we have a list of source strings S and a list of target strings T of size at most |S|. We want find a way to assign each string in T to a distinct string in S such that the total number of mismatched characters is minimized

(Note that because we're looking for a way to match T to S, the case where a string in S is missing is dealt with implicitly)

If this is an accurate interpretation of your problem @programer8, I believe this is an assignment problem and can be solved by the Hungarian algorithm in cubic time: the "workers" referred to in the wiki article are your target strings, the "tasks" are the source strings, and the number of mismatched characters between a source and a target string is the cost of a worker performing a task.

The only hiccup is you have fewer workers/target strings than tasks/source strings, but you can remedy that by adding dummy workers.