i have a set of strings along with their co-ordinates and rectangular bounds int two similar pages. these strings are different in three possible ways. (i) a string can be moved to a new location on a page. (ii) a string is in the same location but it is modified. say ( abc --> abd or ABC) (iii) a combination of (i) and (ii). (iv) a string might be missing.
i tried to use locality sensitive hashing but couldn't find a good hash function for this. Can anyone please suggest me a good hash function or another algorithm to solve this problem. thanks in advance
So we have a list of source strings
S
and a list of target stringsT
of size at most|S|
. We want find a way to assign each string inT
to a distinct string inS
such that the total number of mismatched characters is minimized(Note that because we're looking for a way to match
T
toS
, the case where a string inS
is missing is dealt with implicitly)If this is an accurate interpretation of your problem @programer8, I believe this is an assignment problem and can be solved by the Hungarian algorithm in cubic time: the "workers" referred to in the wiki article are your target strings, the "tasks" are the source strings, and the number of mismatched characters between a source and a target string is the cost of a worker performing a task.
The only hiccup is you have fewer workers/target strings than tasks/source strings, but you can remedy that by adding dummy workers.