At a very high level this is similar to the nearest neighbor search problem.
From wiki: "given a set S of points in a space M and a query point q ∈ M, find the closest point in S to q".
But some significant differences. Specifics:
- Each point is described by k variables.
- The variables are not all numerical. Mixed data types: string, int etc.
- All possible values for all variables not known - but they come from reasonably small sets.
- In the data set to search from there will be multiple points with same values for all the k variables.
- Another way to look at this is there will be many duplicate points.
- For each point lets call the number of duplicates as frequency.
- Given a query point q need to find nearest neighbor p such that frequency of p is at-least 15
There seems to be a wide range of of algorithms around NNS and statistical classification and best bin match.
I am getting a little lost in all the variations. Is there already a standard algorithm I can use. Or would I need to modify one?