Searching for closest statistically significant match in k-dimensional set

20 views Asked by ajit parthan At 22 September 2018 at 13:54

At a very high level this is similar to the nearest neighbor search problem.

From wiki: "given a set S of points in a space M and a query point q ∈ M, find the closest point in S to q".

But some significant differences. Specifics:

Each point is described by k variables.
The variables are not all numerical. Mixed data types: string, int etc.
All possible values for all variables not known - but they come from reasonably small sets.
In the data set to search from there will be multiple points with same values for all the k variables.
Another way to look at this is there will be many duplicate points.
For each point lets call the number of duplicates as frequency.
Given a query point q need to find nearest neighbor p such that frequency of p is at-least 15

There seems to be a wide range of of algorithms around NNS and statistical classification and best bin match.

I am getting a little lost in all the variations. Is there already a standard algorithm I can use. Or would I need to modify one?

TechQA.