Searching for closest statistically significant match in k-dimensional set

44 views Asked by At

At a very high level this is similar to the nearest neighbor search problem.

From wiki: "given a set S of points in a space M and a query point q ∈ M, find the closest point in S to q".

But some significant differences. Specifics:

  • Each point is described by k variables.
  • The variables are not all numerical. Mixed data types: string, int etc.
  • All possible values for all variables not known - but they come from reasonably small sets.
  • In the data set to search from there will be multiple points with same values for all the k variables.
  • Another way to look at this is there will be many duplicate points.
  • For each point lets call the number of duplicates as frequency.
  • Given a query point q need to find nearest neighbor p such that frequency of p is at-least 15

There seems to be a wide range of of algorithms around NNS and statistical classification and best bin match.

I am getting a little lost in all the variations. Is there already a standard algorithm I can use. Or would I need to modify one?

0

There are 0 answers