Finding a vector that is approximately equally distant from all vectors in a set

593 views Asked by At

I have a set of 3 million vectors (300 dimensions each), and I'm looking for a new point in this 300 dim space that is approximately equally distant from all the other points(vectors)

What I could do is initialize a random vector v, and run an optimization over v with the objective: objective function

Where d_xy is the distance between vector x and vector y, but this would be very computationally expensive.

I'm looking for an approximate solution vector for this problem that can be found quickly over very large sets of vectors. (Or any libraries that will do something like this for me- any language)

2

There are 2 answers

1
josliber On BEST ANSWER

I agree that in general this is a pretty tough optimization problem, especially at the scale you're describing. Each objective function evaluation requires O(nm + n^2) work for n points of dimension m -- O(nm) to compute distances from each point to the new point and O(n^2) to compute the objective given the distances. This is pretty scary when m=300 and n=3M. Thus even one function evaluation is probably intractable, not to mention solving the full optimization problem.

One approach that has been mentioned in the other answer is to take the centroid of the points, which can be computed efficiently -- O(nm). A downside of this approach is that it could do terribly at the proposed objective. For instance, consider a situation in 1-dimensional space with 3 million points with value 1 and 1 point with value 0. By inspection, the optimal solution is v=0.5 with objective value 0 (it's equidistant from every point), but the centroid will select v=1 (well, a tiny bit smaller than that) with objective value 3 million.

An approach that I think will do better than the centroid is to optimize each dimension separately (ignoring the existence of the other dimensions). While the objective function is still expensive to compute in this case, a bit of algebra shows that the derivative of the objective is quite easy to compute. It is the sum over all pairs (i, j) where i < v and j > v of the value 4*((v-i)+(v-j)). Remember we're optimizing a single dimension so the points i and j are 1-dimensional, as is v. For each dimension we therefore can sort the data (O(n lg n)) and then compute the derivative for a value v in O(n) time using a binary search and basic algebra. We can then use scipy.optimize.newton to find the zero of the derivative, which will be the optimal value for that dimension. Iterating over all dimensions, we'll have an approximate solution to our problem.

First consider the proposed approach versus the centroid method in a simple setting, with 1-dimensional data points {0, 3, 3}:

import bisect
import scipy.optimize

def fulldist(x, data):
    dists = [sum([(x[i]-d[i])*(x[i]-d[i]) for i in range(len(x))])**0.5 for d in data]
    obj = 0.0
    for i in range(len(data)-1):
        for j in range(i+1, len(data)):
            obj += (dists[i]-dists[j]) * (dists[i]-dists[j])
    return obj

def f1p(x, d):
    lownum = bisect.bisect_left(d, x)
    highnum = len(d) - lownum
    lowsum = highnum * (x*lownum - sum([d[i] for i in range(lownum)]))
    highsum = lownum * (x*highnum - sum([d[i] for i in range(lownum, len(d))]))
    return 4.0 * (lowsum + highsum)

data = [(0.0,), (3.0,), (3.0,)]
opt = []
centroid = []
for d in range(len(data[0])):
    thisdim = [x[d] for x in data]
    meanval = sum(thisdim) / len(thisdim)
    centroid.append(meanval)
    thisdim.sort()
    opt.append(scipy.optimize.newton(f1p, meanval, args=(thisdim,)))
print "Proposed", opt, "objective", fulldist(opt, data)
# Proposed [1.5] objective 0.0
print "Centroid", centroid, "objective", fulldist(centroid, data)
# Centroid [2.0] objective 2.0

The proposed approach finds the exact optimal solution, while the centroid method misses by a bit.

Consider a slightly larger example with 1000 points of dimension 300, with each point drawn from a gaussian mixture. Each point's value is normally distributed with mean 0 and variance 1 with probability 0.1 and normally distributed with mean 100 and variance 1 with probability 0.9:

data = []
for n in range(1000):
    d = []
    for m in range(300):
        if random.random() <= 0.1:
            d.append(random.normalvariate(0.0, 1.0))
        else:
            d.append(random.normalvariate(100.0, 1.0))
    data.append(d)

The resulting objective values were 1.1e6 for the proposed approach and 1.6e9 for the centroid approach, meaning the proposed approach decreased the objective by more than 99.9%. Obviously the differences in the objective value are heavily affected by the distribution of the points.

Finally, to test the scaling (removing the final objective value calculations, since they're in general intractable), I get the following scaling with m=300: 0.9 seconds for 1,000 points, 7.1 seconds for 10,000 points, and 122.3 seconds for 100,000 points. Therefore I expect this should take about 1-2 hours for your full dataset with 3 million points.

6
EelkeSpaak On

From this question on the Math StackExchange:

There is no point that is equidistant from 4 or more points in general position in the plane, or n+2 points in n dimensions.

Criteria for representing a collection of points by one point are considered in statistics, machine learning, and computer science. The centroid is the optimal choice in the least-squares sense, but there are many other possibilities.

The centroid is the point C in the the plane for which the sum of squared distances $\sum |CP_i|^2$ is minimum. One could also optimize a different measure of centrality, or insist that the representative be one of the points (such as a graph-theoretic center of a weighted spanning tree), or assign weights to the points in some fashion and take the centroid of those.

Note, specifically, "the centroid is the optimal choice in the least-squares sense", so the optimal solution to your cost function (which is a least-squares cost) is simply to average all the coordinates of your points (which will give you the centroid).