In the paper "When Is 'Nearest Neighbor' Meaningful?" we read that, "We show that under certain broad conditions (in terms of data and query distributions, or workload), as dimensionality increases, the distance to the nearest neighbor approaches the distance to the farthest neighbor. In other words, the contrast in distances to different data points becomes nonexistent. The conditions we have identified in which this happens are much broader than the independent and identically distributed (IID) dimensions assumption that other work assumes."
My question is how I should generate a dataset that resembles this effect? I have created three points each with 1000 dimensions with random numbers ranging from 0-255 for each dimension but points create different distances and do not reproduce what is mentioned above. It seems changing dimensions (e.g. 10 or 100 or 1000 dimensions) and ranges (e.g. [0,1]) do not change anything. I still get different distances which should not be any problem for e.g. clustering algorithms!
I think the paper is right. First, your test: One problem with your test may be that you are using too few points. I used 10000 point and below are my results (evenly distributed point in [0.0 ... 1.0] in all dimensions). For DIM=2, min/max differ almost by a factor of 1000, for DIM=1000, they only differ by a factor of 1.6, for DIM=10000 by 1.248 . So I'd say these results confirm the paper's hypothesis.
I guess the explanation is as follows: Lets take three randomly generated vectors, A, B and C. The total distance is based on the sum of the distances of each individual row of these vectors. The more dimensions the vectors have, the more the total sum of differences will approach a common average. In other word, it is highly unlikely that a vector C has in all elements a larger distance to A than another vector B has to A. With increasing dimensions, C and B will have increasingly similar distance to A (and to each other).
My test dataset was created as follow. The dataset is essentially a cube ranging from 0.0 to 1.0 in every dimension. The coordinates were created with uniform distribution in every dimension between 0.0 and 1.0. Example code (N=10000, DIM=[2..10000]):
Following the equation given at the bottom of the accepted answer here, we get:
d=2 -> 98460
d=10 -> 142.3
d=100 -> 1.84
d=1,000 -> 0.618
d=10,000 -> 0.247
d=100,000 -> 0.0506 (using N=1000)