Uniform Distribution: Bug or Paradox

186 views Asked by At

Imagine 10 cars randomly, uniformly distributed on a round track of length 1. If the positions are represented by a C double in the range [0,1> then they can be sorted and the gaps between the cars should be the position of the car in front minus the position of the car behind. The last gap needs 1 added to account for the discontinuity.

In the program output, the last column has very different statistics and distribution from the others. The rows correctly add to 1. What's going on?

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

int compare (const void * a, const void * b)
{
  if (*(double*)a > *(double*)b) return 1;
  else if (*(double*)a < *(double*)b) return -1;
  else return 0;
}

double grand_f_0_1(){
  static FILE * fp = NULL;
  uint64_t bits;

  if(fp == NULL) fp = fopen("/dev/urandom", "r");
  fread(&bits, sizeof(bits), 1, fp);
  return (double)bits * 5.421010862427522170037264004349e-020; // https://stackoverflow.com/a/26867455
}

int main()
{
  const int n = 10;
  double values[n];
  double diffs[n];
  int i, j;
  
  for(j=0; j<10000; j++) {
  
    for(i=0; i<n; i++) values[i] = grand_f_0_1();

    qsort(values, n, sizeof(double), compare);
    
    for(i=0; i<(n-1); i++) diffs[i] = values[i+1] - values[i];
    diffs[n-1] = 1. + values[0] - values[n-1];

    for(i=0; i<n; i++) printf("%.5f%s", diffs[i], i<(n-1)?"\t":"\n");

  }
  
  return(0);
}

Here is a sample of the output. The first column represents the gap between the first and second car. The last column represents the gap between 10th car and the first car, across the start/finish line. Large numbers like .33 and .51 are much more common in the last column and very small numbers are relatively rare.

0.13906 0.14241 0.24139 0.29450 0.01387 0.07906 0.02905 0.03160 0.00945 0.01962
0.01826 0.36875 0.04377 0.05016 0.05939 0.02388 0.10363 0.04640 0.03538 0.25037
0.04496 0.05036 0.00536 0.03645 0.13741 0.00538 0.24632 0.04452 0.07750 0.35176
0.00271 0.15540 0.03399 0.05654 0.00815 0.01700 0.24275 0.25494 0.00206 0.22647
0.34420 0.03226 0.01573 0.08597 0.05616 0.00450 0.05940 0.09492 0.05545 0.25141
0.18968 0.34749 0.07375 0.01481 0.01027 0.00669 0.04306 0.00279 0.08349 0.22796
0.16135 0.02824 0.07965 0.11255 0.05570 0.05550 0.05575 0.05586 0.07156 0.32385
0.12799 0.18870 0.04153 0.16590 0.02079 0.06612 0.08455 0.14696 0.13088 0.02659
0.00810 0.06335 0.13014 0.06803 0.01878 0.10119 0.00199 0.06656 0.20922 0.33263
0.00715 0.03261 0.05779 0.47221 0.13998 0.11044 0.06397 0.00238 0.04157 0.07190
0.33703 0.02945 0.06164 0.01555 0.03444 0.14547 0.02342 0.03804 0.16088 0.15407
0.10912 0.14419 0.04340 0.09204 0.23033 0.09240 0.14530 0.00960 0.03412 0.09950
0.20165 0.09222 0.04268 0.17820 0.19159 0.02074 0.05634 0.00237 0.09559 0.11863
0.09296 0.01148 0.20442 0.07070 0.05221 0.04591 0.08455 0.25799 0.01417 0.16561
0.08846 0.07075 0.03732 0.11721 0.03095 0.24329 0.06630 0.06655 0.08060 0.19857
0.06225 0.10971 0.10978 0.01369 0.13479 0.17539 0.17540 0.02690 0.00464 0.18744
0.09431 0.10851 0.05079 0.07846 0.00162 0.00463 0.06533 0.18752 0.30896 0.09986
0.23214 0.11937 0.10215 0.04040 0.02876 0.00979 0.02443 0.21859 0.15627 0.06811
0.04522 0.07920 0.02432 0.01949 0.03837 0.10967 0.11123 0.01490 0.03846 0.51915
0.13486 0.02961 0.00818 0.11947 0.17204 0.08967 0.09767 0.03349 0.08077 0.23426
3

There are 3 answers

7
tstanisl On

Your code is ok. The mean value of the last difference it two times larger than the others.

The paradox comes from the fact is that rather than selecting 10 points on a unit interval one actually tries to divide it into 11 sub-intervals with 10 cuts.Therefore the expected length of each sub-interval is 1/11.

The difference between consecutive points is approaching 1/11 except the last pair because it contains the last sub-interval (between the last point and point 1) and the first one (between point 0 and the first point).

Thus the mean of the last difference is 2/11.

2
wojand On

"There is no special points on the circle"

The thing is that, on a circle, one car looks always the same, and so there is no need to relate to zero: you can just relate to the first car. This means that you can fix the first car at zero, and treat the random positions of the other cars as related to it (measured from it).

And so, the convenient solution is to fix the first car at zero and think of the 9 numbers you still generate as positions related to the first one.

Hope it's a satisfying answer :-)

IDENTITY (or Which diff is first?)

If 10 cars with labels ("1","2" and so on) are placed randomly on a circle, the difference from the "1" to the next will average 1/10. While sorting, the first diff "loses it's identity", what it is changes: it's similar to that if you chose the 1st diff to be the longest one, it would average more. Choosing it based on relation of cars to zero skews (or, in nicer terms: changes) things in a similar manner.

The first difference (2nd, 3rd etc.) just becomes something different – defining it as a difference from a given car is more intuitive, and gives an option to use it as a reference (playing nicely with circle symmetry); the distribution of the rest of cars with respect to it is a uniform one. Dealing with the smallest of random points is not that simple.

Summary: define what you're calculating, know your definitions and probability is non-intuitive

0
APA On

After 3 months of puzzling over this, I have an explanation that is intuitive, at least to me. This is cumulative to the answers provided by @wojand and @tstanisl.

My original code is correct: it uniformly distributes points on the interval, and the forward differences of all points have the same statistical distribution. The paradox is that the forward difference of the highest-value point, the one that crosses the 0-1 discontinuity, is on average twice the others and it's distribution has a different shape.

The reason this forward difference has a different distribution is that it contains the value 0. Larger forward differences (gaps) are more likely to contain any fixed value, simply because they are larger.

We could search for the gap that contains 1/pi, for example, and it too would have the same atypical distribution.