I wanted to program bucket sort for a normal distribution (instead of the uniform distributed generally assumed by bucket sort). When I looked into the math, I came up with the following:
def bucket_sort(numbers):
mean = sum(numbers) / len(numbers)
variance = sum((x - mean) ** 2 for x in numbers) / len(numbers)
std_deviation = math.sqrt(variance)
num_buckets = int( math.sqrt(len(numbers)) )
buckets = [[] for _ in range(num_buckets)]
for num in numbers:
index = int( ( (num - mean) / std_deviation) * num_buckets)
buckets[index].append(num)
for bucket in buckets:
bucket.sort()
sorted_numbers = [num for bucket in buckets for num in bucket]
return sorted_numbers
For simplicity I assume, that numbers only consists of positive integers. Is this adaption of bucket sort correct? I was not able to find good literature for that. Additionally I assume, that this version can be speed up, by using a smaller sample to compute mean and variance. Is this assumption also correct?
I got an answer. The problem is, that the original attempt kind of normalises the values, without doing anything further.
On a mathematical / stochastic level we need the cumulative distribution function CDF. The gaussian distribution defines a probability density function PDF and we can convert that to a uniform distribution using the CDF, which is the integral over the PDF and creates the uniform distributed output, because of a phenomena called probability integral function. Here is the resulting code:
Note: The CDF for the normal distribution is related to the error function, which can be approximated, or as I do, be computed by a math package. The CDF outputs values between 0 and 1 (because it is the integral over a probability distribution). We therefore need a rescale to bucket indices, which is done using the size of the buckets.