Bucket Sort for normal distribution

195 views Asked by At

I wanted to program bucket sort for a normal distribution (instead of the uniform distributed generally assumed by bucket sort). When I looked into the math, I came up with the following:

def bucket_sort(numbers):
    mean = sum(numbers) / len(numbers)
    variance = sum((x - mean) ** 2 for x in numbers) / len(numbers)
    std_deviation = math.sqrt(variance)

    num_buckets = int( math.sqrt(len(numbers)) )
    buckets = [[] for _ in range(num_buckets)]

    for num in numbers:
        index = int( ( (num - mean) / std_deviation) * num_buckets)
        buckets[index].append(num)

    for bucket in buckets:
        bucket.sort()

    sorted_numbers = [num for bucket in buckets for num in bucket]

    return sorted_numbers

For simplicity I assume, that numbers only consists of positive integers. Is this adaption of bucket sort correct? I was not able to find good literature for that. Additionally I assume, that this version can be speed up, by using a smaller sample to compute mean and variance. Is this assumption also correct?

1

There are 1 answers

0
Titanlord On

I got an answer. The problem is, that the original attempt kind of normalises the values, without doing anything further.

On a mathematical / stochastic level we need the cumulative distribution function CDF. The gaussian distribution defines a probability density function PDF and we can convert that to a uniform distribution using the CDF, which is the integral over the PDF and creates the uniform distributed output, because of a phenomena called probability integral function. Here is the resulting code:

import math
import scipy.stats

def bucket_sort(numbers, mean, deviation):
    # Just recompute here to see what happens
    mean = sum(numbers) / len(numbers)
    variance = sum((x - mean) ** 2 for x in numbers) / len(numbers)
    deviation = math.sqrt(variance)

    num_buckets = int( math.sqrt(len(numbers)) ) 
    buckets = [[] for _ in range(num_buckets)]

    for num in numbers:
        # index = scipy.stats.norm.cdf(num, mean, deviation)
        tmp = (num - mean)/(deviation*math.sqrt(2))
        index = 0.5 *(1+math.erf(tmp))
        index = int(index * num_buckets)

        # Append rescaled value to bucket
        buckets[index].append(num)

    for bucket in buckets:
        bucket.sort()

    sorted_numbers = [num for bucket in buckets for num in bucket]

    return sorted_numbers

# example call
mean = 50
std = 20
input = scipy.stats.norm.rvs(50,20,100)
input = [int(i) for i in input]
print(bucket_sort(input,mean, std))

Note: The CDF for the normal distribution is related to the error function, which can be approximated, or as I do, be computed by a math package. The CDF outputs values between 0 and 1 (because it is the integral over a probability distribution). We therefore need a rescale to bucket indices, which is done using the size of the buckets.