Efficient sampling from a 'partial' binomial distribution

374 views Asked by At

I am want to sample from the binomial distribution B(n,p) but with an additional constraint that the sampled value belongs in the range [a,b] (instead of the normal 0 to n range). In other words, I have to sample a value from binomial distribution given that it lies in the range [a,b]. Mathematically, I can write the pmf of this distribution (f(x)) in terms of the pmf of binomial distribution bin(x) = [(nCx)*(p)^x*(1-p)^(n-x)] as

sum = 0
for i in range(a,b+1):
    sum += bin(i)

f(x) = bin(x)/sum

One way of sampling from this distribution is to sample a uniformly distributed number and apply the inverse of the CDF(obtained using the pmf). However, I don't think this is a good idea as the pmf calculation would easily get very time-consuming.

The values of n,x,a,b are quite large in my case and this way of computing pmf and then using a uniform random variable to generate the sample seems extremely inefficient due to the factorial terms in nCx.

What's a nice/efficient way to achieve this?

2

There are 2 answers

1
mathfux On BEST ANSWER

This is a way to collect all the values of bin in a pretty short time:

from scipy.special import comb
import numpy as np
def distribution(n, p=0.5):
    x = np.arange(n+1)
    return comb(n, x, exact=False) * p ** x * (1 - p) ** (n - x)

It can be done in a quarter of microsecond for n=1000.

Sample run:

>>> distribution(4):
array([0.0625, 0.25  , 0.375 , 0.25  , 0.0625])

You can sum specific parts of this array like so:

>>> np.sum(distribution(4)[2:4])
0.625

Remark: For n>1000 middle values of this distribution requires to use extremely large numbers in multiplication therefore RuntimeWarning is raised.

Bugfix

You can use scipy.stats.binom equivalently:

from scipy.stats import binom
def distribution(n, p):
    return binom.pmf(np.arange(n+1), n, p)

This does the same as above mentioned method quite efficiently (n=1000000 in a third of second). Alternatively, you can use binom.cdf(np.arange(n+1), n, p) which calculate cumulative sum of binom.pmf. Then subtraction of bth and ath items of this array gives an output which is very close to what you expect.

0
Sam Mason On

Another way would be to use the CDF and it's inverse, something like:

from scipy import stats

dist = stats.binom(100, 0.5)

# limit ourselves to [60, 100]
lo, hi = dist.cdf([60, 100])

# draw a sample
x = dist.ppf(stats.uniform(lo, hi-lo).rvs())

should give us values in the range. note that due to floating point precision, this might give you values outside of what you want. it gets worse above the mean of the distribution

note that for large values you might as well use the normal approximation