Efficient sampling from a 'partial' binomial distribution

Question

Efficient sampling from a 'partial' binomial distribution

373 views Asked by Black Jack 21 At 03 October 2020 at 19:23

I am want to sample from the binomial distribution B(n,p) but with an additional constraint that the sampled value belongs in the range [a,b] (instead of the normal 0 to n range). In other words, I have to sample a value from binomial distribution given that it lies in the range [a,b]. Mathematically, I can write the pmf of this distribution (f(x)) in terms of the pmf of binomial distribution bin(x) = [(nCx)*(p)^x*(1-p)^(n-x)] as

sum = 0
for i in range(a,b+1):
    sum += bin(i)

f(x) = bin(x)/sum

One way of sampling from this distribution is to sample a uniformly distributed number and apply the inverse of the CDF(obtained using the pmf). However, I don't think this is a good idea as the pmf calculation would easily get very time-consuming.

The values of n,x,a,b are quite large in my case and this way of computing pmf and then using a uniform random variable to generate the sample seems extremely inefficient due to the factorial terms in nCx.

What's a nice/efficient way to achieve this?

Original Q&A

There are 2 answers

Sam Mason On 04 October 2020 at 20:23

Another way would be to use the CDF and it's inverse, something like:

from scipy import stats

dist = stats.binom(100, 0.5)

# limit ourselves to [60, 100]
lo, hi = dist.cdf([60, 100])

# draw a sample
x = dist.ppf(stats.uniform(lo, hi-lo).rvs())

should give us values in the range. note that due to floating point precision, this might give you values outside of what you want. it gets worse above the mean of the distribution

note that for large values you might as well use the normal approximation

**mathfux** · Accepted Answer · 2020-10-03T21:44:46+00:00

This is a way to collect all the values of bin in a pretty short time:

from scipy.special import comb
import numpy as np
def distribution(n, p=0.5):
    x = np.arange(n+1)
    return comb(n, x, exact=False) * p ** x * (1 - p) ** (n - x)

It can be done in a quarter of microsecond for n=1000.

Sample run:

>>> distribution(4):
array([0.0625, 0.25  , 0.375 , 0.25  , 0.0625])

You can sum specific parts of this array like so:

>>> np.sum(distribution(4)[2:4])
0.625

Remark: For n>1000 middle values of this distribution requires to use extremely large numbers in multiplication therefore RuntimeWarning is raised.

Bugfix

You can use scipy.stats.binom equivalently:

from scipy.stats import binom
def distribution(n, p):
    return binom.pmf(np.arange(n+1), n, p)

This does the same as above mentioned method quite efficiently (n=1000000 in a third of second). Alternatively, you can use binom.cdf(np.arange(n+1), n, p) which calculate cumulative sum of binom.pmf. Then subtraction of bth and ath items of this array gives an output which is very close to what you expect.

TechQA.

Efficient sampling from a 'partial' binomial distribution

There are 2 answers

Bugfix

Related Questions in PYTHON

Related Questions in NUMPY

Related Questions in RANDOM

Related Questions in BINOMIAL-CDF

Popular Questions

Popular Tags

Trending Questions