I'm trying to use PyMC to determine the distribution of ad click through rates (CTRs). Let's say we have 1000 ads and I have measurements for clicks and views for all ads. I assume that underlying distribution of the ad CTRs is a Beta distribution, and I would like to use PyMC to estimate the parameters of this distribution. I will call these parameters in the following snippets unknown_alpha
and unknown_beta
.
To show my example code, here is how one could generate an example test set:
from scipy.stats import beta
from scipy.stats import geom
from scipy.stats import binom
def generate_example_data(data_size=1000, unknown_alpha=30, unknown_beta=100):
ctrs = beta.rvs(a=unknown_alpha, b=unknown_beta, size=data_size)
data_views = geom.rvs(0.001, size=data_size)
data_clicks = []
for ctr, views in zip(ctrs, data_views):
data_clicks.append(binom.rvs(p=ctr, n=views))
return data_views, data_clicks
And here is the code, how I tried to use PyMC:
import pymc
def model(data_views, data_clicks):
ctr_prior = pymc.Beta('ctr_prior', alpha=1.0, beta=1.0)
views = pymc.Geometric('views', 0.01, observed=True, value=data_views)
clicks = pymc.Binomial('clicks', n=views, p=ctr_prior, observed=True, value=data_clicks)
model = pymc.Model([ctr_prior, views, clicks])
mc = pymc.MCMC(model)
mc.sample(iter=5000, burn=5000)
return mc.trace('ctr_prior')[:]
views, clicks = generate_example_data()
model(views, clicks)
Output:
array([ 0.])
I know that the model is not finished, yet, to infer about unknown_alpha
and unknown_beta
, but I don't know why I just get array([ 0.])
. I expected to get a trace with 5k elements.
Can anybody explain me where I went wrong?
Cheers!
My guess would be the mc.sample(iter=5000, burn=5000) line. You sample for 5000, and throw away the first 5000. To keep 5000, you want mc.sample(iter=10000, burn=5000)