I am trying to do Bayesian analysis using pymc3 and bambi softwares.

My data shape is 136x5 and top 5 rows are as follows:

   AGE GENDER  AVAR  BVAR   OUTVAR
0   60      F   0.9     0   8260.0
1   56      F   5.4     1  15888.0
2   55      F   1.2     1  19734.4
3   52      F   1.7     1  15904.2
4   49      F   1.6     0  14848.0

where OUTVAR is target variable and others are predictors.

I used following Python3 code:

from bambi import Model
model = Model(bdf)
results = model.fit('OUTVAR ~ AGE + GENDER + AVAR + BVAR', samples=5000, chains=2)
print(results[1000:].summary())

and got following output:

Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average Loss = 1,476.2:  21%|██████████████▊                                                       | 10601/50000 [00:01<00:07, 5542.87it/s]
Convergence achieved at 10700
Interrupted at 10,699 [21%]: Average Loss = 1,520.6
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [OUTVAR_sd, BVAR, AVAR, AGE, GENDER, Intercept]
Sampling 2 chains: 100%|██████████████████████████████████████████████████████████████████████████| 11000/11000 [11:40<00:00, 15.71draws/s]
There were 154 divergences after tuning. Increase `target_accept` or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
There were 241 divergences after tuning. Increase `target_accept` or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The number of effective samples is smaller than 10% for some parameters.

                     mean           sd  hpd0.95_lower  hpd0.95_upper  effective_n  gelman_rubin
AGE              6.936836    81.741918    -158.176468     165.050530          589      1.005245
AVAR           -78.356403   410.942267    -896.068374     718.650196         2414      1.000314
BVAR          2639.993063  2262.101985   -1528.841953    7297.760056          553      1.000544
GENDER[T.M]   -615.092080  1659.905226   -3855.789966    2756.704710         1392      1.003126
Intercept    11739.843222  3591.680335    5239.872556   19053.145349          179      1.007447
OUTVAR_sd     8936.351700   283.640474    8402.347757    9318.495791         6027      1.000028

How do I analyze and what inference can I make from this table of values?

Edit: To be specific, I want to confirm if mean indicates coefficient values for each of variable. However, in usual linear regression, there is no coefficient for outcome variable, while here it is listed. Also, I am not clear what is the meaning of effective_n value for each variable. Thanks for your help.

1 Answers

2
Tal Yarkoni On Best Solutions

In this context, mean is the mean of the posterior distribution. For a simple case like this, the posterior means will usually be close to the point estimates you would get from traditional least-squares regression.

The OUTVAR_sd row is not for the outcome variable; it represents the model error. Think of the true OUTVAR score as being the sum of the model-predicted OUTVAR, plus a normally distributed error. Then OUTVAR_sd is the posterior estimate for that error term.

effective_n is a heuristic estimate of the number of independent samples in the chain for each variable. This is almost invariably smaller than the nominal number, because consecutive samples are correlated. See this blog post for a brief explanation.