I am trying to do some statistical analysis of different A/B tests to see which alternative is better and have found conflicting information about this.
First, I am interested in a couple different things:
- Tests that measure success by counting events, such as conversions or emails sent
- Tests that measure success by counting revenue
- Tests that have only two alternatives (control and new)
- Tests that have multiple alternatives (control and multiple new)
I was hoping to find a simple set of formulae or rules for doing this analysis but have found more questions than answers.
This site says that you can't compare multi-alternative tests; you can only do pairwise comparisons and do a chi-squared analysis to see if the whole test is statistically significant or not.
This site Suggests a way to do A/B/C/D testing (starts on slide 74), analysing the results using the G-Test (which it says is related to chi-squared) but isn't clear on the details of using a fudge factor. It also suggests that you can only use the A/B/C/D approach to eliminate alternatives until you end up with a clear winner in an A/B comparison.
This site gives an example of an A/B/C/D test (including control) and shows how to compare the conversion rate to determine a winner. Unlike this approach it does not recommend eliminating alternatives but rather picks a winner right off the bat (Assuming statistically significant results).
Perhaps I'm naive but I would think that by now a stats analysis library would exist to deal with this very problem. I would also appreciate more information about what algorithms/equations are needed to solve these problems. It's been a long time since my university Stats class.
For the event generating comparison, you could approach this using Beta distributions. Each alternative has some unobserved p, the probability of producing an event. If you observe X positive events out of N, then your uncertainty about p can be modeled by Beta(X+1,N-X+1).
You can compare two alternatives by looking at P(pA > pB), where pA and pB are the two Beta distributions. Methods for computing that inequality probability can be found in this paper.
You can also compute E[pA-pB], the effect size, or compute confidence bounds of the same.