How you can apply this technique and explain the relevant concepts in some depth
Most of us have either heard of or used A/B testing which is a widely employed technique across industries from medicine to software development to measure one version of the product against the other. It’s a method to validate if your beliefs about a new feature or an improvement are actually true and if so to what extent.
The reason I’m writing this is two-fold — to collect information about the statistics and math behind the process in one place and also solidify my own understanding of these concepts. I am by no means a statistics, probability or math expert so happy to hear feedback or corrections.
This post won’t go beyond the basics of the numerical, statistical aspects of setting up and analyzing your A/B tests or multi-way tests, but talk about how you can apply this technique and explain the relevant concepts in some depth.
The process of setting up an A/B test
Step 1 — Hypothesis
Perhaps the most important step is to first start with a belief and potentially supporting data on how a current version of product can be improved and why it would do that.
Once you start with this belief aka hypothesis you actually get into brainstorming ideas for a better version. In most cases you should be careful about going too wild with these changes because you want to make sure you don’t start testing multiple different things at once, otherwise you don’t know what worked and what didn’t.
Step 2 — Experiment design
Define what you want to test, what the audience is, what is the primary/secondary goal for your test, what is the expected impact etc.
Now, after you’ve defined your experiment by identifying what you are testing, how you are measuring results etc. you take the first step.
Step 3— Sample size calculation
This step is for you to know how many samples to collect to confidently understand the impact of an experiment on some key metrics. eg: Do you need a 1000 visitors to your site to tell you which headline graphic works or do you need 10000?
Here’s what you need to know to complete this step:
- Alpha or significance level is the probability that you detected the effect you did, even there exists no difference. In short this is the probability of a your decision being a false positive. You want this to be low, typical value is 5%.
- Beta or statistical power is the probability of not detecting a false negative. That is the likelihood with which you want to detect an effect of having the experimental experience if there indeed exists a difference. You want this to be high. Typical value is 80%.
Note that Alpha and beta shouldn’t have to change for each and every experiment you run.
Now two more things you need are
- Your metric’s conversion rate — this is the metric you are looking to improve on (increase or decrease) and hence running the experiment to validate your belief. You should know that number, if not first collect that number before you can even think of an experiment.
- Your expected relative improvement — It’s ok if this is just a guess. The math on the analysis side will help out regardless.
Plug these values in here. Voila! You are set to begin your experiment. Be careful not to peek in the results as you run your experiment so as not to bias your decision. Be patient!
Step 4— Implementation and data collection
Now that you have the hypothesis, the numbers, the variants designed, the duration and size of the experiment population it’s time to actually implement/run the experiment.
Step 5— The analysis phase
So you’ve been patient and now curious to see how well your belief in the new feature/improvement did, now it’s time to find out.
For analysis simply use a Bayesian based testing calculator. There are very detailed reasons for this but simply put the results you see from your experiment either positive in favor of the new feature or negative are just a probability of how likely that outcome is if you repeated that experiment over. Obviously, the only real way of measuring the impact from the improvement is to go live with the new feature. But running the experiment tells you with some likelihood what you can expect to happen in the real world.
So a few details now:
Bayesian calculators rely on specifying a prior. The prior is simply a probability distribution of the metric before the experiment began. Eg: If you receive a 1000 visitors and a 100 sign up then your prior is a probability distribution curve peaking at 0.1 .
As a result of running the experiment with the different variations you are basically asking the calculator to adjust the prior belief given the data you collected as part of running the experiment. This is called Bayesian inference and is based on one of the simplest theories in probability — Bayes theorem.
Bayesian theorem let’s you find out the probability of an event given another event so using Bayesian inference the calculator adjusts the probabilities of all your experimental variations.
Of course the more the samples of data you collected the higher confidence and narrower spectrum of estimates you’ll have.
You can clearly see that in Figure 2 the calculator has a higher confidence in the experiment because we have more samples.
When can I use A/B testing?
It’s actually super useful to employ as a tool in cases
- when you don’t have enough data to know a new version is better
- when you have data but don’t feel confident to make a change
- when you want to be very sure of you change’s impact and quantify that impact
A/B testing can seem like rocket science but it really isn’t. Once you understand some of the basics and concepts you can masterfully apply this concept to make improvements.