All the Math you need to conduct an A/B test
If you’re an innovator you are by definition in uncharted waters. There is no textbook on how to build your startup because if the route to success was already figured out, then the world wouldn’t need your product.
This means that you can’t google a lot of the questions you may have about your users, the market, and how they might react to changes in your website, product, or advertising. As such, one of the only ways to find out this information is through field experiments: That’s where A/B testing comes in.
A/B testing is a method of comparing two variations of something in order to see which performs better according to some metric. The most common way this is used is to see how two different versions of a website affect the rate at which users sign up.
The first step is to identify what your success metric is. What are you trying to increase? The amount of time a user spends on your platform? The rate at which they sign up? The click through rate on an advertisement? Figure out what you are trying to optimize and make sure you have a method of accurately tracking it.
Next is to identify what you want to vary. Is it the color of an advertisement? Is it the location of your sign up button? Is it the adjective you use to describe your product? The variations in your test should be identical with exception to this difference.
Additionally identify what is your control variation(How something is now) and what is your experiment(The change you are considering making).
The next step is by far the hardest. You need the tracking infrastructure to measure how your success metric varies across your control and experimental variations. If you are using user tracking like Mixpanel, Fullstory, or Logrocket, then this is pretty easy. Otherwise, you may need to natively track this information on your website.
Additionally, you need the infrastructure to randomly assign users to either the control or experimental variations. This should be absolutely random, and should split users evenly between our two versions.
Finally, you need to decide ahead of time the duration of your test. Are you going to run this for a week? Are you going to run this until you have a sample size of 1000 users? Deciding this beforehand can prevent you from stopping the test at a time that biases the results.
Random Sampling and Biases
The backbone of any statistical experiment is random sampling. The idea behind random sampling is that it would be incredibly difficult to find out how every single possible customer might react to the change you are considering making. The next best thing, therefore, is to randomly pick a smaller group of possible customers, and ask them.
If our sample is truly randomly selected from the general population of possible customers, then the results we see in our A/B test should be an accurate estimation of how our customer might react to the change.
Let’s say we are testing how much a person in New York City is willing to spend on a hotdog. A random sample might go all across the city and ask random people. An experiment with sampling bias, would just go to wall street and ask people.
If you just go to Wall Street, however, you are going to find that people are willing to spend more money, and will thus overestimate the average that New Yorkers are willing to spend. Thus our sample was not an accurate representation of the general population of New Yorkers, and our test is not incredibly effective.
That being said, completely removing sampling bias is an unrealistic goal. What is actually feasible and what is incredibly important is to be able to describe what your sampling bias is.
Where does your sample come from? Are they from the portion of doctors that use twitter, and hence were able to click on your twitter ad? Are they from the portion of chefs that live in San Francisco and saw your billboard?
So let’s say that we finished conducting our test, and we got results like this:
In the above example, our A/B test was seeing how a change in our website affected the rate at which people sign up.
As you can see above, with the same amount of visitors, our variant produced 4 more signups. So we should make the change to our variant, right?
Not exactly. Once you have the results of your A/B test, you need to check for statistical significance.
A good way to grasp the idea of statistical insignificance is to take the example of flipping a coin. If we were to flip a fair coin 1,000 times, we would expect to get heads 500 times, and tails 500 times.
But let’s say we want to test whether a coin is fair by seeing if we get an even amount of heads and tails. We flip the coin 4 times, and get 3 heads, and 1 tail. If we just made a conclusion directly from this information, we might say that a coin has a 75% chance of landing heads each time.
But we only tested the coin 4 times! The idea behind statistical significance is that when we have a small sample size, there is a certain amount of variation that is purely a coincidence.
In the coin example, the fact we got 3 tails was likely just a coincidence. In our website A/B test example, the fact that there were 4 more signups might also have just been a coincidence.
So how do we figure out if the results of our test are significant, or just purely a coincidence? That’s where hypothesis testing comes into play. Specifically, with A/B tests for conversion rates, we tend to use a Chi-Squared Test.
The basics behind the Chi-Squared test is to see if variations between two options are actually significant, or purely a result of chance.
The idea is to identify what we call a null-hypothesis. In our example:
Null-hypothesis: The variation we tested to our website would not cause a different amount of people to sign up
In other words, the null hypothesis states that our variation and control would result in the same outcome.
In the above example, we observed an average conversion rate of about 15%. So if the variations were truly equivalent, we would expect each version of the website to cause 38(0.15 * 250) signups for the 250 visitors.
A Chi-Squared test then considers the evidence we collected and the values we would expect if the null hypothesis is true, and asks whether we should reject the null-hypothesis. In other words, is the difference observed so significant, that the null-hypothesis could not be true?
The chi-squared test returns a p-value which is the probability of observing such a difference(or greater), given that the null-hypothesis is true. If we get a probability that is extremely low, say 0.05, we might choose to reject the null-hypothesis and conclude that our variation does indeed increase the success metric.
How you actually calculate the p value is going to depend on the workflow you use, but here are tutorials for excel and google sheets.
How to Learn More
This was an extremely high-level overview of statistical hypothesis testing. Unless you are looking to delve deeper into statistics and data science, I would recommend focusing on the math that allows hypothesis testing to work and rather devote most of your time into understanding biases.
Data is only as good as the way it is collected. There are very few environments were data collection can be close to perfect, and for startup founders, I can guarantee that you are not in that environment.
Understanding how conclusions from data and experiments can be wrong, is just as important as being able to use it in the first place.