When a test disproves your hypothesis, it can be disappointing. While you learn something about your customers, you don’t get the thrill of the win. Still, losing is not nearly as disappointing as a test that never ends. So why do some tests reach statistical significance quickly and some never get valid results?
To understand why some tests never collect enough data, we have to understand the system we use to determine when a test wraps up. We call that system statistical significance. It ensures that we don’t make our decisions based on chance and false positives, but instead on relevant data. At Blue Acorn, we wait until we have 95% statistical significance, which means that if you run the same test 100 times, the same variable will win 95 of the 100 times.
So why do some tests never seem to reach 95% statistical significance? Theoretically, every test will conclude eventually, although when that time exceeds a year, it may as well be never. After a while, test results lose their relevance. So the real question is why would a test take that long? The answer boils down to two reasons: the difference in the variables is very small and/or there isn’t enough traffic for the test you’re running.
Calculating Sample Size
How do you determine the number of visitors needed to gather a statistically significant result, and how do you see if you have had that many visitors? Optimizely has made answering both of these questions simple.
To calculate how many visitors must go through a test, you can use Optimizely’s Sample Size Calculator. All you have to do is input two simple variables.
- Baseline Conversion Rate: This is the current conversion rate. You can find this number in your analytics.
- Minimum Detectable Effect: The minimum relative change in conversion rate you would like to be able to detect. Big changes will take less traffic/time to detect. Small changes are more subtle and could be the result of spikes in the data, so the test needs more traffic/time to verify that the increase isn’t a mistake.
You can also adjust two more variables, statistical power and statistical significance, although it is not necessary. Increasing the statistical power reduces the risk of missing out on a winner. Increasing statistical significance reduces the risk of accidentally picking a winner when one doesn’t exist. However, increasing these numbers will increase the time it takes to gather a statistically significant result. For most tests, 80% for statistical power and 95% for statistical significance will work just fine.
Once you have the sample size, go into your analytics and see how long it takes for that much traffic to flow through the page you’re testing. If it’s longer than most of your other tests, consider performing another test that will finish in less time, unless the potential to increase revenue is worth the wait. Otherwise, come back to it when you’re low on testing ideas.
For a more detailed explanation on the Sample Size Calculator, check out Optimizely’s article “How Long to Run a Test.”
Enough visitors and still no difference
Sometimes, despite the fact that you reached enough visitors to gather a statistically significant result, you don’t find a winner. That means the difference between the original and variation is smaller than your minimum detectable effect. If you set your minimum detectable effect to a high percentage, you may want to recalculate with a smaller percentage. However, if you needed a big improvement in your KPI to justify the costs of the variation, you may want to stop the test and move on to the next one.
Of course, the best way to get a statistically significant result is to develop a solid hypothesis. Hypotheses based on assumptions, pulled out of thin air, or simply copied from another site have a much higher risk of failing. Make sure you have substantial research that points to a possible improvement before testing. You’ll save yourself the trouble of a test that never ends.