Conversion rate tests are very useful, but without proper knowledge of research and statistics, the pitfall of interpreting your results incorrectly is large.
1. Research design is beautiful but NOT flawless
As I started to investigate upon the test designs of conversion rate testing, I was astonished and delighted with the beauty of the design of A/B-testing.That does not mean, unfortunately, that testing and interpreting results is easily done.
An experimental design
A/B-testing uses what is called an experimental design.For the understanding of this post you only need to know that there have to be two groups in an experimental design. One group is exposed to a stimulus, while the other is not. All (literally all!) other conditions have to be identical. Changes between groups can then be assigned to the stimulus only.
In A/B-testing it is thus of utter importance that both groups are identical to each other.
This can be achieved most easily through randomization. As far as we know, randomization is used by most providers of conversion-testing. The visitor either sees your website in version A or in version B. Which version is provided, is based on pure coincidence. The idea is that the randomization will ensure that both groups are alike. So far, the research-design is very strong: the groups are identical. In theory, this is an experimental goldmine!
Period effects mess with your results
However… randomization will only ensure identical groups assuming that you have enough visitors. Sites with small groups of visitors could simply choose to run their tests for a longer period of time. But then…all kind of changes in population could occur, due to blog posts, links or the news in the world. Randomization will still take care of differences between groups. However, there will be differences within your population due to all kind of these period effects. These differences within your population could interact with the results of your A/B-tests. Let me explain this last one:
Imagine you have a site for nerds and you try to sell plugins. You’re doing an A/B-test on your checkout page. Then you write a phenomenal blog about your stunning wife and a whole new (and very trendy) population visits your website. It could be that this new population responds differently on the changes in your checkout page than the old nerdy population. It could be that the new population (knowing less about the web) is more influenced by usability-changes than the old nerdy population. In that case, your test-results would show an increase in sales based on this new population. If the sudden increase in trendy people on your website is only for a short period of time, you will draw the wrong conclusions.
Running tests for longer period of times will only work if you keep a diary in which you write down all possible external explanations. You should interpret your results carefully and always in light of relevant changes in your website and your population.
2. Test-statistic Z is somewhat unreliable
Doing statistical data analyses with these skewed data required a different statistical approach than analyses with ‘normal’ data (with a 50/50 distribution).
Surely, conversions are very skewed. A conversion rate of 5 % would be really high for most sites. Studying the assumptions of the z-statistic used in most conversion rate tests confirmed my suspicions. The z-statistic is not designed for such skewed datasets. It will become unreliable if conversions are below 5 % (some statistical handbooks even state 10 %!). Due to skewed distributions, the chance of making a type I error (concluding that there’s a significant difference, while in reality there is not) rises.That does not mean that the Z-statistic is useless.
Not at all. I do not have a better alternative either. It does mean however, that the interpretation becomes more complicated and needs more nuancing. With very large amounts of data the statistic regains reliability. But… Especially on sites with small amounts of visitors (and thus very little conversions) one should be very careful interpreting the significance. I think you should have at least 30 conversions a week to do proper testing. Note: that is my opinion, not a statistical law! Stopping a test immediately after the result is significant is a bit dangerous. The statistic just is not that reliable.
In my opinion, not the significance, but the relevance should be leading in deciding if changes in your design lead to an increase in conversions.
Is there a meaningful difference (even if it is not significant) after running a test for a week? Yes? than you are on to something… No? than you are probably not on to something…
3. Interpretation must remain narrow
Important to realize is that the conclusions you can draw from the test results, never outgrow the test environment. Thus, if you are comparing the conversion using a green ‘buy now’ button with the conversion using a red version of the button, you can only say something about that button, on that site, in that color. You cannot say anything beyond that. Mechanisms causing an increase in sales because of a green button (e.g. red makes people aggressive, green is a more social colour) remain outside the scope of your test.
Test and stay aware
Conversion tools are aptly called ‘tools’.
Perhaps, packages and programs designed to do conversion testing should help people making their interpretations.
Moreover, I would advice people to test in full weeks (but not much longer if you do not want to pollute your results with period effects). Next to that, people should keep a diary with possible period effects. These effects should always be taking into account while interpreting test results. Also, I would strongly advice to only run tests if a website has sufficient visitors. Finally, I would advice you to take the significance with a grain of salt. It is only one test-statistic (probably not a very reliable one) and the difference between significant and non-significant is small. You should interpret test results taking into account both relevance (is there a meaningful difference in conversion) and significance.