In a previous post, Thijs made quite a fuss about how many conversion-testers do not know their business. He stated that both the execution as the interpretation of testing showed serious flaws. His major point was that the way we deal with this conversion-testing is not scientific. At all. Time to define scientific. Time to explain the theory behind the tests we use to optimise our conversion.
Joost asked me to look into these conversion rate-tests because of my expertise in statistics and research design. In a previous life, I was a criminologist studying criminal behaviour in very large scaled datasets. I learned a lot about (crappy) research designs and complicated statistics. My overall opinion of the conversion rate tests is that these tests are amazing, beautiful and very useful. But… without some proper knowledge about research and statistics, the pitfall of interpreting your results incorrectly is large. In the following article, I attempt to explain my opinion (formulating 3 major arguments) in detail.
1. Research design is beautiful but NOT flawless
As I started to investigate upon the test-designs of conversion rate testing, I was astonished and delighted with the beauty of the design of A/B-testing. Most of my own scientific studies did not have a research design that is that strong and sophisticated. That does not mean, unfortunately, that testing and interpreting results is easily done. Let me explain!
An experimental design
A/B-testing uses what is called an experimental design. Experimental designs are used to investigate upon causal relations. A causal relationship implies that one thing (e.g. an improved interface) will lead to another thing (e.g. more sales). There are variations in experimental designs, but I would like to leave the explanation of the experimental designs for another post (e.g. the true experimental design, the quasi-experimental design).
For the understanding of this post you only need to know that there have to be two groups in an experimental design. One group is exposed to a stimulus, while the other is not. All (literally all!) other conditions have to be identical. Changes between groups can then be assigned to the stimulus only.
In A/B-testing it is thus of utter importance that both groups are identical to each other.
This can be achieved most easily through randomization. As far as we know, randomization is used by most providers of conversion-testing. The visitor either sees your website in version A or in version B. Which version is provided, is based on pure coincidence. The idea is that the randomization will ensure that both groups are alike. So far, the research-design is very strong: the groups are identical. In theory, this is an experimental goldmine!
Period effects mess with your results
However… randomization will only ensure identical groups assuming that you have enough visitors. Sites with small groups of visitors could simply choose to run their tests for a longer period of time. But then…all kind of changes in population could occur, due to blog posts, links or the news in the world. Randomization will still take care of differences between groups. However, there will be differences within your population due to all kind of these period effects. These differences within your population could interact with the results of your A/B-tests. Let me explain this last one:
Imagine you have a site for nerds and you try to sell plugins. You’re doing an A/B-test on your checkout page. Then you write a phenomenal blog about your stunning wife and a whole new (and very trendy) population visits your website. It could be that this new population responds differently on the changes in your checkout page than the old nerdy population. It could be that the new population (knowing less about the web) is more influenced by usability-changes than the old nerdy population. In that case, your test-results would show an increase in sales based on this new population. If the sudden increase in trendy people on your website is only for a short period of time, you will draw the wrong conclusions.
Running tests for longer period of times will only work if you keep a diary in which you write down all possible external explanations. You should interpret your results carefully and always in light of relevant changes in your website and your population.
2. Test-statistic Z is somewhat unreliable
Working on my PhD I had to do all kinds of analyses with skewed data. In my case, my data contained 95 % of law-abiding citizens (thank god), while only 5% committed a crime. Doing statistical data analyses with these skewed data required a different statistical approach than analyses with ‘normal’ data (with a 50/50 distribution). My gut feeling told me that conversion rate testing actually faces the same statistical challenges. Surely, conversions are very skewed. A conversion rate of 5 % would be really high for most sites. Studying the assumptions of the z-statistic used in most conversion rate tests confirmed my suspicions. The z-statistic is not designed for such skewed datasets. It will become unreliable if conversions are below 5 % (some statistical handbooks even state 10 %!). Due to skewed distributions, the chance of making a type I error (concluding that there’s a significant difference, while in reality there is not) rises.
That does not mean that the Z-statistic is useless. Not at all. I do not have a better alternative either. It does mean however, that the interpretation becomes more complicated and needs more nuancing. With very large amounts of data the statistic regains reliability. But… Especially on sites with small amounts of visitors (and thus very little conversions) one should be very careful interpreting the significance. I think you should have at least 30 conversions a week to do proper testing. Note: that is my opinion, not a statistical law!
Stopping a test immediately after the result is significant is a bit dangerous. The statistic just is not that reliable.
In my opinion, not the significance, but the relevance should be leading in deciding if changes in your design lead to an increase in conversions. Is there a meaningful difference (even if it is not significant) after running a test for a week? Yes? than you are on to something… No? than you are probably not on to something…
3. Interpretation must remain narrow
Important to realize is that the conclusions you can draw from the test results, never outgrow the test environment. Thus, if you are comparing the conversion using a green ‘buy now’ button with the conversion using a red version of the button, you can only say something about that button, on that site, in that color. You cannot say anything beyond that. Mechanisms causing an increase in sales because of a green button (e.g. red makes people aggressive, green is a more social colour) remain outside the scope of your test.
Test and stay aware
Conversion tools are aptly called ‘tools’. You can compare them to a hammer; you’ll use the hammer to get some nails in a piece of wood, but you won’t actually have the hammer do all the work for you, right? You still want the control, to be sure the nails will be hit as deeply as you want, and on the spot that you want. It’s the same with conversion tools; they’re tools you can use to reach a desired outcome, but you shouldn’t let yourself be led by them. It is of great importance that you’re always aware of what you are testing and nuancing results in the light of period effects and relevance. That actually is, the scientific way to do conversion rate optimization.
Perhaps, packages and programs designed to do conversion testing should help people making their interpretations. Moreover, I would advice people to test in full weeks (but not much longer if you do not want to pollute your results with period effects). Next to that, people should keep a diary with possible period effects. These effects should always be taking into account while interpreting test results. Also, I would strongly advice to only run tests if a website has sufficient visitors. Finally, I would advice you to take the significance with a grain of salt. It is only one test-statistic (probably not a very reliable one) and the difference between significant and non-significant is small. You should interpret test results taking into account both relevance (is there a meaningful difference in conversion) and significance.