I heard you like A/B testing
While doing split testing for Candy Japan I started thinking if there could be a better way to decide which page variation to show for each visit. The way A/B testing is most commonly done is to run a test for a while, showing one variation to half of the users and another variation to the rest, don't look at the results while the test is running, after some pre-decided time stop the test and then start always using the best variation.
After I started buying advertising and actually paying for each visit, it started to interest me whether this is really the best way to do things. The way I started thinking about it was imagining being at the races with two horses, one of which I could enter in the race and my only decision being which to pick for the next round, with a lot of money at stake. In that situation I would very carefully think each time which one to use. I wouldn't just arbitrarily give each a chance at a thousand races and then pick the better one.
This part is about horses
This imaginary horse racing is one where both horses have a predetermined winning ratio, but I can only know it by experimenting. In the beginning I would have no reason to prefer one horse over the other, so I would let both race once and might then have the following situation.
Horse A: Win
Horse B: Win
Hmm... they both won and have a 100% winning ratio. I didn't really learn anything there. Let's let both race once more and add that result to the end.
Horse A: Win, Lose (50% wins)
Horse B: Win, Win (100% wins)
Now even though I don't have a lot of data, surely this means there is a slightly better chance that Horse B would actually be the better choice for all my future horsing needs. I should still give Horse A the benefit of the doubt, it might have just happened to be unlucky. To reflect that, perhaps from now on I will still use Horse A as well, but will start using Horse B a bit more often.
If I did start using only Horse B from now on, what would need to happen for me to change my mind? Well, if Horse B lost twice in a row, then its winning ratio would also drop to 50% and I would no longer prefer it over Horse A. Now comes the meat of this post.
What is the chance that Horse B would lose twice in a row? If I knew that, then I could start using that to indicate how often I should use that horse. I would need to know what the winning ratios for these horses are to know that, which of course at this point I can't really know, but I can guess based on the horses I have owned in the past. They have all been superstars and have won 50% of the time, so I will suppose that 50% is a good guess for a winning ratio.
For Horse B the chance that it would lose once is then 50% and the chance that it would lose twice is 50% * 50% = 25%. So there is a quarter chance that just with random fluctuation Horse A would catch up. My wild guess: maybe it would make sense to start using Horse A just 25% of the time from now on to reflect the chance it has for catching up? I could then continue adjusting that as more wins and losses come in, always using the losing horse as often as its chance of catching up indicates.
Would this work? Not wanting to think too hard, I did what any programmer would do. I wrote a simulator.
Split testing split testing engines
At this point I ran out of my allotment for the word "horse" and returned to reality. My simulator isn't about horses, it's about an imaginary website for a startup company. The website gets visits and has a roughly known conversion ratio. The website owner has thought of some split tests, but has to decide when to turn each on or off. For example the site owner could show customer testimonials on the site, which would boost or hurt the conversion ratio by x%.
In my simulation the initial conversion ratio is 5% and the value of each conversion is $39.90. The time span for my simulation is always 10000 visits, so you might expect the total revenue to be around $19950.
The simulator has modules for each split testing algorithm, which are used to decide which split tests to use for each visits. They don't have access to any information that I wouldn't have in reality, they just get a visit and return which split tests to use. The main loop of the sim then decides whether this time there happened to be a conversion or not.
At first I wanted to just try the simplest possible case as a sanity check: not using any of the split tests at all. I ran the sim 5000 times and took the average revenue over all of the runs. The result was $19939.70, so things seemed to be working as expected. I also tried just serving with random split tests turned on or off, and that also performed similarly.
Next I wanted to try the algo I had been using up to this point. Run the split test for a while, then decide the winner and only use that one in the future. Running the split tests for 500 visits and then choosing the better one resulted in avg. revenue of $23358.48 (I also tried running the tests for 1000 or 2000 visits before deciding, but those did not perform as well).
Horse.
I saved the most interesting test for last. How would my horse algo do? The average after 5000 runs was... $24383.06. Because the bad A/B tests were abandoned earlier and the good ones were used more often earlier, in this little simulated world this way of running tests made an avg. of $1024.58 more revenue.
If you would like to try running it yourself, you can find the Python code for the sim here.