Tag: abtest

Data Science vs. Data Alchemy – by Lucas Vermeer

How do scurvy, astronomy, alchemy and data science relate to each other?

In this goto conference presentation, Lucas Vermeer — Director of Experimentation at Booking.com — uses some amazing storytelling to demonstrate how the value of data (science) is largely by organizations capability to gather the right data — the data they actually need.

It’s a definite recommendation to watch for data scientists and data science leaders out there.

Here are the slides, and they contain some great oneliners:

Determine optimal sample sizes for business value in A/B testing, by Chris Said

A/B testing is a method of comparing two versions of some thing against each other to determine which is better. A/B tests are often mentioned in e-commerce contexts, where the things we are comparing are web pages.

Business leaders and data scientists alike face a difficult trade-off when running A/B tests: How big should the A/B test be? Or in other words, After collecting how many data points, or running for how many days, should we make a decision whether A or B is the best way to go?

This is a tradeoff because the sample size of an A/B test determines its statistical power. This statistical power, in simple terms, determines the probability of a A/B test showing an effect if there is actually really an effect. In general, the more data you collect, the higher the odds of you finding the real effect and making the right decision.

By default, researchers often aim for 80% power, with a 5% significance cutoff. But is this general guideline really optimal for the tradeoff between costs and benefits in your specific business context? Chris thinks not.

Chris said wrote a great three-piece blog in which he explains how you can mathematically determine the optimal duration of A/B-testing in your own company setting:

Part I: General Overview. Starts with a mostly non-technical overview and ends with a section called “Three lessons for practitioners”.
Part II: Expected lift. A more technical section that quantifies the benefits of experimentation as a function of sample size.
Part III: Aggregate time-discounted lift. A more technical section that quantifies the costs of experimentation as a function of sample size. It then combines costs and benefits into a closed-form expression that can be optimized. Ends with an FAQ.
Chris Said (via)

Moreover, Chris provides three practical advices that show underline 80% statistical power is not always the best option:

You should run “underpowered” experiments if you have a very high discount rate
You should run “underpowered” experiments if you have a small user base
Neverheless, it’s far better to run your experiment too long than too short

Simulations shows that for Chris’ hypothetical company and A/B test, 38 days would be the optimal period of time to gather data
via chris-said.io/2020/01/10/optimizing-sample-sizes-in-ab-testing-part-I/

Chris ran all his simulations in Python and shared the notebooks.

E-Book: Probabilistic Programming & Bayesian Methods for Hackers

The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. Nevertheless, mathematical analysis is only one way to “think Bayes”. With cheap computing power, we can now afford to take an alternate route via probabilistic programming.

Cam Davidson-Pilon wrote the book Bayesian Methods for Hackers as a introduction to Bayesian inference from a computational and understanding-first, mathematics-second, point of view.

The book is available via Amazon, but you can access an online e-book for free. There’s also an associated GitHub repo.

The book explains Bayesian principles with code and visuals. For instance:

%matplotlib inline
from IPython.core.pylabtools import figsize
import numpy as np
from matplotlib import pyplot as plt
figsize(11, 9)

import scipy.stats as stats

dist = stats.beta
n_trials = [0, 1, 2, 3, 4, 5, 8, 15, 50, 500]
data = stats.bernoulli.rvs(0.5, size=n_trials[-1])
x = np.linspace(0, 1, 100)

for k, N in enumerate(n_trials):
    sx = plt.subplot(len(n_trials)/2, 2, k+1)
    plt.xlabel("$p$, probability of heads") \
        if k in [0, len(n_trials)-1] else None
    plt.setp(sx.get_yticklabels(), visible=False)
    heads = data[:N].sum()
    y = dist.pdf(x, 1 + heads, 1 + N - heads)
    plt.plot(x, y, label="observe %d tosses,\n %d heads" % (N, heads))
    plt.fill_between(x, 0, y, color="#348ABD", alpha=0.4)
    plt.vlines(0.5, 0, 4, color="k", linestyles="--", lw=1)

    leg = plt.legend()
    leg.get_frame().set_alpha(0.4)
    plt.autoscale(tight=True)


plt.suptitle("Bayesian updating of posterior probabilities",
             y=1.02,
             fontsize=14)

plt.tight_layout()

I can only recommend you start with the online version of Bayesian Methods for Hackers, but note that the print version helps sponsor the author ánd includes some additional features:

Additional Chapter on Bayesian A/B testing
Updated examples
Answers to the end of chapter questions
Additional explanation, and rewritten sections to aid the reader.

If you’re interested in learning more about Bayesian analysis, I recommend these other books:

Helpful resources for A/B testing

Brandon Rohrer — (former) data scientist at Microsoft, iRobot, and Facebook — asked his network on Twitter and LinkedIn to share their favorite resources on A/B testing. It produced a nice list, which I summarized below.

Hey Twitter, a contact just asked me about A/B testing. Do you have any posts or tutorials you would recommend for them?
— Brandon Rohrer (@_brohrer_) July 6, 2019

The order is somewhat arbitrary, and somewhat based on my personal appreciation of the resources.

Course: A/B-testing by Google via Udacity
Game: So You Think You Can Test? by Lukas Vermeer
Video: A/B Testing in the Wild by Emily Robinson
Video: Beyond Two Groups: Generalized Bayesian A/B[/C/D/E…] Testing by Eric Ma via PyCon 2019
Book: Algorithms to Live By by Brian Christian and Tom Griffiths
Blog: Why Multi-armed Bandit algorithms are superior to A/B testing by Chris Stucchio (see other materials)
Blog: Bayesian Bandits – optimizing click throughs with statistics by Chris Stucchio (see other materials)
Blog: 12 Guidelines for A/B Testing by Emily Robinson (summary).
Blog: A/B Testing Mastery: From Beginner to Pro in a Blog Post by Alex Birkett via ConversionXL
Blog: What is A/B Testing? How to Use A/B Testing to Improve Conversions by MailChimp
Blog: Data Science you need to know! A/B testing by Michael Barber via Medium
Blog: Detecting Interference: An A/B Test of A/B Tests by Guillaume Saint-Jacques
Wiki: A/B Testing
Blog: The Math Behind A/B Testing by Amazon
Blog: How Not To Run an A/B Test by Evan Miller
Blog: A/B Testing by Optimezely
Blog: 5 Things to Know About A/B Testing by Matthew Mayo via KDnuggets
Blog: A Marketer’s Guide to A/B Testing by CleverTap
Blog: A Beginner’s Guide To A/B Testing: An Introduction by Neil Patel

Cover image via Optimizely

A/B Testing a New Look

This WordPress blogger I came across — let’s call him “John” for now — has a very peculiar way of testing out his looks. Using dating-apps like Tinder,
John conducted A/B-tests to find out whether people would prefer him romantically with or without a beard.

John with beard (via https://appsciencing.wordpress.com/)
John shaven (via https://appsciencing.wordpress.com/)

Via a proper experimental setup, John found out that bearded John receives much more attention in the form of Tinder matches. However, not from girls whom John characterized as being asian, that group seemed to prefer shaven John.

Tinder matches for bearded John by race (via https://appsciencing.wordpress.com/)
Tinder matches for shaven John by race (via https://appsciencing.wordpress.com/)

While the sample size was not too large (N_bearded = 500; N_shaven = 500) and the response rate even lower (N_bearded = 64; N_shaven = 30), this seems like a fun way to make your look more data-driven!

Read more on “John”‘s orginal blog below:

How Do You Test Out A New Look? Dating Apps!

12 Guidelines for Effective A/B Testing

I wrote about Emily Robinson and her A/B testing activities at Etsy before, but now she’s back with a great new blog full of practical advice: Emily provides 12 guidelines for A/B testing that help to setup effective experiments and mitigate data-driven but erroneous conclusions:

Have one key metric for your experiment.
Use that key metric do a power calculation.
Run your experiment for the length you’ve planned on.
Pay more attention to confidence intervals than p-values.
Don’t run tons of variants.
Don’t try to look for differences for every possible segment.
Check that there’s not bucketing skew.
Don’t overcomplicate your methods.
Be careful of launching things because they “don’t hurt”.
Have a data scientist/analyst involved in the whole process.
Only include people in your analysis who could have been affected by the change.
Focus on smaller, incremental tests that change one thing at a time.

More details regarding each guideline you can read in Emily’s original blogpost.

In her blog, Emily also refers to a great article by Stephen Holiday discussing five online experiments that had (almost) gone wrong and a presentation by Dan McKinley on continuous experimentation.