Running an A/B test is easy. Screwing up an A/B test is even easier. Are you sure your new feature is working as expected? Is your randomization strategy correlated with something it shouldn’t be? Do you have a proper control group for comparisons?
At a minimum, it’s important to have a platform that makes it easy to run valid experiments. For some companies, that means using a third party service, like Optimizely, Dynamic Yield, or Split. For others, it’s using a vetted open source system such as Facebook’s Planout or Intuit’s Wasabi. For us, it meant building a custom platform in-house that could support our sometimes esoteric needs.
Future blog posts will go into the details of the system, but for now there are three aspects of our new experimentation platform that really get me excited - #oneway, centralization, and leverage through automated analytics. These dovetail to create a democratization of data-driven decision making across the company (say that 10 times over).
We like to support one, and only one, way of doing things. This makes it simpler for our data scientists to learn how to get things done, it makes it easier for us to maintain systems, and it enables us to iterate faster on new ideas. While it is the domain of the data scientists to design and run experiments1, it is not their job to come up with a general platform to make it easier. The experimentation platform, run by the Algorithms Data Platform team, is a big step in that direction for us. Historically, we’ve run A/B tests using a variety of third party services and in-house tools. The front-end engineering teams used one system, backend engineering used another, and data science had a slew of options (one of which used to include data scientists manually writing to, and deleting from, the monolithic prod DB 😳). Simply finding out what experiments were running at any given time was #annoying if not impossible. Imagine trying to quantify the impact of your testing program. We needed a better way.
Centralized A/B tests
Consistent with #oneway, it was important for us to centralize our efforts and build out a single, unified home for experiments. This meant one system that would work for UI, engineering, and data science experiments. It needed to handle a slew of randomization strategies, including vanilla Bernoulli randomization of clients / non-logged-in visitors / stylists / etc, as well as within-subjects designs and bandits of all shapes and flavors.
Centralizing experiments into one place has a number of surface-level benefits:
consistent metadata across the entire company’s experiments (you can now know who’s running that personalization algorithm test that just broke in the middle of the night)
- it only takes a quick glance at the dashboard to see what’s being tested at a given time and what’s being planned - this is critical for managing overlapping experiments
- you can focus engineering efforts on solutions that benefit all of your internal stakeholders
- the platform affords a common place to build all the peripheral features that such a system inevitably needs: a place to look up allocations, power calculators, pre-experimental predictions, etc.
There are also a host of more subtle, but often more impactful, advantages to a centralized system. One of the biggest is lowering the cost to run an experiment - this is a force multiplier that leads to more experiments, more iteration, and ultimately more innovation. Additionally, when people from one team spend time and resources to make their A/B test-running experience better, everyone benefits. When other teams build off that work, the impact is compounded. It also unlocks things like historical meta-analyses of metrics and policies. When the data for myriad experiments is in one place, you can look at how metrics have performed over time. You can also find metrics that are better correlated with North Stars and nudge metric definitions to maximize sensitivity to the factors that drive your business. This is really powerful.
Leverage Through Automated Analytics
Having #oneway to run experiments in a centralized fashion is great and all, but its utility is limited if every data scientist is analyzing their A/B tests in ad hoc notebooks, each with not-so-subtle variations (how many ways can you define churn?). By providing at least minimal metrics, computed in a standardized and reproducible way, you increase the usefulness of the platform substantially. With careful presentation and visualization, it democratizes the often mysterious what is that test doing question. Now you have to be careful here - A/B tests can be devilishly tricky to interpret and the final determination still usually needs to come from a dedicated analysis. How you approach this depends very heavily on how statistically-savvy your business partners are.
When we set out to build our inference pipeline, we drew a few lines in the sand based on prior experience. The metric definitions themselves needed to be centralized, accessible, editable, and owned by data science teams. The data to compute the metric values needed to come from a centralized, auditable, modifiable ETL system, with a Bring Your Own Data option for some specialty metrics where the work/reward ratio is not in our favor (not preferred but sometimes necessary). Finally, the metrics system had to support various statistical methodologies. t-tests are still our bread and butter, but our world is complex. Sometimes we want to employ variance reduction techniques. Other times we want to account for complex correlational structure (looking at you mixed models and Generalized Estimating Equations).
We’ll share more details on what we built in future blog posts!
↩ We don’t have an “inferential data scientist” position - someone who focuses solely on A/B tests. Instead, data scientists are responsible for designing and running experiments either for their own algorithms or for their business partners.