Debunking Narrative Fallacies with Empirically-Justified Explanations

When we experience volatility in business metrics we tend to grasp for explanations. We fall for availability bias, and the more visceral or intuitive the explanation the quicker we latch on. ‘The cool weather is dissuading customers’, ‘customers are happier on Fridays because the weekend is coming’, ‘people are concerned with the economic downturn’, ‘competitor xyz is making a lot of noise in the market which is diluting our messaging’, … etc. The list goes on and on.

Of all our many talents – bipedalism, opposable thumbs, etc. – one of humanity’s most remarkable traits is our tendency to infer meaning from what happens around us. We understand the world through stories, and this is such a fundamental part of our nature that it is almost impossible to stop ourselves from inventing very reasonable-sounding explanations for what we see. A lot of these stories are intuitive and a lot of them might be right (seasonality is real in many businesses!), but we’re not good at knowing when our stories are trustworthy and when they aren’t.

So how do we deal with these issues at Stitch Fix?

Modeling a Business Metric

At Stitch Fix, we try to delight our clients with the fashion choices we send them. Human stylists, aided by an algorithmic recommendation system, select assortments of 5 articles of clothing or accessories that are personally chosen for each client. Clients keep what they want and ship back what they don’t want, only paying for what they keep.

We have various metrics of our overall business performance and health, which we monitor on a weekly basis. This blog post describes how we have tried to understand the weekly fluctuations of one of these metrics, but the story told here is applicable to any effort to understand business metrics.

Some weeks, this metric (which we’ll call \(M\)) is a little higher, some weeks it’s a little lower. We know that even if nothing substantive changed about our business, our clients, our inventory, one wouldn’t expect \(M\) to be absolutely identical from one week to the next, because of statistical fluctuations – the same reason why if you flip a fair coin 100 times, you wouldn’t be surprised if you didn’t get exactly 50 Heads. But it turns out that our typical actual weekly fluctuation in \(M\) is about 5 times the \(\sigma\) scatter that would be expected from the null hypothesis that everything is identical from one week to the next.

This means that there are real drivers underlying most of the fluctuations we see.

But how can we tell accurate stories about our changing metrics, and not fall victim to “narrative fallacy” by using explanations that sound good but might not be empirically grounded?¹

To address this issue, we decided to build a Metric Explainer – a model that predicts \(M\)’s value as a function of other things we know about our business. What do we know about our business that might be relevant to \(M\)?

Each week:

Different groups of clients receive shipments and some clients tend to consistently choose to keep more items than other clients.
We have different assortments of inventory and some pieces of clothing in our inventory tend to do better than others.
Different stylists work, and some stylists might be better at matching clothing to clients than others.
We have different numbers of new clients signing up, and new clients have different purchasing patterns on average than existing clients have.
The distribution of client tenures changes from week to week. Clients have different purchasing patterns throughout their tenure as members.
We have some operational issues that affect our business metrics, related to how many days per week our warehouses are open and processing returns.
Different promotions are going on. Promotions can affect our distribution of clients and can affect client incentives, which can affect our metrics.
There are seasonal variations.
And so on.

We can quantify what we know about the business with a list of features \(x_1, x_2, ..., x_n\), describing the factors listed above and more. For instance, \(x_1\) might quantify something about how well we have managed to delight a particular week’s client cohort in the past; \(x_2\) might be a measure of the quality of that week’s inventory; …; \(x_k\) might be the fraction of clients who are new; and so on. Using a specified set of features, we can try to build a model that will map known these known quantities to a predicted value of \(M\).

Selecting a Model Type

What sort of model is a good one to build? We want accuracy, but not only accuracy. The goal is to use this model to diagnose problems and usefully direct actions within the business. If we had a black box that perfectly predicted \(M\) every week, it wouldn’t necessarily give us any insight. We want accuracy and interpretability. And we want the interpretation to actually be related to the mechanics of the business.

The first two types of models that came to mind with a view toward interpretability were linear regression models and decision tree regression models. Linear models are simple and can be fairly accurate, but one can often get more accuracy with nonlinear models. On the other hand, decision trees can be both accurate and easily interpretable by non-data-scientist partners, but they are prone to overfitting and don’t lead to smooth relationships between input and output variables.

Interpretable Random Forest Regression

In search of a better way, we considered variations on random forest regression models, which are basically black-box collections of lots of decision trees. They can be more robust and accurate than single trees, but in general they are significantly less interpretable. We’d like to be able to explain the variations in metrics and say things like, “This week, \(M\) went up by \(\Delta M\). Of this \(\Delta\), \(\delta_1\) is attributable to the change in feature 1, \(\delta_2\) to the change in feature 2, …” etc., where “feature 1” and so on are replaced by English descriptions of the thing that each feature quantifies. With a linear model, once the coefficients are learned, the way to allocate portions of the total \(\Delta\) to each feature is trivial, but with a random forest regression model it’s less clear.

However, we realized that it’s possible to maintain interpretability with a random-forest regression model, too, as long as the weekly excursions in feature space aren’t too large. The procedure works as follows.

Suppose that the random forest regressor learns a function \(M^{RF}[x_1, x_2, ..., x_n]\), and we want to explain the change in \(M\) from week \(i\), when the features have values \((x_1^i, x_2^i, ..., x_n^i)\) to week \(j\), when the features have values \((x_1^j, x_2^j, ..., x_n^j)\).
We can assign a portion of the total change to feature 1 as follows (note that the two lists of arguments below are identical, except the superscript of \(x_1\) changes from \(j\) to \(i\)):

\[\delta_1 = M^{RF}[x_1^j, x_2^i, ..., x_n^i] - M^{RF}[x_1^i, x_2^i, ..., x_n^i]\]

We’re essentially calculating the partial difference of \(M^{RF}\) from week \(i\) to \(j\) with respect to \(x_1\). We can also calculate \(\delta\)’s that are attributable to each of the other features. The sum of the \(\delta\)’s does not in general exactly equal the total \(\Delta M\) in this procedure, but if the jump in feature space from week \(i\) to week \(j\) is not too extreme, this discrepancy is not large. With the above procedure in mind, therefore, we dabbled in “interpretable random forest models”.

Bayesian Linear Modeling

Ultimately the simple interpretability of linear models was attractive enough that we decided to return to them despite the potential slight improvement in accuracy from nonlinear models. Note that, with a linear model, we aren’t completely sacrificing the ability to capture nonlinear behavior or feature interactions – we can add terms that are linear in nonlinear functions of features or in feature interactions (i.e., terms like \(x_k^2\), \(\cos[x_l]\), and \(x_k x_l\)).

Another benefit to linear models is that it’s easy to apply priors. We have some prior notions about how the model output ought to depend on the features, and it’s easy to fold these into a linear model (and a lot harder to do so with a random forest model). For instance, in weeks when a higher proportion of the clients are those who generally like to keep more items, this ought to lead the model, all else being equal, to predict that \(M\) will be higher. We can incorporate this expectation as a strong prior that \(x_1\) should have a positive coefficient. Similarly, our algorithms and stylists have more information to learn about the preferences of repeat customers than about those of new clients, so the coefficient for the fraction of clients who are new ought to be negative, and we can enforce this expectation with a prior as well.

After setting up sensible priors, we then sample from the posterior distribution of coefficients by making use of the affine-invariant MCMC sampler Emcee. Selecting the maximum a posteriori (MAP) values of the coefficients produces a reasonable model, and one that is highly interpretable.² Also, for a large class of priors on linear-model coefficients (any that produce a concave log-posterior), it’s not necessary to do any sampling to get a MAP estimate, since we can just solve a convex optimization problem.

Feature Selection

If we start by thinking of \(n\) features, should our final model have exactly the \(n\) features we thought of? Not necessarily. Some features might not be predictive. Having too many features might lead to overfitting. How can we figure out which features to preserve in the model? For a given set of features, we used \(k\)-fold cross validation (with \(k \sim 10\)) to estimate the mean and variance of the mean-squared error with data that the model was not trained on. A set of \(n\) features has \(2^n - 1\) non-null subsets, so if \(n\) isn’t too large it’s possible to evaluate the model on all subsets. For instance, if there are \(n = 10\) features, then it’s tractable to evaluate the model’s performance on all 1023 possible groups of features (although perhaps not advisable because of the hazards of multiple comparisons). If the feature set contains 20-30 features or more, brute-force checking all possible subsets of features is no longer tractable, and we used greedy forward feature selection or greedy backward feature selection to find reasonable, if not globally-optimal, subsets of features.³

“Textification”

In the end, we have a linear model with sensible coefficients that has good accuracy when applied to data it was not trained on. So what do we do with this model? The final step was to “textify” the output of the model.⁴

This part was fun. We wrote a Python wrapper for the model that finds the \(\delta\)s associated with each feature, and writes an English sentence for each one, presenting the \(\delta\)s in a useful order (say, largest \(|\delta|\) to smallest). The textification function compares the total \(\Delta M\) to the model’s predicted \(\Delta M\) and tells us the internal dynamics that led to the model’s \(\Delta M\). For instance, it writes sentences like, “Last week, metric \(M\) increased by \(\Delta M\) from the previous week. The model predicted an increase of \(\Delta M^{model}\), which is explained by the following changes in features: …”

If the model’s \(\Delta M^{model}\) is close to the actual change in \(M\), this gives us some confidence that the model’s attribution of the reason why \(M\) changed the way it did might be accurate. The textified model output is displayed on a dashboard where people in the company can easily access it.

Learning from “The Explainer”

Sometimes we see interesting dynamics that we might have entirely missed without the model. One such example is that sometimes the metric’s value is nearly constant from one week to the next but there are large feature-related \(\delta\)s that happen to almost entirely cancel out. For instance, maybe the quality of inventory is much higher one week than the previous, which would tend to predict a higher value of \(M\), but we also had a lot more new clients in the second week, which might tend to predict a lower value of \(M\). Absent our Explainer model, we might have simply glanced at the overall values of \(M\) both weeks and said, “Looks like nothing much changed…”, and missed that there were actually large changes that happened to cancel each other out.

Another key benefit of the Explainer model is that it keeps us honest and empirically-grounded when the metric moves upward more than we might naively have expected. Maybe the key drivers of a positive fluctuation are things we can put resources behind and reproduce, but maybe they’re chance alignments of factors that aren’t directly under our control. The Explainer helps us to understand which is the case, and to avoid facile or self-congratulatory explanations for good news.

Although the Explainer model often presents useful insights and explanations about why the business metric has varied, there are occasionally times when the Explainer’s prediction for \(M\) doesn’t closely match the actual value of \(M\). When this happens, it means that unmodeled factors have played a large role in determining the outcome variable. In other words, these situations present opportunities for us to learn more about what drives our business, and ultimately to refine our model and get smarter.

Statistical models such as those described here do a better job of capturing the relationships that exist between business metrics than people can do by simply telling sensible-sounding narratives. But there is still a core issue of differentiating correlation from causation in the business world. Our automated Explainer model helps keep us honest about what statistical relationships exist in our data, but we haven’t automated away the need for human modelers who build and interpret the models, using the careful reasoning and skeptical perspectives of a scientist. Much like the styling part of our business, our business-metric-explaining efforts are ultimately a blend between algorithmic processing and human judgment.

By facilitating better understanding what drives variations in business metrics, an interpretable model and an algorithmic textification of the model’s output can help allocate finite resources to improve business performance and client delight. If you’d like to be a part of a company that approaches understanding our business this way, please get in touch!

¹ We borrow the phrase “Narrative Fallacy” from Nassim Taleb’s The Black Swan. ←

² MAP values do have the downside that they aren’t invariant under reparameterization, but it’s a lot easier to find the MAP value than a reparameterization-invariant quantity like the median in a high-dimensional space, and in practice the difference between the two isn’t large. ←

³ Though PCA is often used for reducing dimensionality, it didn’t seem ideal for our purposes because if we used PCA we’d end up with explanations like “\(M\) went up by 7% because principal component 1 went down by 3%”, which isn’t very interpretable. Nevertheless, the same interpretability procedure that renders random forest regression interpretable could also make PCA regression interpretable. ←

⁴ Companies such as Narrative Science have taken computer-generated prose to a high level. Our goal here was not top-notch writing, but rather statistically-significant interpretations of explanations for business-metric values, rendered in sentence form. ←