# Learning from the experience of others with mixed effects models

###### July 14, 2015 - San Francisco, CA

At Stitch Fix we have many problems that boil down to finding the best match between items in two sets. Our recommendation algorithms match inventory to clients with the help of the expert human judgment of our stylists. We also match these stylists to clients. This blog post is about the remarkably useful application of some classical statistical models to these and similar problems that feature repeated measurements.

## Measuring the success of an interaction

A general strategy for matching can get complicated. In this post, we’ll focus on the greatly simplified but related problem of predicting the quality of a proposed match based on past observations. Specifically, we’ll consider the problem of predicting how well stylist $$s$$ will do with a client $$c$$. There is very little in what follows that is specific to this problem - in other domains similar models could be used to explore, for example, interactions between visitors and ads or users and experimental treatments. A similar approach could also be used to match types of clients with types of stylists.

Again for simplicity, let’s suppose that we don’t know anything about our stylists or clients except what we’ve observed in the past. Of course, if we have extra information then there’s a lot more that we can do. Indeed, there are many interesting approaches to pairing clients and stylists that use richer data, but this will have to wait for a future post.

To start, pretend that we have observed many triples representing past pairings of clients and stylists,

\begin{equation}(s_i, c_i, y_i) \qquad 1 \leq i \leq N\end{equation}

where $$s_i$$ is the index of a stylist, $$c_i$$ is an index of a client, and $$y_i$$ is the outcome of pairing stylist $$s_i$$ and client $$c_i$$. The outcome $$y_i$$ could be many things, but let’s assume it is a simple binary indicator of success.

What we’d like to predict is whether stylist $$s$$ will be successful with client $$c$$. To do this, we can model the probability of success

\begin{equation}p_i = \mathbb{P}(y_i = 1)\end{equation}

Mixed effects models, a classical statistical tool, are very useful for this problem. And whether placing ads or pairing stylists, this problem is remarkably relevant to many data scientists today.

## Mixed effects models

One of the earliest and most famous applications of statistics was Ronald Fisher’s study of agricultural experiments. He developed tools like the analysis of variance, now a work-horse of applied science, to tease apart the experimental treatments that were most effective.

We need to do something similar. As we gather observations we want to learn about the empirical success of stylist and client pairings. But we need to be wary of small sample sizes - if a client has been paired with the same stylist twice and was happy each time this is evidence of a good match. But are two data points evidence of a perfect match? Probably not. It seems likely to have happened, at least in part, due to chance.

Mixed effects models are an extension to generalized linear models that offer an attractive, direct approach to learning about the interaction of specific clients and stylists while balancing the uncertainty inherent in our predictions. They use an empirical Bayes approach to shrinkage in the spirit of what Brad Efron eloquently describes as “learning from the experience of others”.

We’ll soon see how mixed effects models are useful for our problem. But first, some background.

## The general model

Since we’re predicting a binary outcome, logistic regression offers a convenient modeling framework. Traditional logistic regression seeks to find a set of coefficients, $$\beta$$, that predict a response $$Y$$ with a design matrix $$X$$,

\begin{equation}\log \frac{p}{1-p} = \alpha + X\beta\end{equation} where \begin{equation}\mathbb{P}(Y=1) = p.\end{equation}

The frequentist perspective does not consider $$\beta$$ to be random - it is considered an unknown but fixed effect. A random effects model treats the coefficients (we will call them $$b$$ when they are random) as a random vector generated from a prior distribution $$\mathcal{F}$$,

\begin{equation}\log \frac{p}{1-p} = \alpha + Zb, \qquad b \sim \mathcal{F}\end{equation}

Here $$Z$$ is a fixed design matrix for the random coefficients. We do not assume that we know the prior distribution, but in many applications it is assumed to be a normal distribution with mean zero and unknown variance because of the convenience afforded by this choice,

\begin{equation}b \sim N(0, \Sigma).\end{equation}

This randomness means that the effects are generated from a common population distribution. The parameters of this distribution represent, for example, how much variation there is in the effects across the client population. By assuming that we know the prior distribution we can weigh the experience of an individual against what we know about the general population. This is a common feature of Bayesian models.

A mixed effects model is one where some coefficients are fixed, and some are random,

\begin{equation}\label{full} \log \frac{p}{1-p} = \alpha + X\beta + Zb, \qquad b \sim N(0, \Sigma)\end{equation}

This is often called a GLMM for generalized linear mixed model. The Bayesian reader may suggest that it is sensible to treat all coefficients as random, all of the time. And indeed, this is would lead to a very useful and compellingly consistent Bayesian view of regression. The advantages of the empirical Bayes approach used in mixed effects modeling, requiring no sampling or manual selection of hyperparameters, will be made clear below.

## What about matching? Designs with indicators

The mixed effects model in \eqref{full} is surprisingly general. To better understand it, it is useful to return to our simple example. Recall \begin{equation}p_i = \mathbb{P}(y_i = 1).\end{equation}

We can then consider a simple logistic model for $$y_i$$, \begin{equation}\log \frac{p_i}{1-p_i} = \underbrace{\alpha}_{\text{global constant}} + \underbrace{\beta_{c_i}}_{\text{client effect}} + \underbrace{\gamma_{s_j}}_{\text{stylist effect}}\end{equation}

This model has three terms

• $$\alpha$$, a global offset

• $$\beta_k \sim N(0, \sigma_\beta^2)$$, a random effect for each client

• and $$\gamma_m \sim N(0, \sigma_\gamma^2)$$, a random effect for each stylist

This model is thus flexible enough to capture that some stylists, and likewise some clients, will be inherently better” than others. The prior distributions are zero-centered because the effects are parameterized as deviations from the global constant. Note this is a special case of \eqref{full} where the design matrices are indicator variables and the prior covariance matrix is diagonal. As we will see in the next section, the shrinkage that results from assuming that the effects are random is important to avoid overfitting a model with such a large number of indicator variables.

We can make the model even more interesting, and more useful for matching, by adding another term. For each pair of client and stylist we can add an interaction term $$\delta_{c_i,s_i}$$,

\begin{equation}\label{intercept} \log \frac{p_i}{1-p_i} = \alpha + \beta_{c_i} + \gamma_{s_i} + \underbrace{\delta_{c_i,s_i}}_{\text{interaction effect}}.\end{equation}

Once again, the interactions are parameterized as a deviation from both the global offset $$\alpha$$, but also from the client and stylist specific effects $$\beta$$ and $$\gamma$$. As we observe the interaction between clients and stylists we can attribute the success or failure of a match to the latent success of the stylist, the client and, finally, to their specific interaction. Having the main effects for clients and stylists “controls” for marginal differences across their populations and allows us to separate the interaction: the remaining affinity that’s not explained by just being a “good” or “bad”“client (or stylist).

The prior variances in this model are often interesting in their own right. For example, if $$\sigma_\gamma^2$$ was much larger than $$\sigma_\beta^2$$ this would tell us that there was more variation across stylists than there was across clients. In the same way, if the prior variance for the interaction term, $$\sigma_\delta^2$$, was very small or zero this would mean that there was not an important interaction between the two variables, or at least not one that was important for predicting $$y_i$$. Being able to decompose the sources of variance in a model is often helpful for understanding what’s really going on.

## Mixed effects with known variances: shrinkage

To gain intuition it is helpful to pretend momentarily that we know the parameters of the prior distribution. If we did, we could predict the effects using MAP estimation - roughly akin to maximum likelihood with extra terms for the priors. To fit this model, we would need to maximize the joint log-likelihood. In our case, the log-likelihood has two parts: (1) the likelihood from the assumed logistic model, and (2) the log-likelihood from the prior distribution of the random effects.

Since it is messy and not particularly important here, we can refer to the logistic component of the log-likelihood as $$\mathbf{l}$$ and write ($$-2$$ times) the full log-likelihood as

\begin{equation}\label{loglik} -2\log L \propto \underbrace{\sum_i \mathbf{l}(\alpha, \beta_{c_i}, \gamma_{s_i}, \delta_{c_i,s_i})}_{\text{logistic component}} + \underbrace{\frac{1}{\sigma_\beta^2}\sum_k \beta_k^2}_{\text{client prior}} + \underbrace{\frac{1}{\sigma_\gamma^2}\sum_m \gamma_m^2}_{\text{stylist prior}} + \underbrace{\frac{1}{\sigma_\delta^2}\sum_{k, m} \delta_{k,m}^2}_{\text{interaction prior}}\end{equation}

There are several important things to notice about the log-likelihood in \eqref{loglik}. The first is that when we write down the log-likelihood, it turns out to be a simple $$\ell_2$$ regularization problem, like ridge regression with the flexibility of different penalties on different coefficients. In this case we chose penalties based on the prior variances from our model, but we could easily have arrived at the same optimization problem by adding quadratic penalties like

\begin{equation}\lambda_\beta \sum_k \beta_k^2\end{equation}

to the logistic log-likelihood. This optimization problem is convex, and there are a number of practical ways to solve it even for very large datasets, and even when combined with other types of regularization like $$\ell_1$$ penalties. So if we knew the variances this would be an easy problem.

### Structured shrinkage and partial pooling

The second thing to notice about \eqref{loglik} is that were shrinking the random effects to zero by regularizing them. This is a key feature of the model. By shrinking them, we can automatically and “correctly” adjust the estimated effects for sample sizes to avoid the extreme overfitting that would likely result from including so many indicator variables in the model.

In addition, having the effects parameterized as zero-centered deltas gives the shrinkage a hierarchical structure. The effects for specific clients are pulled toward zero. This means that for a new client we will start by assuming that they have no delta from the offset - in other words, that they are average.

Something similar is true for interactions. We often tend to have more data on a client than we do on their interaction with any particular stylist. This makes it easier to estimate the effect associated with a client than to estimate their interaction effects. Specific interaction effects are pulled toward zero, but the model still includes the offset and the combined marginal client and stylist effects. So even if we cant detect an interaction we can fall back on the main effects.

We can think about the regularization in \eqref{loglik} as striking a balance between two models of very different complexities. The first is a constant model where there are no differences across clients or stylists or interactions,

\begin{equation}\log \frac{p_i}{1-p_i} = \alpha.\end{equation}

This is clearly uninteresting. At the other end of the spectrum, we could have complex model with completely unregularized estimates for every client-stylist pair,

\begin{equation}\log \frac{p_i}{1-p_i} = \delta_{s_i,c_i}.\end{equation}

This model might work if we had unlimited data, but in practice it will likely overfit for interactions for pairs of clients and stylists where the sample size is small. The regularized model in \eqref{loglik} is somewhere in between these two extremes. The model has the flexibility to capture individual effects and interactions as we gather sufficient evidence but shrinks toward the experience of others. This remarkable property, where effects are pulled toward the overall or “pooled” estimates, is sometimes called “partial pooling.”

## Choosing the variances

In practice, the question is often how much regularization to apply. As noted above, we could rewrite \eqref{loglik} using regularization terms like

\begin{equation}\lambda_\beta \sum_k \beta_k^2\end{equation}

but this would mean needing to choose $$\lambda_\beta$$. This could be done using cross-validation and searching over the space of possible parameters. However this can quickly get out of hand. At another extreme, you could also take a full-blown Bayesian approach and sample the posterior after having chosen a suitable prior.

Happily, a mixed effects model will answer the question of regularization for you. Indeed, often the whole point of fitting the model is to learn the variances. This is done by treating the variances as unknown parameters and estimating them by a form of maximum likelihood estimation. This is an empirical Bayes approach: using a model with a prior, but then estimating the hyperparameters from the data.

The resulting optimization problem is complicated, but excellent software, such as the lme4 package, exists in R for fitting problems of moderate size. lme4 uses a very convenient formula notation to specify models that will be familiar to users of lm and glm in R. For example, we could fit the model from \eqref{intercept} using

m <- glmer(y ~ (1|client) + (1|stylist) + (1|client:stylist), family=binomial, data=df)


The notation (1|client) specifies a random effect for each level of client drawn from a zero-centered normal prior distribution. The interested reader should refer to introductions to lme4 (found here, here and here, for example) which are beyond the scope of this post.

It should be noted that it can be challenging to scale models to larger datasets with this software. There are alternative approaches to fitting these models but that must wait as the topic of a future post.

## Summary

Mixed effects models are an extremely useful extension of generalized linear models in applied statistics. We’ve really only scratched the surface. The formulation in \eqref{full} can include many richer models with random slopes, non-indicator features and basis expansions. All the usual tricks from linear modeling can be applied.

If your problem relates to repeated measurements of clients, items, or anything in general then you should consider mixed models as a practical approach to pooling information and learning from the experience of others. Happy shrinking! 