When Bits go Bump in the Night

Pete Dudek
- Pittsburgh, PA

I think it’s safe to say that on-call week is never something we software engineers eagerly anticipate. I know from past experience, my stress levels tend to uptick, my personal productivity expectations decline, and my eyes are always trained on my email, hoping the dreaded pagerduty alert will not jump from the shadows. But I have to ask myself, what’s driving these negative emotions? And more importantly, are they pointing to possible improvements we as software engineers should implement to make on call duties less of a burden? At Stitch Fix, we’ve acknowledged these concerns and recently implemented strategies to help make on call duties less stressful and more productive.

Background

I’m part of the warehouse operations engineering team at Stitch Fix. We currently have four warehouses, encompassing all mainland US timezones. All of our warehouse systems, from inventory tag printing, shipment picking, quality control, to shipping label printing were built and are maintained in-house, although they do rely on our platform provider and some third party services for things like shipping label generation. So when something goes wrong at any of these points of failure, it is the responsibility of the on-call engineer to acknowledge, investigate, escalate if needed, and ultimately solve or help solve the problem expediently. Expediency is a key word here, because when operations at the warehouse can’t function due to system failure or third party platform issues, the business, and our clients, will suffer.

Problem: Readiness without Control

Nothing to be anxious about, right? Well, yes and no. Anxiety often comes with a heightened sense of attentiveness. I used to be terrified of flying. So when in the air, I was at the ready, but it was misguided because I had no power to keep the plane from falling out of the sky. This lead to anxiety, because I wanted to be able to control the situation, and did not trust the pilot, machine, or physics to keep me alive. My readiness was useless. But when you are in a situation where you can affect the outcome, whether it be for a sporting event, driving in heavy traffic, or keeping your child from falling off the monkey bars, you do have control. In those cases, heightened attentiveness is good. It should not cause stress, unless, that is, you start feeling a loss of control.

Solution: Empowerment

When on call, you are the first line of defense for most problems that come up. If you don’t have a clear strategy on how to manage this duty, you’re going to get stressed out. We’ve built an engineering wiki that provides solutions or a playbook for most problems that emerge. This is a huge help since it gives the on call engineer power to solve the problem. But when the wiki doesn’t offer a solution, it doesn’t mean we disappear into a hole and try to figure out the problem in isolation while operations suffer. We have a great team full of experienced people, so if it’s not a fix we immediately know the answer to, there’s no need to clutch the armrest and have a panic attack. The key is to communicate to the business partners engineering is aware of the problem and investigating, then pull in another engineer or two to help evaluate. Or, if the problem is with a third party service, such as for shipping label generation or the platform provider, we open a ticket with them. These clear guidelines allow us to achieve readiness without anxiety.

Problem: Time Management

Not all on-call activities are fire drills. The grand majority of support tickets will be of a lower priority nature. They usually come in two varieties: system alerts, and human generated tickets. With system alerts, the key is to first evaluate the severity of the alert. We emphasize logging at Stitch Fix, and because of that, it’s fairly easy to determine if a notice or alert is indicating a deeper problem via tools like New Relic, Librato, and Papertrail. Human generated tickets are typically more varied, but most of the time, the turnaround doesn’t need to be immediate, although initial responses should be swift and offer some sense of an ETA.

These type of alerts can cause anxiety in a different way than the urgency of a system or service failure. Sometimes, particularly with the human generated tickets, more than a few minutes of investigation is required. If there are enough of these tickets, this takes time—precious time that most engineers would rather spend on roadmap projects. So when on-call week comes around, it can be frustrating knowing that, depending on the volume of tickets, your roadmap productivity is going to take a hit.

Solution: Focus on Single Task

The warehouse operations engineering team acknowledged that this is a real problem, but only if the expectation is that engineers keep up with roadmap work at the same time as providing expedient responses and solutions to tickets. We solved this problem through a joint decision between engineering and the warehouse operations business partners that on-call engineers will focus solely on on-call activities.

We’ve found this to be a very successful strategy. Although it will take some additional planning and coordination if a project has an expected completion date, it eliminates the concern around diminished roadmap productivity. Secondly, it improves both the response time, and investigative thoroughness we can offer each ticket. This investigation process helps us achieve a more holistic understanding of our systems, and further equips us to find, and sometimes fix, root causes of issues during on-call week. For those issues we don’t have time to fix, we can clearly pass details to the next on-call engineer for them to implement. Over time, more bugs will be fixed this way than if roadmap work made thorough detective work impossible.

Accepting Reality

We are scaling up fast. And while we are solving interesting problems, and helping our business succeed through the software solutions we provide, it would be naive to think that we could achieve a ticketless word devoid of any system or third party service problems. That’s why we’ve molded our on-call process to offer our engineers the best possible environment to troubleshoot issues. We have a thorough wiki to help us troubleshoot. We realize that being on call doesn’t mean we can’t pull in other minds when the need arises. And it is expected that our on call time will be spent solving problems and managing tickets expediently, while at the same time shoring up our software to make life that much easier for the next on call engineer and our business partners.

Tweet this post! Post on LinkedIn
Multithreaded

Come Work with Us!

We’re a diverse team dedicated to building great products, and we’d love your help. Do you want to build amazing products with amazing peers? Join us!