Aggressively Helpful Platform Teams

Elijah ben Izzy
- San Francisco, California

Aggressively Helpful

On Thursday afternoons, the Algorithms team has an all-hands in which we rotate through a set of speakers. Over the course of a few months, everyone in the department gets a chance to share something with the team in a quick presentation. The format is open-ended; my colleagues discuss their most recent projects, deliver miniature mathematics lectures, introduce themselves, or present on something they just learned.

During a recent iteration, my team got a remarkable callout. Chris, a data scientist, described us as aggressively helpful. While that epithet may have materialized ambiguously to the rest of the department (can “aggressive” ever be a good thing, even as an adverb?), we considered it the highest of honors.

In this blog post, I will cover what it means to be aggressively helpful, convince you that such a moniker is essential for a disruptive platform team, and show you how to be aggressively helpful as well (with some actual code!).

Before we get started, let me give you a quick overview of how my team fits into the broader organization. Within the Algorithms department1, the Data Platform team is tasked with empowering data scientists to rapidly iterate on and productionize their algorithmic implementations. My team, the Model Lifecycle team, is a subset of Data Platform. We are responsible for building out Stitch Fix’s next generation machine learning platform in an effort to streamline the process of getting models into production. When I use the term “platform team” in this blog post, I’ll largely be talking about teams that build platforms for data scientists, through the lens of my team’s experience. The take-homes, however, should apply to any team that builds software for internal use.

Platform Teams

The Field of Dreams approach to running a platform team is destined to fail.

Field of Dreams, Universal Pictures, 1989

Every engineer dreams of building a shiny new product — a system that will solve some hitherto intractable problem, meet every one of their customers’2 requirements and gain instant adoption, carrying the company to new productive heights. Fortunately, realizing this dream is an essential part of the job. Unfortunately, that is the easy part. Being really really good at building it is only a piece of the puzzle. To understand why, let’s dig into the (heavily simplified) technical aspects of a data scientist’s job at Stitch Fix:

  1. Wrestle complex, scalable infrastructure into submission
  2. Discover and implement cutting-edge algorithmic techniques to drive value

At Stitch Fix, data scientists are empowered to operate on the full stack. Their primary contributions to the organization, however, are the cutting-edge algorithms they develop. Thus, in order to be productive, they quickly establish a series of infrastructure patterns, using a mixture of tried-and-true platform tools and team-level inherited knowledge. The discovery and refinement of such practices is an essential component of a data scientist’s right of passage. Each scientist develops a set of nuanced approaches, the goal of which is to allow for maximum innovation and creativity in the second part of their role—building algorithms.

Enter the Data Platform team. Our aim is to improve the data scientist’s workflow—to re-envision part 1 above so they can spend as much time and creative energy as possible on part 2. To do this, we have to have a vision that cuts across different teams and workflows and strives towards a globally optimal product. This is the “if you build it” part of our job.

For better or worse, this is where the work gets tough. As data scientists have already developed a set of practices to manage infrastructure, it is not enough to build a great new product. A good platform team has to ensure people actually use their product. For a data scientist, switching over can be an uphill battle. Consider the following barriers to adoption:

  • New products are not perfect: they are often full of bugs and formerly relevant assumptions that can only be ironed out with real-life use-cases and product growth.
  • Every minute a data scientist spends switching over to or learning a new technology is a minute they spend not developing an algorithm.
  • Confidence in a platform team is not a given; it has to be earned. This is especially true for teams with disruptive and avant-garde technology.

While the value of using the latest and greatest is likely obvious to the platform team, the calculus is not the same for a data scientist. Thus, we have to sweeten the deal—provide the activation energy necessary to help them improve their workflow. This is what being aggressively helpful is: doing virtually everything in your power to ensure that none of the costs above are prohibitive to adoption. As a platform team, you can’t always make a perfect product, you have no control over where data scientists spend their time, and you cannot instantly bolster their confidence in your platform. You can, however, go above and beyond in supporting your partners’ use of your new product. The rest of this blog post will discuss the strategies our team employs to do just that and present a few of the most important tools in our toolbox, including a rough set of technical instructions in case you want to follow suit.

Before we dive in, however, let us note that the following tactics assume a base-level of good faith and due diligence in product management. While I glossed over the building it part of the job, it is essential that a platform team does not peddle products that don’t actively help data scientists. Snake oil, vaporware, and poor market-fit can be a massive detriment to an organization. While you may debate the merits of being aggressively helpful, we can all agree that being aggressively unhelpful is a guaranteed way to repel your partners and damage an organization.

Reliable Documentation

Let’s start off with something basic. Before a data scientist tries out your software, they will likely read over the documentation. For a platform product, good documentation has two properties:

  1. It presents a compelling narrative for product adoption.
  2. It accurately and comprehensively displays your product’s capabilities.

Effective documentation will do both of these simultaneously, telling your potential users both why and how to get started with your product. The examples it contains tell a story—at each point addressing a use-case that will draw your users in with the allure of an elegant, powerful API. Unfortunately, internal documentation is often organically grown, and riddled with snippets of code that are not relevant or no longer work (and perhaps never did). To up the stakes, quality documentation is critical for widespread integration; customers regularly make decisions on whether or not to use what you’ve built before even trying out the code. Documentation is the foyer for your product, and it should be warm and welcoming if you want them to make themselves at home.

Okay, enough with the pontification. Why have I brought this up? Well, we found that if you combine the above (crafting a narrative with provably correct examples), the documentation becomes far more inviting. To do this, we built a tool that combines system-level testing and documentation in one. Our goal was to ensure that the contract we presented to data scientists was obeyed in full by our system.

We use continuous integration to generate documentation with a set of examples and execute those examples on every pull request. These examples manifest as python functions decorated with an @example decorator, which takes, as arguments, a unique key and a set of assertions on the output (stdout) of these functions. Using python’s inspect functionality as well as the pandoc/panflute libraries, we grab the code from the decorated function and insert it into specially marked code-blocks (referencing the unique key) in the documentation’s Markdown files, which are organized using Sphinx. The modified documentation is then committed to a static file repository on S3, which routes to an internal page using an hosting tool built by a partner platform team. The net effect of this architecture is a set of use-case-specifying unit tests with simple and readable results. To clarify, this means that every snippet of example code in our documentation is guaranteed to work correctly.

The architecture described above was built out of expedience; we often found that our documentation did not match reality, and that was turning away new users. There are a dozen reasonable ways to instill a contract-driven narrative in your documentation. Each one can be right for its own circumstances. The value of building such a system, however, is immense. Having thorough, correct documentation is critical in getting data scientists (and customers in general) to adopt what you build. Furthermore, documentation should be a badge of honor—a cohesive representation of your work that lives on your team’s resume and presents your best self. Sleeping becomes much easier when you know that all your examples will work when a data scientist executes them.

Proactive Monitoring

My grandfather had a plaque on his desk that read: To err is human, to really screw things up requires a computer. I interpreted that as all code will break, a mantra that my younger self would soon grow to appreciate. As a platform team, you are likely responsible for the integrity of a significant swath of horizontal infrastructure. In our case, we develop and maintain a suite of high-level tools that abstract away the inference and training components of the data scientist’s machine learning workflow. These include:

  • Services that execute their models
  • Deployment jobs to update those services
  • Batch jobs to run their models over large datasets
  • ETLs to extract features for training their models

As the nonagenarian’s prescient plaque foresaw, all of these fail with regularity. Left unchecked, these failures can be catastrophic. Not only will they cause data scientists to breach their SLAs, but they will also carry the pernicious secondary effect of reducing overall confidence in the systems you, as a platform team, have built. All this is to say that your code will break, and you will be responsible.

Your work, however, is not a lost cause. Each of these errors is an opportunity to demonstrate how strong of a partner your team can be. Specifically, we make an effort to alert data scientists of their3 failure (perhaps with a recommended fix) before they even know about it. In essence, every failure is a race condition between your ability to demonstrate confidence in your product and your customer’s natural inclination to blame the new technology. This is not an easy race to win, but doing so is completely essential to enable product adoption. Luckily, with the power of Slack and low-latency queuing systems, this gets a lot easier!

To ensure we win this race condition, we send an alert to a central channel that the on-call member of the team is responsible for monitoring. As we present failures from a multitude of sources, the implementations are not all the same, but the basic structure is this:

try:
    do_some_high_level_ds_thing() # deploy a service, execute a job, etc...
except Exception as e:
    notify_slack(context, e, '#team-alerting-channel')

We extend this pattern in a slightly different manner to the live services we manage, for which we rely on a combination of Cloudwatch, PagerDuty, and Lightstep to do the above. Nonetheless, the result is a stream of actionable failures that specify the team, user, and any other necessary metadata (in our case an UUID to identify the related user-published model).

After we’re alerted of the failure, we claim it (by commenting on the slack post in a thread), and reach out to the data scientist in question:

Platform Engineer: Hey — looks like your batch deployment failed! It seems that the spark executors didn't have enough memory. I've rerun/modified the job. Also, we're working on a feature in which we can set this in an automated fashion. See this github issue for more details.

Data Scientist: Wow, I hadn't even noticed. How... creepy helpful of you! Thanks for fixing, LMK if anything else comes up.

What can I say except... "you're welcome" (in the form of a Maui from Moana "you're welcome" emoji)

Not only does being proactive about failures enable our customers to meet their SLAs and build confidence in the team, but it also gives us headroom to iterate faster. As long as we detect them before our customers do, we can break things as much as we want.

Note that we alert on errors regardless of the fault (4xxs and 5xxs). Handling client-side errors allows us to be helpful and better understand the usability of the system. The more client-side errors we get, the more we understand the ways in which our API assumptions and the user’s mental models are misaligned. We regularly discuss and aggregate these as a team to improve the products we build.

Tracking Users

In addition to managing batch jobs and microservices, we maintain a python client that allows data scientists to work with our model storage and retrieval infrastructure. The client is the first way users interact with our platform, and that experience is often a make-or-break moment for decisions on adoption. Unfortunately, due to the fundamental nature of humans writing code, the client will break. Every data scientist approaches this differently. Some love fiddling with new tooling, while others abandon hope on the first sign of trouble. In order to improve the workflow for scientists everywhere on this spectrum, we track every invocation of any function in our client by anyone ever. This may seem heavy-handed, but in the world of big data, this data isn’t particularly big (around ten thousand invocations per day). It is far better to collect this data than ignore it.

We implement this with a simple python decorator. Eliding a detail or two, it looks like this:

class track:
    def __init__(self, kafka_topic: str, allow_tracking_failures: bool=True):
        """Tracks usage of a function. Lowercase name as it's a decorator."""
         self.kafka_topic = kafka_topic
         self.allow_tracking_failures = allow_tracking_failures   

    def __call__(self, fn: Callable):
        @functools.wraps(fn)
        def wrapper(*args, **kwargs):
            function_error = None
            result = None
            try:
                result = fn(*args, **kwargs)
            except Exception as original_error:
                # Catching all exceptions is generally a bad idea but
                # in this case we're going to be raising them later so its OK
                function_error = original_error
            payload = {
                'args'     : self._serialize(args),
                'kwargs'   : self._serialize(kwargs),
                'result'   : self._serialize(result),
                'function' : fn.__name__,
                'success'  : function_error is not None,
                'error'    : self._serialize(function_error),
                'context'  : self._derive_context()
            }
            try:
                self._log_to_kafka_asynchronously(payload, self.kafka_topic)
            except Exception as tracking_error:
                if self.allow_tracking_failures:
                    raise tracking_error
            if function_error is not None:
                raise function_error # Maintains the function's original behavior
            return result
        return wrapper

This code is quite simple (sure, I left out our methodology for serialization as an exercise for the reader), but rather powerful. We can decorate any function that a data scientist might call, and automatically start gathering data. It has a negligible performance impact (the _serialize function abstracts a lot), and can be directed to fail silently, so the function’s callers never know it exists.

Once we’ve logged the payloads to Kafka, we then sync them with an ElasticSearch cluster, and drill into the data on Kibana. This data is invaluable—with it, we can easily answer the following questions:

  • Who are the new users of our client?
  • Have they succeeded in what they were trying to do?
  • Are they using our product in the way we intended?

And so on. This is especially helpful to support data scientists who might be hesitant to get started (facing the activation energy we described above). If we see a new user, we proactively reach out to them to gauge their initial impressions and offer our help. While this big-brother-esque approach can appear creepy, we have found that data scientists appreciate it unanimously. In our experience, it has been key in making our customers feel like they’re wandering into a friendly ecosystem.

Note that the ultimate approach to this would be to combine function tracking with the downstream monitoring and alerting toolset we described above, so that we reach out to data scientists whenever any function they run within our python client fails. This is the gold standard, but, alas, we are not quite there yet.

Final Thoughts

The strategies we just discussed by no means form an exhaustive set of tactics to help data scientists get over the initial difficulty inherent in adopting a new product. Some other approaches we have found useful include:

  • Following up with customers on a regular basis to ensure that they are not blocked on anything
  • Pair programming (these days through screen sharing4) with new users who might have trouble getting started or are otherwise busy
  • Utilizing GitHub issue templates to guide the conversation when a user opens a bug report or submits a feature request
  • Tracking aggregate metrics to better understand our user base
  • Combing through data science Slack channels looking for use-cases our product can solve, or problems that our customers might not have brought to our attention
  • Showing up at data science team meetings to offer our support and present exciting new features

On a personal level, I’ve found this kind of user engagement to be incredibly rewarding. In the remote-first culture of the current workplace, there is nothing more de-isolating than developing strong relationships through copious quantities of unsolicited help and the mutually respectful exchange of feedback and product vision.

For your team, the benefit of this approach will go far further than just getting a data scientist onboarded. Here in the Algorithms department at Stitch Fix, people talk. They regularly present what they’re working on and discuss the merits of new technologies within the team. Aggressive helpfulness compounds: once you’ve demonstrated value to enough people, the dynamic shifts. They start implicitly trusting your judgement, and begin coming to you for advice, both on your product’s best practices and their team’s architecture in general. Through the open nature of this communication, you will come to better understand the workflows and pain points of your data scientists, naturally pick up on their latent needs, and deliver a superior product.

Being aggressively helpful will turn justifiably skeptical data scientists into powerful advocates for your product. Not only do they spread the gospel, but they also become resources for future adopters, ensuring that you don’t get bogged down in customer support as you eye new horizons. The network effect takes over, and you’ll finally be able to realize your vision through the tech stack of the entire department, improving work experience across the board. And hey, you might even have a colleague stand up at an all-hands and praise your team, branding you with an epithet so flattering it inspires you to write a blog post!

Tweet this post! Post on LinkedIn
Multithreaded

Come Work with Us!

We’re a diverse team dedicated to building great products, and we’d love your help. Do you want to build amazing products with amazing peers? Join us!