Scaling Data Science:
Slides from #DDTX17

Stefan Krawczyk
- San Francisco, CA

For those who attended my talk at Data Day Texas in Austin last weekend, you heard me talk about how Stitch Fix has reduced contention on:

  • Access to data
  • Access to ad-hoc compute resources

to help scale Data Science. As attendees requested, I have posted my slides here, which you can find a link to at the bottom.

For those that weren’t at my talk, here’s a brief background to the slides; they should be relatively self explanatory after reading this background.

Background

At Stitch Fix we have a lot of Data Scientists, around eighty at last count. One reason why we have so many is that we do things differently. To get their work done, Data Scientists have access to whatever resources they need (within reason), because they have end to end responsibility for their work; they collaborate with their business partners on objectives and then prototype, iterate, productionize, monitor, and debug everything and anything required to get the output desired (see Engineers shouldn’t write ETL). They’re full data-stack Data Scientists!

The teams in the organization do a variety of different tasks:

Our Data Scientists are quite prolific at what they do – we’re approaching 4,500 job definitions at last count. So one might be wondering now, how have we enabled them to get their jobs done without getting in the way of each other?

This is where the Data Platform teams comes into play. With the goal of lowering the cognitive overhead and engineering effort required on part of the Data Scientist, the Data Platform team provides abstractions and infrastructure to help the Data Scientists. The relationship is a collaborative partnership, in which both are incentivized to help each other through feedback: Data Platform needs to understand the Data Scientists’ pain points, while Data Scientists won’t use a tool that doesn’t work for them. The end result is hopefully a well designed tool that appeals to and is adopted by the Data Scientists.

In regard to scaling Data Science, the Data Platform team has helped establish some patterns and infrastructure that help alleviate contention. Contention on:

  • Access to Data
  • Access to Compute Resources:
    • Ad-hoc compute
      • prototype, iterate, workspace
    • Production compute
      • where things are executed once they’re needed regularly

For the talk (and this post) I only focused on how we reduced contention on Access to Data, & Access to Ad-hoc Compute to enable Data Science to scale at Stitch Fix. With that I invite you to take a look through the slides.

Slides

Data Day Texas 2017: Scaling Data Science at Stitch Fix from Stefan Krawczyk

TL;DR

For those that want the gist, the TL;DR version is:

  • S3 + Hive Metastore is Stitch Fix’s very scalable data warehouse.
  • Internally built APIs make S3 + Hive Metastore easier to use for Data Scientists.
  • Docker is used to provide a consistent environment for Data Scientists to use.
  • Docker + ECS enables a self-service ad-hoc platform for Data Scientists.
Tweet this post! Post on LinkedIn
Multithreaded

Come Work with Us!

We’re a diverse team dedicated to building great products, and we’d love your help. Do you want to build amazing products with amazing peers? Join us!