For those who attended my talk at Data Day Texas in Austin last weekend, you heard me talk about how Stitch Fix has reduced contention on:
- Access to data
- Access to ad-hoc compute resources
to help scale Data Science. As attendees requested, I have posted my slides here, which you can find a link to at the bottom.
For those that weren’t at my talk, here’s a brief background to the slides; they should be relatively self explanatory after reading this background.
Background
At Stitch Fix we have a lot of Data Scientists, around eighty at last count. One reason why we have so many is that we do things differently. To get their work done, Data Scientists have access to whatever resources they need (within reason), because they have end to end responsibility for their work; they collaborate with their business partners on objectives and then prototype, iterate, productionize, monitor, and debug everything and anything required to get the output desired (see Engineers shouldn’t write ETL). They’re full data-stack Data Scientists!
The teams in the organization do a variety of different tasks:
- Clothing recommendations for clients.
- Clothing recommendations for designing clothes.
- Time series analysis & forecasting of inventory, client segments, etc.
- Warehouse worker path routing.
- NLP on requests and feedback.
- … and more!
Our Data Scientists are quite prolific at what they do – we’re approaching 4,500 job definitions at last count. So one might be wondering now, how have we enabled them to get their jobs done without getting in the way of each other?
This is where the Data Platform teams comes into play. With the goal of lowering the cognitive overhead and engineering effort required on part of the Data Scientist, the Data Platform team provides abstractions and infrastructure to help the Data Scientists. The relationship is a collaborative partnership, in which both are incentivized to help each other through feedback: Data Platform needs to understand the Data Scientists’ pain points, while Data Scientists won’t use a tool that doesn’t work for them. The end result is hopefully a well designed tool that appeals to and is adopted by the Data Scientists.
In regard to scaling Data Science, the Data Platform team has helped establish some patterns and infrastructure that help alleviate contention. Contention on:
- Access to Data
- Access to Compute Resources:
- Ad-hoc compute
- prototype, iterate, workspace
- Production compute
- where things are executed once they’re needed regularly
- Ad-hoc compute
For the talk (and this post) I only focused on how we reduced contention on Access to Data, & Access to Ad-hoc Compute to enable Data Science to scale at Stitch Fix. With that I invite you to take a look through the slides.
Slides
TL;DR
For those that want the gist, the TL;DR version is:
- S3 + Hive Metastore is Stitch Fix’s very scalable data warehouse.
- Internally built APIs make S3 + Hive Metastore easier to use for Data Scientists.
- Docker is used to provide a consistent environment for Data Scientists to use.
- Docker + ECS enables a self-service ad-hoc platform for Data Scientists.