Multithreaded in the Wild

Stefan Krawczyk
- San Francisco, CA

Hello Stitch Fix followers, check out where our fellow Stitch Fixers are speaking in the month of April.

Virtual

Stefan Krawczyk, will be presenting a talk titled “Hamilton: a Micro Framework for Creating Dataframes” at apply() the ML data engineering conference on April 21st.

Description:

At Stitch Fix we have 130+ “Full Stack Data Scientists” who in addition to doing data science work, are also expected to engineer and own data pipelines for their production models. One data science team, the Forecasting, Estimation, and Demand team was in a bind. Their data generation process was causing them iteration & operational frustrations in delivering time-series forecasts for the business. In this talk I’ll present Hamilton, a novel python micro framework, that solved their pain points by changing their working paradigm.

Specifically, Hamilton enables a simpler paradigm for a Data Science team to create, maintain, and execute code for generating wide dataframes, especially when there are lots of intercolumn dependencies. Hamilton does this by building a DAG of dependencies directly from python functions defined in a special manner, which also makes unit testing and documentation easy; tune into the talk to find out how. I’ll also cover our experience migrating to it and using it in production for over a year, along with possible future directions.

Stefan Krawczyk, will also be presenting a talk titled “Deployment for free: removing the need to write model deployment code at Stitch Fix” at MLCONF ONLINE 2021 – AI/ML OPS on April 29th.

Description:

At Stitch Fix we have a dedicated Data Science organization called Algorithms. It has over 130+ Full Stack Data Scientists that build & own a variety of models. These models span from your classic prediction & classification models, through to time-series forecasts, simulations, and optimizations. Rather than hand-off models for productionization to someone else, Data Scientists own and are on-call for that process; we love for our Data Scientists to have autonomy. That said, Data Scientists aren’t without engineering support, as there’s a Data Platform team dedicated to building tooling, services, and abstractions to increase their workflow velocity.

One data science task that we have been speeding up is getting models to production and increasing their usability and stability. This is a necessary task that can take a considerable chunk of a Data Scientist’s time, either in terms of developing, or debugging issues; historically everyone largely carved their own path in this endeavor, which meant many different approaches, implementations, and little to leverage across teams. In this talk I’ll cover how the Model Lifecycle team on Data Platform built a system dubbed the "Model Envelope” to enable “deployment for free”. That is, no code needs to be written by a data scientist to deploy any python model to production, where production means either a micro-service, or a batch python/spark job. With our approach we can remove the need for data scientists to have to worry about python dependencies, or instrumenting model monitoring since we can take care of it for them, in addition to other MLOps concerns.

Be sure to catch us at these events :)

Tweet this post! Post on LinkedIn
Multithreaded

Come Work with Us!

We’re a diverse team dedicated to building great products, and we’d love your help. Do you want to build amazing products with amazing peers? Join us!