Thoughtfully writing a blog post

Blog posts by the Algorithms team

Multithreaded in the Wild

See who's out in the wild for the month of October

Multithreaded in the Wild

See who's out in the wild for the month of September

Putting the Power of Kafka into the Hands of Data Scientists

How Stitch Fix designed and built a scalable, centralized and self-service data integration platform tailored to the needs of their Data Scientists.

Synesthesia: The Sound of Style

If we could assign sounds to items of clothing, what would a Fix sound like?

Multithreaded in the Wild

See who's out in the wild for the month of August

Understanding Latent Style

This post explores the use of matrix factorization not just for recommendations, but for understanding style preference more broadly.

Add Constrained Optimization To Your Toolbelt

This post is an introduction to constrained optimization aimed at data scientists and developers fluent in Python, but without any background in operations research or applied math. We'll demonstrate how optimization modeling can be applied to real problems at Stitch Fix. At the end of this article, you should be able to start modeling your own business problems.

Multithreaded in the Wild

See who's out in the wild for the month of June

Two things about power

Experimenter beware: Running tests with low power risks much more than missing the detection of true effects.

Multithreaded in the Wild

See who's out in the wild for May

Multithreaded in the Wild

There's a large group out and about for the month of April. Come see where you might find them.

Lumpers and Splitters: Tensions in Taxonomies

As data scientists tasked with segmenting clients and products, we find ourselves in the same boat with species taxonomists, straddling the line between lumping individuals into broad groups and splitting into small segments. The approach for drawing the boundaries needs to take into account signals from the data while maintaining sharp focus on the project needs. A balance between lumping and splitting allows us to make the best data-driven decisions we can with the resources we have...

Multithreaded in the Wild

Come see who's out and about in the wild for March.

What Do Data Scientists Need to Know about Containerization? As Little as Possible.

Data scientists are not always equipped with the requisite engineering skills to deploy robust code to a production job execution and scheduling system. Yet, forcing reliance on data platform engineers will impede the scientists autonomy. If only there was another way. So today, we're excited to introduce Flotilla, our latest open source project...

Fast Company's 2018 World's Most Innovative Companies List

Wow! We are so honored to be ranked #13 on Fast Company’s Most Innovative Companies List. And, we’re thrilled to be ranked #1 on Fast Company’s Data Science List.

It's really gratifying to see Data Science becoming a primary means of strategic differentiation ...

Multithreaded in the Wild

See who's out in the wild for February

Multithreaded in the Wild

See who's out in the wild for January

What the SATs Taught Us about Finding the Perfect Fit

On the Stitch Fix Algorithms team, we’ve always been in awe of what professional stylists are able to do, especially when it comes to knowing a customer’s size on sight. It’s a magical experience to walk into a suit shop, have the professional shopping assistant look you over and without taking a measurement say, “you’re probably a 38, let’s try this one,” and pull out a perfect-fitting jacket. While this sort of experience has been impossible with traditional eCommerce, at Stitch Fix we’re making it a reality.

Multithreaded in the Wild

See who's out in the wild for December

Multithreaded in the Wild

See who's out in the wild for November

Word Tensors

Counting and tensor decompositions are elegant and straightforward techniques. But these methods are grossly underepresented in business contexts. In this post we factorized an example made up of word skipgrams occurring within documents to arrive at word and document vectors simultaneously. This kind of analysis is effective, simple, and yields powerful concepts.

Stop Using word2vec

When I started playing with word2vec four years ago I needed (and luckily had) tons of supercomputer time. But because of advances in our understanding of word2vec, computing word vectors now takes fifteen minutes on a single run-of-the-mill computer with standard numerical libraries1. Word vectors are awesome but you don't need a neural network -- and definitely don't need deep learning -- to find them. So if you're using word vectors and aren't gunning for state of the art or a paper publication then stop using word2vec.

NBA Season Kickoff

Today is the start of the 2017-2018 NBA Season. Basketball statistics have become a rich and intriguing domain of study, bringing new insights and advantages to the teams that embrace such empiricism. Of course, the framing and analytic techniques used to study basketball are generalizations - they also give intuition to problems in business or other domains (and vice versa). So, for all the basketball statistics enthusiasts out there, as well as those that are looking for inspirations for their own analytic challenges, we thought we’d share a compendium of our past basketball-related posts.

Multithreaded in the Wild

See who's out in the wild for October

Time Dependent Classification

In this post we’ll take a look at how we can model classification prediction with non-constant, time-varying coefficients. There are many ways to deal with time dependence, including Bayesian dynamic models (aka "state space" models), and random effects models. Each type of model captures the time dependence from a different angle; we will keep things simple and look at a time-varying logistic regression that is defined within a regularization framework. We found it quite intuitive, easy to implement, and observed good performances using this model.

Multithreaded in the Wild

See who's out in the wild for September

The curious connection between warehouse maps, movie recommendations, and structural biology

Here at Stitch Fix, we work on many fun and interesting areas of Data Science. One of the more unusual ones is drawing maps, specifically internal layouts of warehouses. These maps are extremely useful for simulating and optimising operational processes. In this post, we'll explore how we are combining ideas from recommender systems and structural biology to automatically draw layouts and track when they change.

Data Science Interns 2017

This summer our community included four interns, all graduate students who are passionate about applying their academic expertise to help us leverage our rich data to better understand our clients, their preferences, and new trends in the industry. In this blog post you’ll meet the interns, who will tell you a bit about the problems they worked on and the strategies they used to solve them.

Multithreaded in the Wild

See who's out in the wild for this last part of August, and vote for our SXSW proposals.

Genie in a Box : Making Spark Easy for Stitch Fix Data Scientists

Stitch Fix is a Data Science company that aspires to help you to find the style that you love. Data Science helps us make most of our business and strategic decisions.

Diamond Part II

Announcing Diamond, an open-source project for solving mixed-effects models

Diamond Part I

Solving mixed-effects models efficiently: the math behind Diamond

Nodebook

Analysis should be reproducible. This isn’t controversial, and yet irreproducible analysis is everywhere. I’ve certainly created plenty of it. Why does this happen, despite good intentions? Because, in the short term, it is easier and more expedient not to worry about reproducibility. But this isn’t a moral failing so much as a failing of our tools. Tools can, and should, help make reproducible analysis the natural thing to do. As a step towards encouraging reproducibility, this post introduces Nodebook, an extension to Jupyter notebook.

Inventory Time Machine

As a proudly data-driven company dealing with physical goods, Stitch Fix has put lots of effort into inventory management. Tracer, the inventory history service, is a new project we have been building to enable more fine-grained analytics by providing precise inventory state at any given point of time...

This one weird trick will simplify your ETL workflow

In this post aimed at SQL practitioners who would rather spend their time writing Python, we'll show how a web development tool can help your ETL stay DRY.

Be smarter. Be seetd.

How to organize an office so everyone working there can be comfortable and productive is the topic of much discussion. A common strategy is to seat people by their team or sub-team membership. Another strategy which we have been employing is to simply allocate people randomly. Building upon these experiences we've developed a new seating allocation tool "seetd", that allows us to frame this as an optimization problem. We're now free to combine these and other approaches objectively.

R in pRoduction: theRe be dRagons!

R is an awesome tool for doing data science interactively, but has some defaults that make us worry about using it in production pipelines.

The Blissful Ignorance of the Narrative Fallacy

We have an innate and uncontrollable urge to explain things - even when there is nothing to explain. This post explores why we are prone to narrative fallacies. We start at an epic moment in sports history, Steph Curry breaking the record for most 3-pointers in a game, and draw conclusions for better decision making in business.

Multithreaded in the Wild

See who's out in the wild for the month of June.

Building a Data Exploration Tool with React

Dora helps data scientists at Stitch Fix visually explore their data. Powered by React and Elasticsearch, it provides an intuitive UI for data scientists to take advantage of Elasticsearch's powerful functionality.

The Making of the Tour, Part 3: Micro-Animations

In this last installment of our Making of the Tour series, we look at some of the fun and random.

Multithreaded in the Wild

See who's out in the wild for the month of May.

The Making of the Tour, Part 2: Simulations

In this post, we'll talk about some simulation-powered animations, provide some cleaned up code that you can use, and discuss these animations' genesis and utility for visualizing abstract systems and algorithms or for visualizing real historical data and projected futures.

Multithreaded in the Wild

See who's out in the wild for the month of April.

The Making of the Tour, Part 1: Process and Structure

Earlier this month, we released an interactive animation describing how data science is woven into the fabric of Stitch Fix: our Algorithms Tour. It was a lot of fun to make and even more fun to see people’s responses to it. For those interested in how we did it, we thought we’d give a quick tour of what lies under that Tour.

Ruminations on Data-Driven Fashion Design

Last summer, we wrote about Stitch Fix’s early experiments in data-driven fashion design. Since then, we’ve been studying, developing, and testing new ways to create clothes that delight our clients. Some of this work was featured yesterday in an article in The Wall Street Journal. As a companion to that piece, we wanted to highlight a few avenues that we have explored recently.

A Tour of our Algorithms Allegories

How data science is woven into the fabric of Stitch Fix. In this interactive tour we share ten “stories” of how data science is is integral to our operations and product.

Multithreaded in the Wild

Come see who's going to be presenting or speaking out in the wild for the month of March!

The intimate relationship between exoplanets and fashion trends

At first sight the difference between planets outside our solar system (exoplanets from now on) and fashion trends seems enormous, but all of us math lovers know that entirely different phenomena can have an almost identical mathematical description. In this very peculiar case, exoplanetary systems and certain fashion trends can be characterized as having a periodic nature, with certain magnitudes repeating cyclically, and this will allow us to use very similar techniques to study them.

What's Wrong With My Time Series

Time series modeling sits at the core of critical business operations such as supply and demand forecasting and quick-response algorithms like fraud and anomaly detection. Small errors can be costly, so it’s important to know what to expect of different error sources. In this post I’ll go through alternative strategies for understanding the sources and magnitude of error in time series.

Multithreaded in the Wild

With January complete, see who's out in the wild for the month of February.

Scaling Data Science:
Slides from #DDTX17

For those who attended my talk at Data Day Texas in Austin last weekend, you heard me talk about how Stitch Fix has reduced contention on: Access to data & Access to ad-hoc compute resources; to help scale Data Science. As attendees requested, I have posted my slides here, which you can find a link to...

Multithreaded in the Wild

Happy New Year! As always, members of our algorithms and engineering teams are out and about this January!

Update: Be Wrong the Right Number of Times

The outcome of the presidential election clearly indicated that the model used by FiveThirtyEight was closer to the truth than that of the Princeton Election Consortium in terms of the level of uncertainty in the predictions---but not by as much as you might think. I consider the question quantitatively: what are the odds that one or the other model is right given the state-by-state results?

I'd Rather Predict Basketball Games Than Elections: Elastic NBA Rankings

When Donald Trump won the 2016 presidential election, both sides of the political spectrum were surprised. The prediction models didn't see it coming and the Nate Silvers of the world took some heat for that (although Nate Silver himself got pretty close). After this, a lot of people would probably agree that the world doesn't need another statistical prediction model.

So, should we turn our backs on forecasting models? No, we just need to revise our expectations. George Box once reminded us that statistical models are, at best, useful *approximations* of the real world. With the recent hype around data science and "money balling" this point is often overlooked.

Happy 80th Birthday to the Turing Machine!

On this day in 1936, Alan Turing stood before the London Mathematical Society and delivered a paper entitled "On Computable Numbers, with an Application to the Entscheidungsproblem", wherein he described an abstract mathematical device that he called a "universal computing engine" and which would later become known as a Turing machine. As a Stitch Fix tribute, we’ve melded a Turing machine and a 1936 Singer sewing machine.

Be Wrong the Right Number of Times

Update, Dec 12, 2016: There is a follow up post discussing the outcome of all of this after the election results were known.

Trend Report I: White after labor day

Plaid is for fall; red on Valentine’s Day; no white after Labor Day. These are fashion adages we’ve all heard before -- that even John Oliver promotes. But how true are they? The Stitch Fix Algorithms team is in a unique position to quantitatively answer these questions for the first time. Given the season, we’ve decided to first take a look at the “No white after Labor Day” claim. How real is it?

Multithreaded in the Wild

As always, members of our algorithms and engineering teams are out and about this month!

Photo Based Clothing Measurements

Data science begins not with data but with questions. And sometimes getting the data necessary to answer those questions requires some ingenuity.

Embracing Immutable Server Pattern Deployment on AWS

On the algorithms team at Stitch Fix, we aim to give everyone enough autonomy to own and deploy all the code that they write, when and how they want to. This is challenging because the breadth of who is writing micro-services for what, covers a wide spectrum of use cases - from writing services to integrate with engineering applications, e.g. serving styling recommendations, to writing dashboards that consume and display data, to writing internal services to help make all of this function. After looking at the many deployment pipeline options out there, we settled on implementing the immutable server pattern.