Thoughtfully writing a blog post

Blog posts by the Algorithms team

Beware the data science pin factory: The power of the full-stack data science generalist and the perils of division of labor through function

This post discusses the benefits of full-stack data science generalists over narrow functional specialists. The later will help you execute and bring process efficiencies. But, the former will help you learn, innovate, and bring a step-change in value - such is the role of data science.

Putting the Power of Kafka into the Hands of Data Scientists

How Stitch Fix designed and built a scalable, centralized and self-service data integration platform tailored to the needs of their Data Scientists.

Stop Using word2vec

When I started playing with word2vec four years ago I needed (and luckily had) tons of supercomputer time. But because of advances in our understanding of word2vec, computing word vectors now takes fifteen minutes on a single run-of-the-mill computer with standard numerical libraries1. Word vectors are awesome but you don't need a neural network -- and definitely don't need deep learning -- to find them. So if you're using word vectors and aren't gunning for state of the art or a paper publication then stop using word2vec.

Sorry ARIMA, but I’m Going Bayesian

When people think of “data science” they probably think of algorithms that scan large datasets to predict a customer’s next move or interpret unstructured text. But what about models that utilize small, time-stamped datasets to forecast dry metrics such as demand and sales? Yes, I’m talking about good old time series analysis, an ancient discipline that hasn’t received the cool “data science” rebranding enjoyed by many other areas of analytics.

Engineers Shouldn’t Write ETL: A Guide to Building a High Functioning Data Science Department

“What is the relationship like between your team and the data scientists?” This is, without a doubt, the question I’m most frequently asked when conducting interviews for data platform engineers. It’s a fine question – one that, given the state of engineering jobs in the data space, is essential to ask as part of doing due diligence in evaluating new opportunities. I’m always happy to answer. But I wish I didn’t have to, because this a question that is motivated by skepticism and fear.

Making Causal Impact Analysis Easy

If you ever spent time in the field of marketing analytics, chances are that you have analyzed the existence of a causal impact from a new local TV campaign, a major PR event, or the emergence of a new local competitor. From an analytical standpoint these types of events all have one thing in common: The impact cannot be tracked at the individual customer level and hence we have to analyze the impact from a bird's eye view using time series analysis at the market level. Data science may be changing at a fast pace but this is an old-school use-case that is still very relevant no matter what industry you're in.

Multiple Hypothesis Testing

In recent years, there has been a lot of attention on hypothesis testing and so-called “p-hacking”, or misusing statistical methods to obtain more “significant” results. Rightly so: For example, we spend millions of dollars on medical research, and we don’t want to waste our time and money, pursuing false leads caused by flaky statistics. But even if all of our assumptions are met and our data collection is flawless, it’s not always easy to get the statistics right; there are still quite a few subtleties that we need to be aware of.

Data Exploration with Weight of Evidence and Information Value in R

Binary classification models are perhaps the most common use-case in predictive analytics. The reason is that many key client actions across a wide range of industries are binary in nature, such as defaulting on a loan, clicking on an ad, or terminating a subscription.

So, You Need a Statistically Significant Sample?

Although a commonly used phrase, there is no such thing as a “statistically significant sample” – it’s the result that can be statistically significant, not the sample. Word-mincing aside, for any study that requires sampling – e.g. surveys and A/B tests – making sure we have enough data to ensure confidence in results is absolutely critical.

A Word is Worth a Thousand Vectors

Standard natural language processing (NLP) is a messy and difficult affair. It requires teaching a computer about English-specific word ambiguities as well as the hierarchical, sparse nature of words in sentences. At Stitch Fix, word vectors help computers learn from the raw text in customer notes. Our systems, composed of machines and human experts, need to recommend the maternity line when she says she’s in her ‘third trimester’, identify a medical professional when she writes that she ‘used to wear scrubs to work’, and distill ‘taking a trip’ into a Fix for vacation clothing.