(This is the second post in a series on Internal Software. The first was Why Build Internal Software?)
I haven’t picked up an Erlenmeyer flask since the late 20th Century, so I am not qualified to talk about science, but as a software engineer I do know about data. In the popular imagination Data Science is conducted on (with? in? at?) huge public, or semi-public data sets like Google search terms, the Twitter firehose or health records from insurance companies. Books such as Big Data: The Essential Guide to Work, Life and Learning in the Age of Insight and Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are are cranked out on a regular basis to explain the rewards of analyzing the unprecedented volume of data that humanity now produces. Apparently “by the end of an average day in the early twenty-first century, human beings searching the internet will amass eight trillion gigabytes of data” (whatever that means).
At Stitch Fix we certainly have enough data that it qualifies as Big (although not quite the petabytes mentioned in the above books). But whereas Big Data is typically a thin gruel that requires heroic levels of consumption to provide any nourishment, we gather only the best ingredients for a healthier stew, and they are best described as Rich Data.
The raw data at Stitch Fix describes two different domains: clients and clothes. At a very basic level our styling recommendations are generated by combining data about clients (including past purchases) and data about clothes. The client data comes from our customers themselves when they fill out their style profile on our website or in our iOS app. The data about the clothes is captured entirely within our internal software. Since the data we use comes solely from within Stitch Fix (we don’t use third party aggregated retail industry data, for example), and there isn’t much of it (relative to Big Data) it needs to be very rich indeed to be useful.
For this internally generated data to be Rich Data it has to be valid and relevant. Validity is vital. It might be acceptable to manually clean up data sets about regional agricultural output levels from the Tang Dynasty but data collected today inside a high-tech retailer in San Francisco should be absolutely trust-worthy. If the data can’t be trusted then neither can any of the insights or algorithms based on them.
The data about clothes at Stitch Fix are entered by the Merchandising team but our software engineers have to create tools that make it very hard for them to enter invalid data. There are plenty of ways we can ensure data integrity. Our Director of Engineering, Dave Copeland, discusses two of them, Rails validations and Postgres check constraints, in his blog post from a couple of years ago. Good software with carefully designed validations and constraints generates valid data.
Of course, you could capture highly valid data with just a single drop-down field titled ‘trendiness’, with options from 1-10. The data would be valid but still useless because it would be irrelevant. To ensure the relevance of the data we capture we lean on another practice that is important to software engineers at Stitch Fix: product design. By working closely with our business partners and doing our best to understand the actual problems we are trying to solve (in this case capturing the data points that enable us to generate the best recommendations for clothes we could send to our clients) we can be reasonably sure we are capturing data that is both valid and relevant.
A final problem for capturing Rich Data, which should really be the first problem, is that of existence. Your data can be as valid and relevant as you like, but if it doesn’t exist inside your software it won’t generate much business value. It seems like this shouldn’t be a problem for internal software, because even though user acquisition costs are absurdly high (recruiting costs plus salary) your conversion rate should be 100% (that’s what the salary is paying for). In reality that is not the case. If your software is difficult to use or doesn’t seem to add any value to the user then they can always find another way to store data in a way that’s useful to them but harder to do science amidst (e.g., Excel).
Data science relies on good data. If your data comes from inside your institution and is going to be Rich rather than Big then it needs to be valid, relevant and extant. To meet all three of those criteria you need good internal software.