Stitch Fix is the third startup I’ve worked at, and the second where I joined at a fairly early stage. So far, we’ve been able to avoid creating a single monolithic application, and have been consistently delivering and deploying solutions to our users. I believe this is because we’ve developed a set of “super powers” which have been extremely helpful, and I believe these powers will keep our team and technology healthy for years to come.
These “Super Powers” are skills that we’ve cultivated, activities that we’ve not shyed away from, or complex processes that we’ve automated away and no longer think about. This powers are easily gathered by a small development team early on, and can be passed to newer members (whereas they are hard to develop later, when the team is larger). They are:
- Provision Applications With Ease
- Continuous Deployment
- Publish Shared Libraries
- Proper Data Modeling Using a Powerful Database
- Databases as a Service
- Background Processing
- Monitoring & Logging
Which leads to the best Super Power of them all:
- Trust from the Rest of the Company
Let’s dig a little deeper on each one.
Provision Applications With Ease
Nothing contributes to monolithic do-everything applications more than the inability to provision new applications to share the load. We decided early on
to stop putting any more features at www.stitchfix.com/admin
and put them in special-purpose apps. Because we use a cloud-based hosting service
(Heroku), that decision is an easy one: we just provision and deploy a new application.
When discussing new applications, we tend to discuss what the application should do and not worry about how long it will take to create. This skill is not easy to develop later a company’s lifecycle and often requires a dedicated team to build the necessary infrastructure and migrate the applications—often at a time when such activities carry higher risk and affect more customers.
Continuous Deployment
When a change to the system is ready, it goes up. When we merge code to our master branch in GitHub, it is bound for production. Our Continuous Integration server is a gatekeeper, of course, but we don’t have release cycles, versions, or timed releases. We may hold a feature until the next day if it represents a radical change in an internal process, but generally we aren’t trying to sync up a bunch of features to do a weekly release.
The technical aspects of achieving this aren’t nearly as interesting as the culural impact it has. Because we don’t have a formalized QA process, we all know that the only thing between our code and production is the test suite, so that test suite needs to be reliable. Whenever there is a production bug, our first question is “Why didn’t the tests catch this?”.
Publish Shared Libraries
Because we have several applications, those applications often need to share business logic or other code. While dedicated services might someday be a way we do that, for now, we share logic and code via libraries (in our case, RubyGems). Just like easily provisioning new applications, we can easily create and publish a new library in a matter of minutes.
To keep our libraries codebases consistent, we initially had a wiki page of instructions, however
this has now been reduced to a command-line app: y my_new_gem
which creates an empty code library the “Stitch Fix Way” that can be deployed to our
hosted private code server instantly.
Proper Data Modeling Using a Powerful Database
When you have a single monolithic application as the only accessor to your database, you can rely on your code to keep the data clean. However, in a setup where many applications are sharing a database, you run the risk that one application will put bad data in, causing another application to break. The solution to this has been a part of many popular relational databases for years: constraints.
It’s unfortunate that MySQL—which is quite popular and usable—lacks real constraints. We use Postgres, which provides not just “not null” constraints and foreign key constraints (both part of MySQL as well), but also check constraints. A check constraint allows us to put sophisitcated requirements on our data to ensure it is always correct.
For example, we have a table that tracks events on a shipment (a shipment being what we send to our customers in hopes of a purchase).
Column | Type | Modifiers
---------------+--------------------------+---------------
id | integer | not null
shipment_id | integer | not null
status_id | integer | not null
event_name | character varying(255) | not null
admin_user_id | integer |
client_id | integer |
created_at | timestamp with time zone | default now()
We have two types of users in our system: our clients (aka our customers), and our internal users (which we call “admin users”). In the table above, you can see we have a reference to both a client ID and an admin user ID. We want exactly one of those fields to have a value, so that we can attribute the event to an actual person. A check constraint allows us to do that:
If any application (or even someone at a SQL prompt) tries to insert a row into this table that doesn’t have an admin_user_id
or client_id
, the
database will refuse the insert. Which means that we can write this code:
Moreover, this means that an errant or buggy application cannot cause trouble for other applications by putting bad data in the database. And this means that we can use the database as an integration point much more reliably than if we had only weak assurances of the data inside. Which allows us to reap the benefits of multiple applications and shared code without having to immediately start standing up RESTful services.
Databases as a Service
On the subject of databases, our databases work just like our applications - we can create new ones, set up replication, access backups, rollback to previous versions, and download production dumps with a simple command. This is because, like our applications, our databases are setup via a cloud hosting service (Heroku Postgres) that provides all of that.
We recently found that three back-end reporting systems were competing for resources on our single replication slave database. So, we spun up two more, and now they each have their own. We never had to discuss the difficulty in physically setting up the systems—just whether or not it made sense to do architecturally.
Background Processing
When building internal tools, or interfacing with third party systems, the easiest way to improve performance and reliability is to run long-running processes in the background. Setting this up the first time can be painful, but once it’s there, you can start to background everything that’s not strictly needed in the user-facing request cycle.
When we started seeing long delays from our shipping label provider, Jon took the extra effort needed to not just background the external call, but also to set up an easy way for the front-end to use AJAX to wait for the result back. The user still has to wait, but our application would no longer time out. When we later needed to add a second call to another third party, Joel was able to easily build on Jon’s work to do that. Up-front investment can often pay off.
Monitoring & Logging
With all of this infrastructure, you need to know what’s going on with it. In addition to basic ping checks of our applications, we also monitor the size of our background queues, get alerted when any job fails, get notified when our applications crash or timeout, and can even be alerted when certain SQL statements that should never return a value do, in fact, return a value.
This means that we know about problems before our users do. It’s so important to have this knowledge, because users have been trained over the years to accept flaky software as being normal—they won’t report problems right away or at-all.
Because of our monitoring set up, I can email an internal user that I’m aware of their problem, but also if there’s a workaround they can try to get their work done, often before they realize what’s going on. It might sound creepy, but it helps—I can get real-time information from them, and they know that we’re on top of it.
We can currently add alerts for any log message our applications produce, meaning it’s dead simple to stay on top of our system.
Trust from the Rest of the Company
All of the previous superpowers mean that we can deliver solutions for our clients and internal users quickly without acquiring a lot of technical debt. Because we aren’t messing around with server configuration, debugging odd data, or figuring out why something’s slow or unresponsive, we spend most of our time delivering the business value we’re paid to deliver.
This means that when we discuss potential solutions, we’re taken seriously. When we give estimates for how long work will take, we are believed (and when an estimate is perceived to as “too long”, the solutions proposed from the non-tech side are almost always around cutting scope and delivering a smaller solution earlier). Finally, when we do need some time to do some technical things behind the scenes (such as upgrade to Rails 4), it’s understood that we wouldn’t be asking if there weren’t an upside to the company in doing so.
We’ll see what happens over the next year, but it’s my belief that our team will continue to function efficiency and effectively as a partner to the business and not the dreaded “IT Department”. All because we’ve cultivated some technical super powers.
Specifics
For the curious, here are the specific tools and services we use:
- Ruby on Rails
- Postgres
- GitHub (source code management)
- Heroku (hosts our apps and Postgres databases)
- Resque (background processing)
- Heroku Scheduler (cron-like)
- TDDium (Continunous Integration)
- Hubot (Chat-based assistance for deployment, issue tracking, and questionable images)
- GemFury (Private Gem Hosting)
- PaperTrail (Logging)
- PagerDuty (alerting rotation)
- Pingdom (application monitoring)
- NewRelic (performance monitoring)
- Google Apps (knowledge sharing and management)