At Stitch Fix, we are currently tackling a pretty common problem among fast-growing startups in the process of scaling. Our applications are overdependent on a shared database, and in order for us to uncouple the various engineering teams from one another and to grow our applications to the next level, we need to unshare it. This blog post will talk about the problems we are trying to solve, and the stepwise approach we are taking to solve them.
How We Got Here
We have set ourselves up for success in many ways:
- We have organized ourselves into small, independent teams strongly aligned with different business areas
- Each team is responsible for its own particular set of applications and services
- We practice TDD and Continuous Delivery, and release software multiple times per day
- We embrace a DevOps culture where the team that builds an application is responsible for running and maintaining it
Like the other companies who follow this model (Google, Amazon, Netflix, etc.), we do this to decouple teams from one another and enable them to move forward quickly and independently. We also believe that giving autonomy and agency to individual teams motivates every engineer to bring their best.
Getting to this point has been an evolutionary process, though, where we started small and have grown over time. Just as with biological evolution, some of the vestiges of the earliest phases of our development are still visible. And one of the (biggest) vestiges is the fact that most of our applications read and write to a shared database.
The Shared Database
For an early- and mid-stage startup, a monolithic database is absolutely the appropriate architecture choice. With a small team and a small company, a single shared database made it simple to get started. Moving fast meant being able to make rapid changes across the entire system. A shared database made it very easy to join data between different tables, and it made transactions across multiple tables possible. These are pretty convenient.
As we have gotten larger, those benefits have become liabilities. It has become a single point of failure, where issues with the shared database can bring down nearly all of our applications. It has become a performance bottleneck, where long-running operations from one application can slow down others. Finally, and most importantly, the shared database has become a coupling point between teams, slowing down our ability to make changes.
Enter Services
Our approach to unsharing the shared database is to move individual pieces of the shared database into private databases owned by individual services. Instead of having applications communicate by reading and writing common tables in a shared database, we encapsulate that functionality in a service which owns and manages the storage, and applications use the service interface.
As our industry has learned through painful experience, a service-oriented architecture with shared storage is an anti-pattern. In order to be effective, services need to be completely isolated from one another, and the only way to read and write a service’s data should be through its supported interface.
Sharing a database underneath encourages backdoor violations of the interface contract, because its data can be read and written directly. It bypasses business logic, and makes it impossible to enforce invariants inside the service. Worse, because the internal implementation details of the service are implicitly exposed, it makes it impossible to safely change them. The service cannot, for example, safely introduce a cache or use an alternate storage mechanism.
eBay’s first foray into SOA—circa 2008—ended up failing for exactly this reason. There was a lot of careful and concerted effort designing and implementing reasonable service interfaces, and they were nicely orthogonal and decoupled. But all the underlying databases were still shared. This meant that the services were actually not isolated from one another, so one service’s operations could have unexpected side effects on another’s data. It also meant that an application did not have to use the service interface to get its job done—it could continue reading and writing directly to the underlying tables as it had always done.
After all, a service’s interface is not just the API we usually think of. The interface comprises all mechanisms someone could use to influence its behavior or read / write its data. This includes both the obvious and the non-obvious:
- Synchronous API (e.g., a REST / JSON request - response)
- Messages consumed and produced
- Logging and monitoring data produced
- Any direct access to its storage
With the benefit of 20-20 hindsight, it should maybe be a bit surprising that it has taken as long as it has for our industry to come to this realization. After all, we would never be so cavalier about violating the encapsulation of a class or component. Imagine a class whose interface boundary you could bypass by reading or writing to its private memory locations! In almost all modern languages, this is not even possible. Just as we would never make internal implementation details of a class accessible to the outside world, a properly isolated service would never expose its storage either.
Getting from Here to There
As with everything we do at Stitch Fix, we ask ourselves how we can approach this unsharing problem incrementally.
Here are our steps:
- First, we introduce a service interface, whose implementation simply directly accesses the table(s) in the shared database. We make sure that that service interface contains operations that are semantically meaningful to the caller, rather than being just a CRUD API.
- We refactor code that uses the table(s) to run all access to the table(s) through that interface. For tables that are joined in other queries, this involves some non-trivial effort sorting out what to do with those queries. Now all access to the table(s) is through the service interface, but that interface is inline in the applications.
- We now write a remote service (using our open-sourced stitches service framework), and replace the implementation of the service interface with a call to the remote service. Now all access to the table(s) is via the remote service, but the table(s) still live in the shared database.
- Last, we migrate the table(s) from the shared database into a private database for the service, and update the service implementation to read and write to its private database. Now the data is encapsulated inside the service, and we are done.
- Rinse and repeat
Why Didn’t We Just Start There?
Why did we even do this? More specifically, why didn’t we partition the database from the very beginning? The honest answer is that it would have been absolutely the wrong thing to do. In the early stages, we should (and did) prioritize validating our business model and making sure we found a product-market fit. Even as we were growing, we prioritized delighting our early customers and rapidly iterating on our product offering over engineering for a future scale that may never have come. The opportunity cost of over-engineering early on is too high. Far better to spend our scarce time and resources concentrating on solving near-term problems rather than problems that might only occur in a year or two.
Now that we have achieved a scale where the shared database is a pain point, it makes sense to invest in unsharing it.
It is worth pointing out that rearchitecting or reengineering a system is almost always a sign of success, not a sign of failure. It is not so much that you have to rearchitect as that you get to rearchitect. It means that you have reached a scale where your earlier choices no longer work, and that people actually care about the product you provide. Or, as I like to say, if you don’t end up regretting your early technology decisions, you probably over-engineered!