Preboot is a feature provided by Heroku that can help you achieve zero downtime deploys. Meaning that when you push a new version of your code, there’s not even a split second that your users experience your app being down.
It does this by loading the new version of your web workers before it shuts down the old ones. It does not do this for your background workers. They load normally.
It’s a very useful feature and is sometimes worth its weight in gold. I don’t know how much computer code weighs, but whatever…
Sanity
In software, sanity is a continuous scale. Everything we do is somewhere between complete chaos and provably correct by math. Preboot leans a bit towards the former. It doesn’t mean that it should never be used, but it creates nuance.
Simple vs Easy
I think Rich Hickey would say that preboot is very easy and not at all simple. You just hit a switch and now Heroku will be sure your new code is ready before it shuts down the old code. Couldn’t be easier. However, achieving that means creating some extra complexity and sweeping it under the rug.
New complexity
In a large system made up of many Rails applications, preboot creates complexity in a few ways.
- You might have two different versions of your code running at once.
- One way this can happen is if your background workers come up very quickly and your web workers are still being preloaded.
- This means that you need to be able to think about what happens when two versions of your code are running at once, which is hard to do and very easy to get wrong. It can be particularly problematic when side effects are involved, like writing to a database or posting to an HTTP service.
- When are you going to run your database migrations? Having both versions of the code running at once adds an extra degree of complexity to this decision. There’s also some timing variability in exactly when the old workers are going to go away.
- Normally, we don’t think about two versions of code running at the same time. On most deploys it will all be fine, but every once in awhile it will not be. You need to think it through for every push to master if you want to be safe.
- Connections to shared resources
- Preboot creates a period of time when you will use (up to) double the number of connections to a shared resource.
- This adds a degree of unpredictability for systems that have constraints on connection counts. At Stitch Fix, we have over a dozen Rails apps that comprise our system. We have a shared Postgres database between many of the apps. Doubling our number of connections to that database could spell trouble. We’re currently working to unwind that dependency, but for now it’s something we have to keep an eye on. Could your shared resources handle it if all the apps were prebooted simultaneously? What would happen if one of the shared resources couldn’t? It’s hard to think through correctly, and it brings us to the last type of new complexity. .
- Failure cases
- Preboot makes it harder to think through all the potential failure cases.
- What happens if the background workers come online with the new code for 15 seconds, but then the web workers fail to load? Are any corruptions possible?
- Diagnosing a failure can be more complicated when you have to keep preboot in mind.
System Design
Another subtle point comes up around depending on preboot. This is a more clear problem. If one of your user facing applications goes completely down when a microservice that it depends on doesn’t have preboot, then the integration between your application and microservice is poorly designed. Thinking preboot removes the need for you to make your system durable to downtime of individual components is risky business. Preboot should act as a small optimization, not as a critical part of your system.
Conclusion
Preboot is a nice option to have. There’s plenty of situations when you probably want it turned on. However, it’s kind of a crazy thing. You need to be aware of it’s craziness and plan accordingly. Lastly, preboot should never make you feel like you can treat an app or service in your system as though it will never be down. That leads to bad design.