Rather than putting all domain logic into a single application monolith, modern software architectures tend to split functionality into multiple applications and services. There are definitely pros and cons of either design; and it is not the goal of this article to go into the details of what is better and why. However, it is generally agreed upon that one of the detractors of using services is the increased complexity that it brings to the table. Specifically, there are many more points of failure, and building a robust system means expecting that nodes of your system may fail at any given point in time. These failures cannot be ignored and you cannot expect to be able to wait until your nodes are all green to operate normally. In order to address some of the added complexity that services introduce, I’ve compiled a checklist of important elements to consider:
-
Isolate as much domain knowledge in the service as possible. Don’t send the consumer of your service any more information than they absolutely need. Being too chatty with your response creates a system that is harder to maintain and also leaks domain knowledge into an inappropriate scope.
-
Use status codes correctly. The most important piece of information returned from most service calls is the HTTP status code. Make sure 200 level codes are successful, 400 level codes are bad requests from the consumer (usually fixable) and 500 errors are internal to the service itself.
-
Have a fallback plan. Know that any endpoint may fail for a myriad of reasons - software issues, hardware issues, network issues, etc. As the architect of the service, these failures may or may not be within your control. As a consumer of this endpoint, you need to expect that these issues will happen on occasion. Building a resilient system is hard. There are many different techniques and patterns to consider. Keep in mind, your fallback plan will vary depending on the type of request you are supporting. For example, at Stitch Fix, we make a call to a service to find the available dates to display on our reservations calendar. The dates are personalized for every client, to make sure we can match them to the perfect stylist and inventory, and can deliver their package on time. If the calendar was down, we couldn’t accept new reservations. Not only would that be a poor customer experience, but we’d lose business and revenue in the process. The solution we are using, is to cache a default version of the calendar to show when the service is down. The optimal calendar is more accurate and fine tuned to our customer needs and business process, but the default calendar will at least allow us to give a conservative view of our availability and keep us operational in the short term.
-
Cache data that doesn’t need to go over wire for every request. Network calls can be expensive and can slow down your entire system. Think of bandwidth and latency as precious commodities. If you need to get data from an external service, ask yourself what the lifespan of this data is; and cache it for that length of time before making a new request. Caching can be done on the server, on the client or somewhere in between. Each situation will call for different caching requirements. If done client-side, cached data can serve as an excellent fallback plan when the service call fails. But beware, caching done wrong, can add bloat to your system and may create stale data.
-
Think about the pros and cons of making requests synchronously versus asynchronously. I would say, favor asynchronous requests when possible to avoid lengthy delays or timeouts on the consumer side. Disclaimer, this topic really warrants its own article to explain the intricacies of doing this right and when you would want to implement this solution and when you would not; so that’s all I’ll say about it here, but do your research about this topic up-front. The type of request you support can change your entire design.
-
Set up consumer driven contracts. Again, this topic really warrants its own article, so I’ll refer you to this article. But, essentially what this buys you is the assurance that changes to a service won’t unwittingly break the existing consumers of that service. It’s pretty important and frees you from the chains of full integration testing.
-
Use common gems or libraries to do the dumb stuff. There is a lot of setup and breakdown of code associated with making HTTP requests and parsing HTTP responses. Put this in a common place so you don’t have to duplicate it everywhere. I would suggest checking out: Stitches, and creating shared libraries to handle the gory details of request/response wrangling.
-
Force timeouts on all external requests. This is pretty simple. Make sure Faraday, HTTParty (or whatever other library you use) has a well thought out timeout constraint on every request. Each request will likely have very different consequences if it takes too long to complete. Consider both the consequences of a long request on the client, as well as the expected real time execution on the server. You may need to adjust this value over time to get it right.
-
Set up alerting for timeouts, 500s or other service failures. If something bad happens to your service which makes it unreachable or unparsable by the consumer; have a plan in place that sends alerts to individuals via sms, email and/or chat to get it resolved as soon as possible. Some helpful tools we use are PagerDuty, Librato, New Relic and Bugsnag.
-
Use detailed error handling and print metadata to a debug logger. When writing a service, you should have an idea of what types of errors can occur. Make sure they are classified as such, and log error information somewhere that is easily searchable. This makes debugging production issues much easier. Consider using a log aggregator if you don’t already.
-
Track requests with an idempotency key. You should make your API specification idempotent when possible (maybe without even needing a key). But there are times you want to ensure a double submit won’t do harm; for example, when charging a customer money. When you have a service that may make several calls based on one originating request (think async), it may be helpful to trace the route of the original request with an idempotency key. This prevents the service from doing the same work more than once.
-
Provide accurate and up-to-date documentation. Let’s be honest, documentation can be a drag. It’s a pain to write and it can very easily become out-of-date. Inaccurate documentation is worse than no documentation. The good news is, there are tools to help you do this without much pain. At Stitch Fix, we use a gem called rspec_api_documentation; which integrates into our acceptance tests to automatically create documentation for us. As long as the tests pass, the documentation should be accurate.
I could go much further into detail on each of these items, but I just wanted to highlight the main ideas. This is certainly not an exhaustive list of things to consider, but hopefully you find it helpful the next time you create a service.