It seems like all software is a service these days. Even operating systems, health equipment, and cars are continually connected and updated online. Services call services, which call other services. Magically, it works. Those of us designing, implementing, operating, and maintaining this magic struggle to keep the systems functioning through constant failures, bugs, security issues, and changing requirements. Many engineers suffer from burnout and strained relationships. Life is too unpredictable and unstable, which is unsustainable.
There’s no magic solution that makes service problems disappear, but there are techniques that provide greater predictability to workload, stability to systems, and sustainability to work. I documented some of these approaches in Switching to DevOps, but there’s a straightforward step I neglected to mention: service level agreements (SLAs). Unfortunately, that step sounds so severe, formal, and bureaucratic that many teams skip or avoid it. Certainly, SLAs can be tedious cudgels, but when written and used properly, they are natural, short, and easy, and they improve your services, customer satisfaction, and lives. If you agree to hear me out, I’ll describe a simple tool that reframes your entire view of creating and running services.
Eric Aside
For more on maintaining a sustainable workload, read Better learn life balance.
Nothing new under the sun
Service level agreements (SLAs) have been around since the 1980s. If you’re already familiar with writing and using SLAs effectively, feel free to stop reading this. If you know about SLAs but see them as oppressive and burdensome, I understand. It’s easy for process fanatics to load SLAs with so much overhead that they become anchors instead of wings. (Fanatics do the same to planning, reviews, and design.)
However, SLAs solve several problems simply and elegantly when used properly.
- SLAs right-size services from the start, saving money and effort.
- SLAs clarify expectations, avoiding conflicts and misunderstandings.
- SLAs keep inevitable failures contained and manageable.
- SLAs define metrics and alerts in advance for you and your partners.
- SLAs can be adjusted for specific situations, allowing you to live a better life.
Let’s discuss how to write and use SLAs effectively with the least amount of overhead.
Eric Aside
For more on not letting tools and processes overwhelm your productivity, read Don’t be a tool.
You want to take it from the top?
Every service has a provider and a consumer. An SLA is an agreement between the provider and the consumer about how much service is provided, how quickly it is provided, and what happens when things go wrong. That’s it. You can add all kinds of extra information to an SLA, but the big benefits you get are from specifying just those three things.
It’s tricky sometimes for a consumer to know how much service they need and how soon they need it. That’s because services are frequently chained together. Folks often guess conservatively about their needs, causing unnecessary work and expenditure. That’s unfortunate and avoidable. Instead, start at the top, typically at the website or app, and determine what it needs through user research (aka usability testing). Then, cascade those requirements down the stack. To prevent a landslide, continue to use real data, as I describe next.
Eric Aside
For more on using data to design better products, read Data-driven decisions.
You get what you need
It’s tempting for consumers to take shortcuts to determine what they need and how quickly they need it. They’ll estimate what they need by doubling or tripling their current load. They’ll estimate how soon they need it by subtracting an estimate of their own compute time from their required response time and then dividing by the number of calls they need to make. Those conservative estimates are barely related to reality and often unworkable for providers. As a result, providers push back on the unworkable estimates and just do the best they can within the deadlines they have. Sometimes that approach turns out to be good enough and sometimes it leads to a crisis, but always the SLA exercise ends up being a time-sucking, ludicrous pile of rotting trash.
Instead, consumers should use their current average sustained and max loads to determine their needs. To determine how fast they need a response, consumers should prototype their usage pattern using the same parallelism and asynchronous calls they’ll use in production and measure the maximum response time they can tolerate from each call. These realistic values are far more helpful to providers when they design their services and have a far higher likelihood of a successful launch.
Because service failures are inevitable (and occasionally planned), the last piece to agree upon in an SLA is what happens when things go wrong. Are there notifications, timeouts, throttling, retries, exceptions, and/or error codes? The right answer depends on the service and the consumer’s intended use of that service. By discussing the mechanism in advance, both the provider and consumer can save themselves substantial future grief and deploy a robust and resilient solution. If the service isn’t available all the time (24x7x365 at 99.9% uptime, the de facto standard for cloud services), then the provider and consumer should also agree on expected availability.
An effective SLA helps the provider and consumer design and operate a reliable service with the least effort and expense. It sets clear and reasonable expectations in advance that work for both sides. SLAs are not competitions or weapons. They are agreements that help consumers and providers succeed.
Eric Aside
External SLAs between companies often include penalties and/or compensation for failures, but even then, consumers aren’t rooting for failure. The purpose of an effective SLA remains to help consumers and providers succeed.
For more on shared success between dependent teams, read You can depend on me, Winning among friends, We’re on the same team.
Stay alert
We’ve covered how SLAs help providers right-size their services and keep failures contained and manageable. They are also perfect for defining health checks and alerts. When a service meets its SLA, it’s healthy. Any time a service’s throughput or response time threatens to miss its SLA, the service should trigger an alert. Sure, other general health checks and alerts remain useful, but the SLA provides the most direct, actionable, and measurable means of determining a service’s state.
SLA-based health checks and alerts are also useful for consumers. They help consumers identify if an issue is theirs or due to a dependency. Great providers not only keep their service healthy but also share their health checks and alerts with consumers.
Eric Aside
For more on great alerting systems, read Bogeyman buddy—DevOps.
We’re here to help
Simple, constructive SLAs based on real usage data are adaptable to many situations, not just day-to-day cloud service operations. Consumers and providers may wish to ramp up or down SLAs during holidays or special events. Doing so creates a stronger partnership, helps with staffing and allocation, and saves resources and stress.
SLAs can also be used for human services (e.g., help desks) and hybrid services (e.g., build and deployment systems). All the same guidance and benefits apply.
Any time a consumer or provider wishes to adjust an SLA, be sure to notify everyone impacted in advance. Doing so avoids misunderstandings and issues and also builds trust. Remember, these agreements exist to improve our lives and our products.
Eric Aside
For more on communications, read Communication breakdown.
Get a grip
Service level agreements sound imposing and heavyweight, but they shouldn’t be. They should be short, simple, and constructive. For example, “The service will support a sustained 1000 transactions per second (tps), a peak of 10,000 tps, a max response time of 10 milliseconds, and a client-side ‘Service Unavailable’ exception if the service falls outside these parameters, all with 99.9% availability.” That’s it. Now the provider can size its service to these specifications (not bigger or smaller) and use the specific measures for its health checks and alerts. The consumer can call the service with confidence and use the same health checks and alerts to root-cause issues quickly.
Simple SLAs based on real data avoid misunderstandings and use people and resources responsibly and sustainably. They can be applied to human and hybrid services as well as typical cloud services. They help teams collaborate effectively and win together. Yes, they can be abused using fake data and cumbersome, punitive processes. But a simple, clear SLA can quickly get teams on the same page working together toward shared success.
Special thanks to James Waletzky and Jason Zions for reviewing the first draft of this month’s column.
Want personalized coaching on this topic or any other challenge? Schedule a free, confidential call. I provide one-on-one career coaching with an emphasis on underrepresented, midcareer software professionals. Find out more at Ally for Onlys in Tech.
Be First to Comment