When something goes wrong with a live production service, we call it an incident and track it in an incident management system. (At Microsoft, our internal system is called “IcM.”) We track the incident and include a running archived chat session, so that when others need to join the “call” (aka “bridge”), they quickly can understand what’s been discussed and done so far to mitigate and/or resolve the problem.
Once the incident is mitigated, the archived incident data is used to better understand what happened and why. The incident data is also aggregated to spot trends and measure responsiveness. The focus of this analysis is to reduce the future recurrence and impact of similar problems. Ideally, incidents are avoided altogether or mitigated so quickly that customers don’t notice them. Because technology is constantly changing, that ideal remains beyond our grasp, but actively pursuing it keeps current and future incidents from overwhelming our teams and services.
Our greatest tool to reduce the recurrence and impact of future incidents is an incident root cause analysis (RCA). As I discussed briefly in Bring out your dead: Postmortems, incident RCAs use the five whys to uncover the deep-rooted issues. Although you could stop looking when you find the expired certificate, botched deployment, or faulty code at the root of the problem, an effective RCA keeps asking why to determine why the certificate was allowed to expire, why the deployment was botchable, and how that bad code passed our validation. A complete RCA doesn’t even stop there. It asks why about every aspect of the incident, including why we didn’t catch the problem sooner, why it took so long for the right people to get on the call, why the trouble impacted other services, and why it took so long to inform our customers. That’s more than five whys. Let’s break this down.
Some service teams do an incident RCA for every incident, no matter how small the impact may be. Other service teams only do RCAs for incidents that have a substantial customer impact, though they review aggregated data for all incidents to spot concerning trends.
What’s the matter here?
One of your services stopped working. Why? In this example, you discover that the service couldn’t authenticate. Why? Because the certificate used to connect to the authentication service expired. At this point (after two whys), you know what broke and should immediately fix it by renewing the certificate. Now the incident has been mitigated, but we still don’t know the root of the problem—that’s the goal of the incident RCA.
The weekly incident RCA meeting is typically attended by the service manager(s), the person on call the previous week, the person on call the current week, and anyone who helped with incidents during the past week. (Some teams meet daily.) You bring up each incident that’s occurred since the last meeting and continue asking why.
For the certificate incident, why did the certificate have an expiration date? We set expiration dates on certificates in case they’re stolen or compromised, and this certificate expired before it was renewed. Why? Because the person assigned to renew it was on vacation. Why didn’t someone else renew it or configure it to be auto-renewed? Hey, good question.
Once you get to the root of the problem, you create work items to repair the problem (aka “repair items”). In this case, you’ve got two clear repair items: Set up all certificates (not just the one that expired) to auto-renew in advance, and have certificate scanners send notifications to the entire team if any certificate fails to auto-renew. After you complete these two repair items, the incident is resolved, and no incident like it should ever recur—you’ve fixed this class of problem forever! Congratulations—but you’re far from done.
While many repair items take just a few days or weeks to complete, some take months or even years. Resolve the incident after the short-term repair items are completed, but keep tracking those long-term items and prioritize them in your backlog.
In the beginning
Let’s start from the beginning with this example. The certificate used to connect to the authentication service expired. It probably expired at midnight. That’s the start of the incident.
When did anyone or anything notice the problem? Since we already know that nothing was checking for certificate expiration, and given the late hour, it’s quite possible that the first person to notice was a customer trying to authenticate—bad news, since we should always know about problems before customers do. Say the customer was an early riser, logging on at 5:00 a.m. (granted, an overseas customer would have detected it earlier). That means the time to detect (TTD) was five hours. That’s awful. Time to ask why again.
Why didn’t we detect the expired certificate? Okay, that’s already a repair item. Why didn’t we detect the broken authentication? We should add a third repair item to monitor our connection to the authentication service (and every other service we depend upon) and automatically create an incident if the monitor reports a sustained broken connection. Doing so would drop TTD from hours to minutes.
Now it’s 5:00 a.m. Customer support receives the customer complaint and raises the issue as an incident. A call goes out to your team’s on-call engineer, who sleeps through it. The on-call backup responds to her notification at 5:15 a.m. Time to acknowledge (TTA) is 15 minutes—not horrible, but could be better.
Why did the on-call engineer sleep through the initial notification? He left his phone downstairs instead of by his bed—lesson learned. Why not have automation respond to the alert, with sub-second TTA and auto-resolution? The early auto-renewal repair item is a better solution in this case since it avoids the problem altogether, but for many issues, using automation is a fantastic way to troubleshoot and resolve problems without waking people up.
Cause I need to know
The on-call backup logs into her system and starts debugging the authentication issue, even though that’s not her area. The authentication service appears to be working for other sites. The network connection is there, and the authentication service is reachable. She checks recent changes to her team’s service—a few are related to authentication. She’s going to need to call for help.
In the meantime, more support calls are triggering more alerts. After 45 minutes of working the problem (6:00 a.m.), the on-call backup declares an outage and notifies customers. That means the time to notify (TTN) is six hours from the start of the incident. That’s awful!
Why did it take six hours to notify customers? Because we weren’t monitoring the authentication service (our third repair item), and our support and on-call people didn’t declare an outage as soon as customers were impacted. We should add a fourth repair item to train all support and on-call folks to create an outage and notify customers as soon as the issue impacts them. The repair item should also include updating our associated documentation. Note that for many repair items, no code change is needed.
I’ll give you what you need
The on-call backup contacts the latest team member who made a change related to authentication. That person gets online within 15 minutes, can’t imagine why her change would have caused the problem, checks the logs, notices the expired certificate, and fixes it 15 minutes after joining the call. Time to mitigate (TTM) is 6½ hours from the start of the incident. (The first two repair items address the need for the mitigation by auto-renewing certificates and notifying everyone if renewal fails.)
Notice how quickly the problem was fixed once the right engineer was engaged—just 15 minutes out of a 6½ hour incident. That’s why we also track time to engage (TTE): How long it takes for a person capable of fixing the problem to engage, which in this case was 6¼ hours from the start of the incident.
Why wasn’t the authentication area owner called sooner? The third repair item would have alerted the notification issue immediately, but we’d still have gone to the on-call engineer first. As I discuss in Bogeyman buddy—DevOps, sophisticated teams divide their services into modules, define owners and backups for each module, and then raise incidents directly to the module owners (a long-term fifth repair item). These teams no longer need on-call rotations, their TTE and TTM drop substantially, and most importantly, engineers who write robust and resilient services are rewarded with undisturbed rest.
Another way to minimize the on-call engineer’s burden is to have a dedicated operations team (vendors or full-timers). However, modern services have moved instead to DevOps, in part because operations teams insulate dev teams from the pain of incidents. As a result, dev teams don’t create and complete the repair items, and their services build up technical debt. That debt makes the services fragile and costly to maintain. DevOps keeps the engineering team accountable for quality and reliability.
I didn’t mean to hurt you
We’ve tackled TTD, TTA, TTN, TTE, and TTM. We’ve created five repair items that should help keep current and future incidents from overwhelming our team and service. We’re done, right? Almost. There’s one more question to ask.
Why does our service depend so strongly on the authentication service? The authentication service we use could have failed for any number of reasons. Some parts of our service require authentication, but not all of them. Do we really want our service to fail every time authentication fails? Our sixth repair item is to make sure that the parts of our service that require authentication report a customer-friendly error message when authentication repeatedly fails and that the rest of our service keeps working. We should repeat this exercise for all our subsystems ensuring our service is robust to outside issues.
All services should continue to operate, perhaps in a diminished capacity, whenever services they depend upon fail. Chaos engineering is dedicated to rooting out these issues, as I discuss in Get real.
Things go wrong in life, and cloud services are no different. When some aspect of your service fails, you need to create, track, and mitigate the incident. Once that’s done, it’s time for an incident RCA.
Ask why your service failed, and keep asking why until you get down to the root cause of the issue. At that point, create a handful of repair items designed to detect and avoid any recurrence of the issue or others like it. Next, ask why it took so long to detect the issue, to notify customers about the impact, and to have the right engineer engage on the issue. Add more repair items designed to ensure faster detection, notification, and engagement. Finally, ask why the failure caused any customer impact at all. Add more repair items designed to make your service robust to the failure of any subsystem.
Our services will never be perfect. However, incident RCAs and the repair items they generate can help us avoid the recurrence of whole classes of issues and minimize their impact on customers. These steps also reduce the impact of incidents on our teams. Spending less time on incidents creates more time for customers to enjoy our services and more time for us to deliver even greater value and positive impact.