When I recently wrote about the frightening yet fantastic world of DevOps, I discussed how escalations reach the dev team, but I skipped over when the dev team does the escalating. As you move from shipping annually to shipping weekly and daily, you depend on engineering systems and services to work 24×7. However, engineering systems often don’t get the love they deserve, and may not be resilient. As a result, escalations become part of your life.
Ideally, we would transform our engineering systems into production-quality services (at least three nines—inoperative only eight total hours per year). By engineering systems, I mean source control, build, test automation, networking, code signing, release, ingestion, configuration, deployment, and monitoring. While these systems are constantly improving, and a growing focus at Microsoft, many are far from three nines. That means the right solution is years away, so knowing how to handle escalations well is a skill well worth having.
How do you handle escalations well? You could run around and yell at everyone, but as much as that makes you feel better, it’s juvenile and counterproductive. A more effective approach is to establish relationships with your service teams, know the right things to say and the right data to collect, escalate to the right people, help them resolve the issue, and then constructively drive long-term fixes. Need more details? I’m at your service.
Love thy neighbor
DevOps teams need to know the folks who run all the systems and services they depend upon, including the production systems. If you wait until there’s a crisis, you won’t know who to call, and they won’t know who is calling.
While it’s wonderful for all members of your team and the service teams to be acquainted, at a minimum the leaders should know each other. That way, should an issue get escalated, there is already a basis for trust, cooperation, and understanding.
To help with the “getting to know you” process, DevOps teams should make a table of services and their associated escalation contacts. However, even when you do know who to call, you don’t always know what to say.
The secret knock
Many service teams, like those in Microsoft IT, have a specific escalation procedure. You email their escalation alias or fill out an online form. Then you receive an email from the service team with a ticket number that uniquely identifies your issue and information about priority, response time, and how to escalate further.
Typically, your priority starts fairly low—like priority 3 with a response time of 24 hours. For a serious blocking issue, you’d like to escalate further and get a faster response, but replying, “Are you be kidding me? 24 hours? You stink!” doesn’t work. Instead, you need to say the secret words, “work stoppage” and/or “customer-facing release.”
Priority 1 with a 30-minute response time is usually reserved for issues with major business impact. The two issues that qualify most often are work stoppages of entire groups and release blockers for customer releases. If you really are dealing with a work stoppage or blocked release (lying only works once), reply to the ticket-number response saying, “This is a priority 1 issue. My entire organization is suffering a work stoppage,” and/or “Our customer-facing release is blocked. Please escalate this issue to priority 1 immediately.”
Sometimes you can also change the priority directly on the ticketing site (instructions are in the service’s initial response email). Sometimes you need to reach out to the service leader. For each team in your service table, note its escalation procedure. Also indicate what information the service team might need, like job numbers, server names, and IP addresses. The better you know how to escalate and what data to provide, the faster your issue will be resolved.
Sometimes it’s your team that makes the mistake. Check out I messed up for what to do.
Wait for the Wolf
Once you’ve provided the right data and escalated to the right priority, you’ll start getting communications from the service team’s experts. Depending on the situation, they may start a Lync call, tracking the issue in real time.
The initial people on the call or mail thread will be broken customers, like yourself, and tier 2 support, who know enough to ask the right questions, but not always enough to fix the problem. You want someone who can fix the problem. Ask, “Who are we getting who can fix the problem? Have they been called? Are they on their way?”
You can tell when someone capable of fixing the problem has arrived by how they speak. People who can’t fix the problem say, “It could be this or that. I’m not sure.” People who can fix the problem say, “It could be this or that. Try this. If it doesn’t work, we know it’s that.” They are confident and take control. These people are precious. Follow their instructions, let them work, and send them your thanks when the problem is resolved.
Stay focused at all times
You’ll often have additional people join the call or mail thread while you’re waiting for a resolution. These new folks will mean well, but often will distract people by speculating on causes or asking questions that were answered earlier.
If you want a fast resolution, have a summary ready to copy/paste when each new person arrives, reiterate that you’re in a work stoppage and/or blocked release, and keep the whole crowd focused on getting and keeping a capable person working the problem. That’s the only way these issues get resolved.
Do the right thing
Once the problem is fixed, you want it to never come back. Contact the service lead, and ask to receive the root cause analysis (RCA). If the RCA is incomplete or faulty, constructively point out the weaknesses and push for a more comprehensive RCA.
With a strong RCA in hand, look over the initiatives laid out to resolve the long-term issues. If there are none, suggest some. If the ones listed aren’t sufficient, constructively suggest alternatives. Keep in mind that service teams, and dependencies in general, often have constraints on what they can and can’t fix and what level of service they can provide (another great topic for your team and the service team leaders to discuss).
If money or resources are an issue, seriously consider providing them. If you doubt it’s worth it, calculate the people-hours and money you lost during the incident and reprioritize. Chances are that putting the long-term solution in place is well worth the cost.
If you believe it’s necessary, but don’t feel comfortable committing money or resources to fix a service’s issues, then ask your dev manager to do it. If your dev manager doesn’t feel comfortable, he should go back to being a lead. It’s a dev manager’s job to understand what his business needs and what it costs to run. (More on this in On budget.) Either a working service is worth the commitment, the issue is truly rare, or your team should drop the service. If you think it’s someone else’s problem to fix, remember whose problem it really was.
If you offer money and resources to a service to fix one of its problems, that’s typically enough for its members to successfully argue that their own management should provide that support. You shouldn’t make the offer without being committed to follow through, but it’s often that very commitment which makes a long-term solution seem worthwhile.
Bad things happen to good people
We’d love to live in a world where everything worked all the time. However, that’s not reality. You will have problems with every service at some point. Come up with a list of services your team uses and create a table. Include columns for the service’s team leader, its escalation contact, its procedure for raising priority, and the data that it will need. Meet with service leaders, help them understand your business and constraints, work to understand theirs, and then fill out your service table with their help.
When a serious issue occurs, use your service table, the right words, and the right data to escalate quickly. Push to get a capable person engaged, and then help that person succeed, while preventing newcomers from derailing the process. Once the issue is resolved, push for a long-term solution that truly addresses the root cause.
Yes, escalation is part of life in DevOps. However, you can minimize the impact, maximize your uptime, and reduce future incidents by handling issues with grace and confidence. Don’t wallow in the problem. Be part of the solution.