Tell average naïve developers that their team is embracing DevOps, and panic will fill their eyes. Their hearts will race, their muscles will tense, and their resumes will reinvigorate. DevOps is the bogeyman to unfamiliar developers. The thought of being on call 24 hours a day, 365.25 days a year, to support the crappy code they wrote is enough to give developers the shakes.
I’ve been there. When my team first switched to DevOps, I was filled with visions of my son’s championship baseball game being interrupted, of being woken up at all hours of the night, and of sacrificing all the great food, gaming, sports, hobbies, and social events that life has to offer. I was terrified. I was also an ignorant imbecile.
Like a sappy movie bogeyman, DevOps actually turns out to be your best friend. Sure, at first it’s scary and mysterious. Then it’s obnoxious and messes with your friends. But in the end, you figure each other out, come to terms, and become best buddies. Soon, you wonder how you ever worked any other way. Think I’m delusional? I think you’re an ignorant imbecile. Read on, and decide for yourself.
Who’s the new guy?
On a DevOps team, development team members work directly with operations. When customer-impacting failures occur that operations can’t resolve, the development team is responsible for fixing its code, day or night, every day of the year. That’s right, developers like you and me.
The first people to see customer-impacting failures are typically tier 1 operations. These folks generally work in shifts, watching for alerts of failures, entering them into a tracking system, and escalating them to tier 2 operations if the failures appear to be persistent and serious.
For each tracked issue, tier 2 operations follow the troubleshooting guide written by the development team. The guide provides step-by-step instructions to identify and resolve common, correctable failures (everything from “try rebooting” to “examine the expiration date on the certificate and follow the following procedure if it’s expired”). If the troubleshooting guide doesn’t identify and resolve the issue, tier 2 escalates the issue to tier 3. That’s you (and the folks responsible for dragging you out of bed).
If the alerting system is smart enough to ignore transient failures and spot patterns, then tiers 1 and 2 can be combined. If the alerting system also has automated recovery, then tier 2 doesn’t require as many folks. If your code is robust and your troubleshooting guide is simple and comprehensive, then very few issues make it to tier 3, which means you get to sleep.
This might sting a bit
Shortly after your team switches to DevOps, you realize that service issues are constantly randomizing your team. (I’m assuming a typical dev team of 5 to 50 people.) A little soul searching leads to the following insights:
- Most of the team’s time is spent figuring out what the problem is. So, your team adds more instrumentation—you know, the instrumentation you intended to put there all along, but didn’t because you were too busy.
- After the team fixes an initial barrage of bugs that should have been caught long ago, issues settle down to just a handful each week. Service problems are still randomizing, but not enough volume to keep the whole team busy.
- Anytime there is an issue, the root cause is addressed quickly, because no one likes to be called in, ever. Then the troubleshooting guide is updated, just in case the issue recurs.
- Soon you choose one person per week to be on call for issues. You set up a schedule to spread out the burden, and everyone does his time on duty. Life becomes manageable again.
Notice that DevOps may mess up your life and personal relationships initially, but only because it forces you to do work you should have done in the first place—tough love.
If management expectations of speed and shortcuts kept you from doing the right things before, DevOps puts that shortsightedness up against the harsh reality of unhappy customers. There’s no shortcut to quality. Once the old technical debt is paid off, life returns to normal, only your development team is now working the right way.
I write more about paying off technical debt in Debt and investment.
Getting so much better all the time
Some teams stay with the weekly on-call schedule for years, but sophisticated teams take their insights a level deeper.
- On-call folks realize the source of problems are often the base services that their code depends upon. Modelling their service structure in a System Center Operations Manager (SCOM) management pack encodes dependencies into the alerting system. Now when a base service go down, the base service team’s on-call person gets called, not yours.
- Even when the issue is with his team’s code, the on-call person often has to locate the problem module and wake up its guilty developer. By hooking health checks for modules into the SCOM management pack, the team makes alerts go straight to the guilty party. On-call people are no longer necessary. Only the folks who write buggy code, shoddy instrumentation, and poor troubleshooting instructions get called frequently.
- Every developer can now choose between being woken up or writing solid, self-healing code, with great instrumentation, troubleshooting, and monitoring. Soon, not only is the on-call schedule gone, but everyone gets back their lives with only rare interruption.
Again, notice that DevOps drives the development team to behave the way it should—writing great code with high availability, comprehensive instrumentation and troubleshooting, and reliable monitoring. Life is good for your team and your customers!
Instrumentation and troubleshooting instructions go together like hand and glove. The instrumentation provides the error codes and context that operations uses to search the troubleshooting guide. Inadequate instrumentation makes the guide useless. An unclear guide leaves the issue unresolved and soon escalated to you, the developer.
It ain’t me
If your team doesn’t ship services, you might think DevOps is someone else’s problem. Think again. Not only will everyone soon be moving to a services model (even apps), but making development responsible for the quality of the main branch is completely analogous to DevOps.
Reverse integrating your team’s branch to the main branch is just like releasing an app or service to production. The main branch build runs at night, just like production. If your changes break the build or product functionality, you make hordes of people angry, just like a service break. The way to determine what’s wrong quickly is through great instrumentation. The way to alert yourself and your fellow team members to issues early is through build monitoring, more commonly known as unit and acceptance tests. Owning the quality of the main branch is a great introduction to DevOps.
So happy together
We’re all headed toward a DevOps world, and it’s a wonderful place to be.
In a DevOps world, no one gets away with writing crappy, untested, poorly instrumented code. Lazy developers who take shortcuts must constantly stay up late at night, suffering the consequences until they wise up.
In a DevOps world, strong developers enjoy their free time with family and friends. They know issues with their code will be rare, because their code is well-tested and designed to be resilient to failure. In the unusual event that a problem does arise, they know the time to fix it will be short due to their comprehensive instrumentation and troubleshooting instructions (automated or otherwise).
In a DevOps world, customers are delighted with high-quality software that rarely fails—not because the engineers got any smarter, but because they finally started working the right way.
It’s time to do what we should have been doing all along—writing great code, great instrumentation, and great tests. It’s time to embrace DevOps.
The title is catchy …. but I lost you somewhere. The relation between DevOps and development for example. This is a classical not following Agile principals?
Yet another pie-in-the-sky-with-flying-unicorns Management Fad. The VAST — and I mean enough to keep a single rotating developer working FULL-TIME on our team — of the "incidents" in our product that our Dev-Ops developer-on-call must handle are editorially-related (misspellings in an article, incorrect formatting, accidentally-deleted article, etc); partner-related (partner-served image failed to load or a brief 3rd-party service interruption); or in a significant number of cases, even TEST related – sometimes the tests just are "incorrect" or they fail to take in timezone differences between the team writing the test and the monitoring tools, or they don't robustly handle network hiccups, etc.) It is INCREDIBLY RARE that any of our customer-impacting issues are directly due to code bugs. The myth that "crappy code" causes the majority of website issues, and golly-gee, all would be right with the world if only Devs felt the pain, is just that: A MYTH. How much money are we wasting paying highly skilled DEVELOPERS to track down trivial junk like this? Our management is getting rid of the "tier-1/tier 2" levels you mention and going straight to dev for all issues. Mind-bogglingly stupid.
Dear Eric, your posts are a great encouragement for joining your team. You seem to live in a different, better Microsoft.
You mean, in your team do service alerts really go to Ops (or someone working shifts) to try something? They don't go straight to developer's phones for immediate investigation, with no one else looking or trying anything?
Also, is your team mostly supporting what your team wrote? From what I've seen, those "folks who write buggy code, shoddy instrumentation, and poor troubleshooting instructions" tend to be geniuses at shifting responsibilities and getting someone else to support what they wrote.
I guess "How is on-call handled in your team?" will become a necessary question to ask when interviewing for a position.
And compensation adjustments for increased responsibility are ?
Industry associates responsibilities with pay. Adding ops to the plate should warrant an increase. Don't forget combined engineering. PM is next, it's called, growth hacking. Hmm, now if I could only incorporate LCA or HR or finance, even Mr. Burns would be happy!
Take ownership and change your culture instead of being the victim. If this seems like an attack, chill, it's not… Your people skills need some softening… Inspire through action and conversation. You can't change douchebags, so why bother, but you can drive towards mutual beneficial outcomes… After all, it is about the customer. Fix your alerts and more importantly design for failure…