I heard a remark the other day that seemed stupid on the surface, but when I really thought about it I realized it was completely idiotic and irresponsible. The remark was that it’s better to crash and let Watson report the error than it is to catch the exception and try to correct it.
From a technical perspective, there is some sense to the strategy of allowing the crash to complete and get reported. It’s like the logic behind asserts—the moment you realize you are in a bad state, capture that state and abort. That way, when you are debugging later you’ll be as close as possible to the cause of the problem. If you don’t abort immediately, it’s often impossible to reconstruct the state and identify what went wrong. That’s why asserts are good, right? So, crashing is sensible, right?
Oh please. Asserts and crashing are so 1990s. If you’re still thinking that way, you need shut off your walkman and join the twenty-first century, unless you write software just for yourself and your old school buddies. These days, software isn’t expected to run only until its programmer got tired. It’s expected to run and keep running. Period.
Struggle against reality
Hold on, an old school developer, I’ll call him “Axl Rose,” wants to inject “reality” into the discussion. “Look,” says Axl, “you can’t just wish bad machine states away, and you can’t fix every bug no matter how late you party.” You’re right, Axl. While we need to design, test, and code our products and services as error-free as possible, there will always be bugs. What we in the new century have realized is that for many issues it’s not the bugs that are the problem—it’s how we respond to those bugs that matters.
Axl Rose responds to bugs by capturing data about them in hopes of identifying the cause. Enlightened engineers respond to bugs by expecting them, logging them, and making their software resilient to failure. Sure, we still want to fix the bugs we log because failures are costly to performance and impact the customer experience. However, cars, TVs, and networking fail all the time. They are just designed to be resilient to those failures so that crashes are rare.
Perhaps be less assertive
“But asserts are still good, right? Everyone says so,” says Axl. No. Asserts as they are implemented today are evil. They are evil. I mean it, evil. They cause programs to be fragile instead of resilient. They perpetuate the mindset that you respond to failure by giving up instead of rolling back and starting over.
We need to change how asserts act. Instead of aborting, asserts should log problems and then trigger a recovery. I repeat—keep the asserts, but change how they act. You still want asserts to detect failures early. What’s even more important is how you respond to those failures, including the ones that slip through.
If at first you don’t succeed
So, how do you respond appropriately to failure? Well, how do you? I mean, in real life, how do you respond to failure? Do you give up and walk away? I doubt you made it through the Microsoft interview process if that was your attitude.
When you experience failure, you start over and try again. Ideally, you take notes about what went wrong and analyze them to improve, but usually that comes later. In the moment, you simply dust yourself off and give it another go.
For Web services, the approach is called the five Rs—retry, restart, reboot, reimage, and replace. Let’s break them down:
§ Retry. First off, you try the failed action again. Often something just goofed the first time and it will work the second time.
§ Restart. If retrying doesn’t work, often restarting does. For services, this often means rolling back and restarting a transaction; or unloading a DLL, reloading it, and performing the action again the way Internet Information Server (IIS) does.
§ Reboot. If restarting doesn’t work, do what a user would do, and reboot the machine.
§ Reimage. If rebooting doesn’t work, do what support would do, and reimage the application or entire box.
§ Replace. If reimaging doesn’t do the trick, it’s time to get a new device.
Welcome to the jungle
Much of our software doesn’t run as a service in a datacenter, and contrary to what Google might have you believe, customers don’t want all software to depend on a service. For client software, the five Rs might seem irrelevant to you. Ah, to be so naïve and dismissive.
The five Rs apply just as well to client and application software on a PC and a phone. The key most engineers miss is defining the action, the scope of what gets retried or restarted.
On the Web it’s easier to identify—the action is usually a transaction to a database or a GET or POST to a page. For client and application software, you need to think more about what action the user or subsystem is attempting.
Well-designed software will have custom error handling at the end of each action, just like I talked about in my column A tragedy of error handling. Having custom error handling after actions makes applying the five Rs much simpler.
Unfortunately, lots of throwback engineers, like Axl Rose, use a Routine for Error Central Handling (RECH) instead, as I described in the same column. If your code looks like Axl’s, you’ve got some work to do to separate out the actions, but it’s worth it if a few actions harbor most crashes and you aren’t able to fix the root cause.
Just like starting over
Let’s check out some examples of applying the five Rs to client and application software:
§ Retry. PCs and devices are a bit more predictable than Web services, so failed operations will likely fail again. However, retrying works for issues that fail sporadically, like network connectivity or data contention. So, when saving a file, rather than blocking for what seems like an eternity and then failing, try blocking for a short timeout and then try again—a better result for the same time or less. Doing so asynchronously unblocks the user entirely and is even better, but it might be tricky.
§ Restart. What can you restart at the client level? How about device drivers, database connections, OLE objects, DLL loads, network connections, worker threads, dialogs, services, and resource handles. Of course, blindly restarting the components you depend upon is silly. You have to consider the kind of failure, and you need to restart the full action to ensure that you don’t confuse state. Yes, it’s not trivial. What kills me is that as a sophisticated user, restarting components is exactly what I do to fix half the problems I encounter. Why can’t the code do the same? Why is the code so inept? Wait for it, the answer will come to you.
§ Reboot. If restarting components doesn’t work or isn’t possible due to a serious failure, you need to restart the client or application itself—a reboot. Most of the Office applications do this automatically now. They even recover most of their state as a bonus. There are some phone and game applications that purposefully freeze the screen and reboot the application or device in order to recover (works only for fast reboots).
§ Reimage. If rebooting the application doesn’t work, what does product support tell you to do? Reinstall the software. Yes, this is an extreme measure, but these days installs and repairs are entirely programmable for most applications, often at a component level. You’ll likely need to involve the user and might even have to check online for a fix. But if you’re expecting the user to do it, then you should do it.
§ Replace. This is where we lose. If our software fails to correct the problem, the customer has few choices left. These days, with competitors aching to steal away our business, let’s hope we’ve tried all the other options first.
Let’s not be hasty
Mr. Rose has another question, “Wait, we can’t just unilaterally take these actions. Customers must be alerted and give permission, right?” Well Axl, that depends.
Certainly, there are cases where the customer must provide increased privileges to restart certain subsystems or repair installs. There are also cases when an action could be time consuming or have undesirable side effects. However, most actions are clear, quick, and solve the problem without requiring user intervention. Regardless, the key word here is “action.”
There’s no point in alerting the user about anything unless it’s actionable. That goes for all messages. What’s the point of telling me an operation failed if there’s no action I can take to fix it or prevent it from happening again? Why not just tell me to put an axe through the screen? If there is a constructive action I can take, why doesn’t the code just take it? And we have the audacity at times to think the customer is dumb? Unbelievable.
It’s always the same
“Fine, this is extra work though,” complains Axl, “and who says the software won’t just be retrying, restarting, rebooting, and reimaging all the time? After all, if the bug happened once, it will happen again.” Actually Axl, bugs come in two flavors—repeatable and random. Some people call these Bohrbugs and Heisenbugs, respectively.
Using the five Rs will resolve random bugs, rendering them almost harmless. However, repeatable bugs will repeat, which is why logging these issues is so important. Even if the program or service doesn’t crash, we still want the failure reported so we can recognize and repair the repeatable bugs, and perhaps even pin down the random bugs. The good news is that the nastiest bugs in this model, the repeatable ones, are by far the easiest to fix.
By putting in some extra work, we can make our software resilient to failure even if it isn’t bug-free. It will just appear to be bug-free and continue operating that way indefinitely. All it takes is a change in how we think about errors—expecting, logging, and handling them instead of catching them. Isn’t that worth the kudos (and free time) you’ll get from family and friends when our software just works? Welcome to the new world.