A tragedy of error handling

If there is any single aspect of our production code that has been traditionally and uniformly lame, it’s error handling. Office made some great strides in this area for Office 2000 with its LAME registry setting for error dialogs. Windows 2000 also improved things quite a bit by working hard at providing meaningful messages with actionable directions. Both Office XP and Windows XP now automatically report severe errors back to us to evaluate and track. However, these efforts only bring us from kicking our users when they’re down to apologizing for slamming them in the first place and then perhaps handing them a cane to push themselves upright.

Eric Aside The internal Office 2000 LAME registry setting added a “lame” button to every Office 2000 error dialog. If you didn’t like the dialog, you could hit the “lame” button and your “vote” would be recorded. These days it’s been replaced by the “Was this information helpful” link all users can access at the bottom of Office error dialogs.

So why doesn’t the code fix the error itself? From my years of staring at our code, I see two main reasons. First, the code has no idea what went wrong in the first place. Second, the error-handling code is not in a position to fix the problem, even if it knew what it was. These two problems are related and pervasive. Let’s talk a bit about the typical situation.

The horror, the horror

A dev writes a bunch of code. Another dev adds a bunch more. Then they add code from some other group. Then they add more. Then they realize that they need to handle errors, but they don’t want to go back and put error handling everywhere, so they write a Routine for Error Central Handling, or RECH, and propagate errors to it. Then they write the next version that adds more code, maybe written by completely different people. Some folks return meaningful errors, some return a simple pass/fail. Some folks like exceptions; some folks like error codes. The RECH only works with exceptions or error codes, not both.

Eric Aside I made up the RECH acronym for this column. I have no idea what the actual name is, if any, for the unfortunate practice it references.

If the RECH works with exceptions, then sections of code that return error codes are wrapped with a throw on failure. If the RECH works with error codes, then sections of code that throw exceptions are wrapped with a generic catch that returns an error code. If a section of code returns pass/fail, then it gets wrapped with a generic exception or error code, which might be converted later to an error code or exception. Even if you start with a function that returns descriptive error codes, like much of Win32 or OLE, these calls are often wrapped by a function that either reduces the codes to pass/fail or ignores them completely. All along the way, information is lost…forever.

Taking exception

To add pain to agony, the mixing of exceptions with error codes is a disaster. You can’t use exceptions if you don’t unwind your stack objects properly. This usually means that anything requiring non-trivial destruction must be an object. That’s easy in C#, but it’s hard in C and C++.

For example, you must use smart pointers exclusively in code that throws exceptions. However, if some parts of the code use exceptions and others don’t, you often get a situation in which an exception-throwing function is called from an error-code-returning function and the catch doesn’t happen till you’re three levels up. Thus an error generates more errors and corrective action is compromised. It gets even worse in multithreaded apps where exceptions must be handled per thread.

Don’t lose it, use it!

You want to use the error information you have in the best possible way. So what do you do? First, pick your poison and go with it.

If you want to use exceptions, fine, but use them everywhere and make everything non-trivial an object (the .NET Framework model).
If you want to use error codes, fine, but propagate them in a lossless fashion or handle them at the source.
If you really must mix metaphors, wrap every exception call inside a return code function with your exception-handling routine, and wrap every return code function call inside an exception function with your return-code-handling routine. Now at least you are not losing data.

Eric Aside Please note, this doesn’t mean you should wrap all exceptions for the sake of wrapping them. That’s inefficient and quite strange. But say you are going to hide an exception-raising call within an outer function that isn’t supposed to raise exceptions. To uphold the outer function’s contract, you need to catch exceptions from the inner function and convert them into an appropriate error code. Ideally, you aren’t using a mixed model in the first place.

Next, you need to plan for action when an error occurs. Figure out the highest level at which you still know what to do if a section of code fails. That level is rarely the top, so stop RECHing.

At that highest actionable level, add your error handling. This will add more than just one error-handling function to your code, but it probably won’t add a thousand. The key is to go as high as you can in the stack, but no higher. When using error codes, you may need to add buffers to your application object to hold key information like file paths or flags so that your error handling can fix the problem.

Only as a last resort—or as an act of bravado—report the error to the user. The net result is a system that always seems to work or at least always seems to care. Our customers will love us for it.

A tragedy of error handling

The horror, the horror

Taking exception

Don’t lose it, use it!

Like this:

Related

Be First to Comment

Your take?Cancel reply

A tragedy of error handling

The horror, the horror

Taking exception

Don’t lose it, use it!

Share this:

Like this:

Related

Be First to Comment

Your take?Cancel reply