This month, Interface focuses on the lessons that we learned from security pushes around the company. What about the lessons that we haven’t learned? What about the dumb things we are still doing?
Eric Aside Interface is the name of the Microsoft monthly internal webzine in which my opinion columns first appeared. The webzine published its last issue in February 2003.
The security fire drill exposed more than security holes in our software. It further exposed the shoddiness of our work and left many folks wondering what the next fire drill will be. Guesses include: privacy, availability, supportability. How about quality? Anyone heard of quality? What the heck happened to quality???
Check-in time on most dev teams is like amateur hour. The kind of garbage that passes as a first cut is pathetic, and fixing it later is like signing our own death certificate. Why? Because we won’t fix problems before release that are too superficial, too complex, or too obscure. We won’t fix bugs before release that are found too late or simply not discovered until after we ship.
So what, right? We’ve been shipping this way for years. What is true this year that wasn’t before? Oh my, where to begin…
Eric Aside Much of what I wrote about here nine years ago has changed. That doesn’t mean we are yet where we want to be. However, we’ve radically increased the amount of unit testing, automated testing, code review, and code analysis we perform both before code gets checked into the main source tree and before we ship. These days we can actually use weekly builds for mission-critical internal tasks and day-to-day work, and daily builds to make incremental improvements to production services.
Things have changed
First of all, today we are trying to sell into markets that require turnkey solutions—that is, you turn the key and it works. These markets require turnkey solutions because the associated customers do not have the expertise to work around the problems. So if it doesn’t work right away, we have to fix it right away.
We have entered two major turnkey markets: consumer products and the enterprise. If you’re smart, you’re wondering how our competitors have succeeded in these markets.
For the consumer market, our competitors have kept their products small and simple. That way there aren’t many failure modes; and if they do fail, the product can quickly restart and recover. We are selling products with far more complexity, and functionality. However, this means we have more failure modes. To stay competitive, our products need to be better with fewer failures and they need to restart and recover faster.
For the enterprise market, our competitors have supplied armies of support personnel and consultants. For many competitors, this is the biggest part of their business—they actually make money on their complexity and failures. When their products collapse, our competitors immediately dispatch their own squadron of people to fix the problems and keep the enterprise up and running.
We don’t follow this business model. We sell less expensive products in high volume and provide minimal support in an effort to maximize profits. However, this means that we can’t afford to break as often and that we must quickly fix or recover from the few failures that we do have.
Eric Aside Our “minimal” support has expanded significantly as the Internet provides new models for support, but Microsoft is still a volume software and services provider.
Good enough isn’t
The second way things have changed for us as a company is that our key products are now good enough. Actually, feature-wise our key products—Office and Windows—have been good enough for years.
Being good enough means that we’ve got all the features that our customers need, or at least those that they think they need. This hurts us two ways:
- People stop upgrading to the next version. After all, the current version has everything that they think they need and upgrading is painful and expensive.
- Any software copycat can create a viably competitive product by just referring to our widely distributed, online, fully automated specifications (the products themselves). If the copycat does a better job, making the software more reliable, smaller, and cheaper (say like Honda did to Chrysler), then we’ve got a big problem.
Think it can’t happen? It already has. (Does Linux ring a bell?) Linux didn’t copy Windows; it just ensured that it had all the good-enough features of a Windows server. Right now there are developers working on Windows-like shells and Office-like applications for Linux. Even if they fail, you can bet someone will eventually succeed in developing a superior product—as long as we leave the quality door open.
We can’t afford to play catch-up with our would-be competition. Detroit has been fighting that losing battle for years. We must step up and make our products great before others catch us.
The good news is that we have some time. Other commercial software companies big enough to copy Office or Windows are poorly run and are way behind us in the PM and test disciplines. The open-source folks lack the strategic focus and coordination that we have; they rely on a shotgun approach hoping to eventually hit their target. We can beat all competitors if we raise our quality bar—ensuring fewer failures, faster restart, and faster recovery—and if we focus on our key customer issues.
But as anyone can tell you, nothing comes for free. If we focus more on quality, something else has to give. At a high level, the only variables that we control are quality, dates, and features. For projects with fixed dates, quality means fewer features. For projects with fixed features, quality means adding time to the schedule.
Eric Aside Actually, I don’t completely believe this anymore. I’ve seen great efficiency gained by removing waste from the system (as you can see in Lean: More than good pastrami and fixing problems early. While it might not be enough to give the company summers off, I believe it is enough for high quality not to cost us features or time. Yes, it’s not as quick as the early days when the quality bar was low, but compared to our recent long stabilization periods, doing it right the first time is as fast if not faster.
Before you balk at this thought process, BillG has already made our choices clear in his article about trustworthy computing:
In the past, we’ve made our software and services more compelling for users by adding new features and functionality and by making our platform richly extensible. We’ve done a terrific job at that, but all those great features won’t matter unless customers trust our software.
The only question is: Are you going to follow through?
There are three principal areas to focus on to improve the quality of our products:
- Better design and code
- Better instrumentation and test
- Better supportability and recovery
Let’s break them down one at a time.
Time enough at last
Few developers wouldn’t love more time to think through their code and get it right the first time. The trouble is finding the time and having the self-discipline to use that time wisely. So, what would you do if you had more time? As a manager, I would spend more time with my people discussing design decisions and reviewing code.
Two key design issues I’d emphasize are simplicity and proper factoring:
- Simplicity Keeping the design simple and focused is key to reducing unintended results and complex failures.
- Proper factoring This helps keep each piece of the design simple and separable from the others. It also makes it easier to enforce a sole authority for data and operations, and to maintain and upgrade code.
Eric Aside Test-Driven Development (TDD) accomplishes both these results for implementation design. You can take a similar approach to TDD for component design as well, though the tests are sometimes no more than thought experiments.
I’d also give devs extra time by pairing them to work on each feature task. This serves to
- Double the time that each dev has to do the work because you schedule the same task length as if one dev were assigned.
- Allow for peer reviews of designs and code.
- Provide each feature with a backup dev in case the primary dev becomes unavailable.
To help my devs apply self-discipline, I’d
- Schedule completion dates for dev specs (also known as design docs and architecture docs).
- Make each backup dev as responsible for feature quality as the primary developer is.
- Measure regression rates and build-check failures to use as feedback on quality at check-in. (Sure, these measures are imperfect, but what did you want? Bugs per line of code?)
Eric Aside These days I’d use churn and complexity measures instead of regression rates. See October 1, 2006: “Bold predictions of quality” for more.
Checking it twice
It’s never enough to think that you have the code right; you’ve got to know. Check it from the inside with instrumentation and from the outside with unit tests. The more you do this, both in terms of coverage and depth, the better you can assure quality.
And, NO, this is not a tester’s job. Test’s job is to protect the customer from whatever you miss—despite your best efforts. The testing team is not a crutch on which you get to balance your two tons of lousy code. You should be embarrassed if test finds a bug. Every bug that they find is a bug that you missed, and it better not be because you were lazy, apathetic, or incompetent.
So how do you prevent bugs from ever reaching testers and customers?
- Instrumentation Asserts, Watson, test harness hooks, logging, tracing, and data validation can all be invaluable (even instrumental—sorry) in identifying and pinpointing problems before and after check-in.
- Unit tests Testing can often make the biggest difference between good and exceptional, between barely functional and solid. There are lots of different kinds of unit tests; you, your backup, and your peer in test should pick those that are most appropriate for your feature:
o Positive unit tests exercise the code as intended and check for the right result.
o Negative unit tests intentionally misuse the code and check for robustness and appropriate error handling.
o Stress tests push the code to its limits hoping to crack open and expose subtle resource, timing, or reentrant errors.
o Fault injection tests expose error-handling anomalies.
The more you verify, the more code paths you cover, the less likely it is that a customer will find fault in your work.
Physician, heal thyself
Even if you design and code your feature well, and even if you instrument and test it well, there will still be bugs. Usually those bugs involve complex interactions with other software. Sometimes that software isn’t ours; sometimes it’s old; and sometimes it isn’t well designed or tested.
But just because bugs will always occur doesn’t excuse you from making bugs rare, nor does it excuse you from taking responsibility when they do occur. You still have to help customers fix or recover from these problems, ideally without them ever noticing.
A wonderful example of recovery is the IIS process recycling technology. Any time an IIS server component dies, the process automatically and immediately restarts. The customer, at worst, only sees a temporary network hiccup repaired by a simple refresh.
Office XP also has a recovery system, though the user is made more aware. When an Office app fails, it backs up the data, reports the problem, and automatically restarts and recovers the data upon request. These solutions are not terribly complicated, but they offer a huge benefit to customers and save tons of money in support costs.
Eric Aside Read Crash dummies: Resilience for a far more in-depth discussion of resiliency.
If you can’t get your product to automatically recover and restart, you should at least capture enough information to identify and reproduce the problem. That way, the support engineer can easily and quickly understand the issue and provide a fix if one already exists. If the issue does not have a known fix, the failure information can be captured and sent to your team so that you can reproduce it and design a fix for the customer right away.
Capturing enough information to identify and reproduce problems is not as hard as it sounds:
- Watson currently does a great job of identifying problems, and future versions of Watson will make it easier to reproduce those problems on campus.
- SQL Server does a great job of capturing a wide variety of customer data, which allows everything from precisely reproducing all the changes to a database to simply dumping the relevant state when a failure occurs.
Step by step
Okay, you follow all these suggestions. Was it worth it? Is the code really better? Did you miss something? If these security pushes have taught us anything, it’s that problems aren’t always obvious and that ad hoc methods and automated tools don’t find all of our issues.
The pushes also reminded us that a couple of missed exploits can cost Microsoft and our customers millions of dollars and lead to a double-digit drop in our customers’ perceptions of product quality.
What techniques can we borrow from the security folks? Two immediately come to mind:
- Decompose your product into components like you would for a threat model, and look for quality issues within each piece. Ask: How should we design each component? How can we make each component instrumented, testable, supportable, and recoverable? Applying more structured engineering methods to improving quality will yield more reliable and comprehensive results.
- Reduce the failure surface area of your product; that is, reduce the number of ways that your product can be misused. Cut options that allow customers to invent procedures that you didn’t intend and that are included just because you thought someone somewhere might desire that option. Simplify each feature so that it performs the one task that you designed it for and that’s all. Remember, unnecessary complexity just hides bugs.
Too much to ask?
By now you are surely thinking that I’m insane. There’s no way your PM, test, and ops teams—let alone your managers—are going to give you the time to take on everything that I’m suggesting. Actually, it’s not as bad as you might think. Many of these practices are tied to each other:
- Proper design welcomes testability and supportability in your products.
- Instrumentation helps identify and reproduce problems.
- Testing shows weaknesses where you need to recover.
To buy more time to do this right, make improving quality a win for the whole team. Define supportability and recovery as features. Get PM, test, and ops into the act.
When we build our products right and customers regain confidence in our work, we will leave our competitors in the dust. They can’t match our features; they can’t match our innovation and forward thinking; and if we show what great engineers we can be, they won’t be able to compete with our quality.
Eric Aside “Where’s the beef” is one of my favorite columns. Nine years later I still get pumped reading it. Quality is a pursuit not a destination. Even though we still have far more to go in improvement, I’m proud of how far we’ve come.