Nothing infuriates me more than wasted time and wasted effort. I’m not talking about training, reorgs, moves, morale events, or vacations. Those at least have the potential to be valuable in your life. I’m talking about build time, integration time, unused specs, incomplete features, blocking issues, excessive and persistent bugs, and over-engineered code and processes. You know—hours and days you’ll never get back.
I broke down all this real waste years ago in my column Lean: More than good pastrami. While I provided examples and suggested solutions for individual areas, I didn’t really map the path to a better life for your particular team. Every situation is different. You need a path that discovers your most harmful waste, drives your team to resolve it, and rewards your team with a visceral sense of relief.
In manufacturing, the secret path to success is reducing inventory. Inventory hides your manufacturing issues. As you reduce inventory, problems appear. You fix the issues, and then reduce inventory further. Gradually, your waste is eradicated, and your efficiency soars. In software development, the secret path to success is reducing cycle time. The shorter you make the time between concept and completion, the more roadblocks you face that have little to do with actual engineering. Fixing those problems unleashes productivity. Let me show you the way to free your engineering soul.
What’s done is done
The first step to shortening your development cycle time is determining how long your cycle takes. For services, it’s the time between releases. However, for packaged products, shortening time between releases often can’t be supported by the market. Thus, a better definition of a cycle is the time between starting detailed specification of a feature and having that feature completed. What does it mean to complete a feature? There’s the rub.
You’ve got to define “done” for your features and for release of your services. Below are the definitions my team uses. We insist the first four are done for every feature and the second four are done for every release. We release the Xbox.com sites every four weeks—and we LOVE IT!
“Done” for every feature:
1. All updated designs and code are reviewed
2. All automated tests written and passed
3. No ship-stopping bugs
4. All monitoring and health checks in place (feedback tools for packaged products)
“Done for every release:
1. All localization and world readiness completed
2. Full test pass completed successfully
3. All quality areas signed off and partners signed off (including LCA)
4. All necessary release documentation completed
As you attempt to shorten the time between starting and finishing work, it’s these eight “done” criteria that expose issues. Let’s briefly discuss the common problems that arise and how to respond.
If you build it
This first requirement for “done,” reviewing updated designs and code, only saves time, so let’s talk about automated tests—unit tests, component tests, stress tests, acceptance tests, system tests, fault injection tests, and so on. Developers and testers should share in writing these tests. Who writes which tests varies by team. As they attempt to shorten cycle time, most teams struggle with their test harnesses and the time it takes to run the tests.
When it comes to reducing cycle time, you’ve got to distinguish between tests that run quickly and often and tests that run slowly and infrequently. Any tests that fall in the middle need to pick a side and be rewritten or refactored.
While it’s nice to have one test harness, you can get away with two—one for fast and reliable check-in tests and one for full test passes. If you’ve got such a big team that even quick check-in tests take more than 10 to 20 minutes, then you’ve probably got a large enough team to invest in test prioritization and parallelization technology.
Likewise, if you’ve got such a large codebase that it takes more than 10 to 20 minutes to rebuild, then you’ve probably got a large enough team to invest in a highly parallelized build lab and build dependency logic. Remember, build, test, and check-in form software development’s inner loop. Anything done to speed up your development inner loop creates a huge multiplier to overall productivity.
As for code branches, you never want to be more than one branch deep from the main branch. Integration is expensive, and each branch level adds another integration layer. Think about it. Say you were building personalized laptops. Having the distinctive components go through customs would crush your delivery schedule. Every branch level is like another customs station between your fixes and features and the main branch.
Roaches check in, but they don’t check out!
Cleaning up a large bug backlog before release can really slow down cycle time. Bugs take progressively longer to fix the longer you wait. Design and code reviews plus automated testing will help (as will refactoring spaghetti code and switching to test-driven development). Regardless, what really matters is finding and fixing bugs early. Short cycle times demand immediate bug fixing.
No matter what you do, you’ll still have bugs—we are human. Some of those bugs will be very difficult to find and fix, which will slow you down. The good news is that an architecture that is resilient to failure can alleviate the need to fix the toughest bugs—the intermittent ones. Resilient architectures allow you to collect data on these stubborn, sporadic slipups and fix them once they are finally understood.
I wrote more about resiliency in one of my more controversial columns, Crash dummies: Resilience.
How am I doing?
Monitoring and health checks are often treated as afterthoughts. This amateur-hour action increases the time needed to track down customer issues, which lengthens cycle time. This is just as true with designing and implementing customer feedback tools for packaged products.
Monitoring and health checks need to become forethoughts, designed in from the start. Consider why you are building your feature, and ask how you’ll know if it is performing as envisioned. That will tell you exactly how to monitor its use and inquire about its health.
All this data and your quick cycle times enable fast feedback loops and constant improvement. Be sure to spend time during every cycle reflecting on what you can do better in your product and your engineering team. My feature teams and leads do this twice every release (every two weeks).
Making monitoring and health checks a forethought is my team’s most recent addition to the “done” list. Poor monitoring and insufficient health checks caused us to stumble in the fall, while we watched a partner team of ours shine in the same area.
Sign me up
We already talked about automation for full test passes, and localization processes are quite refined and fast at Microsoft, so the next area that typically causes trouble is sign-off. Quality areas (security, privacy, and so on) should be addressed and bugs fixed by all engineers as part of normal feature work. However, sign-off on these areas, as well as partner sign-off, can really slow down cycle time.
Even though quality is the responsibility of every engineer, sign-off works best if one engineer is assigned to shepherd each quality area through its process. Those engineers become the team specialists in their areas, a nice career opportunity for them that provides cross-group scope.
Since team specialists deal with their quality areas all the time, sign-off requirements and activities are far faster and easier for them than for other team members. In addition, team specialists develop relationships across the team and with corporate specialists in their area, which also speeds the process and provides growth for the entire team.
How about a few more details
Using many of the techniques I’ve described, my team has managed to reduce the cycle time for our production releases from a few times a year to every four weeks. It’s fantastic! Before I get into all the advantages we are seeing, let me cover two topics people often question.
How do you develop features or architecture changes that take longer than four weeks? There are two basic approaches: horizontal and vertical.
§ The horizontal approach is to work on the large change a layer of the stack at a time. For example, first make the schema change, then ship the new service, then write the new model, then the new controller, and finally the new view. Each layer can ship within a four-week cycle.
§ The vertical approach is to break the large change into smaller slices of functionality. You then complete each vertical slice end-to-end within a four-week cycle. If the slices lead to a disjointed user experience, you hide the slices from users until enough of them are complete.
§ People often use a combination of horizontal and vertical techniques. Unfortunately, the horizontal approach often leads to over-engineering of layers and hampers iterative feedback. I much prefer the vertical technique, using the horizontal approach only as a last resort.
How do you handle sustained engineering? Sustained engineering fixes usually take about a month from identification through testing and release. Since we ship every four weeks, sustained engineering is just part of our regular work. There are no sustained engineering releases except in the most unusual of cases, and more importantly, there is no special sustained engineering team. We are in it together—we all feel the pain of our mistakes and the joy of our advances.
Life is good
Now that we’ve been releasing every four weeks for the last six months, we’re really feeling the benefits.
§ Much of the overhead that engineers complain about is gone—we had to remove it to succeed.
§ Slipping is manageable—if a feature misses a release by a week or two, it still goes out within a month.
§ Releases aren’t scary or crazy anymore—we do them all the time, and you can cause only so much trouble in four weeks.
§ Our team gets more done in less time—we’re faster due to the streamlining that frequent releases demand.
§ We serve our customers well, and they notice—our dramatically improved response times to issues and feedback is greatly appreciated by our customers.
While there is pain involved in any change, shortening cycle time provides the immediate gratification of less overhead and quicker results. The team loves it, and I love it.
In the future, we want to be able to ship in one or two days, like some of our competitors (who are probably laughing at my team’s long, four-week cycles). We don’t plan to release that quickly all the time, but being able to do so will mean being even more streamlined. Once you start down this path, you get hooked on having so little between you and your customers, and that is a great place for everyone.
Why did we move push the Xbox.com team to four-week cycles? Because several members of the leadership, including me, knew it was the best way to improve our team’s productivity and customer quality. Trimming cycle time and work-in-progress is an old technique from lean manufacturing, which dates back to the 1930s.
(Putting on my best Chris Farley voice): Well, La-Ti-Frickin-Da, Eric! 🙂
The concepts here are sound but are they applicable to a majority of the company? While you're off in quick-release serivce la-la land, did you forget that a majority of the company does platforms and major applications that release every year at best?
I wish other groups would embrace the philosophy here, but I'm still rafting down the waterfalls, in my 14th year. I'd love to see you provide examples of how these concepts work in larger teams, because the xbox.com example is just too much of a toy example for anyone to take it seriously.
How applicable are the approaches mentioned in your article (cycle time, shipping every 2-4 weeks, no dedicated sustained engineering team) when your entire Dev and Test team is provided by a vendor company overseas? To get a definitive cost and schedule estimate from a vendor the process is by default very waterfall-ish with long cycle times (3-6 months to release an update to a service) since you need complete func specs with exact UI mock-ups, the SOW goes up the management chain for approval, etc.
One option is to pay the vendor by time and expenses (T&E). However, management certainly won't go for that since it's an open check-book. Also, the vendor company will be less inclined to meet dates.
Another option is to spec ~6 months of work and get definitive schedule and cost estimates from the vendor, get one SOW, etc. then release every 2-4 weeks. However, this is building a huge inventory (specs) that are likely to get stale and subject to changes in the direction of the business as time passes. This doesn't fit with the Lean (BTW, I've read the Pastrami article and Poppendeick's related work).
The ideal world is to have a team of full-timers, on-site sittting in close proximity to each other so that the cycle time can be reduced.
Randy Farley wants an example outside the Xbox.com 50+ person toy organization. Fair enough. I'll skip Amazon.com and Facebook.com, since they are non-MS and they are services. Instead, let's talk about Office feature crews.
Office used to be classic waterfall. Then they switched to feature crews–a cross-discipline team works on a single feature from conception to completion following clear done definitions. Office continues to refine this process and shorten the cycle time, which now varies from 2 – 5 weeks.
Does Office release every 2 – 5 weeks? No, of course not. It's a critical business application and enterprise companies don't want updates to critical business applications that often. But as I mention in the column, shortening cycle time for packaged products is about shortening the time from conception to completion of a feature. When you do, you remove wasted time and its associated frustration from your emgineering system.
BTW, the second comment reminds of another thing we did in Xbox.com to reduce cycle time. We co-located feature teams (including CSGs and vendors where applicable). That gave us a 20+% increase in productivity (as measured by the size and number of features completed over a release).
— I.M. Wright.
Cold you provide examples of this vertical slicing? In our case some features can take several months and it would be great to know how people deal with it…