A recent flood of build breaks triggered a wave of tool suggestions to plug the cracks in our code. Some argued for faster builds. Some argued for deeper branching. Some argued for a “gauntlet” service that simulates official builds and blocks problem code submissions. All of these suggestions are awash in the seeping sewage of the flood—none of them address the root cause, and many would only pressurize the leak until it truly exploded in a tsunami of stifled stench.
Build breaks and other quality issues aren’t created or resolved by tools any more than the code is. Problems are created by people, and they are resolved by people. Misguided, well-meaning nitwits protest, “Of course people create and resolve problems, but tools can have a huge impact.” Heck yeah! Tools can have an enormous impact—they can make checkins slower, costlier, and less frequent; they can frustrate engineers to the point of leaving the project and company; and they can remove all creativity, agility, and pride from development until our code is a mindless mush molded to match the meaningless mechanisms of our monochromatic, masochistic machine.
Tools serve us, not the other way around. Before you suggest a tool, before you jump to a solution, before you make mayhem with mechanism, start with the human problem. What are people trying to accomplish? What’s getting in the way of success? What alternatives are available for the range of situations? Once you understand the true goals, then you can ask how tools might help. Don’t start with a tool. Don’t be a tool.
What are you trying to do?
The problem in this case is a bunch of build breaks. Actually, the problem is that you need your product to build or you can’t deploy it or sell it—so you build it all the time (a best practice). When someone checks in code that breaks the build, then no one on the team can retrieve the current code and build it. That can slow down value being added to your product in the form of code enhancements.
What are you really trying to accomplish? You’re trying to maintain the pace of value being added to your built product. Bad code checkins slow that pace for the individual and the team.
He looks to me to make things right
What alternatives are available to avoid bad code checkins and maintain the pace of value being added to your built product?
- Block checkins that don’t build. This is known at Microsoft as a “gauntlet” system. Every checkin is built and tested by the system. Checkins are only submitted if they pass.
Pros: Build never breaks when the system is working properly.
Cons: Can significantly slow the pace of any value being added to the product, and these systems often don’t work properly so build breaks still occur.
Gauntlet systems only succeed when their results match official build and test results. Of course, the gauntlet system can’t be identical to the official build system (different queuing mechanism, different code signing, different build machines and environment, different publishing, and different performance optimizations). Maintaining identical results for separate systems isn’t feasible, thus gauntlet systems often don’t work properly—in addition to often adding hours to checkins to perform their validation.
- Work in a private branch. Checkins are not submitted to the main branch. Instead, small groups of engineers work in private branches off the main branch. Sets of changes are integrated into the main branch only when they build properly on the private branch.
Pros: Bad checkins only slow down a small group of engineers.
Cons: Slows value being added because product changes must travel from the private branch to the main branch to reach the built product. The pain and disruption of build breaks becomes the pain and disruption of integrations. When many teams integrate their private branches during the same week (a common occurrence), conflicts and breaks are common.
- Work against the last known good (LKG) build. When a build succeeds, it is labeled as an LKG along with its source code. Instead of using the latest checkins, engineers sync to the LKG, and thus their builds are never broken.
Pros: The team can continue adding value to the product even when someone submits a bad checkin.
Cons: The longer the time between LKG builds, the more the LKG gets out of sync with the current code. This leads to all the problems you have with private branches, without the convenience of isolating them to a private branch. You can alleviate the problem somewhat by continuously creating LKGs with a rolling build, but when that rolling build breaks, you’re almost back to where you started.
- Validate work before checkin. Engineers perform their own validation, using build and test tools to check their work before checkin. Often this means having two enlistments—one for development and one for running clean builds. You can also build on a buddy’s machine or have some form of shared buddy build service.
Pros: Easy and catches issues at the source before they become team problems.
Cons: Trusts engineers to be diligent. Forces engineers to use their own machines for validation or a shared buddy build service, which can be slow and seriously deplete storage and CPU resources on work machines. Breaks inevitably get through.
Each alternative has its strengths and weaknesses. Which one is right? Ah, we haven’t considered the range of situations.
The complexity of your build system itself could be causing breaks. Many of these systems evolved slowly over time and haven’t received the engineering rigor we take for granted with modern production software. Investing in build systems isn’t sexy work, but it has a huge force multiplier when every engineer gains an hour or more of productivity per day.
You got something for me?
There are three general categories of changes that can cause build breaks. Each has a different risk profile.
- Trivial code or resource change. The checkin alters minor logic or resources, like a change to an existing string or icon. (Adding or deleting a resource is more significant.) While these changes can lead to build breaks, the occurrence is quite rare. Yet often these changes are valuable to the business or customer—you want them added quickly. A simple build should do. Waiting for a buddy build, let alone a gauntlet pass or an integration period, seems disproportionate to the risk. People hate disproportionate risk responses—engineers especially.
- Isolated change confined to a known area. The checkin alters code within a relatively small and well-defined scope, like refactoring a single class while leaving its interface the same. These changes can break builds, but rarely beyond their defined scope. Local builds and testing should be sufficient to catch a break.
- Interface or behavioral change to shared components. The checkin impacts written or unwritten contracts between components. These changes often lead to build breaks, impacting the entire project. Great care must be taken to perform clean builds on buddy machines for all build configurations followed by a broad spectrum of build verification tests.
A gauntlet system can protect you from all three categories, but it’s only worth it for the last category, and it’s prone to failure. A private branch helps the last two categories, but it’s overkill for the first category and just moves the pain around. LKG builds and trusting engineers to be diligent work for all three categories, but will let build breaks through.
Wow, what do you do? Oh wait, that’s right—it’s a people problem, not a tool problem.
Power to the people
Tools can help, but you start with people. Too much reliance on tools quickly makes them a crutch, causing people to shut off their brains and hit the button instead of applying discretion.
How do you best avoid bad code checkins and maintain the pace of value being added to your built product? Allow and expect people to use engineering discretion.
- For a trivial change, let them check code in after a simple build check.
- For an isolated change, give them solid local build and testing tools to run, and then let them checkin.
- For changes to shared components, give them a buddy build service with full build verification checks to run before checkin.
- For a nice combination of protection and speed, use an LKG system on a private branch with a fast rolling build that integrates every single good build into the main branch.
How do you keep people from breaking the LKG? You don’t. As long as breaks are rare, the pace of value being added to your built product will be high. So you trust engineers to be professionals and to apply the appropriate level of verification for their changes.
When you make the impact and choices clear to people, and you publicly expect and trust them to make the right decisions, they feel empowered and responsible for their work. The right example is set and folks fall in line—for each other as teammates. The risk is minimized, build breaks are rare, and value to customers is maximized. That’s what happens when people are the solution.
Many teams highlight the trust placed with engineers by conferring a lighthearted, embarrassing token to folks who fall short of earning that trust. My old team used a big stuffed bear called “Buster, the Build Break Bear.” The last person to leave the build broken for an hour had to keep Buster on display in his office. As long as it’s not mean spirited, this kind of token is an effective reminder.
For enormous teams, the sheer number of engineers causes even rare events to become common. To protect the LKG, you’ll need separate private branches and LKGs for each large subgroup. You’ll want rolling build integrations in both directions to keep the private branches in sync.
How large should subgroups be? You want as few private branches as possible because each extra branch slows down code movement. So you want subgroups as large as you can make them and still have only a few LKG breaks a week (100 to 500 people each has been my experience).
The whole tone of this post is about trusting engineers to be professionals yet you completely undermine this with "lighthearted embarrassment". Do surgeons give each other stuffed toys when an operation results in an avoidable complication? Do lawyers make their peers wear badges when they lose cases they should have won? No, of course not, those professionals learn from their mistakes and move on, no embarrassment required.
My thought, Andrew, is that people are not all motivated by the same thing, and every place is different. Some, more than others, will be motivated by not wanting to own "Buster". When I worked at a remote EDS account 20 years ago, there was a "Golden Screw-up Award" that belonged to the last person who had screwed up. You couldn't wait to get that off your desk. For some places, that person buys donuts. One division where I work has the person make a DQ run. We're all human. My guess is that team members breaking the build multiple times in a short time get a good "talking to". Some professions are different than others. A screw-up in the operating room is different than a code screw-up. We're not talking life-or-death or innocent vs. guilty here.
techvet2: Depends what your code runs. If you put in a bug to a piece of software that controls anything in an industry that needs regulatory body (casino, healthcare, military, mining, energy, etc) then you may be endangering your job, and possibly other peoples jobs or lives. Having worked in such an environment before, it's actually a big deal to be making terrible code. Your failure to ship a product on time has heavy fines or your company simply doesn't get paid until it works perfectly.
From my experience there is often a deeper architectural reason of poor modularization behind the build problems and complexity of build/checkin processes. We tend to have too much interdependency across components.
Often you will find portions of the product build tree where the results from "external partners" are placed. In many cases it should be possible to follow a similar model with "internal partners" that build sufficiently independent modules. We could get the advantages of working in a private branch (breaks affect a small number of engineers) with a smaller impact in agility (avoid grueling integrations).
There will always be some people working on infrastructure and frameworks that others depend on, and you may still need to pay the "tax" of a wttcheck/gauntlet system for their portions of the tree, but the impact is lower when paid by a small amount of people. Also, the ability write a good framework used by many other team members it is something typically rewarded with a higher seniority level, there are counterbalances available to alleviate the morale impact from the "extra tax".
Better modularization would also help in maintenance and support. Patches would have a better probability of affecting a single file (lower probability of requiring reboots, lower patch time, smaller downloads, etc)
Note that after modularizing components, the build processes need to be adjusted accordingly. it is sad how in some situations teams need to follow unnecessarily complicated build processes because "that's how it is done at Microsoft" (don't get me wrong, we do have many good processes, "the Microsoft way" is often a good way, evolved from the thinking of many highly intelligent individuals, but that doesn't mean following processes blindly without questioning and without thinking if they apply to the current situation or how to improve them)
In my opinion we need more investment in modularizing our component architectures (depending on the team's current soultion it would pay off in different ways, the mentioned advantages for supportability, in agility, in less downtime from breaks, etc.)