How can you tell if you’re a smart engineer? What separates people who go through the motions from those who really get it? Second-order effects. Anyone taking an introductory software class can learn what changing a line of code does to a function, but engineers who really understand programming will see the cascading impact of that change up and down the stack. It’s true in all fields—first-order effects are easily surmised and explained with simple linear relationships, but second order effects are subtle, nonlinear, and far more difficult to master and control. That’s why I ask job applicants to answer only coding questions that require nested loops or multiple call levels to solve.
Second-order effects are at the heart of Microsoft’s current challenges. We want to increase our agility and responsiveness to customers and the market, yet many of our current tools, techniques, and teams are built around shipping every year, not every day. I’ve written about teams Staying small and using fast iterations to make Data-driven decisions. While numerous teams, particularly those providing online services, are meeting or exceeding the pace of the market, a large portion of Microsoft still trudges along.
What’s holding us back? In part, a legacy attitude and approach sustained by legacy leadership. But at the heart of the problem is a second-order effect that forces our tools, techniques, and teams to always operate at enormous scale, which slows us down and inhibits our ability to compete. That second-order effect is diamond dependency, and until every team tackles it directly, our agility will be anchored to an anachronism.
At one time, customer-centered design was at the heart of Microsoft’s challenges, but we’ve gotten far better in that regard. It’s still a work in progress, but our end-to-end experiences often surpass those of our competitors and surprise our critics.
What’s this now?
A diamond dependency refers to a software component (D) that depends on two components (B and C), which in turn each depend on the same fourth component (A). Changes to component A create a second-order effect on component D through interactions with components B and C. This second-order effect is subtle. Some engineers don’t realize when a diamond dependency exists, even when they are the ones who inadvertently created it.
However, diamond dependencies dramatically decelerate developer dynamism. If the developers of component D want to use a new version of B that depends on a new version of A, they must also wait for a new version of C that works with the new A. The diamond dependency indirectly couples components B and C, creating a bottleneck that slows everyone down.
Then again, diamond dependencies are very common and often desirable. They result directly from shared code and shared libraries. Unfortunately, to keep dynamic diamond dependencies in sync, teams must use enormous builds supported by huge source control repositories. Without those big builds and repositories, changes to components A, B, C, and D would inevitably break each other and render the overall codebase useless. Enormous builds and huge repositories might slow our agility, but that’s the only way for us to have reliable builds, right?
Welcome to the 21st century
To maintain agility and isolation of individual components in the face of diamond dependencies, modern web services have multiple versions of each component service running in production at the same time. On a PC, this would be like multiple versions of the same DLL being loaded and active simultaneously.
Running multiple versions of components simultaneously breaks the indirect coupling of components B and C. Component D can call into a new version of B that uses a new version of A, while still calling an old version of C that uses a simultaneously running old version of A. With this approach, modern web services gain the agility of isolated components with fast builds and small repositories—the problem is seemingly solved.
Unfortunately, the two versions of component A running simultaneously share resources (such as memory, network connections, and storage). This is a serious problem in the world of web services. To deal with it, services are backward compatible with resources for all currently active versions. Breaking changes to things like database schemas are managed over multiple service releases to ensure no service disruption or data corruption. (Details in Production is a mixed blessing.)
The agility you get working on modern web services, with multiple versions of components running simultaneously, is worth the backward capability you must maintain. As a bonus, the ability to run multiple versions simultaneously allows you to update services without rebooting or disrupting active usage.
Some major companies that run modern web services, like Google and Facebook, still choose to use huge repositories. With one big repository, it’s easier to enlist, refactor large codebases, and search shared code.
My reality and yours
At Microsoft, we’ve maintained backward compatibility for decades to support our customers. However, would we really do that at the DLL level within the Windows kernel? Having different versions of components running simultaneously on the same PC would cause all kinds of resource contention, as is the case with web services, but with far greater frequency and variety. However, if we don’t allow multiple versions of components to run simultaneously, we’re stuck keeping all components in lock step with each other, meaning enormous builds and reduced agility.
The key is choosing the right size for components. Allowing every kernel DLL to run multiple versions at the same time is nuts, but there are various ways to have multiple kernels appear to run simultaneously. (We’ve used virtualization, shimming, emulation, and sandboxing to have middleware and apps run on seemingly different kernels on the same machine.) Kernels by themselves build much faster than entire operating systems, and likewise for middleware and apps. This doesn’t provide isolation and agility down to the feature team level, but we still get dramatic improvement.
Web services typically limit the number of active versions of any particular service to only a few, track which clients are using old versions, and even enforce a deadline for when those clients must upgrade. It’s possible to do something similar for kernels, middleware, and apps, though missing the deadline may only mean running a bit slower due to greater emulation.
Breaking our large codebases into kernels, middleware, and apps, and allowing different versions of them to run simultaneously will require a great deal of work—it won’t happen overnight. Yes, it’s easier than doing this at the individual DLL level, but it’s still a huge effort. Even getting these pieces to build separately is an enormous task.
However, if we ever hope to increase our agility and responsiveness to customers, and compete effectively within the modern marketplace, we must take action today. Fortunately, people working on Windows and Office have already made substantial progress separating out major components and moving toward greater independence and reuse. Even so, they can’t defeat diamond dependencies alone. You must do your part.
To successfully separate our kernels, middleware, and apps from each other, every engineer must live and breathe backward compatibility and boundaries. Don’t take arbitrary dependencies across boundaries, don’t invent new APIs for expediency (aka laziness), and don’t break file formats or communication protocols on a whim. Instead, stick within your established boundaries, invoke or enhance supported interfaces, and take great care in altering how shared resources are used between versions. With all of us working together, we can show the world how smart we are and how quickly we can regain dominance in the marketplace.