As much as I love Microsoft®, and as many advantages as we have as a company in the intelligence of our people, the breadth of our products, and the boldness of our vision, there are times when people here are frigging clueless. It’s not everyone—Microsoft is a wildly diverse company. But there’s just enough ignorance to drive you insane.
A great example of nerve-racking naiveté surrounds our service environments. My current team has separate environments for development, check-in testing, scenario testing, stress testing, cross-division integration, partner integration, certification, and production. That’s eight different environments—and we’re planning to build out a preproduction environment next year. Here’s the punch line of this pathetic joke: even with all these environments, there are a ton of issues and scenarios that can only be exposed in production.
Why are we being so witlessly wasteful? Because we can afford it (good situation but bad reason), and because there are so many old-school enterprise engineers who don’t understand the most basic truth about services: there’s no place like production. These engineers conjure requirements for testing and integration environments based on hard-won lessons from business software, yet they fail to fathom their folly. Close your eyes, tap Ctrl-Alt-Delete together three times, and think to yourself, “There’s no place like production. There’s no place like production. There’s no place like production.”
How did I get here?
What kind of fools build out and maintain useless environments? The kind who got burned building enterprise software.
Large businesses rely on enterprise software—it’s got to work or they won’t buy it. Once they buy it, they own it. You don’t get to fix enterprise software anytime you want. That’s right, not even with security patches.
Remember, enterprise paychecks depend on having the software run smoothly. Software changes represent risk to an enterprise business. If the software doesn’t work, work well, and continue working well, enterprises businesses aren’t buying it. And they’ll tell you when they are darn well ready to accept a patch.
An entire generation of Microsoft engineers learned the hard way that you can’t release software until the code is fully tested. There are no “retries” in enterprise software.
Enterprise engineers heave at the thought of releasing code that hasn’t been fully vetted into production environments. They’d burst into convulsions if they understood the real truth about services.
Surely, you can’t be serious?
What is the real truth about software services? There’s no place like production.
Let’s break down these myths about testing and integration environments one at a time.
§ If your check-in tests pass in one environment, they’ll pass in all environments. Okay, that one obviously is wrong, but here’s what’s worse. It’s not difficult to write critical check-in tests that pass in production, but fail everywhere else (like tests of broad fan-out or database mirroring). Instead of kidding yourself, write a small set of automated sanity checks that developers can run quickly in their development environment before they check in.
§ You need a separate environment to test scenarios before integrating code with partners. There are two reasons people believe this—they don’t want unstable code to break their partners, and they don’t want their partners’ unstable code to block testing. The first reason is perfectly rational—you need a test environment to do preliminary acceptance and stress testing, especially for critical components. The second reason is laughable—like your partners are actually going to maintain your test environment in some working state. They won’t. They can’t. (More below.)
§ You can’t use production for stress testing. Why not? Are you worried production will fall over? Wouldn’t you want to know? Isn’t that the whole point? Wouldn’t it be great to watch that happen in a controlled way and back off as needed? Hello?
§ You need integration environments to check cross-division scenarios prior to release and provide preproduction access to external partners. Assume cross-division scenarios worked perfectly prior to production. Assume external partners signed-off in a separate environment before release. Do you now have quality assurance? No. None. Scenarios don’t work the same in production, where there are more machines, different load conditions, different routing and load balancing, different configurations, different settings, different data and certificates, different OS setups and patches, different networking, and different hardware. You’ll catch some integration issues, but not enough to make this enormous expense worthwhile.
Does a virtual cloud environment, like Azure®, take care of these issues? No, it only resolves the different OS setups and patches and different hardware. It helps a bit with the other issues, but only production is production.
§ You need a protected certification environment. Why do you certify products in advance? Because you want to ensure they’ll work in production. Oh wait.
Let’s recap. There’s no place like production. You need a development environment to run a small set of automated check-in tests, a test environment to run preliminary acceptance and stress testing to help avoid catastrophic failures, and production. Anything more is superfluous.
It’s nice for your partners to provide “one-boxes” for you to use with your dev and test environments. One-boxes are preconfigured virtual machines that run the services you depend upon in a compressed image. Of course, one-boxes are nothing like production.
Then it’s hopeless
“Wait a minute! We can’t throw untested code at customers. They’ll plotz! And don’t get me started about exposing prerelease, uncertified, partner code. Have you lost your mind?!” Shut up and grow up. There’s no place like production. The problem becomes configuring production to permit the testing and certifying of prerelease code.
The solution is called “continuous deployment.” The concept is simple: deploy multiple builds to production, and use custom routing to direct traffic as desired. It’s like a source control system for regulating services instead of source files. That it isn’t built into Azure and other cloud systems is inconceivable.
There are a variety of different approaches to continuous deployment, which basically differ in regard to the sophistication of the deployment system and custom routing. However, continuous deployment can be quite simple to achieve.
The toughest part is dealing with data, which must function properly across multiple builds. However, if a service is designed to handle at least one rollback after a new build is deployed, even if that new build introduces new data, then that service will function well in a continuous deployment environment.
You also need to worry about variations of settings across builds. This is a little tricky, but not too bad. Hopefully, your settings aren’t changing all the time.
If new builds depend on new versions of the .NET® framework or operating system, those have to be hosted on new machines—just as you’d have to do without continuous deployment.
How do I work this?
How can you use continuous deployment for integration testing, partner testing, stress testing, and certification? Let’s run through those.
§ Integration testing. You deploy your new build to production, but set the custom routing to direct traffic only from your engineering team to the build. (The default is no routing at all.) The rest of the world continues to see your last release. This technique is called “exposure control.” Now, your team can test against real production with real production data and real production load using a build not exposed to customers.
You’ll need good diagnostics to analyze any failures you see in production. That’s true with or without continuous deployment.
§ Partner testing. Partners deploy their new builds to production, but set the custom routing to only direct traffic from their engineering teams to their builds. The rest of the world sees no change. Now, partners can test against your production services without anyone seeing their new work, including their competitors.
§ Stress testing. You deploy your new build to production and test it out. Once verified, you use exposure control to increase the live traffic to your new build by increments—first one percent, then three percent, then 10 percent, then 30 percent, then 100 percent. You monitor service health throughout the process. If your services ever show signs of trouble, you capture the data and route traffic back to your last release (instant rollback).
§ Certification. Partners deploy their new builds to production and test them. Once verified, they use exposure control to direct the certification team to their new builds. The certification team certifies their builds in production, before customers or competitors see their new work. Once certified, partners can choose when to direct live traffic to their new builds.
§ Beta bonus! You deploy a beta build. Once verified, you use exposure control to direct beta users to the beta build.
§ Experimentation bonus! You deploy a variation of your current build. Once verified, you use exposure control to direct half the live traffic to your current build and half to the new variation. You utilize the differences you see in usage patterns to improve your services.
§ Auto-rollback bonus! After you direct all live traffic to your new build, you leave the previous release in place. You connect your health checks to the exposure control. Now if your health checks ever indicate trouble, your exposure control automatically and near instantly redirects traffic back to your previous release—day or night.
We’re not in Kansas anymore
Microsoft engineers learned a great deal from our move into enterprise software a decade ago. Unfortunately, those lessons have misdirected our recent service efforts, driving us to build out extraneous environments in the name of service quality.
Maintaining extraneous environments drains our bandwidth, power, and hardware budgets, and dramatically burdens our engineers, without providing real quality assurance. This needs to stop, and thankfully it is stopping as teams adopt continuous deployment.
With continuous deployment, you get service quality without the added costs. You also bag a bunch of bonus benefits to help you improve your services and better serve your internal and external partners.
There was a time when software development was done without source control systems. Now such a notion is not only laughable, it’s unconscionable. Continuous deployment provides a similar capability for services. Hopefully, we’ll soon look back and wonder how anyone ever worked without it.
Currently, Bing and the Ads Platform have the only production implementations of continuous deployment I’m aware of at Microsoft. Amazon has one of the best-known systems in the industry.
My team is currently building a very simple form of continuous deployment. It uses an on-machine IIS proxy to provide exposure control to multiple versions of the same roles on the same machine.
From the perspective of the engineering team, we still deploy the same roles to the same machines as we always have. The difference is that those machines now host multiple versions of the roles, with exposure control directing the traffic we want to the version we want. Sweet!