Broken Windows: How Bad Software Releases Happen to Good Teams

One possible explanation for how bad releases can happen even amongst effective software development teams

One of my primary responsibilities with the Akka.NET project is release manager - I put together the release notes, press the big green button when we’re ready to deploy, and make sure that each contributor signs off on the release.

The thing I take most seriously about my job is quality control - trying to ensure that no release ever does any of the following:

  1. Introduces a breaking change to a public interface;
  2. Introduces a game-changing bug that forces users to roll-back to a previous version;
  3. Causes a major degradation in performance or stability; or
  4. Never significantly alters the behavior of a component in a manner that falls out of alignment with previous behavior without giving the users sufficient advanced notice.

Unfortunately, within the past few months we’ve had all of the above happen at least once each. Akka.NET is an open source project that has seen a rapid increase in adoption, contribution, and deployment over the past six months in particular (since we released 1.0) so these issues aren’t unexpected; growing pains.

However, I’ve observed what the root cause of this particular set of growing pains appears to be: broken windows theory at work.

Broken windows theory

The Broken windows theory is a criminological theory that essentially amounts to this: if you tolerate lesser crimes such as vandalism, public drinking then this creates a negative feedback loop that results in social decay and increase in more serious crimes such as robbery and theft.

Harkening this idea back to software development, broken window theory amounts to neglect that accelerates code rot within a codebase. If you tolerate some failing unit tests, how long to do you go before someone puts a massive bug in that area of the code base? Much faster than you would if you didn’t tolerate lack of code coverage and test failures.

To my delight, I discovered when researching this post that Jeff Atwood covered the application of broken window theory to software ten years ago:

Programming is insanely detail oriented, and perhaps this is why: if you’re not on top of the details, the perception is that things are out of control, and it’s only a matter of time before your project spins out of control.

In the case of Akka.NET, our failure to achieve our release goals on multiple releases came about as the sum of the following root causes:

  1. No automation, standardization, and historical references for the project’s performance over time;
  2. Asynchronous code is inherently more difficult to test than synchronous code; there were a small number of Heisenbugs which showed up in random test failures only on our build servers, crappy Azure boxes, and never on our high-end development machines. As a result we got conditioned to ignore some of those tests and dismiss the results as related to the CPU-sharing going on inside Azure. As it turns out, these tests revealed real faults in our code that eventually would show up under higher loads.
  3. No automation for testing binary compatibility of the Akka.NET (breaking .DLL changes.)

None of these issues in and of themselves are that significant, until we see what it lead to:

  1. Overlooking potential issues in tests that failed;
  2. Relying on manually-run benchmarks and examples to gauge the performance of our software between releases; and
  3. Inadvertently accepting contributions from well-intentioned contributors who made hard-breaking changes to APIs in ways which were not obvious (to us), such as adding an overload to an extension method.

These were all small details early on in the project, when we had bigger concerns like actually shipping a stable version of our core modules. But as the project grew and these broken windows were left unattended the project rotted in core areas that had a real impact on our users.

However, the users and contributors of Akka.NET are on top of their game so we didn’t let things stay this way for long.

We added NBench, an NUnit-style performance testing framework for .NET, to add stress testing and performance testing to all of Akka.NET’s builds, which eliminates many of these issues (spots performance issues and makes it easier to reproduce Heisenbug.) We’re working on adding an API diffing tool to our release process to help prevent issues related to breaking changes. And we’re experimenting with changes to our review process to make it easier to catch potential bugs and issues sooner.

As Jeff put it, software is insanely detail-oriented - and some of the growing pains Akka.NET experienced over the course of 2015 came down to us having to being paying attention to new types of details we never worried about before, such as performance and backwards-compatibility. As our projects and products grow, we need to sweat the small stuff.

Discussion, links, and tweets

I'm the CTO and founder of Petabridge, where I'm making distributed programming for .NET developers easy by working on Akka.NET, Phobos, and more..