Writing Better Tests Than Humans Can Part 1: FsCheck Property Tests in C#

This is the first post in a 3-part series on property-and-model based testing in FsCheck in C#.

  1. Writing Better Tests Than Humans Can Part 1: FsCheck Property Tests in C#
  2. Writing Better Tests Than Humans Can Part 2: Model-based Tests with FsCheck in C#

Subscribe to get the next as they’re published!

During my work on developing Akka.NET 1.1 I took a long look at the history of various bug reports over the lifespan of Akka.Remote, Akka.Cluster, and Helios over the past two years or so. The thing that I wanted to make sure we absolutely nailed in this release was ensuring the stability of the underlying transport and so-called “endpoint management” system.

If these sub-systems aren’t robust then Akka.Cluster can’t fulfill any of its behavior guarantees to end users.

For instance, I found that in the previous stable version of Helios we were using (1.4.1) there was an easily reproducible race condition that could result in a socket failing to open. In addition to that we (long ago) discovered that Helios didn’t treat message order correctly as a result of a fundamental design flaw with its concurrent message processing model. We had to patch that via a configuration change that severely limited Helios’ throughput inside Akka.Remote.

How is Akka.Cluster supposed to behave reliably during a network partition if its own internal software can easily create one?

This is the crux of the problem I tackled intensely over the course of three months.

Thus we eliminated these kinds of problems and graduated Akka.Cluster from beta (which it has been since August 2014) to a full, production-ready release. One of the tools that was essential to making this happen was FsCheck - a property and model-based testing framework for .NET applications.

Where Unit Tests and Traditional Testing Practices Fall Short

In modern software development it’s standard practice for developers to cover their own code using individual unit tests like this:

public class MyModuleSpec {
	[Fact]
	public void MyModule_should_add_item_to_collection(){
			var myModule = new MyModule();
		myModule.Add(1);
		Assert.True(myModule.Count == 1);
		Assert.True(myModule.First() == 1);
	}
}

These types of tests serve three purposes:

  1. Validating that your code works as you write it, especially if you’re working from the bottom-up.
  2. Protect against future regressions on this code. This is the most important function of tests - to scream, loudly, if the underlying behavior of a system inadvertently changes as the result of a software change.
  3. Defines a working specification for how code behaves. The tests also act as self-contained contracts for defining how each of the covered modules under test are supposed to behave during a variety of conditions.

Unit tests are ideally meant to target very small, isolated units of code. This makes them easier to write; easier to read (by another developer;) and makes the test more “pure.”

Where the unit testing model really starts to fall apart however, is that this model is really designed to test the behavior of individual features. What happens if you need to test how multiple features in a given application interact with each other?

Unit Tests Don’t Scale

Let’s suppose a feature of our application is a custom IList<T> implementation called FixedSizedList<T> that stays at a fixed size and won’t add any more items if the collection once its limit is reached.

var fixedSize = new FixedSizeList<int>(1);
bool addedFirst = fixedSize.Add(1); // should be true
bool addedSecond = fixedSize.Add(2); // should be false

An experienced developer would manually write some unit tests that cover the following scenarios:

  1. Adding an item to an empty list;
  2. Adding an item to a full list;
  3. Removing an item from a full list and then adding a new item; and
  4. Removing an item from an empty list.

That covers all of the basics, maybe, for this feature. Now here’s where things get fun - imagine if we use this FixedSizeList<T> inside another feature: a BackoffThrottler.

public class BackoffThrottler{	
	private readonly double _throttleRate;
	private readonly FixedSizedList<IMessage> _messageBuffer;
	private IChannel _downstreamConsumer;

	private double observedBytes;
	private long nextReleaseTimeTicks;

	public BackoffThrottler(double throttleRate, FixedSizeList<IMessage> buf, IChannel consumer){
		// assignments...
	} 

	public bool Pass(IMessage next){
		if(next.Size + observedBytes >= _throttleRate){
			if(!_messageBuffer.Add(next)){
				throw new InvalidOperationException("buffer full!");
			}
			return false;
		}
		else{
			observedBytes += next.Size;
			_downstreamConsumer.Send(next);
			return true;
		}
	}

	public bool DrainBuffer(long nextReleaseTime){
		// drain all buffered messages
		// calculate new release time
		// reset observed bytes to zero
		// etc...
	}
}

Manually writing unit tests for just the Pass method of this BackoffThrottler class is going to be complicated. We have to determine if the Pass method will return true, false, or if it will throw an InvalidOperationException based on:

  1. The number and size of all messages observed previously;
  2. The size of the current message;
  3. Whether or not any DrainBuffer operations occurred previously; and
  4. The possible configuration settings for both the BackoffThrottler and the FixedSizeList<IMessage> objects.

How many tests would you have to write to exhaustively prove that these two classes both work correctly under all supported configurations? “More than a human can feasibly write” is the correct answer. Now imagine having to do this for three, four, or dozens of features all working together simultaneously. You simply can’t write manual tests to cover all of the scenarios that might occur in the real world. It’s not feasible.

Model and Property-Based Testing

Folks in the Haskell community developed a solution to this problem in the late 1990s, and it was called QuickCheck. This is the library that pioneered the concept of property and model-based testing. FsCheck, which we will be using, is an F# implementation of QuickCheck - although I’ll be using it with C# in this example.

The idea is simple: rather than try to test every possible combination of inputs with a hand-written test, describe an abstract “model” for the feature(s) under test and generate random inputs for each possible value and verify that the test still holds true for all of them (assuming the inputs satisfy from pre-conditions, which we’ll cover.)

What this gives you is a single test written by a developer that can cover thousands of randomly generated scenarios and verify that the code-under-test behaves correctly under all of them. This is much more powerful, efficient, and secure than hand-written tests.

Testing Properties of Features

Let me give you a real-world example from one of my open source libraries, Faker. Faker uses reflection to create a plan for randomly generating fully constructed POCO objects and can even handle rather complex message graphs. But one part of our system for generating accurate fakes includes randomizing array / list types - and so I wrote a shuffle function based on the well-known Fisher-Yates shuffle algorithm.

using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;

namespace Faker.Helpers
{
    /// <summary>
    ///     Extension methods for working with arrays
    /// </summary>
    public static class ArrayHelpers
    {
        /// <summary>
        /// Produces the exact same content as the input array, but in different orders.
        /// 
        /// IMMUTABLE.
        /// </summary>
        /// <typeparam name="T">The type of entity in the array</typeparam>
        /// <param name="array">The target input</param>
        /// <returns>A randomized, shuffled copy of the original array</returns>
        public static IEnumerable<T> Shuffle<T>(this IEnumerable<T> array)
        {
            var original = array.ToList();

            /*
             * https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle
             */
            var newList = new T[original.Count];

            for (var i = 0; i < original.Count;i++)
            {
                var j = Faker.Generators.Numbers.Int(0, i);
                if (j != i)
                    newList[i] = newList[j];
                newList[j] = original[i];
            }
            return newList;
        }
    }
}

For a specific feature inside Faker, I have to be able to guarantee that for any IEnumerable<T> this shuffle function must:

  1. Never modify the original collection;
  2. Never return a collection with a different number of elements;
  3. Never return a collection with different items in its contents; and
  4. Must never return the original collection in its original order.

These are all properties of how this feature of my application works. And I can verify that all four of these properties hold true using property-based tests with FsCheck.

using System.Linq;
using Faker.Helpers;
using FsCheck;
using NUnit.Framework;

namespace Faker.Models.Tests
{
    [TestFixture(Description = "Validates our extension methods for working with arrays")]
    public class ArrayHelperSpecs
    {
        [Test(Description = "Ensure that our shuffle function works over a range of intervals")]
        public void Shuffled_lists_should_never_match_original()
        {
            Prop.ForAll<int[]>(original =>
            {
                var shuffle = original.Shuffle().ToArray();
                return (!original.SequenceEqual(shuffle))
                .Label($"Expected shuffle({string.Join(",", shuffle)}) to be " +
                       $"different than original({string.Join(",", original)})")
                .And(original.All(x => shuffle.Contains(x))
                .Label($"Expected shuffle({string.Join(",", shuffle)}) to contain" +
                       $" same items as original({string.Join(",", original)})"));
            }).QuickCheckThrowOnFailure();
        }
    }
}

In the sample above I’m calling FsCheck from inside NUnit, and I’m going to specify that for any random array of integers (Prop.ForAll<int[]>) my Func<int[], bool> will hold true. Given that int is a primitive data type, FsCheck has a built in randomizer that will generate incrementally more complex values and incrementally larger arrays.

The real piece of code doing the work here though are my two explicit property checks - the the two sets contain the same items but not in the same order. This covers all four of my properties from earlier.

Let’s see what happens when I run this sample as-is.

System.Exception : Falsifiable, after 1 test (1 shrink) (StdGen (1428853014,296185311)):
Label of failing property: Expected shuffle() to be different than original()
Original:
[|0|]
Shrunk:
[||]

Ouch, fails on the very first try - when we attempt to shuffle a zero-length collections. Well you can’t technically shuffle a list with no items, so let’s add a precondition to this property test that states that these properties are only true if there’s at least 1 item.

Prop.ForAll<int[]>(original =>
{
    var shuffle = original.Shuffle().ToArray();
    return (!original.SequenceEqual(shuffle))
    .When(original.Length > 0) // pre-condition to filter out zero-length []s
    .Label($"Expected shuffle({string.Join(",", shuffle)}) to be " +
           $"different than original({string.Join(",", original)})")
    
    .And(original.All(x => shuffle.Contains(x))
    .Label($"Expected shuffle({string.Join(",", shuffle)}) to contain" +
           $" same items as original({string.Join(",", original)})"));
}).QuickCheckThrowOnFailure();

And the results are….

System.Exception : Falsifiable, after 1 test (2 shrinks) (StdGen (1049246400,296185312)):
Label of failing property: Expected shuffle(0) to be different than original(0)
Original:
[|-1|]
Shrunk:
[|0|]

Doh! The spec fails if we have an array with a single entry in it too! I guess that’s because you also can’t shuffle an item with a single entry.

Automatically Finding the Smallest Reproduction Case Using FsCheck

Now notice this comment produced by FsCheck: Falsifiable, after 1 test (2 shrinks) in the last scenario. This process, called “shrinking,” allows FsCheck to work backwards through the series of random inputs it generated to determine what the minimal reproduction steps are for producing a test failure. This will come in handy once we start testing more complicated models.

I’ve had FsCheck shrink a test scenario it produced with 1000+ steps in it down to four steps before - it found a combination of operations that put a particular class into an illegal state. I would have never found that using manual testing.

Now back to our example - looks like we need to modify our precondition to specify that the length of the input collection must be greater than 1 in order to pass.

Prop.ForAll<int[]>(original =>
{
    var shuffle = original.Shuffle().ToArray();
    return (!original.SequenceEqual(shuffle))
    .When(original.Length > 1) // changed precondition
    .Label($"Expected shuffle({string.Join(",", shuffle)}) to be " +
           $"different than original({string.Join(",", original)})")
    
    .And(original.All(x => shuffle.Contains(x))
    .Label($"Expected shuffle({string.Join(",", shuffle)}) to contain" +
           $" same items as original({string.Join(",", original)})"));
}).QuickCheckThrowOnFailure();

So we’ll try running FsCheck again…

System.Exception : Falsifiable, after 2 tests (0 shrinks) (StdGen (424319174,296185314)):
Label of failing property: Expected shuffle(1,1) to be different than original(1,1)
Original:
[|1; 1|]

Ugh! It still fails - this time because it inserted an identical item into multiple places inside the same collection. The output of our shuffle function for a case where all of the inputs have identical elements will be the same as the input each time. So let’s add another precondition for that as well.

Prop.ForAll<int[]>(original =>
{
    var shuffle = original.Shuffle().ToArray();
    return (!original.SequenceEqual(shuffle))
    .When(original.Length > 1 && original.Distinct().Count() > 1)
    .Label($"Expected shuffle({string.Join(",", shuffle)}) to be " +
           $"different than original({string.Join(",", original)})")
    
    .And(original.All(x => shuffle.Contains(x))
    .Label($"Expected shuffle({string.Join(",", shuffle)}) to contain" +
           $" same items as original({string.Join(",", original)})"));
}).QuickCheckThrowOnFailure();

And finally, we should have a passing test.

Ok, passed 100 tests.

Yay! It passed! We can now rest assured that our shuffle function works for the 100 randomly generated scenarios FsCheck produced. You can increase the number of tests if you need it.

Read the next post: Writing Better Tests Than Humans Can Part 2: Model-based Tests with FsCheck in C#.

Discussion, links, and tweets

I'm the CTO and founder of Petabridge, where I'm making distributed programming for .NET developers easy by working on Akka.NET, Phobos, and more..