Tradeoffs in High Performance Software

I’ve spent down the past week tracking down an absolutely brutal bug inside Akka.NET. Sometimes the CPU utilization of the system will randomly jump from 10% to 100% and stay pegged like that until the process is recycled. No exceptions were thrown and memory / network / disk usage all remained constant (until I added monitoring.)

I couldn’t reproduce this CPU spike at all locally, nor could I determine the root cause. No one else had reported this issue on the Akka.NET issue list yet, probably because the product is still young and MarkedUp In-app Marketing is the only major commercial deployment of it that I know of (hopefully that will change!)

I had to hook up StatsD to MarkedUp’s production Akka.NET deployments to figure out what was going on ultimately[footnote: I open sourced our Akka.NET + StatsD integration into a NuGet package – Akka.Monitoring] – using that plus the combination of a DebugDiag dump I was able to isolate this down to a deadlock.

I didn’t have a meaningful stack trace for it because the deadlock occurred inside native CLR code, so I had to actually look at the public source for .NET 4.5 to realize that the issue was caused by an edge-case where a garbage collected Helios IFiber handle could accidentally make a BlockingCollection.CompleteAdding() call on the Fiber’s underlying TaskScheduler (possibly leaked to other Fibers) via the IFiber,Dispose() method which would gum up the works for any network connection using that Fiber to dispatch inbound or outbound requests.

Took a week to find that bug and a few hours to fix it. What a bitch.

Before I pushed any of those bits to master, however, I decided to run Visual Studio’s performance analysis toolset and test the impact of my changes – a mix of CPU performance and memory consumption testing. This revealed some interesting lessons…

Tradeoff #1: Outsourcing work to the OS vs. doing it yourself

First, the reason we had this performance issue to begin with: Helios is C# socket networking middleware designed for reacting quickly to large volumes of data. Inside a Helios networking client or a reactor developers have the opportunity to specify the level of concurrency used for processing requests for that particular connection or connection group – you can run all of your IO with maximum concurrency on top of the built-in thread pool OR you can use a group of dedicated threads for your IO. It was this latter type of Fiber that caused the problems we observed in production.

Why would you ever want to build and manage your own thread pool versus using the built-in one? I.E. manage the work yourself instead of outsourcing it to the OS?

In this case it’s so you can prevent the Helios IConnection object from saturating the general-purpose thread pool with thousands of parallel requests, possibly starving other parts of your application. Having a handful of dedicated threads working with the socket decreases the system’s overall multi-threading overhead (context-switching et al) and it also improves throughput by decreasing the amount of contention around the real bottleneck of the system: the I/O overhead of buffering messages onto or from the network.

In most of applications, outsourcing your work to the OS is the right thing to do. OS and built-in methods are performance-optimized, thoroughly tested, and usually more reliable than anything you’ll write yourself.

However framework and OS-designers are not all-seeing and don’t offer developers a built-in tool for every job. In the case of Helios, the .NET CLR’s general purpose thread-scheduling algorithm works by servicing Tasks in a FIFO queue and it doesn’t discriminate by request source. If a Helios reactor produces 1000x more parallel requests per second than the rest of your application, then your application is going to be waiting a long time for queued Helios requests (which can include network I/O) to finish. This might compromise the responsiveness of the system and create a bad experience for the developer.

You can try working around this problem by changing the priority of the threads scheduled from your application, but at that point you’ve already broken the rule of outsourcing your work to the OS – you’re now in control of scheduling decisions the moment you do that.

Taking the ball back from the OS isn’t a bad idea in cases where the built-in solution wasn’t designed for your use case, which you determine by looking under the hood and studying the OS implementation.

Another good example of “doing the work yourself” – buffer management, such as the IBufferManager implementation found in Jon Skeet’s MiscUtil library or Netty’s ByteBuffer allocators. The idea in this case: if you have an application that creates lots of byte arrays for stream or buffer I/O then you might be better off intelligently reusing existing buffers versus constantly allocating new ones.. The primary goal of this technique is to reduce pressure on the garbage collector (CPU-bound problem), which can become an issue particularly if your byte arrays are large.

Tradeoff #2: Asynchronous vs. Synchronous (Async != faster)

I have a love/hate relationship with the async keyword in C#; it is immensely useful in many situations and eliminates gobs of boilerplate callback spam.

However, I also hate it because async leads droves of .NET developers to believe that you can inherently make your application faster by wrapping a block of code inside a Task and slap an await / async keyword in front of it.[footnote: I complained about this habit before in “10 Reasons Why You’re Failing to Realize Your Potential as a Developer”] Asynchronous code and increasing parallelism inside your application doesn’t inherently produce better results, and in fact it can often decrease the throughput of your application.

Parallelism and concurrency are tools that can improve the performance of your applications in specific contexts, not as a general rule of thumb. Have some work that you can perform while waiting for the result of an I/O-bound operation? Make the I/O call an asynchronous operation. Have some CPU-intensive work that you can distribute across multiple physical cores? Use N threads per core (where N is specific to your application.)

Using async / parallelism comes at a price in the form of system overhead, increased complexity (synchronization, inter-thread communication, coordination, interrupts, etc…) and lots of new classes of bugs and exceptions not found in vanilla applications. These tradeoffs are totally worth it in the right contexts.

In an age where parallelism is more readily accessible to developers it also becomes more frequently abused. And as a result: one of the optimizations worth considering is making your code fully synchronous in performance-critical sections. 

Here’s a simple example that I addressed today inside my fork of NStatsD which resulted in a 40% increase in speed.

This code pushes a UDP datagram to a StatsD server via a simple C# socket – the original code looked like this:

foreach (var stat in sampledData.Keys)
   var stringToSend = string.Format("{0}{1}:{2}", prefix, stat, sampledData[stat]);
   var sendData =  encoding.GetBytes(stringToSend);
   client.BeginSend(sendData, sendData.Length, callback, null);

The code here uses one of the UdpClient’s asynchronous methods for sending a UDP datagram over the wire – it takes this code about 2.7 seconds to transmit 100,000 messages to a StatsD server hosted on a VM on my development machine. Having done a lot of low-level .NET socket work with Helios, I’m suspicious when I see an asynchronous send operation on a socket. So I changed the call from BeginSend to just Send.

foreach (var stat in sampledData.Keys)
   var stringToSend = string.Format("{0}{1}:{2}", prefix, stat, sampledData[stat]);
   var sendData = Encoding.ASCII.GetBytes(stringToSend);
   client.Send(sendData, sendData.Length);

This code, using a fully synchronous send method, pushed 100,000 messages to the same StatsD server in 1.6 seconds – an improvement of 40%. The reason why? BeginSend operations force the OS to maintain some state, make a callback when the operation is finished, and release those resources after the callback has been made.

The Send operation doesn’t have any of this overhead – sure, it blocks the caller until all of the bytes are buffered onto the network, but there’s no context switching or caching objects to be included in the async state. You can validate this by looking at the .NET source code for sockets and compare the SendTo vs. BeginSendTo methods.

The point being: when you have thousands of write requests to a socket in a short period of time the real bottleneck is the socket itself. Concurrency isn’t going to magically make the socket write messages to the network more quickly.

Tradeoff #3: Memory vs. Throughput

I rewrote Helios’ DedicatedThreadFiber to use a BlockingCollection to dispatch pending work to a pool of dedicated worker threads instead of using a custom TaskFactory – a functionally equivalent but a mechanically significant change in how this Fiber’s asynchronous processing works. After making a significant change like this I test Helios under a heavy load using the TimeService example that ships with the Helios source.

Each TimeServiceClient instance can generate up to 3,000 requests per second and usually I’ll have 5-15 of them pound a single TimeServiceServer instance. On my first test run I noticed that the first TimeServiceClient instance’s memory usage shot up from about 8mb to 1.1GB in a span of 45 seconds. “Holy shit, how did I create such an awful memory leak?” I wondered.

Turns out that the issue wasn’t a memory leak at all – the TimeServiceClient could produce requests 2-3 orders of magnitude faster than its Fiber could process them (because, again, it was waiting on a socket), so the BlockingCollection backing the Fiber would grow out of control.

I decided to test something – if I capped the BlockingCollection’s maximum size and have it block the caller until space frees up, what would the impact on memory and performance be?

I capped the Fiber initially to 10,000 queued operations and I was surprised with the results – throughput of the system was the same as it was before but memory usage was only 5mb as opposed to 1.1GB. Sure, some of the callers would block while waiting for room to free up in the BlockingCollection but the system was still operating at effectively the same speed it was before: the maximum speed at which the outbound socket can push messages onto the network.

I made a choice to limit memory consumption at the expense of “on-paper” throughput, but it wasn’t much of a tradeoff at all.

Hopefully a theme is starting to emerge here: your system is only as fast as your slowest component, whether it’s the file system, the network, or whatever. Consuming all of your system’s memory in order to queue pending I/O socket requests faster doesn’t improve the speed of that slow component.

You’re better off blocking for a picosecond and conserving memory until that slow resource becomes available.

However, imagine if I had taken a different approach and decided to use more memory to group related socket messages in batches before we sent them over the socket. That might result in an improvement in total message throughput at the expense of increased memory consumption – I’d have to write some code to ensure that the receiver on the other end of the socket could break up the batched messages again, but there’s potential upside there.

Tradeoff #4: Heavyweight Resources vs. Lightweight Resources

Lightweight resources are primitives, structs, objects, and so forth. Heavyweight objects are threads, processes, synchronization mechanisms, files, sockets, etc… anything that is usually allocated with a resource handle or implements the IDisposable interface is a good candidate for a “heavyweight” object.

After I fixed the issue I described in tradeoff #3 – we had another memory-related performance problem inside our Executor object, used for executing operations inside a Fiber. When we terminate an Executor, we do it gracefully to give it some time to wrap up existing operations before we shutdown.

Helios offers two tools for “timed operations” – a lightweight Deadline class and a heavyweight ScheduledValue class. Both of these classes have the same goal: force some value to change after a fixed amount of time.

The Deadline class accomplishes this by comparing its “due time” with DateTime.UtcNow – if DateTime.UtcNow is greater than due time then the Deadline is considered “overdue” and starts evaluating to false.

The ScheduledValue is a generic class backed by a built-in .NET Timer object – you tell it to change its value from A to B after some amount of time, which the Timer does automatically once it elapses.

We use Deadlines most frequently inside Helios because they’re simple, immutable, and don’t pose any risk of a resource leak. Deadline is what we were using inside the Executor object’s graceful shutdown mechanism.

Every time a Fiber dequeues an operation for execution it checks the Executor.AcceptingJobs property (which is determined by the Executor’s internal DeadLine) – if this expression evaluates to false, the Fiber stops processing its queue and releases its threads. In other words, this property is evaluated hundreds of thousands of times per second – and as my performance profiling in Visual Studio revealed, generated a TON of GC pressure via all of its DateTime.UtcNow calls, which allocates a new DateTime structure every time.

I tried, in vain, to rewrite the Deadline to use a Stopwatch instead but wasn’t able to get reliable shutdown times with it. So I decided to use our heavyweight ScheduledValue class in lieu of Deadline – it uses a Timer under the hood but doesn’t allocate any memory or use any synchronization mechanisms when its value is polled.

This resulted in a significant drop in memory usage and garbage collection pressure for high-volume Fibers, and the ScheduledValue cleans up after itself nicely.

Even though the ScheduledValue requires more upfront resources than a Deadline, it was the superior performance tradeoff because it doesn’t have any additional overhead when its value is frequently polled by executing Fibers.

If the scenario was a little different and we had to allocate a large number of less-frequently polled objects, the Deadline would win because it wouldn’t require a large allocation of threads (used by the Timer) and heavyweight resources.

Wrapping Up

High-performance code isn’t black magic – in a general sense it comes down to the following:

Discussion, links, and tweets

I'm the CTO and founder of Petabridge, where I'm making distributed programming for .NET developers easy by working on Akka.NET, Phobos, and more..