Five recent results in high-performance data paths.

Five recent results in high-performance data paths.

First, a little context: Co-op at Coho

At this time of year, Coho’s engineering team interviews applicants for co-op (4 month) and intern (12 month) student placements on the team. We take these student placements really seriously: it’s an opportunity to have spectacular young software developers come and work with us, and we’ve had really productive results from students that we’ve brought on in the past. A successful co-op term at Coho involves (a) a specific mentor from the technical staff to work directly with the student through the term, and (b) a clear project to work on. We generally try to pick projects that allow students the opportunity to contribute to our product, because it’s rewarding to have your code ship, but also to give the opportunity to exceed expectations by shaping the work they do in their own direction.

A great example of this happened in the last term: Josh Otto, an undergraduate from the co-op program at the University of Waterloo joined us for the fall term. In discussing potential projects, Josh said very clearly that he “wanted to do some low-level C programming.” Josh certainly got to hack on some low-level code: he ended up working with real time stats collection on wait queues, to measure and visualize where requests spend their time as they traverse our stack. Josh turned out to be a pretty full-stack sort of guy, in that the stats that he collected were plumbed through collectd, into InfluxDB and visualized with Grafana. Extensions to Josh’s project continue to be worked on, and are making their way into a toolset that our support team uses to analyze performance data from customer sites. All in all, it was a pretty high impact result, and it was hardly the only thing that Josh achieved as a member of the team.

Josh demoing his co-op project to some of the team.

Josh demoing his co-op project to some of the team.

What does this have to do with fast data paths?

Co-op candidates ask a lot of questions during interviews, and the best candidates are often really, really interested in trying to understand the sort of problems that we face in building Coho’s product. I had a fun conversation about one aspect of what we do last week: a candidate that I was talking to was very surprised that building software for fast, scalable I/O could even be a problem after decades of software research and development. “How isn’t this a solved problem?” they asked. This, incidentally, is exactly the sort of job interview discussion that I love to get into.

The answer to this particular question is especially important to me because it characterizes the exact thing that I love about both building software systems and doing systems research: The environment in which systems research is carried out is in a constant state of change. Software systems, broadly, is any code that sits between top-level application logic, and hardware. Hardware improves non-uniformly: the balance of power between computation, communication, and storage performance swing wildly over the years, resulting in completely different solutions being valid at different points in time. Applications do the same thing: they change continuously, are shaped by their environments, but (a little counterintuitively) tend to move more slowly than the hardware that they run on. Good systems design needs to strike a balance between idealistically reinventing the universe and pragmatically proving value for the millions of lines of code that people already depend on today.

This balance of pragmatism and idealism, and the fact that the environment is in a perpetual state of change is the very core of what makes systems research challenging, exciting, and above all a creative endeavour.

Five recent research data paths.

Building data paths for high-performance I/O is a very good example of systems research being driven in new directions as a result of environmental change. Over the past decade, a pair of really interesting things have happened:

  1. CPUs stopped getting faster. (Well, at least they stopped getting faster as fast as they were getting faster before.)  Instead, they started to become parallel, resulting in either an absolutely head-twisting mess for application developers or a bunch of idle cores. One of the core values of virtualization (as very presciently anticipated by Disco, the research-parent of VMware’s hypervisor) was that it allowed systems to take advantage of extra cores by making them look like a whole bunch of single- or dual-core systems.
  2. I/O suddenly started to get really, really fast. Commodity Ethernet suddenly became as fast as, or faster, than (traditionally more) expensive proprietary fabrics such as Fibre Channel or Infiniband. We figured out how to build commodity switches with full-bisection bandwidth. Then we revisited Clos networks as a mechanism for scaling those full bisection switches up into richly interconnected data centers. Meanwhile, nonvolatile memories made seeks free, and moved storage from the disk bus, to the PCIe bus, to the main memory bus.  Storage, which was traditionally the slowest part of the system has become as fast as RAM in some cases, and about a million times lower latency that developers have traditional assumed it was to access.

The sum of these two facts is that we are currently in a period of systems design in which I/O performance is in its ascent:  it is becoming proportionally faster relative to computation.   This environmental change is demanding that systems researchers and designers reconsider the parameters in how they architect systems.  As evidence of this trend, here are five spectacularly interesting papers that have been published at top systems and networking conferences over the past 12 months.  I’ll summarize them quickly before making a couple of observations about the commonalities that thread through all of them.

  1. mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems
  2. Arrakis: The Operating System is the Control Plane
  3. IX: A Protected Dataplane Operating System for High Throughput and Low Latency
  4. Network stack specialization for performance
  5. MICA: A Holistic Approach to Fast In-Memory Key-Value Storage

All five of these papers reflect a pretty sophisticated position on systems performance in which we want to achieve low latencies, high throughputs, and efficient utilization of the hardware that we buy.

Here is a preposterously brief summary of all five of these papers: mTCP argues that the OS kernel is standing in the way of achieving good scalable network performance because of the machinery (like data copies and context switching) that are involved in moving data between kernel and user space. Taking advantage of new NIC and CPU features (such as multiqueue), mTCP moves the device driver and network stack directly into an application and takes the kernel completely out of the IO path. Arrakis does the same thing, but also thinks a little bit about doing the same for storage devices.

IX, which was published alongside Arrakis, says that you would have to be absolutely crazy to trust applications with direct access to devices because of both concerns over software stability and over all of the horrible things that they might do to your networks. It does some really clever stuff with Intel’s virtualization extensions to allow the I/O path to be co-resident with application code — realizing many of the benefits that Arrakis and mTCP do in terms of mapping queues to cores and making things go fast — but still manages to run the network stack in an isolated protection domain where the application can’t muck with it.

The Sandstorm paper (Network stack specialization for performance) goes a bit further down the road of layer/API violations than mTCP, which attempts to preserve POSIX socket interfaces to client apps, despite their being recompiled to link against mTCP instead of libc for the network.  It also implements a user-level stack with application-specific tweaks to the network code that deliver impressive speedups for both a web and DNS server implementation.

Finally, MICA is a spectacular “reinvent the world”-category paper that considers just how fast you might be able to make a key-value store go on cool new hardware. The paper shamelessly ignores all of the properties that you might want in a real key-value store (such as durability or replication) and enjoys an opportunity to meditate entirely on the design of the datapath itself, posting a peak throughput of almost 80M operations per second to a single server.  When I have taught this paper to grad students, I’ve enjoyed referring to it as an exercise in constructing systems for drag racing, and a very successful one at that.

What do these papers have in common?

All five of these systems are focussed on achieving saturation performance in the face of new hardware. The main problem that all five struggle with is that in order to scale, even within a single server, the CPU cores are the trickiest thing to manage. In order to saturate these fast I/O interfaces, we need to first remove any unnecessary layers of abstraction, such as the OS’s I/O stack, if those things interfere with performance. Second, and this turns out to be a tricky thing to get right, we need to find ways to efficiently balance work across multiple cores, with close to zero state shared across them.

The reason for this latter thing is that of coherence.  When you attempt to parallelize work across a set of cores, (sanely written) applications generally try to keep memory consistent across those cores.  A TCP implementation that received packets from a given connection to any random core needs to serialize accesses to TCP connection’s state, such as a flow’s sequence number.  Developers are forced to do this using concurrency primitives like locks. At the rate of request processing in these systems, contested locks are a complete disaster, and even uncontested locks can add enough overhead to severely impact performance.

A secondary concern in parallelizing datapath implementations is the fact that the x86 architecture is cache coherent: CPU caches implement a coherence protocol to ensure that all cores see a single version of memory, and they will work very hard at copying data around to ensure that view stays coherent. If performing I/O on multiple cores involves sharing in-memory state across them, things go wrong in a hurry.  These coherence overheads can come from operations that are as simple as maintaining performance counters (how many packets have I dropped) or debug data structures, resulting in extra subtleties and considerations for developers that are trying to achieve good parallel performance.

Between these two levels of coherence (application-level serialization and cache coherence overheads) a poorly designed implementation will achieve the same (or worse) performance in parallel as it would running the whole system on a single core.

To address this, all five of the systems above carefully engineer a per-core execution context that is as close to shared nothing as they possibly can. Flows are hashed and directed to a specific core. Even things like performance counters are replicated and stored as per-core local state, and only aggregated when higher-level tools actually read them.

MICA goes the interesting additional step of really thinking about the problem in an end-to-end fashion, and effectively addressing requests to the core on which data lives. This sort of approach paves the way for cluster-level resource management for request traffic across large numbers of devices.

“So how isn’t this a solved problem?”

Given these five papers, it might be natural to think that we’ve figured out how to build high-performance data paths in the face of emerging hardware. Unfortunately, (well, okay, maybe I mean fortunately here), this couldn’t be farther from the truth: One big challenge that remains unsolved in all of these systems has to do with how we should reason about trust in architectures that delegate so much more responsibility directly to applications. From a network context, it is important to enforce that applications not be able to forge MAC addresses or VLAN/VxLAN tags. On the storage side, it simply isn’t safe to allow multiple tenants to share access to the same block device if an addressing error in one application can corrupt data from another.

Simple hardware-level techniques, such as NVMe namespaces or “stamping” fields on the network transmit queue go some small distance to help, but the core issue is that most datacenter environments are multi-tenant, and involve a collection of shared data center services that provide benefit across multiple tenants. In a storage system, for example, implementing deduplication or cluster reconfiguration in response to failure is complex and important stuff. Trusting an application to embed their own datapath is great for performance, but it requires careful design to get right while still managing to allow important cluster-level services to be safely and efficiently deployed.

This is one example out of a long list of challenging consequences that we are facing in building thinner data paths in large-scale distributed systems.  Others surround things like managing the placement and locality of data, and similarly organizing the network and application components that connect to it, alternatively, there is the careful balance that I alluded to above in building systems that can realize the performance wins of the research systems above while still working under unchangeable legacy applications, which in turn may be running on unchangeable legacy OSes and VMMs.

Interesting times…

It’s exciting to have the opportunity to work on a large number of really challenging systems problems at once.  While the datapath is a big component of Coho’s product and something that is under aggressive development, there are many other areas of exciting work: We have teams working with equally interesting problems with SDN switching, protocol design, distributed dynamic data structures, time-series data analysis, and even dynamic constraint satisfaction.  We have teams working closely with customers on integration with technologies like Docker and Apache Hadoop.

It’s been wonderful to see the engineering team continue to grow over the past year, and to be able to take on more and bigger problems as a group.  If any of these areas sound interesting to you — either as a co-op or as a full-time member of our technical staff — there are plenty of opportunities to join the team.  Just drop a note to

And with that, I’ll get back to talking to co-op applicants!

Note:  The car image at the top of this article is from wikimedia commons, and has been remixed a little.  Here’s the original and here’s the attribution:  By GSenkow (Own work) [CC BY-SA 3.0 (], via Wikimedia Commons 

20,732 total views, 3 views today