Engineering for Change

Ouroboros_1
The Underdog by Spoon on Grooveshark
There’s this Spoon song that I really like, called “The Underdog.”  It’s a simple, catchy tune, and, to my listening at least, it talks about the challenges of a small independent band struggling with the realities of working with the establishment in music recording, distribution, and promotion.  The song is angry and frustrated.  It’s about change, in particular about how the music industry as a whole (like many information-based industries today) continues to struggle with some pretty fundamental disruption in the face of both new technologies and also new approaches to doing business.

Maybe it’s a bit of a stretch, but the sentiment in the song echoes my own views of what is happening in infrastructure computing right now. I’ve frequently blogged about how changes in technology (like the rise of flash and software-defined networking) are catalysts for change, but there’s an even bigger story here: that of the broader generational change between new and old-generation IT companies. New IT is different, and these differences demand consideration.

Engineering for change.

In the past decade, we’ve seen absolutely massive disruption of traditional businesses: Amazon revolutionized the sale of consumer goods, starting with books. Netflix killed video rentals and now has its sights set on the cable networks.  Uber is upsetting an established and clearly sub-optimal market for hired car dispatch.  Amazon again with leased IT infrastructure in place of on-prem hosting.

Marc Andreessen has referred to this sweeping set of industrial changes as “software eats everything.”  I completely agree with this view, and I’m excited by it.  In fact, at the rate that technologies like 3D printing and 3D scanning are advancing, I am personally hoping that orthodontics-as-a-service will materialize before my kids need braces.

One of the most central characteristics of these new companies, and one of the most significant benefits of cloud-based applications is the notion of cycle time.  Cycle time is a simple concept: how quickly can your company identify a problem or opportunity with what you sell, design a solution, implement that solution, and then deliver it to customers of your product or service. Cycle time is agility and responsiveness. These new companies aren’t just offering a better product, they are offering products that get better faster.

What cycle time means for IT infrastructure products.

Cycle time is easy in the cloud. In fact, Facebook’s developer mantra, “go fast and break things,” characterizes exactly the idea of making changes quickly even at the expense of stability. In a system that has about a billion users, Facebook has the ability to roll new functionality out over a subset of their users, even at the cost of occasional disruptions.

IT infrastructure products are not Facebook.  Storage in particular: the systems we build aren’t centrally managed, they are being installed as physical assets on customer sites. The single most important property of a storage system is that it not lose data. In an enterprise storage environment, “go fast and break things” probably isn’t the most appropriate mantra.

Let’s consider a different, and possibly more analogous new-generation company. Automobiles today represent software systems with even higher correctness properties than enterprise storage: if an automotive system fails at speed, people are going to die.

Fascinatingly, Tesla has demonstrated the precise benefit of compressed cycle time through a recently deployed set of software patches. The company issued an over-the-air software patch, to adjust the road clearance of all of their cars at highway speeds in response to a specific and measured safety concern. Tesla serves as a very relevant example of what enterprise IT customers should be looking for in their products today: their cars are real physical things, literally deployed on premise. Despite this, the auto maker treats the entire deployed fleet of their vehicles as something that can be monitored in aggregate, and something that can be carefully and responsively improved over time.

Upgrades are critical to cycle time.

And so here’s a perplexing thing: Last week, EMC’s XtremIO (a product that I would have up until then categorized as new-generation storage) announced that upgrades to the newest release of their software would be destructive: that all data stored on the storage system would be lost as a byproduct of upgrade, and that as a result, all production data would have to be migrated _twice_, once off and then again back on to the new system.

There has been a lot of discussion about the implications of this process over the past week, but I’d like to consider it strictly from the perspective of agility: XtremeIO needed to deploy some really significant improvements to their stack. They took a decision to do this in a simple, repeatable, but also highly disruptive manner.

Unfortunately, it also leaves their customers cringing fearfully in anticipation of future upgrades because of the potential pain associated with them. This is a traditional, very traditional, version of the enterprise software cycle; effectively yearly releases with significant pain associated with product upgrades. XtremeIO is hardly the only historical example of this: NetApp famously angered their customers with destructive upgrades after the Spinnaker acquisition, and then did it again with the recent introduction of cluster-mode.

Cycle time at Coho.

Three years ago, when we sat down to rethink architectures for scalable enterprise storage, we assumed up-front that we were not going to get everything perfect in the first release.  We decided that two critical aspects of our design — irrespective of anything that we did in implementing the data path or other specific storage features — were critical in engineering for change. In particular:

  1. Measure. We built in, from the beginning, a remote monitoring system called OnStream, that allows us to monitor the health of Coho systems that are in production, and to learn, in aggregate, about how they can be improved over time.
  2. Improve.  We accepted up front that in order to get to the system that we really wanted to build, that we were going to have to upgrade the software, regularly, on the customer site and in production.

Our most recent product release moves to make these upgrades completely in-service, allowing rolling metadata and layout changes to be deployed on a running Coho deployment. This has been hard work on an understated feature, and it has required the engineering team to build techniques for versioning, testing, and deployment that have impact across the entire stack. At the end of the day though, features like live and non-disruptive upgrade reflect a prioritization on engineering for change and a need for agility: we want our customers to expect and be comfortable with a software-based storage system that continuously improves over time.

Learn by doing.

A second aspect of cycle time is the information that’s required to make good decisions and prioritize what needs to be improved. One of the biggest benefits that companies like Google and Facebook have today is the ability to look longitudinally across their deployed systems and have statistically significant samples for understanding things like application workloads or hardware failure properties.

I view the deployed set of Coho Systems, across our diverse set of customer environments, as being remarkably similar to hosted, central systems like S3, EBS, or GoogleFS.  We are a scale-out multi-tenant storage system with very hard isolation between tenants: they each run their own independent hardware. However, just like Tesla’s monitoring of their deployed fleet, there is enormous opportunity to learn from the aggregate behaviour of those deployed systems and to improve each individual Coho deployment as a result.

Coho’s OnStream does exactly this: it propagates relevant information about our deployed systems back for centralized analysis by our engineering team. Using the data collected by OnStream, we are able to study wear properties of flash hardware at scale, and to warn customers about hardware failure risk before failures happen. Even more exciting than this is the recent set of techniques that we have published on efficiently summarizing storage workloads, which allow us to study and actually improve the performance of deployed systems over time.

Invested in change.

Over my career, I’ve been involved in the design and development of a number of large, complex software systems. One of the lessons that I’ve taken from these experiences is that for any good system, design and architecture aren’t static — they are continuously evolving and (hopefully) improving. The measure to me, of a good software system is only partially what it is capable of now, and much more importantly how well it anticipates the need to change and improve over time, and how willing its authors are to acknowledge weaknesses and to continue to invest in improving their systems.

Our customers want infrastructure that responds to their needs, and that makes their IT practices more agile and efficient.  Balancing storage and network performance, analytically fitting storage resources to workload demands, and providing scale-out systems that can be acquired on demand are important technical aspects of helping them achieve this. More than all of this though is the importance of demonstrating to our customers that we respect their time and effort in building and delivering solutions. The most critical aspect of compressing cycle time is building systems that involve customers in the release cycle, and involve them directly in product evolution.  To do this, they need to look forward to upgrades, not live in fear of them.

“The thing that I tell you now
It may not go over well
And it may not be photo-op
In the way that I spell it out

But you won’t hear from the messenger,
Don’t wanna know bout something that you don’t understand,
You got no fear of the underdog,
That’s why you will not survive.”

“Ouroboros 1”. Licensed under Public domain via Wikimedia Commons – http://commons.wikimedia.org/wiki/File:Ouroboros_1.jpg#mediaviewer/File:Ouroboros_1.jpg

Interested in learning more about Coho and our products? Check out ESG’s report on our initial product offering, or our slightly gorier technical white paper that describes the system in a bit more detail.

5,881 total views, 2 views today