The New Data Services

The New Data Services

For as long as we’ve had enterprise storage, products have defined themselves as providing a combination of two things: data reliability and data services. Reliability — and I’m going to wind up some storage greybeards by imprecisely using that word to also include specific related properties such as durability and availability — means that enterprise storage systems take unreliable components like disks and servers, and combine them to build a storage abstraction that works, remains available, and protects data in the face of all sorts of failure.

Data services are kind of the other half of enterprise storage. They typically include a whole bunch of additional storage-level functionality: things like snapshots, deduplication, and remote replication. Data services are functional extras: they aren’t just about presenting a more reliable or faster storage device, they are about adding new functionality at the storage layer that lets you (you being the administrator in this context) do useful new stuff with the data you store.

To me, NetApp is the canonical example of a company that differentiated their storage product based on data services: NetApp’s file system WAFL made volume-level snapshots cheap, fast, and easy to take. The decision to then expose these snapshots to users, initially in the filesystem namespace as a .snapshot directory and then as a programmatic abstraction through things like SIS clone, changed user and administrator understandings of what the storage system actually provided. NetApp snapshots are a great example of the power of data services because they expose a useful piece of functionality within the storage layer and improve our ability to work with our data as both administrators and end users.

The data services drought

This year marks the 20 year anniversary of NetApp’s original USENIX paper describing the design and implementation of WAFL. Here’s the sad thing: data services in the storage industry haven’t changed an awful lot over that twenty years. The two decades grew the conventional understanding of data services to include a bunch of volume-level notions of storage system reconfiguration (like volume concatenation), a few flavours of remote replication such as continuous data protection, and data reduction techniques like deduplication. Unlike NetApp’s original snapshot implementation, none of these data services are user-facing or programmable. Storage hasn’t really materially changed, in terms of interface, for a couple of decades.

To the cloud

And so here’s where it’s really interesting to look at what cloud-based environments have done in terms of evolving storage interfaces. Google built a file system, and later turned it into a database, and then extended the system to continuously process and index changes and eventually to provide efficient consistency at large geographic scale. Talk about data services.

Interestingly, not a lot is written about the actual APIs that are available to Google employees internally, but Amazon is different — they resell these services. Amazon built an object store (S3) and a block store (EBS). Both have grown to enormous scale. This week at re:Invent, Amazon has announced several interesting new services that bring compute closer to data, by exposing APIs that work directly in response to events at the storage layer. In particular, S3 has been extended with the ability to generate events in response to things like object creation, and an entirely new programmatic service, Amazon Lambda, has been introduced to allow developers to write small JavaScript programs (well, continuations really) that run in response to these events.

In Amazon’s words:

AWS Lambda is a compute service that runs your code in response to events and automatically manages the compute resources for you, making it easy to build applications that respond quickly to new information. AWS Lambda starts running your code within milliseconds of an event such as an image upload, in-app activity, website click, or output from a connected device. You can also use AWS Lambda to create new back-end services where compute resources are automatically triggered based on custom requests.

In other words, Lambda combines an event triggering system that is directly integrated with the storage system, with a lightweight runtime to associate small bits of code with those events. As a result, I can write a small program to generate thumbnails whenever I add an image to an S3 bucket and then add those thumbnails to the end of a pre-existing web page. I don’t need to build a VM to watch for new images, nor do I need to host a background PaaS-based service to monitor S3. As a developer, I just write code to respond to the event and the storage system runs it in an efficient, integrated manner.

My suspicion is that Lambda existed in some form within AWS’s implementation long before it was surfaced to users: many internal storage facilities benefit from exactly this kind of eventing, and so Lambda’s release may really be the full productization of something that was originally some completely internal infrastructure.

The new data services

I’m excited about what Lambda represents, in that it is effectively an entirely new form of data service. I frequently hear our customers talk about a data usage pattern that involves ingesting data into enterprise storage, and then immediately pulling it back off to transform in a VM and then storing the results of that transformation back to their storage system. This is a common processing pipeline that is highly inefficient in traditional enterprise computing, and it’s one that I think enterprise storage vendors should be paying a lot of attention to.

In a talk at Storage Field Day last week, I described some early work that we’ve been doing at Coho to integrate the Cloudera CDH stack directly within our storage platform. This is part of a broader bit of (very new) work that we’ve been exploring, in collaboration with Intel and a couple of customer design partners, around the efficient incorporation of Docker containers that are plumbed directly into both our SDN network integration, and our high-performance clustered storage system.

We’ve been getting great initial customer reaction to these new points of integration, and the Lambda announcement from Amazon is squarely in the same direction. Unlike VM or language runtime hosting (as has been the case with traditional IaaS or PaaS systems), storage-facing APIs for compute have the potential to allow developers to work very directly with their data, efficiently, and without worrying about the operational distractions of having to build, provision, and spin up environments.

After 20 pretty uneventful years in storage, it’s exciting to see data services start to grow and evolve in interesting new ways. It’s a direction that I’m excited about seeing Coho provide for our own customers as we continue to grow and evolve.


The image at the top of this article is a from a postcard photo ad for the Harwood-Barley Company of Marion, Indiana, a maker of trucks and fire engines.  http://commons.wikimedia.org/wiki/File:Old_delivery_truck_Harwood-Barley_Marion_Indiana.JPG

5,918 total views, 7 views today