Site to Site Replication: How It Works

We recently added a big feature to our storage system: site to site replication. An earlier post gave an overview of the feature, SiteProtect — what it’s for, why it’s important, and how it’s used. This post is a followup that goes into detail about how SiteProtect works. It’s a bit different from how other systems do it, and its implementation shows off some of the special features of our architecture. So let’s jump in.

Tribble

Site to site replication is powered by two core features of our storage system: lightweight snapshots and dynamic data replication. So let’s start by looking at how replication works.

This feature is surprisingly powerful — it does much more than protect your data from hardware failure. For example, we use it to dynamically move active data onto new nodes when you increase the size of your cluster — we create new replicas of your files on the new storage and remove old replicas to balance your data across all of your available storage. Replication can do this in the background, even as your files are being modified. You can almost think of site to site replication as adding a new storage node, but instead of putting it on the same rack as your existing storage, you can have it in a different data center or even a different country. That metaphor is more accurate than it sounds: internally, we do in fact represent the remote cluster as just another storage node, which happens to proxy requests over your replication link.

Replication alone isn’t the whole story though. By design, once a replica has been created, we keep it constantly synchronized with the other replicas at all times (to protect your data from hardware failures). But site to site replication needs to work over wide area networks, where latency and bandwidth are out of our control and would very likely have a big effect on performance if we tried to keep a synchronous replica of your data on the other end of such a link. At the same time, we don’t want to just let remote replicas get out of sync — that would make it difficult to ensure the remote replicas were in a consistent state, and also make it hard to know how old your backup is. This is where lightweight snapshots enter the picture.

In our system, snapshots are first class objects, cheap to create and as efficient to access as the original files. They are implemented as copy-on-write clones of the source files (similar to zfs or btrfs snapshots). This means that the space taken by the snapshot is only the parts of the files that have changed since the snapshot was taken, and that there is very little other overhead to keeping snapshots. So, when we replicate your data to another site, what we’re really doing is taking regular snapshots of your files and replicating those. Since the snapshots don’t change, the problem of keeping the files synchronized goes away. And, since snapshots only contain the data that’s different from the previous snapshot, they give you an easy way to fit your data into your wide area link — just schedule them at a rate that fits. We’ll only replicate the difference.

Coho’s site to site replication feature is designed to stay out of your way until you need it. Replicated files are stored on the destination in a completely separate file system, so they don’t clutter your view of your local workloads. But at the same time, you don’t have to statically reserve space to hold the data — the underlying capacity is shared automatically between your local data and backups of remote data. When you do want to activate a backup, just clone it into your local storage. It’ll work just like a local VM and automatically replicate itself back to your primary site if and when it comes online again. This means that it’ll be waiting for you if you choose to “fail back” your VM on its original home.

That last operation is worth a closer look: when you fail back a VM, that means you are replacing the old VM with the version that ran on the backup site. The magic behind that is our new SnapAttach feature, which lets us safely replace the disk images of your VM with those of the failback snapshot, without otherwise affecting the configuration of your VM. That’s just one of the many things SnapAttach can do — check out the blog post about it.

I think that covers the central mechanisms behind site to site replication. Overall, I think the most interesting thing about it is how naturally it arises from the architecture of our core product. As an engineer I’ve found it very gratifying to build big new features into our storage system by composing many of the elements we already have. I hope some sense of that has come across in this post. For a more user-centric description of the feature, don’t forget to check out the original post on site to site replication.

6,662 total views, 6 views today