Coho’s SRA plugin – helping you do the right thing™

Coho Data is soon to release v2.6 of their DataStream software. This release is chocked full of goodness; don’t worry there’ll be a blog post covering all the high-level features. I wanted to take this opportunity to explain an associated feature that we’re also releasing – Coho’s SRA plugin. For the uninitiated, this is the 3rd-party software module you install on a VMware SRM server to offload VM replication duties to your Enterprise Storage devices.  Our SRA builds on an existing key Coho feature: SiteProtect, our asynchronous replication engine.  “Nothing particularly interesting here” I hear you say. “Many Enterprise Storage solutions have async replication”, and “well, vSphere has its own replication these days. And that’s free. Where’s the benefit?” Let me explain why I think you should care, and what makes SiteProtect and an ABR strategy (array-based replication) the right thing to do.  First; let’s talk about the title of this blog post.

 

Do I really have to do that?

A life-time ago; well, a long time ago at least, I spent a year in a place called Sandhurst. During a pretty exhausting twelve months I was taught many interesting principles. It was super tough to get in the door, and even harder to stay there.  

zulu 1 One of the fundamental lessons my Troop Commander, an aloof and laconic Cavalry Officer, would drum into us was “do the right thing!”.  The concept was simple. The year-long course was an exercise in making rapid, albeit considered decisions when things are upside down – training for “the fog of war” as it’s known.  He posited that whatever decision lay in front of you, if the course of action was unclear, simply ask yourself – “are you doing the *right* thing?”. He would chant this at seemly tangential moments, perhaps to solidify the universal nature of his favourite phrase.

Aside from my acerbic Captain back then, there’s plenty of highfalutin quotable leadership studies saying much the same thing, such as Bennis Warren:

Managers are people who do things right and leaders are people who do the right thing. Both roles are crucial, and they differ profoundly. I often observe people in top positions doing the wrong things well.

So Forbes, what’s this got to do with the price of fish?

Coming back to our new SRA plugin.  It builds on SiteProtect, the async replication we launched in 2014. To avoid getting bogged in verbose fluff, here’s the salient points on why I think Coho’s SiteProtect is stand-out from the crowd stuff.

  • VM-centric
  • Bi-directional replication (both sites can be running VMs and protecting VMs at the same time)
  • All cross-WAN traffic is compressed and sent over an encrypted tunnel
  • Replication-specific bandwidth throttling
  • Also works with OpenStack VMs
  • No array specific management controllers or VMs need to be deployed (if you’re shopping for storage – check the implementation details!)
  • Space-efficient delta snapshots and clones on both sides (no re-hydration or full cloning at any point)
  • VM protection can be scheduled with VMware snapshots and VMware tools quiescing to ensure VM, OS and application consistency
  • No advanced array or vSphere configuration settings, policy changes, LUN reservation, etc. have to be applied first. Just enable.
  • Everything is thin-provisioned
  • Super simple to enable and manage
  • Recovery Time in seconds (RTO)
  • Flexible Recovery Point (RPO) schedule
  • No extra licensing required – everything for SiteProtect works out of the box. We don’t nickel-and-dime you.
  • A failback resync option to the original site (per VM) after the initial recovery
  • Failover/recovery is fully integrated with vCenter

That’s a long list, and there’s certainly more I could include, but it’s the summation of these features that really makes it so compelling. I don’t mean to trivialize the points either. For example, consider the last one “Failover/recovery is fully integrated with vCenter”.  Coho’s complete solution means that if you need to failover one or more VMs, it does the heavy lifting by presenting and registering each VM on the recovery site’s vCenter. Yeah, browsing each VM’s folder and right-click > Add to Inventory > complete the register VM wizard gets pretty boring after a handful of VMs. Do you think you’ll have the patience to do this when you have hundreds to failover and your boss is waiting impatiently at your cube?

Yes, a storage company can still innovate with mainstay features and produce genuinely smart improvements that actually help their users.

It’s true that in many environments and business cases, our (free to customers) SiteProtect is a simple and judicious choice for their DR & BC requirements.  Don’t assume that you need to pay for VMware’s SRM software – as always in infrastructure design, you should go back to the functional requirements.  That’s your linchpin.

Don’t get me wrong, SRM is a great tool for companies with bet-the-business vSphere deployments. For me, there’s 3 key things that SRM adds to regular async storage replication strategies:

  • Automation of VM recovery at scale.
  • Pre-planned orchestration to ensure that services are brought back in-order. (consider both the technically correct order, and business priority order)
  • Network reconfiguration of each VM for foreign subnets.

(Remember, Coho’s own SiteProtect async replication can take care of the first bullet point)

At the end of the day you need to decide which is more suitable for the protection of your Enterprise’s data.  Time for Rock, Paper, Scissors, Lizard, Spock – “All hail Sam Kass. Hail!

 

Bring out the runbooks

With VMware’s SRM software, probably the biggest design decision coming your way is the method of replication.  Host Based Replication (HBR), aka vSphere Replication; or Array Based Replication (ABR).

Red or blue pill?

Red or blue pill?

Avoiding the Primrose Path

I’ve listened to many of my peers debating the merits of each strategy.  Check out GS‘s great blog post here: https://blogs.vmware.com/vsphere/2015/04/srm-abrvsvr.html as a really good starting point.  That comparison has to make a generic representation on the ABR side. That’s a completely fair and reasonable thing to do (GS: awesome post as always). But for many of the reasons I made above, it wouldn’t represent a true comparison of VMware’s HBR versus Coho’s SiteProtect.  At the end of the day, both provide the replication that SRM needs, why wouldn’t you just pick the default?

"You take the blue pill, the story ends. You wake up in your bed and believe whatever you want to believe. You take the red pill, you stay in wonderland, and I show you how deep the rabbit hole goes."

“You take the blue pill, the story ends. You wake up in your bed and believe whatever you want to believe. You take the red pill, you stay in wonderland, and I show you how deep the rabbit hole goes.”

 

The above diagram shows probably the most significant reason why you should (if you can) pick an ABR solution. I’ll come to describe this diagram in more detail, but first here’s just some of the other reasons I like to table when I’m talking to storage and virtualization admins.

  • HBR collapses VMware snapshots on the target site. Coho’s independent replication doesn’t do this – what you recover is what you were protecting. Do you really want to lose that history?
  • SiteProtect is available to all our users no matter their vSphere licensing, VMware’s vSphere replication is only available on particular licensing levels.
  • VMware’s HBR has a 2,000 VM limit (no such limit with SiteProtect).
  • Coho SiteProtect has a more flexible RPO policy (vSphere replication has a maximum RPO of 24 hours).
  • vSphere replication can only keep a maximum of 24 recovery points, Coho doesn’t limit this.
  • SiteProtect can replicate all of these:
    • Fault Tolerance (FT) VMs
    • Powered-off VM
    • Templates
    • Microsoft Failover Cluster VMs (MSCS)
    • vApps
    • Linked clones

VMware’s HBR can’t. And that’s a long list – do really want to segment your recovery plan here?

  • Coho’s ABR doesn’t require any replication appliance VMs. That’s one less thing to deploy, maintain, update; and on both sides. Not to mention the host resources (RAM, CPU cycles, …).

Let’s come back to the above diagram as it represents a significant difference and has several wide-ranging design impacts.  With an ABR strategy, obviously all replication traffic traverses the WAN direct from one array to another.  With HBR all VM replication is pulled from the array up to the host, across to a remote host and pushed down to the remote array. When you first turn this on, and every time you create and register a VM for HBR replication, the entire VM has to be “sucked up the same pipe” you’re relying on for your VMs’ latency sensitive storage link. And back down on the the other side. Now, maybe this ain’t so bad if you only have to do it once, right? Unfortunately anytime you have a host failure, even though VMware HA will recover those precious workloads quickity-split, every single VM (with it’s associated VMDK files) protected by vSphere Replication on that host, needs to go through a “Full Sync”. Here’s what that means:

Full sync: vSphere Replication compares the source virtual disk to the target copy by using checksums to determine which blocks are “out of sync.” The “out of sync” blocks are then replicated from the source to the target. **vSphere Replication must read the entire contents of both the source and target virtual disk files. This operation can be very time consuming.**

Source: vSphere Replication FAQs

No kidding.

On a regular sized ESXi host with a bunch of chunky VMs, it’s not unreasonable that this could be many, many TBs that needs to be re-read. On both sites. All while you’re not protected. And that’s aside from the knock-on effect on your other bandwidth sensitive, WAN-dependant services that are competing with the resending of that data.  You’re effectively hammering your storage in both datacenters with a pile of IO, and unnecessarily clogging up the host-to-array connections.  Extra IO on all the switch infrastructure. And don’t forget the load on the hosts themselves.

Okay, so I’m talking about how heavily the infrastructure gets hit with HBR on initial setup, and also on host failures. Maybe that alone is acceptable.  But because vSphere Replication can’t protect powered-off VMs, every time you power on a VM this happens. Again.

Look at the ABR red pill. It becomes pretty obvious that a seemingly minor simplification makes a huge difference.  Much less LAN traffic, much less load on the ESXi hosts.  Coho’s SiteProtect can carefully balance and re-prioritize replication traffic against active client IO to ensure that your VMs continue to do what they should be doing, all while the storage array is efficiently trickling data across the WAN to protect your workloads.  Doing it this way means the ESXi hosts don’t need to track the changed blocks of every VMDK which eats up host memory. This is all offloaded to your storage hardware, as it should be.

If you’re interested in learning more about how simple it is to use the SiteProtect feature, take a look at our Evaluation Guide written by our very own Christopher Wells.  Yeah, Chris knows a thing or two about what it’s like to deal with real DR/BC situations. His blog post “Real Life DR & BC, with VMware SRM”, is still one of the most well-known SRM community stories out there.

 

The Inertia of a Default Option

The Default Effect is a recognized psychological response that extends well beyond computer software. It’s often blamed on several factors such as cognitive indifference, a perceived additional cost, or the belief that a default implies a recommendation. I think it’s safe to say that most VMware customers think of vSphere’s HBR as the default way to set up SRM. It’s true that using HBR can be a perfectly appropriate choice in particular use-cases.  However I think it’s misplaced loyalty, lethargy and the result of successful marketing that the wider community of datacenter architects are gravitating so heavily towards HBR being the firm default.

Specious: it'll work, but is this really the best tool for the job?

Specious: it’ll work, looks an attractive way to do it; but is this really the best tool for the job?

We make the assumption that a default is easier and in most cases the best option. We’re usually surprised when it isn’t. That’s the Default Effect at play. I can tell from experience with many different virtualization-focussed replication tools that Coho’s async replication is the most straightforward I’ve ever configured.  Consider all the benefits of offloading what is essentially a storage function to your storage hardware. Call that lowering OPEX, tactical efficiency, leveraging holistic alignments of a synergistic paradigm-shifting strategy; or whatever you need to explain it your CIO.  I call the *right thing* to do.  Sometimes it can take a little more effort to make the right choice and lead those around you, but that’s when you win.

It’s perhaps simplistic to just call this common sense.  The decisions you make in an infrastructure design should always be dependant on the unique requirements of each environment.  But I don’t think it’s reductive to consider approaching your next DR/BC project with a different viewpoint.  Array Based Replication solutions vary widely in what they can do for you.  Please, take the time to look at what Coho’s storage, our SiteProtect replication and SRM integration can offer. I know you won’t be disappointed.

8,623 total views, 9 views today