This post is long past due but I’ve been using this technology more and more lately for all sorts of fun scenarios and thought it’s time to post about it.
The basics of DR and BCP
If you are a virtualization administrator in any fashion then you know that one of the hottest irons in your CTO/CIO’s fire is DR (disaster recovery) and BCP (business continuity planning). I see, just about every other day, people ask on Twitter and Reddit for DR recommendations for their vSphere or Hyper-V virtualization stack. One of the worst directions the conversations take is into backups and testing them – this makes me cringe (sorry anyone). “Make sure you have backups” and “An untested backup is a failed backup” – right, OK, yes. However, none of these things are DR in its truest sense.
Let me pull back a bit – I suppose backups can in fact solve a DR requirement for some companies. If a company is able to sustain a catastrophic hardware or facility failure by pulling data down from cloud storage or from tapes then there’s no need to revisit the requirement. However, in today’s world, very few companies can operate like this. This introduces two very well known acronyms: RTO and RPO.
Recovery Time Objective (RTO): How long it takes you to organize your tapes, cloud storage, replicated data, etc. and set it all back up into some semblance of the environment(s) it once was and allow business functionality to continue.
Recovery Point Objective (RPO): The point in time prior to the disaster that you are able to actually recover useful data from. Maybe your backups run 5 PM daily and take 4 hours – if a disaster occurs at 8 PM your RPO would be ~24 hours and so on.
There’s got to be a better way
I am going to be pretty outward here – I hate backups solutions (and many of my colleagues feel the same way). They’ve come a long way since I had to administer them, but backups are still clunky and require a lot of attention. Then, if you actually need to restore from a backup you are either transferring data from, for example, an AWS S3 bucket back down to your data center or you’re calling up your offsite storage location and paying a ton of money to have them go get you a box with your stuff in it.
Just as storage is often made redundant via replication, virtual infrastructure can also leverage replication and thus redundancy. The main problem with relying solely on storage replication is that you need to design your LUNs, volumes, or file systems to work within your replication domain. Many times I see clients with few and large LUNs used for all of their VM storage. When you want to replicate some VMs on a large LUN, you also end up replicating VMs you don’t intend to. This means you need to start considering how your VMs:LUNs map out and which LUNs are replicated, etc. And, in most cases, you’re going to use asynchronous replication between arrays for these LUNs since geographic distance and inter-connectivity between sites means you can’t synchronously replicate. What happens if you have a 16TB LUN with a lot of IO for ETL/DB/etc. processing? You may not actually be able to replicate the LUN reliably within the replication time frame and so your RPO goes through the ceiling. There has to be a better way!
Enter VM-level replication. If you follow my blog you may remember a post a while back about vSphere Replication. VMware has provided a hypervisor-based replication solution using changed block tracking (CBT) within VMs. As of the latest release (I believe it’s vSphere Replication 6.5) you can close your RPO down to 5 minutes. It’s technically “asynchronous” replication but you don’t get to schedule it. You can set the RPO timer but not when it starts. So, if you choose to replicate a VM every 24 hours you don’t get to say when that occurs in the day but so long as the data transfer can finish in time, it’ll replicate every 24 hours.
vSphere Replication, while great for what it is, does not do anything other than make a VM go to another place and pop out the other side as the same exact VM (with a snapshot tree for point-in-time recovery). This sounds like what you want, usually, but the problem with this is that it doesn’t provide much in the way of automation during failover and it also does not provide you a way to test the replication. When you fail the VM over to the destination that’s it – there’s no “Oops, just kidding, that was just a test” – you now have a VM on the destination and if you need to roll back it means re-configuring and re-replicating the whole VM back the other direction. Also, since you can’t do any post-cut-over configuration changes, you end up with the same IP address on the destination as the source which means that you can’t test automatically and would have to create VLANs and re-IP your DR to do a test, etc. As for the automation aspect, you can leverage VMware Site Recovery Manager (SRM) which can utilize vSphere Replication or SAN/Array-based replication and do some post-recovery configuration but it’s still not perfect.
Also, needless to say, no matter how hard you try, if you are using vSphere Replication or SRM then you are replicating from a vSphere platform to a vSphere platform (be it your own or a cloud provider).
This is where Zerto comes in.
What makes Zerto great?
Before we start talking about all of the things Zerto can do for you, let me give you a very general idea of what Zerto looks like:
The above image depicts what a typical Zerto-replicated environment might look like. The important part to note is that, at its basis, Zerto is storage-agnostic, has a Zerto Virtual Manager (ZVM), Virtual Replication Appliance (VRA), and ties into some sort of management pane (vCenter… SCVMM…) for API calls, etc. You’ll notice that the ZVM is connected to the “management” layer of the stack and the VRA’s are amongst other VMs with one VRA per host.
The reason there is a single VRA per host is obvious once you understand how Zerto works. As mentioned earlier, other replication technologies rely on changed block tracking (CBT), snapshots, and even guest-agents to keep a running log of changes within virtual infrastructure that will be pushed to the replicated side of things. Zerto, on the other hand, literally taps into the virtual machine IO stream, witnessing the vSCSI commands that the VM requires and replicates those commands to the destination once the source-array acknowledges the writes. This means that Zerto does not replicate any ghost or already deduplicated writes. Further, the VRA has the duty of compressing the IO stream as well. In addition to the optimizations already mentioned, Zerto also leverages WAN optimization, compression, and throttling.
It get’s better, yet. In addition to being completely storage-agnostic, what if I also told you it is, practically, hypervisor-agnostic? As of this post, the current build of Zerto is 6.0 and supports VMware vSphere, Microsoft Hyper-V, Amazon AWS, and Microsoft Azure. I’ve migrated clients from Hyper-V to vSphere, vSphere to Azure and AWS, and every combination between with minimal downtime utilizing automated conversion and re-IP which ultimately leads to on or before-schedule migration with excellent customer satisfaction.
What does Zerto do for my RPO and RTO?
Utilizing the vSCSI witnessing combined with compression, deduplication, etc., it is not uncommon to see 5 – 7 second RPO on VMs that are being replicated across the country using Zerto. I think some of the “worst” performing VPG (Virtual Protection Group) I’ve seen replicated with Zerto had a 20-30 second RPO. It’s pretty amazing. As for RTO? That depends on where you’re pointing to. For instance, vSphere to vSphere you can expect < 60 second RTO. AWS and Azure have longer RTO because of having to build the destination VMs within the cloud provider with Azure being a bit quicker than AWS at the time of this writing (because when replicating to AWS your data goes into S3 then gets called back to EBS upon failover).
What’s more is that as of the Zerto 6.0 release, they’ve really been focused on “Any2Any” cloud migration. With this, comes enhanced features like orchestrated fail back from AWS (this used to be a pain), inter-region Azure replication, and a focus on journal file level recovery (JFLR). This means that not only can you keep a sub 20 second RPO to any cloud, you can also keep a 30 day journal retention with thousands of recovery points which will let you restore a VM from any given point in time within the journal history, but will also support file level recovery! So now, Zerto is challenging conventional backup strategies as well.
Wrapping it up
If you read my blog you’ll know that nothing wins me over without serious convincing. When I first heard of Zerto I thought, “What a funky name – probably some bobo software trying to do what vSphere Replication does.” It wasn’t until I started looking at demos and reading about the product that I wanted to give it a go. Once I read about how the technology works I was hooked – we demo’d and now offer it to clients all the time. The pricing is pretty reasonable and it just works – but don’t take it from me, give it a try yourself.
As usual, this post is not sponsored by Zerto and I am in no way affiliated with the company I just really like the product and have seen first hand the problems it has solved. It has made my life so much easier both on my hosting platform for DR purposes as well as client projects where the requirement is to migrate from one solution to another with minimum headache. Prior to using Zerto, if a client wanted a DR test performed it basically meant a full day in a conference room coordinating efforts with application teams, network engineers, and systems engineers to stand up the environment. It also usually meant either severing the VPN tunnel between production and DR for the client and thus pausing replication or creating a new network on the DR side and failing over VMs into isolated networks and having to provide the client and application teams with RA VPN profiles for this standalone DR test. Either way, it sucked. Now, a DR test of 100 VMs can be performed with a single engineer at the helm within an hour. Simply amazing.
Check back soon as I’ll be covering some actual installation and use cases for Zerto soon!