Disaster Recovery for VMware – Part 1


Disaster recovery planning is something that seems challenging for all businesses.  Virtualization in addition to its operational flexibility, and cost reduction benefits, has helped companies improve their DR posture. Virtualization has made it easier to move machines from production to recovery sites, but many of the disaster recovery tools today still function at the storage layer.  Legacy technologies like storage array snapshots, and LUN based replication restrict the configuration options of upstream technologies like VMware Storage DRS. If you wanted to replicate a virtual machine you had to replicate the entire LUN is resided on. You weren’t free to leverage Storage DRS for its automated performance balancing features because a VM could be migrated from one LUN mucking up your storage based replication.


Fortunately over the past few years there’s been great advancement in hypervisor based replication technologies.  There’s a wealth of competing products vying for customer attention.  As always competition drives innovation and value for the consumer.  This will be the first of a 4 part blog series that looks at various hypervisor based disaster recovery products.  Note this isn’t a review of backup products which is a separate category, we are looking at products specifically designed to assist companies in a disaster scenario.


Before talking about products; however, we should understand their underlying architectures, and how it relates to their storage based predecessors. Like storage based technologies hypervisor based replication technologies currently come in two flavors:

Snap and replicate

Write journaling

These technologies should be very familiar to storage administrators. Write journaling is a newer technology, and the market leader is currently EMC’s Recover Point product. Different storage arrays all have slightly different terms for snap and replicate technologies, but the principals are the same. It’s important to understand this because the technologies will dictate how tightly you can define your recovery time objectives (RTOs) and recovery point objectives (RPOs).

First we will cover snap and replicate technologies. Snap and replicate at the hypervisor level works similarly to its storage counterpart. Instead of taking a snapshot of a storage LUN on a scheduled basis VMware takes a snapshot of the virtual machine’s disks on a scheduled basis. This allows products to copy those disks off of the primary storage media to a secondary location. A nice benefit about using VMware snap and replicate technologies is that you can use completely different types of storage systems on the product and DR systems. You can you and enterprise class SAN in the production datacenter, and internal storage if desired at the disaster recovery location. As long as the storage subsystem is supported by VMware, and has the proper performance characteristics the technology works. Typically a technology called change block tracking keeps track of any data that may change during the backup window.

Write splitting is the second technology we will examine. Like snap and replicate technologies write splitting at the hypervisor level doesn’t require the same storage type at the primary and secondary sites. Write splitting at the hypervisor level is a fairly new technology, but it’s been developed by the same team that developed write splitting at the storage layer. When I evaluate a technology I like to know there’s a history of success from the team that’s created it.

Virtual machine write journaling works differently than storage based write journaling. Instead of having a physical appliance that sits in front of your storage arrays the write splitting occurs inside the ESXi kernel. Because the technology is splitting every write there are some significant technical benefits. As a general rule snap and replicate technologies can in best case scenarios only achieve 15 minutes RTOs and RPOs. White journaling under best case scenarios can deliver RTOs and RPOs from 5 to 10 seconds.

While there is certainly an RTO and RPO benefit to the write journaling technology there are other things to consider. Hero numbers are great for the marketing team, but anyone who’s worked in operations knows what really matters about the product generally isn’t on a spec sheet. All of the products we will talk about work differently, but they all seek to achieve the same result. The supporting infrastructure and associated management costs for all of these products is critical.

Every technology we’re examining works on a management server / replication server architecture. Some of these packages use Windows proxies while other products use Linux based proxies. Consider if you’re planning a massive DR project what if there are dozens of Windows licenses you have to account for, time to patch and manage those virtual machines, etc. If you fall into the scope of PCI you will most likely be required to manage anti-virus, and some sort of log monitoring on all those windows servers; whereas, on Linux systems anti-virus is more of an “option” according to PCI. Also Linux has native syslog capabilities built in whereas Windows does not. All of these factors can add to or reduce the total cost of ownership of a disaster recovery product.

Through the rest of this series we will look at three products that are the leaders in the disaster recovery space for VMware.

VMware SRM running (on top of vSphere replication)

Veeam Backup and Recovery

Zerto Virtual Replication

Without saying another factor to consider is price for the solution. Generally the tighter the RTO and RPO the solution provides the more expensive it will be. However list pricing isn’t always cut and dry when considering total cost of owner ship added to the cost of potential gains in RTO and RPO. In addition various software vendors pricing models lend them to a specific virtual machine configuration. If you have a virtual environment with fewer larger servers product X maybe more favorable from a cost perspective. If you have a virtual environment with smaller server product Y’s pricing model maybe more favorable.


View the above chart of the quick and dirty of the three technologies we will be diving into over the next few weeks in our series.

Disaster recovery is a challenging project, but thankfully there are more options than ever for businesses to select from. Many of them are technically sound and will accomplish business goals. Many times it comes down to selecting the right architecture and price model for your business.




Leave a reply

+ twenty three = twenty seven

This site uses Akismet to reduce spam. Learn how your comment data is processed.