h1

Quick-Take: How Virtual Backup Can Invite Disaster

August 1, 2012

There have always been things about virtualizing the enterprise that have concerned me. Most boil down to Uncle Ben’s admonishment to his nephew, Peter Parker, in Stan Lee’s Spider-Man, “with great power comes great responsibility.” Nothing could be more applicable to the state of modern virtualization today.

Back in “the day” when all this VMware stuff was scary and “complicated,” it carried enough “voodoo mystique” that (often defacto) VMware admins either knew everything there was to know about their infrastructure, or they just left it to the experts. Today, virtualization has reached such high levels of accessibility that I think even my 102 year old Nana could clone a live VM; now that is scary.

Enter Veeam Backup, et al

Case in point is Veeam Backup and Recovery 6 (VBR6). Once an infrastructure exceeds the limits of VMware Data Recovery (VDR), it just doesn’t get much easier to backup your cadre of virtual machines than VBR6. Unlike VDR, VBR6 has three modes of access to virtual machine disks:

  1. Direct SAN  Access – VBR6 backup server/proxy has direct access to the VMFS LUNs containing virtual machine disks – very fast, very low overhead;
  2. Virtual Appliance – VBR6 backup server/proxy, running as a virtual machine, leverages it’s relation to the ESXi host to access virtual machine disks using the ESXi host as a go-between – fast, moderate overhead;
  3. Network – VBR6 backup server/proxy accesses virtual machine disks from ESXi hosts similar in a manner similar to the way the vSphere Client grants access to virtual machine disks across the LAN – slower, with more overhead;

For block-based storage, option (1) appears to be the best way to go: it’s fast with very little overhead in the data channel. For those of us with grey hair, think VMware Consolidated Backup proxy server and you’re on the right track; for everyone else, think shared disk environment. And that, boys and girls, is where we come to the point of today’s lesson…

Enter Windows Server, Updates

For all of its warts, my favorite aspect of VMware Data Recovery is the fact that it is a virtual appliance based on a stripped-down Linux distribution. Those two aspects say “do not tamper” better than anything these days, so admins – especially Windows admins – tend to just install and use as directed. At the very least, the appliance factor offers an opportunity for “special case” handling of updates (read: very controlled and tightly scripted).

The other “advantage” to VMDR is that is uses a relatively safe method for accessing virtual machine disks: something more akin to VBR6’s “virtual appliance” mode of operation. By allowing the ESXi host(s) to “proxy” access to the datastore(s), a couple of things are accomplished:

  1. Access to VMDKs is protocol agnostic – direct attach, iSCSI, AoE, SAS, Fiber Channel and/or NFS all work the same;
  2. Unlike “Direct SAN Access” mode, no additional initiators need to be added to the target(s)’ ACL;
  3. If the host can access the VMDK, it stands a good chance of being backed-up fairly efficiently.

However, VBR6 installs onto a Windows Server and Windows Server has no knowledge of what VMFS looks like nor how to handle VMFS disks. This means Windows disk management needs to be “tweaked” to ignore VMFS targets by disabling “automount” in VBR6 servers and VCB proxies. For most, it also means keeping up with patch management and Windows Update (or appropriate derivative). For active backup servers with a (pre-approved, tested) critical update this might go something like:

  1. Schedule the update with change management;
  2. Stage the update to the server;
  3. Put server into maintenance mode (services and applications disabled);
  4. Apply patch, reboot;
  5. Mitigate patch issues;
  6. Test application interaction;
  7. Rinse, repeat;
  8. Release server back to production;
  9. Update change management.

See the problem? If Windows Server 2008 R2 SP1 is involved you just might have one right around step 5…

And the Wheels Came Off…

Service Pack 1 for Windows Server 2008 R2 requires a BCD update, so existing installations of VCB or VBR5/6 will fail to update. In an environment where there is no VCB or VBR5/6 testing platform, this could result in a resume writing event for the patching guy or the backup administrator if they follow Microsoft’s advice and “fix” SP1. Why?

Fixing the SP1 installation problem is quite simple:

Quick steps to do this in case you forgot are:

1.  Run DISKPART

2.  automount enable

3.  Restart

4.  Install SP1

Technet Blogs, Windows Servicing Guy, SP1 Fails with 0x800f0a12

Done, right? Possibly in more ways than one. By GLOBALLY enabling automount, rebooting Windows Server and installing SP1, you’ve opened-up the potential for Windows to write a signature to the VMFS volumes holding your critical infrastructure. Fortunately, it doesn’t have to end that way.

Avoiding the Avoidable

Veeam’s been around long enough to have some great forum participants from across the administrative spectrum. Fortunately, a member posted a solution method that keeps us well away from VMFS corruption and still solves the SP1 issue in a targeted way: temporarily mounting the “hidden” system partition instead of enabling the global automount feature. Here’s my take on the process (GUI mode):

  1. Inside Server Manager, open Disk Management (or run diskmgt.msc from admin cmd prompt);
  2. Right-click on the partition labled “System Reserved” and select “Change Drive Letter and Paths…”
  3. On the pop-up, click the “Add…” button and accept the default drive letter offered, click “OK”;
  4. Now “try again” the installation of Service Pack 1 and reboot;
  5. Once SP1 is installed, re-run Disk Management;
  6. Right-click on the “System Reserved” partition and select “Change Drive Letter and Paths..”
  7. Click the “Remove” button to unmap the drive letter;
  8. Click “Yes” at the “Are you sure…” prompt;
  9. Click “Yes” at the “Do you want to continue?” prompt;
  10. Reboot (for good measure).

This process assumes that there are no non-standard deployments of the Server 2008 R2 boot volume. Of course, if there is no separate system reserved partition, you wouldn’t encounter the SP1 failure to install issue…

SOLORI’s Take: The takeaway here is “consider your environment” (and the people tasked with maintaining it) before deploying Direct SAN Access mode into a VMware cluster. While it may represent “optimal” backup performance, it is not without its potential pitfalls (as demonstrated herein). Native access to SAN LUNs must come with a heavy dose of respect, caution and understanding of the underlying architecture: otherwise, I recommend Virtual Appliance mode (similar to Data Recovery’s take.)

While no VMFS volumes were harmed in the making of this blog post, the thought of what could have happened in a production environment chilled me into writing this post. Direct access to the SAN layer unlocks tremendous power for modern backup: just be safe and don’t forget to heed Uncle Ben’s advice! If the idea of VMFS corruption scares you beyond your risk tolerance, appliance mode will deliver acceptable results with minimal risk or complexity.

One comment

  1. I saw this happen with a VMware VCB a few years ago. It wiped out AD, Exchange and SQL VMs. Ugly.

    Like



Comments are closed.