Posts Tagged ‘nexenta’

h1

Quick-Take: NexentaStor 4.0.1GA

April 14, 2014

Our open storage partner, Nexenta Systems Inc., hit a milestone this month by releasing NexentaStor 4.0.1 for general availability. This release is significant mainly because it is the first commercial release of NexentaStor based on the Open Source Illumos kernel and not Oracle’s OpenSolaris (now closed source). With this move, NexentaStor’s adhering to the company’s  promise of “open source technology” that enables hardware independence and targeted flexibility.

Some highlights in 4.0.1:

  • Faster Install times
  • Better HA Cluster failover times and “easier” cluster manageability
  • Support for large memory host configurations – up to 512GB of DRAM per head/controller
  • Improved handling of intermittently faulty devices (disks with irregular I/O responses under load)
  • New (read: “not backward compatible”) Auto-Sync replication (user configurable zfs+ssh still available for backward compatibility) with support for replication of HA to/from non-HA clusters
    • Includes LZ4 compression (fast) option
    • Better Control of “Force Flags” from NMV
    • Better Control of Buffering and Connections
  • L2ARC Compression now supported
    • Potentially doubles the effective coverage of L2ARC (for compressible data sets)
    • Supports LZ4 compression (fast)
    • Automatically applied if dataset is likewise compressed
  • Server Message Block v2.1 support for Windows (some caveats for IDMAP users)
  • iSCSI support for Microsoft Server 2012 Cluster and Cluster Shared Volume (CSV)
  • Guided storage pool configuration wizards – Performance, Balanced and Capacity modes
  • Enhanced Support Data and Log Gathering
  • High Availability Cluster plug-in (RSF-1) binaries are now part of the installation image
  • VMware: Much better VMXNET3 support
    • no more log spew
    • MTU settings work from NMV
  • VMware: Install to PVSCSI (boot disk) from ISO no longer requires tricks
  • Upgrade from 3.x is currently “disruptive” – promised “non-disruptive” in next maintenance update
  • Improved DTrace capabilities from NMC shell for
    • COMSTAR/iSCSI/FC
    • general IO
  • Snappier, more stable NMV/GUI
    • Service port changes from 2000 to 8457
    • Multi-NMS default
    • Fast refresh for ZFS containers
    • RSF-1 defaults in “Server” settings
    • Improved iSCSI

See Nexenta’s 4.0.1 Release Notes for additional changes and details.

Note, the 18TB Community Edition EULA is still hampered by the “non-commercial” language, restricting it’s use to home, education and academic (ie. training, testing, lab, etc.) targets. However, the “total amount of Storage Space” license for Community is a deviation from the Enterprise licensing (typical “raw” storage entitlement)

2.2 If You have acquired a Community Edition license, the total amount of Storage Space is limited as specified on the Site and is subject to change without notice. The Community Edition may ONLY be used for educational, academic and other non-commercial purposes expressly excluding any commercial usage. The Trial Edition licenses may ONLY be used for the sole purposes of evaluating the suitability of the Product for licensing of the Enterprise Edition for a fee. If You have obtained the Product under discounted educational pricing, You are only permitted to use the Product for educational and academic purposes only and such license expressly excludes any commercial purposes.

– NexentaStor EULA, Version 4.0; Last updated: March 18, 2014

For those who operate under the Community license, this means your total physical storage is UNLIMITED, provided your space “IN USE” falls short of 18TB (18,432 GB) at all times. Where this is important is in constructing useful arrays with “currently available” disks (SATA, SAS, etc.) Let’s say you needed 16TB of AVAILABLE space using “modern” 3TB disks. The fact that your spinning disks are individually larger than 600GB indicates that array rebuild times might run afoul of failure PRIOR to the completion of the rebuild (encountering data loss) and mirror or raidz2/raidz3 would be your best bet for array configuration.

SOLORI Note: Richard Elling made this concept exceedingly clear back in 2010, and his “ZFS data protection comparison” of 2, 3 and 4-way mirrors to raidz, raidz2 and raidz3 is still a great reference on the topic.

Elling’s MTTDL Comparison by RAID Type

 

Given 16TB in 3-way mirror or raidz2 (roughly equivalent MTTDL predictors), your 3TiB disk count would follow as:

3-way Mirror Disks := RoundUp( 16 * (1024 / 1000)^3 / 70% / ( 3 * (1000 / 1024)^3 )  ) * 3 = 27 disks, or

6-disk Raidz2 Disks := RoundUp( 16 * (1024 / 1000)^3 / 70% / ( 4 * 3 * (1000 / 1024)^3 )  ) * 6 = 18 disks

By “raw” licensing standards, the 3-way mirror would require a 76TB license while the raidz2 volume would require a 51TB license – a difference of 25TB in licensing (around $5,300 retail). However, under the Community License, the “cost” is exactly the same, allowing for a considerable amount of flexibility in array loadout and configuration.

Why do I need 54TiB in disk to make 16TB of “AVAILABLE” storage in Raidz2?

The RAID grouping we’ve chosen is 6-disk raidz2 – that’s akin to 4 data and 2 parity disks in RAID6 (without the fixed stripe requirement or the “write hole penalty.”) This means, on average, one third of the space consumed on-disk will be in the form of parity information. Therefore, right of the top, we’re losing 33% of the disk capacity. Likewise, disk manufacturers make TiB not TB disks, so we lose 7% of “capacity” in the conversion from TiB to TB. Additionally, we like to have a healthy amount of space reserved for new block allocation and recommend 30% unused space as a target. All combined, a 6-disk raidz array is, at best, 43% efficient in terms of capacity (by contrast, 3-way mirror is only 22% space efficient). For an array based on 3TiB disks, we therefore get only 1.3TB of usable storage – per disk – with 6-disk raidz (by contrast, 10-disk raidz nets only 160GB additional “usable” space per disk.)

 SOLORI’s Take: If you’re running 3.x in production, 4.0.1 is not suitable for in-place upgrades (yet) so testing and waiting for the “non-disruptive” maintenance release is your best option. For new installations – especially inside a VM or hypervisor environment as a Virtual Storage Appliance (VSA) – version 4.0.1 presents a better option over it’s 3.x siblings. If you’re familiar with 3.x, there’s not much new on the NMV side outside better tunables and snappier response.

h1

NFS and VMware: Perfect for Small Business? Part 1 – Introduction

August 22, 2012

Nexenta System’s “open storage” software made significant inroads into the VMware community over the last year with NFS storage. Even though Nexenta has been a partner with VMware for much longer, the storage vendor really made it’s debut at last year’s VMworld 2011 Hands-on-Labs by showcasing it’s NFS-for-VMware solution running on commodity hardware:

And, here’s the kicker, NexentaStor was running on industry standard hardware from Supermicro with STEC drives for write and read cache and 7200 rpm SAS drives for capacity.  Monday some DRAM on one of the four servers (two HA pairs) failed.  And no end users noticed because of our HA cluster performed correctly and failed over.  Meanwhile our load increased from a designed 33% to over 60% of the total load of the Hands on Lab due to unspecified issues with either NetApp or EMC.

Evan Powell, CEO – Nexenta Systems, VMworld Reviewed

While this was indeed an important inflection point in the VMware/Nexenta relationship, in broader terms Nexenta’s success at VMworld was the probably the moment when commodity NFS stepped out of the shadow of block storage. To be fair, there are many enterprise alternatives to Nexenta for NFS storage – like NetApp and EMC, but there are few can be deployed on commodity hardware, fewer that do both hardware and virtual storage appliances, and fewer still that have commercially licensed and community licensed distributions of the same platform.

If you’ve ever asked the question, “what’s the best storage solution for my vSphere stack?” I’d be willing to bet that NFS was not high on the list of recommendations. If you’ve looked at the related product marketing materials, as I have, or engaged front-line VMware personnel in a discussion of primary storage solutions, between 2009 and 2011, as I have, you’d be hard pressed to leave the conversation with a recommendation to use NFS. If Nexenta’s appearance can “prove” that open storage solutions based on NFS (and commodity hardware) are “ready” for big cloud infrastructures, can it be true that it’s a perfect fit for a small business’ private cloud? I’d say a resounding YES, but…

Introduction, NFS versus Block Storage

Before you say, “thanks for the tip, Collin, but who needs commercial stuff when NFS services are included in practically every Linux distribution, and “no cost” solutions -like FreeNAS – make NFS cheap and easy?” While it is true that solutions like this have been very popular with lab and bare-bones users, but most enterprises (even small ones) require a “bet the business” level of support and stability that isn’t often found in “community supported” distributions and do-it-yourself implementations. Even if the though “any NFSv3 server” – properly sized and configured – should work with VMware according to its abilities: it’s up to you to decide if the basket fits your eggs. The commercial NFS vendors really know their stuff, so you’re buying expertise, experience and a well-refined playbook: something you’ll be giving up when you go it alone.

Despite being “block storage’s whipping boy,” to say NFS is “not ready for prime time” in today’s VMware product matrix would be the height of FUD-peddling. On the contrary, a well know post in 2009 from noted EMC’r Chad Sakac and NetApp’s Vaughn Stewart made a great case for NFS in the enterprise in their multi-vendor post back in 2009. Since then, many improvements in NFS offerings and vSphere capabilities have increased NFS’ appeal in that space, not diminished it. To quote the Virtual Geek:

“NFS is an absolutely legitimate storage model for VMware – with many advantages.”

– Chad Sakac, aka Virtual Geek, EMC VP VMware Technology Alliance

Certainly there is a lot to like in pairing NFS with vSphere 5.x no matter the scale of the enterprise. Here are some of the high-points:

  • NFS works seamlessly with Storage I/O Control and Network I/O Control to support converged network architectures;
  • NFS exposes VMDKs to 3rd party tools and scripts without VMFS proxies, enabling:
    • Simple Backup/Recovery of VM, VMDK from NAS is a file copy operation
    • Linux, Windows7, etc. support NFS clients out of the box
    • Replication of VM or VMDK from NAS can be achieved simply with rsync
    • Use of snapshotted NFS volumes does not require ESX/VMFS
  • Reclamation of unused storage is not array dependent (file deletes return to storage immediately without SCSI Unmap support or equivalent)
  • Not subject to LUN locking and related performance issues in block/VMFS
  • It’s simpler to use: in the link above, VMware dedicates 24 pages to block/VMFS and only 3 to NFS
  • Presentation and management of NAS storage is very familiar (it’s a filer)
  • NFS is very forgiving of “imperfect” network configurations – compared to iSCSI, especially where network time-outs and latency are concerned
  • NFS storage does not need to be available at ESXi boot time, enabling VMs to exist on VSA running on-top of the host ESXi server (enabling recursive storage possibilities and reduced/shared hardware costs)
  • Mounting an NFS snapshot to vSphere does not include a signature operation (or risk possible collision)
  • NFS does not require VAAI to resolve SCSI file locking and VM loading limitations consistent with SCSI-based block storage
  • vSphere 5 currently support 256 NFS mounts per host
    • NFS.MaxVolumes (per host) – default 8, max 256
  • Single file size not limited on NFS file systems, however
    • Without 3rd party NAS VAAI, all VMDKs on NAS are always thin provisioned
    • Single file size limited to NAS vendor file system constraints
    • VMDK uses 512-byte sectors, so it suffers from the same limitations as physical disks, hence it will still have a 2TB-512-byte limit (since VMware has no 4K-byte sector VMDK, there will be no way to support 2TB+ VMDKs on NFS until that time)
  • NFS volumes are not limited in size
    • For NetApp WAFL, the limit is up to 100TB (with restrictions)
    • For NexentaStor, the limit is determined by the zpool size
  • On-line expansion of an NFS file system is a one-step operation: expand the file system on the filer

That said, NFS still cannot replace block storage on Tier 1 applications that were designed for block storage. Even iSCSI – arguably the least common denominator in shared block storage for VMware – still has some built-in advantages (and unique disadvantages) as compared to NFS. Likewise, when we’re talking about block storage in VMware we’re usually talking about VMFS too:

  • Writes are almost always asynchronous, making even low-end iSCSI “appear” to be faster than low-end NFS
  • Interface redundancy is straight forward and deterministic with many good options for redundancy
  • Storage latency in iSCSI/block is “more predictable” across common use cases
  • vSphere 5 currently supports 256 LUNs per host (similar to NFS mount limit)
    • Disk.MaxLUN (per target) – default 256, max 256
    • Total VMFS LUNs per host cannot exceed Disk.MaxLUN, regardless of type (FC, SAS, iSCSI, etc.)
  • vSphere VMFS3/5 limits single file size (VMDK and virtual RDM) to 2TB (minus 512 bytes)
  • VMFS3 limits single volume size to 50-64TB depending on block size chosen when formatted
  • VMFS5 limits single volume size to 64TB for VMFS5 (always uses 1MB block size)
  • vSphere’s storage telemetry is still geared towards block versus filer storage, making trouble-shooting of “performance issues” more available
  • Pairing storage to interface is much easier to do, even on-the-fly
  • Exchange 2010 expressly forbids the use of NAS storage as VMDK datastores
  • Virtual RDM and Clustering (shared block) require block storage (in some cases, not even iSCSI qualifies for support)
  • Tier 1 application support on block-based storage is generally better (familiarity and testing)
  • VMware VAAI for block storage ships with vSphere, similar acceleration features for NAS must come from the vendor (creating a much lest robust out-of-the-box experience for SMB)
  • On-line VMFS expansion usually requires two steps, with some caveats:
    • For VMFS expansions using a single LUN expansions under 2TB: (1) expand the underlying LUN on the SAN, (2) expand VMFS with the new space on the LUN
    • Single LUN expansions over 2TB require VMFS5
    • VMFS3 volume expansion beyond 2TB require multiple extents, each of which may not exceed 2TB-512B – loss of a single extent in a multi-extent volume could mean a loss of the entire volume
    • VMFS5 supports single LUNs (extents) as large as 60TB

Sparse VAAI issues aside, NFS is a great go-to storage protocol for most virtual workloads that do not strictly require block or shared-block storage back-ends (clustering, et al). Where NFS struggles today – in terms of VMware implementations in the SMB space – is in network resiliency. It is not that you cannot make NFS resilient to network failures, it’s more or less that redundancy is not neatly baked-into the service or protocol like it is for iSCSI, SAS and Fiber Channel – these block-based services have mature, multi-session amd multi-path capabilities at the service level (multi-path targets and initiators).

Note about 2TB VMDK limitations – given that most modern OSes running as supported virtual machines support some form of LUN concatenation (extents) to bypass 2TB physical disk limitations, the very same facilities can be leveraged to bypass the 2TB VMDK limits for these OSes. While this is not an optimal solution, it is a supported one. Today’s physical disks that exceed 2TB in size do so with 4KB sectors instead of 512B sectors. Currently, there is no 4KB sector VMDK analog.

Next Up, NFS and Path Redundancy

Hopefully by now there’s a compelling argument to look deeper into the NFS/VMware question, but – as with most shared, network storage – the rubber meets the road at the network layer. To me, the secret to making NFS more robust is in the network architecture that underpins it: depending on the complexity of the environment, the network layer will make or break an NFS implementation. In some ways there’s a lot more to making NFS “redundant’ (due to it’s lack of multipath capabilities): it’s not impossible; it’s not difficult; it’s just full of options and caveats.

Unlike block storage, you can’t “throw up two network interfaces, two target ports and two initiator ports” and easily have path redundancy and multipath data. With NFS, the network – not the storage service – does most of the “heavy lifting” and – as you’ll see in the next post – NFS has absolutely no concept of multipath. Therefore, I’m going to spend the next entry reviewing some of the main points driving network and NFS service dependencies that make understanding NFS network resiliency a bit more accessible.

h1

In-the-Lab: NexentaStor vs ESXi, Redux

February 24, 2012

In my last post, I mentioned a much complained about “idle” CPU utilization quirk with NexentaStor when running as a virtual machine. After reading many supposed remedies on forum postings (some reference in the last blog, none worked) I went pit-bull on the problem… and got lucky.

As an avid (er, frequent) NexentaStor user, the luster of the NMV (Nexenta’s Web GUI) has worn off. Nearly 100% of my day-to-day operations are on the command line and/or Nexenta’s CLI (dubbed NMC). This process includes power-off events (from NMC, issue “setup appliance power-off” or “setup appliance reboot”).

For me, the problem cropped-up while running storage benchmarks on some virtual storage appliances for a client. These VSA’s are bound to a dedicated LSI 9211-8i SAS/6G controller using VMware’s PCI pass-through (Host Configuration, Hardware, Advanced Settings). The VSA uses the LSI controller to access SAS/6G disks and SSDs in a connected JBOD – this approach allows for many permutations on storage HA and avoids physical RDMs and VMDKs. Using a JBOD allows for attachments to PCIe-equipped blades, dense rack servers, etc. and has little impact on VM CPU utilization (in theory).

So I was very surprised to find idle CPU utilization (according to ESXi’s performance charting) hovering around 50% from a fresh installation. This runs contrary to my experience with NexentaStor, but I’ve seen reports of such issues on the forums and even on my own blog. I’ve never been able to reproduce more than a 15-20% per vCPU bias between what’s reported in the VM and what ESXi/vCenter sees. I’ve always attributed this difference to vSMP and virtual hardware (disk activity) which is not seen by the OS but is handled by the VMM.

CPU record of idle and IOzone testing of SAS-attached VSA

During the testing phase, I’m primarily looking at the disk throughput, but I notice a persistent CPU utilization of 50% – even when idle. Regardless, the 4 vCPU VSA appears to perform well (about 725MB/sec 2-process throughput on initial write) despite the CPU deficit (3 vCPU test pictured above, about 600MB/sec write). However, after writing my last blog entry, the 50% CPU leach just kept bothering me.

After wasting several hours researching and tweaking with very little (positive) effect, a client e-mail prompted a NMV walk through with resulted in an unexpected consequence: the act of powering-off the VSA from web GUI (NMV) resulted is significantly reduced idle CPU utilization.

Getting lucky: noticing a trend after using NMV to reboot for a client walk-through of the GUI.

Working with the 3 vCPU VSA over the previous several hours, I had consistently used the NMC (CLI) to reboot and power-off the VM. The fact of simply using the NMV to shutdown the VSA couldn’t have anything to do with idle CPU consumption, could it? Remembering that these were fresh installations I wondered if this was specific to a fresh installation or could it show up in an upgrade. According to the forums, this only hampered VMs, not hardware.

I grabbed a NexentaStor 3.1.0 VM out of the library (one that had been progressively upgraded from 3.0.1) and set about the upgrade process. The result was unexpected: no difference in idle CPU from the previous version; this problem was NOT specific to 3.1.2, but specific to the installation/setup process itself (at least that was the prevailing hypothesis.)

Regression into my legacy VSA library, upgrading from 3.1.1 to 3.1.2 to check if the problem follows the NexentaStor version.

If anything, the upgraded VSA exhibited slightly less idle CPU utilization than its previous version. Noteworthy, however, was the extremely high CPU utilization as the VSA sat waiting for a yes/no response (NMC/CLI) to the “would you like to reboot now” question at the end of the upgrade process (see chart above). Once “no” was selected, CPU dropped immediately to normal levels.

Now it seemed apparent that perhaps an vestige of the web-based setup process (completed by a series of “wizard” pages) must be lingering around (much like the yes/no CPU glutton.) Fortunately, I had another freshly installed VSA to test with – exactly configured and processed as the first one. I fired-up the NMV and shutdown the VSA…

Confirming the impact of the "fix" on a second fresh installed NexentaStor VSA

After powering-on the VM from the vSphere Client it was obvious. This VSA had been running idle for some time, so it’s idle performance baseline – established prior across several reboots from CLI – was well recorded by the ESXi host (see above.) The resulting drop in idle CPU was nothing short of astounding: the 3 vCPU configuration has dropped from a 50% average utilization to 23% idle utilization. Naturally, these findings (still anecdotal) have been forwarded on to engineers at Nexenta. Unfortunately, now I have to go back and re-run my storage benchmarks; hopefully clearing the underlying bug has reduced the needed vCPU count…

h1

In-the-Lab: NexentaStor and VMware Tools, You Need to Tweak It…

February 24, 2012

While working on an article on complex VSA’s (i.e. a virtual storage appliance with PCIe pass-through SAS controllers) an old issue came back up again: NexentaStor virtual machines still have a problem installing VMware Tools since it branched from Open Solaris and began using Illumos. While this isn’t totally Nexenta’s fault – there is no “Nexenta” OS type in VMware to choose from – it would be nice if a dummy package was present to allow a smooth installation of VMware Tools; this is even the case with the latest NexentaStor release: 3.1.2.

I could not find where I had documented the fix in SOLORI’s blog, so here it is… Note, the NexentaStor VM is configured as an Oracle Solaris 11 (64-bit) virtual machine for the purpose of vCenter/ESXi. This establishes the VM’s relationship to a specific VMware Tools load. Installation of VMware Tools in NexentaStor is covered in detail in an earlier blog entry.

VMware Tools bombs-out at SUNWuiu8 package failure. Illumos-based NexentaStor has no such package.

Instead, we need to modify the vmware-config-tools.pl script directly to compensate for the loss of the SUNWuiu8 package that is explicitly required in the installation script.

Commenting out the SUNWuiu8 related section allows the tools to install with no harm to the system or functionality.

Note the full “if” stanza for where the VMware Tools installer checks for ‘tools-for-solaris’ must be commented out. Since the SUNWuiu8 package does not exist – and more importantly is not needed for Illumos/Nexenta – removing a reference to it is a good thing. Now the installation can proceed as normal.

After the changes, installation completes as normal.

That’s all there is to getting the “Oracle Solaris” version of VMware Tools to work in newer NexentaStor virtual machines – now back to really fast VSA’s with JBOD-attached storage…

SOLORI’s Note: There is currently a long-standing bug that affects NexentaStor 3.1.x running as a virtual machine. Currently there is no known workaround to keep NexentaStor from running up a 50% cpu utilization from ESXi’s perspective. Inside the NexentaStor VM we see very little CPU utilization, but from the performance tab, we see 50% utilization on every configured vCPU allocated to the VM. Nexenta is reportedly looking into the cause of the problem.

I looked through this and there is nothing that stands out other that a huge number of interrupts while idle. I am not sure where those interrupts are coming from. I see something occasionally called volume-check and nmdtrace which could be causing the interrupts.

Nexenta Support

A bug report was reportedly filed a couple of days ago to investigate the issue further.

h1

In-the-Lab: Default Rights on CIFS Shares

December 6, 2010

Following-up on the last installment of managing CIFS shares, there has been a considerable number of questions as to how to establish domain user rights on the share. From these questions it is apparent that the my explanation about root-level share permissions could have been more clear. To that end, I want to look at default shares from a Windows SBS Server 2008 R2 environment and translate those settings to a working NexentaStor CIFS share deployment.

Evaluating Default Shares

In SBS Server 2008, a number of default shares are promulgated from the SBS Server. Excluding the “hidden” shares, these include:

  • Address
  • ExchangeOAB
  • NETLOGON
  • Public
  • RedirectedFolders
  • SYSVOL
  • UserShares
  • Printers

Therefore, it follows that a useful exercise in rights deployment might be to recreate a couple of these shares on a NexentaStor system and detail the methodology. I have chosen the NETLOGON and SYSVOL shares as these two represent default shares common in all Windows server environments. Here are their relative permissions:

NETLOGON

From the Windows file browser, the NETLOGON share has default permissions that look like this:

NETLOGON Share permissions

Looking at this same permission set from the command line (ICALCS.EXE), the permission look like this:

NETLOGON permissions as reported from icacls
The key to observe here is the use of Windows built-in users and NT Authority accounts. Also, it is noteworthy that some administrative privileges are different depending on inheritance. For instance, the Administrator’s rights are less than “Full” permissions on the share, however they are “Full” when inherited to sub-dirs and files, whereas SYSTEM’s permissions are “Full” in both contexts.

SYSVOL

From the Windows file browser, the NETLOGON share has default permissions that look like this:

SYSVOL network share permissions

Looking at this same permission set from the command line (ICALCS.EXE), the permission look like this:

SYSVOL permissions from ICACLS.EXE
Note that Administrators privileges are truncated (not “Full”) with respect to the inherited rights on sub-dirs and files when compared to the NETLOGON share ACL.

Create CIFS Shares in NexentaStor

On a ZFS pool, create a new folder using the Web GUI (NMV) that will represent the SYSVOL share. This will look something like the following:
Creating the SYSVOL share
Read the rest of this entry ?

h1

ZFS Pool Import Fails After Power Outage

July 15, 2010
The early summer storms have taken its toll on Alabama and UPS failures (and short-falls) have been popping-up all over. Add consolidated, shared storage to the equation and you have a recipe for potential data loss – at least this is what we’ve been seeing recently. Add JBOD’s with separate power rails and limited UPS life-time and/or no generator backup and you’ve got a recipe for potential data loss.

Even with ZFS pools, data integrity in a power event cannot be guaranteed – especially when employing “desktop” drives and RAID controllers with RAM cache and no BBU (or perhaps a “bad storage admin” that has managed to disable the ZIL). When this happens, NexentaStor (an other ZFS storage devices) may even show all members in the ZFS pool as “ONLINE” as if they are awaiting proper import. However, when an import is attempted (either automatically on reboot or manually) the pool fails to import.

From the command line, the suspect pool’s status might look like this:

root@NexentaStor:~# zpool import
pool: pool0
id: 710683863402427473
state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:
        pool0        ONLINE
          mirror-0   ONLINE
            c1t12d0  ONLINE
            c1t13d0  ONLINE
          mirror-1   ONLINE
            c1t14d0  ONLINE
            c1t15d0  ONLINE
Looks good, but the import it may fail like this:
root@NexentaStor:~# zpool import pool0
cannot import 'pool0': I/O error
Not good. This probably indicates that something is not right with the array. Let’s try to force the import and see what happens:

Nope. Now this is the point where most people start to get nervous, their neck tightens-up a bit and they begin to flip through a mental calendar of backup schedules and catalog backup repositories – I know I do. However, it’s the next one that makes most administrators really nervous when trying to “force” the import:

root@NexentaStor:~# zpool import -f pool0
pool: pool0
id: 710683863402427473
status: The pool metadata is corrupted and the pool cannot be opened.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
cannot import 'pool0': I/O error
Really not good. Did it really suggest going to backup? Ouch!.

In this case, something must have happened to corrupt metadata – perhaps the non-BBU cache on the RAID device when power failed. Expensive lesson learned? Not yet. The ZFS file system still presents you with options, namely “acceptable data loss” for the period of time accounted for in the RAID controller’s cache. Since ZFS writes data in transaction groups and transaction groups normally commit in 20-30 second intervals, that RAID controller’s lack of BBU puts some or all of that pending group at risk. Here’s how to tell by testing the forced import as if data loss was allowed:

root@NexentaStor:~# zpool import -nfF pool0
Would be able to return data to its state as of Fri May 7 10:14:32 2010.
Would discard approximately 30 seconds of transactions.
or
root@NexentaStor:~# zpool import -nfF pool0
WARNING: can't open objset for pool0
If the first output is acceptable, then proceeding without the “n” option will produce the desired effect by “rewinding” the last couple of transaction groups (read ignoring) and imported the “truncated” pool. The “import” option will report the exact number of “seconds” worth of data that cannot be restored. Depending on the bandwidth and utilization of your system, this could be very little data or several MB worth of transaction(s).

What to do about the second option? From the man pages on “zpool import” Sun/Oracle says the following:

zpool import [-o mntopts] [ -o property=value] … [-d dir-c cachefile] [-D] [-f] [-R root] [-F [-n]]-a
Imports all pools found in the search directories. Identical to the previous command, except that all pools with a sufficient number of devices available are imported. Destroyed pools, pools that were previously destroyed with the “zpool destroy” command, will not be imported unless the-D option is specified.

-o mntopts
Comma-separated list of mount options to use when mounting datasets within the pool. See zfs(1M) for a description of dataset properties and mount options.

-o property=value
Sets the specified property on the imported pool. See the “Properties” section for more information on the available pool properties.

-c cachefile
Reads configuration from the given cachefile that was created with the “cachefile” pool property. This cachefile is used instead of searching for devices.

-d dir
Searches for devices or files in dir. The -d option can be specified multiple times. This option is incompatible with the -c option.

-D
Imports destroyed pools only. The -f option is also required.

-f
Forces import, even if the pool appears to be potentially active.

-F
Recovery mode for a non-importable pool. Attempt to return the pool to an importable state by discarding the last few transactions. Not all damaged pools can be recovered by using this option. If successful, the data from the discarded transactions is irretrievably lost. This option is ignored if the pool is importable or already imported.

-a
Searches for and imports all pools found.

-R root
Sets the “cachefile” property to “none” and the “altroot” property to “root”.

-n

Used with the -F recovery option. Determines whether a non-importable pool can be made importable again, but does not actually perform the pool recovery. For more details about pool recovery mode, see the -F option, above.

No real help here. What the documentation omits is the “-X” option. This option is only valid with the “-F” recovery mode setting, however it is NOT well documented suffice to say it is the last resort before acquiescing to real problem solving… Assuming the standard recovery mode “depth” of transaction replay is not quite enough to get you over the hump, the “-X” option gives you an “extended replay” by seemingly providing a scrub-like search through the transaction groups (read “potentially time consuming”) until it arrives at the last reliable transaction group in the dataset.
Lessons to be learned from this excursion into pool recovery are as follows:
  1. Enterprise SAS good; desktop SATA could be a trap
  2. Redundant Power + UPS + Generator = Protected; Anything else = Risk
  3. SAS/RAID Controller + Cache + BBU = Fast; SAS/RAID Controller + Cache – BBU = Train Wreck

The data integrity functions in ZFS are solid when used appropriately. When architecting your HOME/SOHO/SMB NAS appliance, pay attention to the hidden risks of “promised performance” that may walk you down the plank towards a tape backup (or resume writing) event. Better to leave the 5-15% performance benefit on the table or purchase adequate BBU/UPS/Generator resources to supplant your system in worst-case events. In complex environments, a pending power loss can be properly mitigated through management supervisors and clever scripts: turning down resources in advance of total failure. How valuable is your data???

h1

In-the-Lab: Install VMware Tools on NexentaStor VSA

June 17, 2010

Physical lab resources can be a challenge to “free-up” just to test a potential storage appliance. With NexentaStor, you can download a pre-configured VMware (or Xen) appliance from NexentaStor.Org, but what if you want to build your own? Here’s a little help on the subject:

  1. Download the ISO from NexentaStor.Org (see link above);
  2. Create a VMware virtual machine:
    1. 2 vCPU
    2. 4GB RAM (leaves about 3GB for ARC);
    3. CD-ROM (mapped to the ISO image);
    4. One (optionally two if you want to simulate the OS mirror) 4GB, thin provisioned SCSI disks (LSI Logic Parallel);
    5. Guest Operating System type: Sun Solaris 10 (64-bit)
    6. One E1000 for Management/NAS
    7. (optional) One E1000 for iSCSI
  3. Streamline the guest by disabling unnecessary components:
    1. floppy disk
    2. floppy controller (remove from BIOS)
    3. primary IDE controller (remove from BIOS)
    4. COM ports (remove from BIOS)
    5. Parallel ports (remove from BIOS)
  4. Boot to ISO and install NexentaStor CE
    1. (optionally) choose second disk as OS mirror during install
  5. Register your installation with Nexenta
    1. http://www.nexenta.com/register-eval
    2. (optional) Select “Solori” as the partner
  6. Complete initial WebGUI configuration wizard
    1. If you will join it to a domain, use the domain FQDN (i.e. microsoft.msft)
    2. If you choose “Optimize I/O performance…” remember to re-enable ZFS intent logging under Settings>Preferences>System
      1. Sys_zil_disable = No
  7. Shutdown the VSA
    1. Settings>Appliance>PowerOff
  8. Re-direcect the CD-ROM
    1. Connect to Client Device
  9. Power-on the VSA and install VMware Tools
    1. login as admin
      1. assume root shell with “su” and root password
    2. From vSphere Client, initiate the VMware Tools install
    3. cd /tmp
      1. untar VMware Tools with “tar zxvf  /media/VMware\ Tools/vmware-solaris-tools.tar.gz”
    4. cd to /tmp/vmware-tools-distrib
      1. install VMware Tools with “./vmware-install.pl”
      2. Answer with defaults during install
    5. Check that VMware Tools shows and OK status
      1. IP address(es) of interfaces should now be registered

        VMware Tools are registered.

  10. Perform a test “Shutdown” of your VSA
    1. From the vSphere Client, issue VM>Power>Shutdown Guest

      System shutting down from VMware Tools request.

    2. Restart the VSA…

      VSA restarting in vSphere

Now VMware Tools has been installed and you’re ready to add more virtual disks and build ZFS storage pools. If you get a warning about HGFS not loading properly at boot time:

HGFS module mismatch warning.

it is not usually a big deal, but the VMware Host-Guest File System (HGFS) has been known to cause issues in some installations. SInce the NexentaStor appliance is not a general purpose operating system, you should customize the install to not use HGFS at all. To disable it, perform the following:

  1. Edit “/kernel/drv/vmhgfs.conf”
    1. Change:     name=”vmhgfs” parent=”pseudo” instance=0;
    2. To:     #name=”vmhgfs” parent=”pseudo” instance=0;
  2. Re-boot the VSA

Upon reboot, there will be no complaint about the offending HGFS module. Remember that, after updating VMware Tools at a future date, the HGFS configuration file will need to be adjusted again. By the way, this process works just as well on the NexentaStor Commercial edition, however you might want to check with technical support prior to making such changes to a licensed/supported deployment.

h1

NexentaStor 3.0 Announced

March 2, 2010

Nexenta Systems announced it’s 3.0 iteration at CeBIT and in a press release this week and has provided a few more details about how the next version is shaping-up. Along with the previously announced deduplication features, the NexentaStor 3.0 edition will include several enhancements to accelerate performance and virtualization applications:

  • In-line deduplication for increased storage savings (virtual machine templates, clones, etc.);
  • Broader support for 10GE adapters and SAS-2 (6Gbps, zoning, etc.) adapters;
  • Replication enhancements to simplify disaster recovery implementations;
  • An updated Virtual Machine Data Center (VMDC v3.0) optional plug-in with VMware, Xen and Hyper-V support (storage-centric control of virtual machine resource provisioning and management);

Additionally, Nexenta is promising easier high-availability (Simple and HA cluster) provisioning for mission critical implementations. Existing NexentaStor license holders will be able to upgrade to NexentaStor 3.0 at no additional cost. Nexenta Systems plans to make NexentaStor 3.0 available by the end of March 2010.

As a Nexenta partner, Solution Oriented will provide clients with valid NexentaStor support contracts upgrade guidance once NexentaStor 3.0 has been released and tested against SOLORI stable image storage platforms. As always, Nexenta’s VMware Ready virtual storage appliance should be your first step in evaluating upgrade potential.

h1

Sun Adds De-Duplication to ZFS

November 3, 2009

Yesterday Jeff Bonwick (Sun) announced that deduplication is now officially part of ZFS – Sun’s Zettabyte File System that is at the heart of Sun’s Unified Storage platform and NexentaStor. In his post, Jeff touched on the major issues surrounding deduplication in ZFS:

Deduplication in ZFS is Block-level

ZFS provides block-level deduplication because this is the finest granularity that makes sense for a general-purpose storage system. Block-level dedup also maps naturally to ZFS’s 256-bit block checksums, which provide unique block signatures for all blocks in a storage pool as long as the checksum function is cryptographically strong (e.g. SHA256).

Deduplication in ZFS is Synchronous

ZFS assumes a highly multithreaded operating system (Solaris) and a hardware environment in which CPU cycles (GHz times cores times sockets) are proliferating much faster than I/O. This has been the general trend for the last twenty years, and the underlying physics suggests that it will continue.

Deduplication in ZFS is Per-Dataset

Like all zfs properties, the ‘dedup’ property follows the usual rules for ZFS dataset property inheritance. Thus, even though deduplication has pool-wide scope, you can opt in or opt out on a per-dataset basis. Most storage environments contain a mix of data that is mostly unique and data that is mostly replicated. ZFS deduplication is per-dataset, which means you can selectively enable dedup only where it is likely to help.

Deduplication in ZFS is based on a SHA256 Hash

Chunks of data — files, blocks, or byte ranges — are checksummed using some hash function that uniquely identifies data with very high probability. When using a secure hash like SHA256, the probability of a hash collision is about 2^-256 = 10^-77. For reference, this is 50 orders of magnitude less likely than an undetected, uncorrected ECC memory error on the most reliable hardware you can buy.

Deduplication in ZFS can be Verified

[If you are paranoid about potential “hash collisions”] ZFS provies a ‘verify’ option that performs a full comparison of every incoming block with any alleged duplicate to ensure that they really are the same, and ZFS resolves the conflict if not.

Deduplication in ZFS is Scalable

ZFS places no restrictions on your ability to dedup. You can dedup a petabyte if you’re so inclined. The performace of ZFS dedup will follow the obvious trajectory: it will be fastest when the DDTs (dedup tables) fit in memory, a little slower when they spill over into the L2ARC, and much slower when they have to be read from disk — but the point I want to emphasize here is that there are no limits in ZFS dedup. ZFS dedup scales to any capacity on any platform, even a laptop; it just goes faster as you give it more hardware.

Jeff Bonwick’s Blog, November 2, 2009

What does this mean for ZFS users? That depends on the application, but highly duplicated environments like virtualization stand to gain significant storage-related value from this small addition to ZFS. Considering the various ways virtualization administrators deal with virtual machine cloning, even the basic VMware template approach (not using linked-clones) will now result in significant storage savings. This restores parity between storage and compute in the virtualization stack.

What does it mean for ZFS-based storage vendors? More main memory and processor threads will be necessary to limit the impact on performance. With 6-core and 8-thread CPU’s available in the mainstream, this problem is very easily resolved. Just like the L2ARC tables consume main memory, the DDT’s will require an increase in main memory for larger datasets. Testing and configuration convergence will likely take 2-3 months once dedupe is mainstream.

When can we expect to see dedupe added to ZFS (i.e. OpenSolaris)? According to Jeff, “in roughly a month.”

Updated: 11/04/2009 – Link to Nexenta corrected. Was incorrectly linked to “nexent.com” – typo – now correctly linked to “http://www.nexenta.com”

h1

In-the-Lab: Full ESX/vMotion Test Lab in a Box, Part 5

September 28, 2009

In Part 4 of this series we created two vSphere virtual machines – one running ESX and one running ESXi – from a set of master images we can use for rapid deployment in case we want to expand the number of ESX servers in our lab. We showed you how to use NexentaStor to create snapshots of NFS and iSCSI volumes and create ZFS clone images from them. We then showed you how to stage the startup of the VSA and ESX hosts to “auto-start” the lab on boot-up.

In this segment, Part 5, we will create a VMware Virtual Center (vCenter) virtual machine and place the ESX and ESXi machines under management. Using this vCenter instance, we will complete the configuration of ESX and ESXi using some of the new features available in vCenter.

Part 5, Managing our ESX Cluster-in-a-Box

With our VSA and ESX servers purring along in the virtual lab, the only thing stopping us from moving forward with vMotion is the absence of a working vCenter to control the process. Once we have vCenter installed, we have 60-days to evaluate and test vSphere before the trial license expires.

Prepping for vCenter Server for vSphere

We are going to install Microsoft Windows Server 2003 STD for the vCenter Server operating system. We chose Server 2003 STD since we have limited CPU and memory resources to commit to the management of the lab and because our vCenter has no need of 64-bit resources in this use case.

Since one of our goals is to have a fully functional vMotion lab with reasonable performance, we want to create a vCenter virtual machine with at least the minimum requirements satisfied. In our 24GB lab server, we have committed 20GB to ESX, ESXi and the VSA (8GB, 8GB and 4GB, respectively). Our base ESXi instance consumes 2GB, leaving only 2GB for vCenter – or does it?

Memory Use in ESXi

VMware ESX (and ESXi) does a good job of conserving resources by limiting commitments for memory and CPU. This is not unlike any virtual memory capable system that puts a premium on “real” memory by moving less frequently used pages to disk. With a lot of idle virtual machines, this ability alone can create significant over-subscription possibilities for VMware; this is why it could be possible to run 32GB worth of VM’s to run on a 16-24GB host.

Do we really want this memory paging to take place? The answer – for the consolidation use cases – is usually “yes.” This is because consolidation is born out of the need to aggregate underutilized systems in a more resource efficient way. Put another way, administrators tend to provision systems based on worst case versus average use, leaving 70-80% of those resources idle in off-peak times. Under ESX’s control those underutilized resources can be re-tasked to another VM without impacting the performance of either one.

On the other hand, our ESX and VSA virtual machines are not the typical use case. We intend to fully utilized their resources and let them determine how to share them in turn. Imagine a good number of virtual machines running on our virtualized ESX hosts: will they perform well with the added hardship of memory paging? Also, when begin to use vMotion those CPU and memory resources will appear on BOTH virtualized ESX servers at the same time.

It is pretty clear that if all of our lab storage is committed to the VSA, we do not want to page its memory. Remember that any additional memory not in use by the SAN OS in our VSA is employed as ARC cache for ZFS to increase read performance. Paging memory that is assumed to be “high performance” by NexentaStor would result in poor storage throughput. The key to “recursive computing” is knowing how to anticipate resource bottlenecks and deploy around them.

This brings the question: how much memory is left after reserving 4GB for the VSA? To figure that out, let’s look at what NexentaStor uses at idle with 4GB provisioned:

NexentaStor's RAM footprint with 4GB provisioned, at idle.

NexentaStor's RAM footprint with 4GB provisioned, at idle.

As you can see, we have specified a 4GB reservation which appears as “4233 MB” of Host Memory consumed (4096MB+137MB). Looking at the “Active” memory we see that – at idle – the NexentaStor is using about 2GB of host RAM for OS and to support the couple of file systems mounted on the host ESXi server (recursively).

Additionally, we need to remember that each VM has a memory overhead to consider that increases with the vCPU count. For the four vCPU ESX/ESXi servers, the overhead is about 220MB each; the NexentaStor VSA consumes an additional 140MB with its two vCPU’s. Totaling-up the memory plus overhead identifies a commitment of at least 21,828MB of memory to run the VSA and both ESX guests – that leaves a little under 1.5GB for vCenter if we used a 100% reservation model.

Memory Over Commitment

The same concerns about memory hold true for our ESX and ESXi hosts – albeit in a less obvious way. We obviously want to “reserve” memory for required by the VMM – about 2.8GB and 2GB for ESX and ESXi respectively. Additionally, we want to avoid over subscription of memory on the host ESXi instance – if at all possible – since it will already be working running our virtual ESX and ESXi machines.

Read the rest of this entry ?