Posts Tagged ‘Intel’

h1

Quick Take: IBM Tops VMmark, Crushes Record with 4P Nehalem-EX

April 7, 2010

It was merely a matter of time before one of the new core-rich titans – the Intel’s 8-core “Beckton” Nehalem-EX (Xeon 7500) or AMD’s 12-core “Magny-Cours” (Opteron 6100) – was to make a name for itself on VMware’s VMmark benchmark. Today, Intel draws first blood in the form of an 4-processor, 32-core, 64-thread, monster from IBM: the x3850 X5 running four Xeon X7560 (2.266GHz – 2.67GHz w/turbo, 130W TDP, each) and 384GB of DDR3-1066 low-power registered DIMMs. Weighing-in at 70.78@48 tiles, the 4P IBM System x3850 handily beats the next highest system – the 48-core DL785 G5 which set the record of 53.73@35 tiles back in August, 2009 – and bests it by over 30%.

At $3,800+ per socket for the tested Beckton chip, this is no real 2P alternative. In fact, a pair of Cisco UCS B250 M2 blades will get 52 tiles running for much less money. Looking at processor and memory configurations alone, this is a $67K+ enterprise server, resulting in a moderately-high $232/VM price point for the IBM x3850 X5.

SOLORI’s Take: The most interesting aspect of the EX benchmark is its clock-adjusted scaling factor: between 70% and 91% versus a 2P/8-core Nehalem-EP reference (Cisco UCS, B200 M1, 25.06@17 tiles). The unpredictable nature of Intel’s “turbo” feature – varying with thermal loads and per-core conditions – makes an exact clock-for-clock comparison difficult. However, if the scaling factor is 90%, the EX blows away our previous expectations about the platform’s scalability. Where did we go wrong when we predicted a conservative 44@39 tiles? We’re looking at three things: (1) a bad assumption about the effectiveness of “turbo” in the EP VMmark case (setting Ref_EP_Clock to 3.33 GHz), and (2) underestimating EX’s scaling efficiency (assumed 70%), (3) assuming a 2.26GHz clock for EX.

Chosing our minimum QPI/HT3 scalability factor of 75%, the predicted performance was derived this way from HP Proliant BL490 G6 as a baseline:

Est. Tiles = EP_Tiles_per_core( 2.13 ) * 32 cores * Scaling_Efficiency( 75% ) * EX_Clock( 2.26 ) / EP_Clock( 2.93 ) = 39 tiles

Est. Score = Est_Tiles( 40 ) * EP_Score_per_Tile( 1.43 ) * Est_EX_Clock( 2.26 ) / Ref_EP_Clock( 2.93 ) = 44.12

Est. Nehalem-EX VMmark -> 44.12@39 tiles

Correcting for the as-tested clock/turbo numbers, and using AMD’s 2P-to-4P VMmark scaling efficiency of 83%, and shifting to the new UCS baseline (with newer ESX version) the Nehalem-EX prediction factors to:

Est. Tiles = EP_Tiles_per_core( 2.13 ) * 32 cores * Scaling_Efficiency( 83% ) * EX_Clock( 2.67 ) / EP_Clock( 2.93 ) = 51 tiles

Est. Score = Est_Tiles( 51 ) * EP_Score_per_Tile( 1.47 ) * Est_EX_Clock( 2.67 ) / Ref_EP_Clock( 2.93 ) = 68.32

Est. Nehalem-EX VMmark -> 68.3@51 tiles

Clearly, this approach either overestimates the scaling efficiency or underestimates the “turbo” mode. IBM claims that a 2.93 GHz “turbo” setting is viable where Intel suggests 2.67 GHz is the maximum, so there is a source of potential bias. Looking at the tiles-per-core ratio of the VMmark result, the Nehalem-EX drops from 2.13 tiles per core on EP/2P platforms to 1.5 tiles per core on EX/4P platforms – about a 30% drop in per-core loading efficiency. That indicator matches well with our initial 75% scaling efficiency moving from 2P to 4P – something that AMD demonstrated with Istanbul last August. Given the high TDP of EX and IBM’s 2.93 GHz “turbo” specification, it’s possible that “turbo” is adding clock cycles (and power consumption) and compensating for a “lower” scaling efficiency than we’ve assumed. Looking at the same estimation with 2.93GHz “clock” and 71% efficiency (1.5/2.13), the numbers fall in line with VMmark:

Est. Tiles = EP_Tiles_per_core( 2.13 ) * 32 cores * Scaling_Efficiency( 71% ) * EX_Clock( 2.93 ) / EP_Clock( 2.93 ) = 48 tiles

Est. Score = Est_Tiles( 48 ) * EP_Score_per_Tile( 1.47 ) * Est_EX_Clock( 2.93 ) / Ref_EP_Clock( 2.93 ) = 70.56

Est. Nehalem-EX VMmark -> 70.56@48 tiles

This give us a good basis for evaluating 2P vs. 4P Nehalem systems: scaling factor of 71% and capable of pushing clock towards the 3GHz mark within its thermal envelope. Both of these conclusions fit typical 2P-to-4P norms and Intel’s process history.

SOLORI’s 2nd Take: So where does that leave AMD’s newest 12-core chip? To date, no VMmark exists for AMD’s Magny-Cours, and AMD chips tend not to do as well in VMmark as their Intel peers do to the benchmarks SMT-friendly loads. However, we can’t resist using the same analysis against AMD/MC’s 2.4GHz Opteron 6174SE (theoretical) using the 2P HP DL385 G6 as a baseline for core loading and the HP DL785 G6 for tile performance (best of the best cases):

Est. Tiles = HP_Tiles_per_core( 0.92 ) * 48 cores * Scaling_Efficiency( 83% ) * MC_Clock( 2.3 ) / HP_Clock( 2.6 ) = 33 tiles

Est. Score = Est_Tiles( 34 ) * HP_Score_per_Tile( 1.54 ) * Est_MC_Clock( 2.3 ) / Ref_HP_Clock( 2.8 ) = 41.8

Est. 4P Magny-Cours VMmark -> 41.8@33 tiles

That’s nowhere near good enough to top the current 8P, 48-core Istanbul VMmark at 53.73@35 tiles, so we’ll likely have to wait for faster 6100 parts to see any new AMD records. However, assuming AMD’s proposition is still “value 4P” so about 200 VM’s at under $18K/server gets you around $90/VM or less.

h1

Quick Take: Q1 DRAM Price Follow-up, 8GB DDR3 Below Target

March 3, 2010

In September 2009 we predicted that average 8GB DIMM prices (DDR2 and DDR3) would reach $565/stick by year end (with DDR3 being higher than DDR2) and at now we’re seeing the reversal of fortunes for DDR2. At year end, the average price for benchmark DDR2/DDR3 was $591 retail, with promotional pricing pushing that below$550 as predicted. Today, we’re seeing DDR3 begin to overtake DDR2 in the 8GB ECC category, dropping below $510/stick, while DDR2 climbs to $550/stick (promotional, on $625/stick retail.)

In 4GB ECC configurations, DDR2 enjoys only a slight retail advantage (13%) while promotional pricing (likely due to inventory reduction initiatives) are providing a bit better value short term. However, the price gap is only 1/2 the power gap, with DDR3 delivering a greater than 35% reduction in power over its DDR2 equivalent (about $1.25/year/stick at $0.10/kWh). The honeymoon is almost over for DDR2.

Benchmark Server (Spot) Memory Pricing – Dual Rank DDR2 Only
DDR2 Reg. ECC Series (1.8V) Price Jun ’09 Price Sep ’09 Price Dec ’09 Price Mar ’10

KVR800D2D4P6/4G
4GB 800MHz DDR2 ECC Reg with Parity CL6 DIMM Dual Rank, x4
(5.400W operating)
$100.00 $117.00
up 17%
$140.70
up 23% promo
$128.90

($151 retail)

KVR667D2D4P5/4G
4GB 667MHz DDR2 ECC Reg with Parity CL5 DIMM Dual Rank, x4 (5.940W operating)
$80.00 $103.00
up 29%
$97.99
down 5%
$128.74

($149 retail)

KVR667D2D4P5/8G
8GB 667MHz DDR2 ECC Reg with Parity CL5 DIMM Dual Rank, x4 (7.236W operating)
$396.00 $433.00 $433.00 (promo) $550.00
(Promo price, retail $625)
Benchmark Server (Spot) Memory Pricing – Dual Rank DDR3 Only
DDR3 Reg. ECC Series (1.5V) Price Jun ’09 Price Sep ’09 Price Dec ’09 Price Mar ’10

KVR1333D3D4R9S/4G
4GB 1333MHz DDR3 ECC Reg w/Parity CL9 DIMM Dual Rank, x4 w/Therm Sen (3.960W operating)
$138.00 $151.00
up 10%
$135.99
down 10%

$150.74

($170 retail)

KVR1066D3D4R7S/4G
4GB 1066MHz DDR3 ECC Reg w/Parity CL7 DIMM Dual Rank, x4 w/Therm Sen (5.085W operating)
$132.00 $151.00
up 15%
$137.59
down 9% (promo)
$150.74
($170 retail)

KVR1066D3D4R7S/8G
8GB 1066MHz DDR3 ECC Reg w/Parity CL7 DIMM Dual Rank, x4 w/Therm Sen (4.110W operating)
$1035.00 $917.00 down 11.5% $667.00
down 28%
$506.59

(retail $584, avail. 3/15)

KVR1333D3D4R9S/8GHA
8GB 1333MHz DDR3 ECC Reg CL9 DIMM 2R x4 w/TS Server Hynix A (4.635W operating)
$584.00

SOLORI’s Take: With strong DDR3 demand and short-falls in DDR2 supply (according to DRAMeXchange), the only thing keeping DDR3 prices above DDR2 at this point is demand and inventory. As Q2/2010 introduces a rush of new workstation and server products based on DDR3 systems, the DRAM production ramp will eventually stabilize demand somewhere towards the end of Q3/2010. Meanwhile, technology companies like VMware, Microsoft, Intel and AMD are betting on new infrastructure spending on operating systems, virtualization and hardware refresh to drive-up economic market factors. If the global economic crisis deepens, this anticipated spending spree could be short-lived and its impact shallow.

h1

Fujistu RX300 S5 Rack Server Takes 8-core VMmark Lead

November 11, 2009

Fujitsu’s RX300 S5 rack server takes the top spot in VMware’s VMmark for 8-core systems today with a score of 25.16@17 tiles. Loaded with two of Intel’s top-bin 3.33GHz, 130W Nehalem-EP processors (W5590, turbo to 3.6GHz per core) and 96GB of DDR3-1333 R-ECC memory, the RX300 bested the former champ – the HP ProLiant BL490c G6 blade – by only about 2.5%.

With 17 tiles and 102 virtual machines on a single 2U box, the RX300 S5 demonstrates precisely how well vSphere scales on today’s x86 commodity platforms. It also appears to demonstrate both the value and the limits of Intel’s “turbo mode” in its top-bin Nehalem-EP processors – especially in the virtualization use case – we’ll get to that later. In any case, the resulting equation is:

More * (Threads + Memory + I/O) = Dense Virtualization

We could have added “higher execution rates” to that equation, however, virtualization is a scale-out applications where threads, memory pool and I/O capabilities dominate the capacity equation – not clock speed. Adding 50% more clock provides less virtualization gains than adding 50% more cores, and reducing memory and context latency likewise provides better gains that simply upping the clock speed. That’s why a dual quad-core Nehalem 2.6GHz processor will crush a quad dual-core 3.5GHz (ill-fated) Tulsa.

Speaking of Tulsa, unlike Tulsa’s rather anaemic first-generation hyper-threading, Intel’s improved SMT in Nehalem “virtually” adds more core “power” to the Xeon by contributing up to 100% more thread capacity. This is demonstrated by Nehalem-EP’s 2 tiles per core contributions to VMmark where AMD’s Istanbul quad-core provides only 1 tile per core. But exactly what is a VMmark tile and how does core versus thread play into the result?

Single-Tile-Relationship

The Illustrated VMmark "Tile" Load

As you can see, a “VMmark Tile” – or just “tile” for short – is composed of 6 virtual machines, half running Windows, half running SUSE Linux. Likewise, half of the tiles are running in 64-bit mode while the other half runs in 32-bit mode. As a whole, the tile is composed of 10 virtual CPUs, 5GB of RAM and 62GB of storage. Looking at how the parts contribute to the whole, the tile is relatively balanced:

Operating System / Mode 32-bit 64-bit Memory vCPU Disk
Windows Server 2003 R2 67% 33% 45% 50% 58%
SUSE Linux Enterprise Server 10 SP2 33% 67% 55% 50% 42%
32-bit 50% N/A 30% 40% 58%
64-bit N/A 50% 70% 60% 42%

If we stop here and accept that today’s best x86 processors from AMD and Intel are capable of providing 1 tile for each thread, we can look at the thread count and calculate the number of tiles and resulting memory requirement. While that sounds like a good “rule of thumb” approach, it ignores specific use case scenarios where synthetic threads (like HT and SMT) do not scale linearly like core threads do where SMT accounts for only about 12% gains over single-threaded core, clock-for-clock. For this reason, processors from AMD and Intel in 2010 will feature more cores – 12 for AMD and 8 for Intel in their Magny-Cours and Nehalem-EX (aka “Beckton”), respectively.

Learning from the Master

If we want to gather some information about a specific field, we consult an expert, right? Judging from the results, Fujitsu’s latest dual-processor entry has definitely earned the title ‘Master of VMmark” in 2P systems – at least for now. So instead of the usual VMmark $/VM analysis (which are well established for recent VMmark entries), let’s look at the solution profile and try to glean some nuggets to take back to our data centers.

It’s Not About Raw Speed

First, we’ve noted that the processor used is not Intel’s standard “rack server” fare, but the more workstation oriented W-series Nehalem at 130W TDP. With “turbo mode” active, this CPU is capable of driving the 3.33GHz core – on a per-core basis – up to 3.6GHz. Since we’re seeing only a 2.5% improvement in overall score versus the ProLiant blade at 2.93GHz, we can extrapolate that the 2.93GHz X5570 Xeon is spending a lot of time at 3.33GHz – its “turbo” speed – while the power-hungry W5590 spends little time at 3.6GHz. How can we say this? Looking at the tile ratio as a function of the clock speed.

We know that the X5570 can run up to 3.33GHz, per core, according to thermal conditions on the chip. With proper cooling, this could mean up to 100% of the time (sorry, Google). Assuming for a moment that this is the case in the HP test environment (and there is sufficient cause to think so) then the ratio of the tile score to tile count and CPU frequency is 0.433 (24.54/17/3.33). If we examine the same ratio for the W5590, assuming the clock speed of 3.33GHz, we get 0.444 – a difference of 2.5%, or the contribution of “turbo” in the W5590. Likewise, if you back-figure the “apparent speed” of the X5570 using the ratio of the clock-locked W5590, you arrive at 3.25GHz for the W5570 (an 11% gain over base clock). In either case, it is clear that “turbo” is a better value at the low-end of the Nehalem spectrum as there isn’t enough thermal headroom for it to work well for the W-series.

VMmark Equals Meager Network Use

Second, we’re not seeing “fancy” networking tricks out of VMmark submissions. In the past, we’ve commented on the use of “consumer grade” switches in VMmark tests. For this reason, we can consider VMmark’s I/O dependency as related almost exclusively to storage. With respect to networking, the Fujitsu team simply interfaced three 1Gbps network adapter ports to the internal switch of the blade enclosure used to run the client-side load suite and ran with the test. Here’s what that looks like:

ESX-Network-Configuration

Networking Simplified: The "leaders" simple virtual networking topology.

Note that the network interfaces used for the VMmark trial are not from the on-board i82575EB network controller but from the PCI-Express quad-port adapter using its older cousin – the i82571EB. What is key here is that VMmark is tied to network performance issues, and it is more likely that additional network ports might increase the likelihood of IRQ sharing and reduced performance more so than the “optimization” of network flows.

Keeping Storage “Simple”

Third, Fujitsu’s approach to storage is elegantly simple: several “inexpensive” arrays with intelligent LUN allocation. For this, Fujistu employed eight of its ETERNUS DX80 Disk Storage Systems with 7 additional storage shelves for a total of 172 working disks and 23 LUNs. For simplicity, Fujistu used a pair of 8Gbps FC ports to feed ESX and at least one port per DX80 – all connected through a Brocade 5100 fabric switch. The result looked something like this:

ESX-Storage-Configuration

Fujitsu's VMmark Storage Topology: 8 Controllers, 7 Shelves, 172 Disks and 23 LUNs.

And yes, the ESX server is configured to boot from SAN, using no locally attached storage. Note that the virtual machine configuration files, VM swap and ESX boot/swap are contained in a separate DX80 system. This “non-default” approach allows the working VMDKs of the virtual machines to be isolated – from a storage perspective – from the swap file overhead, about 5GB per tile. Again, this is a benchmark scenario, not an enterprise deployment, so trade-offs are in favour of performance, not CAPEX or OPEX.

Even if the DX80 solution falls into the $1K/TB range, to say that this approach to storage is “economic” requires a deeper look. At 33 rack units for the solution – including the FC switch but not including the blade chassis – this configuration has a hefty datacenter footprint. In contrast to the old-school server/blade approach, 1 rack at 3 servers per U is a huge savings over the 2 racks of blades or 3 racks of 1U rack servers. Had each of those servers of blades had a mirror pair, we’d be talking about 200+ disks spinning in those racks versus the 172 disks in the ETERNUS arrays, so that still represents a savings of 15.7% in storage-related power/space.

When will storage catch up?

Compared to a 98% reduction in network ports, a 30-80% reduction server/storage CAPEX (based on $1K/TB SAN), a 50-75% reduction in overall datacenter footprint, why is a 15% reduction in datacenter storage footprint acceptable? After all, storage – in the Fujitsu VMmark case – now represents 94% of the datacenter footprint. Even if the load were less aggressively spread across five ESX servers (a conservative 20:1 loading), the amount of space taken by storage only falls to 75%.

How can storage catch up to virtualization densities. First, with 2.5″ SAS drives, a bank of 172 disks can be made to occupy only 16U with very strong performance. This drops storage to only 60% of the datacenter footprint – 10U for hypervisor, 16U for storage, 26U total for this example. Moving from 3.5″ drives to 2.5″ drives takes care of the physical scaling issue with acceptable returns, but results in only minimal gains in terms of power savings.

Saving power in storage platforms is not going to be achieved by simply shrinking disk drives – shrinking the NUMBER of disks required per “effective” LUN is what’s necessary to overcome the power demands of modern, high-performance storage. This is where non-traditional technology like FLASH/SSD is being applied to improve performance while utilizing fewer disks and proportionately less power. For example, instead of dedicating disks on a per LUN basis, carving LUNs out of disk pools accelerated by FLASH (a hybrid storage pool) can result in a 30-40% reduction in disk count – when applied properly – and that means 30-40% reduction in datacenter space and power utilization.

Lessons Learned

Here are our “take aways” from the Fujitsu VMmark case:

1) Top-bin performance is at the losing end of diminishing returns. Unless your budget can accommodate this fact, purchasing decisions about virtualization compute platforms need to be aligned with $/VM within an acceptable performance envelope. When shopping CPU, make sure the top-bin’s “little brother” has the same architecture and feature set and go with the unit priced for the mainstream. (Don’t forget to factor memory density into the equation…) Regardless, try to stick within a $190-280/VM equipment budget for your hypervisor hardware and shoot for a 20-to-1 consolidation ratio (that’s at least $3,800-5,600 per server/blade).

2) While networking is not important to VMmark, this is likely not the case for most enterprise applications. Therefore, VMmark is not a good comparison case for your network-heavy applications. Also, adding more network ports increases capacity and redundancy but does so at the risk of IRQ-sharing (ESX, not ESXi) problems, not to mention the additional cost/number of network switching ports. This is where we think 10GE will significantly change the equation in 2010. Remember to add up the total number of in use ports – including out-of-band management – when factoring in switch density. For net new instalments, look for a switch that provides 10GE/SR or 10GE/CX4 options and go with !0GE/SR if power savings are driving your solution.

3) Storage should be simple, easy to manage, cheap (relatively speaking), dense and low-power. To meet these goals, look for storage technologies that utilize FLASH memory, tiered spindle types, smart block caching and other approaches to limit spindle count without sacrificing performance. Remember to factor in at least the cost of DAS when approximating your storage budget – about $150/VM in simple consolidation cases and $750/VM for more mission critical applications (that’s a range of $9,000-45,000 for a 3-server virtualization stack). The economies in managed storage come chiefly from the administration of the storage, but try to identify storage solutions that reduce datacenter footprint including both rack space and power consumption. Here’s where offerings from Sun and NexentaStor are showing real gains.

We’d like to see VMware update VMmark to include system power specifications so we can better gage – from the sidelines – what solution stack(s) perform according to our needs. VMmark served its purpose by giving the community a standard from which different platforms could be compared in terms of the resultant performance. With the world’s eyes on power consumption and the ecological impact of datacenter choices, adding a “power utilization component” to the “server-side” of the VMmark test would not be that significant of a “tweak.” Here’s how we think it can be done:

  1. Require power consumption of the server/VMmark related components be recorded, including:
    1. the ESX platform (rack server, blade & blade chassis, etc.)
    2. the storage platform providing ESX and test LUN(s) (all heads, shelves, switches, etc.)
    3. the switching fabric (i.e. Ethernet, 10GE, FC, etc.)
  2. Power delivered to the test harness platforms, client load machines, etc. can be ignored;
  3. Power measurements should be recorded at the following times:
    1. All equipment off (validation check);
    2. Start-up;
    3. Single tile load;
    4. 100% tile capacity;
    5. 75% tile capacity;
    6. 50% tile capacity;
  4. Power measurements should be recorded using a time-power data-logger with readings recorded as 5-minute averages;
  5. Notations should be made concerning “cache warm-up” intervals, if applicable, where “cache optimized” storage is used.

Why is this important? In the wake of the VCE announcement, solution stacks like VCE need to be measured against each other in an easy to “consume” way. Is VCE the best platform versus a component solution provided by your local VMware integrator? Given that the differentiated VCE components are chiefly UCS, Cisco switching and EMC storage, it will be helpful to have a testing platform that can better differentiate “packaged solutions” instead of uncorrelated vendor “propaganda.”

Let us know what your thoughts are on the subject, either on Twitter or on our blog…

h1

Quick Take: Red Hat and Microsoft Virtual Inter-Op

October 9, 2009

This week Red Hat and Microsoft announced support of certain of their OSes as guests in their respective hypervisor implementations: Kernel Virtual Machine (KVM) and Hyper-V, respectively. This comes on the heels of Red Hat’s Enterprise Server 5.4 announcement last month.

KVM is Red Hat’s new hypervisor that leverages the Linux kernel to accelerate support for hardware and capabilities. It was Red Hat and AMD that first demonstrated live migration between AMD and Intel-based hypervisors using KVM late last year – then somewhat of a “Holy Grail” of hypervisor feats. With nearly a year of improvements and integration into their Red Hat Enterprise Server and Fedora “free and open source” offerings, Red Hat is almost ready to strike-out in a commercially viable way.

Microsoft now officially supports the following Red Hat guest operating systems in Hyper-V:

Red Hat Enterprise Linux 5.2, 5.3 and 5.4

Red Hat likewise officially supports the following Microsoft quest operating systems in KVM:

Windows Server 2003, 2008 and 2008 R2

The goal of the announcement and associated agreements between Red Hat and Microsoft was to enable a fully supported virtualization infrastructure for enterprises with Red Hat and Microsoft assets. As such, Microsoft and Red Hat are committed to supporting their respective products whether the hypervisor environment is all Red Hat, all Hyper-V or totally heterogeneous – mixing Red Hat KVM and Microsoft Hyper-V as necessary.

“With this announcement, Red Hat and Microsoft are ensuring their customers can resolve any issues related to Microsoft Windows on Red Hat Enterprise Virtualization, and Red Hat Enterprise Linux operating on Microsoft Hyper-V, regardless of whether the problem is related to the operating system or the virtualization implementation.”

Red Hat press release, October 7, 2009

Many in the industry cite Red Hat’s adoption of KVM as a step backwards [from Xen] requiring the re-development of significant amount of support code. However, Red Hat’s use of libvirt as a common management API has allowed the change to happen much more rapidly that critics assumptions had allowed. At Red Hat Summit 2009, key Red Hat officials were keen to point out just how tasty their “dog food” is:

Tim Burke, Red Hat’s vice president of engineering, said that Red Hat already runs much of its own infrastructure, including mail servers and file servers, on KVM, and is working hard to promote KVM with key original equipment manufacturer partners and vendors.

And Red Hat CTO Brian Stevens pointed out in his Summit keynote that with KVM inside the Linux kernel, Red Hat customers will no longer have to choose which applications to virtualize; virtualization will be everywhere and the tools to manage applications will be the same as those used to manage virtualized guests.

Xen vs. KVM, by Pam Derringer, SearchDataCenter.com

For system integrators and virtual infrastructure practices, Red Hat’s play is creating opportunities for differentiation. With a focus on light-weight, high-performance, I/O-driven virtualization applications and no need to support years-old established processes that are dragging on Xen and VMware, KVM stands to leap-frog the competition in the short term.

SOLORI’s Take: This news is good for all Red Hat and Microsoft customers alike. Indeed, it shows that Microsoft realizes that its licenses are being sold into the enterprise whether or not they run on physical hardware. With 20+:1 consolidation ratios now common, that represents a 5:1 license to hardware sale for Microsoft, regardless of the hypervisor. With KVM’s demonstrated CPU agnostic migration capabilities, this opens the door to an even more diverse virtualization infrastructure than ever before.

On the Red Hat side, it demonstrates how rapidly Red Hat has matured its offering following the shift to KVM earlier this year. While KVM is new to Red Hat, it is not new to Linux or aggressive early adopters since being added to the Linux kernel as of 2.6.20 back in September of 2007. With support already in active projects like ConVirt (VM life cycle management), OpenNebula (cloud administration tools), Ganeti, and Enomaly’s Elastic Computing Platform, the game of catch-up for Red Hat and KVM is very likely to be a short one.

h1

Quick Take: Nehalem/Istanbul Comparison at AnandTech

October 7, 2009

Johan De Gelas and crew present an interesting comparison of Dunnington, Shanghai, Istanbul and Nehalem in a new post at AnandTech this week. In the test line-up are the “top bin” parts from Intel and AMD in 4-core and 6-core incarnations:

  • Intel Nehalem-EP Xeon, X5570 2.93GHz, 4-core, 8-thread
  • Intel “Dunnington” Xeon, X7460, 2.66GHz, 6-core, 6-thread
  • AMD “Shanghai” Opteron 2389/8389, 2.9GHz, 4-core, 4-thread
  • AMD “Istanbul” Opteron 2435/8435, 2.6GHz, 6-core, 6-thread

Most importantly for virtualization systems architects is how the vCPU scheduling affects “measured” performance. The telling piece comes from the difference in comparison results where vCPU scheduling is equalized:

AnandTech's Quad Sockets v. Dual Sockets Comparison. Oct 6,  2009.

AnandTech's Quad Sockets v. Dual Sockets Comparison. Oct 6, 2009.

When comparing the results, De Gelas hits on the I/O factor which chiefly separates VMmark from vAPUS:

The result is that VMmark with its huge number of VMs per server (up to 102 VMs!) places a lot of stress on the I/O systems. The reason for the Intel Xeon X5570’s crushing VMmark results cannot be explained by the processor architecture alone. One possible explanation may be that the VMDq (multiple queues and offloading of the virtual switch to the hardware) implementation of the Intel NICs is better than the Broadcom NICs that are typically found in the AMD based servers.

Johan De Gelas, AnandTech, Oct 2009

This is yet another issue that VMware architects struggle with in complex deployments. The latency in “Dunnington” is a huge contributor to its downfall and why the Penryn architecture was a dead-end. Combined with 8 additional threads in the 2P form factor, Nehalem delivers twice the number of hardware execution contexts than Shanghai, resulting in significant efficiencies for Nehalem where small working data sets are involved.

When larger sets are used – as in vAPUS – the Istanbul’s additional cores allows it to close the gap to within the clock speed difference of Nehalem (about 12%). In contrast to VMmark which implies a 3:2 advantage to Nehalem, the vAPUS results suggest a closer performance gap in more aggressive virtualization use cases.

SOLORI’s Take: We differ with De Gelas on the reduction in vAPUS’ data set to accommodate the “cheaper” memory build of the Nehalem system. While this offers some advantages in testing, it also diminishes one of Opteron’s greatest strengths: access to cheap and abundant memory. Here we have the testing conundrum: fit the test around the competitors or the competitors around the test. The former approach presents a bias on the “pure performance” aspect of the competitors, while the latter is more typical of use-case testing.

We do not construe this issue as intentional bias on AnandTech’s part, however it is another vector to consider in the evaluation of the results. De Gelas delivers a report worth reading in its entirety, and we view this as a primer to the issues that will define the first half of 2010.

h1

Quick Take: Dell/Nehalem Take #2, 2P VMmark Spot

September 9, 2009

The new 1st runner-up spot for VMmark in the “8 core” category was taken yesterday by Dell’s R710 – just edging-out the previous second spot HP ProLiant BL490 G6 by 0.1% – a virtual dead heat. Equipped with a pair of Xeon X5570 ($1386/ea, bulk list) and 96GB registered DDR3/1066 (12x8GB), the 2U, rack mount R710 weighs-in with a tile ratio of 1.43 over 102 VMs. :

  • Dell R710 w/redundant high-output power supply, ($18,209)
  • 2 x Intel Xeon X5570 Processors (included)
  • 96GB ECC DDR3/1066 (12×8GB) (included)
  • 2 x Broadcom NexXtreme II 5709 dual-port GigabitEthernet w/TOE (included)
  • 1 x Intel PRO 1000VT quad-port GigabitEthernet (1x PCIe-x4 slot, $529)
  • 3 x QLogic QLE2462 FC HBA (1x PCIe slot, $1,219/ea)
  • 1 x LSI1078 SAS Controller (on-board)
  • 8 x 15K SAS OS drive, RAID10 (included)
  • Required ProSupport package ($2,164)
  • Total as Configured: $24,559 ($241/VM, not including storage)

Three Dell/EMC CX3-40f arrays were used as the storage backing of the test. The storage system included 8GB cache, 2 enclosures and 15, 15K disks per array delivering 19 LUNs at about 300GB each. Intel’s Hyper-Threading and  “Turbo Boost” were enabled for 8-thread, 3.33GHz core clocking as was VT; however embedded SATA and USB were disabled as is common practice.

At about $1,445/tile ($241/VM) the new “second dog” delivers its best at a 20% price premium over Lenovo’s “top dog” – although the non-standard OS drive configuration makes-up a half of the difference, with Dell’s mandatory support package making-up the remainder. Using a simple RAID1 SAS and eliminating the support package would have droped the cost to $20,421 – a dead heat with Lenovo at $182/VM.

Comparing the Dell R710 the 2P, 12-core benchmark HP DL385 G6 Istanbul system at 15.54@11 tiles:

  • HP DL385 G6  ($5,840)
  • 2 x AMD 2435 Istanbul Processors (included)
  • 64GB ECC DDR2/667 (8×8GB) ($433/DIMM)
  • 2 x Broadcom 5709 dual-port GigabitEthernet (on-board)
  • 1 x Intel 82571EB dual-port GigabitEthernet (1x PCIe slot, $150/ea)
  • 1 x QLogic QLE2462 FC HBA (1x PCIe slot, $1,219/ea)
  • 1 x HP SAS Controller (on-board)
  • 2 x SAS OS drive (included)
  • $10,673/system total (versus $14,696 complete from HP)

Direct pricing shows Istanbul’s numbers at $1,336/tile ($223/VM) which is  a 7.5% savings per-VM over the Dell R710. Going to the street – for memory only – changes the Istanbul picture to $970/tile ($162/VM) representing a 33% savings over the R710.

SOLORI’s Take: Istanbul continues to offer a 20-30% CAPEX value proposition against Nehalem in the virtualization use case – even without IOMMU and higher memory bandwidth promised in upcoming Magny-Cours. With the HE parts running around $500 per processor, the OPEX benefits are there for Istanbul too. It is difficult to understand why HP wants to charge $900/DIMM for 8GB PC-5300 sticks when they are available on the street for 50% less – that’s a 100% markup. Looking at what HP charges for 8GB DDR3/1066 – $1,700/DIM – they are at least consistent. HP’s memory pricing practice makes one thing clear – customers are not buying large memory configurations from their system vendors…

On the contrary, Dell appears to be happy to offer decent prices on 8GB DDR3/1066 with their R710 at approximately $837/DIMM – almost par with street prices.  Looking to see if this parity held up with Dell’s AMD offerings, we examined the prices offered with Dell’s R805: while – at $680/DIMM – Dell’s prices were significantly better than HP’s, they still exceeded the market by 50%. Still, we were able to configure a Dell R805 with AMD 2435’s for much less than the equivalent HP system:

  • Dell R805 w/redundant power ($7,214)
  • 2 x AMD 2435 Istanbul Processors (included)
  • 64GB ECC DDR2/667 (8×8GB) ($433/ea, street)
  • 4 x Broadcom 5708 GigabitEthernet (on-board)
  • 1 x Intel PRO 100oPT dual-port GigabitEthernet (1x PCIe slot, included)
  • 1 x QLogic QLE2462 FC HBA (1x PCIe slot, included)
  • 1 x Dell PERC SAS Controller (on-board)
  • 2 x SAS OS drive (included)
  • $10,678/system total (versus $12,702 complete from Dell)

This offering from Dell should be able to deliver equivalent performance with HP’s DL385 G6 and likewise savings/VM compared to the Nehalem-based R710. Even at the $12,702 price as delivered from Dell, the R805 represents a potential $192/VM price point – about $50/VM (25%) savings over the R710.

h1

Intel’s $1.1B Euro Slap On the Wrist, Must Sell 2.3M Chips

May 13, 2009

May 13th, 2009  – besides being my birthday – marks the day that the European Competition Commission drew a $1.1B Euro fine (about $1.4B US dollars) on Intel for going “to great lengths to cover up its anti-competitive actions” and in the process “harmed millions of European consumers.” This according to the EU commissioner Neelie Kroes, in an address in Brussels today. The fine could have been as large as $4B Euros, and will go to the EU’s annual budget – not consumers.

Commissioner Kroes was seen holding up an Intel PII/PIII processor card (SECC2) during the news conference, giving some scope to what has been a very long and drawn-out process: going back to 2000. At the heart of the matter has been Intel’s “llegal anticompetitive practices to exclude competitors from the market for computer chips called x86 central processing units (CPUs)” – namely AMD. These were apparantly manifested in behind the scenes rebates and discounts in exchange for a reduction or termination of AMD-based products.

In a press release from Intel’s President and CEO, Paul Otellini, the fined chip maker offered this defense:

Intel takes strong exception to this decision. We believe the decision is wrong and ignores the reality of a highly competitive microprocessor marketplace – characterized by constant innovation, improved product performance and lower prices. There has been absolutely zero harm to consumers. Intel will appeal.

Intel must cover their fine immediately with a bank guarantee which will stay sequestered until their appeal is either exhausted or the decision reversed. Based on EU’s hunger for this type of commercial justice, the money could be tied-up for many years. But the question remains, does Intel have a history of anti-competitive behavior beyond the test of rigorous competition?

Intel’s history tells a compelling story: the EU joins Japan (2004) and South Korea (2008) in finding Intel engaged in anti-competitive behavior. The question remains: how will the EU’s decision play in the US courts as AMD’s ongoing antitrust suit (2005) against Intel continues to unfold? Delayed until 2010 due to the lenghty list of depositions scheduled for the case, the EU’s decision will likely do more to tarnish Intel’s new “Promoting Innovation” Campaign than settle the dispute.

So what does Intel need to do to weather the EU’s wrath? In product terms, Intel needs to move 2,262,752 of its Nehalem-EP (5500-series) chips to cover the loss. Based on a predicted 40M unit replacement market in the US, thats less than 5% and it’s under 2.5% of the market if they are 2P systems. However, Intel’s promised a 9:1 value for the replacement with some estimating that number moves to 18:1 with good results for SMT (depending on the workload).

What does this mean from an Intel 5500-series sales perspective? Here’s our estimate, using Intel’s 9:1 and 18:1 math (not forgeting the 4.5:1 for the dual-core):

Nehalem Units Needed Retail Value 9:1 18:1
W5580 12,545 $20,072,000.00 0.56%
X5570 121,713 $168,694,218.00 5.48%
X5560 168,227 $197,162,044.00 7.57%
X5550 174,450 $167,123,100.00 7.85%
E5540 531,715 $395,595,960.00 23.93%
E5530 419,636 $222,407,080.00 18.88%
E5520 183,533 $68,457,809.00 8.26%
E5506 262,704 $69,879,264.00 5.91%
E5504 250,051 $56,011,424.00 5.63%
E5502 106,312 $19,986,656.00 1.20%
L5520 10,516 $5,573,480.00 0.24%
L5506 21,350 $9,031,050.00 0.96%
Total 2,262,752 $1,399,994,085.00 12.97% 73.49%

By these estimates, Intel will need to close 86.5% of the total replacement market to be able to cover the EU fines. All this assumes, of course, that they don’t offer discounts off of their “published” per-1000 chip prices. Good luck, Intel, on an exciting marketing campaign!

h1

Quick Take: Nutty Intel VT Story

May 6, 2009

ZDnet has an interesting story that’s getting some traction about Windows 7’s XP mode and how you may not be able to run it on your Intel platform. Since the technology relies on Intel-VT or AMD-v to work, if your chip doesn’t have it, you’re cooked. Unlike AMD’s all-or-nothing approach that creates uniformity across server and workstation platforms – delivering all features to all but the “Semperon” versions of the AMD64, Intel likes to market “reduced feature” versions to keep price points meaningful.

Intel’s approach also makes it a nightmare for consumer end-users to determine what they get from their money, as described very well in ZDnet’s blog:

Here’s a real-world example. Dell’s Vostro 420 is a well-built, no-frills desktop PC designed for the small and medium business market. The screen [graph] below shows the current lineup of CPUs that you can choose from when you build this system to order at Dell’s website. Four of the six options support Intel VT; I’ve circled the two CPUs that don’t support VT.

(see ZDnet’s blog entry for graphic and story) Read the rest of this entry ?

h1

Shanghai Economics 101 – Conclusion

May 6, 2009

In the past entries, we’ve looked only at the high-end processors as applied to system prices, and we’ll continue to use those as references through the end of this one. We’ll take a look at other price/performance tiers in a later blog, but we want to finish-up on the same footing as we began; again, with an eye to how these systems play in a virtualization environment.

We decided to finish this series with an analysis of  real world application instead of just theory. We keep seeing 8-to-1, 16-to-1 and 20-to-1 consolidation ratios (VM-to-host) being offered as “real world” in today’s environment so we wanted to analyze what that meant from an economic side.

The Fallacy of Consolidation Ratios

First, consolidation ratios that speak in terms of VM-to-host are not very informative. For instance, a 16-to-1 consolidation ratio sounds good until you realize it was achieved on an $16,000 4Px4C platform. This ratio results in a $1,000-per-VM cost to the consolidator.

In contrast, let’s take the same 16-to-1 ratio on a $6,000 2Px4C platform and it results in a $375-per-VM cost to the consolidator: a savings of nearly 60%. The key to the savings is in vCPU-to-Core consolidation ratio (provided sufficient memory exists to support it). In the first example that ratio was 1:1, but in the last example the ratio is 2:1. Can we find 16:1 vCPU-to-Core ratios out there? Sure, in test labs, but in the enterprise we think the valid range of vCPU-to-Core consolidation ratios is much more conservative, ranging from 1:1 to 8:1 with the average (or sweet spot) falling somewhere between 3:1 and 4:1.

Second, we must note that memory is a growing aspect of the virtualization equation. Modern operating systems no longer “sip” memory and 512MB for a Windows or Linux VM is becoming more an exception than a rule. That puts pressure on both CPU and memory capacity as driving forces for consolidation costs. As operating system “bloat” increases, administrative pressure to satisfy their needs will mount, pushing the “provisioned” amount of memory per VM ever higher.

Until “hot add” memory is part of DRS planning and the requisite operating systems support it, system admins will be forced to either over commit memory, purchase memory based on peak needs or purchase memory based on average memory needs and trust DRS systems to handle the balancing act. In any case, memory is a growing factor in systems consolidation and virtualization.

Modeling the Future

Using data from the Univerity of Chicago and as a baseline and extrapolating forward through 2010, we’ve developed a simple model to predict vMEM and vCPU allocation trends. This approach establishes three key metrics (already used in previous entries) that determine/predict system capacity: Average Memory/VM (vMVa), Average vCPU/VM (vCVa) and Average vCPU/Core (vCCa).

Average Memory per VM (vMVa)

Average memory per VM is determined by taking the allocated memory of all VM’s in a virtualized system – across all hosts – and dividing that by the total number of VM’s in the system (not including non-active templates.) This number is assumed to grow as virtualization moves from consolidation to standardized deployment. Read the rest of this entry ?