THE SQL Server Blog Spot on the Web

Welcome to - The SQL Server blog spot on the web Sign in | |
in Search

The Rambling DBA: Jonathan Kehayias

The random ramblings and rantings of frazzled SQL Server DBA

Some Thoughts on Clustering SQL Server Virtual Machines

On my blog post Virtualizing SQL on VMware Reference List, Oscar Zamora (Blog | Twitter) asked the following question in a comment:

As a virtualized instance has the benefit of "failing over" to another physical box, would you consider clustering a virtualized instance?

The answer to this question more than I want to write up in a comment, so I decided instead to blog my response.  Since at least ESX 3, VMware has provided a number of high availability features in their enterprise server virtualization product.  A detailed listing of the HA features available in VSphere 4 can be found in the vSphere Availability Guide.
In this post I’ll talk about the most popular ones and the ones that VMware marketing uses to try to convince people that VMware HA solves all of their High Availability needs.

High Availability and DRS Clusters

ESX hosts can be clustered together providing high availability from the hardware failure of a host for the guests running inside of the cluster.  If a host fails, the VM’s that were running on that host fail as well, but the cluster detects this and powers the VM’s up on other hosts inside of the cluster automatically, bringing the VM’s back online and restoring application serviceability.  Other features such as Server vMotion and Storage vMotion allow movement of the VM’s and storage dynamically to allow for hardware maintenance and upgrades with little to no downtime to the applications running on the VM’s. In addition to this the VMware Tools installed inside of the guests can provide monitoring of the guest to detect Operating System failures and lock ups inside of the VM and automatically restart the VM.

VMware High Availability: Easily Deliver High Availability for All of Your Virtual Machines
VMware High Availability: Concepts, Implementation, and Best Practices

Fault Tolerance

VSphere 4 introduced a new high availability feature for VM guests called Fault Tolerance.  Fault Tolerance creates a synchronized Secondary virtual machine on another host in the high availability cluster that is lock stepped with the Primary VM.  In the event of a host failure, guests that have Fault Tolerance enabled immediately failover to their Secondary in a manner that is similar to vMotion preventing application downtime from occurring.  When this occurs a new Secondary is created on another host inside of the cluster and synchronized with the new primary maintaining the fault tolerance of the guest inside of the environment.

VMware vSphere™ 4 Fault Tolerance: Architecture and Performance
VMware® Fault Tolerance Recommendations and Considerations on VMware vSphere™ 4
Protecting Mission-Critical Workloads with VMware Fault Tolerance

What does this all mean to SQL Server?

These features are really great features provided by virtualization, but that doesn’t make them the solution to all of your High Availability needs.  SQL Server is often considered and treated by server administrators as just another application server, especially when it comes to virtualization.  However, SQL Server is not just another application and SQL Server provides its own High Availability options, like clustering, that may be more appropriate based on your environmental requirements.  There are specific reasons that the above features may not be acceptable HA features for SQL Server. 

Fault Tolerance is currently limited to single vCPU guest VM’s only, so unless your SQL Server VM’s are all single vCPU, that’s not going to help you out.  This leaves you with VMware HA and the potential for failure with automatic restart on another host.  However, what none of the VMware features provides is minimization of downtimes associated with planned Windows Updates of the guest VM’s or the application of SQL Server Service Packs.  If you have the ability to take periodically planned downtimes of one to two hours for patching the basic HA features of VMware will probably meet your needs.  This may be an acceptable configuration in your environment, and if it is, I won’t fault you for deciding that it meets your high availability needs.  I have VM’s in production that are protected first, by good database backups, and then by VMware HA. 

However, I also have systems that have minimal downtime requirements, and because of this, relying on VMware HA as my primary HA solution doesn’t provide the level of availability required for those systems.  This leads to SQL Server clustering, which is supported in virtualization if it meets specific requirements as documented on the Support policy for Microsoft SQL Server products that are running in a hardware virtualization environment and in Bob Wards blog post SQL Server Support Policy for Failover Clustering and Virtualization gets an update.  This brings us to the question that started this discussion: “Would I consider clustering a virtualized instance?”  Maybe, if the host environment had the resources to support it, but using VM’s for my cluster nodes wouldn’t be my first choice for clustering for a couple of reasons.

First if a database has a minimal downtime requirement it probably also has a minimum performance requirement that is coupled with it.  While you can, and should, setup reservations for the resources allocated to a SQL Server VM, in NUMA enabled hosts like newer Nahalem systems, the maximum recommended size of a single VM is the resources available in a single NUMA node.  That means if you have a quad socket quad core server with 128GB RAM with 4 NUMA nodes, on per socket, each node would have 4 cores and 32GB RAM, making the largest VM 4 vCPU and 32GB RAM, unless you trade off the NUMA optimizations in ESX and memory locality.  In addition to this, you have to be careful where your SQL Server VM cluster nodes exist inside of the VMware host cluster to actually maintain the high availability of the SQL Server cluster.  If both of the SQL Server VM cluster nodes exist on the same physical host, and that host fails, both of the SQL Server VM cluster nodes are going to fail as well.  On top of these reasons, configuration of the VM guests for MSCS is not a trivial process as shown by the 36 page whitepaper Setup for Failover Clustering and Microsoft Cluster Service.

In my current environment we have multiple clustered SQL Server instances, and none of them are virtualized and at the current time we are planning to build additional clustered SQL Server instances to migrate databases that have high availability and minimal downtime requirements to.  Despite having a dedicated VMware host cluster for our SQL VM environment comprised of very powerful hardware, there isn’t any added benefit to building the clusters inside of virtual machines.  Your environment may be different and clustering inside of virtual machines may make sense for your specific requirements, but in the end it is the database downtime requirements that should determine whether or not the database should be clustered, not that VMware provides built in High Availability, because that only covers one aspect of minimizing downtime.

Published Wednesday, November 24, 2010 12:58 AM by Jonathan Kehayias



John Sansom said:

Great post JK and very well explained. So much so that even virtualization novices such as myself can get to grips with it :-)

I find this whole discussion very interesting but what I can’t seem to get my head around why or think of a scenario for, is why we would want to create a clustered solution that resides within a virtualized space, for a production scenario. I’m probably missing something, but what benefit or business case do you think there could be where such a scenario would be more preferable over a natural clustered solution?

November 24, 2010 10:37 AM

Oscar Zamora said:

Great writeup Jonathan. Thanks for elaborating.

November 24, 2010 12:31 PM

Jonathan Kehayias said:


My biggest point in this post was to point out that while VMware HA provides some significant enhancements and benefits over individual physical servers, it doesn't provide the same high availabilty that SQL Clustering does.  If you have a near zero downtime requirement a SQL Cluster is your best choice for implementing that, and it is possible to do so using two VM's on the same host, 2 VM's on different hosts, or even a physical Primary Active Node that is physical, and a Secondary Passive Node that is a VM. I probably should have pointed out that last option in my original post, but you can do a Physical/Virtual cluster for SQL Server as well.

I can create a couple of theoretical scenarios that someone might create a virtualized cluster.  I had a private chat with Brent Ozar about this topic as well, to make sure that everything I was thinking was technically accurate, and he pointed out that the passive node has no real utlization until a failover actually occurs, so while it may have 4 vCPU's and 16GB RAM, it hasn't actually committed that against the host.  

So one scenario could be a small business that is already committed to virtualization as a overall platform and has a virtualized SQL Server.  They apply a service pack to the SQL Server, and it is down for over an hour while the patches are applied, and the business decides that this level of downtime not acceptible, but they don't want to purchase two servers for a new cluster.  You could build a virtual cluster, and once everything is running, they could power off the passive node and only power it on for times when they are planning maintenance and patching.  This doesn't provide protect from a active node BSOD, or from a host failure, but they can continue to patch without substantial downtimes associated with them.

Like I said, I don't run any virtual clusters in my own environment.  Clusters are a complex thing physically for a lot of people, so adding the virtualization layer to things only complicates matters worse.

November 24, 2010 3:38 PM

John Sansom said:

Thanks for your detailed reply JK, I really it.

Oh that's a great point in your small business scenario whereby the passive node utilization (or rather lack thereof) costs could be significantly reduced, when compared to the costs of an idle physical passive node.

So we can certainly see some business benefits to a virtualized clustering solution.

Are there any advantages from a HA perspective that a virtualized clustering solution can offer that the pure physical can not?

I guess what I'm getting at is cost aside, would if be fair to say that if HA in terms of as near to zero downtime is the priority then physical SQL Server Clustering is the preferable choice?

November 25, 2010 6:29 AM

Rob said:

One thing that complicates things, even in a cluster, VM or not, SQL will have downtime regardless when applying SQL Server specific patches and SP's.  The cluster only provides benefits when installing Windows or non-SQL related patches.  I'm pretty sure this was the case in older cluster setups.  Has this changes with latest releases of Windows/SQL or the share nothing clusters?

November 25, 2010 9:11 AM

Jonathan Kehayias said:


You can do rolling updates in SQL 2008 as detailed in the following KB:

November 25, 2010 9:49 AM

Jonathan Kehayias said:


There are no advantages added to having a cluster on a virtual machine over physical.  There are actually limitations to the VMware HA features when running a MSCS cluster virtualized. My personal preference would be physical for a SQL cluster at the moment, but I may change my mind on that at some point.  I have some testing/work to do as a follow up on this topic.  If that changes my opinion I'll be sure to update this thread.

November 25, 2010 10:15 AM

retracement said:

I think this article pretty much covers everything, so well done for your efforts. One point I would stress is that Clustering is protecting your SQL Instance against failure on your host hardware and OS. Moving this setup into a Virtual environment doesn't really change anything, it is *still* the single point of failure. If you lose your virtual machine's OS due to human error, patching, software installation or just a simple glitch then your SQL instance is DOWN period. Yes snapshotting is another way but is probably not the most performant method, is manual (and therefore prone to human error) and secondly the outage for recovery would be longer than with clustering.

For this reason alone, in the right hands, I would choose Clustered Instances over non-Clustered ones (virtual or not) every time. I would also add that in the wrong hands I would avoid Clustering completely since it requires a thorough understanding and professional due diligence.

November 27, 2010 4:09 PM

HarryH said:

Good article, Jonathan.

The company I am working for is considering moving and upgrading their critical SQL Server 2005, which is currently clustered, to a VM\SQL Server 2008 environment. The people who made the proposal were under the impression that by doing so they could get rid of the clustering - something they view as being overly complex.

The question I had, as the system architect, was the one you tackle here, is Windows clustering still required? and I believe (and previously suspected) from the perspective of seamless O\S and SQL patching\upgrades, the answer is 'yes' (this is in regard to a system that must have minimal, preferably zero, downtime).

What I particularly find interesting in your article, and I feel I need to research further, is the point you make about VM limitations in terms of number of CPUs and RAM available to allocate to a SQL Server through the VM environment - I was not aware of such limitations previously, though this no doubt shows my lack of knowledge in this area.

The other point you make that is of particular interest for me, is that the cluster can be configured physical\virtual, an idea that seems to offer much potential.    

Again, thanks for article.

December 3, 2010 5:18 PM

Jonathan Kehayias said:


The numbers mentioned in this post are nowhere near the actual maximum configuration limits for a VM running on VMware.  The 4 vCPU and 32GB size mentioned in this post deals strictly with the maximum recommendation under NUMA to allow the VM to use the hypervisors built-in NUMA optimizations.  If you had a newer server with 8 core Nahalem processors and 1TB of RAM, the maximum recommendation under NUMA would be 8 vCPU's and 255GB RAM, which is also the maximum limit of vCPU's and RAM that a VM can be assigned in VSphere 4 and 4.1.  The Maximum Configuration limits for VM's can be found in the following Whitepaper:

You can certainly trade off the NUMA optimizations for a SQL VM, I was just pointing out that it is against the recommendations made by VMware.  You can find this specific recommendation in the whitepaper Virtualized SAP Performance with VMware VSphere 4:

Doing a physical to virtual cluster certainly has promise, but you have to plan for the what if scenario that the physical machine actually dies and you have to run on the VM Node for a period of time, for example by making sure that the VM host has the resources to run SQL, or that you can vMotion other machines around to allow the passive VM Node in the cluster the resources necessary in the event that you have a failover to it.  Our VM Admin found this an interesting discussion and wants to do some testing of clustering across VM Hosts in our environment.  I'm not sure when we'll get around to doing the that, but I'll blog about that at some point in the future.

December 3, 2010 6:50 PM
Anonymous comments are disabled

This Blog


Privacy Statement