Should You Use VMware Fault Tolerance All the Time?

by Davoud Teimouri · 28/07/2023

VMware Fault Tolerance (VMware FT) is one of vSphere clustering features that can protect services with zero downtime but not useful for any services or any virtual machine.

This post is not about VMware Fault Tolerance capabilities and I’ll reviewing use cases. At end of this post, you’ll find why this feature is not suitable for each service or environment. Also find that when you should use it and when you should not use that.

Some customers expect this feature to be used, but they don’t know that this feature cannot protect their service in any situation. This feature and other features have many good capabilities, but sometimes using them creates false confidence and causes the service to be compromised. You have assured the customers that the service will work without interruptions and problems, but you have not considered some things.

How VMware Fault Tolerance Helps Protect Services?

This feature can protect service by creating another virtual machine from the virtual machine and putting secondary virtual machine on another host. vSphere keeps primary and secondary virtual machines under a mirroring process. So, if first ESXi host fail cause of any failure, service will not be affected and secondary virtual machine will be activated as primary virtual machine. It protects virtual machine against storage failure, if secondary virtual machine placed on different storage array.

Also, this process helps you to preventing data loss because primary and secondary virtual machines are replicating continuously.

It avoids “split-brain” situations, which can lead to two active copies of a virtual machine after recovery from a failure. Atomic file locking on shared storage is used to coordinate failover so that only one side continues running as the Primary VM and a new Secondary VM is respawned automatically.

Fault Tolerance Requirements

One of most important requirements is network bandwidth and also dedicated uplinks. VMware recommends that use 10Gb or higher network connection for FT.

Also separate its uplinks and network traffic with other traffics in virtual environment. Jumbo frame also is one of recommendations and best practices.

Fault Tolerance Limitations and Interoperability

When you want to enable and use this feature, you should consider about some limitations and Interoperability.

Limitations

Licensing:
- vSphere Standard and Enterprise. Allows up to 2 vCPUs
- vSphere Enterprise Plus. Allows up to 8 vCPUs
RAM per FT VM: 128 GB
VM Disk Size: 2 TB
VM Virtual Disk Count: 16

Interoperability

The following vSphere features are not supported for FT virtual machines:

Snapshots: Snapshots must be removed or committed before FT can be enabled on a virtual machine. In addition, it is not possible to take snapshots of virtual machines on which FT is enabled.
Storage vMotion: You cannot invoke Storage vMotion for virtual machines with FT turned on. To migrate the storage, you should temporarily turn off FT, and perform the storage vMotion action. When this is complete, you can turn FT back on.
Linked clones: You cannot use FT on a virtual machine that is a linked clone, nor can you create a linked clone from an FT-enabled virtual machine.
Virtual Volume datastores.
Storage-based policy management: Storage policies are supported for vSAN storage.
I/O filters.
TPM.
VBS enabled VMs.

Not all third-party devices, features, or products can interoperate with FT.

For a virtual machine to be compatible with FT, the Virtual Machine must not use the following features or devices.

Incompatible Feature or Device	Corrective Action
Physical Raw Disk mapping (RDM).	With legacy FT you can reconfigure virtual machines with physical RDM-backed virtual devices to use virtual RDMs instead.
CD-ROM or floppy virtual devices backed by a physical or remote device.	Remove the CD-ROM or floppy virtual device or reconfigure the backing with an ISO installed on shared storage.
USB and sound devices.	Remove these devices from the virtual machine.
N_Port ID Virtualization (NPIV).	Deactivate the NPIV configuration of the virtual machine.
NIC passthrough.	This feature is not supported by FT so it must be turned off.
Hot-plugging devices.	The hot plug feature is automatically deactivated for FT virtual machines. To hot plug devices (either adding or removing), you must momentarily turn off FT, perform the hot plug, and then turn on FT. Note: When using FT, changing the settings of a virtual network card while a virtual machine is running is a hot-plug operation, since it requires “unplugging” the network card and then “plugging” it in again. For example, with a virtual network card for a running virtual machine, if you change the network that the virtual NIC is connected to, FT must be turned off first.
Serial or parallel ports	Remove these devices from the virtual machine.
Video devices that have 3D activated.	FT does not support video devices that have 3D activated.
Virtual Machine Communication Interface (VMCI)	Not supported by FT.
2TB+ VMDK	FT is not supported with a 2TB+ VMDK.

Fault Tolerance Use Cases

This feature helps you to keep your critical services operative during failures. Several typical situations can benefit from the use of vSphere Fault Tolerance.

FT provides a higher level of business continuity than vSphere HA. When a Secondary VM is called upon to replace its Primary VM counterpart, the Secondary VM immediately takes over the Primary VM’s role with the entire state of the virtual machine preserved. Applications are already running, and data stored in memory does not need to be reentered or reloaded. Failover provided by vSphere HA restarts the virtual machines affected by a failure.

This higher level of continuity and the added protection of state information and data informs the scenarios when you might want to deploy FT.

Applications which must always be available, especially applications that have long-lasting client connections that users want to maintain during hardware failure.
Custom applications that have no other way of doing clustering.
Cases where high availability might be provided through custom clustering solutions, which are too complicated to configure and maintain.

Another key use case for protecting a virtual machine with FT can be described as On-Demand FT. In this case, a virtual machine is adequately protected with vSphere HA during normal operation. During certain critical periods, you might want to enhance the protection of the virtual machine. For example, you might be running a quarter-end report which, if interrupted, might delay the availability of critical information.

With vSphere FT, you can protect this virtual machine before running this report and then turn off or suspend FT after the report has been produced. You can use On-Demand FT to protect the virtual machine during a critical time period and return the resources to normal during non-critical operation.

Most Fault Tolerance Problems

FT protects services against failure and data loss, but some failures are not supported by it and even affect its performance. So, you should consider about some specific failures, if you want to have fault tolerant virtual machines without problem.

Fault Tolerance Metadata Datastore

The Fault Tolerance metadata datastore must be accessible to both the primary and secondary VMs in order for FT to function properly. If the datastore becomes unavailable, FT will terminate and the VMs will be unavailable.

The Fault Tolerance metadata datastore is typically placed on a highly available storage system, such as a SAN or NAS. This ensures that the datastore will be available even if a single storage device fails.

Here are some of the benefits of using a Fault Tolerance metadata datastore:

It provides a central location for storing information about FT-enabled VMs.
It ensures that the information about FT-enabled VMs is always available, even if a single storage device fails.
It simplifies the management of FT-enabled VMs.

Here are some of the risks of not using a Fault Tolerance metadata datastore:

If the datastore becomes unavailable, FT will terminate and the VMs will be unavailable.
It can be more difficult to manage FT-enabled VMs if the information about them is not stored in a central location.
There is a risk of data loss if the datastore is not properly backed up.

Change Fault Tolerance Metadata Datastore

To change the FT metadata datastore, you will need to use the vSphere Web Client or the vSphere CLI.

Using the vSphere Web Client

Log in to the vSphere Web Client.
Click on the Hosts and Clusters view.
Select the host that is hosting the FT-enabled VM.
Click on the Configuration tab.
Click on the Datastores section.
Select the datastore that you want to use as the new FT metadata datastore.
Click on the Edit button.
In the Edit Datastore dialog box, select the Use as Fault Tolerance Metadata Datastore check box.
Click on the OK button.

Once you have changed the FT metadata datastore, you will need to restart the primary and secondary VMs for the changes to take effect.

Here are some things to keep in mind when changing the FT metadata datastore:

The new datastore must be accessible to both the primary and secondary VMs.
The new datastore must be compatible with FT.
You must restart the primary and secondary VMs for the changes to take effect.

Failure on More Than One Host

FT keep primary and secondary virtual machines on separated ESXi hosts, but what if both ESXi servers crash same time, exact same time?

Actually, FT wouldn’t help you on this situation or situations like that. Because there is no third protected virtual machine, you will face with service downtime even with fault tolerant virtual machines.

Secondary VM Affects Primary VM Performance!

If a Primary VM appears to be executing slowly, even though its host is lightly loaded and retains idle CPU time, check the host where the Secondary VM is running to see if it is heavily loaded.

When a Secondary VM resides on a host that is heavily loaded, the Secondary VM can affect the performance of the Primary VM.

A Secondary VM running on a host that is overcommitted (for example, with its CPU resources) might not get the same amount of resources as the Primary VM. When this occurs, the Primary VM must slow down to allow the Secondary VM to keep up, effectively reducing its execution speed to the slower speed of the Secondary VM.

If the Secondary VM is on an overcommitted host, you can move the VM to another location without resource contention problems. Or more specifically, do the following:

For FT networking contention, use vMotion technology to move the Secondary VM to a host with fewer FT VMs contending on the FT network. Verify that the quality of the storage access to the VM is not asymmetric.
For storage contention problems, turn FT off and on again. When you recreate the Secondary VM, change its datastore to a location with less resource contention and better performance potential.
To resolve a CPU resources problem, set an explicit CPU reservation for the Primary VM at an MHz value sufficient to run its workload at the desired performance level. This reservation is applied to both the Primary and Secondary VMs, ensuring that both VMs can execute at a specified rate. For guidance in setting this reservation, view the performance graphs of the virtual machine (before Fault Tolerance was enabled) to see how many CPU resources it used under normal conditions.

Managing Multiple FT VMs in Cluster

You should consider about many factors to enabling FT for multiple virtual machines in a cluster which has limited resources. Other services can affect FT virtual machines performance.

So, it’s recommended that keep FT virtual machines as small group of mission critical services.

The below problems can affect FT virtual machines and preventing or solving the problems is complicated in clusters with limited resources:

Partial Hardware Failure Related to Storage
Partial Hardware Failure Related to Network
Insufficient Bandwidth on the Logging NIC Network
vMotion Failures Due to Virtual Machine Activity Level
Too Much Activity on VMFS Volume Can Lead to Virtual Machine Failovers
Lack of File System Space Prevents Secondary VM Startup

Most problems would terminate FT and you will loss service protection by FT on primary, secondary or both virtual machines.

Virtual Machine Internal Problems

VMware FT protects services against failure, but service wouldn’t protected again internal failures. So, if virtual machine’s guest OS failed cause of human mistakes or anything else, primary and secondary virtual machines will be affected.

When Should You Use FT?

You should use FT when you sure about that there is no native solution for service protection and data loss prevention.

Also, you are ready to prepare infrastructure for Fault Tolerance by using best practices.

When Should You Not Use FT?

You should not use FT in the below situations:

Your services need to more resources that supported by FT for virtual machine.
You need to use vSphere feature which not supported by FT.
You want to use VMware vSAN and other storage providers together.
Your service has native solution for high availability or supporting some clustering features such as Windows Failover Clustering Service.
Service is growing and needs more resources, or you need to have load balancing across virtual servers.

Using VMware FT and Clustering Together

I don’t recommend that using VMware FT and clustering features or software on operating systems or by third-party applications.

You need to some vSphere features for implementing clustering which the features may not be available for FT VM.

Clustering and using FT is stupid idea, because you need to more resources and there is no actual benefits for doing that.

Conclusion

VMware Fault Tolerance is unique feature and may not be available on any virtualization platforms, but you should not use it for any situation or any service.

When you have plan to implement FT and protect services by that, you should consider about best practices, requirements and limitations.

If service supporting clustering feature, I highly recommend that use clustering features or software instead of FT.

Server clustering is more flexible for scaling services and if there is need to more resources for service, expansion is so easier.

Highest level of availability can be achieved by using server clustering and also server can be clustered as multi-sites or also, load balancing can use resources much more efficiently compared to FT.

Using VMware Fault Tolerance is depending to you and your requirements but use any vSphere feature wisely.