Understanding Adaptive Resync in VMware vSAN 6.7

Introduction to Adaptive Resync

Consistent Performance delivery along with data resiliency are two key tenet for an enterprise storage solution. If case of host / disk failure few of the components of the Virtual Machine might get non-complaint because of the the missing data. To make the impacted Virtual Machine complaint with their policy vSAN will sync the data on to the host / drives where we have sufficient resources available.

Resync operations will aim to finish the creation of missing components asap. The Resync operations consumes the I/O of the disk drives where vSAN is creating the missing objects of the impacted Virtual Machine. The longer the resysc takes, the longer you are at risk. If background operations like resync, rebalancing, etc consumes all I/O then application servers performance will suffer. On the other hand if Application server consumes all I/O and your backend cannot safely maintain availability. Some of the reasons for resynchronizations include:

Object policy changes
Host or disk group evacuations
Host upgrades (Hypervisor, on-disk format)
Object or component rebalancing
Object or component repairs

In the earlier releases, there was a manual throttling mechanism to handle these kind of situations. In vSAN 6.6, a throttling mechanism was introduced in the UI allowing a user to define a static limit on resync I/O. It required manual intervention, knowledge of performance metrics, and had limited abilities to control I/O types at specific points in the storage stack

With VMware vSAN 6.7, VMware introduced a new method to balance the use of resources during background activities like resync, rebalancing, etc. vSAN 6.7 distinguishes four types of I/O, and has a pending queue for each I/O class.

vSAN employs a sophisticated, highly adaptive congestion control scheme to manage I/O from one or more resources. vSAN 6.7 has two distinct types of congestion to help regulate I/O, improving upon the single congestion type found in vSAN 6.6 and earlier.

Bandwidth congestion. This type of congestion can come from the feedback loop in the “bandwidth regulator”, and is used to tell the vSAN layer on the host that manages vSAN components the speed at which to process I/O.

Backpressure congestion. This type of congestion can come as the result of the pending queues for the various I/O classes filling to capacity. Backpressure congestion is visible in the UI by highlighting the cluster, clicking Monitor > vSAN > Performance, and selecting the “VM” category.

The benefit to this optimized congestion control method is the ability to better isolate the impact of congestions and improve resource utilization. The dispatch / Fairness scheduler is at the heart of vSAN’s ability to manage and regulate I/O based on the conditions of the environment. The separate queues for the I/O classes allow vSAN to prioritize new incoming I/O that may have an inherently higher level of priority over existing I/O waiting to be processed. If a Virtual Machine latency reach to high watermark, vSAN will cut the bandwidth of background operations to half. Now vSAN will again check if Virtual Machine latency is still above it will cut the resources for background operations to half again. When the latency is below the low watermark then vSAN will increase the bandwidth of resync traffic granularly until the low watermark is reached and stay at that level. When resync and VM I/O activity is occurring, and the aggregate bandwidth of the I/O classes exceeds the advertised bandwidth, resync I/Os are assigned no less than approximately 20% of the bandwidth, allocating approximately 80% for VM I/Os.

Conclusion

Adaptive Resync in vSAN 6.7 introduces the mechanism to implement a fully intelligent, adaptable flow control mechanism for managing resync I/O and VM I/O. If there is no resync activity is occurring, VM I/Os can consume up to 100% of the available bandwidth. If resync and VM I/O is below the advertised available bandwidth, neither I.O class will be throttled. If resync and VM I/O aggregate bandwidth of the I/O classes exceeds the advertised bandwidth, resync I/Os are assigned no less than approximately 20% of the bandwidth, allocating approximately 80% for VM I/Os.