Degraded Device Handling in VMware VSAN

As the name suggest Degraded Device Handling (DDH) or Dying Disk Handling is an unhealthy drive detection method help VMware VSAN customers to avoid cluster performance degradation due to an unhealthy drive. There can be a situation where the drive part of VSAN is not completely failed but can show inconsistent behavior and is generating lot of IO retries / errors. Now the question comes How could we deal with such a situation?

With VSAN 6.1, VMware introduced the functionality called Degraded Device Handling (DDH) where vSAN itself monitored the drives with excessive read or write latency. If vSAN observes that the average latency of drive was higher then 50 ms for more then 10-min period, it dismounts the concerned drive. Once unmounted, components on the dismounted drive get marked as absent and a rebuilding of components starts after a period of 60 min. Dismounting the drive considering the last 10 minutes of data leads to number of challenges as it might be possible that concerned drive is temporarily reporting the higher average latency.

To overcome the issues called by the false positives from drives temporarily reporting higher average latencies, VMware did bunt of enhancements in VMware VSAN unhealthy drive detection method in the upcoming releases.

Average latency will be tracked over multiple, randomly selected 10 min intervals not just from last 10 min. A d.rive will be marked unhealthy only when the average write IO round trip latency exceed the configured threshold for four times in last six hour period.
In case of high read latencies, dismounting of cache or capacity devices will not happen.
In case of high write latency, dismounting of cache drive will not happen.
DDH will only dismount the capacity device with high write latencies.
Latency threshold for a magnetic disk was set to 500 ms and 200 ms for a SSD.
Remounting of unmounted drive by DDH will be tried approximately 24 times over 24 hour period.
DDH will not unmount the drive if the drive hold the last remaining copy of the data. If the drive holds the last remaining copy of the data, DDH will start the data evacuation from the device immediately. This is in contrast to waiting for the vSAN CLOM Rebuild Timer (60 minutes by default) to expire before rebuilding copies of “absent” components.

Once VMware VSAN unhealthy drive detection method detects an unhealthy disk, it logs a key disk SMART attributes for monitoring and detecting errors on the unhealthy disk. The SMART attribute mentioned below gives an idea as to why the device was inconsistent and why DDH have chose to unmount the concerned device.

Re-allocated sector count. Attribute ID = 0x05.
Uncorrectable errors count. Attribute ID = 0xBB.
Command timeouts count. Attribute ID = 0xBC.
Re-allocated sector event count. Attribute ID = 0xC4.
Pending re-allocated sector count. Attribute ID = 0xC5.
Uncorrectable sector count. Attribute ID = 0xC6.

We have seen few cases where drive failed without any warning.Predicting device failure and the proactive evacuation of data from a degraded device enhances the resilience of a vSAN datastore.

Understanding Degraded Device Handling in VMware VSAN

Degraded Device Handling in VMware VSAN