After going through various “Signoff tests” with a customer today I came across an interesting “Gotcha”. I was running through the availability/fault tolerance tests when we came to simulating a switch failure and confirming network and storage (iSCSI) was unaffected and virtual machines continued to run.
The network configuration consisted of 2x 3Com SuperStack 5500 switches configured as a stack. I havn’t had a lot of 3com exposure but they seem to do the job and being stackable allows for redundant switches and the ability to configure link aggregation between the two switches and the ESX hosts. I normally work with ProCurve and this configuration is not possible so it was good to try something new.
So anyway I pulled the power on one of the switches and after a few lost pings and a bit of VM unresponsiveness while ESX did it’s NMP magic everything returned to normal. So it was yet another tick in the box for that test J
The “Gothca” came when we plugged the switch back in. As the previously powered off switch began to boot everything started going pear shaped. VM’s began shutting down and the hosts reported HA errors.
What had happened is that as the second switch boots and joins the stack the other switches go offline and re-organise themselves and there unit ID. This resulted in each host not being able to ping either the gateway or the secondary isolation address and beginning the default “power off VM’s when isolated” routine.
There are two options to overcome this issue of VM’s shutting down when this happens –
- Increase the das.failuredetectioninterval advanced setting to something longer than the time it take the switches to sort themselves out (60 seconds out to do it). This fixes the issue but will cause a longer outage if anything really does happen.
- Change the default action for isolation to “Leave Virtual Machines Powered On” this fixes the virtual machines dying however the server will need to be “Re-enabled for HA”
Of course this fixes the issue of virtual machines powering down but not the problem of losing complete connectivity to the VM’s for more than 15 seconds. So it was decided with the client that the most appropriate action would be to arrange an maintenance outage before adding a failed or new switch to the stack.