As a continuation to the Spectre/Meltdown situation VMware had pushed out an update identified as ESXi650-201801402-BG (for 6.5, ESXi600-201801402-BG for 6.0, and ESXi550-201801401-BG for 5.5) that included a VIB for “VMware_bootbank_cpu-microcode_6.5.0-1.38.7526125” which would address the speculative execution issues.
It turns out that in VMware KB 52345 “it has been recommended that VMware remove exposure of the speculative-execution mechanism to virtual machines on ESXi hosts using the affected Intel processors until Intel provides new microcode at a later date.” Oh… why? Not sure…
But, here’s something interesting. Yesterday I posted a benchmark on a ZFS VM with and without this microcode patch running on a (Dell R620 with E5-2640 v1 CPUs) in order to gauge the performance impact of a storage solution running the “fixed” microcode. The benchmark ran fine with and without the code applied and I went on my way leaving my R620 running the microcode and my Supermicro X9DRI-LN4F+ system running without it.
I had shutdown the R620 as I was done testing only to return later and want to use it again. I powered the host up and watched ESXi boot but it didn’t reconnect to vCenter. It can take a bit sometimes so I wasn’t worried about it. After about 5 minutes I logged into the iDRAC to find that the system was booting up and was just at POST. Wait… I already booted the system up…
Turns out the machine experienced some fatal errors involving the PCI Express bus and power cycled:
Sigh. The entry at on Jan 13 at 23:02:28 was the time that the power cycled. The subsequent 4 errors were me trying the host again later (one boot yielded the 4 errors).
The only thing that changed was the Intel Microcode update via VMware.
Now, that’s not to say I didn’t experience some sort of super coincidental issue, but I did just get done doing decently strenuous benchmarking and everything was fine and this system has been rock solid for about a year. I looked into it some and identified my LSI 9211-8i as being in slot 1 which it seems the iDRAC system logs are pointing to as having an issue. Again, nothing was moved or bumped or anything – the server has been in the rack and never touched. I cleared the logs then removed power from the system for 5 minutes and tried again – so far so good. The system has been up for 18 – 20 hours now without issue.
I am not sure if this is coincidental or not, but I’ve heard of some reports of issues with the new microcode (though it’s traditionally been in Haswell and Broadwell machines) – for now, you can be I’m not applying microcode updates in production until this is all ironed out.
What a whirlwind! Thanks for reading!