More microcode madness – VMware recommends NOT patching

Posted By Jon on Jan 14, 2018 | 4 comments

Oy vey.

As a continuation to the Spectre/Meltdown situation VMware had pushed out an update identified as ESXi650-201801402-BG (for 6.5, ESXi600-201801402-BG for 6.0, and ESXi550-201801401-BG for 5.5) that included a VIB for “VMware_bootbank_cpu-microcode_6.5.0-1.38.7526125” which would address the speculative execution issues.

It turns out that in VMware KB 52345 “it has been recommended that VMware remove exposure of the speculative-execution mechanism to virtual machines on ESXi hosts using the affected Intel processors until Intel provides new microcode at a later date.” Oh… why? Not sure…

But, here’s something interesting. Yesterday I posted a benchmark on a ZFS VM with and without this microcode patch running on a (Dell R620 with E5-2640 v1 CPUs) in order to gauge the performance impact of a storage solution running the “fixed” microcode. The benchmark ran fine with and without the code applied and I went on my way leaving my R620 running the microcode and my Supermicro X9DRI-LN4F+ system running without it.

I had shutdown the R620 as I was done testing only to return later and want to use it again. I powered the host up and watched ESXi boot but it didn’t reconnect to vCenter. It can take a bit sometimes so I wasn’t worried about it. After about 5 minutes I logged into the iDRAC to find that the system was booting up and was just at POST. Wait… I already booted the system up…

Turns out the machine experienced some fatal errors involving the PCI Express bus and power cycled:

Sigh. The entry at on Jan 13 at 23:02:28 was the time that the power cycled. The subsequent 4 errors were me trying the host again later (one boot yielded the 4 errors).

The only thing that changed was the Intel Microcode update via VMware.

Now, that’s not to say I didn’t experience some sort of super coincidental issue, but I did just get done doing decently strenuous benchmarking and everything was fine and this system has been rock solid for about a year. I looked into it some and identified my LSI 9211-8i as being in slot 1 which it seems the iDRAC system logs are pointing to as having an issue. Again, nothing was moved or bumped or anything – the server has been in the rack and never touched. I cleared the logs then removed power from the system for 5 minutes and tried again – so far so good. The system has been up for 18 – 20 hours now without issue.

I am not sure if this is coincidental or not, but I’ve heard of some reports of issues with the new microcode (though it’s traditionally been in Haswell and Broadwell machines) – for now, you can be I’m not applying microcode updates in production until this is all ironed out.

What a whirlwind! Thanks for reading!

4 Comments

none

January 14, 2018

So far the microcode updates seem to be only for gen4 and newer CPUs. Sandy bridge came out in 2011 so Intel might not even release updates for it

Post a Reply

Jon
January 15, 2018

Yep that’s what it’s all saying – it’s just extremely odd that I had issues literally after the first reboot and applying the microcode vib… eek!
Post a Reply

January 14, 2018

That’s pretty interesting. Thanks for sharing. I wonder if something was cached and cold boot cleared it? I’m not a big fan of “coincidence”.

Post a Reply

Jon
January 14, 2018

CPU cache is volatile, so it shouldn’t have mattered. It is somewhat concerning… I don’t like coincidence either!
Post a Reply

More microcode madness – VMware recommends NOT patching

4 Comments

Leave a Reply to Jon Cancel reply

About Me

More microcode madness – VMware recommends NOT patching

4 Comments

Leave a Reply to Jon Cancel reply

About Me

Tags