How to Avoid PSOD - VMware Purple Screen of Death
The Purple Screen of Death (PSOD) is a fatal crash of VMware ESX/ESXi hosts which kills all active Virtual Machines. With highly virtualized data centers, microservices and technologies like Docker, a single PSOD can terminate tens, hundreds or even thousands of underlying services. Some PSODs can be hardware-related, but most of the time they are due to a combination of problematic drivers, BIOS or software bugs (assert errors). Taking measures to proactively detect (anticipate) and resolve these issues can do a great deal in preventing PSOD outages.
What is Purple Screen of Death - PSOD
A Purple Screen of Death (PSOD) is a diagnostic screen with white type on a purple background. The term Purple Screen of Death is a play on the Blue Screen of Death, the informal name given by users to the Windows general protection fault error. Typically, the PSOD details the memory state at the time of the crash and includes other information such as the ESXi version and build, the exception type, register dump, what was running on each CPU at the time of the crash, backtrace, server uptime, error messages and core dump information.
What to do about it
Prevent:
- Make sure patches for vCenter and ESXi are applied
- Keep drivers and firmware up to date
- Check if HW is on Hardware Compatibility list
- Use Runecast Analyzer and scan systems for known bugs, driver issues or configurations which led to PSOD
Be prepared:
- Leverage vSphere HA / FT
- Configure dump locations for troubleshooting
- Have remote console to ESXi (iLO, iDRAC, IMM)
- Configure ESXi to restart after PSOD
- Know your enemy! (research what scenarios led to PSOD - logs, syslogs using Runecast Log Analysis)
Being proactive like this will greatly help you avoid future critical PSOD-related service outages. Runecast Analyzer was designed to minimize and even completely eliminate PSOD crashes of ESXi hosts. Many of the root causes behind PSODs are not easy to detect manually because it is typically a combination of several factors, not just a software problem. Automating this process is the most viable way to ensure your datacentres are as reliable as possible.
Below is an example of Ruencast Analyzer detecting a PSOD problem.
If you are interested to hear more about PSOD errors, our CTO and VMware VCAP-DCD Aylin Sali made a short webinar dedicated to the topic.