BSOD Analysis Help!

azhang253

Member
Joined
Nov 15, 2024
Posts
5
  • A brief description of your problem (but you can also include the steps you tried)
    • 6 Machine cluster - one of the nodes seemed to be hung, causing issues with the cluster.
    • One of the node was hung - showed Isolated FCM
    • Login via iDrac - black screen.
    • In order to bring system up, I shut down node from iDrac. Cluster failed over resources automatically and all VM's started on other nodes.
  • System Manufacturer?
    • Dell
  • Laptop or Desktop?
    • Server
  • Exact model number (if laptop, check label on bottom)
    • M640
  • OS ? (Windows 11, 10, 8.1, 8, 7, Vista)
    • Server 2022
  • x86 (32bit) or x64 (64bit)?
    • x64
  • What was original installed OS on system?
    • None
  • Is the OS an OEM version (came pre-installed on system) or full retail version (YOU purchased it from retailer)?
    • No
  • Age of system? (hardware)
    • Unsure, probably 5-6 year?
  • Age of OS installation?
    • 2 years
  • Have you re-installed the OS?
    • Yes, clean install in 2022.
  • CPU
    • Xeon Gold 5217 x2
  • RAM (brand, EXACT model, what slots are you using?)
    • Unsure of exact brand, all 16 slots are used
  • Video Card
    • None/Integrated
  • MotherBoard - (if NOT a laptop)
    • Dell
  • Power Supply - brand & wattage (if laptop, skip this one)
    • Multiple inside Dell M1000E Chassis
  • Is driver verifier enabled or disabled?
    • Unsure
  • What security software are you using? (Firewall, antivirus, antimalware, antispyware, and so forth)
    • SentinelOne
  • Are you using proxy, vpn, ipfilters or similar software?
    • No
  • Are you using Disk Image tools? (like daemon tools, alcohol 52% or 120%, virtual CloneDrive, roxio software)
    • No, but our backup software Rubrik uses VSS
  • Are you currently under/overclocking? Are there overclocking software installed on your system?
    • No
 
Hi!

You should have a memory dump in C:\Windows\MEMORY.DMP; please, upload it.
Read More:
(I'm trying to guess) It seems like it searched information about the storage (physical disk), but the storage wasn't available.
There are a lot of warnings created by disk, filtermanager and partmgr (several hours and also days before the BSOD).
Read More:
One of them is (maybe) more interesting:
Read More:
Then I searched that KB2983588 but the first result is Event ID 158 for identical disk GUIDs - Windows Client.
While you wait for further opinions, you could try to fix this.
 
Those disk errors are caused by many HyperV Backup solutions that use VSS (Ie Rubrik, when they create and a snapshot of a disk to take the backup). According to them, they can be ignored. I've seen them before, and I believe they only occur during my backup scheduled time. They are also on other HyperV hosts that have not BSOD'd. Today's BSOD occured when there were no backups happening. That being said, I'm not ruling out Rubrik causing BSOD's... after all Veeam did in the past too, and so did SentinelOne....

How should I upload the memory.dmp file? It's 36GB.
 
Hello, and welcome to the forum.

I'm not experienced at all with server systems and the two bugchecks in those two dumps I've not encountered before, so I'm going to flag @x BlueRobot and @axe0, who are way more experienced that I am in these areas.

What I can tell you is that the two bugchecks are a 0x9E USER_MODE_HEALTH_MONITOR and a 0x23 FAT_FILE_SYSTEM.

The 0x9E bugcheck indicates that a user-mode process failed a health check, in particular it failed a WatchdogSourceRhsResourceDeadlockPhysicalDisk. The failing user-mode process (as shown in that watchdog record) was rhs.exe - the Resource Hosting System which manages cluster resources. The failure would appear to have been a deadlock condition at the drive level perhaps?

The 0x23 bugcheck indicates a problem in a drive using the FAT file system, the drive in question appears to be labelled Device\MPIODisk4 if that helps. These two bugchecks would thus seem to be linked to a drive issue?

I won't attempt to go further because, as I said, server systems are an unknown beast to me. Hopefully @x BlueRobot and/or @axe0 will be along soon. I will continue to monitor this thread, partly in case I can help, but also as a learning experience.
 
If this helps, you got three errors before the BSOD:
Read More:
(Time has been adjusted automatically by my system, real time is probably six hour less)
You should be able to create a log with Get-ClusterLog (FailoverClusters).
 

Has Sysnative Forums helped you? Please consider donating to help us support the site!

Back
Top