BSOD on every server in the cluster

miranon

New member
Joined
Oct 13, 2024
Posts
4
Hello everyone.
I have a S2D cluster of 4 servers.
A BSOD occurs with an error "driver_irql_not_less_or_equal" on vmswitch.sys. At the same time, an "Uncorrectable ECC" error message appears. Each time on the second cpu. Occurs from time to time on all servers in the cluster.
It was noticed that this problem occurs more often with SR-IOV enabled on VMs and a large network load. A complete reinstallation of all servers in the cluster was performed, including switching to the core version. Updating all drivers and firmwares to the latest versions.

The configuration of all 4 servers is identical:
ASUS RS700-E10-RS12U
2xXeon 6346 (HyperTreading disabled)
512MB RAM (8x64GB Samsung M393A8G40BB4-CWE, installed in A1, C1, E1, G1, J1, L1, N1, R1)
2x480GB SSD SATA in RAID 1 for boot (SAMSUNG MZ7L3480HCHQ-00A07)
4x3,84TB NVMe for Storage (Available only 3,2TB - we use Micron Flex Capacity Feature to increase DWDP. Micron 7300 MTFDHBE3T8TDF)
1x400GB NVMe for Storage (Intel Optane DC P5800X SSDPF21Q400GB)
1x6,4TB NVMe for Storage (Intel D7-P5620 SSDPF2KE064T1)
Intel E810-XXV Network Adapter. iWarp RDMA Enabled.
2x1600 Power Supply (CHICONY POWER R1K6AW03P)

I use Windows Server 2022 Datacenter Core with all updates.
I don't have any additional software on servers
I don't use overclocking (Disabled in Bios)
High performance power schema is enabled (In the BIOS too)
I ran the memory and processor tests recommended by the vendor (stress-ng --cpu 32 --cpu-method all --metrics --timeout 8h for CPU and stress-ng --vm 8 --vm-bytes 80% --timeout 8h for memmory).
Everythink is OK. The vendor said it was not a hardware problem.

http://speccy.piriform.com/results/sOtiaDuzbQ4TPf860Ja8ACt

I apologize for any mistakes - English is not my native language.

I would appreciate your help.
Alex.
 

Attachments

Hi,

Typically, we do not work with Windows servers. I would like to ask if you have permission to ask for help on a public forum like Sysnative forums if these servers are used in a company.
 
Hi

Yes, i have this permission. Unfortunately, we do not have a support contract. Therefore, this was allowed to be done.
 
Note that any suggestion/troubleshooting step is specific to a specific Windows setup only, for any other server please provide relevant logs as well so we can have a look at the situation with those servers. If you apply a suggestion to a different server than the one it was provided for, the suggestion may not be effective because the situation might be different which is judged by the logs. We have had situations before where multiple computers seemed to have had the same problem per the user's description, but for each computer a different solution was necessary because the problem was not identical for each computer.
 
I do not see any crashes with the server from the sysnativefilecollectionapp-3 file.

With the others it looks to be mostly related to the driver of device Intel(R) Ethernet Network Adapter E810-XXV-2. At what moment/where does this uncorrectable ECC show, I didn't see any WHEA logs related to this.

Also, to avoid confusion, I would recommend that you add in the filename from which server the logs are from.
 
I do not see any crashes with the server from the sysnativefilecollectionapp-3 file.

With the others it looks to be mostly related to the driver of device Intel(R) Ethernet Network Adapter E810-XXV-2. At what moment/where does this uncorrectable ECC show, I didn't see any WHEA logs related to this.

Also, to avoid confusion, I would recommend that you add in the filename from which server the logs are from.

You don't see any crashes because all the servers in the cluster was reinstalled. It just hasn't happened on this server yet. It doesn't happen very often.

I have tried Intel drivers from at least version 28.2 to 29.3 with the same result. I have no idea what to do.


The ECC error can only be seen from the web console or IPMI. I add a screenshot.

1729353960998.png
 
Are there more details you can see in this web console or IPMI about the ECC error?
 
Back
Top