BSOD on every server in the cluster

miranon · Oct 19, 2024

Hello everyone.
I have a S2D cluster of 4 servers.
A BSOD occurs with an error "driver_irql_not_less_or_equal" on vmswitch.sys. At the same time, an "Uncorrectable ECC" error message appears. Each time on the second cpu. Occurs from time to time on all servers in the cluster.
It was noticed that this problem occurs more often with SR-IOV enabled on VMs and a large network load. A complete reinstallation of all servers in the cluster was performed, including switching to the core version. Updating all drivers and firmwares to the latest versions.

The configuration of all 4 servers is identical:
ASUS RS700-E10-RS12U
2xXeon 6346 (HyperTreading disabled)
512MB RAM (8x64GB Samsung M393A8G40BB4-CWE, installed in A1, C1, E1, G1, J1, L1, N1, R1)
2x480GB SSD SATA in RAID 1 for boot (SAMSUNG MZ7L3480HCHQ-00A07)
4x3,84TB NVMe for Storage (Available only 3,2TB - we use Micron Flex Capacity Feature to increase DWDP. Micron 7300 MTFDHBE3T8TDF)
1x400GB NVMe for Storage (Intel Optane DC P5800X SSDPF21Q400GB)
1x6,4TB NVMe for Storage (Intel D7-P5620 SSDPF2KE064T1)
Intel E810-XXV Network Adapter. iWarp RDMA Enabled.
2x1600 Power Supply (CHICONY POWER R1K6AW03P)

I use Windows Server 2022 Datacenter Core with all updates.
I don't have any additional software on servers
I don't use overclocking (Disabled in Bios)
High performance power schema is enabled (In the BIOS too)
I ran the memory and processor tests recommended by the vendor (stress-ng --cpu 32 --cpu-method all --metrics --timeout 8h for CPU and stress-ng --vm 8 --vm-bytes 80% --timeout 8h for memmory).
Everythink is OK. The vendor said it was not a hardware problem.

http://speccy.piriform.com/results/sOtiaDuzbQ4TPf860Ja8ACt

I apologize for any mistakes - English is not my native language.

I would appreciate your help.
Alex.

axe0 · Oct 19, 2024

Hi,

Typically, we do not work with Windows servers. I would like to ask if you have permission to ask for help on a public forum like Sysnative forums if these servers are used in a company.

miranon · Oct 19, 2024

Hi

Yes, i have this permission. Unfortunately, we do not have a support contract. Therefore, this was allowed to be done.

axe0 · Oct 19, 2024

Note that any suggestion/troubleshooting step is specific to a specific Windows setup only, for any other server please provide relevant logs as well so we can have a look at the situation with those servers. If you apply a suggestion to a different server than the one it was provided for, the suggestion may not be effective because the situation might be different which is judged by the logs. We have had situations before where multiple computers seemed to have had the same problem per the user's description, but for each computer a different solution was necessary because the problem was not identical for each computer.

miranon · Oct 19, 2024

Thank you very much for agreeing to help me.

Here are the files from servers 1-3. The original file is from server 4.

axe0 · Oct 19, 2024

I do not see any crashes with the server from the sysnativefilecollectionapp-3 file.

With the others it looks to be mostly related to the driver of device Intel(R) Ethernet Network Adapter E810-XXV-2. At what moment/where does this uncorrectable ECC show, I didn't see any WHEA logs related to this.

Also, to avoid confusion, I would recommend that you add in the filename from which server the logs are from.

miranon · Oct 19, 2024

axe0 said:
I do not see any crashes with the server from the sysnativefilecollectionapp-3 file.

With the others it looks to be mostly related to the driver of device Intel(R) Ethernet Network Adapter E810-XXV-2. At what moment/where does this uncorrectable ECC show, I didn't see any WHEA logs related to this.

Also, to avoid confusion, I would recommend that you add in the filename from which server the logs are from.

You don't see any crashes because all the servers in the cluster was reinstalled. It just hasn't happened on this server yet. It doesn't happen very often.

I have tried Intel drivers from at least version 28.2 to 29.3 with the same result. I have no idea what to do.

The ECC error can only be seen from the web console or IPMI. I add a screenshot.

axe0 · Oct 20, 2024

Are there more details you can see in this web console or IPMI about the ECC error?

axe0 · Oct 23, 2024

Are you still with me?

miranon · Oct 24, 2024

Thank you for your message.

Sorry, I was sick. I only came to work today.

No further information other than this message.

There have been several more reboots over the past few days. There are now reboots on all 4 servers.

However, there is an observation that this may be related to Jumbo Frames. We have turned them off and so far we have not seen any reboots. I can collect all the information again from all 4 servers.

axe0 · Oct 25, 2024

I only need to see logs if crashes occurred, otherwise there's no data to look at.

miranon · Oct 26, 2024

I collected logs from a third server that had no BSOD before.

axe0 · Oct 26, 2024

This is with the Jumbo frame disabled?

miranon · Oct 28, 2024

No. Jumbo frames was disabled after.

axe0 · Oct 28, 2024

Then it looks to be the Jumo frames where the issue lies, because the dump looks pretty much the same as earlier dumps.

miranon · Oct 29, 2024

This is very bad news for us, as it reduces productivity. Could there be any other reasons?
By the way, sometimes we see strange behavior of the VM - when SR-IOV is enabled, the connection to the VM is lost. The solution is to restart the VM or disable SR-IOV. I don't know if this makes a difference.

Since my letter from 10/24, there have been no more BSODs. On two servers, the ECC error has already gone away, on two it still remains. It is not clear why this is so.

axe0 · Oct 29, 2024

How long have these machines been running with Jumbo Frames enabled before this problem started?

miranon · Oct 29, 2024

After the last reinstallation the problem appeared just a few days later.

axe0 · Oct 29, 2024

I assume these machines have been running just fine for a long time with jumbo frames enabled before the reinstallation?

miranon · Oct 29, 2024

No. It was the constant BSODs that made us try reinstalling.
These servers originally had x710 network cards installed. They were replaced with 810, and RDMA,Jumbo and SR-IOV were enabled. I think it was after this that the problems started.

BSOD on every server in the cluster

Member

Attachments

Administrator, BSOD Academy Instructor, Security Analyst

Member

Administrator, BSOD Academy Instructor, Security Analyst

Member

Attachments

Administrator, BSOD Academy Instructor, Security Analyst

Member

Administrator, BSOD Academy Instructor, Security Analyst

Administrator, BSOD Academy Instructor, Security Analyst

Member

Administrator, BSOD Academy Instructor, Security Analyst

Member

Attachments

Administrator, BSOD Academy Instructor, Security Analyst

Member

Administrator, BSOD Academy Instructor, Security Analyst

Member

Administrator, BSOD Academy Instructor, Security Analyst

Member

Administrator, BSOD Academy Instructor, Security Analyst

Member

Administrator,
BSOD Academy Instructor,
Security Analyst

Administrator,
BSOD Academy Instructor,
Security Analyst

Administrator,
BSOD Academy Instructor,
Security Analyst

Administrator,
BSOD Academy Instructor,
Security Analyst

Administrator,
BSOD Academy Instructor,
Security Analyst

Administrator,
BSOD Academy Instructor,
Security Analyst

Administrator,
BSOD Academy Instructor,
Security Analyst

Administrator,
BSOD Academy Instructor,
Security Analyst

Administrator,
BSOD Academy Instructor,
Security Analyst

Administrator,
BSOD Academy Instructor,
Security Analyst