PCI-E WHEA errors (0x124)

satrow · May 3, 2012

Vir Gnarus said:
Can you attach them to your post directly? They are small enough that zipping them should work. I cannot access them due to firewall restrictions against that site.

Attached.

Vir Gnarus · May 3, 2012

Thanks satrow. I am looking through them. So far I haven't found anything definitive, but I'm leaning on CPU/Mobo. I'll explain things later as I garner more info from these.

Time to do some brute force hardware tests:

RAM: Memtest86+ - 7+ passes
CPU: Prime95 - Torture Test; Large FFTs; overnight (9+ hours)
GPU: MemtestCL - Run twice (if any of the tests work on your GPU; ATI cards will need to install the ATI APP SDK as it requires OpenCL)
Drives: Seatools - All basic tests aside from the Fix all or the advanced ones.

The ones you want to run first are CPU and GPU. All of these (excluding MemtestCL) are included in the UBCD if you prefer a Live CD environment (which is the best environment to test hardware on). Note that Prime95 currently does not work on the UBCD. Also, please provide us temps/voltages using HWInfo with Sensors only option checked. Log two 30-minute instances: one for idle, and one for high load.

Teln3t · May 4, 2012

Vir Gnarus said:
Thanks satrow. I am looking through them. So far I haven't found anything definitive, but I'm leaning on CPU/Mobo. I'll explain things later as I garner more info from these.

Time to do some brute force hardware tests:

RAM: Memtest86+ - 7+ passes
CPU: Prime95 - Torture Test; Large FFTs; overnight (9+ hours)
GPU: MemtestCL - Run twice (if any of the tests work on your GPU; ATI cards will need to install the ATI APP SDK as it requires OpenCL)
Drives: Seatools - All basic tests aside from the Fix all or the advanced ones.

The ones you want to run first are CPU and GPU. All of these (excluding MemtestCL) are included in the UBCD if you prefer a Live CD environment (which is the best environment to test hardware on). Note that Prime95 currently does not work on the UBCD. Also, please provide us temps/voltages using HWInfo with Sensors only option checked. Log two 30-minute instances: one for idle, and one for high load.

Alright, two days ago I got fed up with it all and decided to take every piece of the computer apart. I bought a de-dusting can to ensure that I could de-dust every cm^2 of the machine. I put everything back together and it hasn't crashed in two days. I will note that I pulled out the graphics card, there was an incredible amount of dust inside the actual PCI-E slot. There was also a ton of dust in the same area where a red light would activative on my Radeon 6980. I looked with a magnifying glass and the red light was marked D1000. After looking for documentation on what the D1000 red LED was supposed to signify....I found nothing.

But yes, for the past two days it hasn't crashed since the de-dusting event.

I'm going to get some mcndonalds now.

Vir Gnarus · May 4, 2012

Did the red LED continue to turn on after you cleaned everything up? I personally tried to look up any official documentation on it but nothing comes up. Everything on google is where people reported that the light came on when their graphics card was suffering issues.

It's good to know that things are stable now after dusting. I would advise that you still be vigilant, however, given that the cleaning may not actually resolve the issue but only alleviate it. The last time I diagnosed a PCI-E WHEA error from someone, they thought dusting fixed the problem too, until the crashes creeped up again, only not nearly as frequently as prior to the cleanup work. Of course, if everything works great after this, then wonderful. I just want you to be aware of this as you continue to use the system.

I'll forgo looking at the crashdumps any further. If it was dust afterall, than no amount of debugging with WinDBG will help one discover that.

Teln3t · May 6, 2012

Vir Gnarus said:
Did the red LED continue to turn on after you cleaned everything up? I personally tried to look up any official documentation on it but nothing comes up. Everything on google is where people reported that the light came on when their graphics card was suffering issues.

It's good to know that things are stable now after dusting. I would advise that you still be vigilant, however, given that the cleaning may not actually resolve the issue but only alleviate it. The last time I diagnosed a PCI-E WHEA error from someone, they thought dusting fixed the problem too, until the crashes creeped up again, only not nearly as frequently as prior to the cleanup work. Of course, if everything works great after this, then wonderful. I just want you to be aware of this as you continue to use the system.

I'll forgo looking at the crashdumps any further. If it was dust afterall, than no amount of debugging with WinDBG will help one discover that.

Sorry completely forgot to update you on this Vir Gnarus. My bad.

No, the D1000 led on the GPU did not activate after I had cleaned everything out. (And I mean not just blowing it out, but taking the tip of a paper-towel and getting rid of all the dust carefully). Going on the fourth day now (straight) and no crashes thus far.

Vir Gnarus · May 7, 2012

Sounds like a winner. Gave me the chance to dive into the PCI-E bus details, so thanks for that. It's unfortunate that it didn't lead to any solid answers, but given that this was caused by dust, there would be no way it could expose such cause. I should've recommended cleaning things up, but that was an oversight on my part, I apologize.

I'm glad to hear everything's workin fine now. I'll retain this for future reference for others wishing to diagnose PCI/PCI-E predicaments.

stucko · Jul 27, 2012

Hi all,

Ive been reading this thread and it has been helpful.

I have a similar problem but dont know how to solve it eventhough I think I know where the source of the prob.

It started after I resinstalled Windows 7 on my new SSD. The biggest problem is that I didnt have any windows 7 installer.

So now I have two windows 7, 1 on my old HD and one on my new HD (SSD). The old one works fine with ocassional crash that doesnt have a bluescreen (im guessing its heat).

My new one however is very annoying , almost everytime I plug in a hard disk in a particular USB PORT< the OS crashes with a minidump as follows :

Code:

Microsoft (R) Windows Debugger Version 6.2.8400.4218 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.


Loading Dump File [C:\Windows\Minidump\072812-15038-01.dmp]
Mini Kernel Dump File: Only registers and stack trace are available

Symbol search path is: SRV*C:\Program Files (x86)\Windows Kits\8.0\Symbols*[url]http://msdl.microsoft.com/download/symbols[/url]
Executable search path is: 
Windows 7 Kernel Version 7601 (Service Pack 1) MP (8 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 7601.17835.amd64fre.win7sp1_gdr.120503-2030
Machine Name:
Kernel base = 0xfffff800`02e0b000 PsLoadedModuleList = 0xfffff800`0304f670
Debug session time: Sat Jul 28 08:56:08.271 2012 (UTC + 8:00)
System Uptime: 0 days 0:10:29.239
Loading Kernel Symbols
.
SYMSRV:  c:\program files (x86)\windows  kits\8.0\symbols*[url]http://msdl.microsoft.com/download/symbols[/url]  needs a downstream store
..............................................................
................................................................
.......................
Loading User Symbols
Loading unloaded module list
.......
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck 124, {4, fffffa800eb65038, 0, 0}

*** WARNING: Unable to verify timestamp for win32k.sys
*** ERROR: Module load completed but symbols could not be loaded for win32k.sys
Probably caused by : GenuineIntel

Followup: MachineOwner
---------

7: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000004, PCI Express Error
Arg2: fffffa800eb65038, Address of the WHEA_ERROR_RECORD structure.
Arg3: 0000000000000000
Arg4: 0000000000000000

Debugging Details:
------------------


BUGCHECK_STR:  0x124_GenuineIntel

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  WIN7_DRIVER_FAULT

PROCESS_NAME:  System

CURRENT_IRQL:  7

STACK_TEXT:  
fffff880`031aea78 fffff800`03405a3b : 00000000`00000124 00000000`00000004 fffffa80`0eb65038 00000000`00000000 : nt!KeBugCheckEx
fffff880`031aea80 fffff800`02f97b03 : 00000000`00000001  fffffa80`0e5680c0 00000000`00000000 fffffa80`0e567b60 :  hal!HalBugCheckSystem+0x1e3
fffff880`031aeac0 fffff880`00f2fbcf : fffffa80`00000750  fffffa80`0e5680c0 00000000`00000001 fffffa80`0eb64ab0 :  nt!WheaReportHwError+0x263
fffff880`031aeb20 fffff880`00f2f5f6 : 00000000`00000000  fffff880`031aec70 fffffa80`0ecfed80 fffffa80`0f7721a0 :  pci!ExpressRootPortAerInterruptRoutine+0x27f
fffff880`031aeb80 fffff800`02e8601c : fffff880`03186180  00000000`ffffffff fffffa80`0ecfed80 fffff880`04198501 :  pci!ExpressRootPortInterruptRoutine+0x36
fffff880`031aebf0 fffff800`02e81ea2 : fffff880`03186180  fffff880`00000002 00000000`00000002 fffff800`00000000 :  nt!KiInterruptDispatch+0x16c
fffff880`031aed80 00000000`00000000 : 00000000`00000000  00000000`00000000 00000000`00000000 00000000`00000000 :  nt!KiIdleLoop+0x32


STACK_COMMAND:  kb

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: GenuineIntel

IMAGE_NAME:  GenuineIntel

DEBUG_FLR_IMAGE_TIMESTAMP:  0

FAILURE_BUCKET_ID:  X64_0x124_GenuineIntel_PCIEXPRESS

BUCKET_ID:  X64_0x124_GenuineIntel_PCIEXPRESS

Followup: MachineOwner
---------

7: kd> !errrec fffffa800eb65038
===============================================================================
Common Platform Error Record @ fffffa800eb65038
-------------------------------------------------------------------------------
Record Id     : 01cd6c5a4fe290f5
Severity      : Fatal (1)
Length        : 672
Creator       : Microsoft
Notify Type   : PCI Express Error
Timestamp     : 7/28/2012 0:56:08 (UTC)
Flags         : 0x00000000

===============================================================================
Section 0     : PCI Express
-------------------------------------------------------------------------------
Descriptor    @ fffffa800eb650b8
Section       @ fffffa800eb65148
Offset        : 272
Length        : 208
Flags         : 0x00000001 Primary
Severity      : Fatal

Port Type     : Root Port
Version       : 1.1
Command/Status: 0x4010/0x0506
Device Id     :
  VenId:DevId : 8086:340c
  Class code  : 030400
  Function No : 0x00
  Device No   : 0x05
  Segment     : 0x0000
  Primary Bus : 0x00
  Second. Bus : 0x00
  Slot        : 0x0000
Dev. Serial # : 0000000000000000
Express Capability Information @ fffffa800eb6517c
  Device Caps : 00008021 Role-Based Error Reporting: 1
  Device Ctl  : 0107 ur FE NF CE
  Dev Status  : 0005 ur FE nf CE
   Root Ctl   : 0008 fs nfs cs

AER Information @ fffffa800eb651b8
  Uncorrectable Error Status    : 00040000 ur ecrc MTLP rof uc ca cto fcp ptlp sd dlp und
  Uncorrectable Error Mask      : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und
  Uncorrectable Error Severity  : 00062010 ur ecrc MTLP ROF uc ca cto FCP ptlp sd DLP und
  Correctable Error Status      : 00000040 adv rtto rnro dllp TLP re
  Correctable Error Mask        : 00000000 adv rtto rnro dllp tlp re
  Caps & Control                : 00000012 ecrcchken ecrcchkcap ecrcgenen ecrcgencap FEP
  Header Log                    : 3f000000 04000030 00000000 00000000
  Root Error Command            : 00000000 fen nfen cen
  Root Error Status             : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer
  Correctable Error Source ID   : 00,00,00
  Correctable Error Source ID   : 00,00,00

===============================================================================
Section 1     : Processor Generic
-------------------------------------------------------------------------------
Descriptor    @ fffffa800eb65100
Section       @ fffffa800eb65218
Offset        : 480
Length        : 192
Flags         : 0x00000000
Severity      : Informational

Proc. Type    : x86/x64
Instr. Set    : x64
CPU Version   : 0x00000000000106a5
Processor ID  : 0x0000000000000007

The source is definately the USB port but I just dont know how to fix it. Can anybody give some pointers? Thanks in advance

Teln3t · Jul 28, 2012

stucko said:

Hi all,

Ive been reading this thread and it has been helpful.

I have a similar problem but dont know how to solve it eventhough I think I know where the source of the prob.

It started after I resinstalled Windows 7 on my new SSD. The biggest problem is that I didnt have any windows 7 installer.

So now I have two windows 7, 1 on my old HD and one on my new HD (SSD). The old one works fine with ocassional crash that doesnt have a bluescreen (im guessing its heat).

My new one however is very annoying , almost everytime I plug in a hard disk in a particular USB PORT< the OS crashes with a minidump as follows :

Code:

Microsoft (R) Windows Debugger Version 6.2.8400.4218 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.


Loading Dump File [C:\Windows\Minidump\072812-15038-01.dmp]
Mini Kernel Dump File: Only registers and stack trace are available

Symbol search path is: SRV*C:\Program Files (x86)\Windows Kits\8.0\Symbols*[URL]http://msdl.microsoft.com/download/symbols[/URL]
Executable search path is: 
Windows 7 Kernel Version 7601 (Service Pack 1) MP (8 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 7601.17835.amd64fre.win7sp1_gdr.120503-2030
Machine Name:
Kernel base = 0xfffff800`02e0b000 PsLoadedModuleList = 0xfffff800`0304f670
Debug session time: Sat Jul 28 08:56:08.271 2012 (UTC + 8:00)
System Uptime: 0 days 0:10:29.239
Loading Kernel Symbols
.
SYMSRV:  c:\program files (x86)\windows  kits\8.0\symbols*[URL]http://msdl.microsoft.com/download/symbols[/URL]  needs a downstream store
..............................................................
................................................................
.......................
Loading User Symbols
Loading unloaded module list
.......
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck 124, {4, fffffa800eb65038, 0, 0}

*** WARNING: Unable to verify timestamp for win32k.sys
*** ERROR: Module load completed but symbols could not be loaded for win32k.sys
Probably caused by : GenuineIntel

Followup: MachineOwner
---------

7: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000004, PCI Express Error
Arg2: fffffa800eb65038, Address of the WHEA_ERROR_RECORD structure.
Arg3: 0000000000000000
Arg4: 0000000000000000

Debugging Details:
------------------


BUGCHECK_STR:  0x124_GenuineIntel

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  WIN7_DRIVER_FAULT

PROCESS_NAME:  System

CURRENT_IRQL:  7

STACK_TEXT:  
fffff880`031aea78 fffff800`03405a3b : 00000000`00000124 00000000`00000004 fffffa80`0eb65038 00000000`00000000 : nt!KeBugCheckEx
fffff880`031aea80 fffff800`02f97b03 : 00000000`00000001  fffffa80`0e5680c0 00000000`00000000 fffffa80`0e567b60 :  hal!HalBugCheckSystem+0x1e3
fffff880`031aeac0 fffff880`00f2fbcf : fffffa80`00000750  fffffa80`0e5680c0 00000000`00000001 fffffa80`0eb64ab0 :  nt!WheaReportHwError+0x263
fffff880`031aeb20 fffff880`00f2f5f6 : 00000000`00000000  fffff880`031aec70 fffffa80`0ecfed80 fffffa80`0f7721a0 :  pci!ExpressRootPortAerInterruptRoutine+0x27f
fffff880`031aeb80 fffff800`02e8601c : fffff880`03186180  00000000`ffffffff fffffa80`0ecfed80 fffff880`04198501 :  pci!ExpressRootPortInterruptRoutine+0x36
fffff880`031aebf0 fffff800`02e81ea2 : fffff880`03186180  fffff880`00000002 00000000`00000002 fffff800`00000000 :  nt!KiInterruptDispatch+0x16c
fffff880`031aed80 00000000`00000000 : 00000000`00000000  00000000`00000000 00000000`00000000 00000000`00000000 :  nt!KiIdleLoop+0x32


STACK_COMMAND:  kb

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: GenuineIntel

IMAGE_NAME:  GenuineIntel

DEBUG_FLR_IMAGE_TIMESTAMP:  0

FAILURE_BUCKET_ID:  X64_0x124_GenuineIntel_PCIEXPRESS

BUCKET_ID:  X64_0x124_GenuineIntel_PCIEXPRESS

Followup: MachineOwner
---------

7: kd> !errrec fffffa800eb65038
===============================================================================
Common Platform Error Record @ fffffa800eb65038
-------------------------------------------------------------------------------
Record Id     : 01cd6c5a4fe290f5
Severity      : Fatal (1)
Length        : 672
Creator       : Microsoft
Notify Type   : PCI Express Error
Timestamp     : 7/28/2012 0:56:08 (UTC)
Flags         : 0x00000000

===============================================================================
Section 0     : PCI Express
-------------------------------------------------------------------------------
Descriptor    @ fffffa800eb650b8
Section       @ fffffa800eb65148
Offset        : 272
Length        : 208
Flags         : 0x00000001 Primary
Severity      : Fatal

Port Type     : Root Port
Version       : 1.1
Command/Status: 0x4010/0x0506
Device Id     :
  VenId:DevId : 8086:340c
  Class code  : 030400
  Function No : 0x00
  Device No   : 0x05
  Segment     : 0x0000
  Primary Bus : 0x00
  Second. Bus : 0x00
  Slot        : 0x0000
Dev. Serial # : 0000000000000000
Express Capability Information @ fffffa800eb6517c
  Device Caps : 00008021 Role-Based Error Reporting: 1
  Device Ctl  : 0107 ur FE NF CE
  Dev Status  : 0005 ur FE nf CE
   Root Ctl   : 0008 fs nfs cs

AER Information @ fffffa800eb651b8
  Uncorrectable Error Status    : 00040000 ur ecrc MTLP rof uc ca cto fcp ptlp sd dlp und
  Uncorrectable Error Mask      : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und
  Uncorrectable Error Severity  : 00062010 ur ecrc MTLP ROF uc ca cto FCP ptlp sd DLP und
  Correctable Error Status      : 00000040 adv rtto rnro dllp TLP re
  Correctable Error Mask        : 00000000 adv rtto rnro dllp tlp re
  Caps & Control                : 00000012 ecrcchken ecrcchkcap ecrcgenen ecrcgencap FEP
  Header Log                    : 3f000000 04000030 00000000 00000000
  Root Error Command            : 00000000 fen nfen cen
  Root Error Status             : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer
  Correctable Error Source ID   : 00,00,00
  Correctable Error Source ID   : 00,00,00

===============================================================================
Section 1     : Processor Generic
-------------------------------------------------------------------------------
Descriptor    @ fffffa800eb65100
Section       @ fffffa800eb65218
Offset        : 480
Length        : 192
Flags         : 0x00000000
Severity      : Informational

Proc. Type    : x86/x64
Instr. Set    : x64
CPU Version   : 0x00000000000106a5
Processor ID  : 0x0000000000000007

The source is definately the USB port but I just dont know how to fix it. Can anybody give some pointers? Thanks in advance

Vir Gnarus can correct me if I'm wrong, as he is a far better kernel researcher than I am. However, by over viewing your memory dump, it appears that the issue is coming from a faulty PCI-E bus port. Then again, I might be incorrect.

Vir Gnarus · Jul 30, 2012

Is this one of those SSD drives that's a PCI/PCI-E card? Or do you have your SSD drive plugged into a drive controller card? If either, that could probably explain things that's going on here. Unfortunately I cannot tell yet from the WHEA info here what exactly caused the issue, yet I do know what the issue is.

This is caused by a malformed TLP, as in a packet of data sent through the PCI/PCI-E bus got altered during transit. T3lnet and I discovered this can be attributed to physical issues such as dirty PCI/PCI-E port or a card that's not inserted properly. Make sure these are not the case. In addition, look online for any firmware updates associated with your SSD drive and/or drive controller card, as well as chipset drivers for your motherboard and your BIOS. All of these - especially firmware for your SSD drive - are well capable of causing instability when using said drive.

After this is done you may need to do a repair install Windows 7 on your SSD drive of do an SFC scan to ensure Windows files are not corrupted by your drive's potential instability.

There is also, of course, the possibility we're dealing with a bad SSD drive that may need replacing.

usasma · Jul 30, 2012

Just an observation - but when STOP 0x124 first came out (in Vista), the majority of the errors were with Arg1 = 4
It wasn't until much, much later (probably a year or two) that the Arg1 = 1 errors became more common.

Back then they were suggestive of a video problem - so you might want to check your video stuff also.

Vir Gnarus · Jul 30, 2012

Just a guess here, but that probably was because when Vista came out PCI-E was rather early in its stages, and driver developers haven't yet ironed out everything to work properly with the new standard, so more bugs were bound to happen, especially video drivers. Video driver developers will tell you that their drivers are designed to fix issues in their hardware's architecture, not to enhance it (though fixing often does do so).

stucko · Aug 3, 2012

Hi guys,

thanks for the replies.

FYI,

I have 2 OS (windows 7) running on the computer,
(1) on a normal HDD (ST31000528AS - 1TB) running on SATA 3 Gb/s (This OS is installed by default by Dell)
(2) on an SSD (OCZ-AGILITY) running on SATA 6 Gb/s (This OS is the one I recently installed by me)

I was using this computer for about 1.5 years now using the OS on (1), and didnt encounter any BSOD. (although once or twice a week I get crashes (non-bsod, just restarts) and i assume is due to heat because i rarely turn my PC off)

After I inserted the SSD (not PCI) the old OS still works fine. But on the OS (2) a BSOD comes out everytime i plug in to a USB 3.0 slot.

I was puzzled as to why my OS(1) is fine while my OS(2) had problems. I compared the drivers and seems like my new OS had an older version.
So I did some net searching and finally found a driver for my NEC->Renesas USB port which runs on a PCIE card. After installing it my USB can now be used without occuring a BSOD Most of the time, it still sometimes crash with a BSOD but not during plugin but sometimes when running some programs or exploring the disk.

The USB runs fine on the old OS (1) eventhough its running an older driver.

Anyone still thinks its the VideoCard's fault or the SSD's fault?
How do I double check/ confirm if the SSD is faulty? or the BUS PCI card is faulty. (id cry if its the SSD, since its new and expensive)

I noticed that the PCI card is a bit dusty, so ive cleaned it up and reslotted it. But im still wondering why OS(1) works fine while the new one OS(2) using latest drivers is not working so well.

Since the BSODs are not frequent anymore (hard to reproduce), il just use my computer as it is and will report if things have improved and will come here and see if theres any other suggestions on how to fix it.

Again thanks for the replies.

Vir Gnarus · Aug 6, 2012

How did you install the OS on the SSD? Did you make any adjustments to the BIOS concerning your drive (such as switching from SATA IDE to AHCI) since then? Also, did you apply the SSD firmware updates before or after installing the OS? The thing is you need to set up the SSD drive properly first (install firmware updates, set BIOS settings, etc.), and then install the OS. Having the OS installed prior or during manipulation to the SSD or BIOS will affect the stability of the OS. There's also the potential either the OS got corrupted during installation. Overall, you'll want to make sure to reinstall the OS for the SSD drive after you've completed all the changes to it.

One cannot discern very well if an SSD drive is bad or not because there hasn't been any reliable diagnostic tests for it. Check OCZ's website and see if they have one, otherwise there's nothing really you can do aside from swapping drives. I also should let you know that OCZ has a reputation for having faster SSD drives than most competitors - but at the price of reduced reliability; their drives have a tendency to bug out moreso than others. Hopefully you have a warranty still available to put to use.

Btw, you want to make sure that the slots themselves are clean, not so much the cards (though the connector should be clean). It's making sure the connection between the slot and the card is not impeded by any residue.

cluberti · Aug 7, 2012

It's worth noting that a lot of motherboards hang off certain controllers (USB3, SATA) on the PCI-E bus internally, so bad drives and bad (or misbehaving) USB devices on those buses can cause issues. I've seen this on a lot of motherboards running Vista or 7 on an SSD where it worked fine with a spinner, and XP had no issues either (mostly because it made no distinction between SSD and mechanical, whereas Vista and higher at least make an attempt to behave differently). Not saying it's the issue here, but it can be, esp. on OEM boards.

rijeka051 · Aug 7, 2012

Hello all,

This thread has really a wealth of information. I stumbled onto it while digging into my problem which is very similar.
I have an HP laptop that is BSOD-ing (STOP 0x124), and the eventviewer is full of repeated messages regarding WHEA, correcting a hardware error etc.
The laptop is fairly new, Win7 64bit OS, and had its MB, and graphics card replaced without good results.

I am not sure how to interpret the following :

Device Id :
VenId

evId : 8086:d138
Class code : 030400

PCIdatabase .com does not recognize d138 device id. I consulted the PCIExpress base specification document, but could not decipher Class code: 030400.

Perhaps you can help me shed some more light on my WHEA listing.

Code:

===============================================================================
Common Platform Error Record @ fffffa8007b41038
-------------------------------------------------------------------------------
Record Id     : 01cd6fb01727dcad
Severity      : Fatal (1)
Length        : 672
Creator       : Microsoft
Notify Type   : PCI Express Error
Timestamp     : 8/1/2012 9:28:17
Flags         : 0x00000000

===============================================================================
Section 0     : PCI Express
-------------------------------------------------------------------------------
Descriptor    @ fffffa8007b410b8
Section       @ fffffa8007b41148
Offset        : 272
Length        : 208
Flags         : 0x00000001 Primary
Severity      : Recoverable

Port Type     : Root Port
Version       : 1.1
Command/Status: 0x0010/0x0407
Device Id     :
  VenId:DevId : 8086:d138
  Class code  : 030400
  Function No : 0x00
  Device No   : 0x03
  Segment     : 0x0000
  Primary Bus : 0x00
  Second. Bus : 0x00
  Slot        : 0x0000
Dev. Serial # : 0000000000000000
Express Capability Information @ fffffa8007b4117c
  Device Caps : 00008021 Role-Based Error Reporting: 1
  Device Ctl  : 0107 ur FE NF CE
  Dev Status  : 0003 ur fe NF CE
   Root Ctl   : 0008 fs nfs cs

AER Information @ fffffa8007b411b8
  Uncorrectable Error Status    : 00014000 ur ecrc mtlp rof UC ca CTO fcp ptlp sd dlp und
  Uncorrectable Error Mask      : 00000000 ur ecrc mtlp rof uc ca cto fcp ptlp sd dlp und
  Uncorrectable Error Severity  : 00062010 ur ecrc MTLP ROF uc ca cto FCP ptlp sd DLP und
  Correctable Error Status      : 00002000 ADV rtto rnro dllp tlp re
  Correctable Error Mask        : 00000000 adv rtto rnro dllp tlp re
  Caps & Control                : 0000000e ecrcchken ecrcchkcap ecrcgenen ecrcgencap FEP
  Header Log                    : 4a000001 01000004 00180000 00000000
  Root Error Command            : 00000000 fen nfen cen
  Root Error Status             : 00000000 MSG# 00 fer nfer fuf mur ur mcr cer
  Correctable Error Source ID   : 00,00,00
  Correctable Error Source ID   : 00,00,00

===============================================================================
Section 1     : Processor Generic
-------------------------------------------------------------------------------
Descriptor    @ fffffa8007b41100
Section       @ fffffa8007b41218
Offset        : 480
Length        : 192
Flags         : 0x00000000
Severity      : Informational

Proc. Type    : x86/x64
Instr. Set    : x64
CPU Version   : 0x00000000000106e5
Processor ID  : 0x0000000000000001

Vir Gnarus · Aug 7, 2012

@Cluberti:

Thanks, Carl. I did notice the SATA class codes in the PCI-E classification, but I didn't think much about it because I figured it was related to SATA controller cards and not any actual onboard controller. In retrospect, I now understand certain incidents I've come across dealing with these WHEA errors and fixing it by fiddling with the HD. Again, thanks!

@rijeka051:

Hi mate, the PCI database is helpful often, but isn't exactly the most robust database for this, as it depends primarily on voluntary contributions. Though I'm afraid it won't help regardless if you figure them out or not, as from my previous research I've found it only is defining the root port that reported the issue and not the actual device that this problem originated from (if there even was a device it came from). Now if the WHEA Error mentions that an actual device reported this error and not a bridge or root port, then it'll be helpful. I assume our only option is to look at the Header Log under the AER information and try and interpret it to figure out the device that sent this packet.

As for the error, there's two: UC and CTO are capitalized (highlighted) in Uncorrectable Error Status. Since this is from a root port, we'll look at PCI_EXPRESS_ROOT_PORT_AER_CAPABILITY structure. These are found to be Unexpected Completion and Completion TimeOut, which I would venture to guess the latter caused the former. If a transaction got timed out, then that probably means a packet either went the wrong way and never was sent to its proper destination, or the packet was lost or discarded.

As for translating this into potential causes, I'm not sure. I would figure anything that would cause a packet to be lost unexpectedly during transit (PSU instability, motherboard failure, overheating, dust/foreign material) would be initially suspect. As Cluberti has explained this could also be related to USB or SATA since they can commonly be connected on OEM mobos through the PCI-E bus.

I'd like to dabble in it but I'll have to scrutinize over it later.

Wrench97 · Aug 7, 2012

It's only on OEM boards can be the ports be connected through the PCIe bus but most if not all retail boards as well.

stucko · Aug 10, 2012

@VirGnarus:https://www.sysnative.com/forums/member.php/60-Vir-Gnarus

I also should let you know that OCZ has a reputation for having faster SSD drives than most competitors - but at the price of reduced reliability; their drives have a tendency to bug out moreso than others

Haha guess i didnt do enough research to find that out. I think I did all the required steps in order to install the OS on the SDD as you have stated. I guess ill go about in making sure that my SDD is not faulty.

@Cluberti:
Thanks for pointing out that windows vista and higher behave differently on Spinner/SDD and that motherboards hang off certain controllers. Mine is OEM. Owh well I guess ill have to live with it, ill be a couple of years till i get a new rig.

Thanks guys for being very helpful :)

writhziden · Aug 10, 2012

stucko, have you tried a power cycle with the SSD? I have seen OCZ SSD firmware updates and the like cause problems between the BIOS/SSD interface that can result in 0x124 and 0x7A crashes. Resetting the BIOS/SSD connection can resolve the problem and the steps to do so involve power cycling the SSD. I just gave these steps to another SSD OCZ user, and I gave them to an OCZ user a couple months back getting a 0x124 crash that was resolved with the power cycle.

SSD Troubleshooting:

Try doing a power cycle of the SSD. The following steps should be carried out and take ~1 hour to complete.

Power off the system.
Remove all power supplies (ac adapter then battery for laptop, ac adapter for desktop)
Hold down the power button for 30 seconds to close the circuit and drain all components of power.
Reconnect all power supplies (battery then ac adapter for laptop, ac adapter for desktop)
Turn on the system and enter the BIOS (see your manual for the steps to enter the BIOS)
Let the computer remain in the BIOS for 20 minutes.
Follow steps 1-3 and physically remove the SSD from the system by disconnecting the cables for a desktop or disconnecting the drive from the junction for a laptop.
Leave the drive disconnected for 30 seconds to let all power drain from it.
Replace the drive connection(s) and then do steps 4-8 again.
Repeat steps 1-4.
Start your computer normally and run Windows.

The above steps were a result of: Why did my SSD "disappear" from my system? - Crucial Community

While that may not be your drive, a power cycle should be the same on all SSD drives. See how the system responds after the SSD power cycle.

Wrench97 · Nov 5, 2012

Not PCIe x124's but an interesting read from Intel on 124 MCE's > http://download.intel.com/design/intarch/PAPERS/324077.pdf

PCI-E WHEA errors (0x124)

Moderator

BSOD Kernel Dump Expert

Member

BSOD Kernel Dump Expert

Member

BSOD Kernel Dump Expert

New member

Member

BSOD Kernel Dump Expert

Retired Admin

BSOD Kernel Dump Expert

New member

BSOD Kernel Dump Expert

Senior Member

New member

BSOD Kernel Dump Expert

Administrator, Hardware Expert

New member

Administrator, .NET/UWP Developer

Administrator, Hardware Expert