VIDEO_TDR_FAILURES after S3 sleep - Windows 7 x64

I am going out very soon so I can't comment, I'll take a look later but I never would have guessed to do this much disassembling.. Very nice work mate :)
 
Ha, thanks. Well I didn't really give you the option since the minidumps I uploaded didn't have the memory in to read where the jmp was pointing to. Let me know if you want to get at the full kernel dump and I'll upload it somewhere and send you a link.
 
Yes, I wouldn't mind getting a link to take look at the dump files myself.
 
Yes I got it, I'll download it and do some debugging, good practice :)

How old is that Kernel memory dump?
Is this a new crash or just some good old analysis?
 
It's the last crash I had from a week ago. Since then I had the freeze with the 117 with my old 260 card and the flickers without crashes or TDRs on my 670. No problems yet since I switched to my new case and PSU, but I also haven't had it in S3 sleep since then yet either.

Like you say, I think it's mostly an academic exercise at the moment as I need to see how things work out over the next few days.
 
I haven't got round to looking at the dump file but I noticed something in the analysis.

Is there a way for me to reconstruct the callstack that led to fffff8800f804530 in nvlddmkm?

Code:
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`02befa68 fffff880`05356140 : 00000000`00000116 fffffa80`0bd5d010 fffff880`0f804530 ffffffff`c00000b5 : nt!KeBugCheckEx
fffff880`02befa70 fffff880`05355f1b : [COLOR=#ff0000]fffff880`0f804530[/COLOR] fffffa80`0bd5d010 fffffa80`0bae5350 fffffa80`0ba42010 : dxgkrnl!TdrBugcheckOnTimeout+0xec
fffff880`02befab0 fffff880`0520ff13 : fffffa80`0bd5d010 00000000`c00000b5 fffffa80`0bae5350 fffffa80`0ba42010 : dxgkrnl!TdrIsRecoveryRequired+0x273
fffff880`02befae0 fffff880`05239cf1 : 00000000`ffffffff 00000000`00000a9f 00000000`00000000 00000000`00000000 : dxgmms1!VidSchiReportHwHang+0x40b
fffff880`02befbc0 fffff880`0520b2e1 : fffffa80`0ba42010 ffffffff`00000000 00000000`00000a9f 00000000`00000000 : dxgmms1!VidSchiCheckHwProgress+0x71
fffff880`02befbf0 fffff880`05237ff6 : 00000000`00000000 fffffa80`0bae5350 00000000`00000080 fffffa80`0ba42010 : dxgmms1!VidSchiScheduleCommandToRun+0x1e9
fffff880`02befd00 fffff800`02d7073a : 00000000`01eb24a1 fffffa80`0ba564c0 fffffa80`068a7890 fffffa80`0ba564c0 : dxgmms1!VidSchiWorkerThread+0xba
fffff880`02befd40 fffff800`02ac58e6 : fffff800`02c4fe80 fffffa80`0ba564c0 fffff800`02c5dcc0 fffff880`03962201 : nt!PspSystemThreadStartup+0x5a
fffff880`02befd80 00000000`00000000 : fffff880`02bf0000 fffff880`02bea000 fffff880`02bef7f0 00000000`00000000 : nt!KxStartSystemThread+0x16
 
Hmm, that just looks like the parameter being passed to the bugcheck routine so it can report where the driver was at when it failed to respond to CheckHwProgress. I was wondering if I could look at how the driver itself got to that point. I need to find some time to learn about the kernel threading model, I guess...
 
I'm a few days into my new PSU and case and no hint of trouble so far. I'm running with only basic peripherals so far, so it'll be interesting to see if the problem stays away when I start plugging everything back in. This said, the problem went away for a full month before, so I don't really know what to expect...! I still haven't got around to trying to take my earlier analysis a little further. :-)
 
OK, here's the latest. I'm still mostly problem free but yesterday I had a recovered TDR - a single 117 live kernel report - while playing Civ 5 (which crashed the game) followed by series of 6 117 live kernel reports in quick succession 3 hours later, this time freezing the PC without a 116 bugcheck (like I had with the GTX 260 installed). Both times, the 117s were triggered by exactly the same action in game and I've found other reports of people getting TDRs under the same circumstances, so this may be a driver glitch associated with Civ 5. However, I'm not quite ready to let things drop yet, so I've been doing some more thinking...

I was thinking about the disassembly at the pointer into nvlddmkm a little more and came up with an idea based on totally the wrong conclusion about it! (will explain that in a minute) Looking at the hardware IRQ assignments on my machine, I noticed that my video card was sharing IRQ 16 with a "VIA Rev 5" USB Universal Host Controller. Now, I already happened to have suspicions about this device. It's actually part of a TV tuner card I have installed. There are two such controllers which each claim to have a "USB Composite Device" attached to them. I don't fully understand what these do and will have to investigate further. However, what I can tell you is that they don't respond nicely to S3 sleep at all - every time I switch my computer back on from S3 sleep, they claim to disconnect and reconnect, sometimes producing errors and often causing the attached infrared controller to stop working properly. My guess would be that it's because the PCI bus loses power in S3 sleep while "normal" USB controllers do not, so when we go to S3, the power to these controllers is suddenly lost unexpectedly.

So, for my next step of diagnostics, I have disabled the one VIA USB controller that is sharing IRQ 16 (the other is on IRQ 18). I'll see how that goes and try removing the TV tuner card completely as the next step if it doesn't work.

Now, here is the "wrong conclusion" I made and some interesting extra information from it. You'll recall that the disassembly in the NVidia driver showed lots of "int" commands surrounding all those jumps (I'm back on the full memory dump from the 116 from a few weeks ago). In fact, the instruction previous to the stop point was also an int:

Code:
4: kd> u fffff8800f804530-10 L20
nvlddmkm+0x14f520:
fffff880`0f804520 48ff2571837100  jmp     qword ptr [nvlddmkm!nvDumpConfig+0x188388 (fffff880`0ff1c898)]
fffff880`0f804527 cc              int     3
fffff880`0f804528 e9b799f0ff      jmp     nvlddmkm+0x58ee4 (fffff880`0f70dee4)
fffff880`0f80452d cc              int     3
fffff880`0f80452e cc              int     3
[B]fffff880`0f80452f cc              int     3[/B]
fffff880`0f804530 48ff25d9817100  jmp     qword ptr [nvlddmkm!nvDumpConfig+0x188200 (fffff880`0ff1c710)]
fffff880`0f804537 cc              int     3
fffff880`0f804538 e94fa2f0ff      jmp     nvlddmkm+0x5978c (fffff880`0f70e78c)
fffff880`0f80453d cc              int     3
fffff880`0f80453e cc              int     3
fffff880`0f80453f cc              int     3
fffff880`0f804540 48ff2589817100  jmp     qword ptr [nvlddmkm!nvDumpConfig+0x1881c0 (fffff880`0ff1c6d0)]
fffff880`0f804547 cc              int     3
fffff880`0f804548 e94ba7f0ff      jmp     nvlddmkm+0x59c98 (fffff880`0f70ec98)
fffff880`0f80454d cc              int     3
fffff880`0f80454e cc              int     3
fffff880`0f80454f cc              int     3
fffff880`0f804550 48ff25c1827100  jmp     qword ptr [nvlddmkm!nvDumpConfig+0x188308 (fffff880`0ff1c818)]
fffff880`0f804557 cc              int     3
fffff880`0f804558 e9d3a8f0ff      jmp     nvlddmkm+0x59e30 (fffff880`0f70ee30)
fffff880`0f80455d cc              int     3
fffff880`0f80455e cc              int     3
fffff880`0f80455f cc              int     3
fffff880`0f804560 48ff25d1817100  jmp     qword ptr [nvlddmkm!nvDumpConfig+0x188228 (fffff880`0ff1c738)]
fffff880`0f804567 cc              int     3
fffff880`0f804568 48ff2559817100  jmp     qword ptr [nvlddmkm!nvDumpConfig+0x1881b8 (fffff880`0ff1c6c8)]
fffff880`0f80456f cc              int     3
fffff880`0f804570 e99bb1f0ff      jmp     nvlddmkm+0x5a710 (fffff880`0f70f710)
fffff880`0f804575 cc              int     3
fffff880`0f804576 cc              int     3
fffff880`0f804577 cc              int     3

My mind immediately went from "int", i.e. "interrupt" to hardware interrupts and hence IRQs. In fact, after a little more reading, I realise now that this is a software interrupt requesting ISR 3. We can find out what that is:

Code:
4: kd> !idt

Dumping IDT: fffff880009bd6c0

00:    fffff80002ad1940 nt!KiDivideErrorFault
01:    fffff80002ad1a40 nt!KiDebugTrapOrFault
02:    fffff80002ad1c00 nt!KiNmiInterrupt    Stack = 0xFFFFF880009BD0C0
[B]03:    fffff80002ad1f80 nt!KiBreakpointTrap[/B]
04:    fffff80002ad2080 nt!KiOverflowTrap
05:    fffff80002ad2180 nt!KiBoundFault
06:    fffff80002ad2280 nt!KiInvalidOpcodeFault
07:    fffff80002ad24c0 nt!KiNpxNotAvailableFault
08:    fffff80002ad2580 nt!KiDoubleFaultAbort    Stack = 0xFFFFF880009B90C0
09:    fffff80002ad2640 nt!KiNpxSegmentOverrunAbort
0a:    fffff80002ad2700 nt!KiInvalidTssFault
0b:    fffff80002ad27c0 nt!KiSegmentNotPresentFault
0c:    fffff80002ad2900 nt!KiStackFault
0d:    fffff80002ad2a40 nt!KiGeneralProtectionFault
0e:    fffff80002ad2b80 nt!KiPageFault
10:    fffff80002ad2f40 nt!KiFloatingErrorFault
11:    fffff80002ad30c0 nt!KiAlignmentFault
12:    fffff80002ad31c0 nt!KiMcheckAbort    Stack = 0xFFFFF880009BB0C0
13:    fffff80002ad3540 nt!KiXmmException
1f:    fffff80002ac77d0 nt!KiApcInterrupt
2c:    fffff80002ad3700 nt!KiRaiseAssertion
2d:    fffff80002ad3800 nt!KiDebugServiceTrap
2f:    fffff80002b1f950 nt!KiDpcInterrupt
37:    fffffa80068aba90 hal!HalpApicSpuriousService (KINTERRUPT fffffa80068aba00)

3f:    fffffa80068abb30 hal!HalpApicSpuriousService (KINTERRUPT fffffa80068abaa0)

50:    fffffa80068abc70 hal!HalpCmciService (KINTERRUPT fffffa80068abbe0)

52:    fffffa8006853990 USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa8006853900)

61:    fffffa8006853090 serial!SerialCIsrSw (KINTERRUPT fffffa8006853000)

62:    fffffa80068538d0 USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa8006853840)

                     USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa8006853780)

                     USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa8006853480)

72:    fffffa8006853690 USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa8006853600)

                     USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa8006853300)

82:    fffffa8006853210 USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa8006853180)

92:    fffffa8006853e10 ataport!IdePortInterrupt (KINTERRUPT fffffa8006853d80)

                     ataport!IdePortInterrupt (KINTERRUPT fffffa8006853cc0)

                     USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa8006853540)

                     USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa80068530c0)

                     portcls!CInterruptSync::`scalar deleting destructor'+0xb8 (KINTERRUPT fffffa800a1b3f00)

a0:    fffffa8006853750 ndis!ndisMiniportMessageIsr (KINTERRUPT fffffa80068536c0)

a2:    fffffa8006853b10 HDAudBus!HdaController::Isr (KINTERRUPT fffffa8006853a80)

b0:    fffffa8006853c90 storport!RaidpAdapterMSIInterruptRoutine (KINTERRUPT fffffa8006853c00)

b1:    fffffa8006853f90 ACPI!ACPIInterruptServiceRoutine (KINTERRUPT fffffa8006853f00)

b2:    fffffa8006853bd0 ataport!IdePortInterrupt (KINTERRUPT fffffa8006853b40)

                     ataport!IdePortInterrupt (KINTERRUPT fffffa8006853e40)

                     USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa80068539c0)

                     USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa80068533c0)

                     USBPORT!USBPORT_InterruptService (KINTERRUPT fffffa8006853240)

                     dxgkrnl!DpiFdoLineInterruptRoutine (KINTERRUPT fffffa800a1b3e40)

c1:    fffff80002a46450 hal!HalpBroadcastCallService (KINTERRUPT fffff80002a463c0)

d1:    fffff80002adfe10 nt!KiSecondaryClockInterrupt
d2:    fffff80002a46590 hal!HalpHpetRolloverInterrupt (KINTERRUPT fffff80002a46500)

df:    fffff80002a463b0 hal!HalpApicRebootService (KINTERRUPT fffff80002a46320)

e1:    fffff80002ade970 nt!KiIpiInterrupt
e2:    fffffa80068abd10 hal!HalpDeferredRecoveryService (KINTERRUPT fffffa80068abc80)

e3:    fffffa80068abbd0 hal!HalpLocalApicErrorService (KINTERRUPT fffffa80068abb40)

fd:    fffffa80068abdb0 hal!HalpProfileInterrupt (KINTERRUPT fffffa80068abd20)

fe:    fffffa80068abe50 hal!HalpPerfInterrupt (KINTERRUPT fffffa80068abdc0)

ff:    0000000000000000

So it looks like a simple software interrupt for handling the bugcheck. Oh well, it could have pointed me in the right direction, so we'll see how things go now I've got IRQ 16 a little more cleared up. Unfortunately, I still don't know how I can reconstruct the call stack in nvlddmkm and I still don't understand the kernel driver threading model, so I need to do some more reading.
 

Has Sysnative Forums helped you? Please consider donating to help us support the site!

Back
Top