0x119 VIDEO_SCHEDULER_INTERNAL_ERROR & Fence IDs

Vir Gnarus · Mar 16, 2012

Hi all,

I'd like to take you all on a little adventure with me.

I'm working with a missionary that I'm good friends with that has been experiencing some bsods. At the moment he's only given me a few. While a couple were inconclusive (despite all being DV-enabled), one sorta stuck out, which is attached and mentioned below:

Code:

[FONT=Courier New]VIDEO_SCHEDULER_INTERNAL_ERROR (119)
The video scheduler has detected that fatal violation has occurred. This resulted
in a condition that video scheduler can no longer progress. Any other values after
parameter 1 must be individually examined according to the subtype.
Arguments:
Arg1: 0000000000000001, The driver has reported an invalid fence ID.
Arg2: 0000000000008be2
Arg3: 0000000000008c5c
Arg4: 0000000000008c5b

Debugging Details:
------------------


CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  VERIFIER_ENABLED_VISTA_MINIDUMP

BUGCHECK_STR:  0x119

PROCESS_NAME:  svchost.exe

CURRENT_IRQL:  a

LAST_CONTROL_TRANSFER:  from fffff880044d822f to fffff80002e93d00

STACK_TEXT:  
fffff880`0a714de8 fffff880`044d822f : 00000000`00000119 00000000`00000001 00000000`00008be2 00000000`00008c5c : nt!KeBugCheckEx
fffff880`0a714df0 fffff880`04137eb9 : 00000000`00000000 00000000`00008be2 00000000`00000000 00000000`00008c5c : watchdog!WdLogEvent5+0x11b
fffff880`0a714e40 fffff880`04138125 : fffffa80`09b4f000 fffff880`0a714f70 00000000`000011ac fffff8a0`12a87c10 : dxgmms1!VidSchiVerifyDriverReportedFenceId+0xad
fffff880`0a714e70 fffff880`04137f76 : 00000000`00008be2 fffff880`0a715001 fffffa80`09b43000 00000000`00000001 : dxgmms1!VidSchDdiNotifyInterruptWorker+0x19d
fffff880`0a714ec0 fffff880`0403f13f : fffffa80`087a5040 fffff800`02e968a4 fffff800`00000002 fffff800`00000000 : dxgmms1!VidSchDdiNotifyInterrupt+0x9e
fffff880`0a714ef0 fffff880`00c1ecca : 00000000`00000000 fffffa80`087a3040 00000000`00000000 fffff800`02e966ef : dxgkrnl!DxgNotifyInterruptCB+0x83
fffff880`0a714f20 00000000`00000000 : fffffa80`087a3040 00000000`00000000 fffff800`02e966ef fffff880`03164180 : atikmpag+0x4cca


STACK_COMMAND:  kb

FOLLOWUP_IP: 
dxgmms1!VidSchiVerifyDriverReportedFenceId+ad
fffff880`04137eb9 c744244053eeffff mov     dword ptr [rsp+40h],0FFFFEE53h

SYMBOL_STACK_INDEX:  2

SYMBOL_NAME:  dxgmms1!VidSchiVerifyDriverReportedFenceId+ad

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: dxgmms1

IMAGE_NAME:  dxgmms1.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  4ce799c1

FAILURE_BUCKET_ID:  X64_0x119_VRF_dxgmms1!VidSchiVerifyDriverReportedFenceId+ad

BUCKET_ID:  X64_0x119_VRF_dxgmms1!VidSchiVerifyDriverReportedFenceId+ad

Followup: MachineOwner[/FONT]

Now just so you know, I had initially hardly just as much understanding on this as you probably do while reading this. I have absolutely no clue what Fence IDs are. However, I did some lookin up and noticed the following concerning em: Windows Vista and Later Display Driver Model Operation Flow.

So I go through the motions of it and I get a bit of an idea what a Fence ID is. It's apparently a "ticket" for the GPU to have access to process a DMA buffer. For those unaware, DMA means Direct Memory Access, which means a connection for - in this case - the GPU to be able to mess with system memory directly without havin to hassle the cpu or OS. This is the apparent process. Do you see anything familiar in relation to the call stack listed above in the crashdump?

14.

The DirectX graphics kernel subsystem calls the display miniport driver's DxgkDdiSubmitCommand function to queue the DMA buffer to the GPU execution unit. Each DMA buffer submitted to the GPU contains a fence identifier, which is a number. After the GPU finishes processing the DMA buffer, the GPU generates an interrupt.

15.

The display miniport driver is notified of the interrupt in its DxgkDdiInterruptRoutine function. The display miniport driver should read, from the GPU, the fence identifier of the DMA buffer that just completed.

16.

The display miniport driver should call the DxgkCbNotifyInterrupt function to notify the DirectX graphics kernel subsystem that the DMA buffer completed. The display miniport driver should also call the DxgkCbQueueDpc function to queue a deferred procedure call (DPC).

So where in the process of this did the crash occur? As you can tell, it's during the "NotifyInterrupt" function at the very end, on step 16 - all notifying that a DMA buffer completed. Part of this notification is a pointer pointing to a data structure (DXGKARGCB_NOTIFY_INTERRUPT_DATA), and part of the data in that structure is the fence ID.

Apparently what we have here, is that after the GPU finished processing the DMA buffer, it notified the graphics driver that it finished doing what it wanted to do and gave it the id number for the DMA buffer (the Fence ID). The graphics driver gives this as part of the notification to DirectX that it got done, DirectX took a look at the Fence ID, and bugs out, thinking, "This fence ID doesn't look familiar at all. Something ain't right!" So it tells Windows to stop everything cuz it *appears* as if the gpu got illegal access to memory.

Part of me thinks this isn't so much a graphics driver issue as it is a graphics hardware issue. That's my initial diagnosis, and right now I'm still working with him to gather more info on this to verify what's what. As for my end, right now I'd like to know a few things in case anyone can help me:

If anyone else has had similar bsods that they've resolved and the culprits behind em. Was it typically hardware, and what hardware was it? Was it the drivers?
I'd like to know what the fenceID was. However, I'm unfamiliar with the dt command in Windbg and I'm not sure where to point it too and how. To those wondering, this command points to a data structure and reveals its contents and info on it. Since this is part of the notification process to DirectX about the DMA buffer completion, I should be able to see the FenceID inside the notification data structure.
I'd like to know what the FenceID was prior to the DMA buffer completion. If I knew this as well as what it was after the completion (when it bugged out), I can discern if the DMA buffer access itself was bad, or if the returned FenceID from the GPU ended up gettin corrupted somehow. Not sure how or if it's even possible to get this info, though.

This obviously isn't the end of my journey on this. I'll be continuing as I progress with finding an answer on this and extra more info from the guy about the situation.

UPDATE:

Comments from previous discussion on this:

cluberti said:
It looks like the crash is in the directx routine that reports the out of order fence returns. There are quite a number of bugs logged on this for Windows 7, and they run the gamut of ATI, Nvidia, and Intel video drivers as root causes. What is actually happening under the covers is that these FenceIDs are being returned out-of-order, and thus the bugcheck (why dx says "that's not right", because there's a proper way to return these). Again, in every case I can find, it was a driver (not hardware) issue, and the external vendor would be tasked with resolving the issues with their driver on customer hardware.

Unfortunately, the problem happens in the external driver before it hits directx, so I can't tell you why it's happening, but the likelihood it's a hardware issue is probably almost nil if it isn't also bugchecking with a 116. It could be power-related, though, so if the machine is older checking the PSU isn't a bad idea.

Sorry I can't provide the debug, but the directx drivers aren't public on purpose, and I don't feel comfortable putting any of that out here even amongst this small group given the protection around this source.

VirGnarus said:
I can see how this could be problematic, as if I recall DMA process jobs being out of order can potentially cause memory corruption. I was figuring it was maybe just a bad fence ID altogether.

Though I wonder, what are the parameters the watchdog sent to KeBugCheckEx? Obviously the first one is the subtype of error, but are the other three the Fence IDs (like expected/received)?

cluberti said:
Yes, they are the received fences. I would still recommend testing a different card to be safe, but the driver is still the likely culprit.

Patrick · Jul 30, 2012

Currently in the middle of an analysis with a 119: http://www.techsupportforum.com/forums/f299/bsod-on-startup-657906.html#post3824527

Very fun to analyze, and as always, I learned an unreasonable amount from this thread alone :hug:

No response yet, hopefully the OP does respond... I'd really like to see how it plays out.

Patrick · Jul 31, 2012

And bump, going through another..

Rather than it being - DxgNotifyInterrupt

it's

DxgNotifyDpc

in the stack..

what does that mean?

writhziden · Jul 31, 2012

Hmm, I find no reference to DxgNotifyDpc through a Google search. I probably need to buy a book on debugging for that one.

I am posting to let you both know I am also interested in this one.

Patrick · Jul 31, 2012

Yeah, no luck for me either..

Thanks Mike, let me know if you find anything yourself.

Vir Gnarus · Aug 1, 2012

Interrupts (or ISRs; Interrupt Service Routines) to handle device I/O need to be done very quickly or risk holding up the entire system (because of high IRQL), so what usually happens is the interrupt is designed to merely create a DPC, or Deferred Procedure Call, to defer (hence the name) the responsibility of handling the I/O till later. The DPC itself, once it is next in the DPC queue, will then do the actual servicing of the device's I/O. The interrupt is only there to notify the system to prepare for I/O, while it is the DPC itself that does all the work. Windows Internals 5th Edition explains all this in the I/O System chapter. If you have the 6th edition, you'll have to wait until Part 2 of it comes out.

So what's going on is that the interrupt has already done its work, and it is now the DPC (the actual I/O) that's doing the work, which DirectX is involved (obviously some form of video/audio I/O). You can check the DPC queue for each processor using !dpcs in Windbg. Obviously, this information, like most, is not available in a minidump, but if you give it the number of the processor that was currently running at the time of the crash (you can tell by the Windbg prompt which proc you're in) you may be lucky, but I doubt it.

Example:

Code:

2: kd> !dpcs
CPU Type      KDPC       Function
 5: Normal  : 0xfffffa8025ae6c28 0xfffffa600106e8f0 tcpip!TcpPeriodicTimeoutHandler

2: kd> dt !_KDPC                                            [COLOR=#008000]< Template for KDPC data structure[/COLOR]
nt!_KDPC
   +0x000 Type             : UChar
   +0x001 Importance       : UChar
   +0x002 Number           : Uint2B
   +0x008 DpcListEntry     : _LIST_ENTRY
   +0x018 DeferredRoutine  : Ptr64     void 
   +0x020 DeferredContext  : Ptr64 Void
   +0x028 SystemArgument1  : Ptr64 Void
   +0x030 SystemArgument2  : Ptr64 Void
   +0x038 DpcData          : Ptr64 Void

2: kd> dt !_KDPC fffffa8025ae6c28
nt!_KDPC
   +0x000 Type             : 0x13 ''
   +0x001 Importance       : 0x1 ''
   +0x002 Number           : 0x45
   +0x008 DpcListEntry     : _LIST_ENTRY [ 0xfffffa60`01ab8580 - 0xfffffa60`01ab8580 ]
   +0x018 DeferredRoutine  : 0xfffffa60`0106e8f0     void  tcpip!TcpPeriodicTimeoutHandler+0
   +0x020 DeferredContext  : 0x00000000`00000005 Void
   +0x028 SystemArgument1  : 0x00000000`cba8328a Void
   +0x030 SystemArgument2  : 0x00000000`01ccae02 Void
   +0x038 DpcData          : 0xfffffa60`01ab8580 Void

Understand that the KDPC data structure is opaque, in that it is an internal structure where information on it is publicly finite, and so you kinda have to walk it out, fiddle with it, and figure it out on your own. Also, it's not something that a driver is allowed to manipulate, only Windows kernel can. So if you discover that a driver has tampered with this or even is attempting to write to it, you know the driver is being unscrupulous (a driver can point to it, though, just not edit). That's not to say it's the case you're dealing with, however.

Patrick · Aug 1, 2012

Brilliant explanation, thank you.

Gah, I don't think the OP will respond though, mentioned he does not have time for hardware diagnostics / troubleshooting so it's likely I won't be able to take in much knowledge from this specific analysis. Also, I tried running a !dpcs command and got the following -

3: kd> !dpcs
CPU Type KDPC Function
Failed to read DPC at 0xfffffa800615b0c8
Failed to read DPC at 0xfffff88002fd5318

I'm assuming that is because as you said, it's a minidump, and you cannot access this info with a minidump?

You also mentioned you can find the current processor at the time of the crash for that specific dump from WinDbg, where do you find that information?

Vir Gnarus · Aug 1, 2012

When you load up a crashdump, the processor and thread context that's initially loaded is the one that was most recent during time of the crashdump, as in the one that was active at that time. You can tell the thread by doing !thread, but the processor is much easier, by just looking at the Windbg prompt:

Code:

[COLOR=#008000]Processor 
\/[/COLOR]
[COLOR=#008000]2[/COLOR]: kd> !thread
THREAD fffffa80273ed040  Cid 0004.0a64  Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 2
Not impersonating
DeviceMap                 fffff88000006150
Owning Process            fffffa8024c1a040       Image:         System
Attached Process          N/A            Image:         N/A
Wait Start TickCount      5438           Ticks: 405495019 (73:05:09:22.838)
Context Switch Count      2              IdealProcessor: 2             
UserTime                  00:00:00.000
KernelTime                00:00:00.000
Win32 Start Address rpcxdr!RxWorkThread (0xfffffa6008394f88)
Stack Init fffffa60089b4db0 Current fffffa60089b4a10
Base fffffa60089b5000 Limit fffffa60089af000 Call 0
Priority 12 BasePriority 12 PriorityDecrement 0 IoPriority 2 PagePriority 5
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffffa60`089b4a58 fffff800`01e63661 : 00000000`00000050 fffffa60`08c4d718 00000000`00000008 fffffa60`089b4b50 : nt!KeBugCheckEx
fffffa60`089b4a60 fffff800`01e53219 : 00000000`00000008 fffffa60`019d8180 fffffa80`273ed000 fffffa80`2b379010 : nt!MmAccessFault+0x1371
fffffa60`089b4b50 fffffa60`08c4d718 : fffffa60`0839537b fffffa80`273ed040 00000000`00000080 fffffa60`0839e350 : nt!KiPageFault+0x119 (TrapFrame @ fffffa60`089b4b50)
fffffa60`089b4ce8 fffffa60`0839537b : fffffa80`273ed040 00000000`00000080 fffffa60`0839e350 fffffa80`2b379010 : <Unloaded_TmPreFlt.sys>+0x4718
fffffa60`089b4cf0 fffff800`020788b3 : 00000000`00000001 00000000`0000000f 00000000`00000000 fffffa80`2b379010 : rpcxdr!RxWorkThread+0x3f3
fffffa60`089b4d50 fffff800`01e8e7f6 : fffffa60`01966180 fffffa80`273ed040 fffffa60`0196fd40 00000000`00000001 : nt!PspSystemThreadStartup+0x57
fffffa60`089b4d80 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KxStartSystemThread+0x16

Also, as shown below in the windbg prompt:

Vir Gnarus · Aug 1, 2012

Oh, of course, if you change the processor context using ~, then the prompt will adjust accordingly, but this is the one that showed up for me when I opened this particular kernel dump. Instead of defaulting to processor 0, it automatically was set to proc 2, which was running at time of crash.

Patrick · Aug 1, 2012

Got it, thanks :)

Wrench97 · Oct 28, 2012

Here's another one > https://www.sysnative.com/forums/showthread.php/4216-Bluescreen-119-in-games?p=30737#post30737

Code:

Debug session time: Sun Oct 14 09:21:02.426 2012 (UTC - 4:00)
Loading Dump File [C:\Users\Owner\Bsodapps\101412-13431-01.dmp]
BugCheck 119, {1, 1000060, e3963, e3961}
Probably caused by : dxgmms1.sys ( dxgmms1!VidSchiVerifyDriverReportedFenceId+ad )
Bugcheck code 00000119
Arguments: 

Arg1: 0000000000000001, The driver has reported an invalid fence ID.

Arg2: 0000000001000060

Arg3: 00000000000e3963

Arg4: 00000000000e3961


DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT
BUGCHECK_STR:  0x119
PROCESS_NAME:  Wow-64.exe
MaxSpeed:     3100
CurrentSpeed: 3100
BiosVersion = 0506
BiosReleaseDate = 05/07/2012

Vir Gnarus · Oct 29, 2012

Yah, I saw that, good catch. Look at Arg 2-4, which are actually the fence ids. Either Arg3 is the fence id it expected, or the fence id directly prior to the one we're dealing with. Either way, it's evident we're looking at one messed up fence id in Arg 2. Clearly it's an overwritten value, perhaps from stack overflow or some other driver nonsense.

x BlueRobot · Jan 3, 2014

Sorry to resurrect a old thread, but here's some additional information - TDR changes in Windows 8 (Windows Drivers) and Supplying Fence Identifiers (Windows Drivers)

x BlueRobot · Jan 3, 2014

A fence is an instruction that contains 64 bits of data and an address. The display miniport driver can insert a fence in the direct memory access (DMA) stream that is sent to the graphics processing unit (GPU). When the GPU reads the fence, the GPU writes the fence data at the specified fence address. However, before the GPU can write the fence data to memory, it must ensure that all of the pixels from the primitives that precede the fence instruction are retired and properly written to memory.
Note The GPU is not required to stall the entire pipeline while it waits for the last pixel from the primitives that precede the fence instruction to retire; the GPU can instead run the primitives that follow the fence instruction.
Hardware that supports per-GPU-context virtual address space must support the following types of fences:

Regular fences are fences that can be inserted in a DMA buffer that is created in user mode. Because the content of a DMA buffer from user mode is not trusted, fences within such a DMA buffer must refer to a virtual address in the GPU context address space and not to a physical address. Access to such a virtual address is bound by the same memory validation mechanism as any other virtual address that the GPU accesses.

Privileged fences are fences that can be inserted only in a DMA buffer that is created (and only accessible) in kernel mode. Fences within such a DMA buffer refer to a physical address in memory.
Note that if the fence target address was accessible in user mode, malicious software could perform a graphics operation over the memory location for the fence and therefore override the content of what the kernel expected to receive.

Source: DxgkDdiQueryCurrentFence routine (Windows Drivers)

I'll keep doing some research on Fence IDs, and formulate it into a blog post if anyone is still interested or still learning about Fence IDs.

x BlueRobot · Jan 4, 2014

Update:

If the reason for a Stop 0x119 is parameter 2, the driver failing upon the submission of a command buffer, then the other parameters are as follows:

2) The NTSTATUS error code returned from the failed driver call
3) A pointer to the DXGKARG_SUBMITCOMMAND structure
4) A pointer to an internal scheduler data structure

Reference - DxgkDdiSubmitCommand routine (Windows Drivers)

0x119 VIDEO_SCHEDULER_INTERNAL_ERROR & Fence IDs

Vir Gnarus

BSOD Kernel Dump Expert

Attachments

Patrick

Sysnative Staff

Patrick

Sysnative Staff

writhziden

Administrator, .NET/UWP Developer

Patrick

Sysnative Staff

Vir Gnarus

BSOD Kernel Dump Expert

Patrick

Sysnative Staff

Vir Gnarus

BSOD Kernel Dump Expert

Vir Gnarus

BSOD Kernel Dump Expert

Patrick

Sysnative Staff

Wrench97

Administrator, Hardware Expert

Vir Gnarus

BSOD Kernel Dump Expert

x BlueRobot

Administrator

x BlueRobot

Administrator

x BlueRobot

Administrator