Class 101 for 0x101 Bugchecks

Vir Gnarus · Mar 16, 2012

NOTICE: 0x101 BUGCHECKS CANNOT BE ANALYZED USING MINIDUMPS!!! YOU WILL NEED AT LEAST A KERNEL DUMP!

Hi guys, I'd figure I'd run you through my personal experience delving into a 0x101 bugcheck, also known as a CLOCK_WATCHDOG_TIMEOUT. This will cover:

- Interrupts & IRQLs
- Simple disassembly
- Looking at PRCB and PCR
- Observing the context of multiple processors

Warning: At the time of writing this, I'm only 95% sure on my observation, so take this with a grain of salt. This is one of my more thorough investigations that I've come across lately on TSF, and I'm still rusty with dealing with interrupts, so I wouldn't doubt that some of my approach or understanding may be erroneous. Keep that in mind when reading this, and any corrections are welcome. I don't want to end up being the blind leading the blind. Always refer to the book Windows Internals if you want solid understanding of this stuff. Otherwise, what I see here is just my exploration into it and serves for navigating a crashdump and collecting information more effectively (it's hard doing all this without someone that knows better pointing your faults :) ).

The example I'd like to use is the case mentioned in this thread. Note that all my work involved operating with the client's kernel dump he uploaded to a 3rd-party filesharing site. Since I doubt everyone is eager or has the resources or time to scrutinize over a whole kernel dump, I will be more descriptive in my presentation to compensate.

So why did I use a kernel dump? Because minidumps do not retain the data required to analyze 0x101 bugchecks. I repeat: you cannot analyze 0x101 bugchecks from minidumps! Minidumps only save the processor context for the processor core that was awake and able to successfully run KeBugCheckEx and dump the crashdump. Why is this a problem? You'll understand once you start reading.

As a primer that I used to get myself in the right direction, this article was used. It was extremely beneficial in giving me my bearings and what data I should be most concerned about when analyzing this type of bugcheck.

_____

We start off by viewing exactly what an 0x101 bugcheck is. Let's run the initial !analyze -v on it.

Code:

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

CLOCK_WATCHDOG_TIMEOUT (101)
An expected clock interrupt was not received on a secondary processor in an
MP system within the allocated interval. This indicates that the specified
processor is hung and not processing interrupts.
Arguments:
Arg1: 0000000000000060, Clock interrupt time out interval in nominal clock ticks.
Arg2: 0000000000000000, 0.
Arg3: fffff880009ea180, The PRCB address of the hung processor.
Arg4: 0000000000000001, 0.

...

I'm sure the explanation may not be understood by some here, so I'll go through an explanation of the explanation. :)

- An expected clock interrupt was not received on a secondary processor -

To understand this, one must understand what interrupts are. As the name implies, an interrupt is an operation that interrupts what a processor was doing at the time. This is often crucial because sometimes a processor may be working on something relatively unimportant and something more time-sensitive and of higher priority needs to get fulfilled real quick, such as hardware I/O.

Say a processor core is running an operation that involves drawing a window on your screen for a user application like Skype. Then, all of a sudden, the core receives a request (interrupt) to do some disk I/O to move stuff from paging file to memory. The core obviously isn't going to neglect the request and continue drawing the window. It sees the higher priority and puts everything on standby while it gets done with the request (interrupt). Once that gets done, it will move back to the lesser priority stuff it was doing.

Note:

As you can understand, having everything stop to take care of an interrupt sounds like it'd hang the PC. It can, and is often referred too as interrupt storms. The reason why it doesn't do it all the time is because an interrupt shouldn't be designed to satisfy the request Windows received from the hardware. If it did this all the time then nothing gets done quickly. Instead, an interrupt will construct and issue a DPC (Deferred Procedure Call) that gets queued up and serviced later in an orderly fashion. That way, the interrupt finishes VERY quickly (it has too), and the actual workload with dealing with the hardware's interrupt for service (the actual I/O work) is done in a fashion that won't halt everything else up in the system.

In a fast-paced, multi-threaded and highly complex environment like Windows on newer hardware, it's easy to see this process of handling interrupts is more than just simply looking at a table of priorities and seeing which ones needs checking off first (which kinda does exist - called IRQLs - which we'll get too shortly). There's algorithms and whatnot involved in an attempt to provide the snappiest and smoothest means of computer operation for the user without having everything become imbalanced and chaotic swirling into a crash. But aside from all this complex methodology, the general idea is that some stuff has higher priority than others and needs faster and/or longer attention than the rest, and all of this is done to make sure that all gets taken care of decently and in order.

Now that I've given a general idea what interrupts are and what they're for, a clock interrupt is a type of interrupt in which involves counting the actual cycles of the processor core, which is essentially running a clock on the processors to keep everything in sync. If you read in the NTDebugging blog I used as a primer, you'll read that the clock interrupt gets handed out to all the cores and all of them have to report in. When one doesn't report at all in a designated time frame, this crash happens. Now it also mentions that the clock interrupt in x86 systems has an IRQL of 28. So what's an IRQL?

A good explanation for an IRQL is by this blog entry. It is essentially called the Interrupt Request Level, and is merely the priority of an interrupt given by Windows. The higher the number, the higher the priority. All typical activity running on a processor core is at level 0, which is called Passive or LOW_LEVEL (note that this isn't even an interrupt. An interrupt is anything above IRQL 0). When an interrupt kicks in, the IRQL reaches the level specified by the interrupt, the interrupt is serviced and then when it's done it drops back down to 0. If there were any other interrupts that popped up but at lower or same IRQLs, it will service those afterwards in an orderly fashion, dropping down the list till it goes back to IRQL 0. So while every interrupt has its own IRQL, a processor core also has it's own IRQL "state" if you will. It will service anything with the same IRQL as the state that it's in, and then will lower itself to service lower priority IRQLs until it's all back to 0 again. There's a bit more to it than that, but that's the gist of it.

- in an MP system -

A multi-processor system.

- within the allocated interval -

As mentioned before, it didn't respond in a designated time frame.

- This indicates that the specified processor is hung and not processing interrupts -

This can be a bit misleading. While yes, it does mean the processor is hung, it doesn't actually mean that it won't service any interrupts. In many cases it could be stuck servicing a higher IRQL interrupt but simply won't deal with anything lower than it.

Now that we got that description taken care of, now on to business. Ultimately, there's steps we should carry out to figure this out. In essence it should be, in order:

- what happened,
- how it happened, and lastly,
- why it happened.

WHAT
_____

First, what happened? There was a CLOCK_WATCHDOG_TIMEOUT. Yes, we know this. How long did it timeout? 60 clock ticks the bugcheck says. Now, exactly what processor buggered out? Well, there's a couple places we can look, all of which is in the !analyze -v output:

Code:

CLOCK_WATCHDOG_TIMEOUT (101)
An expected clock interrupt was not received on a secondary processor in an
MP system within the allocated interval. This indicates that the specified
processor is hung and not processing interrupts.
Arguments:
Arg1: 0000000000000060, Clock interrupt time out interval in nominal clock ticks.
Arg2: 0000000000000000, 0.
Arg3: [COLOR=#ff0000][B]fffff880009ea180[/B][/COLOR], The PRCB address of the hung processor.
Arg4: 000000000000000[COLOR=#ff0000][B]1[/B][/COLOR], 0.

Debugging Details:
------------------


BUGCHECK_STR:  CLOCK_WATCHDOG_TIMEOUT_[COLOR=#ff0000][B]2[/B][/COLOR]_PROC

DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

PROCESS_NAME:  System

CURRENT_IRQL:  d

STACK_TEXT:  
fffff800`0402f938 fffff800`02a39463 : 00000000`00000000 00000000`00000000 00000000`00000000 fffff880`009ea180 : nt!KeBugCheckEx
fffff800`0402f940 fffff800`02a93c87 : fffff880`00000000 fffff800`00000000 00000000`00000000 00000000`00000000 : nt! ?? ::FNODOBFM::`string'+0x4d8e
fffff800`0402f9d0 fffff800`0300b090 : 00000000`00000000 fffff800`0402fb80 fffff800`00000000 00000000`00000000 : nt!KeUpdateSystemTime+0x377
fffff800`0402fad0 fffff800`02a87ab3 : 00000000`00000000 fffff800`03026460 fffffa80`041baf00 00000000`00000001 : hal!HalpRtcClockInterrupt+0x130
fffff800`0402fb00 fffff880`03f991f2 : fffff800`02a990ca 00000000`00369e99 fffffa80`04eb44e8 fffff800`02c13c40 : nt!KiInterruptDispatchNoLock+0x163
fffff800`0402fc98 fffff800`02a990ca : 00000000`00369e99 fffffa80`04eb44e8 fffff800`02c13c40 00000000`00000001 : amdk8!C1Halt+0x2
fffff800`0402fca0 fffff800`02a93d5c : fffff800`02c05e80 fffff800`00000000 00000000`00000000 fffff880`03e76480 : nt!PoIdle+0x53a
fffff800`0402fd80 00000000`00000000 : fffff800`04030000 fffff800`0402a000 fffff800`0402fd40 00000000`00000000 : nt!KiIdleLoop+0x2c


STACK_COMMAND:  kb

SYMBOL_NAME:  ANALYSIS_INCONCLUSIVE

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: Unknown_Module

IMAGE_NAME:  Unknown_Image

DEBUG_FLR_IMAGE_TIMESTAMP:  0

FAILURE_BUCKET_ID:  X64_CLOCK_WATCHDOG_TIMEOUT_[COLOR=#ff0000][B]2[/B][/COLOR]_PROC_ANALYSIS_INCONCLUSIVE

BUCKET_ID:  X64_CLOCK_WATCHDOG_TIMEOUT_[COLOR=#ff0000][B]2[/B][/COLOR]_PROC_ANALYSIS_INCONCLUSIVE

Followup: MachineOwner
---------

I highlighted in bold the places you can find the answer. Here's a rundown of each:

Code:

Arg3: [COLOR=#ff0000][B]fffff880009ea180[/B][/COLOR], The PRCB address of the hung processor.

The PRCB - or PRocessor Control Block - contains data regarding the state of the processor (note that we've been talking about logical processors this whole time, not physical). It is an extension of the PCR - Processor Control Region - though it does contain the bulk of the data concerning that particular processor. You can do a dt command for nt!_KPRCB or nt!_KPCB to get details on each structure. You can also use the Windbg extensions !prcb and !pcr to get such details on the PRCB and PCR, respectively. Just type in the number of the processor you want and it'll give you the details. Let's try !prcb, since that's what the bugcheck is referring too. Do it for each existing processor (in this case, 2 exist):

Code:

0: kd> !prcb 0
PRCB for Processor 0 at fffff80002c05e80:
Current IRQL -- 13
Threads--  Current fffff80002c13c40 Next 0000000000000000 Idle fffff80002c13c40
Processor Index 0 Number (0, 0) GroupSetMember 1
Interrupt Count -- 0131a6aa
Times -- Dpc    00000368 Interrupt 0000036d 
         Kernel 000b11d9 User      0003e82a 

0: kd> !prcb 1
PRCB for Processor 1 at [COLOR=#ff0000][B]fffff880009ea180[/B][/COLOR]:
Current IRQL -- 0
Threads--  Current fffff880009f4f40 Next fffffa8005a95b60 Idle fffff880009f4f40
Processor Index 1 Number (0, 1) GroupSetMember 2
Interrupt Count -- 0176aaf8
Times -- Dpc    00002cc2 Interrupt 00001e01 
         Kernel 000d4b6e User      0001ac92

See the address for the PRCB for Processor 1? That matches the address mentioned in this third argument for the bugcheck code, so that's the irresponsible processor right there.

Note:

- Processor numbers start at 0, so the first processor will be called "Processor 0". There are a couple exceptions, as we'll see later, but that's usually the case.
- The information provided by these commands are basic. If you want all the nitty gritty on the processor, you'll want to dump the contents of the PRCB structure using the dt command. Take the address of the PRCB mentioned to use as the address for the nt!_KPRCB symbols to feed on. In this case, the syntax for the dt command will be:

dt nt!_KPRCB fffff880009ea180

Code:

Arg4: 000000000000000[COLOR=#ff0000][B]1[/B][/COLOR], 0.

This is a bit of a hidden surprise. It's an undocumented argument for this bugcheck, but it apparently reveals the processor number right there. Again, processor numbers start at 0, so this would be the second processor.

Code:

BUGCHECK_STR:  CLOCK_WATCHDOG_TIMEOUT_[COLOR=#ff0000][B]2[/B][/COLOR]_PROC

FAILURE_BUCKET_ID: X64_CLOCK_WATCHDOG_TIMEOUT_[COLOR=#ff0000][B]2[/B][/COLOR]_PROC_ANALYSIS_INCONCLUSIVE

BUCKET_ID:  X64_CLOCK_WATCHDOG_TIMEOUT_[COLOR=#ff0000][B]2[/B][/COLOR]_PROC_ANALYSIS_INCONCLUSIVE

Bucket IDs are just what Microsoft uses to get an idea on the cause of crash when reports are submitted. In many cases they have been helpful in discovering clues on cause, but in this case it's just the usual fare. As you can see, there's a discrepancy between the number of the processor provided here and the one in the rest of the data. This is one of those few and far inbetween exceptions to the rule. It's just to clarify for individuals looking at it as to exactly what processor it is, without having to remember the numbering starts at 0. So 2 would just mean "second processor".

HOW
_____

Ok, so far we've explained what just happened: Processor 1 (second processor) reached 60 clock ticks without responding to the clock interrupt and so the system BSODs. Now, how did this take place? To determine this, we need to look at what was going on during this time before Windows intervened and crashed the system.

Our first step in this venture involves checking the IRQL of each processor before the system decided to crash. During the crash process, the IRQL is altered so nothing else gets in its way, so it is pointless to see the latest existing IRQL. Fortunately, Windows developers are aware of this and made it so that the IRQL of the processor before the crash process kicked in was saved. You can grab this saved IRQL using the !irql command, followed by the number of the processor.

Code:

0: kd> !irql 0
Debugger saved IRQL for Processor 0x0 -- [COLOR=#006400][B]13[/B][/COLOR]
0: kd> !irql 1
Debugger saved IRQL for Processor 0x1 -- [COLOR=#ff0000][B]0[/B][/COLOR] (LOW_LEVEL)

As you can see, the IRQL of the first processor is 13 (which is CLOCK for x64 processors) and the second is at 0. So we can see that only Processor 0 was at CLOCK level. In the NTDebugging blog - and previously mentioned - the clock interrupt for x86 processors is 28, but since this is an x64 environment, things change around a bit (fewer IRQL levels) and it is 13 for x64 environments.

Note: For information about interrupts, consult the Windows Internals book, pages 85-114 (5th edition)

Note: compare the IRQL mentioned in !irql with what you see from !pcr and !prcb. Notice something? The IRQL for Processor 1 is listed as 0 in the PCR, but 13 in the PRCB and in the !irql output. That's because the PCR saves the current IRQL, but the PRCB stores the one previously before control passed to the bugcheck code. Details are mentioned in Windows Internals page 95 (5th edition). You may also consult the help documentation on Windbg for !irql.

Now that we have the IRQL, let's garner some more information. We'll look at the callstack present in each processor to see what was the latest activity taking place on them. You can use a variation of the k command to do this. I use kv as it's most verbose. Let's start with Processor 0:

Code:

0: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff800`0402f938 fffff800`02a39463 : 00000000`00000101 00000000`00000060 00000000`00000000 fffff880`009ea180 : nt!KeBugCheckEx
fffff800`0402f940 fffff800`02a93c87 : fffff880`00000000 fffff800`00000001 00000000`00002626 00000000`00000000 : nt! ?? ::FNODOBFM::`string'+0x4d8e
fffff800`0402f9d0 fffff800`0300b090 : 00000000`00000000 fffff800`0402fb80 fffff800`03026460 00000000`00000000 : nt!KeUpdateSystemTime+0x377
fffff800`0402fad0 fffff800`02a87ab3 : 00000000`00000000 fffff800`03026460 fffffa80`041baf00 00000000`00000001 : hal!HalpRtcClockInterrupt+0x130
fffff800`0402fb00 fffff880`03f991f2 : fffff800`02a990ca 00000000`00369e99 fffffa80`04eb44e8 fffff800`02c13c40 : nt!KiInterruptDispatchNoLock+0x163 (TrapFrame @ fffff800`0402fb00)
fffff800`0402fc98 fffff800`02a990ca : 00000000`00369e99 fffffa80`04eb44e8 fffff800`02c13c40 00000000`00000001 : amdk8!C1Halt+0x2
fffff800`0402fca0 fffff800`02a93d5c : fffff800`02c05e80 fffff800`00000000 00000000`00000000 fffff880`03e76480 : nt!PoIdle+0x53a
fffff800`0402fd80 00000000`00000000 : fffff800`04030000 fffff800`0402a000 fffff800`0402fd40 00000000`00000000 : nt!KiIdleLoop+0x2c

Now let's see latest thread running on Processor 1. In order to do so, we should change the processor context of Windbg. Currently Windbg is displaying all the data on what's taking place on Processor 0. You can tell because of the number visible in the Windbg prompt:

Code:

[COLOR=#006400][B]0[/B][/COLOR]: kd>

The number displays which processor is current. You can switch to the other processor using the tilde (~) key, followed by the number of the processor. So let's do that and dump the callstack afterwards:

Code:

0: kd> ~1
1: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff880`02f1bc98 fffff800`02a990ca : 00000000`00369e99 fffffa80`04ec1ab8 fffff880`009f4f40 00000000`00000001 : amdk8!C1Halt+0x2
fffff880`02f1bca0 fffff800`02a93d5c : fffff880`009ea180 fffff880`00000000 00000000`00000000 fffff880`03e76480 : nt!PoIdle+0x53a
fffff880`02f1bd80 00000000`00000000 : fffff880`02f1c000 fffff880`02f16000 fffff880`02f1bd40 00000000`00000000 : nt!KiIdleLoop+0x2c

Ok, so we got the callstack for each existing thread at the time of the crash for both our processors. Now we have to interpret what's going on with each. Here's some stuff we can conjure up from these two callstacks:

Both started with the IdleLoop routine, which is essentially the start of the System Idle Process you see in Task Manager. So they both were sitting and waiting to do something.
Processor 0 then received an interrupt. This interrupt happened to be a clock interrupt.
The clock interrupt then involved updating the system time. This is something that is replicated across all processors so that all the processors update their own timers and things are kept track of.
Processor 0 was the one that performed the bugcheck.

So from what we can tell by this, Processor 0 took care of a systematic clock interrupt that keeps track of the timing of things, while Processor 1 apparently didn't even both with it and just stayed dormant. This is further expressed by the IRQL for Processor 0 being 13 (clock interrupt) and the IRQL for Processor 1 being 0. This is getting suspicious. But for now, let's move on.

WHY
_____

So far we've gained enough information to where we can see that a clock interrupt (as expected) popped up and tried to do its routine thing of keeping the system timer up-to-date. Processor 0 had no problem handling it, and did so just fine. However, we can see that when it came time for Processor 1 to do the same, nothing happened. The interrupt did not pop up on the callstack, nor did the IRQL change. Things continued to lie dormant, and Processor 1 was sound asleep. Why did this happen?

The initial and immediate suspect right here is the 3rd-party driver and its routine present in both callstacks:

Code:

amdk8!C1Halt+0x2

This is present as part of the whole Idle Loop (System Idle Process) that goes on when there's a wait on doing stuff. Instead of the default routine that takes place here (I think it's hal!HalProcessorIdle), it has been replaced by the one provided by amdk8.sys. Since this has been a consistent issue with the victim, it seems rather evident that we're dealing with a possibly buggy driver. However, we will need to dig even further to verify this suspicion. We will actually need to disassemble the code here to see what the problem is. So let's go ahead and do that.

There are a number of ways you can disassemble code in Windbg. The most common (and relatively easiest) is to use the Disassembly window provided by Windbg. Go to View followed by Disassembly to bring it up. Then in the prompt labeled Offset, type in the address you desire to start at. We can easily discover this by looking at the callstack for the thread on Processor 0 (not 1). What we want is the return address in the frame (each row in the callstack is called a frame of the callstack) above the amdk8!C1Halt frame. What these return addresses show is that when a routine performs a return instruction after it's done its work, it returns to the frame below it at address specified. We can use this address to look at what's inside amdk8!C1Halt.

Code:

0: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff800`0402f938 fffff800`02a39463 : 00000000`00000101 00000000`00000060 00000000`00000000 fffff880`009ea180 : nt!KeBugCheckEx
fffff800`0402f940 fffff800`02a93c87 : fffff880`00000000  fffff800`00000001 00000000`00002626 00000000`00000000 : nt! ??  ::FNODOBFM::`string'+0x4d8e
fffff800`0402f9d0 fffff800`0300b090 : 00000000`00000000  fffff800`0402fb80 fffff800`03026460 00000000`00000000 :  nt!KeUpdateSystemTime+0x377
fffff800`0402fad0 fffff800`02a87ab3 : 00000000`00000000  fffff800`03026460 fffffa80`041baf00 00000000`00000001 :  hal!HalpRtcClockInterrupt+0x130
fffff800`0402fb00 [COLOR=#ff0000]fffff880`03f991f2[/COLOR] : fffff800`02a990ca  00000000`00369e99 fffffa80`04eb44e8 fffff800`02c13c40 :  nt!KiInterruptDispatchNoLock+0x163 (TrapFrame @ fffff800`0402fb00) [COLOR=#006400][I]< the return address we want (it's above the amdk8!C1Halt+0x2 frame)[/I][/COLOR]
[COLOR=#0000cd]fffff800`0402fc98 fffff800`02a990ca : 00000000`00369e99 fffffa80`04eb44e8 fffff800`02c13c40 00000000`00000001 : amdk8!C1Halt+0x2[/COLOR] [COLOR=#006400][I]< the frame[/I][/COLOR]
fffff800`0402fca0 fffff800`02a93d5c : fffff800`02c05e80 fffff800`00000000 00000000`00000000 fffff880`03e76480 : nt!PoIdle+0x53a
fffff800`0402fd80 00000000`00000000 : fffff800`04030000  fffff800`0402a000 fffff800`0402fd40 00000000`00000000 :  nt!KiIdleLoop+0x2c

Why did I decide that we take a look at the callstack for Processor 0 instead of 1 for the answer? Because as you can tell, the very last frame in the callstack for Processor 1 is amdk8!C1Halt. We need the return address from a frame above amdk8!C1Halt in order for it to help us. If we grabbed the return address from the frame with amdk8!C1Halt in it, it would bring us to the instruction address that amdk8!C1Halt would return from if it was finished. This address would direct towards the function below it, which is specifically nt!PoIdle+0x53a. So we'd be looking in the code for nt!PoIdle and not amdk8!C1Halt.

If you're wondering if there's an easier and more reliable answer, there is. You can simply take the symbol name for the routine that's mentioned in the callstack and use that instead of the return address for the Offset requested in the Disassembly window. So if we wanna look exactly at amdk8!C1Halt+0x2, we'll just slap that exactly as the offset and it'll figure it out from there. This also works for the u command and other variations mentioned next. In addition, it will also work the same for routines listed in the callstack that lack symbols. So if this is so much easier, why explain the alternative? Because it'll teach you some things about what's going on in a callstack and code flow.

Another common way to get details on this is to use the uf command followed by the address of where you want to start (which we got from that return address). In this case, it will show the entire function (in this case, amdk8!C1Halt) and that's it. We really don't need to see anything more (like what the Disassembly window displays) so this option is conveniently concise.

You can also do u and variants thereof, but I'd rather have the Windbg help documentation explain it for ya.

Let's try the uf command:

Code:

1: kd> uf fffff880`03f991f2
amdk8!C1Halt:
fffff880`03f991f0 fb              sti
fffff880`03f991f1 f4              hlt
fffff880`03f991f2 c3              ret

Notice even though we gave it address fffff880`03f991f2 it started with fffff880`03f991f0. That's because it simply presents the entire function whatever address we gave it is sitting in. So no matter where in the function we give it, it'll display the whole thing from start to finish (more or less). Now what to explain from this output. We see three instructions: sti (set interrupt flag), hlt (halt) and ret (return). Do a google and you'll find that this is also present in the generic hal!HalProcessorIdle routine that amdk8!C1Halt replaced.

Honestly, with research you will find there's no code difference or really anything that could bug this code out. It's standard fare, and nothing suspect about it. This has also been confirmed by the person that has been victimized by these crashes: they updated their chipset drivers that covers this specific driver, and no change has been made at all to the code (nor has the problem been resolved). So we're left with a rather innocuous little bit of code.

But wait now, what does this code actually do? Well, the first instruction is sti, or the set interrupt flag instruction. It sets the processor to be ready to process any interrupts that comes its way. The next instruction, hlt, halts the processor until the next interrupt pops up. Then there's the standard ret or return. So, wait a minute, if it was set to accept any interrupts and wait until it gets one, why hasn't the clock interrupt popped up on Processor 1's callstack? Wasn't this all done to get it to wait and ready to grab on the next available interrupt? Why didn't it even bother starting to deal with the clock interrupt?

We can get even more info on this by understanding how the sti instruction works. It does so by setting a flag register (which is only either 0 or 1, hence the name 'flag register') called the interrupt flag (if) from 0 to 1 if it has not already been 1, otherwise it just leaves it as 1. Once this flag is set, this processor should accept any interrupts coming its way. So, the last thing we can check here is the interrupt flag. Maybe, perhaps, the sti instruction set the interrupt flag to 1 and something else later changed it back to 0! Let's take a look. Remember to be in the processor context of Processor 1 when you do this. Type the r command (register) followed by the name of the register you want, which in this case is if.

Code:

1: kd> r if
if=1

So there it is. The interrupt flag at the time was in fact 1. It should've handled an interrupt to it, but it didn't. Nothing here reveals that it should've stopped Processor 1 from accepting the clock interrupt which would've prevented this crash. Instead, it continued to sleep, and crashed it went.

What you see here is the example of what very well appears to be a perfectly good condition, but somehow something still went wrong. It means there had to be a hardware issue involved, and since everything else says "everything's cool" we can only point blame at the CPU simply not responding to the interrupt despite being given the green light to do so.

And sure enough, that appears to have been the case. At the time of this writing, the client has installed a new replacement CPU and has had no issues since.

Thanks for reading this, and I hope it has entertained and enlightened you on some stuff, as well as given you questions that hopefully I or someone else can answer. Again, if you suspected any errors (typos, misconceptions, etc.), don't hesitate to pass comment to me. Have fun, guys!

Note:

One thing I want to add concerning this particular 0x101 bugcheck is for people to observe the context for Processor 1 (the processor core that hung). Notice that the entire context is intact: the registers, the callstack of the latest thread, etc., all are present and preserved. If Processor 1 was hung, how could this be? Just so you know, while the entire processor context is stored into memory (which can be later dumped like everything else into a crashdump during a crash), the processor still is responsible to fill in that context. So how did everything get preserved if the processor involved was stuck?

The only way I see it could happen, is that Processor 1 did wake up during the crash to report everything in its context to fill in for the crashdump. What this could mean is that Processor 1 core wasn't exactly hung permanently, but rather it took too long to be aware that there was a clock interrupt that need servicing (or any interrupt for that matter) and it took its sweet time to wake up. It did eventually wake up for the crash, but it failed to do so in a prompt manner for the clock interrupt. It would mean the processor itself is out of sync, which is a hardware malfunction (hence why the answer was replacing the CPU).

You will see 0x101 bugchecks in which for whatever reason (hardware or driver) a processor core or two was not responsive during the crashdump (and so it was really hung), and so not everything for its context was preserved (a clear example is if you attempt to dump the callstack using kv you'll get only one frame, which is empty, and all registers will be zeroed). This is explained in the article I used as a primer that I linked too at the beginning of this little 'tutorial', as in that case the processor contexts weren't saved all the way, so the only way for the person to get a callstack was to find the last thread that ran on it and dump the raw stack and then reconstruct it from there using a special variation of the kv command. Unfortunately, because the example in that primer was on a x32 system, that attempt only applies to x32 systems. I have yet to figure out how to do the same for x64 systems.

Food for thought.

niemiro · Jun 24, 2012

Thanks a lot for this very informative post.

I have a user with this kernel dump: http://www.mediafire.com/?ubbabtx5tpwty3a

It looks almost identical to the one you showed here, and I wondered whether you could think of anything other than hardware? I guess I will ask to update chipset drivers, and see if that does it. A bit desperate, I know.

Also, the OP claims that the computer works fine in Safe Mode, but crashes in normal mode. Finally, I notice that this computer seems to have 8 cores. That seems like quite a few. Do you think this may even be a multi-processor machine, perhaps even a small server? I will ask the OP.

Thanks a lot for any insight you may be able to offer.

Code:

Microsoft (R) Windows Debugger Version 6.2.8229.0 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [D:\MEMORY (2).DMP]
Kernel Summary Dump File: Only kernel address space is available
Symbol search path is: SRV*D:\Symbols*http://msdl.microsoft.com/download/symbols
Executable search path is: 
Windows Server 2008/Windows Vista Kernel Version 6002 (Service Pack 2) MP (8 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS Personal
Built by: 6002.18607.amd64fre.vistasp2_gdr.120402-0336
Machine Name:
Kernel base = 0xfffff800`03003000 PsLoadedModuleList = 0xfffff800`031c7dd0
Debug session time: Sun Jun 24 14:10:20.781 2012 (UTC + 1:00)
System Uptime: 0 days 0:01:39.562
Loading Kernel Symbols
...............................................................
..................................................
Loading User Symbols
Loading unloaded module list
.....
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************
Use !analyze -v to get detailed debugging information.
BugCheck 101, {18, 0, fffffa60019d8180, 3}
*** ERROR: Module load completed but symbols could not be loaded for intelppm.sys
Probably caused by : Unknown_Image ( ANALYSIS_INCONCLUSIVE )
Followup: MachineOwner
---------
0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************
CLOCK_WATCHDOG_TIMEOUT (101)
An expected clock interrupt was not received on a secondary processor in an
MP system within the allocated interval. This indicates that the specified
processor is hung and not processing interrupts.
Arguments:
Arg1: 0000000000000018, Clock interrupt time out interval in nominal clock ticks.
Arg2: 0000000000000000, 0.
Arg3: fffffa60019d8180, The PRCB address of the hung processor.
Arg4: 0000000000000003, 0.
Debugging Details:
------------------

BUGCHECK_STR:  CLOCK_WATCHDOG_TIMEOUT_8_PROC
DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT
PROCESS_NAME:  System
CURRENT_IRQL:  d
STACK_TEXT:  
fffff800`04416a98 fffff800`030193a0 : 00000000`00000101 00000000`00000018 00000000`00000000 fffffa60`019d8180 : nt!KeBugCheckEx
fffff800`04416aa0 fffff800`030543aa : 00000000`00000000 fffff800`04416bc0 fffffa80`08765330 fffff800`03548320 : nt! ?? ::FNODOBFM::`string'+0x2de4
fffff800`04416ae0 fffff800`0352b8af : 00000000`00000000 fffff800`04416bc0 fffff800`03548320 fffffa80`08d91170 : nt!KeUpdateSystemTime+0xea
fffff800`04416b10 fffff800`03053b6d : 00000000`00000000 fffff800`03548320 00000000`00000000 fffffa60`0390b6d6 : hal!HalpRtcClockInterrupt+0x127
fffff800`04416b40 fffffa60`00d407a2 : fffffa60`00d3f685 fffff800`04410000 00000000`00000000 00000000`00000001 : nt!KiInterruptDispatchNoLock+0x14d
fffff800`04416cd8 fffffa60`00d3f685 : fffff800`04410000 00000000`00000000 00000000`00000001 00000000`0000000c : intelppm+0x37a2
fffff800`04416ce0 fffff800`0305f173 : 0000003d`d5f3b80e 00000000`00000000 fffffa80`00000001 fffff800`03179a80 : intelppm+0x2685
fffff800`04416d10 fffff800`0305ee91 : fffff800`03176680 fffff800`00000000 00000000`0f088bae 00000000`00000000 : nt!PoIdle+0x183
fffff800`04416d80 fffff800`0322e860 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x21
fffff800`04416db0 00000000`fffff800 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!zzz_AsmCodeRange_End+0x4
fffff800`044100b0 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00680000`00000000 : 0xfffff800

STACK_COMMAND:  kb
SYMBOL_NAME:  ANALYSIS_INCONCLUSIVE
FOLLOWUP_NAME:  MachineOwner
MODULE_NAME: Unknown_Module
IMAGE_NAME:  Unknown_Image
DEBUG_FLR_IMAGE_TIMESTAMP:  0
FAILURE_BUCKET_ID:  X64_CLOCK_WATCHDOG_TIMEOUT_8_PROC_ANALYSIS_INCONCLUSIVE
BUCKET_ID:  X64_CLOCK_WATCHDOG_TIMEOUT_8_PROC_ANALYSIS_INCONCLUSIVE
Followup: MachineOwner
---------

0: kd> !prcb 0
PRCB for Processor 0 at fffff80003176680:
Current IRQL -- 13
Threads--  Current fffff8000317bb80 Next 0000000000000000 Idle fffff8000317bb80
Number 0 SetMember 1
Interrupt Count -- 0001471f
Times -- Dpc    000000bc Interrupt 00000018 
         Kernel 000018ab User      00000000 

0: kd> !prcb 1
PRCB for Processor 1 at fffffa60005ec180:
Current IRQL -- 0
Threads--  Current fffffa60005f5d40 Next 0000000000000000 Idle fffffa60005f5d40
Number 1 SetMember 2
Interrupt Count -- 0000c511
Times -- Dpc    00000000 Interrupt 00000000 
         Kernel 0000189a User      00000000 

0: kd> !prcb 2
PRCB for Processor 2 at fffffa6001966180:
Current IRQL -- 0
Threads--  Current fffffa600196fd40 Next fffffa80054d7bb0 Idle fffffa600196fd40
Number 2 SetMember 4
Interrupt Count -- 0000bc7e
Times -- Dpc    00000015 Interrupt 00000000 
         Kernel 000012ec User      00000000 

0: kd> !prcb 3
PRCB for Processor 3 at fffffa60019d8180:
Current IRQL -- 0
Threads--  Current fffffa8009df3bb0 Next 0000000000000000 Idle fffffa60019e1d40
Number 3 SetMember 8
Interrupt Count -- 0000bf6c
Times -- Dpc    00000001 Interrupt 00000002 
         Kernel 000012d2 User      00000000 

0: kd> !prcb 4
PRCB for Processor 4 at fffffa6001a43180:
Current IRQL -- 0
Threads--  Current fffffa80054e6210 Next 0000000000000000 Idle fffffa6001a4cd40
Number 4 SetMember 10
Interrupt Count -- 0000939e
Times -- Dpc    00000000 Interrupt 00000000 
         Kernel 00001897 User      00000000 

0: kd> !prcb 5
PRCB for Processor 5 at fffffa6001ab5180:
Current IRQL -- 0
Threads--  Current fffffa6001abed40 Next 0000000000000000 Idle fffffa6001abed40
Number 5 SetMember 20
Interrupt Count -- 000091dc
Times -- Dpc    00000000 Interrupt 00000010 
         Kernel 00001895 User      00000000 

0: kd> !prcb 6
PRCB for Processor 6 at fffffa6001b27180:
Current IRQL -- 0
Threads--  Current fffffa80054eebb0 Next 0000000000000000 Idle fffffa6001b30d40
Number 6 SetMember 40
Interrupt Count -- 0000bdf3
Times -- Dpc    00000001 Interrupt 00000004 
         Kernel 00001155 User      00000000 

0: kd> !prcb 7
PRCB for Processor 7 at fffffa6001b99180:
Current IRQL -- 0
Threads--  Current fffffa80069e6bb0 Next 0000000000000000 Idle fffffa6001ba2d40
Number 7 SetMember 80
Interrupt Count -- 0000bff6
Times -- Dpc    00000000 Interrupt 00000001 
         Kernel 0000114e User      00000000 

0: kd> !prcb 8
Cannot get PRCB address


0: kd> !irql 0
Debugger saved IRQL for processor 0x0 -- 13
0: kd> !irql 1
Debugger saved IRQL for processor 0x1 -- 0 (LOW_LEVEL)
0: kd> !irql 2
Debugger saved IRQL for processor 0x2 -- 0 (LOW_LEVEL)
0: kd> !irql 3
Debugger saved IRQL for processor 0x3 -- 0 (LOW_LEVEL)
0: kd> !irql 4
Debugger saved IRQL for processor 0x4 -- 0 (LOW_LEVEL)
0: kd> !irql 5
Debugger saved IRQL for processor 0x5 -- 0 (LOW_LEVEL)
0: kd> !irql 6
Debugger saved IRQL for processor 0x6 -- 0 (LOW_LEVEL)
0: kd> !irql 7
Debugger saved IRQL for processor 0x7 -- 0 (LOW_LEVEL)


0: kd> ~0
0: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffff800`04416a98 fffff800`030193a0 : 00000000`00000101 00000000`00000018 00000000`00000000 fffffa60`019d8180 : nt!KeBugCheckEx
fffff800`04416aa0 fffff800`030543aa : 00000000`00000000 fffff800`04416bc0 fffffa80`08765330 fffff800`03548320 : nt! ?? ::FNODOBFM::`string'+0x2de4
fffff800`04416ae0 fffff800`0352b8af : 00000000`00000000 fffff800`04416bc0 fffff800`03548320 fffffa80`08d91170 : nt!KeUpdateSystemTime+0xea
fffff800`04416b10 fffff800`03053b6d : 00000000`00000000 fffff800`03548320 00000000`00000000 fffffa60`0390b6d6 : hal!HalpRtcClockInterrupt+0x127
fffff800`04416b40 [COLOR=#ff0000][B]fffffa60`00d407a2[/B][/COLOR] : fffffa60`00d3f685 fffff800`04410000 00000000`00000000 00000000`00000001 : nt!KiInterruptDispatchNoLock+0x14d (TrapFrame @ fffff800`04416b40)
fffff800`04416cd8 [B][COLOR=#0000ff]fffffa60`00d3f685[/COLOR][/B] : fffff800`04410000 00000000`00000000 00000000`00000001 00000000`0000000c : intelppm+0x37a2
fffff800`04416ce0 fffff800`0305f173 : 0000003d`d5f3b80e 00000000`00000000 fffffa80`00000001 fffff800`03179a80 : intelppm+0x2685
fffff800`04416d10 fffff800`0305ee91 : fffff800`03176680 fffff800`00000000 00000000`0f088bae 00000000`00000000 : nt!PoIdle+0x183
fffff800`04416d80 fffff800`0322e860 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x21
fffff800`04416db0 00000000`fffff800 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!zzz_AsmCodeRange_End+0x4
fffff800`044100b0 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00680000`00000000 : 0xfffff800

0: kd> ~1
1: kd> kv
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffffa60`0191bcd8 fffffa60`00d3f685 : fffffa80`054d7720 fffffa60`005f5d40 fffffa60`00000001 fffffa60`0191bd50 : intelppm+0x37a2
fffffa60`0191bce0 fffff800`0305f173 : 00000000`00000001 fffffa80`054d7818 fffffa80`054d7720 fffffa60`005f5d40 : intelppm+0x2685
fffffa60`0191bd10 fffff800`0305ee91 : fffffa60`005ec180 fffffa60`00000000 00000000`0f096483 00000000`00000000 : nt!PoIdle+0x183
fffffa60`0191bd80 fffff800`0322e860 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x21
fffffa60`0191bdb0 00000000`fffffa60 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!zzz_AsmCodeRange_End+0x4
fffffa60`005efd00 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00680000`00000000 : 0xfffffa60


1: kd> uf [B][COLOR=#ff0000]fffffa60`00d407a2[/COLOR][/B]
intelppm+0x37a0:
fffffa60`00d407a0 fb              sti
fffffa60`00d407a1 f4              hlt
fffffa60`00d407a2 c3              ret

1: kd> uf [B][COLOR=#0000ff]fffffa60`00d3f685[/COLOR][/B]
intelppm+0x267c:
fffffa60`00d3f67c 4883ec28        sub     rsp,28h
fffffa60`00d3f680 e81b110000      call    intelppm+0x37a0 (fffffa60`00d407a0)
fffffa60`00d3f685 33c0            xor     eax,eax
fffffa60`00d3f687 4883c428        add     rsp,28h
fffffa60`00d3f68b c3              ret


1: kd> ~3
3: kd> r if
if=1


3: kd> !thread
THREAD fffffa8009df3bb0  Cid 0234.0238  Teb: 000007fffffdd000 Win32Thread: fffff900c0004d50 RUNNING on processor 3
Not impersonating
DeviceMap                 fffff880000073d0
Owning Process            fffffa8009e03040       Image:         csrss.exe
Attached Process          N/A            Image:         N/A
Wait Start TickCount      5915           Ticks: 457 (0:00:00:07.140)
Context Switch Count      140            IdealProcessor: 3                 LargeStack
UserTime                  00:00:00.000
KernelTime                00:00:00.468
Win32 Start Address 0x0000000049d6153c
Stack Init fffffa600569cdb0 Current fffffa600569b360
Base fffffa600569d000 Limit fffffa6005695000 Call 0
Priority 13 BasePriority 13 PriorityDecrement 0 IoPriority 2 PagePriority 5
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for nvlddmkm.sys - 
Child-SP          RetAddr           : Args to Child                                                           : Call Site
fffffa60`0569b6c8 fffff800`03527699 : 00000000`00000010 00000000`00000246 fffffa60`0569b6f0 00000000`00000018 : hal!HalpPciReadMmConfigUlong+0x7
fffffa60`0569b6d0 fffff800`035274aa : 00000000`00000000 fffffa60`0569b800 00000000`00000040 fffff800`0351b000 : hal!HalpPCIPerformConfigAccess+0x55
fffffa60`0569b700 fffff800`035272ef : fffffa60`0569b800 00000000`00000000 00000000`00000000 fffffa60`0569b8d0 : hal!HalpPCIConfigHoldingConfigLock+0x17a
fffffa60`0569b750 fffff800`035270d8 : 00000000`00000000 fffffa60`0569b8d0 fffffa60`0569b800 00000000`00000040 : hal!HalpPCIConfig+0x87
fffffa60`0569b790 fffff800`03526d1c : 00000000`00000000 00000000`00000000 00000000`00000040 fffff800`0353aa80 : hal!HalpReadPCIConfig+0x60
fffffa60`0569b7d0 fffff800`03528190 : 00000000`00000002 fffff800`03526d9a 00000000`00000000 00000000`0000000a : hal!HalpGetPCIData+0x89
fffffa60`0569b8a0 fffffa60`02c17c44 : 00000000`00000000 00000000`00000000 00000000`00000028 fffffa60`0569b9d0 : hal!HalGetBusDataByOffset+0x9c
fffffa60`0569b990 fffffa60`02cdc48e : 00000000`00000000 00000000`0000ffff 00000000`00000007 00000000`00000000 : nvlddmkm+0x208c44
fffffa60`0569b9d0 fffffa60`02cdffa4 : fffffa80`ffff8086 fffffa80`08a72870 fffffa60`03537888 fffffa80`08a72c41 : nvlddmkm+0x2cd48e
fffffa60`0569ba50 fffffa60`02ce0344 : fffffa80`09f40300 fffffa80`09f4d000 fffffa80`08a72870 fffffa80`08757610 : nvlddmkm+0x2d0fa4
fffffa60`0569bab0 fffffa60`02cd2867 : fffffa80`08a705e3 fffffa80`08a72870 fffffa80`08a710de fffffa80`08a72870 : nvlddmkm+0x2d1344
fffffa60`0569bb40 fffffa60`02c0231e : fffffa80`09f4d000 fffffa80`08a72870 fffffa80`08a72870 fffffa80`09e5e010 : nvlddmkm+0x2c3867
fffffa60`0569bb70 fffffa60`02d7f2db : fffffa80`09e5e010 fffffa80`09e5e010 fffffa80`08a72870 00000000`00000012 : nvlddmkm+0x1f331e
fffffa60`0569bbb0 fffffa60`02cc860c : 00000000`00000000 00000000`00000000 00000000`00000001 00000000`00000012 : nvlddmkm+0x3702db
fffffa60`0569bbf0 fffffa60`02ccd5c8 : 00000000`00000000 00000000`00000012 00000000`00000000 fffffa80`09f63d30 : nvlddmkm+0x2b960c
fffffa60`0569bc20 fffffa60`02c383e5 : 00000000`00000000 fffffa80`09f4d000 00000000`00000000 fffffa80`095ca000 : nvlddmkm+0x2be5c8
fffffa60`0569bcb0 fffffa60`02c0a5cd : fffffa80`09f4d000 fffffa80`09f4d000 00000000`00000001 00000000`00000001 : nvlddmkm+0x2293e5
fffffa60`0569bce0 fffffa60`02c0a73b : fffffa60`00000000 00000000`d0000000 00000000`00000000 00000000`00000000 : nvlddmkm+0x1fb5cd
fffffa60`0569bdb0 fffffa60`02b2ae91 : 00000000`00000000 00000000`00000000 00000000`d0000000 00000000`00000000 : nvlddmkm+0x1fb73b
fffffa60`0569be70 fffffa60`02b2b3d0 : fffffa80`08d3d000 fffffa60`02b2ae0f 00000000`00000001 00000000`00000000 : nvlddmkm+0x11be91
fffffa60`0569bf20 fffffa60`02ae8292 : fffffa80`0017f71e fffffa80`08d3d000 fffffa80`09ea2240 fffffa80`09ea2240 : nvlddmkm+0x11c3d0
fffffa60`0569bf60 fffffa60`02a6473a : fffffa80`08d3d000 fffffa60`00000001 fffffa80`08d3d000 00000000`00000000 : nvlddmkm+0xd9292
fffffa60`0569c020 fffffa60`03749ca9 : fffffa80`08d3d000 fffffa80`08d3d000 fffffa60`0569c990 fffffa60`0569c8d0 : nvlddmkm+0x5573a
fffffa60`0569c560 fffffa60`03753389 : fffffa60`03749c27 fffffa80`08d3d000 fffffa60`0569c990 fffffa80`08b2d72c : nvlddmkm!nvDumpConfig+0x23f999
fffffa60`0569c600 fffffa60`03756d25 : fffffa80`08b2d72c fffffa80`08d3d000 fffffa60`0569c990 00000000`00000000 : nvlddmkm!nvDumpConfig+0x249079
fffffa60`0569c7f0 fffffa60`03882b46 : fffffa80`08d3d000 fffffa80`08b2d72c fffffa80`08b2d728 fffff800`030df8b8 : nvlddmkm!nvDumpConfig+0x24ca15
fffffa60`0569c830 fffffa60`0388073a : 00000000`40020056 00000000`00000000 fffffa80`08b2d040 00000000`00000000 : dxgkrnl!DpiDxgkDdiStartDevice+0x62
fffffa60`0569c880 fffffa60`03880baa : fffffa80`00000000 00000000`00000000 00000000`00000000 fffffa80`08b3fd80 : dxgkrnl!DpiFdoStartAdapter+0x382
fffffa60`0569c9e0 fffffa60`0387b66f : 00000000`00000000 00000000`00000000 00000000`00000000 fffffa60`0569cca0 : dxgkrnl!DpiFdoStartAdapterThread+0x17a
fffffa60`0569ca70 fffffa60`038f71be : fffffa60`00000000 00000000`00000000 00000000`00000000 00000000`00292d00 : dxgkrnl!DpiSessionCreateCallback+0x1b
fffffa60`0569caa0 fffffa60`038f70f6 : 00000000`00000000 00000000`00000054 fffffa80`09e03040 00000000`00000000 : watchdog!SMgrSessionOpen+0x42
fffffa60`0569cae0 fffff960`00043ecb : fffffa80`09e03040 00000000`000007ff fffffa60`0569cb48 00000000`00000000 : watchdog!SMgrNotifySessionChange+0x22
fffffa60`0569cb20 fffff960`00046a9c : fffffa80`0000067b fffffa60`0569cca0 fffffa80`054d5080 00000000`000007ff : win32k!InitializeGreCSRSS+0x23
fffffa60`0569cbe0 fffff800`0305a573 : fffffa80`09df3bb0 000007fe`fd7d8a20 fffffa80`09f3f630 00000000`0018f808 : win32k!NtUserInitialize+0x13c
fffffa60`0569cc20 000007fe`fd72cd9a : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffffa60`0569cc20)
00000000`0018f768 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x7fe`fd72cd9a


3: kd> r
rax=00000000ffffffff rbx=0000000000000028 rcx=ffffffffffd18000
rdx=fffffa600569b818 rsi=fffffa600569b818 rdi=0000000000000018
rip=fffff80003533b47 rsp=fffffa600569b6c8 rbp=ffffffffffd18000
 r8=0000000000000018  r9=0000000000000018 r10=0000000000000000
r11=0000000000000000 r12=fffff8000353a980 r13=0000000000000003
r14=fffffa600569b907 r15=fffff800035424a0
iopl=0         nv up ei pl zr na po nc
cs=0010  ss=0018  ds=0000  es=0000  fs=0000  gs=0000             efl=00000246
hal!HalpPciReadMmConfigUlong+0x7:
fffff800`03533b47 8902            mov     dword ptr [rdx],eax ds:fffffa60`0569b818=ffffffff

Patrick · Jun 25, 2012

Wow, this is a great post... thank you for the bump, Richard. In the past, when I've seen 101's... I have always dreaded it because I didn't know how to analyze them, and now I have much to look for :)

Vir Gnarus · Jun 25, 2012

Good job on the approach. Looking at intelppm.sys wasn't really necessary since in this case when you look at the running thread for the faulting proc (proc 3) you can see that intelppm was not involved but rather nvlddmkm.sys, or the PCI-E bus, as the last few frames in the callstack show. In the specific situation I was dealing with in the OP, the amd chipset driver was responsible, but not in your case. I'll take a look at the kernel dump myself, but from what I see it looks like you'll want to ask the guy to remove the graphics card, clean up any foreign material that may be in the slot, and then reinsert it, as well as update graphics drivers if they haven't already.

I'm concerned about one thing though, in that you actually are retrieving a thread with all its info n stuff from the faulting proc, which isn't really supposed to happen if that proc was actually frozen. I would think what took place is the IRQL that proc was on at the time was higher than clock interrupt but not higher than the bugcheck, but if that was the case why didn't it save the IRQL (which shows up as 0), or if it did successfully save it, then why on earth would a thread at IRQL 0 stop a clock interrupt?

Perplexing, but I'd like to look into it further. One of the things I'd like to do is check to see if anything in the callstack actually called to increase the IRQL (KeRaiseIRQL). There's a script someone made at codemachine.com that will parse through a module to see if there's any calls it makes to a function name that you give it, which is very convenient. It's not perfect, but does the job well. There's other ways of approaching this as well, but I'll determine that to the best of my ability when I take a look at it.

Patrick · Jun 25, 2012

Oh, and a question, what exactly is intelppm.sys? I know it's the Intel Processor Driver, but what is this specific driver in charge of?

Vir Gnarus · Jun 26, 2012

From what I understand, CPU drivers like that are commonly used to just handle a lot of ACPI stuff like the newer power settings and dynamic overclocking stuff that's built into CPUs nowadays. There's nothing really critical for them as anything of necessity is built into Windows kernel to begin with.

I have to admit that 0x101 bugchecks are probably one of the nastiest buggers to analyze because they often end up with a lot of missing data which needs to be reconstructed manually, and/or having to deal with analyzing interprocess communications.

Patrick · Jun 26, 2012

Very annoying indeed. Thanks for the explanation! :)

Shintaro · Jun 26, 2012

Vir Gnarus,

Outstanding post!

Patrick · Jul 17, 2012

Giving this a bump as I require some assistance in learning how to do this :)

I'm working on a BSOD situation right now in which the user has attached many dumps, all of which are 101 bugchecks. Every single one of them. The user recently attached one today that mentioned it couldn't find the symbols for the Intel Storage Drivers (iaStor.sys), and after updating them the issue still remains and new dumps are still pointing to iaStor.sys..

For example, here' where I am having trouble. Here's the dump file:

Read More:

Okay, I first notice that the dump says it was timed out for 19 clock ticks.

Next, it mentions to run a !prcb command on the amount of processors. If I am reading this and understanding this correctly, there are 8 processors in the dump I have shown. So, with that known, I would run a !prcb on 1 through 8, right? Well I am going wrong somewhere, see here.

Here's what happens when I run a !prcb on processor 0:

0: kd> !prcb 0
PRCB for Processor 0 at fffff780ffff0000:
Current IRQL -- 13
Threads-- Current fffffa8009fa03a0 Next fffffa800d8c8060 Idle fffff8000320bcc0
Processor Index 0 Number (0, 0) GroupSetMember 1
Interrupt Count -- 00123ee0
Times -- Dpc 000003af Interrupt 000007e1
Kernel 00008075 User 00000c2a

Works just fine. Alright, let's move onto processor 1:

0: kd> !prcb 1
Cannot get PRCB address

Nope.

Let's try processor 2:

0: kd> !prcb 2
Cannot get PRCB address

Nope.

3?

0: kd> !prcb 3
Cannot get PRCB address

Nada.

So far, I have only been able to get the 0 processor to show info... am I going wrong somewhere?

niemiro · Jul 17, 2012

It looks like you are doing things correctly. Are you trying to get this information out of a minidump? I don't think that minidumps retain that information, or enough information to properly diagnose this particular bugcheck. You might have to ask the user to upload something bigger

Richard

Shintaro · Jul 17, 2012

Minidumps only have the information on the process and cpu that crashed. If you want to be able to change CPU's then you need a full memory dump as far as I know.

EDIT: CPU's are zero numbered. On a 4 core CPU: CPU1 = 0, CPU2 = 1, CPU3 = 2, CPU4 = 3.

To digress slightly,
The way that I determine how many CPU's there are is from this line:

Windows 7 Kernel Version 7601 (Service Pack 1) MP (4 procs) Free x64

As I understand the above line:
Win 7 with SP1, Multi-processor ( 4 procs/cpu) Free (not the checked build) x64 (64 bit version)

I am no guru, but if you want to upload a minidump we could have a look.

Patrick · Jul 17, 2012

Ah, you guys are awesome. That's exactly why then, they are minidumps.

Thanks! :)

Shintaro · Jul 17, 2012

Mate,

Its all about sharing information and learning.

Patrick · Jul 17, 2012

Shintaro said:
Mate,

Its all about sharing information and learning.

Absolutely.

Vir Gnarus · Jul 17, 2012

I added a line at the top of the article to get people's attention of the somber truth that minidumps are worthless for these bugchecks.

Btw, when or if you plan on looking at the kernel dump, a quick way of running through data on all the processors instead of having to tediously having to churn out the thread for each processor is using the !running command:

Code:

0: kd> !running -it

System Processors:  (000000000000000f)
  Idle Processors:  (0000000000000000) (0000000000000000) (0000000000000000) (0000000000000000)

       Prcbs             Current         (pri) Next            (pri) Idle
  0    fffff80002df3e80  fffffa8006261060 ( 8)                       fffff80002e01cc0  ................

Child-SP          RetAddr           Call Site
fffff880`08cd6918 fffff800`02cd7f3a nt!KeBugCheckEx
fffff880`08cd6920 fffff800`02c8ace7 nt! ?? ::FNODOBFM::`string'+0x4e2e
fffff880`08cd69b0 fffff800`031f4895 nt!KeUpdateSystemTime+0x377
fffff880`08cd6ab0 fffff800`02c7d713 hal!HalpHpetClockInterrupt+0x8d
fffff880`08cd6ae0 00000000`7518cffd nt!KiInterruptDispatchNoLock+0x163
00000000`0008e350 00000000`00000000 0x7518cffd

  1    fffff880009ec180  fffff880009f6fc0 ( 0) fffffa800394f540 (22) fffff880009f6fc0  ................

Child-SP          RetAddr           Call Site
00000000`00000000 00000000`00000000 0x0

  2    fffff88002f65180  fffffa80036aead0 (16) fffffa80036ae5e0 (23) fffff88002f6ffc0  ................

Child-SP          RetAddr           Call Site
fffff880`03316740 fffff800`02c6afe4 nt!KeFlushMultipleRangeTb+0x260
fffff880`03316810 fffff800`02cf94f5 nt!MiAgeWorkingSet+0x64a
fffff880`033169c0 fffff800`02c6b116 nt! ?? ::FNODOBFM::`string'+0x4cd46
fffff880`03316a40 fffff800`02c6b5cb nt!MmWorkingSetManager+0x6e
fffff880`03316a90 fffff800`02f17e6a nt!KeBalanceSetManager+0x1c3
fffff880`03316c00 fffff800`02c71f06 nt!PspSystemThreadStartup+0x5a
fffff880`03316c40 00000000`00000000 nt!KiStartSystemThread+0x16

  3    fffff88002fd7180  fffffa80037c02c0 (11) fffffa80065ffb50 (21) fffff88002fe1fc0  ................

Child-SP          RetAddr           Call Site
fffff880`09462880 fffff800`02cb5acd nt!KxFlushEntireTb+0x93
fffff880`094628c0 fffff800`02cd9e90 nt!KeFlushTb+0x119
fffff880`09462940 fffff800`02c8de8d nt! ?? ::FNODOBFM::`string'+0xae02
fffff880`09462980 fffff800`02c7f2ee nt!MmAccessFault+0xa7d
fffff880`09462ae0 00000000`6862ce6a nt!KiPageFault+0x16e
00000000`0e29fba0 00000000`00000000 0x6862ce6a

The "i" argument causes it to display idle procs too, and "t" displays the stack trace for the thread running on each proc. Of course, to get the IRQLs and some other PRCB/PCR data, you unfortunately still have to type !irql or !prcb for each separate proc. Oh, and concerning the line of periods at the end of each, those are related to spinlock queues. Read up on it from the Windbg help manual for !running, as well as Windows Internals book on the chapter about Processes and Threads.

Patrick · Jul 17, 2012

Right on, thanks a lot Vir!

Shintaro · Jul 17, 2012

Another cool command. :cool3: But hidden in there more reading.:smile:

Cayden · Jul 25, 2012

Wow this was as informative as it was amusing to read! Great work, I learned things.

gmcallis · Mar 12, 2013

Hi Richard,

I stumbled upon your post searching for my BSOD issue and besides having an Arg1 of 19 in lieu of your clients 18, my details are exactly the same.

I read Vir Gnarus suggests "to remove the graphics card, clean up any foreign material that may be in the slot, and then reinsert it, as well as update graphics drivers if they haven't already."

I'm currently downloading the latest drivers for my GEForce 570, and will clean up any foreign material that may be in the slot when the pc is shutdown, but I'm wondering if this rectified the BSOD issue for your client?

Cheers

Graham

Vir Gnarus · Mar 12, 2013

Huh? Who? What? <.>;;

Class 101 for 0x101 Bugchecks

BSOD Kernel Dump Expert

Senior Administrator, Windows Update Expert

Sysnative Staff

BSOD Kernel Dump Expert

Sysnative Staff

BSOD Kernel Dump Expert

Sysnative Staff

Well-known member

Sysnative Staff

Senior Administrator, Windows Update Expert

Well-known member

Sysnative Staff

Well-known member

Sysnative Staff

BSOD Kernel Dump Expert

Sysnative Staff

Well-known member

Member

New member

BSOD Kernel Dump Expert