Vir Gnarus
BSOD Kernel Dump Expert
- Mar 2, 2012
- 474
NOTICE: 0x101 BUGCHECKS CANNOT BE ANALYZED USING MINIDUMPS!!! YOU WILL NEED AT LEAST A KERNEL DUMP!
Hi guys, I'd figure I'd run you through my personal experience delving into a 0x101 bugcheck, also known as a CLOCK_WATCHDOG_TIMEOUT. This will cover:
- Interrupts & IRQLs
- Simple disassembly
- Looking at PRCB and PCR
- Observing the context of multiple processors
Warning: At the time of writing this, I'm only 95% sure on my observation, so take this with a grain of salt. This is one of my more thorough investigations that I've come across lately on TSF, and I'm still rusty with dealing with interrupts, so I wouldn't doubt that some of my approach or understanding may be erroneous. Keep that in mind when reading this, and any corrections are welcome. I don't want to end up being the blind leading the blind. Always refer to the book Windows Internals if you want solid understanding of this stuff. Otherwise, what I see here is just my exploration into it and serves for navigating a crashdump and collecting information more effectively (it's hard doing all this without someone that knows better pointing your faults :) ).
The example I'd like to use is the case mentioned in this thread. Note that all my work involved operating with the client's kernel dump he uploaded to a 3rd-party filesharing site. Since I doubt everyone is eager or has the resources or time to scrutinize over a whole kernel dump, I will be more descriptive in my presentation to compensate.
So why did I use a kernel dump? Because minidumps do not retain the data required to analyze 0x101 bugchecks. I repeat: you cannot analyze 0x101 bugchecks from minidumps! Minidumps only save the processor context for the processor core that was awake and able to successfully run KeBugCheckEx and dump the crashdump. Why is this a problem? You'll understand once you start reading.
As a primer that I used to get myself in the right direction, this article was used. It was extremely beneficial in giving me my bearings and what data I should be most concerned about when analyzing this type of bugcheck.
_____
We start off by viewing exactly what an 0x101 bugcheck is. Let's run the initial !analyze -v on it.
Code:
0: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
CLOCK_WATCHDOG_TIMEOUT (101)
An expected clock interrupt was not received on a secondary processor in an
MP system within the allocated interval. This indicates that the specified
processor is hung and not processing interrupts.
Arguments:
Arg1: 0000000000000060, Clock interrupt time out interval in nominal clock ticks.
Arg2: 0000000000000000, 0.
Arg3: fffff880009ea180, The PRCB address of the hung processor.
Arg4: 0000000000000001, 0.
...
I'm sure the explanation may not be understood by some here, so I'll go through an explanation of the explanation. :)
- An expected clock interrupt was not received on a secondary processor -
To understand this, one must understand what interrupts are. As the name implies, an interrupt is an operation that interrupts what a processor was doing at the time. This is often crucial because sometimes a processor may be working on something relatively unimportant and something more time-sensitive and of higher priority needs to get fulfilled real quick, such as hardware I/O.
Say a processor core is running an operation that involves drawing a window on your screen for a user application like Skype. Then, all of a sudden, the core receives a request (interrupt) to do some disk I/O to move stuff from paging file to memory. The core obviously isn't going to neglect the request and continue drawing the window. It sees the higher priority and puts everything on standby while it gets done with the request (interrupt). Once that gets done, it will move back to the lesser priority stuff it was doing.
Note:
As you can understand, having everything stop to take care of an interrupt sounds like it'd hang the PC. It can, and is often referred too as interrupt storms. The reason why it doesn't do it all the time is because an interrupt shouldn't be designed to satisfy the request Windows received from the hardware. If it did this all the time then nothing gets done quickly. Instead, an interrupt will construct and issue a DPC (Deferred Procedure Call) that gets queued up and serviced later in an orderly fashion. That way, the interrupt finishes VERY quickly (it has too), and the actual workload with dealing with the hardware's interrupt for service (the actual I/O work) is done in a fashion that won't halt everything else up in the system.
In a fast-paced, multi-threaded and highly complex environment like Windows on newer hardware, it's easy to see this process of handling interrupts is more than just simply looking at a table of priorities and seeing which ones needs checking off first (which kinda does exist - called IRQLs - which we'll get too shortly). There's algorithms and whatnot involved in an attempt to provide the snappiest and smoothest means of computer operation for the user without having everything become imbalanced and chaotic swirling into a crash. But aside from all this complex methodology, the general idea is that some stuff has higher priority than others and needs faster and/or longer attention than the rest, and all of this is done to make sure that all gets taken care of decently and in order.
Now that I've given a general idea what interrupts are and what they're for, a clock interrupt is a type of interrupt in which involves counting the actual cycles of the processor core, which is essentially running a clock on the processors to keep everything in sync. If you read in the NTDebugging blog I used as a primer, you'll read that the clock interrupt gets handed out to all the cores and all of them have to report in. When one doesn't report at all in a designated time frame, this crash happens. Now it also mentions that the clock interrupt in x86 systems has an IRQL of 28. So what's an IRQL?
A good explanation for an IRQL is by this blog entry. It is essentially called the Interrupt Request Level, and is merely the priority of an interrupt given by Windows. The higher the number, the higher the priority. All typical activity running on a processor core is at level 0, which is called Passive or LOW_LEVEL (note that this isn't even an interrupt. An interrupt is anything above IRQL 0). When an interrupt kicks in, the IRQL reaches the level specified by the interrupt, the interrupt is serviced and then when it's done it drops back down to 0. If there were any other interrupts that popped up but at lower or same IRQLs, it will service those afterwards in an orderly fashion, dropping down the list till it goes back to IRQL 0. So while every interrupt has its own IRQL, a processor core also has it's own IRQL "state" if you will. It will service anything with the same IRQL as the state that it's in, and then will lower itself to service lower priority IRQLs until it's all back to 0 again. There's a bit more to it than that, but that's the gist of it.
- in an MP system -
A multi-processor system.
- within the allocated interval -
As mentioned before, it didn't respond in a designated time frame.
- This indicates that the specified processor is hung and not processing interrupts -
This can be a bit misleading. While yes, it does mean the processor is hung, it doesn't actually mean that it won't service any interrupts. In many cases it could be stuck servicing a higher IRQL interrupt but simply won't deal with anything lower than it.
Now that we got that description taken care of, now on to business. Ultimately, there's steps we should carry out to figure this out. In essence it should be, in order:
- what happened,
- how it happened, and lastly,
- why it happened.
WHAT
_____
First, what happened? There was a CLOCK_WATCHDOG_TIMEOUT. Yes, we know this. How long did it timeout? 60 clock ticks the bugcheck says. Now, exactly what processor buggered out? Well, there's a couple places we can look, all of which is in the !analyze -v output:
Code:
CLOCK_WATCHDOG_TIMEOUT (101)
An expected clock interrupt was not received on a secondary processor in an
MP system within the allocated interval. This indicates that the specified
processor is hung and not processing interrupts.
Arguments:
Arg1: 0000000000000060, Clock interrupt time out interval in nominal clock ticks.
Arg2: 0000000000000000, 0.
Arg3: [COLOR=#ff0000][B]fffff880009ea180[/B][/COLOR], The PRCB address of the hung processor.
Arg4: 000000000000000[COLOR=#ff0000][B]1[/B][/COLOR], 0.
Debugging Details:
------------------
BUGCHECK_STR: CLOCK_WATCHDOG_TIMEOUT_[COLOR=#ff0000][B]2[/B][/COLOR]_PROC
DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT
PROCESS_NAME: System
CURRENT_IRQL: d
STACK_TEXT:
fffff800`0402f938 fffff800`02a39463 : 00000000`00000000 00000000`00000000 00000000`00000000 fffff880`009ea180 : nt!KeBugCheckEx
fffff800`0402f940 fffff800`02a93c87 : fffff880`00000000 fffff800`00000000 00000000`00000000 00000000`00000000 : nt! ?? ::FNODOBFM::`string'+0x4d8e
fffff800`0402f9d0 fffff800`0300b090 : 00000000`00000000 fffff800`0402fb80 fffff800`00000000 00000000`00000000 : nt!KeUpdateSystemTime+0x377
fffff800`0402fad0 fffff800`02a87ab3 : 00000000`00000000 fffff800`03026460 fffffa80`041baf00 00000000`00000001 : hal!HalpRtcClockInterrupt+0x130
fffff800`0402fb00 fffff880`03f991f2 : fffff800`02a990ca 00000000`00369e99 fffffa80`04eb44e8 fffff800`02c13c40 : nt!KiInterruptDispatchNoLock+0x163
fffff800`0402fc98 fffff800`02a990ca : 00000000`00369e99 fffffa80`04eb44e8 fffff800`02c13c40 00000000`00000001 : amdk8!C1Halt+0x2
fffff800`0402fca0 fffff800`02a93d5c : fffff800`02c05e80 fffff800`00000000 00000000`00000000 fffff880`03e76480 : nt!PoIdle+0x53a
fffff800`0402fd80 00000000`00000000 : fffff800`04030000 fffff800`0402a000 fffff800`0402fd40 00000000`00000000 : nt!KiIdleLoop+0x2c
STACK_COMMAND: kb
SYMBOL_NAME: ANALYSIS_INCONCLUSIVE
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: Unknown_Module
IMAGE_NAME: Unknown_Image
DEBUG_FLR_IMAGE_TIMESTAMP: 0
FAILURE_BUCKET_ID: X64_CLOCK_WATCHDOG_TIMEOUT_[COLOR=#ff0000][B]2[/B][/COLOR]_PROC_ANALYSIS_INCONCLUSIVE
BUCKET_ID: X64_CLOCK_WATCHDOG_TIMEOUT_[COLOR=#ff0000][B]2[/B][/COLOR]_PROC_ANALYSIS_INCONCLUSIVE
Followup: MachineOwner
---------
I highlighted in bold the places you can find the answer. Here's a rundown of each:
Code:
Arg3: [COLOR=#ff0000][B]fffff880009ea180[/B][/COLOR], The PRCB address of the hung processor.
The PRCB - or PRocessor Control Block - contains data regarding the state of the processor (note that we've been talking about logical processors this whole time, not physical). It is an extension of the PCR - Processor Control Region - though it does contain the bulk of the data concerning that particular processor. You can do a dt command for nt!_KPRCB or nt!_KPCB to get details on each structure. You can also use the Windbg extensions !prcb and !pcr to get such details on the PRCB and PCR, respectively. Just type in the number of the processor you want and it'll give you the details. Let's try !prcb, since that's what the bugcheck is referring too. Do it for each existing processor (in this case, 2 exist):
Code:
0: kd> !prcb 0
PRCB for Processor 0 at fffff80002c05e80:
Current IRQL -- 13
Threads-- Current fffff80002c13c40 Next 0000000000000000 Idle fffff80002c13c40
Processor Index 0 Number (0, 0) GroupSetMember 1
Interrupt Count -- 0131a6aa
Times -- Dpc 00000368 Interrupt 0000036d
Kernel 000b11d9 User 0003e82a
0: kd> !prcb 1
PRCB for Processor 1 at [COLOR=#ff0000][B]fffff880009ea180[/B][/COLOR]:
Current IRQL -- 0
Threads-- Current fffff880009f4f40 Next fffffa8005a95b60 Idle fffff880009f4f40
Processor Index 1 Number (0, 1) GroupSetMember 2
Interrupt Count -- 0176aaf8
Times -- Dpc 00002cc2 Interrupt 00001e01
Kernel 000d4b6e User 0001ac92
See the address for the PRCB for Processor 1? That matches the address mentioned in this third argument for the bugcheck code, so that's the irresponsible processor right there.
Note:
- Processor numbers start at 0, so the first processor will be called "Processor 0". There are a couple exceptions, as we'll see later, but that's usually the case.
- The information provided by these commands are basic. If you want all the nitty gritty on the processor, you'll want to dump the contents of the PRCB structure using the dt command. Take the address of the PRCB mentioned to use as the address for the nt!_KPRCB symbols to feed on. In this case, the syntax for the dt command will be:
dt nt!_KPRCB fffff880009ea180
Code:
Arg4: 000000000000000[COLOR=#ff0000][B]1[/B][/COLOR], 0.
This is a bit of a hidden surprise. It's an undocumented argument for this bugcheck, but it apparently reveals the processor number right there. Again, processor numbers start at 0, so this would be the second processor.
Code:
BUGCHECK_STR: CLOCK_WATCHDOG_TIMEOUT_[COLOR=#ff0000][B]2[/B][/COLOR]_PROC
FAILURE_BUCKET_ID: X64_CLOCK_WATCHDOG_TIMEOUT_[COLOR=#ff0000][B]2[/B][/COLOR]_PROC_ANALYSIS_INCONCLUSIVE
BUCKET_ID: X64_CLOCK_WATCHDOG_TIMEOUT_[COLOR=#ff0000][B]2[/B][/COLOR]_PROC_ANALYSIS_INCONCLUSIVE
Bucket IDs are just what Microsoft uses to get an idea on the cause of crash when reports are submitted. In many cases they have been helpful in discovering clues on cause, but in this case it's just the usual fare. As you can see, there's a discrepancy between the number of the processor provided here and the one in the rest of the data. This is one of those few and far inbetween exceptions to the rule. It's just to clarify for individuals looking at it as to exactly what processor it is, without having to remember the numbering starts at 0. So 2 would just mean "second processor".
HOW
_____
Ok, so far we've explained what just happened: Processor 1 (second processor) reached 60 clock ticks without responding to the clock interrupt and so the system BSODs. Now, how did this take place? To determine this, we need to look at what was going on during this time before Windows intervened and crashed the system.
Our first step in this venture involves checking the IRQL of each processor before the system decided to crash. During the crash process, the IRQL is altered so nothing else gets in its way, so it is pointless to see the latest existing IRQL. Fortunately, Windows developers are aware of this and made it so that the IRQL of the processor before the crash process kicked in was saved. You can grab this saved IRQL using the !irql command, followed by the number of the processor.
Code:
0: kd> !irql 0
Debugger saved IRQL for Processor 0x0 -- [COLOR=#006400][B]13[/B][/COLOR]
0: kd> !irql 1
Debugger saved IRQL for Processor 0x1 -- [COLOR=#ff0000][B]0[/B][/COLOR] (LOW_LEVEL)
As you can see, the IRQL of the first processor is 13 (which is CLOCK for x64 processors) and the second is at 0. So we can see that only Processor 0 was at CLOCK level. In the NTDebugging blog - and previously mentioned - the clock interrupt for x86 processors is 28, but since this is an x64 environment, things change around a bit (fewer IRQL levels) and it is 13 for x64 environments.
Note: For information about interrupts, consult the Windows Internals book, pages 85-114 (5th edition)
Note: compare the IRQL mentioned in !irql with what you see from !pcr and !prcb. Notice something? The IRQL for Processor 1 is listed as 0 in the PCR, but 13 in the PRCB and in the !irql output. That's because the PCR saves the current IRQL, but the PRCB stores the one previously before control passed to the bugcheck code. Details are mentioned in Windows Internals page 95 (5th edition). You may also consult the help documentation on Windbg for !irql.
Now that we have the IRQL, let's garner some more information. We'll look at the callstack present in each processor to see what was the latest activity taking place on them. You can use a variation of the k command to do this. I use kv as it's most verbose. Let's start with Processor 0:
Code:
0: kd> kv
Child-SP RetAddr : Args to Child : Call Site
fffff800`0402f938 fffff800`02a39463 : 00000000`00000101 00000000`00000060 00000000`00000000 fffff880`009ea180 : nt!KeBugCheckEx
fffff800`0402f940 fffff800`02a93c87 : fffff880`00000000 fffff800`00000001 00000000`00002626 00000000`00000000 : nt! ?? ::FNODOBFM::`string'+0x4d8e
fffff800`0402f9d0 fffff800`0300b090 : 00000000`00000000 fffff800`0402fb80 fffff800`03026460 00000000`00000000 : nt!KeUpdateSystemTime+0x377
fffff800`0402fad0 fffff800`02a87ab3 : 00000000`00000000 fffff800`03026460 fffffa80`041baf00 00000000`00000001 : hal!HalpRtcClockInterrupt+0x130
fffff800`0402fb00 fffff880`03f991f2 : fffff800`02a990ca 00000000`00369e99 fffffa80`04eb44e8 fffff800`02c13c40 : nt!KiInterruptDispatchNoLock+0x163 (TrapFrame @ fffff800`0402fb00)
fffff800`0402fc98 fffff800`02a990ca : 00000000`00369e99 fffffa80`04eb44e8 fffff800`02c13c40 00000000`00000001 : amdk8!C1Halt+0x2
fffff800`0402fca0 fffff800`02a93d5c : fffff800`02c05e80 fffff800`00000000 00000000`00000000 fffff880`03e76480 : nt!PoIdle+0x53a
fffff800`0402fd80 00000000`00000000 : fffff800`04030000 fffff800`0402a000 fffff800`0402fd40 00000000`00000000 : nt!KiIdleLoop+0x2c
Now let's see latest thread running on Processor 1. In order to do so, we should change the processor context of Windbg. Currently Windbg is displaying all the data on what's taking place on Processor 0. You can tell because of the number visible in the Windbg prompt:
Code:
[COLOR=#006400][B]0[/B][/COLOR]: kd>
The number displays which processor is current. You can switch to the other processor using the tilde (~) key, followed by the number of the processor. So let's do that and dump the callstack afterwards:
Code:
0: kd> ~1
1: kd> kv
Child-SP RetAddr : Args to Child : Call Site
fffff880`02f1bc98 fffff800`02a990ca : 00000000`00369e99 fffffa80`04ec1ab8 fffff880`009f4f40 00000000`00000001 : amdk8!C1Halt+0x2
fffff880`02f1bca0 fffff800`02a93d5c : fffff880`009ea180 fffff880`00000000 00000000`00000000 fffff880`03e76480 : nt!PoIdle+0x53a
fffff880`02f1bd80 00000000`00000000 : fffff880`02f1c000 fffff880`02f16000 fffff880`02f1bd40 00000000`00000000 : nt!KiIdleLoop+0x2c
Ok, so we got the callstack for each existing thread at the time of the crash for both our processors. Now we have to interpret what's going on with each. Here's some stuff we can conjure up from these two callstacks:
- Both started with the IdleLoop routine, which is essentially the start of the System Idle Process you see in Task Manager. So they both were sitting and waiting to do something.
- Processor 0 then received an interrupt. This interrupt happened to be a clock interrupt.
- The clock interrupt then involved updating the system time. This is something that is replicated across all processors so that all the processors update their own timers and things are kept track of.
- Processor 0 was the one that performed the bugcheck.
So from what we can tell by this, Processor 0 took care of a systematic clock interrupt that keeps track of the timing of things, while Processor 1 apparently didn't even both with it and just stayed dormant. This is further expressed by the IRQL for Processor 0 being 13 (clock interrupt) and the IRQL for Processor 1 being 0. This is getting suspicious. But for now, let's move on.
WHY
_____
So far we've gained enough information to where we can see that a clock interrupt (as expected) popped up and tried to do its routine thing of keeping the system timer up-to-date. Processor 0 had no problem handling it, and did so just fine. However, we can see that when it came time for Processor 1 to do the same, nothing happened. The interrupt did not pop up on the callstack, nor did the IRQL change. Things continued to lie dormant, and Processor 1 was sound asleep. Why did this happen?
The initial and immediate suspect right here is the 3rd-party driver and its routine present in both callstacks:
Code:
amdk8!C1Halt+0x2
This is present as part of the whole Idle Loop (System Idle Process) that goes on when there's a wait on doing stuff. Instead of the default routine that takes place here (I think it's hal!HalProcessorIdle), it has been replaced by the one provided by amdk8.sys. Since this has been a consistent issue with the victim, it seems rather evident that we're dealing with a possibly buggy driver. However, we will need to dig even further to verify this suspicion. We will actually need to disassemble the code here to see what the problem is. So let's go ahead and do that.
There are a number of ways you can disassemble code in Windbg. The most common (and relatively easiest) is to use the Disassembly window provided by Windbg. Go to View followed by Disassembly to bring it up. Then in the prompt labeled Offset, type in the address you desire to start at. We can easily discover this by looking at the callstack for the thread on Processor 0 (not 1). What we want is the return address in the frame (each row in the callstack is called a frame of the callstack) above the amdk8!C1Halt frame. What these return addresses show is that when a routine performs a return instruction after it's done its work, it returns to the frame below it at address specified. We can use this address to look at what's inside amdk8!C1Halt.
Code:
0: kd> kv
Child-SP RetAddr : Args to Child : Call Site
fffff800`0402f938 fffff800`02a39463 : 00000000`00000101 00000000`00000060 00000000`00000000 fffff880`009ea180 : nt!KeBugCheckEx
fffff800`0402f940 fffff800`02a93c87 : fffff880`00000000 fffff800`00000001 00000000`00002626 00000000`00000000 : nt! ?? ::FNODOBFM::`string'+0x4d8e
fffff800`0402f9d0 fffff800`0300b090 : 00000000`00000000 fffff800`0402fb80 fffff800`03026460 00000000`00000000 : nt!KeUpdateSystemTime+0x377
fffff800`0402fad0 fffff800`02a87ab3 : 00000000`00000000 fffff800`03026460 fffffa80`041baf00 00000000`00000001 : hal!HalpRtcClockInterrupt+0x130
fffff800`0402fb00 [COLOR=#ff0000]fffff880`03f991f2[/COLOR] : fffff800`02a990ca 00000000`00369e99 fffffa80`04eb44e8 fffff800`02c13c40 : nt!KiInterruptDispatchNoLock+0x163 (TrapFrame @ fffff800`0402fb00) [COLOR=#006400][I]< the return address we want (it's above the amdk8!C1Halt+0x2 frame)[/I][/COLOR]
[COLOR=#0000cd]fffff800`0402fc98 fffff800`02a990ca : 00000000`00369e99 fffffa80`04eb44e8 fffff800`02c13c40 00000000`00000001 : amdk8!C1Halt+0x2[/COLOR] [COLOR=#006400][I]< the frame[/I][/COLOR]
fffff800`0402fca0 fffff800`02a93d5c : fffff800`02c05e80 fffff800`00000000 00000000`00000000 fffff880`03e76480 : nt!PoIdle+0x53a
fffff800`0402fd80 00000000`00000000 : fffff800`04030000 fffff800`0402a000 fffff800`0402fd40 00000000`00000000 : nt!KiIdleLoop+0x2c
Why did I decide that we take a look at the callstack for Processor 0 instead of 1 for the answer? Because as you can tell, the very last frame in the callstack for Processor 1 is amdk8!C1Halt. We need the return address from a frame above amdk8!C1Halt in order for it to help us. If we grabbed the return address from the frame with amdk8!C1Halt in it, it would bring us to the instruction address that amdk8!C1Halt would return from if it was finished. This address would direct towards the function below it, which is specifically nt!PoIdle+0x53a. So we'd be looking in the code for nt!PoIdle and not amdk8!C1Halt.
If you're wondering if there's an easier and more reliable answer, there is. You can simply take the symbol name for the routine that's mentioned in the callstack and use that instead of the return address for the Offset requested in the Disassembly window. So if we wanna look exactly at amdk8!C1Halt+0x2, we'll just slap that exactly as the offset and it'll figure it out from there. This also works for the u command and other variations mentioned next. In addition, it will also work the same for routines listed in the callstack that lack symbols. So if this is so much easier, why explain the alternative? Because it'll teach you some things about what's going on in a callstack and code flow.
Another common way to get details on this is to use the uf command followed by the address of where you want to start (which we got from that return address). In this case, it will show the entire function (in this case, amdk8!C1Halt) and that's it. We really don't need to see anything more (like what the Disassembly window displays) so this option is conveniently concise.
You can also do u and variants thereof, but I'd rather have the Windbg help documentation explain it for ya.
Let's try the uf command:
Code:
1: kd> uf fffff880`03f991f2
amdk8!C1Halt:
fffff880`03f991f0 fb sti
fffff880`03f991f1 f4 hlt
fffff880`03f991f2 c3 ret
Notice even though we gave it address fffff880`03f991f2 it started with fffff880`03f991f0. That's because it simply presents the entire function whatever address we gave it is sitting in. So no matter where in the function we give it, it'll display the whole thing from start to finish (more or less). Now what to explain from this output. We see three instructions: sti (set interrupt flag), hlt (halt) and ret (return). Do a google and you'll find that this is also present in the generic hal!HalProcessorIdle routine that amdk8!C1Halt replaced.
Honestly, with research you will find there's no code difference or really anything that could bug this code out. It's standard fare, and nothing suspect about it. This has also been confirmed by the person that has been victimized by these crashes: they updated their chipset drivers that covers this specific driver, and no change has been made at all to the code (nor has the problem been resolved). So we're left with a rather innocuous little bit of code.
But wait now, what does this code actually do? Well, the first instruction is sti, or the set interrupt flag instruction. It sets the processor to be ready to process any interrupts that comes its way. The next instruction, hlt, halts the processor until the next interrupt pops up. Then there's the standard ret or return. So, wait a minute, if it was set to accept any interrupts and wait until it gets one, why hasn't the clock interrupt popped up on Processor 1's callstack? Wasn't this all done to get it to wait and ready to grab on the next available interrupt? Why didn't it even bother starting to deal with the clock interrupt?
We can get even more info on this by understanding how the sti instruction works. It does so by setting a flag register (which is only either 0 or 1, hence the name 'flag register') called the interrupt flag (if) from 0 to 1 if it has not already been 1, otherwise it just leaves it as 1. Once this flag is set, this processor should accept any interrupts coming its way. So, the last thing we can check here is the interrupt flag. Maybe, perhaps, the sti instruction set the interrupt flag to 1 and something else later changed it back to 0! Let's take a look. Remember to be in the processor context of Processor 1 when you do this. Type the r command (register) followed by the name of the register you want, which in this case is if.
Code:
1: kd> r if
if=1
So there it is. The interrupt flag at the time was in fact 1. It should've handled an interrupt to it, but it didn't. Nothing here reveals that it should've stopped Processor 1 from accepting the clock interrupt which would've prevented this crash. Instead, it continued to sleep, and crashed it went.
What you see here is the example of what very well appears to be a perfectly good condition, but somehow something still went wrong. It means there had to be a hardware issue involved, and since everything else says "everything's cool" we can only point blame at the CPU simply not responding to the interrupt despite being given the green light to do so.
And sure enough, that appears to have been the case. At the time of this writing, the client has installed a new replacement CPU and has had no issues since.
Thanks for reading this, and I hope it has entertained and enlightened you on some stuff, as well as given you questions that hopefully I or someone else can answer. Again, if you suspected any errors (typos, misconceptions, etc.), don't hesitate to pass comment to me. Have fun, guys!
Note:
One thing I want to add concerning this particular 0x101 bugcheck is for people to observe the context for Processor 1 (the processor core that hung). Notice that the entire context is intact: the registers, the callstack of the latest thread, etc., all are present and preserved. If Processor 1 was hung, how could this be? Just so you know, while the entire processor context is stored into memory (which can be later dumped like everything else into a crashdump during a crash), the processor still is responsible to fill in that context. So how did everything get preserved if the processor involved was stuck?
The only way I see it could happen, is that Processor 1 did wake up during the crash to report everything in its context to fill in for the crashdump. What this could mean is that Processor 1 core wasn't exactly hung permanently, but rather it took too long to be aware that there was a clock interrupt that need servicing (or any interrupt for that matter) and it took its sweet time to wake up. It did eventually wake up for the crash, but it failed to do so in a prompt manner for the clock interrupt. It would mean the processor itself is out of sync, which is a hardware malfunction (hence why the answer was replacing the CPU).
You will see 0x101 bugchecks in which for whatever reason (hardware or driver) a processor core or two was not responsive during the crashdump (and so it was really hung), and so not everything for its context was preserved (a clear example is if you attempt to dump the callstack using kv you'll get only one frame, which is empty, and all registers will be zeroed). This is explained in the article I used as a primer that I linked too at the beginning of this little 'tutorial', as in that case the processor contexts weren't saved all the way, so the only way for the person to get a callstack was to find the last thread that ran on it and dump the raw stack and then reconstruct it from there using a special variation of the kv command. Unfortunately, because the example in that primer was on a x32 system, that attempt only applies to x32 systems. I have yet to figure out how to do the same for x64 systems.
Food for thought.
Last edited: