BSOD Method and Tips

Vir Gnarus

BSOD Kernel Dump Expert
Joined
Mar 2, 2012
Posts
474
I think given the amount of BSODs present, that it'd be good for a general thread to provide analysis tips that anyone can access for convenience sake. A good reliable knowledgebase is a powerful tool. I think anyone should contribute with their own proven methods and advice.

NOTE: If there's something you'd like explained about debugging, BSODs n all that jazz, that isn't covered in this or other articles, go ahead and ask! I'm sure I or someone else around here would be eager to provide you some knowledge, and I might end up adding it to this big ole ugly thread!

NEWCOMERS: To those new to Windows debugging and crashdump analysis, please set aside an hour to watch the following 9-part video series. Thanks, JC.

Windows Hang and Crash Dump Analysis 1/9 - YouTube


Last update: March 04, 2021 (x BlueRobot)

- Added three books to Links O'Plenty


Previous updates:

March 12, 2014 (Patrick)
- Added a book to Links O' Plenty.
- Added a website to Links O' Plenty.
- Added two databases to Links O' Plenty.

Sept 3, 2013
- Added the procdumpext Windbg extension to the list in Links O' Plenty.

June 26, 2013
- Added TechEd 2013 videos link to the Links O' Plenty.

June 5, 2013
- Fixed a longstanding misconception of IRP Logging setting for Driver Verifier. Please read in the appropriate section and correct accordingly.

May 2, 2013
- Added a bit more info on Special Pool under Driver Verifier and when to use caution in using it. Added a couple videos in Links O' Plenty. Updated Windows Internals info in Links O' Plenty.

March 28, 2013
- Added several commands to WinDBG Commands as well as adding a bit of info to some existing ones, like ? command. Carried some descriptions over from an old post I made in another forum.

February 08, 2013
- Added BAD_POOL_HEADER (0x19) article link under ERROR CODES section. Fixed link to MDLs article under ERROR CODES.

January 8, 2013
- Appended some info on the WHEA (0x124) description to help you interpret the error code mnemonics.

Oct 2, 2012
- Added Databases section to Links O' Plenty. Added 4 DBs to it.

Sept 20, 2012
- Included a new video under Links O' Plenty. Finally worked up the energy to do it. Only took nearly a month! Thanks to shintaro for this one!

August 28, 2012
- Added some tips under General Concepts. Split General Tips into Debugging Tips and Learning Tips.

August 2, 2012
- Added a blog and 3 books to Links O' Plenty.
- Corrected what I believe is a misinterpretation of the WHEA error explained in the example for WHEA Errors.
- Explained for Windows Internals that there's a new 6th Edition (though only Part 1 right now).

July 12, 2012
- Corrected outdated link for the NotMyfault application in Links O' Plenty. The application has since been changed with a lot of extra crash goodness added.
- Added a new section under Miscellaneous links in Links O' Plenty called Debugging Extensions. Added cmkd extension there.

June 14, 2012
- Added anchor tags for Table of Contents (finally).
- Added Processes, Threads and Code Flow under General Concepts.
- Added CodeMachine to list of website links under Links O' Plenty.

March 23, 2012
- Added a link to the article for PCI-Express WHEA errors in the WHEA error (0x124) section.

January 18, 2012
- Added NotMyFault as an application in the Links O' Plenty section under Miscellaneous.

January 10, 2012
- Corrected the link for Debug Tutorial under Miscellaneous links in the Links O' Plenty section. Now it'll bring you to the list of parts for the tutorial.

January 5, 2012
- Corrected the link for details on Driver Verifier options.

January 3, 2012
- Added a much-needed Table of Contents. Thanks, usasma.

December 30, 2011
- Added !verifier under Windbg Commands section. Changed formatting a wee bit.

Dec 21, 2011
- Added DRIVER_POWER_STATE_FAILURE to Error Codes section.

Dec 06, 2011
- Nice sized update.
- Split Windbg Commands and General Concepts sections.
- Added Contexts to General Concepts section.
- Added a number of simple yet effective commands, extensions, etc. under Windbg Commands section.
- Added CLOCK_WATCHDOG_TIMEOUT (0x101) under Error Codes section.

Oct 24, 2011
- Added Debugging in Process into blogs list in Links O' Plenty. Make sure to thank Reventon for this. :)

Oct 21, 2011
- Added an Introduction for those who are new.
- Added Uninformed Journal to site list in Links O' Plenty.

Oct 18, 2011
- Added CGI:Internet to list of Blogs in Links O' Plenty.

Sept 7, 2011
- Replaced the Mysteries of Memory Management with Mark's Webcasts, which is an entire collection of Mark Russinovich's videos.

July 29, 2011
- Added subgroups under Miscellaneous links. Slapped a few links for stuff that good ole JC provided. Thanks, mate!

July 25, 2011
- Added Videos to Links section. Added Mystery of Memory Management to new list.

July 20, 2011
- Mark Russinovich just release his book, Windows Sysinternals Administrator's Reference. Added it to list of books.

July 5, 2011
- Minor edits on General Concepts tips. Added link to a wonderful case example for tip 3.

June 28, 2011
- Added Alex Ionescu's blog to Blog links

June 23, 2011
- Corrected misnomer describing the !chkimg extension. Gave a more accurate explanation for it.
- In describing the WHEA error, I mentioned that it helpfully provides the "process" number. It is actually the processor number.

TABLE OF CONTENTS



INTRODUCTION


LINKS O' PLENTY

GENERAL CONCEPTS

WINDBG COMMANDS

ERROR CODES


DRIVER VERIFIER



_____

INTRODUCTION


Motive


The blue screen, aka BSOD (Blue Screen of Death) or the properly technical term bugcheck or stop error, and to many simply "My Windows crashed", has a tendency to become unnerving, and to the PC technician and sysadmin it can be very convoluted to attempt to figure its cause. The material written on the screen can be extremely foreboding and intimidating, and if they even have the gall to install Debugging Tools for Windows and type !analyze -v their brains turn to mush at the sight of its esoteric and utterly unintelligible jargon. Their only hope is to read the solitary line, "Probably caused by:" and be done with it, only to find it led them nowhere and they're back at square one of their troubleshooting venture.

Possibly, if you're reading this, you're an individual that has no intention of letting that get the best of them. You want to go above and beyond the normal procedure for investigating computer woes. Perhaps - even better - despite the headache !analyze -v gives you, you happen to admire and even feel curious wanting to know what all of that jibberish means. Your desire is to improve on being a fine PC troubleshooter and your passion is to learn the ins & outs of a computer. If any of this describes you, then the world of kernel debugging is for you.

Debugging


Kernel debugging is simply the process of analyzing with the intent of finding and squashing bugs, but on the kernel side of things. The kernel itself is the main component of an OS, whose responsibility is to be the middleman: to get your applications - aka "user mode" or "userland" - to work with your hardware, as well as with each other. So what you're dealing with isn't so much debugging or troubleshooting crashes out there in userland, but the deep down layer of how it all comes together and actually works in unison in a single environment. Crash dump analysis is merely a particular avenue of kernel debugging, in which you analyze, interpret and diagnose through the use of crash dump files.

BSODs


So what is a BSOD? Think of an application crash, where it stops the application and prompts you an error message with details. That's the user mode way of crashing, whereas the BSOD is no different, just on the kernel mode. Like the user mode version, it is the event where Windows recognizes there is a problem and there may be serious consequences if it lets things proceed, so it takes a snapshot of everything - called a crashdump - and then stops everything in its tracks, showing you the blue screen with info on the cause.

Understand that the word "recognize" is the keyword here. Windows recognizes an error and responds. Let me rephrase that: A crash doesn't occur when a problem occurs, only when Windows finds one. It is imperative that your mind wraps around this. What it means is that the crash may or may not occur at the right time! The bug may have started working its ugly magic on things and it only ended up festering to a point where Windows finally sees it and reacts.

This is the general reason why it's not as easy as simply running a crashdump through !analyze -v and simply getting the exact problem with the exact cause. The culprit may have scampered off never to be seen and when an innocent comes to the scene they pick up the murder weapon and the cops - aka Windows - charges in and catches them "red-handed". It may also be a case where an accident - aka hardware problem - occurs and all the cops can blame on is whoever was at the scene at the time. There are numerous ways things can go awry when it comes to analyzing BSODs, but when one realizes why the BSOD happens, they have the knowledge to press on to find the real answer, beyond what !analyze -v wants to tell them.

You


Therefore, with all that in mind, it is really the responsibility of a person to do the actual detective work on figuring out the case. You have plenty of help, however, with various tools and data to work with. Why, the BSOD wasn't made to scare consumers, but rather to help you, as it provides as much information as it can for you to dive in. Though, it does none of these resources good without the knowledge to utilize them. Not only them, but you're also dealing with the bridge that links hardware to software, so it is eventually necessary to get a good grasp of both in order to properly conduct analysis for BSODs. There's a lot to learn, and it is certainly bewildering (especially when you whip open the book Windows Internals for the first time), but a motivated individual who already is savvy with PCs should be able to approach it, slowly but surely.

This thread


I hope the contents here prove beneficial and can be used as a springboard for anyone to be able to approach this level of troubleshooting. It is not designed to be a tutorial from start to finish, but as a place to go that will make finding answers easier. Because believe me, with as technical as this field is, you will have plenty of questions.



_____



Here's a list of links to articles, websites and whatnot that I personally used to help me explore the wild mysterious world that is Windows internals, troubleshooting and debugging. Hopefully this will help guide you to better understand some things. Like other sections, this will be updated as often as I find more resources (or remember any) that are pertinent.

BOOKS


Windows Internals - The Book. An absolute necessity. This along with the WDK will be your primary resources for debugging. You cannot get far without it, and it will make things a whole lot easier for you in the long run. Recommended you have a digital version available for quick and easy searching. It is extremely thorough, yet laid out with beautiful finesse, and is very easily digestible and explains things with very little assumption on your previous knowledge. The newest version at the time of this writing (6th Edition; Windows 7) comes in two parts, so make sure you're aware of that. First part covers some basic elements while the second part gets deep into the OS kernel functionality.

Note (x BlueRobot): The latest version of Windows Internals is now the 7th Edition (again in two parts) which covers Windows 8.x and Windows 10.

Advanced Windows Debugging - A great companion in helping to understand various debugging techniques. A lot more can be extracted from this book once you get the hang of debugging.

Windows Sysinternals Administrator's Reference - While not exactly BSOD-related, this is a book that is pretty much imperative to any PC technician that isn't already fluent with the Sysinternals suite of utilities. Knowledge of these tools will master your prowess with the Windows OS, and no hang, error, sluggishness, or other various ugliness in Windows will evade your troubleshooting prowess.

The Linux Process Manager - Linux? In my Windows debugging article? It's more likely than you think. Regardless of the differences between Linux and Windows kernels, knowledge of OS design in general is especially crucial in understanding all the data Windbg throws at you. While I personally have not yet had the opportunity to read this, I've heard through the grapevine that it is second to none in understanding how thread scheduling and other thread/process management systems operate in OSes. Be warned: it is very code heavy, but it still pushes a lot into explaining design.

The IDA Pro Book - Often when you go debugging you'll have to end up disassembling code for various and sundry reasons. While you will be doing it most frequently with Windbg, IDA Pro is also known as the de facto standard in disassembly software. More than that, this book in particular not only covers IDA Pro usage for disassembly, but it also covers a lot on the basics of disassembly in general, so it is great for beginners. While IDA Pro in itself is not a cheap item to obtain, it has a reduced functionality version available for free (one version below newest, and no kernel debugger).

Memory Dump Analysis Anthology - Patterns, patterns, patterns. Often debugging can be expedited and difficulty reduced when one works through detecting patterns in data. This is especially true when analyzing minidumps, because often times patterns is all you have, and you can't deep dive to figure out more. This one linked (Volume 1), along with the 4 other volumes after it, servers as an exhaustive and comprehensive reference to explain how to detect various and sundry patterns of all the assortment of faults that can cause BSODs and application crashes. He then goes to explain a bit on how to analyze them further to find cause and other important bits of information for your analysis. If his writing is anything like the other books or his site that I've read, his writing can be kinda difficult to work with as it's very cut n dry because of the whole language barrier thing, but it's still readable and won't cause you much headache. If anything, he simplifies nomenclature a lot which will give those new to debugging a better chance on understanding his explanations. Regardless, these are gems, and will serve very well in increasing your knowledge and exposure to debugging.

Inside Windows Debugging: A Practical Guide to Debugging and Tracing Strategies in Windows - Patrick - Along with Windows Internals, Advanced Windows Debugging, etc, this particular book in my opinion is a must have regarding debugging. It goes into Kernel Mode & User-Mode specifics (application processes, system processes, low-level communication mechanisms. NTDLL, USER32, etc), Kernel Mode debugging, User-Mode debugging, live debugging, remote debugging, postmortem debugging, breaking-in, breakpoints, access violations, heap corruptions, stack corruptions, stack overflows, handle leaks, memory leaks (Kernel/User-mode), deadlocks, system calls, Xperf, and so much more.

Honestly, to me, this book is right up there in terms of usefulness with Windows Internals. In fact, I think I like it a little more (in terms of debugging, of course).

Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools and Obfuscation - x BlueRobot - This is a must-have book for anyone who is interested in debugging. The book explains the x86, x86-x64 and ARM architectures, along with, assembly patterns for common control-flow logic. It also covers stack operations and some address translation. The book briefly covers some Windows Internals as well.

What Makes It Page? The Windows 7 Virtual Memory Manager - x BlueRobot - While the book is technically written for Windows 7, not much has changed since then, and this book literally covering everything about memory address translation at an operating system level. This a great book in my opinion and well worth the read. It covers the TLB cache, working sets, VADs and page protection, as well as, memory mapped files. It fills in the gaps in which the Windows Internals book wasn't able to cover in terms of memory management.

Windows Kernel Programming - x BlueRobot - From the same co-author of Windows Internals; Pavel covers driver development from start to finish. By the end of the book, you'll have a basic file system filter driver. The author assumes the reader has very little prior knowledge of the Windows kernel and/or driver development beforehand, however, please note that some foundational knowledge of C++ is required.

Programming the Microsoft Windows Driver Model - x BlueRobot - This book is technically written for Windows XP and is now unfortunately out of print, so you'll have to rely on getting a good second-hand copy if possible. You may be thinking why bother with such a old book? It's written for Windows XP. I actually wasn't aware of this book until I read Windows Kernel Programming which recommends purchasing this book if you wish to learn all the aspects of driver development which he wasn't able to cover. As mentioned before with the memory manager, not much has changed since Windows XP regarding the WDM apart from some of the APIs and thus I can't recommend this book enough. It's a great addition to the Windows Internals book.

BLOGS


Mark's Blog - Mark Russinovich offers incredible insight on troubleshooting various crashes, hangs and other conundrums in Windows, primarily by using the Sysinternals Suite. Utilizing a down-to-earth approach, any power user of Windows or PCs can turn PC technician by learning from this guy's blog and the handy dandy Sysinternals Suite.

NTDebugging Blog - An advanced debugging and troubleshooting blog from Microsoft's finest in the field. While initially you may not understand much of what's being said, it's a good idea to grab one of the more 'generic' blog articles and google search & research anything in it that confuses you. You'll end up digesting the articles a lot easier. Plus they have some articles are cater towards debugging novices so you can get your bearings.

Analyze-V - Scott Noone offers a bunch of information on a variety of debugging scenarios, as well as numerous items on Windbg commands and applications. Definitely worth perusing.

Alex Ionescu's Blog - Alex Ionescu is the co-author of the Windows Internal books, so he's definitely got the know-how. His blog hasn't been updated since the release of the latest book (2009), but it's still got a number of articles that are worth a look.

CSI:Internet - Unlike the show's critically acclaimed technical accuracy, this web gem has 2 series which features scenarios of real experts as they diagnose, analyze, scrutinize and nullify nasty malware such as rootkits. They are very down-to-earth with their descriptions and yet the amount of information they provide is a very beneficial lesson towards understanding the internal mechanics of Windows and how things work in the OS environment. This is a good read for everyone who wishes to delve into the realm of debugging and advanced PC troubleshooting.

Debugging in Progress... - Reventon - This is looking to be a very promising asset to anyone coming across familiar crashes that would otherwise induce headaches. Sometimes a crash problem is a simple hotfix away but you don't know what to get and where. This guy goes through the trouble mentioning that. Unfortunately he does not explain how he reached the conclusion most of the time, but it's still valuable as a reference for hotfix solutions to crashes.

Of Filesystems And Other Demons - A good solid blog on the inner workings of all things I/O, filesystem, and filter driver related. The guy provides explanations on a lot of things and is very indepth. More advanced learning here.

VIDEOS


Mark's Webcasts - Mark Russinovich has videos covering a wide variety of Windows-based materials, from troubleshooting using Sysinternals to debugging scenarios to explaining the nitty gritty of Windows core operations like memory management. These will greatly assist even new PC techs into expanding their expertise. Best of all, they're all free (except the ancient Sysinternals Video Library located at the bottom).

Debugging Heap Memory Corruptions - Shintaro - From Tarik Soulami, which is the author of a new Windows kernel debugging book, has decided to bless us with a look into debugging heap memory corruptions. While heap memory deals with userland (applications and services) and not driver and kernel area, nothing hurts learning about that other half of the Windows environment - you know, the one that is the most used? Figuring out why an application decided to spontaneously bug out or cause slowdown by sucking up memory is kinda important, and this video aims to cover that aspect. Plus it gives you a bit of a look into the userland brother of Driver Verifier, named Application Verifier.

Defrag Tools - mgrzeg - A solid show that covers indepth details on certain debugging and troubleshooting aspects using some of the best tools out there, from Windbg to Sysinternals Suite. Very valuable stuff.

Defrag - The show that Defrag Tools spun off of, this show covers various user-requested troubleshooting and explains a bit how to approach each. Do you remember Call for Help on TechTV? This show basically operates in that manner. While some of the tips are rather basic troubleshooting, there are some subjects you can glean info on about deeper topics. They don't cover much on each item, however, but again it's good for grabbing tidbits of info on all sorts of items.

TechEd 2013 - TechEd 2013 site has all sorts of videos that covers details on debugging and troubleshooting methods, from the simple like getting started with Sysinternals tools in their Sysinternals Primer video, or getting down deep in Hardcore Debugging. Check the list of videos noted as relevant to the cause!


WEBSITES


MSDN - Naturally this is a must. An online resource providing anything in the WDK help documentation and much more. If a routine in a stack confuses you, or you need a better understanding on how Windows handles memory, this and more is explained thoroughly in the MSDN and WDK help documentation.

Crash Dump Analysis - Crash dumps often generate patterns in behavior that can be discerned. To help figure out these patterns and the best way to approach solutions from them, this website will work wonders. I also personally am eager to afford getting the guy's Memory Dump Analysis Anthology set, which is as comprehensive as you can get on crash dump patterns.

OSR Online - While the articles on this website aren't posted often, the forums are still active and the articles and their NT Insider is just choke full of delicious debugging goodness. A good bit is directed towards driver developers, but you should still be able to scrounge up a plethora of info on stuff from this website.

Uninformed Journal - a technical research journal that delves into various reverse engineering stuff. Very advanced. A couple of good ones I personally found to be easier to read and very relevant to BSOD analysis is Introduction to Reverse Engineering Win32 Applications and Improving Automated Analysis of Windows x64 Binaries. They give you the lowdown on general analysis tips, and the x64 one covers how to read code flow and callstacks for x64-bit systems.

CodeMachine - A sweet site containing various articles. I especially love their most recent one that explains various data structures, as well as their more famous x64 Deep Dive article which helps clarify things on x64 stuff.

KernelMode - Patrick - An advanced forum that has a few faces such as malware analysis, reverse engineering, debugging, Windows development, etc. Most of it as I noted is advanced malware analysis and/or debugging (not so much focused on BSOD's), however, it's still useful to read for commands, and just information in general. You never know what you'll learn!

DATABASES


Pool Tag DB - Kernel and drivers use pool memory for storage much like userland apps/services use heap. For tracking purposes these allocations will often be marked by drivers by a 3/4-letter abbreviation describing the purpose behind the allocation. These can therefore help us determine origins for pool allocations, but it doesn't help if we can't interpret the abbreviation. Therefore, a database such as this one is very appreciated. It even has a pooltag.txt file you can download which you can put in Windbg directory and it'll give additional details for any pool tags it finds identical to what's in the DB. Sweet!

PCI Database - Unknown device in Dev Manager that you can't figure out? Not sure on a VEN/DEV for a particular piece of PCI hardware? Let your worries be alleviated with this robust database. Just enter in the VEN or DEV for the device you found (DeviceManager > Properties window > Details tab > Hardware IDs) and get the nitty gritty on that mysterious item!

PCI Repository - Alternative to the PCI Database. Doesn't have a search function. Search here if you can't find in the previous listed site.

USB Repository - Like the PCI version but caters to USB devices and their respective VID/PIDs.

Carrona Driver Reference Table (DRT - John Carrona/usasma) - Patrick - The holy grail to us debuggers, truly. Are you debugging a crash dump and see a driver that you don't recgonize and/or know? Check here. 99.9% of the time, unless it's a new and/or obscure driver, it's on this list. If it's not on the list, please do your best to remember to submit it to the DRT. It is actively checked and it will be added as long as you submit the driver.

Sysnative Driver Reference Table (DRT) - Patrick - Same exact as above, but a mirror. Out of habit, I personally check John Carrona's DRT primarily, but if it's ever down (rarely), I'll go to Sysnative's.

MISCELLANEOUS



Applications


Task Manager - JCGriff - Ole Task Manager. While I expect everyone here to be using and encouraging use of Process Explorer, this article generally covers stuff that you'll see in both, so it's a decent reading regardless.

NotMyFault - This is a link directly to a zip file containing NotMyFault. It is an exe with its own driver who's sole purpose is to cause BSODs in various and sundry ways. Use it to experiment and test your BSOD analysis prowess. Try going further than just looking for the myfault.sys driver in the callstack and see exactly how and why it messed things up! Great teaching tool! Remember, don't use it on a production system! Note that zip file contains source code. The actual executable is located in two subdirectories, \exe\Release and \exe\x64 for the 64-bit version.

Debugger Extensions
(add-ons for Windbg)

cmkd - This awesome fella has a couple of neat features. My favorite is the !stack command it gives you, which with the -p argument will analyze the callstack for the thread you specify (or current thread context) and figure out the parameters given to each function, and the -t argument included with -p will explain to you how it retrieved each one. This is great for seeing what data is being passed from function to function. It's also great as a learning tool for understanding x64 callstacks, thread stack construction and parameter passing/saving, as it gives you an idea how it walks the raw stack and disassembled function code to see where it got the parameters. Like any automated analysis, it isn't 100% guaranteed, but it sure saves you a buttload of time and eases the process of walking callstacks immensely. Don't forget the other goodies it has, too!

procdumpext - Created by Andrew Richards from Debug Tools show. His personal Skydrive link so may be broken later. Oh, but man, this thing is a beauty. There are some extensions like !grep that parses output like your typical Linux grep command, and !dpx, which does all the manual labor of finding anything worthwhile in a raw thread stack. I nearly wept when viewing its output the first time. Very good stuff. Type !procdumpext.help to get list of commands and info.



Intro to x64 Debugging - 4 part series of articles on this guy's blog pertaining to x64 debugging. In many cases debugging x64 stuff is different and in some cases easier (and some cases more difficult) than x32. This guy helps explain that. He might also have some other nuggets on debugging in his blog.

Debug Tutorial - CodeProject offers a nice debugging tutorial provided by Toby Opferman. You may wanna skip to part 2 since first part is related to user-mode debugging. This tutorial covers essential items such as the stack, so that you can understand how to debug properly by interpreting what you're looking at in Windbg.

Memory


RAM, Virtual Memory, Page file and all that stuff - JCGriff - Simple explanation of memory usage and application. Make sure to check the links at bottom of the page for more in-depth articles.

General Windows Info - RAM, Virtual Memory, Pagefile and all that stuff - JCGriff - An expanded variation of the MS Support article listed above.

Memory Performance Info - JCGriff - Details on what structures and whatnot to hook onto for performance counters in both Performance Monitor usage as well as debugging purposes. Powerful info on setting up a PC to analyze memory usage.

Mark's Blog: Pushing the Limits - JCGriff - Mark Russinovich's blog comes up with a series on memory and other related resources. This one directly points to the third entry, but all of them are well worth the read.




_____

GENERAL CONCEPTS


Debugging Tips


1. Windows code is usually not the problem - Despite the common misconception that Windows is the buggiest OS alive, in reality most of the time it's 3rd-party drivers getting in the way. Kernel code is often not responsible for a crash. So if you see a process like nt, tcpip.sys, and any other windows code being pointed the finger at, it's almost always a wrong diagnosis, or it's accurate in that Windows module faulted and caused the crash, but only because it caught something going wrong. Still, check thoroughly the cause of the crash.

2. 3rd-party drivers in callstack aren't always the problem - This is a classic example. Typically this applies to anti-virus drivers showing up. Often removing/replacing an AV program is the solution, but it also usually is not be the problem either. Considering how they work, when a program bugs out, the AV program might be blamed. Take care diagnosing when the only 3rd party driver mentioned in a stack is an AV driver. Don't immediately jump the gun thinking it's responsible for the bad behavior. This also generally applies to any 3rd-party driver on the stack. Do not immediately assume it did something ugly when it's being blamed. Analyze further to discern what happened and why it happened.

3. Discover Patterns - This is redundant a bit giving it's what many do here already, but just to make sure everyone is aware, finding patterns in crashes is very effective in determining cause. Also be aware that stacks with no pattern is a pattern too. If you see crashes all over the place with different stacks from different programs, you can start to guess it's hardware like memory issues or that a driver function has a bad pointer and is pointing here, there and everywhere. It's not much info as finding an actual precise pattern, but it still reduces possible causes.

4. Do not fully trust all stacks - If you see "Stack unwind information not available. Following frames may be wrong." that means that the analyze engine didn't either have symbols for a process in the stack and/or the stack is corrupt, and either way ends up having to do a best guess. Finding this message in the middle of a stack can help pinpoint where a 3rd-party driver is, but don't be sure that it's the right stack. Determining if it is requires further evaluation that requires prior knowledge on stacks and whatnot, so if you can't look at a stack and figure if it's healthy or not when it has the "stack unwind info not available" message in it, be wary in using it to diagnose issues. I'd recommend asking for help from someone who can, or disregard it altogether. The same goes for oddball stack entries with no symbol or module names (like one with the number 0x654 in it). These usually mean the stack is trashed or something tripped up the stack unwind and had it running in the wrong direction.

5. Interpret, Verify, and Diagnose, in that order - It's easy to jump straight to diagnosing an issue just by the readout given by !analyze -v from a crashdump someone gave you, but be careful. Many things can prevent the data you and the analysis engine see from being accurate and will lead you in the wrong direction. You must first be able to interpret what you see in the first place so you can understand it, then you should verify that the data given is legit and can be used to diagnose the problem. If you fail to follow all three steps in that order, both your conclusions and assumptions will be misleading, and end up causing the user to do stuff they shouldn't have to do to fix the problem, often finishing unsuccessfully.

6. The Scientific Theory Works- Observe. Hypothesize. Test. Examine. Conclude. If it worked for science (save origins science) then it can work for debugging. While you don't want to be too formulaic, a good orderly approach will prevent you from running all over the place with a crashdump and requesting data from the client that's irrelevant or, worse, requesting changes to their system that are unnecessary and detrimental. This goes right up there with the previous tip. It's necessary to have a structured method to figuring out debugging, and the scientific method is a very solid means of doing it. About the biggest item missing from this isn't so much the examining portion as many people tend to initially assume, but the hypothesis step. Most people will often attempt to just collect data after data and rummage through crashdumps and logs trying to find something relevant, when they really should first start by making an educated guess from the client's description of their situation and the initial !analyze -v and logs what caused it, and then work through the data to find if that's true or not. If it isn't, refine hypothesis, and start over. Doing it this way will save a load off of stress and time both for you and the client.

7. Need more data? Ask! - Don't start and stop just at whatever crashdump(s) you have available at first. If your hypothesis on the cause isn't satisfactorily answered (be it "correct" or "incorrect") with whatever data you have now, don't hesitate to ask for more! Too many times I see people try hard with what the client first gives them, then because it's not enough they throw their arms up and ask the client to do all sorts of random things, therefore losing structure in their debugging approach and just going after everything to find an answer. Work with what you got as much as you can, but if there comes a time where a crash occurred too late (happens often) or a crashdump doesn't have enough info (i.e. minidump) then sit down and contemplate what can be done to get you the best data possible, and then continue from there. Remember this: quality surpasses quantity. It's neither fair for the client nor for your headache-addled brain to ask for copious amounts of data just to end up with 90% of it (and sometimes 100%!) being worthless and you having had to deal with sifting through it all to come up short! Sometimes even a kernel dump will do nothing for you when it crashed too late, and often times even a minidump from a Driver Verifier crashdump will suffice!

8. Avoid the Caveman Approach - Swapping hardware. Uninstalling software. Changing various settings. Anything that alters the environment with the intent not to resolve an issue but to find an answer is never wise. You often will cause problems for the client later on, and it'll often cause them to be dissatisfied with having a messy PC in the end, even if the problem does somehow managed to get fixed after it all. Granted, there are those desparate times where your knowledge runs short or resources are sparing where there doesn't seem to be any other option, but too often people resort to caveman approaches way too early, effectively giving up on a proper diagnosis and just deciding to go hog wild and make a lot of noise in the process. Ask others for assistance (or knowledge), run tests instead of changing things around, acquire more data, re-evaluate the situation. Do whatever it takes before having it come to this. Though in all honesty, you're probably better off swallowing your pride and saying "I cannot help you" than making a mess of their PC in a vain effort to close the case. If you're determined, ask them first about going this route before proceeding and the ramifications involved.


Learning Tips


1. Blogs - Stuff like the Ntdebugging blog and Scott Noone's Analyze -v are great places to study cases from professionals that have dealt with it before. If you desire to witness how a professional handles the same BSODs that you come across (and more) then blogs are an excellent source. Google them, as there's always blogs floating out there with single entries where people have had to debug some code and explains how they've done it.

2. Books - Outside of the great Windows Internals, there's plenty more material out there to cover. Remember, there's more to analyzing BSODs and hangs than just figuring out Windbg. There's so many facets that go into understanding debugging in order to gain a better grasp at what information Windbg throws at you, and there's plenty of material for them all that'll help you through. Are you having trouble understanding Assembly? read up on stuff about ASM like Art of Assembly. Can you grasp the gist of the code but have hard time walking through it? Read up on reverse engineering material and tutorials. Debugging is a little more than the sum of its parts, and fortunately while little is written about kernel debugging in itself, there's plenty on each part that'll help expedite your approach.

3. Windbg Help Manual - Sometimes when I have spare time I just open up Windbg's Help Manual that comes with it and peruse all the commands (prefixed with a period '.') and extensions (prefixed with a bang "!") and read up on each one. This helps me two-fold: I learn more about Windbg and its vast amount of tools to get the job done quickly and painlessly; when I come across one that covers data about mechanics I don't understand (e.g. what are arbiters from !arbiter output?) then I can whip out Windows Internals as well as google it to get a grasp of it. Overall I learn more about Windows and debugging and I also learn more about the debugger I'm using. I cannot tell you how many times I've been stumped trying to find the data I seek from a crashdump when later on I read up on the help manual and find an extension that would've given me the answer right there, or at least would've been a great leap to finding it. There's more to Windbg than just the occasional !analyze -v and lm. Explore!

4. Mess Around - Grab that NotMyFault from Sysinternals site and give it a whirl! Setup a VM or some extra computer and let that application crash and hang it till it's a mess. Then take those crashdumps and examine and figure out just what happened to make each crash/hang possible. Use the opportunity as ways of refining your approach for various crashes and extract more pertinent data from each. You'll know you've reached a solid level of understanding when you can comfortably (somewhat) navigate your way all the way to the exact code that messed things up, and/or knowing just about everything on what faulted and how it exactly came to that state. Also, don't forget to read the source code from the MyFault driver, as they have comments that'll explain what they're doing to cause each problem. When you have firm knowledge on a crash that you know how it happened, then you'll have an easier way figuring out those that you initially don't.


Contexts


When dealing with registers, callstacks, etc., basically the entire environment you work with as you debug is the context of what you're working in. For example, in a crashdump, you will start off in the context of the latest thread and process that has taken place - that is, the actual bugcheck code. But to actually see what was going on at the actual time of the crash, you have to change the context to that environment. Look at this example snippet of !analyze -v:

Code:
TRAP_FRAME:  fffffa60089b4b50 -- (.trap 0xfffffa60089b4b50)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=fffffa8028ead010 rbx=0000000000000000 rcx=fffffa8028ead010
rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
rip=fffffa6008c4d718 rsp=fffffa60089b4ce8 rbp=0000000000000080
r8=fffffa802b379080  r9=fffffa60089b4d50 r10=fffffa8028ead010
r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl nz na pe nc
<Unloaded_TmPreFlt.sys>+0x4718:
fffffa60`08c4d718 ??              ???
Resetting default scope

Notice the register values. This was all caught as what was present during the time of the erroneous instruction. Now, after the !analyze -v, try doing an r command to get the registers again:

Code:
2: kd> r
rax=fffff6fd30046200 rbx=0000000000000000 rcx=0000000000000050
rdx=fffffa6008c4d718 rsi=fffffa80273ed040 rdi=0000000000000000
rip=fffff80001e54690 rsp=fffffa60089b4a58 rbp=0000000000000000
r8=0000000000000008  r9=fffffa60089b4b50 r10=fffff6fb7dbf4c00
r11=00000000000001f4 r12=fffff6fd30046211 r13=fffffa6008c4d718
r14=fffff80001fc6220 r15=fffff80001fc6220
iopl=0         nv up ei ng nz ac po cy
cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00000297
nt!KeBugCheckEx:
fffff800`01e54690 48894c2408      mov     qword ptr [rsp+8],rcx ss:0018:fffffa60`089b4a60=0000000000000050

Wait a minute, something's not right here. The registers are different, and the latest instruction it shows is KeBugCheckEx! This isn't what was present during the crash! Rather what we see here, is that this is the context of what occurred at the very end of the bugcheck: when you see the bluescreen and it dumps the crashdump file. What we saw in !analyze -v before, was the context of everything as it was (more or less) when the faulting instruction (in this crashdump's case, <Unloaded_TmPreFlt.sys>+0x4718: fffffa60`08c4d718) occurred.

Context changes occur frequently during PC operations. A computer can switch between threads while they're in the middle of working very often, so it's necessary that when that thread gets to work again, it has to have all its data right there as it were before another thread came in and interfered. As such, it needs its context saved so that when it works again it can continue normally as if it never was interrupted. Sometimes a thread might want to be interrupted by another thread that will alter the original thread's context with stuff the new thread did work on. This can be done legitimately (expected), or not (unexpected). Whatever the case, context switches happen frequently, and it's the job of the OS and the CPU to keep these contexts saved.

As such, you too can change contexts with the debugger. If you want further explanations on what contexts you can change, how to change them, etc., you can refer to the help documentation for Windbg. Look up context.


Processes


Just about anyone that has used a computer for an extended period of time will come across this name, and many of them think it's a program or application. What they don't realize is that this is a misnomer. While a process does often associate with a program or application, and it does hold an image of the executable running the whole thing, it is not the one doing all the work. In actuality, a process does no work at all. Rather, the threads associated with it do. A process is merely an environment which threads run in. It is a local area for which various and sundry items such as the threads themselves can be stored and utilized for the threads doing all the work. This is akin to a business, in which you see the building, and may associate all its products/services with that building and the name of its business, but the work is really being done by all the people inside the building. While this sounds like it's being anal with definitions, it is very important to understand this, otherwise you will make dubious assumptions.


Threads


The worker bees of a process. They are what makes things happen on a computer. They do not have to be within the process environment to perform functions for that process. This is often true when a thread inside the process calls for a worker thread in Windows to do work for it. This is much like a company (the process) outsourcing some of its services and operations to another company (Windows kernel/service). While the threads inside the process are doing all the major work, it has the option of calling worker threads to do all the repetitive and menial tasks. Often times you may see this in a crashdump where a worker thread caused the crash, but in reality it's the work that it's doing at the behest of the culprit process that's really at fault. Impossible to diagnose without at least a kernel dump.


Code Flow


The flow of which all operations are going. The basics of it is you have a thread, and in that thread you have an initial function responsible for a specific task. In order to accomplish it, it will need the assistance of other functions to do so.

Much like a product being built by a company, you have it go through various hands and sometimes even various places before the finished product is made. Tack on also all the other personnel responsible for various indirect duties to supply the needs of those actually creating the product. If you had the same person/people doing all the tasks, then things slow to a crawl and you're lucky to even have a satisfactory result in the end.

That's much like what goes on in your typical code flow. It is not enough to have a "one function to rule them all", that's just daft. Rather, you start with the initial function, like say, for drawing a popup window. There's a multitude of facets to this job, so the initial thread will call one function to do something, like DirectX telling it to draw the window, which DirectX will figure "ok, but with what?" and then pass that request to another, and then that to another, and then so on and so forth. How things fragment and flow through all this process of events is the essence of code flow.



_____

WINDBG COMMANDS




?


Need to do a quick calculation but don't want to whip out a programming calculator (like on Windows 7)? Use this little fella. Remember, numbers by default are hexadecimal unless you specify otherwise. Use 0nfor decimal, 0x for hex, 0t for octal, and 0yfor binary. For example, binary for 7 should bewritten as 0y111.

Code:
0: kd> ? 16+30
Evaluate expression: 70 = 00000000`00000046
__

@


Not really a command or extension. You use it to express that you want to use the value stored in a register. Say you know the rcx register's value is a memory address, and it points to some data. Instead of having to look at the register, copy the value it has, and paste that into, like, db, you can instead just do db @rcx. You can use this for any extension or command. Remember to be in the right context!

Code:
0: kd> db @rbx
fffff800`02c05e80  80 1f 00 00 00 00 00 01-40 3c c1 02 00 f8 ff ff  ........@<......
fffff800`02c05e90  00 00 00 00 00 00 00 00-40 3c c1 02 00 f8 ff ff  ........@<......
fffff800`02c05ea0  01 00 00 00 00 00 00 00-b0 fd 02 04 00 f8 ff ff  ................
fffff800`02c05eb0  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................
fffff800`02c05ec0  31 00 05 80 00 00 00 00-38 26 af 00 00 00 00 00  1.......8&......
fffff800`02c05ed0  00 70 18 00 00 00 00 00-f8 06 00 00 00 00 00 00  .p..............
fffff800`02c05ee0  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................
fffff800`02c05ef0  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................

__

0: kd> ? @rcx+10
Evaluate expression: 273 = 00000000`00000111
__
2: kd> u @rip
nt!RtlEnumerateEntryHashTable+0xbf:
fffff800`030c1c25 48897908        mov     qword ptr [rcx+8],rdi
fffff800`030c1c29 498939          mov     qword ptr [r9],rdi
fffff800`030c1c2c 488b7c2410      mov     rdi,qword ptr [rsp+10h]
fffff800`030c1c31 c3              ret
fffff800`030c1c32 483900          cmp     qword ptr [rax],rax
fffff800`030c1c35 740a            je      nt!RtlEnumerateEntryHashTable+0xdb (fffff800`030c1c41)
fffff800`030c1c37 483912          cmp     qword ptr [rdx],rdx
fffff800`030c1c3a 75cf            jne     nt!RtlEnumerateEntryHashTable+0xa5 (fffff800`030c1c0b)
__

d


Means simply to dump data. There are so many variations of this command that you're better off looking at the Windbg help manual for details. In general though, it just takes the address (or address range) you give it and dumps the data in that memory range based on what you specify. For example, dc will dump separate double-word values as well as present any ASCII characters they may represent. Good for finding strings of text in memory. Another very popular one is dps, which will dump the data in values based on 32 or 64-bit sizes, based on what architecture the system from the dump (or connection if live debugging) is using, and also will dump any symbols it finds for any addresses, making this a perfect item for thread stacks. Again, lot of options here, so take a look at the help manual and test em out.
__

dt


Means Display Type which is used to display data types, often which are structures. To keep things orderly, any kind of data that needs to be retained is often done in a type of structure. With the proper symbols, one can dump that structure in a format that can be easily read by human eyes. Give it the symbols to use and the start point of the actual data and it'll parse the data based on the symbols you specified. For example, we'll use the PRCB, or Processor Control Block. The easiest method for this is by merely typing !PRCB:

Code:
1: kd> !prcb
PRCB for Processor 1 at ffdff120:
Current IRQL -- 2
Threads--  Current 861d4798 Next 00000000 Idle 807c7800
Processor Index 1 Number (0, 1) GroupSetMember 2
Interrupt Count -- 00000223
Times -- Dpc    00000000 Interrupt 00000000
         Kernel 00000224 User      00000000
It can be rather misleading, but the value highlighted is referring to the address of the PRCB structure, not the actual associated processor (which will be the PCR, since the PCR of a processor is the structure that represents that processor to Windows). So we just take that value and give it the right symbols, in this case it's _KPRCB.

Understand that Windows kernel stuff names its items in a common nomenclature. For data structures it'll prefix it with an underscore (_) and for kernel-code it'll prefix it with the letter K. Keep this in mind with determining the symbols. So as an extra example, for the PCR structure it'd be _KPCR. If you want to know the exact structure names for other stuff, the WDK has a portion of it called Build Environments that offers public symbols and other extra documentation on all its kernel modules (that's covered publicly, of course). You'll find the symbols in those.

Anyways, let's dump the PRCB using the address and the _KPRCB structure symbols:

Code:
1: kd> dt _KPRCB ffdff120
nt!_KPRCB
   +0x000 MinorVersion     : 1
   +0x002 MajorVersion     : 1
   +0x004 CurrentThread    : 0x861d4798 _KTHREAD
   +0x008 NextThread       : (null)
   +0x00c IdleThread       : 0x807c7800 _KTHREAD
   +0x010 LegacyNumber     : 0x1 ''
   +0x011 NestingLevel     : 0 ''
   +0x012 BuildType        : 0
   +0x014 CpuType          : 6 ''
   +0x015 CpuID            : 1 ''
   +0x016 CpuStep          : 0xe0c
   +0x016 CpuStepping      : 0xc ''
   +0x017 CpuModel         : 0xe ''
   +0x018 ProcessorState   : _KPROCESSOR_STATE
   +0x338 KernelReserved   : [16] 0
   +0x378 HalReserved      : [16] 0xe100
   +0x3b8 CFlushSize       : 0x40
   +0x3bc CoresPerPhysicalProcessor : 0x2 ''
   +0x3bd LogicalProcessorsPerCore : 0x1 ''
   +0x3be PrcbPad0         : [2]  ""
   +0x3c0 MHz              : 0x6c4
   +0x3c4 CpuVendor        : 0x1 ''
   +0x3c5 GroupIndex       : 0x1 ''
   +0x3c6 Group            : 0
   +0x3c8 GroupSetMember   : 2
   +0x3cc Number           : 1
   +0x3d0 PrcbPad1         : [72]  ""
   +0x418 LockQueue        : [17] _KSPIN_LOCK_QUEUE
   +0x4a0 NpxThread        : (null)
   +0x4a4 InterruptCount   : 0x223
   +0x4a8 KernelTime       : 0x224
   +0x4ac UserTime         : 0
   +0x4b0 DpcTime          : 0
   +0x4b4 DpcTimeCount     : 0
   +0x4b8 InterruptTime    : 0
   +0x4bc AdjustDpcThreshold : 0xd
   +0x4c0 PageColor        : 0xd2
   +0x4c4 DebuggerSavedIRQL : 0x2 ''
   +0x4c5 NodeColor        : 0 ''
   +0x4c6 PrcbPad20        : [2]  ""
   +0x4c8 NodeShiftedColor : 0
   +0x4cc ParentNode       : 0x83d4a300 _KNODE
   +0x4d0 SecondaryColorMask : 0x3f
   +0x4d4 DpcTimeLimit     : 0x280
   +0x4d8 PrcbPad21        : [2] 0
   +0x4e0 CcFastReadNoWait : 0
   +0x4e4 CcFastReadWait   : 0x57
   +0x4e8 CcFastReadNotPossible : 0
   +0x4ec CcCopyReadNoWait : 0
   +0x4f0 CcCopyReadWait   : 0x61
   +0x4f4 CcCopyReadNoWaitMiss : 0
   +0x4f8 MmSpinLockOrdering : 0n0
   +0x4fc IoReadOperationCount : 0n93
   +0x500 IoWriteOperationCount : 0n0
   +0x504 IoOtherOperationCount : 0n191
   +0x508 IoReadTransferCount : _LARGE_INTEGER 0x1276c3
   +0x510 IoWriteTransferCount : _LARGE_INTEGER 0x0
   +0x518 IoOtherTransferCount : _LARGE_INTEGER 0x41ad
   +0x520 CcFastMdlReadNoWait : 0
   +0x524 CcFastMdlReadWait : 0
   +0x528 CcFastMdlReadNotPossible : 0
   +0x52c CcMapDataNoWait  : 0
   +0x530 CcMapDataWait    : 0x273
   +0x534 CcPinMappedDataCount : 0xf
   +0x538 CcPinReadNoWait  : 0
   +0x53c CcPinReadWait    : 0x5b
   +0x540 CcMdlReadNoWait  : 0
   +0x544 CcMdlReadWait    : 0
   +0x548 CcLazyWriteHotSpots : 0
   +0x54c CcLazyWriteIos   : 0
   +0x550 CcLazyWritePages : 0
   +0x554 CcDataFlushes    : 0x38
   +0x558 CcDataPages      : 0x63
   +0x55c CcLostDelayedWrites : 0
   +0x560 CcFastReadResourceMiss : 0
   +0x564 CcCopyReadWaitMiss : 0xa2
   +0x568 CcFastMdlReadResourceMiss : 0
   +0x56c CcMapDataNoWaitMiss : 0
   +0x570 CcMapDataWaitMiss : 0x24
   +0x574 CcPinReadNoWaitMiss : 0
   +0x578 CcPinReadWaitMiss : 0x14
   +0x57c CcMdlReadNoWaitMiss : 0
   +0x580 CcMdlReadWaitMiss : 0
   +0x584 CcReadAheadIos   : 0x1f
   +0x588 KeAlignmentFixupCount : 0
   +0x58c KeExceptionDispatchCount : 0xb
   +0x590 KeSystemCalls    : 0x4322
   +0x594 AvailableTime    : 0x2f
   +0x598 PrcbPad22        : [2] 0
   +0x5a0 PPLookasideList  : [16] _PP_LOOKASIDE_LIST
   +0x620 PPNPagedLookasideList : [32] _GENERAL_LOOKASIDE_POOL
   +0xf20 PPPagedLookasideList : [32] _GENERAL_LOOKASIDE_POOL
   +0x1820 PacketBarrier    : 0
   +0x1824 ReverseStall     : 0n3
   +0x1828 IpiFrame         : 0x88722bec Void
   +0x182c PrcbPad3         : [52]  ""
   +0x1860 CurrentPacket    : [3] (null)
   +0x186c TargetSet        : 0
   +0x1870 WorkerRoutine    : 0x83c4828c     void  nt!KiFlushTargetSingleTb+0
   +0x1874 IpiFrozen        : 0
   +0x1878 PrcbPad4         : [40]  ""
   +0x18a0 RequestSummary   : 0
   +0x18a4 SignalDone       : (null)
   +0x18a8 PrcbPad50        : [56]  ""
   +0x18e0 DpcData          : [2] _KDPC_DATA
   +0x1908 DpcStack         : 0x807e3000 Void
   +0x190c MaximumDpcQueueDepth : 0n4
   +0x1910 DpcRequestRate   : 0
   +0x1914 MinimumDpcRate   : 3
   +0x1918 DpcLastCount     : 0xf
   +0x191c PrcbLock         : 0
   +0x1920 DpcGate          : _KGATE
   +0x1930 ThreadDpcEnable  : 0x1 ''
   +0x1931 QuantumEnd       : 0 ''
   +0x1932 DpcRoutineActive : 0 ''
   +0x1933 IdleSchedule     : 0 ''
   +0x1934 DpcRequestSummary : 0n0
   +0x1934 DpcRequestSlot   : [2] 0n0
   +0x1934 NormalDpcState   : 0n0
   +0x1936 DpcThreadActive  : 0y0
   +0x1936 ThreadDpcState   : 0n0
   +0x1938 TimerHand        : 0x2cf
   +0x193c LastTick         : 0x2d0
   +0x1940 MasterOffset     : 0n0
   +0x1944 PrcbPad41        : [2] 0
   +0x194c PeriodicCount    : 0
   +0x1950 PeriodicBias     : 0
   +0x1958 TickOffset       : 0
   +0x1960 TimerTable       : _KTIMER_TABLE
   +0x31a0 CallDpc          : _KDPC
   +0x31c0 ClockKeepAlive   : 0n1
   +0x31c4 ClockCheckSlot   : 0 ''
   +0x31c5 ClockPollCycle   : 0x64 'd'
   +0x31c6 PrcbPad6         : [2]  ""
   +0x31c8 DpcWatchdogPeriod : 0n1920
   +0x31cc DpcWatchdogCount : 0n1451
   +0x31d0 ThreadWatchdogPeriod : 0n0
   +0x31d4 ThreadWatchdogCount : 0n0
   +0x31d8 KeSpinLockOrdering : 0n0
   +0x31dc PrcbPad70        : [1] 0
   +0x31e0 WaitListHead     : _LIST_ENTRY [ 0x861d14d4 - 0x861d280c ]
   +0x31e8 WaitLock         : 0
   +0x31ec ReadySummary     : 0
   +0x31f0 QueueIndex       : 1
   +0x31f4 DeferredReadyListHead : _SINGLE_LIST_ENTRY
   +0x31f8 StartCycles      : 0x4`bfc91224
   +0x3200 CycleTime        : 0x1`3bb032ff
   +0x3208 HighCycleTime    : 1
   +0x320c PrcbPad71        : 0
   +0x3210 PrcbPad72        : [2] 0
   +0x3220 DispatcherReadyListHead : [32] _LIST_ENTRY [ 0x807c5340 - 0x807c5340 ]
   +0x3320 ChainedInterruptList : (null)
   +0x3324 LookasideIrpFloat : 0n2147483647
   +0x3328 MmPageFaultCount : 0n16258
   +0x332c MmCopyOnWriteCount : 0n5
   +0x3330 MmTransitionCount : 0n11600
   +0x3334 MmCacheTransitionCount : 0n0
   +0x3338 MmDemandZeroCount : 0n1300
   +0x333c MmPageReadCount  : 0n851
   +0x3340 MmPageReadIoCount : 0n175
   +0x3344 MmCacheReadCount : 0n0
   +0x3348 MmCacheIoCount   : 0n0
   +0x334c MmDirtyPagesWriteCount : 0n0
   +0x3350 MmDirtyWriteIoCount : 0n0
   +0x3354 MmMappedPagesWriteCount : 0n0
   +0x3358 MmMappedWriteIoCount : 0n0
   +0x335c CachedCommit     : 0x100
   +0x3360 CachedResidentAvailable : 0x87
   +0x3364 HyperPte         : 0x807e3005 Void
   +0x3368 PrcbPad8         : [4]  ""
   +0x336c VendorString     : [13]  "GenuineIntel"
   +0x3379 InitialApicId    : 0x1 ''
   +0x337a LogicalProcessorsPerPhysicalProcessor : 0x2 ''
   +0x337b PrcbPad9         : [5]  ""
   +0x3380 FeatureBits      : 0xa08f3fff
   +0x3388 UpdateSignature  : _LARGE_INTEGER 0x54`00000000
   +0x3390 IsrTime          : 0
   +0x3398 RuntimeAccumulation : 0x6b49d20
   +0x33a0 PowerState       : _PROCESSOR_POWER_STATE
   +0x3468 DpcWatchdogDpc   : _KDPC
   +0x3488 DpcWatchdogTimer : _KTIMER
   +0x34b0 WheaInfo         : 0x867dd81c Void
   +0x34b4 EtwSupport       : 0x861f0940 Void
   +0x34b8 InterruptObjectPool : _SLIST_HEADER
   +0x34c0 HypercallPageList : _SLIST_HEADER
   +0x34c8 HypercallPageVirtual : (null)
   +0x34cc VirtualApicAssist : (null)
   +0x34d0 StatisticsPage   : (null)
   +0x34d4 RateControl      : (null)
   +0x34d8 Cache            : [5] _CACHE_DESCRIPTOR
   +0x3514 CacheCount       : 3
   +0x3518 CacheProcessorMask : [5] 2
   +0x352c PackageProcessorSet : _KAFFINITY_EX
   +0x3538 PrcbPad91        : [1] 0
   +0x353c CoreProcessorSet : 2
   +0x3540 TimerExpirationDpc : _KDPC
   +0x3560 SpinLockAcquireCount : 0x8d4a8
   +0x3564 SpinLockContentionCount : 0x40
   +0x3568 SpinLockSpinCount : 0xca8
   +0x356c IpiSendRequestBroadcastCount : 0
   +0x3570 IpiSendRequestRoutineCount : 0x280c
   +0x3574 IpiSendSoftwareInterruptCount : 0x8ca
   +0x3578 ExInitializeResourceCount : 0x99
   +0x357c ExReInitializeResourceCount : 2
   +0x3580 ExDeleteResourceCount : 0x52
   +0x3584 ExecutiveResourceAcquiresCount : 0x82c4
   +0x3588 ExecutiveResourceContentionsCount : 0x22
   +0x358c ExecutiveResourceReleaseExclusiveCount : 0x629
   +0x3590 ExecutiveResourceReleaseSharedCount : 0x7c83
   +0x3594 ExecutiveResourceConvertsCount : 5
   +0x3598 ExAcqResExclusiveAttempts : 0x4eb
   +0x359c ExAcqResExclusiveAcquiresExclusive : 0x3d7
   +0x35a0 ExAcqResExclusiveAcquiresExclusiveRecursive : 0x10f
   +0x35a4 ExAcqResExclusiveWaits : 0xb
   +0x35a8 ExAcqResExclusiveNotAcquires : 5
   +0x35ac ExAcqResSharedAttempts : 0x7d71
   +0x35b0 ExAcqResSharedAcquiresExclusive : 0x161
   +0x35b4 ExAcqResSharedAcquiresShared : 0x78e1
   +0x35b8 ExAcqResSharedAcquiresSharedRecursive : 0x32f
   +0x35bc ExAcqResSharedWaits : 0x17
   +0x35c0 ExAcqResSharedNotAcquires : 0
   +0x35c4 ExAcqResSharedStarveExclusiveAttempts : 0x6d
   +0x35c8 ExAcqResSharedStarveExclusiveAcquiresExclusive : 1
   +0x35cc ExAcqResSharedStarveExclusiveAcquiresShared : 0x69
   +0x35d0 ExAcqResSharedStarveExclusiveAcquiresSharedRecursive : 3
   +0x35d4 ExAcqResSharedStarveExclusiveWaits : 0
   +0x35d8 ExAcqResSharedStarveExclusiveNotAcquires : 0
   +0x35dc ExAcqResSharedWaitForExclusiveAttempts : 0
   +0x35e0 ExAcqResSharedWaitForExclusiveAcquiresExclusive : 0
   +0x35e4 ExAcqResSharedWaitForExclusiveAcquiresShared : 0
   +0x35e8 ExAcqResSharedWaitForExclusiveAcquiresSharedRecursive : 0
   +0x35ec ExAcqResSharedWaitForExclusiveWaits : 0
   +0x35f0 ExAcqResSharedWaitForExclusiveNotAcquires : 0
   +0x35f4 ExSetResOwnerPointerExclusive : 0
   +0x35f8 ExSetResOwnerPointerSharedNew : 2
   +0x35fc ExSetResOwnerPointerSharedOld : 0
   +0x3600 ExTryToAcqExclusiveAttempts : 0
   +0x3604 ExTryToAcqExclusiveAcquires : 0
   +0x3608 ExBoostExclusiveOwner : 1
   +0x360c ExBoostSharedOwners : 0
   +0x3610 ExEtwSynchTrackingNotificationsCount : 0
   +0x3614 ExEtwSynchTrackingNotificationsAccountedCount : 0
   +0x3618 Context          : 0x807c2138 _CONTEXT
   +0x361c ContextFlags     : 0x10017
   +0x3620 ExtendedState    : 0x807f3000 _XSAVE_AREA
Recognize it automagically determined the module you wanted the symbols for were from the nt module. If you need to specify the exact module you have symbols for that you wanna look up, type it in followed by an exclamation mark then the symbol name, kinda like what you would see in a callstack for a thread. For this it'd be nt!_KPRCB instead of just typing _KPRCB for the symbol name in the command.

As you can tell, the info here is a lot more verbose than from the alternative !PRCB extension you could've used. There are also some substructures lying in here (which, btw, the PRCB is a substructure of the PCR structure). Navigate to those the same way you would with this one, using their appropriate offsets for the starting address. Example would be the Processor State substructure, which is represented by _KPROCESSOR_STATE symbol. It's at offset 0x018, so add 18 to the previous address we used and we'll have the correct address:

Code:
1: kd> dt _KPROCESSOR_STATE ffdff120+18
nt!_KPROCESSOR_STATE
   +0x000 ContextFrame     : _CONTEXT
   +0x2cc SpecialRegisters : _KSPECIAL_REGISTERS
This is also split into two substructures. Follow those in the same manner.

There's actually a more simplified and less meticulous method to perusing data structures than this, but I gave this example to show how data structures work and how they can be navigated without such luxuries. If you wanna know the easier ways, look at the Windbg manual for the dt command.

One thing to be aware of is that symbols are only a form of template of which to present data. They are not intelligent, nor does the debugger assume and try to figure if you're using the right symbols or not (in most cases). You can pretty much present the dt command any data you throw at it, and it will parse it based on the symbols you gave it, regardless if those symbols actually represent that data or not. The results will obviously be quite ugly if the data does not match up with the symbols, so you must be able to gain a general understanding of different data structures and whatnot first before you can use dt effectively. You know you're getting better at debugging when you can look at raw data, and just from the first number of values as well as other clues you may have gathered, you can tell what structure it is and then pump all of it through the correct symbols using dt. So dt - while powerful - is not a magical panacea. You must first have to speculate what the data is first, and match up the right symbols for it, which you will use dt to accomplish.
__

lm


Means list modules, which module is another name for a driver or other loaded dll library of some sort. In a typical crashdump, it'll list the drivers and other kernel stuff loaded at the time of the crash. You can also have it list stuff like timestamps with t or give verbose info on a particular module with v. There's a number of different arguments you can give it to cater to your interests, so read the Windbg manual on it. Another alternative is to click Debug on the Windbg window then Modules... which gives you a list that you can sort by column. Does not provide quite the amount of info like v does.
__

ln


Means to list nearest symbol. Give it an address and it'll look up if any module data sits in that address, and if so, it'll give the nearest symbols for it, if available. Good to determine if a portion of memory is a function, data structure, etc.

This is not to be used as an alternative for lma, which lists any present module whos portion sits at the address you give it. That command only checks to see if the module is present, but ln rather looks for symbols present at that address. Even if a module is present at that address, if you don't have any symbols loaded for it (often the case with 3rd-party drivers) then ln will bring up nothing whereas lma will. ln is more important where you're actually looking for the name of some function or some data structure that you believe sits there, but you don't know what it is. If you have the symbols, it'll present it to you.
__

r


Explained in this article. Use to show what was present in a specific register. Use by itself to get a general list of registers (not all of them, though), along with the latest instruction in that context. Of course, remember to be in the right context!
__

.formats


Best used for finding Bit Flips. This simple little command will display all the various formats for a number that you give it (you can even give it expressions like 55+3). Also great to interpreting bit flags.
__

!chkimg and image corruption


!chkimg is often use to discern validity of data in an executable image.

To explain, an executable image is where memory manager for Windows takes the entire dll/exe/etc. etc. - an actual file - and slaps a copy of it - the image - into memory. It's a little more of an involved process than that, but that's the general idea. This is done whenever you run a process or service, so when you run a program it takes all the files it needs for starters and slaps em into memory so that it can be executed from there without altering the original executable files and whatnot that are present on the disk(amongst other reasons).

!chkimg takes symbols gathered for the image and matches them up with the image present in memory at the time. If there's discrepancies, it'll mention it. Note that it uses symbols to validate, so if you have no symbols or (worse) incorrect symbols, it's either gonna not work at all or tell you it's wrong when it's not.

!chkimg is primarily used to determine memory corruption problems. There have been times when viruses and the like have corrupted images in this manner, but there's various ways it's done and not as easy to discern.

Let's look at an example:
Code:
1: kd> !thread
GetPointerFromAddress: unable to read from 81f6f86c
THREAD 84a04230 Cid 0cc4.0490 Teb: 7ffde000 Win32Thread: 00000000 RUNNING on processor 1
IRP List:
Unable to read nt!_IRP @ 847b27a8
Impersonation token: a866bca0 (Level Impersonation)
GetUlongFromAddress: unable to read from 81f47394
Owning Process 87fe06d8 Image: svchost.exe
Attached Process N/A Image: N/A
ffdf0000: Unable to get shared data
Wait Start TickCount 10504
Context Switch Count 35
ReadMemory error: Cannot get nt!KeMaximumIncrement value.
UserTime 00:00:00.000
KernelTime 00:00:00.000
Win32 Start Address 0x7729d8ac
Stack Init a5f50fe0 Current a5e86ba0 Base a5f51000 Limit a5f4e000 Call ddc
Priority 10 BasePriority 8 PriorityDecrement 0 IoPriority 2 PagePriority 5
ChildEBP RetAddr Args to Child
a5f50564 89a21316 00000024 001904aa a5f50a90 nt!KeBugCheckEx+0x1e
a5f5058c 89a1c2c6 849fa008 a5f505c0 89a15f54 Ntfs!NtfsExceptionFilter+0xad (FPO: [Non-Fpo])
a5f50598 89a15f54 00000000 a5f50d2c 89a3b1a0 Ntfs!NtfsCommonCreateCallout+0x37 (FPO: [SEH])
a5f505ac 89a200ba 00000000 00000000 00000000 Ntfs!_EH4_CallFilterFunc+0x12 (FPO: [Uses EBP] [0,0,4])
a5f505d4 81eedb62 fffffffe a5f50d1c a5f5078c Ntfs!_except_handler4+0x8e (FPO: [Non-Fpo])
a5f505f8 81eedb34 a5f50a90 a5f50d1c a5f5078c nt!ExecuteHandler2+0x26
a5f506b0 81e6e567 a5f50a90 a5f5078c db4f8e1b nt!ExecuteHandler+0x24
a5f50a74 81e905ea a5f50a90 00000000 a5f50ae4 nt!KiDispatchException+0x170
a5f50adc 81e9059e a5f50b5c 89a990f2 badb0d00 nt!CommonDispatchException+0x4a (FPO: [0,20,0])
a5f50b5c 89a98ca2 864867c8 00000e69 a5f50b7c nt!Kei386EoiHelper+0x186
a5f50bb4 89a97c20 849fa008 864867c8 8bf3bd08 Ntfs!NtfsFindPrefixHashEntry+0x12b (FPO: [Non-Fpo])
a5f50c10 89a9667f 849fa008 847b27a8 8bf3bd08 Ntfs!NtfsFindStartingNode+0x73b (FPO: [Non-Fpo])
a5f50cec 89a1c2aa 849fa008 847b27a8 a5e86324 Ntfs!NtfsCommonCreate+0x620 (FPO: [Non-Fpo])
a5f50d2c 81ef12e8 a5e862bc 00000000 ffffffff Ntfs!NtfsCommonCreateCallout+0x20 (FPO: [Non-Fpo])
a5f50d2c 81ef13e1 a5e862bc 00000000 ffffffff nt!KiSwapKernelStackAndExit+0x118 (FPO: [0,0] TrapFrame @ a5f50d44)
a5e8624c 00000000 00000000 00000000 00000000 nt!KiSwitchKernelStackAndCallout+0x31

Look at the callstack. Before nt!Kei386EoiHelper kicks in (which is part of the crash process), we see the problem rose while Ntfs.sys was doing something (Ntfs!NtfsFindPrefixHashEntry+0x12b). You'll often use !chkimg to look at modules, and in this case it'll be the Ntfs module, since that's where the problem was discovered. Specifically, the error occurred in a specific routine residing in the Ntfs module's code, called "NtfsFindPrefixHashEntry". So we can do one of two things, either search the entire module, or search specifically in the area where the problem arose. Let's try searching all modules for all routines that start with "Ntfs", like the one that bugged out:

Code:
1: kd> !chkimg -d -db !Ntfs
89a99202 - Ntfs!NtfsIsFileNameValid+4e
[ f6:00 ]
89a9920a - Ntfs!NtfsIsFileNameValid+56 (+0x08)
[ ca:00 ]
89a99212 - Ntfs!NtfsIsFileNameValid+5e (+0x08)
[ fb:00 ]
89a9921a - Ntfs!NtfsIsFileNameValid+66 (+0x08)
[ 74:00 ]
89a99222 - Ntfs!NtfsIsFileNameValid+6e (+0x08)
[ c6:00 ]
89a9922a - Ntfs!NtfsIsFileNameValid+76 (+0x08)
[ f9:00 ]
89a9923a - Ntfs!NtfsIsFileNameValid+86 (+0x10)
[ 0e:00 ]
89a99242 - Ntfs!NtfsIsFileNameValid+8e (+0x08)
[ 02:00 ]
89a9924a - Ntfs!NtfsIsFileNameValid+96 (+0x08)
[ 45:00 ]
89a99252 - Ntfs!NtfsIsFileNameValid+9e (+0x08)
[ 90:00 ]
89a9925a - Ntfs!NtfsCreateFcb+4 (+0x08)
[ d7:00 ]
89a99262 - Ntfs!NtfsCreateFcb+c (+0x08)
[ c6:00 ]
89a9926a - Ntfs!NtfsCreateFcb+14 (+0x08)
[ dc:00 ]
89a99272 - Ntfs!NtfsCreateFcb+1c (+0x08)
[ 75:00 ]
89a9927a - Ntfs!NtfsCreateFcb+24 (+0x08)
[ 45:00 ]
89a99282 - Ntfs!NtfsCreateFcb+2c (+0x08)
[ 8b:00 ]
89a9928a - Ntfs!NtfsCreateFcb+34 (+0x08)
[ 89:00 ]
89a99292 - Ntfs!NtfsCreateFcb+3c (+0x08)
[ 02:00 ]
89a9929a - Ntfs!NtfsCreateFcb+44 (+0x08)
[ cc:00 ]
89a992a2 - Ntfs!NtfsCreateFcb+4c (+0x08)
[ a4:00 ]
89a992aa - Ntfs!NtfsCreateFcb+54 (+0x08)
[ 89:00 ]
89a992ba - Ntfs!NtfsCreateFcb+64 (+0x10)
[ 47:00 ]
89a992c2 - Ntfs!NtfsCreateFcb+6c (+0x08)
[ 89:00 ]
89a992ca - Ntfs!NtfsCreateFcb+74 (+0x08)
[ 9c:00 ]
89a992d2 - Ntfs!NtfsCreateFcb+7c (+0x08)
[ 05:00 ]
89a992da - Ntfs!NtfsCreateFcb+84 (+0x08)
[ b0:00 ]
89a992e2 - Ntfs!NtfsCreateFcb+8c (+0x08)
[ c6:00 ]
89a992ea - Ntfs!NtfsCreateFcb+94 (+0x08)
[ 5d:00 ]
89a992fa - Ntfs!NtfsCreateFcb+a4 (+0x10)
[ 5c:00 ]
89a99302 - Ntfs!NtfsCreateFcb+ac (+0x08)
[ 54:00 ]
89a9930a - Ntfs!NtfsCreateFcb+b4 (+0x08)
[ 0f:00 ]
89a99312 - Ntfs!NtfsCreateFcb+bc (+0x08)
[ 41:00 ]
89a9932a - Ntfs!NtfsCreateFcb+d4 (+0x18)
[ ff:00 ]
89a9933a - Ntfs!NtfsCreateFcb+e4 (+0x10)
[ ff:00 ]
89a99342 - Ntfs!NtfsCreateFcb+ec (+0x08)
[ b7:00 ]
89a9934a - Ntfs!NtfsCreateFcb+f4 (+0x08)
[ 04:00 ]
89a99352 - Ntfs!NtfsCreateFcb+fc (+0x08)
[ eb:00 ]
89a9935a - Ntfs!NtfsCreateFcb+104 (+0x08)
[ 85:00 ]
89a99362 - Ntfs!NtfsCreateFcb+10c (+0x08)
[ c6:00 ]
89a9936a - Ntfs!NtfsCreateFcb+114 (+0x08)
[ 40:00 ]
89a99372 - Ntfs!NtfsCreateFcb+11c (+0x08)
[ 75:00 ]
89a9937a - Ntfs!NtfsCreateFcb+124 (+0x08)
[ 7d:00 ]
89a99382 - Ntfs!NtfsCreateFcb+12c (+0x08)
[ 06:00 ]
89a9938a - Ntfs!NtfsCreateFcb+134 (+0x08)
[ 30:00 ]
89a99392 - Ntfs!NtfsCreateFcb+13c (+0x08)
[ 2b:00 ]
89a9939a - Ntfs!NtfsCreateFcb+144 (+0x08)
[ 8b:00 ]
89a993aa - Ntfs!NtfsCreateFcb+154 (+0x10)
[ f7:00 ]
89a993c2 - Ntfs!NtfsCreateFcb+16c (+0x18)
[ 2b:00 ]
89a993ca - Ntfs!NtfsCreateFcb+174 (+0x08)
[ 8b:00 ]
89a993da - Ntfs!NtfsCreateFcb+184 (+0x10)
[ f7:00 ]
89a993fa - Ntfs!NtfsCreateFcb+1a4 (+0x20)
[ 95:00 ]
89a99402 - Ntfs!NtfsCreateFcb+1ac (+0x08)
[ 8b:00 ]
89a99412 - Ntfs!NtfsCreateFcb+1bc (+0x10)
[ f7:00 ]
89a99422 - Ntfs!NtfsCreateFcb+1cc (+0x10)
[ 04:00 ]
89a9942a - Ntfs!NtfsCreateFcb+1d4 (+0x08)
[ 04:00 ]
89a99432 - Ntfs!NtfsCreateFcb+1dc (+0x08)
[ 68:00 ]
89a9943a - Ntfs!NtfsCreateFcb+1e4 (+0x08)
[ ab:00 ]
57 errors : !Ntfs (89a99202-89a9943a)
89a99200 55 0c *00 da 1b d2 83 e2 08 83 *00 04 23 c2 74 1f U...........#.t.
89a99210 66 83 *00 3a 74 19 66 83 fb 5c *00 13 66 83 fb 2e f..:t.f..\..f...
89a99220 74 04 *00 45 fe 00 47 46 46 3b *00 72 b5 eb 04 c6 t..E..GFF;.r....
89a99230 45 ff 00 80 7d fe 00 5e 5b 74 *00 83 f9 01 74 05 E...}..^[t....t.
89a99240 83 f9 *00 75 04 c6 45 ff 00 8a *00 ff 5f c9 c2 08 ...u..E....._...
89a99250 00 90 *00 90 90 90 6a 58 68 f8 *00 a3 89 e8 66 ca ......jXh.....f.
89a99260 f7 ff *00 45 e7 00 33 f6 89 75 *00 89 75 d8 89 75 ...E..3..u..u..u
89a99270 d0 39 *00 24 75 06 8d 45 e6 89 *00 24 8b 45 14 89 .9.$u..E...$.E..
89a99280 45 a4 *00 45 18 89 45 a8 33 ff *00 7d e0 8b 45 0c E..E..E.3..}..E.
89a99290 05 a8 *00 00 00 89 45 d4 8d 4d *00 51 8d 4d c8 51 ......E..M.Q.M.Q
89a992a0 8d 4d *00 51 50 ff 15 8c 93 a3 *00 3b c6 0f 84 a1 .M.QP......;....
89a992b0 00 00 00 8b 78 08 89 7d e0 f6 *00 04 01 74 2f 8b ....x..}.....t/.
89a992c0 47 08 *00 45 98 8b 47 0c 89 45 *00 8d 45 98 50 8b G..E..G..E..E.P.
89a992d0 47 38 *00 a8 02 00 00 50 ff 15 *00 95 a3 89 83 67 G8.....P.......g
89a992e0 04 bf *00 45 e7 01 89 75 e0 8b *00 10 eb 71 8b 45 ...E...u.....q.E
89a992f0 24 c6 00 01 8b 5d 10 3b de 74 *00 39 b7 d8 00 00 $....].;.t.9....
89a99300 00 75 *00 8d 43 30 33 c9 41 f0 *00 c1 08 8d 43 14 .u..C03.A.....C.
89a99310 33 c9 *00 f0 0f c1 08 39 b7 d8 00 00 00 74 2d 8b 3......9.....t-.
89a99320 87 d8 00 00 00 83 c0 14 83 c9 *00 f0 0f c1 08 8b ................
89a99330 87 d8 00 00 00 83 c0 30 83 c9 *00 f0 0f c1 08 75 .......0.......u
89a99340 0b ff *00 d8 00 00 00 e8 38 1d *00 00 89 9f d8 00 ........8.......
89a99350 00 00 *00 03 8b 5d 10 3b fe 0f *00 5c 02 00 00 8b .....].;...\....
89a99360 45 24 *00 00 00 89 75 fc 33 c0 *00 89 45 24 80 7d E$....u.3...E$.}
89a99370 1c 00 *00 77 39 45 14 76 72 83 *00 14 08 74 6c 83 ...w9E.vr....tl.
89a99380 7d 14 *00 74 66 80 7d 20 00 74 *00 68 c0 5c a4 89 }..tf.} .t.h.\..
89a99390 e8 36 *00 f8 ff 8b f0 89 75 dc *00 fe 89 7d e0 68 .6......u....}.h
89a993a0 e8 03 00 00 6a 00 56 e8 21 ce *00 ff 83 c4 0c 81 ....j.V.!.......
...
89a993c0 e8 06 *00 f8 ff 8b f0 89 75 dc *00 fe 89 7d e0 68 ........u....}.h
89a993d0 70 03 00 00 6a 00 56 e8 f1 cd *00 ff 83 c4 0c 81 p...j.V.........
...
89a993f0 68 f0 00 00 00 6a 10 ff 15 f0 *00 a3 89 8b f0 89 h....j..........
89a99400 75 dc *00 fe 89 7d e0 68 f0 00 00 00 6a 00 56 e8 u....}.h....j.V.
89a99410 b9 cd *00 ff 83 c4 0c 80 7d 1c 00 74 4b 8b 45 0c ........}..tK.E.
89a99420 f7 40 *00 00 00 00 02 74 34 a0 *00 4c a4 89 84 c0 .@.....t4..L....
89a99430 74 14 *00 7c 07 00 00 68 36 6d *00 89 68 a2 00 00 t..|...h6m..h...

Note that the command had "!Ntfs" in it, which is to say, "Look in any loaded module for any routine who's symbol name starts with 'Ntfs'."

This output has two parts, one for the -d argument and the other for -db argument when typing in the !chkimg cmd.

The first part displays in two lines per corrupted item found:


89a99202 - Ntfs!NtfsIsFileNameValid+4e
[ f6:00 ]


First line shows location of the problem area. It displays the memory address location (89a99202), the symbol name (Ntfs!NtfsIsFileNameValid), and the offset (+4e). Note in later lines, it'll have a hex number in parentheses, which is the offset - or position - from the last corrupted data listed. Note that this is just one byte corrupted, in other cases you may see a address range as opposed to just an address. Also consider in this example that the offsets for each corrupted byte is nearly the same.

Second line shows the actual problem. It displays, in brackets, two values: the one before the colon is what the symbol says is correct (and is the value that it expects), and the one after the colon is what is actually found (the corrupted byte). In this case, it got zeroed out. In fact, all of them listed in this case are zeroed.


The second part of the output displays actual data as if you used the 'db' command on the data range. It shows actual hex values, and then ASCII characters associated for each byte. The asterisk next to a byte signifies that's a corrupt byte. Here we get a more visual indication of the pattern of this corruption compared to the -d output, where there we used the offset from each byte (most 0x8 and 0x10) to determine a pattern. Here we actually see a couple lines of zeroes strung down the length of the range.

With this info, you can conclude - at least in this case - that we're looking at memory problems, in that a contiguous length of memory perchance went dead and lost its contents (hence the zeroes).

Lookup !chkimg in Windbg help manual for more information and more ways you can use this extension.
__

!idt


Means Interrupt Dispatch Table. Each processor (logical) has its own dispatch table filled in by drivers and the like so that if a specific interrupt vector (a type of very small instruction) is hit in code, it'll divert code execution to the function associated with that interrupt vector. Think of interrupt vectors as very tiny call instructions that operate in their own way.
__

!pcr & !prcb


Every processor is required to dump its information - everything ranging from register values to current context to processor id and state info - and continue updating this information into a particular data structure called the PCR or Processor Control Region. It's what keeps everything nice and tidy for the operations of a particular logical processor (logical processors are made based on processor cores, physical CPUs and if Hyperthreading or some other similar feature is active). !pcr extension dumps this information in an easy-to-read format. Note that every PCR has a subsection called the PRCB or Processor Control Block, which contains the bulk the PCR's info. You can view that with !prcb. However, this option actually isn't nearly as robust an output as dumping the PRCB using dt command, and rather just outputs basic information.
__

!r


Operates much like r but dumps all registers present.
__

!verifier


Use this to confirm that Driver Verifier was operational at the time of the crash and that it has been properly set. Without any additional arguments, !verifier by itself will show list of enabled options, as well as special pool allocations and other summary info. You can use various arguments to garner further details on certain items (might not work on minidumps).

Code:
--Example of Driver Verifier off--

2: kd> !verifier

Verify Level 0 ... enabled options are:

Summary of All Verifier Statistics

RaiseIrqls                             0x0
AcquireSpinLocks                       0x0
Synch Executions                       0x0
Trims                                  0x0

Pool Allocations Attempted             0x0
Pool Allocations Succeeded             0x0
Pool Allocations Succeeded SpecialPool 0x0
Pool Allocations With NO TAG           0x0
Pool Allocations Failed                0x0
Resource Allocations Failed Deliberately   0x0

Current paged pool allocations         0x0 for 00000000 bytes
Peak paged pool allocations            0x0 for 00000000 bytes
Current nonpaged pool allocations      0x0 for 00000000 bytes
Peak nonpaged pool allocations         0x0 for 00000000 bytes

--Example of Driver Verifier on--

4: kd> !verifier

Verify Level 9bb ... enabled options are:
    Special pool
    Special irql
    All pool allocations checked on unload
    Io subsystem checking enabled
    Deadlock detection enabled
    DMA checking enabled
    Security checks enabled
    Miscellaneous checks enabled

Summary of All Verifier Statistics

RaiseIrqls                             0x0
AcquireSpinLocks                       0x100bd28
Synch Executions                       0x83c
Trims                                  0x46a39

Pool Allocations Attempted             0xb9d19
Pool Allocations Succeeded             0xb9d19
Pool Allocations Succeeded SpecialPool 0xb9d19
Pool Allocations With NO TAG           0x24
Pool Allocations Failed                0x0
Resource Allocations Failed Deliberately   0x0

Current paged pool allocations         0x32ec for 00B27EA4 bytes
Peak paged pool allocations            0x3342 for 00B2EABC bytes
Current nonpaged pool allocations      0x6b95 for 04682B30 bytes
Peak nonpaged pool allocations         0x6b97 for 0498E234 bytes

There's also a good chance that with DV on, it will show up in the bucket ID in the !analyze -v output:

Code:
DEFAULT_BUCKET_ID:  VERIFIER_ENABLED_VISTA_MINIDUMP

as opposed too something like:

Code:
DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

If you get an "Unable to retrieve verifier list" result from !verifier that means the minidump failed to dump the verifier information onto the minidump. Often this can be that DV wasn't on, but there are cases where DV can be on and this still occurs. This should not occur on a kernel dump.



_____

ERROR CODES



DRIVER_VERIFIER_DETECTED_VIOLATION (0xc4)






WHEA errors (0x122, 0x124)


Note: for PCI-Express WHEA errors (subcode 0x4), visit this article.

WHEA is the Windows Hardware Error Architecture, which is essentially like Werfault or other error handling processes but specifically catered to handle hardware issues. You will only see it on Vista and 2008 or later OSes, as earlier OSes will just provide bugcheck 0x9C instead. Refer to Windbg help manual for info on that. Two error codes are possible with WHEA:

- 0x122 : WHEA_INTERNAL_ERROR is like a double fault, where something caused even the WHEA error handling code to mess up somewhere. If this happens, you may have to look elsewhere, as even WHEA error handling cannot be trusted in this case.

- 0x124 : WHEA_UNCORRECTABLE_ERROR is the most common, and provides details on what hardware error is possible. Details can change depending on architecture, whether it's an x32 or x64, which will add MCi_STATUS on x32. However most of the time you can just look at error record for the answer and disregard MCi_STATUS.

Example of 0x124 on x64 system (x32 will have Arg3 & Arg4 filled):
Arg1: 0000000000000000, Machine Check Exception
Arg2: fffffa80049f78f8, Address of the WHEA_ERROR_RECORD structure.
Arg3: 0000000000000000, High order 32-bits of the MCi_STATUS value.
Arg4: 0000000000000000, Low order 32-bits of the MCi_STATUS value.

You can get details for an error record by looking at the WHEA_ERROR_RECORD structure, which is Arg2. Use the !errrec extension cmd for this:
1: kd> !errrec fffffa80049f78f8
===============================================================================
Common Platform Error Record @ fffffa80049f78f8
-------------------------------------------------------------------------------
Record Id : 01cb818aa5caac05
Severity : Fatal (1)
Length : 928
Creator : Microsoft
Notify Type : Machine Check Exception
Timestamp : 11/11/2010 10:24:46
Flags : 0x00000002 PreviousError

===============================================================================
Section 0 : Processor Generic
-------------------------------------------------------------------------------
Descriptor @ fffffa80049f7978
Section @ fffffa80049f7a50
Offset : 344
Length : 192
Flags : 0x00000001 Primary
Severity : Fatal

Proc. Type : x86/x64
Instr. Set : x64
Error Type : BUS error
Operation : Generic

Flags : 0x00
Level : 0
CPU Version : 0x000000000001067a
Processor ID : 0x0000000000000000

===============================================================================
Section 1 : x86/x64 Processor Specific
-------------------------------------------------------------------------------
Descriptor @ fffffa80049f79c0
Section @ fffffa80049f7b10
Offset : 536
Length : 128
Flags : 0x00000000
Severity : Fatal

Local APIC Id : 0x0000000000000000
CPU Id : 7a 06 01 00 00 08 04 00 - bd e3 08 0c ff fb eb bf
00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

Proc. Info 0 @ fffffa80049f7b10

===============================================================================
Section 2 : x86/x64 MCA
-------------------------------------------------------------------------------
Descriptor @ fffffa80049f7a08
Section @ fffffa80049f7b90
Offset : 664
Length : 264
Flags : 0x00000000
Severity : Fatal

Error : BUSL0_SRC_ERR_M_NOTIMEOUT_ERR (Proc 0 Bank 0)
Status : 0xf200084000000800
In this example, the error is BUSL0_SRC_ERR_M_NOTIMEOUT_ERR, which means that the L0 cache memory on a processor suffered a generic non-timeout-related error while something was trying to access data in it. The source of the requested operation was the processor itself. This helpfully provides the exact processor number (Proc 0) and memory bank (Bank 0) for multi-core/multi-cpu systems to further diagnose.

To decipher the mnemonics presented in the error message, consult the CPU developer manual associated with the CPU installed in the offending system. For Intel CPUs, it's in Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3: System Programming Guide (for 64-bit of course), and the tables for the mnemonics are located in section 15.9.2 (that's chapter 15, section 9, subsection 2). While Intel and AMD CPUs present their errors differently, typically the mnemonics are the same.


_____

DRIVER VERIFIER


Here's the link to a page to articles explaining in detail the changes that each setting in DV creates. I would especially recommend everyone to review the Special Pool one, as that one is typically used improperly, resulting in DV being next to useless despite being active:
Driver Verifier Options Details

Concerning Special Pool, for those who want the quick n easy explanation:

- Try to determine problematic driver(s) prior to enabling DV with Special Pool
- Try to select the drivers you think are the problem instead of "anything but Windows". The less the better.
- If crashes don't seem specific on a cause, try tagging another small set of drivers.
- Special Pool uses up a lot of virtual and physical memory. If EITHER almost gets exhausted, Special Pool requests will fail and normal memory allocating will continue (which means Special Pool doesn't get made, which means the setting becomes worthless).
- If you recommend anyone to set Special Pool, ask that they increase their paging file size to 2x/3x usual (only in low RAM cases).
- If the person is not using a 64-bit version of Windows, be more conscientious of using Special Pool. 32-bit Windows memory limitations are more stringent with pool use than just the max available RAM. Consult Mark's article on details.

Concerning IRP Logging:

- Very valuable when the problem is suspected to involve I/O.
- Only available on kernel dumps or larger.
- Must have I/O Verification enabled as well.
- IRP logs are available through one of two options:
  • After setting up Driver Verifier with the setting and restarting, the user should run DC2WMIParser. This is available under Tools in the WDK (Windows Driver Kit). Run with the proper arguments (/f and /t) and then trigger the problematic symptoms while it's running. This best when the symptoms don't involve a crash (e.g. hang, lag/unresponsive). The resulting file should be sent for analysis.
  • Crash the system. The kernel dump will retain the necessary data. Use !verifier 0x100 to access the IRP log.

Other misc. DV-related tips:

- If you are diagnosing Windows hangs, use Deadlock Detection. DD only helps you if the following occurs:
  • The resulting crash is a 0xC4 (DRIVER_VERIFIER_DETECTED_VIOLATION)
  • use !deadlock 1 to get the nitty gritty

- Force Pending I/O Requests and Low Resource Simulation are bad, mmk? They are designed to stress test drivers in worst-case conditions, and impose an artificially restrictive environment on the drivers in order to do so. Never recommend these.
- Learn what each setting actually does! This lets you use the right tools for the right issues.
 
Last edited by a moderator:
Bump. Got off my lazy duff and added anchors to ToC. In addition, added a couple more general concepts just to clarify people's perceptions of some things. May add some rough definitions of different stacks (thread stacks, call stacks, IRP stacks, etc.) later. In addition, I also added a link to the CodeMachine website which offers great articles, including their superb x64 Deep Dive one.
 
  • Thanks
Reactions: JMH
Thanks, usasma. It was a pain in the rear to get em all up but so far it looks like no broken links or anything like that. Notify me if you find anything missing or broken. Thanks!
 
Yeah, looks like his blog bit the dust for whatever unforeseen reason. I'll comment on it being broken at the time. If it doesn't come back up in a month, I'll remove the entry altogether. Which is a shame cuz its got some good tidbits of info in it.
 
Thanks, went ahead and did that. Not like the blog will be updated any further.
 
Bump. Fixed a link and added the cmkd Windbg extension that gives you a bunch of extra debugging goodness.

Btw, I haven't explained how to load Windbg extensions. I'll add it later in the OP, but for now, all you do (for most of them) is place it in the parent directory of Windbg (where the Windbg executable sits) and then open up Windbg and type .load [dbgext] where [dbgext] is the name of the extension. There's no need to add the .dll extension to the name. In the case with cmkd, type .load cmkd. That's it. It'll error if it fails to load, otherwise it'll be silent in which you can then use the commands it gives you.
 
Bump. Fixed a link and added the cmkd Windbg extension that gives you a bunch of extra debugging goodness.

Btw, I haven't explained how to load Windbg extensions. I'll add it later in the OP, but for now, all you do (for most of them) is place it in the parent directory of Windbg (where the Windbg executable sits) and then open up Windbg and type .load [dbgext] where [dbgext] is the name of the extension. There's no need to add the .dll extension to the name. In the case with cmkd, type .load cmkd. That's it. It'll error if it fails to load, otherwise it'll be silent in which you can then use the commands it gives you.

From memory, cannot you also load it as you run one of the extensions for the first time, e.g.:?

!cmkd.stack

I use cmkd quite a bit. It is incredibly useful!
 
Yes, yer right. You can get straight into using it by doing so. It'll load up the extension, run the command you specified, and then will continue to keep the extension loaded so any subsequent command can be typed in with the normal fashion.

There's other extensions I'd like to try out, and it wouldn't hurt to see your own come into fruition!
 
Bump. Added a few links, made a few adjustments. Read the update section for details.
 
Bump. Added some good wholesome tips under General Concepts, enough to split them into two groups: Debugging Tips & Learning Tips.

Btw, I found a recent book that's out, called Inside Windows Debugging: Practical Strategies. Anyone noticed this or got a good look at it? It looks to be approaching things from a developer standpoint, but there's a myriad of ways it tells you how to debug, and various tools beyond Windbg (like Xperf).
 
Btw, I found a recent book that's out, called Inside Windows Debugging: Practical Strategies. Anyone noticed this or got a good look at it? It looks to be approaching things from a developer standpoint, but there's a myriad of ways it tells you how to debug, and various tools beyond Windbg (like Xperf).

That book looks awesome! Thanks for the link, I had not seen it! Have just ordered mine from Amazon. Looking forward to it arriving!
 
Awesome. Tell us what ya think when you peruse it. I'd be willing to put it in the big ole OP post if it sounds good.
 
I haven't watched this yet, so if it proves not useful, please delete this.

Debugging Heap Memory Corruptions

Date: This event took place live on August 02 2012
Presented by: Tarik Soulami
Duration: Approximately 60 minutes.
Cost: Free
Heap corruptions are one of the most common causes of program crashes. They are also often hard to reproduce, which makes them even more challenging to track down. In this interactive webcast Tarik Soulami, author of Inside Windows Debugging will present numerous tips and ideas for how to investigate these conditions, both in the native as well as the managed Windows programming models. There will also be demos to illustrate these techniques.


http://oreillynet.com/pub/e/2350
 
Given that this is dealing with heap (userland processes, be they applications/services) and not pool (drivers/kernel/system processes), this probably won't be able to help much with those who wish to analyze BSODs and system-wide hangs. However this is awesome stuff for those who wish to work on the user mode side of things and deal with crashing services/apps, certain memory leaks and other miscellaneous misbehavings. I'll have to give it a look later on when I have the time available, as I'm very curious about it. Looks like a great find.
 
Watched it. Did a fairly good job with it all, and certainly helped. I'll add it to the OP for sure once I manage to generate the energy to do so. :)
 
Back
Top