Determining the source of Bug Check 0x133 (DPC_WATCHDOG_VIOLATION) errors on Windows

Digerati · Dec 11, 2012

Vir Gnarus said:
Though I still don't see how any 'black box' can record everything

I don't believe I ever said it can. But a black box type device can record everything up to and including the clock cycle just before a crash. And that might lead an analyzer towards the fault. But it might not.

Ah, I guess our definition of crash is different

Well, I am going by 40+ years supporting computing hardware to put a roof over my family's head. You seem to want to define crash as a specific, or specific category of failure. Not so. "Crash" is a very generic term used to signify the sudden failure of pretty much anything. Hard drives can crash. For example, the R/W heads can literally "crash" - as in hard physical contact with the platters. Browsers can crash, as can operating systems and other programs. Faulty RAM can cause a crash, as can a bad NIC, PSU, or graphic card, or malware. The term "crash", when referring to computer systems coming to a halt, is probably as old as Adm. Grace Hopper's term; "bug" - and just as ambiguous. It means one thing - it came to a stop.

but even that doesn't resolve the fact that most of the work done is inside the components connected to the mobo!

Ummm, no. Unless you mean a mounted CPU, the RAM, and graphics card. Most of the work, by far, is done in the CPU with data transferring back and forth over the motherboard bus to and from the RAM. The main exception might be the graphics solution which has its own dedicated number cruncher/CPU - the GPU - and dedicated graphics RAM to work in. The Chipset, which is really a small computer in itself, is somewhat autonomous, but not entirely. It still needs a processor - the CPU. The memory manager, for example, just sits there, until instructed by the CPU and OS to put or fetch this or that data. Yes, the memory manager may be programmed how to manage that data, but it is still not doing it on its own.

The other devices are not as autonomous as your comment seems to suggest either.

There's also, again, the whole performance thing involved.
As if having Windows setup in a super-paranoid state wasn't crippling enough, but to have all the hardware constantly report on their operations, and the system would be reduced to something akin to a 386 IBM!

Nah! I don't agree with that either. Not today. It is extremely difficult to intentionally max out CPU and RAM utilization as it is. Anti-malware solutions and firewalls are constantly monitoring nearly every aspect of our systems already. Watchdog software and hardware devices have been around for years. Mission critical computers and networks are often self healing - to a point, of course. At least to where they notice a problem, record what is happening, cut-over to redundant equipment while sending out all sorts of alarms, pages, text messages, and emails to the admins.

Sure, real-time debugging does eat up resources and some adjustments (more resources - RAM, CPU horsepower, and disk space) may be desired. But the real problem is not performance, but cost.

jcgriff2 said:
I don't think it is all that impossible as we head toward Windows 9 now for improvements in crash recording/ reporting to be implemented far beyond what we think is possible now.

I agree. And much of that is due to the fact today's systems are much more capable than what they typically are asked to do (meaning there is lots of headroom for the crash recovery process to operate in) and todays hardware and operating systems are much more "robust" to start with.

While BSODs and system crashes may seem like commonplace events, they really aren't. With 1 billion plus Windows systems out there, even 1% is 10 million. It is just not cost effective for OS makers, hardware makers, and most users to have a failsafe, watchdog, crash recording, crash recovery system in place. Not yet anyway.

I certainly think BSOD analysis is a valuable troubleshooting tool but we must understand that crashes that create BSODs errors are just one small type of crash. There are many causes of crashes that leave no crumbs to follow - regardless how sophisticated real-time debugging and status recording may be.

jcgriff2 · Dec 11, 2012

Digerati said:
Vir Gnarus said:

There's also, again, the whole performance thing involved.
As if having Windows setup in a super-paranoid state wasn't crippling enough, but to have all the hardware constantly report on their operations, and the system would be reduced to something akin to a 386 IBM!

Click to expand...

Nah! I don't agree with that either. Not today. It is extremely difficult to intentionally max out CPU and RAM utilization as it is. Anti-malware solutions and firewalls are constantly monitoring nearly every aspect of our systems already. Watchdog software and hardware devices have been around for years. Mission critical computers and networks are often self healing - to a point, of course. At least to where they notice a problem, record what is happening, cut-over to redundant equipment while sending out all sorts of alarms, pages, text messages, and emails to the admins.

Sure, real-time debugging does eat up resources and some adjustments (more resources - RAM, CPU horsepower, and disk space) may be desired. But the real problem is not performance, but cost.

I would consider ProcMon to be somewhat close to ideal real-time recording of events (from a system resource usage standpoint anyway).

ProcMon can bring my P7350 Core2duo 4 GB RAM system to a dead standstill within 2 or 3 hours, albeit due to virtual memory being eaten at ~5 MB/sec.

Digerati · Dec 11, 2012

I would consider ProcMon to be somewhat close to ideal real-time recording of events (from a system resource usage standpoint anyway).

"Somewhat close", I agree. But of course, it is included in Windows, or running by default. And it most users would not run it, unless they were already having problems.

I just fired up PM and TM. My Physical Memory use went from 26% to 29%. Will keep it running for a bit to see what happens as far as resources getting eaten up over time.

Digerati · Dec 11, 2012

ProcMon can bring my P7350 Core2duo 4 GB RAM system to a dead standstill within 2 or 3 hours, albeit due to virtual memory being eaten at ~5 MB/sec.

Hmmm, just came back from an errand and the first thing I noticed was my computer was still awake. Process Monitor recorded over 9 million events. But my real point of replying again is TM said my memory use had dropped a little to 20% while I was away. That is, there is no sign it was eating up all my resources. I'll keep watching it.

I used be a hardware guy in a software company. And when not supporting computers or networks, I was usually alpha and beta testing software. And the developers (almost 200 of them to 5 or 6 of us) always wanted the PID and other information as found in PM and similar programs. And I understand that. And I can tell you from experience as a hardware guy in software company that it is almost impossible to convince those bull-headed people :grin1: hardware doesn't have to play by their rules! It does not always give advanced warning of failure - it simply ceases to exist, electronically - no current flow. So I am just saying we really have to understand that hardware may not give Windows time to record its status before a crash.

Fortunately, today's hardware is pretty darn reliable. But status monitoring/debugging can be invaluable, especially when unscheduled downtime is not really an option.

Which brings to mind, power. The ATX Form Factor Guide for PSUs only requires an uptime of 19ms (19/1000th of 1 second) if the AC line-in voltage is disrupted. After that, power to the computer stops. It doesn't fade away, it stops. With no warning. The ATX Form Factor does not require line-in monitoring for debugging purposes. (Your - speaking to everyone - computers are all on a good UPS w/AVR, right?)

That said - I just hit "X" instead of "-" :banghead: :mad7: PM is going again and I'll let it run for more than two hours this time, and with more than just idle time.

OT - BTW, I just clicked on that little [More] button on the right of this reply text box to look at more smilies and it "crashed" my browser session! Tried again, and it crashed too. Never had that happen before anywhere else so not sure if my IE9 choked or the site's link did - if something to do with PM!? It was bad enough that IE did not recover. And that is odd too. But I was able to right-click on the locked tab and select duplicate tab, and here I am, with a working tab again. I'm not clicking the button again a third time! :shame2:

jcgriff2 · Dec 11, 2012

Give ProcMon some help - run a/v scan; anything intensive.

It was about 2+ years ago when this laptop was near-frozen after leaving ProcMon running for hours. It writes to the page file, so WMI showed virtual memory usage escalating rapidly.

The WMI apps - "Recoveros + Page file"
- HTML output - https://www.sysnative.com/0x8/WMIC_Recoveros_Pagefile_04-2010_jcgriff2_html.exe
- TXT output - https://www.sysnative.com/0x8/WMI_recoveros_pagefile_jcgriff2_com_.exe

My ststem output -
HTML - https://www.sysnative.com/jcgriff2/postedfiles/wmic_recoveros_page_jcgriff2_12-11-2012.html
TXT - https://www.sysnative.com/jcgriff2/postedfiles/wmic_recoveros_page_jcgriff2_12-11-2012.txt

Page file - virtual memory usage (MB) -

Code:

AllocatedBaseSize=4062
Caption=C:\pagefile.sys
[COLOR=#ff0000]CurrentUsage=333[/COLOR]
Description=C:\pagefile.sys
InstallDate=20100626093920.056423-240
Name=C:\pagefile.sys
[COLOR=#ff0000]PeakUsage=1480[/COLOR]
Status=
TempPageFile=FALSE

Vir Gnarus · Dec 12, 2012

Digerati, I recommend you turn on all heap and tagging gflags as well as stack trace databases and object listing gflags (installed with WinDBG) then restart your PC, then additionally have Procmon running. All of that will be the closest one can get to a full system log. You'll see how even the mightiest systems are crippled by its relentless logging. I've even had all gflags set (no Procmon) on one of our enterprise application servers in order to try and find the cause of a reoccurring bugcheck and it was still enough to cause our dual-xeon system with 36GB of RAM to be reduced to a sloth. I had to wait 5 minutes just to open up a browser!

Digerati · Dec 12, 2012

Well, this morning, my RAM usage was up to 45% and system performance had taken a hit. I suspect if I let it go for another day or so, I too would have run out of resources. I am actually a bit surprised there does not appear to be a limit to the data logged - just to prevent running out of resources - I think they need to use some form of FIFO logging, rather than something that just gets bigger and bigger. Oh well. I really don't think PM was meant to run for days on end.

It had recorded well over 100 million events! That's ridiculous, IMO - in terms of an analyst scouring through the logs looking for clues. Now if there was another program that read those logs and culled out the useless data, leaving only applicable error messages, that might be useful.

I have to admit my biases here. I hate analyzing logs. I had to do it for years in Air Force. I'd much rather have a multimeter probe in one hand, and a soldering iron in the other. That is one reason why I don't use PM much. It is valuable (to me) however, when I can duplicate a BSOD, but have not yet isolated the actual culprit.

Vir Gnarus · Dec 12, 2012

There's a setting in ProcMon to provide a cutoff. Go to Options then History Depth. The default is 199 million events, so obviously you can see why that would be so resource intensive!

I think ProcMon is very powerful for what it is, especially with additional items like everything under the Tools section such as the Process tree which makes wading through all that data a lot easier, but it's still quite a bit to chew. Without firsthand having a idea what to look for, no amount of graphs are filters are going to help. Xperf also is in the same ballpark.

Digerati · Dec 12, 2012

Thanks. I saw history depth in the drop down menu but neglected to see what that meant. :embarrasment5:

I might try it again - when I am not doing real work!

cluberti · Jan 21, 2015

Not to revive and old thread for no reason, but I just came across this today and felt with all the back and forth I wanted to add my 2 pence. Procmon is generally taking data from the ETW channels in Windows - note that those same channels are available from the dump, and usually the last few seconds are still in RAM if you have a complete memory dump. Including the event log, all of which can be queried from the debugger. Procmon (and ETW in general) is not realtime, but it's close. Procmon has some overhead with filtering and such, so ETW channels from a dump are more helpful in the event of a crash generally, but if all you have is a procmon, that is at least something.

Determining the source of Bug Check 0x133 (DPC_WATCHDOG_VIOLATION) errors on Windows

Digerati

Moderator
Hardware Expert
Microsoft MVP (Ret.)

jcgriff2

Co-Founder / Admin
BSOD Instructor/Expert
Microsoft MVP (Ret.)

Digerati

Moderator
Hardware Expert
Microsoft MVP (Ret.)

Digerati

Moderator
Hardware Expert
Microsoft MVP (Ret.)

jcgriff2

Co-Founder / Admin
BSOD Instructor/Expert
Microsoft MVP (Ret.)

Vir Gnarus

BSOD Kernel Dump Expert

Digerati

Moderator
Hardware Expert
Microsoft MVP (Ret.)

Vir Gnarus

BSOD Kernel Dump Expert

Digerati

Moderator
Hardware Expert
Microsoft MVP (Ret.)

cluberti

Senior Member

Determining the source of Bug Check 0x133 (DPC_WATCHDOG_VIOLATION) errors on Windows

ModeratorHardware ExpertMicrosoft MVP (Ret.)

Co-Founder / AdminBSOD Instructor/ExpertMicrosoft MVP (Ret.)

ModeratorHardware ExpertMicrosoft MVP (Ret.)

ModeratorHardware ExpertMicrosoft MVP (Ret.)

Co-Founder / AdminBSOD Instructor/ExpertMicrosoft MVP (Ret.)

BSOD Kernel Dump Expert

ModeratorHardware ExpertMicrosoft MVP (Ret.)

BSOD Kernel Dump Expert

ModeratorHardware ExpertMicrosoft MVP (Ret.)

Senior Member

Moderator
Hardware Expert
Microsoft MVP (Ret.)

Co-Founder / Admin
BSOD Instructor/Expert
Microsoft MVP (Ret.)

Moderator
Hardware Expert
Microsoft MVP (Ret.)

Moderator
Hardware Expert
Microsoft MVP (Ret.)

Co-Founder / Admin
BSOD Instructor/Expert
Microsoft MVP (Ret.)

Moderator
Hardware Expert
Microsoft MVP (Ret.)

Moderator
Hardware Expert
Microsoft MVP (Ret.)