New computer, but experiencing nvlddmkm.sys VIDEO_DXGKRNL_FATAL_ERROR (code 141)?

Discussion in 'Sager and Clevo' started by Amnvex, Sep 3, 2019.

  1. Amnvex

    Amnvex Notebook Enthusiast

    Reputations:
    0
    Messages:
    28
    Likes Received:
    2
    Trophy Points:
    6
    Does this always point to a faulty card? Or can it be a driver issue or something? A windows 1903 issue, perhaps?

    I've tried just about everything to troubleshoot it. It's not temps and it's not overclocking. SFC and DISM bring back nothing. All memory checks, stress tests, etc come back fine, too. I think it may be a driver issue somewhere, but I can't imagine what driver it is because I've tried fresh install of windows and that doesn't do anything to change the situation. Driver verifier of 3rd party drivers (non microsoft ones) has not caused any crashes, but microsoft-only verification of drivers has caused BSOD loops on startup (when loading into an account) that forced me to reset/turn off the verifier in safemode (this was especially bad after a particular windows update). I can't fathom that a laptop with an RTX 2060 that's only ~2 months old is already failing... and the nature of the this issue has changed from previous times.

    I noticed something peculiar: the longer the PC is on, the more likely it is to fail. For example, can play a game for, say, 2-3 hours, and then it'll crash. It'll never happen before that 2-3 hour mark if the PC is rebooted freshly. Basically no early crashes. But that's not the case for another game. Victor Vran has a map where you choose a destination. When you open and close it, it'll crash. At some point it was so bad (I don't know how it got worse) that the moment you open the map, a guaranteed crash would happen immediately (video TDR, wait ~30-40 seconds, then it'll say windows is shutting down with fans suddenly blasting, and then the whole thing unexpectedly shuts down). If an attempt to manually shut down the PC happens while this crash takes place during the calm period before the system auto-shuts off, it will go to a BSOD that shows ntoskrnl as the driver that's responsible for the crash instead of nvlddmkm.sys when it BSODs (instead of the usual 30-40 second calm period where it automatically attempts to shut down). Sometimes it even does initiate shutdown procedure and goes to a blue screen to shut down, but it never finishes--it just turns off completely.

    Very confused... anyone have any idea? :\
     
    Last edited: Sep 3, 2019
  2. joluke

    joluke Notebook Deity

    Reputations:
    175
    Messages:
    914
    Likes Received:
    420
    Trophy Points:
    76
    What temperatures are you getting in full load and idle?
     
  3. DaMafiaGamer

    DaMafiaGamer Notebook Deity

    Reputations:
    480
    Messages:
    839
    Likes Received:
    1,015
    Trophy Points:
    156
    The GPU is experiencing unreliable voltage to the die in its full load p state, vram seems fine but the power phases seem unreliable which is why when you do sudden things in a game it may require more voltage, voltage that the vrms can't give stably. This causes the gpu core to crash unexpectedly leading to the blue screen. Long story short there is a hardware failure but its not really bad yet, it seems that the gpu can still run properly if you adjust the core frequency correctly. If things were 'bad' bad then the laptop would freeze or black screen and restart or shut down. The fact that the OS knows something is wrong shows the severity of the situation isn't as bad.

    Please try downloading and using nvidiainspector, offset the core in the negative by around 100 to 150mhz and try running your games. Let me know how that goes :)

    Offsetting the core in the negative means that it needs less vcore to power the gpu which leads to less wattage which in turn stresses the vrms less.

    The fact that there is no artifcating of any sort shows that this is INDEED A VOLTAGE ISSUE!
    Clevo fix your vrm schematics!
     
  4. Amnvex

    Amnvex Notebook Enthusiast

    Reputations:
    0
    Messages:
    28
    Likes Received:
    2
    Trophy Points:
    6
    Err, sorry, I should have said that I did not OC the graphics card. I underclocked it, and undervolted it (with turbo reduced to 3.6 GHz from its 4.1). I ran Furmark for a while, too, and it's stable. Are you saying I should reduce its clock speed and try it then? Maybe it is a voltage issue as you're saying. I guess I can try to reduce the clock. The base is already reduced since it is a 2060 mobile and not a desktop (I see in furmark that it hits with boost around high ~1580 MHz on the boost. I don't know how that's possible if it is supposed to go no higher than 1200 MHz according to the inspector. The other thing that's weird is this: nvidia inspector flashes in and out these numbers (compare the two images):

    upload_2019-9-3_18-33-0.png
    upload_2019-9-3_18-33-8.png
    upload_2019-9-3_18-38-10.png
    upload_2019-9-3_18-39-11.png

    Note when it refreshes, if I have not yet saved the settings in adjustments for offsets, it will refresh those back to 0 the next time it gets info from the card (e.g. sensor).

    Here's what happens when I reduce the clocks. When it flashes, I happened to capture it. Check the sensor data now (I had to capture it fast since it goes away also equally fast):
    upload_2019-9-3_18-50-36.png

    Idle is around ~40c and full load is around 70. I have not yet seen it go above 70 in Furmark after running it for 10 mins or so.
     
    Last edited: Sep 3, 2019
  5. DaMafiaGamer

    DaMafiaGamer Notebook Deity

    Reputations:
    480
    Messages:
    839
    Likes Received:
    1,015
    Trophy Points:
    156
    Wait the laptop has Optimus? This could be a whole different issue entirely, still related to the gpu but the mux switch could also be bad...
     
  6. Amnvex

    Amnvex Notebook Enthusiast

    Reputations:
    0
    Messages:
    28
    Likes Received:
    2
    Trophy Points:
    6
    I don't really know... it's a 630 HD intel and an RTX 2060. I assumed that the RTX 2060 and the Intel GPU switch between each other as necessary and that's what Optimus is for? Maybe that's not right and I don't have it. I'm probably making a mistake by claiming that I do have it.

    But here's something I found out: if I close nvidia inspector, the overclocking resets. Maybe something to do with the bios's speed scaling being enabled? I don't really know. But yeah, everything goes back to stock after applying clocks and voltages to the card if I close the inspector. Seems like it doesn't stick.

    I opened Furmark just now. The sensors are stably reading the card. No flashing, no flickering.

    upload_2019-9-3_19-1-54.png

    Here's with Furmark running:
    upload_2019-9-3_19-3-16.png

    Here's with Furmark with -150MHz on the core clock:
    upload_2019-9-3_19-4-54.png

    And ~3 minutes in:
    upload_2019-9-3_19-5-23.png

    Furmark closed, BUT the application itself is still open. Still with -150MHz on the core clock.
    upload_2019-9-3_19-7-23.png

    What could possibly be the problem...? Is there a way to permanently underclock the card in another way? And again I don't understand how the card is reaching such high clock values when I thought it was supposed to be downclocked for laptops.
     
    Last edited: Sep 3, 2019
  7. DaMafiaGamer

    DaMafiaGamer Notebook Deity

    Reputations:
    480
    Messages:
    839
    Likes Received:
    1,015
    Trophy Points:
    156
    Seems that the card is working fine on the underclock, you need to put the values in two or three times for it to stick. Run furmark and put in the - offset values. This will stick as the dedicated gpu is active. Remember to literally click apply clocks and voltage a good two three times. It shouldn’t revert then if you don’t refresh the program or close it...
     
  8. Amnvex

    Amnvex Notebook Enthusiast

    Reputations:
    0
    Messages:
    28
    Likes Received:
    2
    Trophy Points:
    6
    Ok, I did. I had Furmark opened when I applied the clocks and voltages. I spammed the button like 10 times with an OC of -100. When I did that, it kept dropping the value (was at ~1900, then ~1800, then 1455 or something and now 1355 estimated). It doesn't go below anymore if I spam the button. If I turn it off, the values should stick, right? They don't show being "stuck" after closing and reopening Nvidia Inspector (all goes back to 0 for adjustments panel). Idk if that's normal, but it doesn't show that estimated max is as high as it used to be.

    This is with Furmark @ 5 mins runtime and with nvidia inspector restarted. Seems that it is generally not going above ~1500 MHz, which is still more than what the card is rated for (especially with boost).
    upload_2019-9-3_19-17-36.png

    This is a really weird problem... really weird.
     
    Last edited: Sep 3, 2019
  9. DaMafiaGamer

    DaMafiaGamer Notebook Deity

    Reputations:
    480
    Messages:
    839
    Likes Received:
    1,015
    Trophy Points:
    156
    That gpu vbios is really not playing nice with nv inspector lol. There is something up with your laptop. Even I’m struggling to find out!
     
  10. Amnvex

    Amnvex Notebook Enthusiast

    Reputations:
    0
    Messages:
    28
    Likes Received:
    2
    Trophy Points:
    6
    *shrug*
    I wish I had an answer. Something driver-related is my guess, but trying to single that out is impossible. It could simply be Microsoft fked up or something. I get Intel HD driver errors saying that The description for Event ID 0 from source igfxCUIService2.0.0.0 cannot be found. I've reinstalled the GPU driver 100 times. Doesn't help. Tried all versions. This is a Clevo P970ED, one of the newest versions of Clevo computers, and so I can only guess why the multitude of issues.

    I don't know what's going on X_X...I've had driver issues from the beginning. I've gotten rid of most of them at this point. I also had Windows 10 upgrade 1903 fail on me and BSOD on me with updates with ntoskrnl driver being the culprit. And yet all tests, memory, SSD, HDD, etc, pass with no problems.

    There is another funky thing: I have a custom fan profile currently. After I updated my drivers (this was not a problem before the updates of windows and other drivers), my fans would shoot to the sky only on the GPU at start-up and stay that way if on performance profile with the CCC. If on entertainment profile, this doesn't happen (CCC 3.0). But yeah, idk what's going on anymore. Computer doesn't BSOD or crash or anything anymore unless dealing with Win10 updates. Then it may. But that's been rare and only happened a couple of times in the last month of using it. The other ~30 crashes were all related to the GPU.

    Next time I boot, I am going to undo the overclocking speed stepping technology that Clevo has enabled in the BIOS. But that's after I try the Nvidia Inspector underclock. I used to have MSI Afterburner, but that was pretty useless for this card other than to change the clock speeds (voltages are locked for RTX laptop cards).
     
    Last edited: Sep 3, 2019
Loading...

Share This Page