Some Tasks failing with STATUS_ACCESS_VIOLATION

Message boards : Number crunching : Some Tasks failing with STATUS_ACCESS_VIOLATION

To post messages, you must log in.

AuthorMessage
MossyRock

Send message
Joined: 3 Aug 13
Posts: 15
Credit: 4,353,376
RAC: 0
Message 97918 - Posted: 5 Jul 2020, 5:05:10 UTC
Last modified: 5 Jul 2020, 5:06:49 UTC

Hey,

I have a long-standing problem on one of my machines. It began with Seti@Home WUs last year and is continuing now with Rosetta@Home WUs. I run World Community Grid WUs on this same machine and they usually all run to completion without problems, except on very rare occasions - when they do, they blow with the SAME error, at the SAME address.

Other people who run the same WU nearly always have success. Here is an example of one of my problem workunits:

https://boinc.bakerlab.org/rosetta/result.php?resultid=1213749857

This happens to about 15% of my R@H workunits, so I consider this to be a big problem. I run all cores and threads this machine 24x7, there are no GPU WUs running, and the CPU is comfortable (40 to 50 C.).

I have run extensive diagnostics on the machine (Prime95 for 12 hours; MemTest86 for 12 hours) and no problems ever show up.

Does anyone have any guidance here?

Thanks.
ID: 97918 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1487
Credit: 14,674,522
RAC: 14,452
Message 97919 - Posted: 5 Jul 2020, 5:33:49 UTC
Last modified: 5 Jul 2020, 5:36:33 UTC

Is your CPU or RAM overclocked/volted? If so, go back to the default stock settings & see if that helps. There are cases where Prime95/Memtest have run OK, but actual applications have not been able to.

How many CPU cores/threads are you using?
I was getting the same errors you are when trying to use all cores & threads on my 6c/12t systems with only 16GB of RAM. You've got 8c/16t there, and only 16GB of RAM.
Limiting the number of cores/threads in use till i upgraded my RAM stopped the errors from occurring. Generally allow for 1.3GB of RAM per core/thread being used.


The fact that you were getting issues with Seti tasks previously makes a system problem more likely (CPU/Memory, Power supply).
Grant
Darwin NT
ID: 97919 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MossyRock

Send message
Joined: 3 Aug 13
Posts: 15
Credit: 4,353,376
RAC: 0
Message 97920 - Posted: 5 Jul 2020, 5:45:45 UTC - in response to Message 97919.  

Hi Grant,

No, it is not overclocked/volted. I'm running all 8 cores / 16 threads.

I've never come close to maxing out the memory that I've seen, but interesting on limiting the cores. I'll notch it down to 13 threads and see what happens.

Thanks.
ID: 97920 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 350
Credit: 1,017,068
RAC: 223
Message 97925 - Posted: 5 Jul 2020, 11:34:51 UTC - in response to Message 97920.  
Last modified: 5 Jul 2020, 11:40:33 UTC

Hi Grant,

No, it is not overclocked/volted. I'm running all 8 cores / 16 threads.

I've never come close to maxing out the memory that I've seen, but interesting on limiting the cores. I'll notch it down to 13 threads and see what happens.

Thanks.



Undervolted? I had some issues causing similar errors to yours so I had to slightly decrease my undervolt on my Ryzen 1400.
Instead of undervolt with -0.15, I had to go with -0.10. Seems to have fixed those issues. I also run my RAM at 1600 Mhz instead of the stock 2666 but that's just because I get an extra power saving.
ID: 97925 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 97927 - Posted: 5 Jul 2020, 13:57:34 UTC - in response to Message 97918.  

Whether by coincidence or design, your failed work unit has been sent to a very similar computer for another go. It will be interesting to keep an eye on that task to see whether it completes successfully.
ID: 97927 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
floyd

Send message
Joined: 26 Jun 14
Posts: 23
Credit: 10,268,639
RAC: 0
Message 97929 - Posted: 5 Jul 2020, 18:46:02 UTC

I agree that this is likely a hardware issue but I'd like to add some thoughts that haven't come up yet. First, for the relevance of MemTest, I had a computer that could run it for hours without a single error but would crash after seconds under real (BOINC) load until I increased RAM voltage. Second, early first series Ryzens were bugged. I have a 1700 too and it has several problems, including frequent access violations when running R@H or WCG's MIP. I also have a later 1700x that does all that just fine.
ID: 97929 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1487
Credit: 14,674,522
RAC: 14,452
Message 97933 - Posted: 6 Jul 2020, 6:59:18 UTC - in response to Message 97929.  

Second, early first series Ryzens were bugged. I have a 1700 too and it has several problems, including frequent access violations when running R@H or WCG's MIP. I also have a later 1700x that does all that just fine.
Motherboard BIOS updates for the CPU Microcode updates should sort those sort of issues out,
Grant
Darwin NT
ID: 97933 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MossyRock

Send message
Joined: 3 Aug 13
Posts: 15
Credit: 4,353,376
RAC: 0
Message 97936 - Posted: 6 Jul 2020, 15:22:58 UTC

So that task that failed on my machine completed successfully on a similar machine. That confirms that there is something wrong on MY machine.

Thank you for your recommendations here. There are tips to alter CPU voltage, both upwards and downwards. This is something that I'm not comfortable with as I've never messed around with voltages.

One observation - since I've notched performance from 16 threads down to 13 as per Grant's recommendation, only one task has failed and it was right around the time of the setting change so I'm not sure if it was PRE- or POST-change.

In the meantime, I will check for a new BIOS update for my mobo. The last time I updated the BIOS was quite some time ago.

If it turns out that I need to tinker with CPU voltages, I will need guidance from someone who knows what they're doing.

Thank you, all.
ID: 97936 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Myers
Avatar

Send message
Joined: 29 Mar 20
Posts: 95
Credit: 289,903
RAC: 0
Message 97937 - Posted: 6 Jul 2020, 16:45:07 UTC - in response to Message 97936.  

There are plenty of Ryzen tuning videos at Youtube. Search on your favorite TechTuber's channel and you will find explanations about how to set cpu voltages on Ryzen.

I would suggest going to a manual voltage and fixed clock multiplier if the machine is a dedicated crunching machine that you aren't concerned about gaming frames per second.

You can achieve an overall higher all-core clock frequency at LESS Vcore with attended lower temperatures and power consumption.
ID: 97937 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
floyd

Send message
Joined: 26 Jun 14
Posts: 23
Credit: 10,268,639
RAC: 0
Message 97938 - Posted: 6 Jul 2020, 18:39:31 UTC - in response to Message 97933.  

Motherboard BIOS updates for the CPU Microcode updates should sort those sort of issues out,
They haven't so far and I don't expect any more to come.
ID: 97938 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
floyd

Send message
Joined: 26 Jun 14
Posts: 23
Credit: 10,268,639
RAC: 0
Message 97939 - Posted: 6 Jul 2020, 18:47:01 UTC - in response to Message 97937.  

I would suggest going to a manual voltage and fixed clock multiplier
And I would suggest not doing that before the system is stable at or close to default settings, if at all. The OP seems more concerned about stability than performance.
ID: 97939 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Myers
Avatar

Send message
Joined: 29 Mar 20
Posts: 95
Credit: 289,903
RAC: 0
Message 97940 - Posted: 6 Jul 2020, 20:02:39 UTC - in response to Message 97939.  

I would suggest going to a manual voltage and fixed clock multiplier
And I would suggest not doing that before the system is stable at or close to default settings, if at all. The OP seems more concerned about stability than performance.

You don't have to overclock. Just take it off Auto and run at the default base clocks for maximum stability. Even at base clocks you can significantly reduce the Vcore that Auto sets and it will still be totally stable.
ID: 97940 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 376
Credit: 10,735,541
RAC: 5,655
Message 97941 - Posted: 6 Jul 2020, 21:00:35 UTC - in response to Message 97940.  


You don't have to overclock. Just take it off Auto and run at the default base clocks for maximum stability. Even at base clocks you can significantly reduce the Vcore that Auto sets and it will still be totally stable.


If you go to manual and reduce the Vcore too far so that it becomes unstable enough that it won’t boot into the bios, how do you recover?
ID: 97941 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Myers
Avatar

Send message
Joined: 29 Mar 20
Posts: 95
Credit: 289,903
RAC: 0
Message 97942 - Posted: 6 Jul 2020, 23:33:21 UTC - in response to Message 97941.  


You don't have to overclock. Just take it off Auto and run at the default base clocks for maximum stability. Even at base clocks you can significantly reduce the Vcore that Auto sets and it will still be totally stable.


If you go to manual and reduce the Vcore too far so that it becomes unstable enough that it won’t boot into the bios, how do you recover?

Well, first you don't reduce the Vcore that far in the first place. just set Auto, run your normal full crunching load, look at what the Vcore FIT voltage actually is and use that as your set point in the BIOS.
And if you can't boot into the BIOS, just clear the BIOS and start over. Our use BIOS Flashback to flash a pristine image back to the BIOS and start over.
You can always recover from a failed boot with new BIOS image.
ID: 97942 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 376
Credit: 10,735,541
RAC: 5,655
Message 97943 - Posted: 7 Jul 2020, 4:28:21 UTC - in response to Message 97942.  


You don't have to overclock. Just take it off Auto and run at the default base clocks for maximum stability. Even at base clocks you can significantly reduce the Vcore that Auto sets and it will still be totally stable.


If you go to manual and reduce the Vcore too far so that it becomes unstable enough that it won’t boot into the bios, how do you recover?

Well, first you don't reduce the Vcore that far in the first place. just set Auto, run your normal full crunching load, look at what the Vcore FIT voltage actually is and use that as your set point in the BIOS.
And if you can't boot into the BIOS, just clear the BIOS and start over. Our use BIOS Flashback to flash a pristine image back to the BIOS and start over.
You can always recover from a failed boot with new BIOS image.


OK, thanks. I’ll have a look on number 2 desktop - half the rac and half as much again on the power.
ID: 97943 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MossyRock

Send message
Joined: 3 Aug 13
Posts: 15
Credit: 4,353,376
RAC: 0
Message 97951 - Posted: 7 Jul 2020, 17:58:20 UTC

It's been three days and not a single WU has blown. The only step I took was reducing the number of running threads down to 13 from 16 as per Grant's suggestions.

The problem seems to be fixed, but I'll keep a close eye on it. I'll leave twiddling with the voltages alone unless the problem seriously recurs again.

Thank you, all.
ID: 97951 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1990
Credit: 38,536,805
RAC: 15,887
Message 97953 - Posted: 7 Jul 2020, 22:06:00 UTC - in response to Message 97951.  

It's been three days and not a single WU has blown. The only step I took was reducing the number of running threads down to 13 from 16 as per Grant's suggestions.

The problem seems to be fixed, but I'll keep a close eye on it. I'll leave twiddling with the voltages alone unless the problem seriously recurs again.

Thank you, all.

Reducing from 16 to 13 is a lot.
Now you've found a successful level, take it up to 14 and see if it remains stable, then 15.
If it's stable with either of those, fine. It's just all 16 that's a problem. If you get errors again, say at 15, you can drop it down to 14 again and know that's the best compromise.
You can do all this without getting into voltages, which is a whole other ball-game
ID: 97953 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MarkJ

Send message
Joined: 28 Mar 20
Posts: 72
Credit: 25,010,478
RAC: 27
Message 97977 - Posted: 9 Jul 2020, 8:15:32 UTC - in response to Message 97951.  

The only step I took was reducing the number of running threads down to 13 from 16 as per Grant's suggestions.

The problem seems to be fixed, but I'll keep a close eye on it.

You got a Ryzen 1700 (8c/16t) and 16GB of memory. I'd say it runs out of memory and reducing the number of threads allows it to fit. It all depends on the work mix and how much memory they need at the time. Some work units need 1.5GB and others are happy with 400MB.

For what its worth I ended upgrading the memory on all my 6c/12t machines to 32GB. For a couple of Pi4 4GB I ended up reducing the threads to 3.
BOINC blog
ID: 97977 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Some Tasks failing with STATUS_ACCESS_VIOLATION



©2024 University of Washington
https://www.bakerlab.org