Message boards : Number crunching : Some Tasks failing with STATUS_ACCESS_VIOLATION
Author | Message |
---|---|
MossyRock Send message Joined: 3 Aug 13 Posts: 15 Credit: 4,353,376 RAC: 0 |
Hey, I have a long-standing problem on one of my machines. It began with Seti@Home WUs last year and is continuing now with Rosetta@Home WUs. I run World Community Grid WUs on this same machine and they usually all run to completion without problems, except on very rare occasions - when they do, they blow with the SAME error, at the SAME address. Other people who run the same WU nearly always have success. Here is an example of one of my problem workunits: https://boinc.bakerlab.org/rosetta/result.php?resultid=1213749857 This happens to about 15% of my R@H workunits, so I consider this to be a big problem. I run all cores and threads this machine 24x7, there are no GPU WUs running, and the CPU is comfortable (40 to 50 C.). I have run extensive diagnostics on the machine (Prime95 for 12 hours; MemTest86 for 12 hours) and no problems ever show up. Does anyone have any guidance here? Thanks. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,747,692 RAC: 22,903 |
Is your CPU or RAM overclocked/volted? If so, go back to the default stock settings & see if that helps. There are cases where Prime95/Memtest have run OK, but actual applications have not been able to. How many CPU cores/threads are you using? I was getting the same errors you are when trying to use all cores & threads on my 6c/12t systems with only 16GB of RAM. You've got 8c/16t there, and only 16GB of RAM. Limiting the number of cores/threads in use till i upgraded my RAM stopped the errors from occurring. Generally allow for 1.3GB of RAM per core/thread being used. The fact that you were getting issues with Seti tasks previously makes a system problem more likely (CPU/Memory, Power supply). Grant Darwin NT |
MossyRock Send message Joined: 3 Aug 13 Posts: 15 Credit: 4,353,376 RAC: 0 |
Hi Grant, No, it is not overclocked/volted. I'm running all 8 cores / 16 threads. I've never come close to maxing out the memory that I've seen, but interesting on limiting the cores. I'll notch it down to 13 threads and see what happens. Thanks. |
Falconet Send message Joined: 9 Mar 09 Posts: 353 Credit: 1,227,479 RAC: 1,836 |
Hi Grant, Undervolted? I had some issues causing similar errors to yours so I had to slightly decrease my undervolt on my Ryzen 1400. Instead of undervolt with -0.15, I had to go with -0.10. Seems to have fixed those issues. I also run my RAM at 1600 Mhz instead of the stock 2666 but that's just because I get an extra power saving. |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
Whether by coincidence or design, your failed work unit has been sent to a very similar computer for another go. It will be interesting to keep an eye on that task to see whether it completes successfully. |
floyd Send message Joined: 26 Jun 14 Posts: 23 Credit: 10,268,639 RAC: 0 |
I agree that this is likely a hardware issue but I'd like to add some thoughts that haven't come up yet. First, for the relevance of MemTest, I had a computer that could run it for hours without a single error but would crash after seconds under real (BOINC) load until I increased RAM voltage. Second, early first series Ryzens were bugged. I have a 1700 too and it has several problems, including frequent access violations when running R@H or WCG's MIP. I also have a later 1700x that does all that just fine. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,747,692 RAC: 22,903 |
Second, early first series Ryzens were bugged. I have a 1700 too and it has several problems, including frequent access violations when running R@H or WCG's MIP. I also have a later 1700x that does all that just fine.Motherboard BIOS updates for the CPU Microcode updates should sort those sort of issues out, Grant Darwin NT |
MossyRock Send message Joined: 3 Aug 13 Posts: 15 Credit: 4,353,376 RAC: 0 |
So that task that failed on my machine completed successfully on a similar machine. That confirms that there is something wrong on MY machine. Thank you for your recommendations here. There are tips to alter CPU voltage, both upwards and downwards. This is something that I'm not comfortable with as I've never messed around with voltages. One observation - since I've notched performance from 16 threads down to 13 as per Grant's recommendation, only one task has failed and it was right around the time of the setting change so I'm not sure if it was PRE- or POST-change. In the meantime, I will check for a new BIOS update for my mobo. The last time I updated the BIOS was quite some time ago. If it turns out that I need to tinker with CPU voltages, I will need guidance from someone who knows what they're doing. Thank you, all. |
Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 332,473 RAC: 472 |
There are plenty of Ryzen tuning videos at Youtube. Search on your favorite TechTuber's channel and you will find explanations about how to set cpu voltages on Ryzen. I would suggest going to a manual voltage and fixed clock multiplier if the machine is a dedicated crunching machine that you aren't concerned about gaming frames per second. You can achieve an overall higher all-core clock frequency at LESS Vcore with attended lower temperatures and power consumption. |
floyd Send message Joined: 26 Jun 14 Posts: 23 Credit: 10,268,639 RAC: 0 |
Motherboard BIOS updates for the CPU Microcode updates should sort those sort of issues out,They haven't so far and I don't expect any more to come. |
floyd Send message Joined: 26 Jun 14 Posts: 23 Credit: 10,268,639 RAC: 0 |
I would suggest going to a manual voltage and fixed clock multiplierAnd I would suggest not doing that before the system is stable at or close to default settings, if at all. The OP seems more concerned about stability than performance. |
Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 332,473 RAC: 472 |
I would suggest going to a manual voltage and fixed clock multiplierAnd I would suggest not doing that before the system is stable at or close to default settings, if at all. The OP seems more concerned about stability than performance. You don't have to overclock. Just take it off Auto and run at the default base clocks for maximum stability. Even at base clocks you can significantly reduce the Vcore that Auto sets and it will still be totally stable. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 390 Credit: 12,073,013 RAC: 4,827 |
If you go to manual and reduce the Vcore too far so that it becomes unstable enough that it won’t boot into the bios, how do you recover? |
Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 332,473 RAC: 472 |
Well, first you don't reduce the Vcore that far in the first place. just set Auto, run your normal full crunching load, look at what the Vcore FIT voltage actually is and use that as your set point in the BIOS. And if you can't boot into the BIOS, just clear the BIOS and start over. Our use BIOS Flashback to flash a pristine image back to the BIOS and start over. You can always recover from a failed boot with new BIOS image. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 390 Credit: 12,073,013 RAC: 4,827 |
OK, thanks. I’ll have a look on number 2 desktop - half the rac and half as much again on the power. |
MossyRock Send message Joined: 3 Aug 13 Posts: 15 Credit: 4,353,376 RAC: 0 |
It's been three days and not a single WU has blown. The only step I took was reducing the number of running threads down to 13 from 16 as per Grant's suggestions. The problem seems to be fixed, but I'll keep a close eye on it. I'll leave twiddling with the voltages alone unless the problem seriously recurs again. Thank you, all. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2121 Credit: 41,179,074 RAC: 11,480 |
It's been three days and not a single WU has blown. The only step I took was reducing the number of running threads down to 13 from 16 as per Grant's suggestions. Reducing from 16 to 13 is a lot. Now you've found a successful level, take it up to 14 and see if it remains stable, then 15. If it's stable with either of those, fine. It's just all 16 that's a problem. If you get errors again, say at 15, you can drop it down to 14 again and know that's the best compromise. You can do all this without getting into voltages, which is a whole other ball-game |
MarkJ Send message Joined: 28 Mar 20 Posts: 72 Credit: 25,238,680 RAC: 0 |
The only step I took was reducing the number of running threads down to 13 from 16 as per Grant's suggestions. You got a Ryzen 1700 (8c/16t) and 16GB of memory. I'd say it runs out of memory and reducing the number of threads allows it to fit. It all depends on the work mix and how much memory they need at the time. Some work units need 1.5GB and others are happy with 400MB. For what its worth I ended upgrading the memory on all my 6c/12t machines to 32GB. For a couple of Pi4 4GB I ended up reducing the threads to 3. BOINC blog |
Message boards :
Number crunching :
Some Tasks failing with STATUS_ACCESS_VIOLATION
©2024 University of Washington
https://www.bakerlab.org