Completed and validated task but early reporting and error message in the Stderr output?

Questions and Answers : Getting started : Completed and validated task but early reporting and error message in the Stderr output?

To post messages, you must log in.

AuthorMessage
Peti

Send message
Joined: 17 Mar 20
Posts: 5
Credit: 142,053
RAC: 0
Message 92281 - Posted: 25 Mar 2020, 17:02:51 UTC
Last modified: 25 Mar 2020, 17:06:39 UTC

Hi everyone!
I'm new to this topic, and I'm confused a little.
So, my PC has calculated some tasks.
Sometimes I see early reporting (few hours remaining) after rebooting the PC and restarting boinc manager,
and I also had some computation error (I guess incorrect setting while trying to overclock cpu, now it's back to normal settings)

When I browse the reported tasks on the webpage, I still see errors in the "Stderr output" text.
and the task seems "Completed and validated" on the webpage. Why is that, if there is error?

for example this:https://boinc.bakerlab.org/rosetta/result.php?resultid=1132408641
ERROR: std::abs( coordsys_rot.det() - 1.0 ) < 1e-6
ERROR:: Exit from: src/core/pose/symmetry/util.cc line: 884
And runtime is much less than expected. (1 hours 19 min 56 sec instead of >7 hours)

another example: https://boinc.bakerlab.org/rosetta/result.php?resultid=1131638832
run time 2 hours 59 min 50 sec instead of >7 hours, the same error in stderr.

and this: https://boinc.bakerlab.org/rosetta/result.php?resultid=1131616913
ERROR: Assertion `copy_pose.size() == native.size()` failed. MSG:the reference pose must be the same size as the working pose
ERROR:: Exit from: src/protocols/protein_interface_design/filters/RmsdFilter.cc line: 323
(also early reporting, 4 hours 16 min 49 sec)

then how is this "valid"? Whom should I tell that my PC might have made mistakes that are unnoticed?
(just to note, I have overclocked the PC and was running Rosetta@home like that, but PC did not crash, seemed stable for a day of 90% cpu load, now I reset CPU clock to normal and reset project in boinc manager, I'm curious if the computer will keep making mistakes, if this is a bug in software or due to overclock)

(By the way, Linux Mint 19.3 system, 7.9.3 X64 boinc manager, EVGA SR2 with dual xeon x5670, 64GB ECC memory)
ID: 92281 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peti

Send message
Joined: 17 Mar 20
Posts: 5
Credit: 142,053
RAC: 0
Message 92306 - Posted: 26 Mar 2020, 1:47:10 UTC - in response to Message 92281.  
Last modified: 26 Mar 2020, 2:32:54 UTC

Update:
I have a handful of new tasks finished and no error. This time: no overclock & not touching the Boinc manager after the start.
(also, I have reset the project yesterday, and downloaded all the tasks and necessary files again, so it's clean start.)

I'm still not sure if my overclock was bad or there was a bug / problem in the stop and restart...
(I exited boinc manager and stopped tasks before reboot, it should have checkpoints so it should not matter, should it? I guess it should properly restart from checkpoint even in case of power blackout ?)

my overclock was looking stable, passed memtest and stress tests (a program calculating pi to lots of digits) and it also reported some of the rosetta tasks without problems.
(I'm running Folding at home in parallel , I had no calculation errors in that neither cpu not GPU work units)
So I don't know if it's some random hardware error happening rarely, or it's a software bug)

By the way, why the Boinc manager does not suspend calculation when Folding manages to receive a cpu workunit and starts to calculate? There is this setting that says "suspend when no boinc cpu usage is >25%" but it just does nothing, and I have to manually pause it. (I cannot have CPU at 100% since I need one cpu core for the GPU folding@home, so I set rosetta to use 88%)

edit:
there are some other people here, who had some similar error messages here too:
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=12242&postid=92152#92152
ID: 92306 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92309 - Posted: 26 Mar 2020, 3:48:59 UTC

R@h is very CPU and memory intensive. It is not uncommon for over-clockers to crash some work units.

I'm not saying the work units are perfect, I won't have time to review them. Hopefully things remain stable with normal clock speed. Thank you for joining the project.
Rosetta Moderator: Mod.Sense
ID: 92309 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Questions and Answers : Getting started : Completed and validated task but early reporting and error message in the Stderr output?



©2024 University of Washington
https://www.bakerlab.org