Posts by AdeB

21) Message boards : Number crunching : Report long-running models here (Message 59760)
Posted 23 Feb 2009 by Profile AdeB
Post:
Preferred runtime + 4 hrs for workunit loopbuild_mamaln_ideal_hb_t312__IGNORE_THE_REST_1zjc_1_7634_42_0.

stderr out:
...
BOINC:: Worker startup. 
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 43200
Hbond tripped.
====>
called boinc_finish
...


AdeB
22) Message boards : Number crunching : Report long-running models here (Message 59195)
Posted 31 Jan 2009 by Profile AdeB
Post:
Another one that was stopped after preferred runtime + 4hrs.

CPU time 57846.37

stderr out:
...
Starting watchdog...
Watchdog active.
Starting work on structure: _00001
# cpu_run_time_pref: 43200
Starting work on structure: _00002
====>
called boinc_finish
...


AdeB
23) Message boards : Number crunching : Report long-running models here (Message 59155)
Posted 29 Jan 2009 by Profile AdeB
Post:
Here is one where the watchdog stepped in, it was stopped after 16hrs with a 12hr preference:

task: 224203414

CPU time 57633.47

stderr out:
...
Starting watchdog...
Watchdog active.
Starting work on structure: S_shuffle_00001 <--- F_00008_0003416_0
Fullatom mode ..
# cpu_run_time_pref: 43200
Starting work on structure: S_shuffle_00002 <--- F_00001_0000109_0
Fullatom mode ..
Starting work on structure: S_shuffle_00003 <--- F_00002_0003276_0
Fullatom mode ..
Hbond tripped.
====>
called boinc_finish
...


AdeB
24) Message boards : Number crunching : Problems with Minirosetta v1.54 (Message 59153)
Posted 29 Jan 2009 by Profile AdeB
Post:
This task was aborted after my preferred runtime + 4 hours. It was working on the 3th model.
stderr out:
...
Watchdog active.
Starting work on structure: S_shuffle_00001 <--- F_00008_0003416_0
Fullatom mode ..
# cpu_run_time_pref: 43200
Starting work on structure: S_shuffle_00002 <--- F_00001_0000109_0
Fullatom mode ..
Starting work on structure: S_shuffle_00003 <--- F_00002_0003276_0
Fullatom mode ..
Hbond tripped.
====>
called boinc_finish


AdeB
25) Message boards : Number crunching : Report long-running models here (Message 58661)
Posted 7 Jan 2009 by Profile AdeB
Post:
I aborted lr5_score12_rlbd_2fls_IGNORE_THE_REST_DECOY_5559_1293_1 after running for more than 30 hours.

AdeB
26) Message boards : Number crunching : Report long-running models here (Message 58543)
Posted 5 Jan 2009 by Profile AdeB
Post:
AdeB, since that single model has been running longer then 6 hours. I would suggest you abort it...

After another crash the task has been aborted.
27) Message boards : Number crunching : Report long-running models here (Message 58508)
Posted 4 Jan 2009 by Profile AdeB
Post:
long-running models:
1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_134659_0 took more than 3x my preferred time (which is 12 hours)
1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_188618_1 is still running. I am the second one to try this workunit, the first time there was an error because there were too many restarts.
Yesterday is saw that the CPU time was over 13 hours, when i tried to look at the graphics it crashed. Today (after crunching for some other projects) it restarted at 6 hours. This time the graphics worked fine, but it took 20 minutes to go from 'model 1 step 203980' to 'model 1 step 203991'.
So, what to do? How many steps are there in a model? Should i let it run because it is almost finished, or abort it because there is no way i can finish this model?

AdeB
28) Message boards : Number crunching : Rosetta adds 100,000th host! (Message 58207)
Posted 28 Dec 2008 by Profile AdeB
Post:
hosts and users are going up and teraflops are going down as of dec 22.. id like to see a way on the home page to post only active hosts within 30 or 60 days it hard to see where we stand as far as active hosts and users


Check this: users and hosts. Both numbers are dropping.
29) Message boards : Number crunching : Expired deadline (Message 57925)
Posted 16 Dec 2008 by Profile AdeB
Post:
This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified


I guess this hasn't been resolved yet. In this workunit it is clear that at the moment it was send to one of my computers there were too many total results. And though my computer crunched it for more then 11 hours, and it finished without any errors, it could never validate.
Bad luck for me, and a complete waste of CPU-time.

AdeB



I think it was still valid to get sent to your computer, as the other 2 systems did not reply for whatever reason, which do not count as errors as no results were ever returned. It is to bad you got stuck on a validate error and wasted cpu time. that is a real problem here sometimes.


I think that you are right that 'no reply' doesn't count as an error. But it should not be send to a third computer, because then there will be a validate error as the number of tasks exceeds the maximum number of tasks:
max # of error/total/success tasks	[b]1, [color=red]2[/color], 1[/b]




yeah i see the human logic vs the computer logic do not match. the boinc ticket 276 explains things pretty good. surprised they haven't fixed this bug. must be super low priority.


Looks like someone stepped in and granted credit for the task. I hope it was also possible to save the results, because that's what its all about.
30) Message boards : Number crunching : Expired deadline (Message 57899)
Posted 15 Dec 2008 by Profile AdeB
Post:
This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified


I guess this hasn't been resolved yet. In this workunit it is clear that at the moment it was send to one of my computers there were too many total results. And though my computer crunched it for more then 11 hours, and it finished without any errors, it could never validate.
Bad luck for me, and a complete waste of CPU-time.

AdeB



I think it was still valid to get sent to your computer, as the other 2 systems did not reply for whatever reason, which do not count as errors as no results were ever returned. It is to bad you got stuck on a validate error and wasted cpu time. that is a real problem here sometimes.


I think that you are right that 'no reply' doesn't count as an error. But it should not be send to a third computer, because then there will be a validate error as the number of tasks exceeds the maximum number of tasks:
max # of error/total/success tasks	[b]1, [color=red]2[/color], 1[/b]


31) Message boards : Number crunching : Expired deadline (Message 57887)
Posted 15 Dec 2008 by Profile AdeB
Post:
This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified


I guess this hasn't been resolved yet. In this workunit it is clear that at the moment it was send to one of my computers there were too many total results. And though my computer crunched it for more then 11 hours, and it finished without any errors, it could never validate.
Bad luck for me, and a complete waste of CPU-time.

AdeB
32) Message boards : Number crunching : Minirosetta v1.45 bug thread (Message 57702)
Posted 8 Dec 2008 by Profile AdeB
Post:
ERROR: Illegal value for integer option -run:jran specified:

in workunit 1g73A_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1g73A-_5476_258_1

AdeB
33) Message boards : Number crunching : Minirosetta v1.40 bug thread (Message 57282)
Posted 27 Nov 2008 by Profile AdeB
Post:
for the team to know what is going on, please post your affected work units links in your next message.


This is going to be a tedious task, as the WorkUnits (most of them) complete normally after the deadlock is solved.
And after BOINC has crashed, I have no way of telling which WorkUnit may have caused it, since I'm looking at upto 8 WorkUnits per Host which will restart all normal when re-launching BOINC.

For now I'm afraid I'm best off with just solving the deadlocks, had to do that ~8 times today already.

(the only real solution I'd see is to run BOINC in debug mode to get behind it crashing or the MiniRosetta Client failing, which I'm very hesitant to do on 24 active production Systems running 24/7 at full speed - sounds like loads of work :p )

Anyway, for now I haven't seen any such behaviour on my 32bit Win32 Systems so far, only my Linux Systems seem randomly affected.

-- edit --

Oh, forgot :
How does Rosetta react to undervolting of CPUs ?

Most of my Systems run with reduced Vcore tested stable with Prime95, given a small safety buffer and have 100% validation on other Projects (Einstein, MalariaControl, SETI, LHC).

I'm very careful before I blame anything on a Project Client when I'm not running hardware 100% to its specifications.


FalconFly, i noticed that you are crunching for LHC@home as well.
It might be that LHC@home is causing your crashes. I've had some crashes too this week. Next time it happens check your boinc.log file, the last message there, before SIGSEGV and the stack trace, is probably: [lhcathome] Scheduler request
A few weeks ago this has also been mentioned by several people in the LHC@home message boards.

AdeB
34) Message boards : Number crunching : Rosetta Mini with new score terms bug thread (Message 56657)
Posted 3 Nov 2008 by Profile AdeB
Post:
No problems here: linux, AMD Athlon XP
35) Message boards : Number crunching : Problems with Rosetta version 5.98 (Message 56039)
Posted 26 Sep 2008 by Profile AdeB
Post:
This workunit is valid but stderr out is enormous:

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 43200
# random seed: 2792818
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
.
/// This line is repeated 516 times ///
.
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
======================================================
DONE :: 1 starting structures 43239.7 cpu seconds
This process generated 45 decoys from 45 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>
36) Message boards : Number crunching : Minirosetta v1.32 bug thread (Message 55207)
Posted 21 Aug 2008 by Profile AdeB
Post:
Compute error in this workunit.

stderr out:
<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
# cpu_run_time_pref: 43200

ERROR: NANs occured in hbonding!
ERROR:: Exit from: src/core/scoring/hbonds/hbonds_geom.cc line: 763
called boinc_finish

</stderr_txt>
]]>
37) Message boards : Number crunching : Problems with Rosetta version 5.93 (Message 51008)
Posted 26 Jan 2008 by Profile AdeB
Post:
resultid=135831728

CPU time 127601.71875 (35.44 HOURS)
Claimed credit 501.989998586804
Granted credit 20

Mod Sense. I'm pretty sure there's something wrong here. Anyone else spot the problem???? It's not like this issue wasn't posted about early enough on Friday for someone at the project to comment upon it.


Oh no, you did get 20. You should have got at least an extra 100 for all the effort you put into it.
38) Message boards : Number crunching : Problems with Rosetta version 5.93 (Message 50995)
Posted 26 Jan 2008 by Profile AdeB
Post:
sorry for the triple-post. I had some problems with my connection.
39) Message boards : Number crunching : Problems with Rosetta version 5.93 (Message 50994)
Posted 26 Jan 2008 by Profile AdeB
Post:
Strange. hedera received 88 of his 98 claimed for his watchdog ended task resultid=135513724. I wonder what the difference was?


And i received 92 of 94 claimed for resultid 135481414.
I hope Astro gets more than 20 credits for his job, but it probably won't be 400+.
40) Message boards : Number crunching : Problems with Rosetta version 5.93 (Message 50993)
Posted 26 Jan 2008 by Profile AdeB
Post:
Strange. hedera received 88 of his 98 claimed for his watchdog ended task resultid=135513724. I wonder what the difference was?


And i received 92 of 94 claimed for resultid 135481414.
I hope Astro gets more than 20 credits for his job, but it probably won't be 400+.


Previous 20 · Next 20



©2024 University of Washington
https://www.bakerlab.org