minirosetta 2.14

Message boards : Number crunching : minirosetta 2.14

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 66714 - Posted: 30 Jun 2010, 14:40:41 UTC

billy ewell 1931, thanks for crunching. Regardless of credit granted, the results are valuable. So I cannot agree with your comment about any CPU time being wasted. I would simply say that ideally the runtime per model of these tasks could be more consistent. Rest assured that for every long-running model, there are more credits granted per normal one. This is how the credit system works. Your long-running model result causes the average credit claimed per model to increase, and so as other users report results they (and you) are granted slightly more then if no such long-running model had occurred. The problem is that it is much harder to see that you get a fraction more, and easy to see when you get 50+% less then claimed.

Keep crunching.
Rosetta Moderator: Mod.Sense
ID: 66714 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 66715 - Posted: 30 Jun 2010, 14:42:01 UTC - in response to Message 66713.  

Would hate to kill a job so close to comletion,
but I've got to wonder if this is really going to
complete.

Does this task still create CPU-load? If yes, leave it running, if not try restarting the BOINC-manager (make sure, the client processes will be stopped as well). If it still doesn't create CPU-load after restarting the manager, you should abort it.
cu

Joe



That would be my advice as well. Sounds like a long-running model, like billy ewell 1931 just reported as well. But if it is still using CPU time, then it should take care of itself without any tinkering.
Rosetta Moderator: Mod.Sense
ID: 66715 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2140
Credit: 41,518,559
RAC: 10,612
Message 66719 - Posted: 30 Jun 2010, 16:50:27 UTC - in response to Message 66714.  

billy ewell 1931, thanks for crunching. Regardless of credit granted, the results are valuable. So I cannot agree with your comment about any CPU time being wasted. I would simply say that ideally the runtime per model of these tasks could be more consistent. Rest assured that for every long-running model, there are more credits granted per normal one. This is how the credit system works. Your long-running model result causes the average credit claimed per model to increase, and so as other users report results they (and you) are granted slightly more then if no such long-running model had occurred. The problem is that it is much harder to see that you get a fraction more, and easy to see when you get 50+% less then claimed.

Sometimes I understand these things and sometimes not, so please pull me up if I'm getting this wrong, but...

Some WUs seem to be of the type where 500 steps are attempted for a decoymodel and if nothing useful seems to be coming up it gets ended very quickly and moves onto the next. Then the next, then the next etc.

Eventually a decoymodel comes up that looks promising and rather than stopping at 500 it seems to go on (and on) until the watchdog cuts in.

So, perversely, you either have (say) 1000 models taking (say) 2h 50m or 1001 taking 7 hours. It's when the task over-runs that it's working on the most valuable stuff, which then gets a low credit award.

Alternatively, the task is getting into a loop and is going nowhere very slowly indeed, as we've seen recently.

An example of a long running WU for me is simIF2_1f0s_1PBV_ProteinInterfaceDesign_28Jun2010_21501_4_0

These simIF tasks seem to be particularly susceptible.
ID: 66719 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
billy ewell 1931

Send message
Joined: 30 Mar 07
Posts: 14
Credit: 6,981,439
RAC: 3,135
Message 66720 - Posted: 30 Jun 2010, 19:24:26 UTC - in response to Message 66714.  

billy ewell 1931, thanks for crunching. Regardless of credit granted, the results are valuable. So I cannot agree with your comment about any CPU time being wasted. I would simply say that ideally the runtime per model of these tasks could be more consistent. Rest assured that for every long-running model, there are more credits granted per normal one. This is how the credit system works. Your long-running model result causes the average credit claimed per model to increase, and so as other users report results they (and you) are granted slightly more then if no such long-running model had occurred. The problem is that it is much harder to see that you get a fraction more, and easy to see when you get 50+% less then claimed.

Keep crunching.


MS: Thanks for the reply; I principally understand but wish to emphasize as I stated previously that "I am not a credits chaser" but a dedicated supporter of scientific research and extremely happy to do so. All that having been said, the work unit recently completed and reported below highlighted my concerns brilliantly. I last checked this work unit at about 6.6 hours of completion time and the last check point at that time was 00:26:13. I started and stopped this unit at least 15 times without effect. My main concern is that my 21 cpus are being used efficiently and your answer has reassured me. Again, thanks.

Bill [q]

Task ID 349219039
Name fc_A_noSmallMvs_fc6x_2hwx_ProteinInterfaceDesign_20Jun2010_21458_97_0
Workunit 318903612
Created 29 Jun 2010 8:01:24 UTC
Sent 29 Jun 2010 8:03:08 UTC
Received 30 Jun 2010 18:46:19 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 1273687
Report deadline 9 Jul 2010 8:03:08 UTC
CPU time 28848.84
stderr out <core_client_version>6.10.56</core_client_version>
<![CDATA[
<stderr_txt>
[2010- 6-30 3:17: 1:] :: BOINC:: Initializing ... ok.
[2010- 6-30 3:17: 1:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev36507.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 14400
BOINC:: CPU time: 28846.8s, 14400s + 14400s[2010- 6-30 11:27:54:] :: BOINC
InternalDecoyCount: 64
======================================================
DONE :: 2 starting structures 28846.8 cpu seconds
This process generated 64 decoys from 64 attempts
======================================================
called boinc_finish

</stderr_txt>
]]>


Validate state Valid
Claimed credit 184.044030472935
Granted credit 16.2806625211898
application version 2.14
ID: 66720 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jochen

Send message
Joined: 6 Jun 06
Posts: 133
Credit: 3,847,433
RAC: 0
Message 66721 - Posted: 30 Jun 2010, 20:08:18 UTC - in response to Message 66720.  

...I started and stopped this unit at least 15 times...

AFAIR this is not a good idea, as long as the task is not kept in memory.
If you suspend a task and keep it in memory, it does not have any effect at all. If you suspend a task and don't keep it in memory you will lose any work work done from the last checkpoint.

Again I would recommend to just leave the tasks running, as long as the create CPU-load. This is probably the best you can do. And again AFAIR this is the best way to assure that no CPU time is wasted.

Joe, the jinx
ID: 66721 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 66736 - Posted: 1 Jul 2010, 23:35:53 UTC

This failed after 11min.

td-only-2-ARF1_4-15_21413_114_0

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=319298520

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt

Starting work on structure: _00001
# cpu_run_time_pref: 14400

ERROR: rsd_type_list.size()
ERROR:: Exit from: src/core/fragment/Frame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

ID: 66736 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 66828 - Posted: 9 Jul 2010, 9:04:17 UTC

This one completed successfully but stopped well short of its 6 hour run.
ab_07_06_T0581_21_136_homs_h004__SAVE_ALL_OUT.IGNORE_THE_REST_10_11_21556_1


ERROR: expected to read 18 libraries from Dun02, but read 0
ERROR:: Exit from: ....srccorescoringdunbrackRotamerLibrary.cc line: 865

ID: 66828 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Speedy
Avatar

Send message
Joined: 25 Sep 05
Posts: 163
Credit: 808,337
RAC: 1
Message 67060 - Posted: 1 Aug 2010, 21:20:22 UTC

result id 356014820 errored out after 28 minutes defult runtime is 10 hours
Error output
ERROR: rsd_type_list.size()
ERROR:: Exit from: ....srccorefragmentFrame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>
Have a crunching good day!!
ID: 67060 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 67091 - Posted: 4 Aug 2010, 2:38:36 UTC

Hi.

Someone might want to have a look at this one i got a Validate error, none of

the other copies have been returned. I can't see a problem with it.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=321742035

ab_07_08_T0606_27_169_h001_disulf_SAVE_ALL_OUT.IGNORE_THE_REST_06_07_21584_248_2


Starting work on structure: _00001
# cpu_run_time_pref: 14400
Starting work on structure: _00002
Starting work on structure: _00003
Starting work on structure: _00004
Starting work on structure: _00005
Starting work on structure: _00006
Starting work on structure: _00007
Starting work on structure: _00008
Starting work on structure: _00009
Starting work on structure: _00010
Starting work on structure: _00011
Starting work on structure: _00012
Starting work on structure: _00013
Starting work on structure: _00014
Starting work on structure: _00015
Starting work on structure: _00016
Starting work on structure: _00017
======================================================
DONE :: 1 starting structures 13766.2 cpu seconds
This process generated 17 decoys from 17 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>

ID: 67091 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 67098 - Posted: 4 Aug 2010, 15:25:27 UTC

P.P.L. it looks like you received the third issue of that specific work unit. The first two never reported back. So the third resulted in too many tasks, as two is the configured maximum. So, yes, the BOINC server should not send out such results that are doomed to failure in the first place. It is a bug that I believe was recently fixed, so the next time the Project Team upgrades the servers, it shouldn't happen anymore. It only happens when some very rare circumstances combine. That was part of what made it hard for Berkeley to track it down.

So, your machine completed it ok, i.e. no computation errors. But the validator discovered three reports for something with a maximum of two and hence produces the validation error.
Rosetta Moderator: Mod.Sense
ID: 67098 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 67108 - Posted: 5 Aug 2010, 1:12:22 UTC

Hi Mod Sense.

Yes a bug indeed, i received credit for it anyway so that's O.K. :)


ID: 67108 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
12kpp

Send message
Joined: 4 Jul 09
Posts: 2
Credit: 256,800
RAC: 0
Message 67112 - Posted: 6 Aug 2010, 2:05:16 UTC

Hi !
I have the same error. Validate error.

324097911
ID: 67112 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 67113 - Posted: 6 Aug 2010, 3:42:32 UTC

This one errored after 2 min.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=326440378

cs-only-2-DinI_3-14_20161_242_0

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>


( Left out the bits in between )


Starting work on structure: _00001
# cpu_run_time_pref: 14400

ERROR: rsd_type_list.size()
ERROR:: Exit from: src/core/fragment/Frame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>

ID: 67113 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 67123 - Posted: 6 Aug 2010, 18:45:46 UTC
Last modified: 6 Aug 2010, 18:51:03 UTC

Reposting speedy's comments for the Project Team to investigate. Speedy's tasks report:

ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_databasescoring/weights/dslf_weights.wts exist
ERROR:: Exit from: ....srccorescoringScoreFunctionFactory.cc line: 178


357381146 357381134 & 357381125 all tasks start with lrm_jorj_combined_tlrm_jorj_combined_torsion. All tasks end with Compute error. I'm thinking lrm_jorj_combined_tlrm_jorj_combined_torsion is a bad bad batch of tasks.

Rosetta Moderator: Mod.Sense
ID: 67123 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 67133 - Posted: 7 Aug 2010, 22:00:28 UTC

Another one failed after 2 sec.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=326517114

lrm_jorj_combined_torsion_it06_run01_A_rlbd_1mgw__SAVE_ALL_OUT_IGNORE_THE_RESTlr5_DECOY_21224_221_0

<core_client_version>6.10.17</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>

( left out bits in middle )

Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...

ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_database/scoring/weights/dslf_weights.wts exist
ERROR:: Exit from: src/core/scoring/ScoreFunctionFactory.cc line: 178
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

ID: 67133 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 67135 - Posted: 8 Aug 2010, 3:25:14 UTC
Last modified: 8 Aug 2010, 3:26:01 UTC

And another one, this went for 12 sec.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=326653711


lrm_jorj_combined_torsion_it06_run01_A_rlbd_1o4w__SAVE_ALL_OUT_IGNORE_THE_RESTlr10_DECOY_21224_430_1

<core_client_version>6.10.17</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>

( left out bits in middle )

Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...

ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_database/scoring/weights/dslf_weights.wts exist
ERROR:: Exit from: src/core/scoring/ScoreFunctionFactory.cc line: 178
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
ID: 67135 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Speedy
Avatar

Send message
Joined: 25 Sep 05
Posts: 163
Credit: 808,337
RAC: 1
Message 67137 - Posted: 8 Aug 2010, 5:52:30 UTC

Task 357811058
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>

( left out lines in middle )

Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev36507.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/yfsong_lrm_jorj_combined_torsion_it06_run01_A.zip
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/lr13_2iiy.fix.out.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...

ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_databasescoring/weights/dslf_weights.wts exist
ERROR:: Exit from: ....srccorescoringScoreFunctionFactory.cc line: 178
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

Validate state Invalid Ran for 7 seconds.
Have a crunching good day!!
ID: 67137 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[DPC]NGS~StugIII

Send message
Joined: 8 Mar 06
Posts: 2
Credit: 58,616
RAC: 0
Message 67141 - Posted: 8 Aug 2010, 16:22:09 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=355266280

This one failed but it took 59251 seconds.

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0057DC0B write attempt to address 0x00D0954C


ID: 67141 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
duftkerze

Send message
Joined: 7 Jul 06
Posts: 2
Credit: 692,624
RAC: 0
Message 67161 - Posted: 11 Aug 2010, 3:23:01 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=357781244
ID: 67161 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 67162 - Posted: 11 Aug 2010, 4:03:58 UTC
Last modified: 11 Aug 2010, 4:05:11 UTC

Moved duftkerze's post here. Their result has the "Unable to open weights" error that is reported in the prior posts here.
Rosetta Moderator: Mod.Sense
ID: 67162 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : minirosetta 2.14



©2024 University of Washington
https://www.bakerlab.org