Message boards : Number crunching : minirosetta 2.14
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next
Author | Message |
---|---|
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
billy ewell 1931, thanks for crunching. Regardless of credit granted, the results are valuable. So I cannot agree with your comment about any CPU time being wasted. I would simply say that ideally the runtime per model of these tasks could be more consistent. Rest assured that for every long-running model, there are more credits granted per normal one. This is how the credit system works. Your long-running model result causes the average credit claimed per model to increase, and so as other users report results they (and you) are granted slightly more then if no such long-running model had occurred. The problem is that it is much harder to see that you get a fraction more, and easy to see when you get 50+% less then claimed. Keep crunching. Rosetta Moderator: Mod.Sense |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Would hate to kill a job so close to comletion, That would be my advice as well. Sounds like a long-running model, like billy ewell 1931 just reported as well. But if it is still using CPU time, then it should take care of itself without any tinkering. Rosetta Moderator: Mod.Sense |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2140 Credit: 41,518,559 RAC: 10,612 |
billy ewell 1931, thanks for crunching. Regardless of credit granted, the results are valuable. So I cannot agree with your comment about any CPU time being wasted. I would simply say that ideally the runtime per model of these tasks could be more consistent. Rest assured that for every long-running model, there are more credits granted per normal one. This is how the credit system works. Your long-running model result causes the average credit claimed per model to increase, and so as other users report results they (and you) are granted slightly more then if no such long-running model had occurred. The problem is that it is much harder to see that you get a fraction more, and easy to see when you get 50+% less then claimed. Sometimes I understand these things and sometimes not, so please pull me up if I'm getting this wrong, but... Some WUs seem to be of the type where 500 steps are attempted for a decoymodel and if nothing useful seems to be coming up it gets ended very quickly and moves onto the next. Then the next, then the next etc. Eventually a decoymodel comes up that looks promising and rather than stopping at 500 it seems to go on (and on) until the watchdog cuts in. So, perversely, you either have (say) 1000 models taking (say) 2h 50m or 1001 taking 7 hours. It's when the task over-runs that it's working on the most valuable stuff, which then gets a low credit award. Alternatively, the task is getting into a loop and is going nowhere very slowly indeed, as we've seen recently. An example of a long running WU for me is simIF2_1f0s_1PBV_ProteinInterfaceDesign_28Jun2010_21501_4_0 These simIF tasks seem to be particularly susceptible. |
billy ewell 1931 Send message Joined: 30 Mar 07 Posts: 14 Credit: 6,981,439 RAC: 3,135 |
billy ewell 1931, thanks for crunching. Regardless of credit granted, the results are valuable. So I cannot agree with your comment about any CPU time being wasted. I would simply say that ideally the runtime per model of these tasks could be more consistent. Rest assured that for every long-running model, there are more credits granted per normal one. This is how the credit system works. Your long-running model result causes the average credit claimed per model to increase, and so as other users report results they (and you) are granted slightly more then if no such long-running model had occurred. The problem is that it is much harder to see that you get a fraction more, and easy to see when you get 50+% less then claimed. MS: Thanks for the reply; I principally understand but wish to emphasize as I stated previously that "I am not a credits chaser" but a dedicated supporter of scientific research and extremely happy to do so. All that having been said, the work unit recently completed and reported below highlighted my concerns brilliantly. I last checked this work unit at about 6.6 hours of completion time and the last check point at that time was 00:26:13. I started and stopped this unit at least 15 times without effect. My main concern is that my 21 cpus are being used efficiently and your answer has reassured me. Again, thanks. Bill [q] Task ID 349219039 Name fc_A_noSmallMvs_fc6x_2hwx_ProteinInterfaceDesign_20Jun2010_21458_97_0 Workunit 318903612 Created 29 Jun 2010 8:01:24 UTC Sent 29 Jun 2010 8:03:08 UTC Received 30 Jun 2010 18:46:19 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 1273687 Report deadline 9 Jul 2010 8:03:08 UTC CPU time 28848.84 stderr out <core_client_version>6.10.56</core_client_version> <![CDATA[ <stderr_txt> [2010- 6-30 3:17: 1:] :: BOINC:: Initializing ... ok. [2010- 6-30 3:17: 1:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev36507.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... BOINC:: Worker startup. Starting watchdog... Watchdog active. # cpu_run_time_pref: 14400 BOINC:: CPU time: 28846.8s, 14400s + 14400s[2010- 6-30 11:27:54:] :: BOINC InternalDecoyCount: 64 ====================================================== DONE :: 2 starting structures 28846.8 cpu seconds This process generated 64 decoys from 64 attempts ====================================================== called boinc_finish </stderr_txt> ]]> Validate state Valid Claimed credit 184.044030472935 Granted credit 16.2806625211898 application version 2.14 |
Jochen Send message Joined: 6 Jun 06 Posts: 133 Credit: 3,847,433 RAC: 0 |
...I started and stopped this unit at least 15 times... AFAIR this is not a good idea, as long as the task is not kept in memory. If you suspend a task and keep it in memory, it does not have any effect at all. If you suspend a task and don't keep it in memory you will lose any work work done from the last checkpoint. Again I would recommend to just leave the tasks running, as long as the create CPU-load. This is probably the best you can do. And again AFAIR this is the best way to assure that no CPU time is wasted. Joe, the jinx |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
This failed after 11min. td-only-2-ARF1_4-15_21413_114_0 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=319298520 <core_client_version>6.2.14</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> <stderr_txt Starting work on structure: _00001 # cpu_run_time_pref: 14400 ERROR: rsd_type_list.size() ERROR:: Exit from: src/core/fragment/Frame.cc line: 62 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> |
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
This one completed successfully but stopped well short of its 6 hour run. ab_07_06_T0581_21_136_homs_h004__SAVE_ALL_OUT.IGNORE_THE_REST_10_11_21556_1 ERROR: expected to read 18 libraries from Dun02, but read 0 ERROR:: Exit from: ....srccorescoringdunbrackRotamerLibrary.cc line: 865 |
Speedy Send message Joined: 25 Sep 05 Posts: 163 Credit: 808,337 RAC: 1 |
result id 356014820 errored out after 28 minutes defult runtime is 10 hours Error output ERROR: rsd_type_list.size() ERROR:: Exit from: ....srccorefragmentFrame.cc line: 62 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> ]]> Have a crunching good day!! |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi. Someone might want to have a look at this one i got a Validate error, none of the other copies have been returned. I can't see a problem with it. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=321742035 ab_07_08_T0606_27_169_h001_disulf_SAVE_ALL_OUT.IGNORE_THE_REST_06_07_21584_248_2 Starting work on structure: _00001 # cpu_run_time_pref: 14400 Starting work on structure: _00002 Starting work on structure: _00003 Starting work on structure: _00004 Starting work on structure: _00005 Starting work on structure: _00006 Starting work on structure: _00007 Starting work on structure: _00008 Starting work on structure: _00009 Starting work on structure: _00010 Starting work on structure: _00011 Starting work on structure: _00012 Starting work on structure: _00013 Starting work on structure: _00014 Starting work on structure: _00015 Starting work on structure: _00016 Starting work on structure: _00017 ====================================================== DONE :: 1 starting structures 13766.2 cpu seconds This process generated 17 decoys from 17 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish </stderr_txt> |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
P.P.L. it looks like you received the third issue of that specific work unit. The first two never reported back. So the third resulted in too many tasks, as two is the configured maximum. So, yes, the BOINC server should not send out such results that are doomed to failure in the first place. It is a bug that I believe was recently fixed, so the next time the Project Team upgrades the servers, it shouldn't happen anymore. It only happens when some very rare circumstances combine. That was part of what made it hard for Berkeley to track it down. So, your machine completed it ok, i.e. no computation errors. But the validator discovered three reports for something with a maximum of two and hence produces the validation error. Rosetta Moderator: Mod.Sense |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi Mod Sense. Yes a bug indeed, i received credit for it anyway so that's O.K. :) |
12kpp Send message Joined: 4 Jul 09 Posts: 2 Credit: 256,800 RAC: 0 |
|
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
This one errored after 2 min. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=326440378 cs-only-2-DinI_3-14_20161_242_0 <core_client_version>6.2.14</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> <stderr_txt> ( Left out the bits in between ) Starting work on structure: _00001 # cpu_run_time_pref: 14400 ERROR: rsd_type_list.size() ERROR:: Exit from: src/core/fragment/Frame.cc line: 62 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> ]]> |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Reposting speedy's comments for the Project Team to investigate. Speedy's tasks report: ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_databasescoring/weights/dslf_weights.wts exist ERROR:: Exit from: ....srccorescoringScoreFunctionFactory.cc line: 178 357381146 357381134 & 357381125 all tasks start with lrm_jorj_combined_tlrm_jorj_combined_torsion. All tasks end with Compute error. I'm thinking lrm_jorj_combined_tlrm_jorj_combined_torsion is a bad bad batch of tasks. Rosetta Moderator: Mod.Sense |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Another one failed after 2 sec. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=326517114 lrm_jorj_combined_torsion_it06_run01_A_rlbd_1mgw__SAVE_ALL_OUT_IGNORE_THE_RESTlr5_DECOY_21224_221_0 <core_client_version>6.10.17</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> <stderr_txt> ( left out bits in middle ) Setting database description ... Setting up checkpointing ... Setting up graphics native ... ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_database/scoring/weights/dslf_weights.wts exist ERROR:: Exit from: src/core/scoring/ScoreFunctionFactory.cc line: 178 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
And another one, this went for 12 sec. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=326653711 lrm_jorj_combined_torsion_it06_run01_A_rlbd_1o4w__SAVE_ALL_OUT_IGNORE_THE_RESTlr10_DECOY_21224_430_1 <core_client_version>6.10.17</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> <stderr_txt> ( left out bits in middle ) Setting database description ... Setting up checkpointing ... Setting up graphics native ... ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_database/scoring/weights/dslf_weights.wts exist ERROR:: Exit from: src/core/scoring/ScoreFunctionFactory.cc line: 178 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> |
Speedy Send message Joined: 25 Sep 05 Posts: 163 Credit: 808,337 RAC: 1 |
Task 357811058 <core_client_version>6.10.58</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> ( left out lines in middle ) Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev36507.zip Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/yfsong_lrm_jorj_combined_torsion_it06_run01_A.zip Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/lr13_2iiy.fix.out.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_databasescoring/weights/dslf_weights.wts exist ERROR:: Exit from: ....srccorescoringScoreFunctionFactory.cc line: 178 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> Validate state Invalid Ran for 7 seconds. Have a crunching good day!! |
[DPC]NGS~StugIII Send message Joined: 8 Mar 06 Posts: 2 Credit: 58,616 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=355266280 This one failed but it took 59251 seconds. Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x0057DC0B write attempt to address 0x00D0954C |
duftkerze Send message Joined: 7 Jul 06 Posts: 2 Credit: 692,624 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=357781244 |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Moved duftkerze's post here. Their result has the "Unable to open weights" error that is reported in the prior posts here. Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
minirosetta 2.14
©2024 University of Washington
https://www.bakerlab.org