Discussion on increasing the default run time

Message boards : Number crunching : Discussion on increasing the default run time

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 · Next

AuthorMessage
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94583 - Posted: 16 Apr 2020, 0:51:55 UTC - in response to Message 94581.  

It could happen, certainly. Such events generally are with new protocols, which are revised to prevent such long-running models. If you would link to the WU, we could see more about it. How much credit did it get? This would be a clue as to whether other systems running WUs from the same batch are having similar struggles. Was the WU ended by the watchdog? What was the runtime preference when the WU was run?
Rosetta Moderator: Mod.Sense
ID: 94583 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1506
Credit: 14,889,290
RAC: 20,843
Message 94584 - Posted: 16 Apr 2020, 1:10:45 UTC - in response to Message 94583.  
Last modified: 16 Apr 2020, 1:31:19 UTC

It could happen, certainly. Such events generally are with new protocols, which are revised to prevent such long-running models. If you would link to the WU, we could see more about it. How much credit did it get? This would be a clue as to whether other systems running WUs from the same batch are having similar struggles. Was the WU ended by the watchdog? What was the runtime preference when the WU was run?
2 completed, another 1 (at least) to come (Edit- it just finished, over 8hrs but less than 12hrs).
Target CPU Runtime- not set (so 8 hrs).

Edit- looks like up to another 5 or 6 yet to start.


rb_04_13_21398_21021_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_06_09_913288_66_0

      Run time 12 hours 14 min 46 sec
      CPU time 12 hours 9 min 21 sec
Validate state Valid
        Credit 348.05


<core_client_version>7.6.33</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.15_windows_intelx86.exe @rb_04_13_21398_21021_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -jumps:pairing_file t000_.fasta.bbcontacts.jumps -jumps:random_sheets 7 5 4 2 6 1 -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_04_13_21398_21021_ab_t000__robetta.zip -frag3 rb_04_13_21398_21021_ab_t000__robetta.200.3mers.index.gz -fragA rb_04_13_21398_21021_ab_t000__robetta.200.9mers.index.gz -fragB rb_04_13_21398_21021_ab_t000__robetta.200.6mers.index.gz -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 1000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3090037
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 43761.3s, 14400s + 28800s[2020- 4-15 16:59:26:] :: BOINC 
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE ::     1 starting structures  43761.3 cpu seconds
This process generated      1 decoys from       1 attempts
======================================================
16:59:26 (9808): called boinc_finish(0)

</stderr_txt>
]]>





rb_04_13_21398_21021_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_05_08_913288_47_0

      Run time 12 hours 14 min 46 sec
      CPU time 12 hours 9 min 20 sec
Validate state Valid
        Credit 379.23


<core_client_version>7.6.33</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.15_windows_intelx86.exe @rb_04_13_21398_21021_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -jumps:pairing_file t000_.fasta.bbcontacts.jumps -jumps:random_sheets 7 5 4 2 6 1 -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_04_13_21398_21021_ab_t000__robetta.zip -frag3 rb_04_13_21398_21021_ab_t000__robetta.200.3mers.index.gz -fragA rb_04_13_21398_21021_ab_t000__robetta.200.8mers.index.gz -fragB rb_04_13_21398_21021_ab_t000__robetta.200.5mers.index.gz -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 1000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3090470
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 43759.8s, 14400s + 28800s[2020- 4-15 16:50:11:] :: BOINC 
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE ::     1 starting structures  43759.8 cpu seconds
This process generated      1 decoys from       1 attempts
======================================================
16:50:11 (9784): called boinc_finish(0)

</stderr_txt>
]]>





rb_04_14_20382_21147_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_09_11_913349_93_0

      Run time 9 hours 14 min 30 sec
      CPU time 9 hours 11 min 10 sec
Validate state Valid
        Credit 337.77

<core_client_version>7.6.33</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.15_windows_intelx86.exe @rb_04_14_20382_21147_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -jumps:pairing_file t000_.fasta.bbcontacts.jumps -jumps:random_sheets 3 1 4 -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_04_14_20382_21147_ab_t000__robetta.zip -frag3 rb_04_14_20382_21147_ab_t000__robetta.200.3mers.index.gz -fragA rb_04_14_20382_21147_ab_t000__robetta.200.11mers.index.gz -fragB rb_04_14_20382_21147_ab_t000__robetta.200.9mers.index.gz -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 1000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3019737
Starting watchdog...
Watchdog active.
======================================================
DONE ::     1 starting structures  33070.7 cpu seconds
This process generated      1 decoys from       1 attempts
======================================================
BOINC :: WS_max 9.49916e+08

BOINC :: Watchdog shutting down...
10:06:00 (9848): called boinc_finish(0)

</stderr_txt>
]]>

Grant
Darwin NT
ID: 94584 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94586 - Posted: 16 Apr 2020, 1:29:32 UTC - in response to Message 94584.  


CPU time: 43761.3s, 14400s + 28800s


That is how you know the watchdog ended the task. 14,400 seconds is the 4 hours plus the WU target runtime.

So two were ended by watchdog. All three of them got over 300 points of credit. So that implies the batch has some incredibly tough models.

Similar task names:
rb_04_13_21398_21021_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_06_09_913288_66_0
rb_04_13_21398_21021_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_05_08_913288_47_0
rb_04_14_20382_21147_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_09_11_913349_93_0
Rosetta Moderator: Mod.Sense
ID: 94586 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1506
Credit: 14,889,290
RAC: 20,843
Message 94587 - Posted: 16 Apr 2020, 2:05:03 UTC - in response to Message 94586.  


CPU time: 43761.3s, 14400s + 28800s
That is how you know the watchdog ended the task. 14,400 seconds is the 4 hours plus the WU target runtime.

So two were ended by watchdog. All three of them got over 300 points of credit. So that implies the batch has some incredibly tough models.

Similar task names:
rb_04_13_21398_21021_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_06_09_913288_66_0
rb_04_13_21398_21021_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_05_08_913288_47_0
rb_04_14_20382_21147_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_09_11_913349_93_0
I'm waiting to see if
rb_04_12_21173_20924_ab_t000__robetta_cstwt_1.0_IGNORE_THE_REST_07_10_911326_20_1 is similar in nature.






All three of them got over 300 points of credit.
Which, unfortunately, is the same payout (and sometimes less) than ones that only take 8 hours.
And a lot less than those that run for much shorter periods- eg
Mini_Protein_binds_IL1R_COVID-19_design7_SAVE_ALL_OUT_IGNORE_THE_REST_7ob9cg6j_909102_4_0

    Run time 1 hours 4 min 24 sec
    CPU time 1 hours 2 min 56 sec
Validate state Valid
      Credit 222.80

209 Credits per hour, would (should) mean 2,508 for a 12hr processing time for the same type of Task (and ideally close to that for other Tasks).
Someone mentioned that the "Estimated computation size" for a Task is the same (80,000 GFLOPs) regardless of whether the Target CPU time is 2 hours or 36 hours. If so, it would go a long way to explaining this discrepancy in Credit.
If 8 hours is the default ("Estimated computation size 80,000 GFLOPs"), then a 16hr Target CPU Runtime Task should have an "Estimated computation size" of 160,000 GFLOPs. A 2 hour Target time, only 20,000 GFLOPs etc. Not only would it help settle Credit down, but it would also help with Estimated completion times (unfortunately it's all tied in together).

Sorry, off topic i know.
Grant
Darwin NT
ID: 94587 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94589 - Posted: 16 Apr 2020, 4:12:57 UTC - in response to Message 94587.  

Believe it or not, the WU is created before it is assigned to a host. Which means the WU itself is not built knowing how long it will be run. In fact, the runtime preference can be changed while the WU is running (and if it changes by more than 4 hours less than the present runtime, Mr. Watchdog licks it up). So, instead, things are just based on the duration correction factor, and settle in over time. Less than perfect, but it is not practical to create WUs on-the-fly, as they are assigned to hosts.
Rosetta Moderator: Mod.Sense
ID: 94589 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1506
Credit: 14,889,290
RAC: 20,843
Message 94590 - Posted: 16 Apr 2020, 6:07:32 UTC - in response to Message 94589.  

Believe it or not, the WU is created before it is assigned to a host.
Makes sense. You can't assign something, to something that doesn't exist.


Which means the WU itself is not built knowing how long it will be run. In fact, the runtime preference can be changed while the WU is running (and if it changes by more than 4 hours less than the present runtime, Mr. Watchdog licks it up).
Which is why the "Estimated computation size" (wu.rsc_fpops_est) value is supplied by the project- when the Task is allocated to a Host. It is an estimate of the how much computation will be required to complete the Task (and is a core value in Credit allocation & Runtime estimates).
While not 100% accurate (as some mathematical operations have higher & lower overheads than others), a Task that runs for 4 hours will require half the computational work of one that runs for 8 hours. A Task that runs for 16 hours will require double the computational work of one that runs for 8 hours. The "Estimated computation size" that is supplied when these Tasks are allocated needs to reflect that for Credit awarded to be more consistent, and for Estimated completion times to get to their true values sooner.


So, instead, things are just based on the duration correction factor, and settle in over time.
So Rosetta has some custom code to continue to use DCF? (as it was deprecated when Credit New was introduced, or is it still floating around in the sever code? From memory the DCF was meant to be frozen at 1 (and that's what showing on my systems)).


Less than perfect, but it is not practical to create WUs on-the-fly, as they are assigned to hosts.
It's not about creating Tasks on the fly, it's about supplying the correct "Estimated computation size" value for a given Task as it is allocated. And the mechanism is there to do so (Tasks processed by Anonymous Platform applications are supplied with a modified "Estimated computation size" so that the Estimate completion time matches up with reality).*

A Host requests work, if it's deemed ok then the Scheduler will allocate the work. As the Scheduler has queried the host for it's Target CPU Runtime (necessary to determine if the host is OK to get work or not), it uses that value to modify the "Estimated computation size" that is allocated to the Task.
Credit awarded will become much, much, much more consistent than it is (but it will still have issues as it is using Credit New), and Estimated completion times should settle closer to their actual values much sooner. Issues will still occur when new Applications are released as they have no prior computation time history, but those issues shouldn't be as severe as they tend to be at present.






*the Anonymous Platform wu.rsc_fpops_est value isn't used for Credit calculations, the unmodified value is. But for non-Anonymous Platforms, the wu.rsc_fpops_est allocated to the Task is a significant value in the determination of how much Credit to award.
Grant
Darwin NT
ID: 94590 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1506
Credit: 14,889,290
RAC: 20,843
Message 94592 - Posted: 16 Apr 2020, 6:38:16 UTC - in response to Message 94583.  

It could happen, certainly. Such events generally are with new protocols, which are revised to prevent such long-running models. If you would link to the WU, we could see more about it. How much credit did it get? This would be a clue as to whether other systems running WUs from the same batch are having similar struggles. Was the WU ended by the watchdog? What was the runtime preference when the WU was run?
Some more results to add to my previous post.
One ran short of the CPU Target time, the others pretty close.


rb_04_14_20382_21147_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_06_05_913349_26_0

      Run time 6 hours 6 min 39 sec
      CPU time 6 hours 5 min 55 sec
Validate state Valid
        Credit 311.55


<core_client_version>7.6.22</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.15_windows_x86_64.exe @rb_04_14_20382_21147_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -jumps:pairing_file t000_.fasta.bbcontacts.jumps -jumps:random_sheets 3 1 4 -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_04_14_20382_21147_ab_t000__robetta.zip -frag3 rb_04_14_20382_21147_ab_t000__robetta.200.3mers.index.gz -fragA rb_04_14_20382_21147_ab_t000__robetta.200.5mers.index.gz -fragB rb_04_14_20382_21147_ab_t000__robetta.200.6mers.index.gz -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 1000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3023759
Starting watchdog...
Watchdog active.
======================================================
DONE ::     1 starting structures    21955 cpu seconds
This process generated      1 decoys from       1 attempts
======================================================
BOINC :: WS_max 1.0745e+09

BOINC :: Watchdog shutting down...
12:14:37 (8096): called boinc_finish(0)

</stderr_txt>]]>





rb_04_14_21373_21193_ab_t000__robetta_cstwt_5.0_IGNORE_THE_REST_03_13_913372_5_0

      Run time 7 hours 40 min 24 sec
      CPU time 7 hours 36 min 21 sec
Validate state Valid
        Credit 227.59


<core_client_version>7.6.33</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.15_windows_intelx86.exe @rb_04_14_21373_21193_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_04_14_21373_21193_ab_t000__robetta.zip -frag3 rb_04_14_21373_21193_ab_t000__robetta.200.3mers.index.gz -fragA rb_04_14_21373_21193_ab_t000__robetta.200.13mers.index.gz -fragB rb_04_14_21373_21193_ab_t000__robetta.200.3mers.index.gz -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 1000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3005791
Starting watchdog...
Watchdog active.
======================================================
DONE ::     1 starting structures  27380.9 cpu seconds
This process generated     10 decoys from      10 attempts
======================================================
BOINC :: WS_max 5.0348e+08

BOINC :: Watchdog shutting down...
11:18:42 (4580): called boinc_finish(0)

</stderr_txt>
]]>





rb_04_14_21373_21193_ab_t000__robetta_cstwt_5.0_IGNORE_THE_REST_12_06_913372_6_0

      Run time 7 hours 51 min 15 sec
      CPU time 7 hours 45 min 21 sec
Validate state Valid
        Credit 228.86


<core_client_version>7.6.33</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.15_windows_intelx86.exe @rb_04_14_21373_21193_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_04_14_21373_21193_ab_t000__robetta.zip -frag3 rb_04_14_21373_21193_ab_t000__robetta.200.3mers.index.gz -fragA rb_04_14_21373_21193_ab_t000__robetta.200.6mers.index.gz -fragB rb_04_14_21373_21193_ab_t000__robetta.200.12mers.index.gz -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 1000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3006062
Starting watchdog...
Watchdog active.
======================================================
DONE ::     1 starting structures  27921.6 cpu seconds
This process generated     11 decoys from      11 attempts
======================================================
BOINC :: WS_max 5.84385e+08

BOINC :: Watchdog shutting down...
14:49:01 (4196): called boinc_finish(0)

</stderr_txt>
]]>

Grant
Darwin NT
ID: 94592 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94613 - Posted: 16 Apr 2020, 13:48:48 UTC - in response to Message 94590.  

Credit awarded to you depends entirely on the number of completed models you are reporting back. And nothing else. Credit claim of your WU is then dropped into the average and factored in for the reports that follow yours.

Estimated time to completion would be the only improvement.

And again, the time to run is NOT a part of the issued WU, and it can change during the run of a WU. However the runtime used does get reported back with the WU results.
Rosetta Moderator: Mod.Sense
ID: 94613 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94614 - Posted: 16 Apr 2020, 13:57:16 UTC - in response to Message 94592.  


One ran short of the CPU Target time, the others pretty close.


When it took your machine 6 hours and 6 minutes to complete the first model, a reasonable estimate of the time to complete a second would take you well past an 8hr runtime. So the WU run is ended. I think of it as the task coming up for air at the end of a model. Taking its head out of the pool and checking the clock. There is no way to report a model that is only partially done.

So any time the estimate goes more than something like 5 minutes past the preference, the WU is ended. So, I think if a task came up for air at 7hrs 50 min. and was cruising at 15 minutes per model, it might do one more. But, keep in mind, noone has ever explored that next model before. No idea what you might find there. And the runtime varies depending on what you do find there. So how much "ignore the rest" is that next model able to ignore? Noone knows until someone explores it.
Rosetta Moderator: Mod.Sense
ID: 94614 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1506
Credit: 14,889,290
RAC: 20,843
Message 94643 - Posted: 16 Apr 2020, 23:56:57 UTC - in response to Message 94613.  

Credit awarded to you depends entirely on the number of completed models you are reporting back. And nothing else.
Sorry, but i can't see any indication of that occurring (1 Decoy, 600 Decoys, 50 decoys) can all get the same Credit or one more than the others, or the others more than one, and all possible combinations (and the Task i referenced in the post that started this discussion backs that up).



And again, the time to run is NOT a part of the issued WU, and it can change during the run of a WU. However the runtime used does get reported back with the WU results.
And again- at no time anywhere have i said that Runtime is related to the Tasks being issued.
Not once, anywhere.

All i have pointed out, is that the "Estimated computation size" is used to determine Credit, the fact that a fixed value is being used for different Runtime Tasks explains the extreme variability in Credit awarded (and also some of the issues with Estimated completion times settling down as they are tied together).
Unless Rosetta have modified the code in their implementation of BOINC that is what is happening- wu.fpops_est is a major factor in determining Credit. And the variability in Credit between Tasks that run for 2 hrs v those that run for 8 hrs is easily explained by that fact.


From the section on Credit New
For each job, the project supplies

an estimate of the FLOPs used by a job (wu.fpops_est)
a limit on FLOPs, after which the job will be aborted (wu.fpops_bound).
Previously, inaccuracy of fpops_est caused problems. The new system still uses fpops_est, but its primary purpose is now to indicate the relative sizes of jobs.

A 2 hour task is relatively a quarter the size of an 8 hour Task. A 16 hour Task is relatively double the size of an 8 hour Task. Having the same value for Tasks that run for different times breaks the underlying purpose of the wu.fpops_est value. And it is a significant value used in determining Credit.

That's all i've tried to point out.
Grant
Darwin NT
ID: 94643 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94678 - Posted: 17 Apr 2020, 13:43:39 UTC - in response to Message 94643.  

In my way of looking at it, using the runtime preference to compute a FLOPS estimate is incorporating the runtime preference in to each WU. As you point out, that is not what is being done. So I don't follow why you assert that an estimate, which is the same for all WUs, is a factor in determining credit. The reason credit for 1 decoy vs 600 could (you are exaggerating) be the same is that some WUs are much harder to compute than others. Because the hosts before you all found they were hard to compute (more CPU second per model), they are granted high credit per decoy. Because the hosts before a report of 600 decoys completed all found they only take 5 minutes per decoy, the credit per is lower. All regardless of runtime preference and FLOPS estimates, neither of which reflect actual work done. So, I agree, a dynamic FLOPS estimate aligned with runtime preference, incorporated into each WU as it is assigned, would give you a beautiful estimated time to completion. But it would slow the assignment of work, and not impact granted credit.
Rosetta Moderator: Mod.Sense
ID: 94678 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rzlatic
Avatar

Send message
Joined: 20 Nov 07
Posts: 3
Credit: 327,897
RAC: 0
Message 94686 - Posted: 17 Apr 2020, 15:06:58 UTC
Last modified: 17 Apr 2020, 15:09:20 UTC

is there a reason some of my workunits come to a crawl after 98% (8-10 hours at that time), with ETA stuck at 10:05 +/- a few seconds. after that the crunching of that particular WU lasts for hours.

its not with not all tasks, but every day there is (different) task with very similar percentage and very similar ETA. this happens repeatedly during last week.

tried to tweak disk, memory and cpu settings but resulting with same issue. anybody else?
ID: 94686 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 94688 - Posted: 17 Apr 2020, 15:35:22 UTC - in response to Message 94686.  

is there a reason some of my workunits come to a crawl after 98% (8-10 hours at that time), with ETA stuck at 10:05 +/- a few seconds. after that the crunching of that particular WU lasts for hours.

its not with not all tasks, but every day there is (different) task with very similar percentage and very similar ETA. this happens repeatedly during last week.

tried to tweak disk, memory and cpu settings but resulting with same issue. anybody else?


You aren't the only one seeing this. I had a similar problem to what you described happen to me yesterday. I thought it was just a bug in that one WU. The one in question was https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1035999438.

I let it run for over 24 hours and it slowly crawled up to about 99.6% complete. Finally I decided to just restart my machine (it had been running for 36 days, so I wanted to anyway for housekeeping) and when it came back up the WU restarted from the beginning.

I'll see if it finishes this time.
ID: 94688 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,071,286
RAC: 12
Message 94692 - Posted: 17 Apr 2020, 16:56:28 UTC - in response to Message 94688.  

There is apparently an issue with some (not all) tasks whose name starts with 12v1n. I had one go over 36 hours before I finally aborted it. Other have reported similar issues. However, not all tasks that start 12v1n have the issue. Just abort it if it runs past where the watchdog would abort it.

-Charlie
-Charlie
ID: 94692 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94697 - Posted: 17 Apr 2020, 17:46:47 UTC - in response to Message 94686.  

rzlatic, it appears you are seeing tasks running more than 4 hours passed the runtime preference, then ended by the watchdog on a Linux system running the i686 application. Please see the discussion here.
Rosetta Moderator: Mod.Sense
ID: 94697 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1506
Credit: 14,889,290
RAC: 20,843
Message 94710 - Posted: 17 Apr 2020, 21:04:06 UTC - in response to Message 94678.  
Last modified: 17 Apr 2020, 21:27:04 UTC

In my way of looking at it, using the runtime preference to compute a FLOPS estimate is incorporating the runtime preference in to each WU. As you point out, that is not what is being done. So I don't follow why you assert that an estimate, which is the same for all WUs, is a factor in determining credit.
Becasue that is the way the BOINC Credit mechanism is written (i posted a link in my previous post), so unless Rosetta have modified that code, that's what happens.



The reason credit for 1 decoy vs 600 could (you are exaggerating)
Yes, but only a little bit.
be the same is that some WUs are much harder to compute than others.
Yep. That is the case.



Because the hosts before you all found they were hard to compute (more CPU second per model), they are granted high credit per decoy. Because the hosts before a report of 600 decoys completed all found they only take 5 minutes per decoy, the credit per is lower.
That is the way it is meant to work, and it mostly does.



All regardless of runtime preference and FLOPS estimates, neither of which reflect actual work done.
Once again- i agree.
The time to compute reflects work done, but Rosetta has fixed runtimes. So the work done is determined based on the benchmark of the CPU, [i]and the Estimated computation size for that Task.[i/i]
A 1GFLOP CPU does so many FLOPs if it processes work for 4 hours.
A 1,000 GFLOP CPU does so many FLOPs if it processes work for 4 hours.

A 1GFLOP CPU does so many FLOPs if it processes work for 24 hours.
A 1,000 GFLOP CPU does so many FLOPs if it processes work for 24 hours.
If the Estimate for computation size is the same for 24 hours as it is for 4 hours those that process work for 24 hours will get les Credit than those that process work for 4 hours. Yes, the Credit mechanism does self-correct, but it takes time & a lot of results.
Hence the frequent spray of Credit values even for Tasks of the same type on the same application.



So, I agree, a dynamic FLOPS estimate aligned with runtime preference, incorporated into each WU as it is assigned, would give you a beautiful estimated time to completion. But it would slow the assignment of work, and not impact granted credit.
Given the servers you have, and the amount of work returned per hour, there would be no impact on the time taken to allocate work- the Scheduler has all those values anyway when it comes to deciding if people get work or not. Dividing or multiplying a number by another number before putting it in a field would have no impact on the Scheduler's performance.
And as i've mentioned before- i've posted the link showing that the the Estimated computation size is central to the allocation of Credit (unless Rosetta have implemented their own system)
Grant
Darwin NT
ID: 94710 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1506
Credit: 14,889,290
RAC: 20,843
Message 94722 - Posted: 18 Apr 2020, 3:10:45 UTC - in response to Message 94583.  
Last modified: 18 Apr 2020, 3:11:50 UTC

It could happen, certainly. Such events generally are with new protocols, which are revised to prevent such long-running models. If you would link to the WU, we could see more about it. How much credit did it get? This would be a clue as to whether other systems running WUs from the same batch are having similar struggles. Was the WU ended by the watchdog? What was the runtime preference when the WU was run?
Looks like there are some more RB Tasks with longer then selected CPU Runtimes, but are different Task names from the last group.

Will be a while yet before they are finished, but one hasn't had a checkpoint in 30min.
Grant
Darwin NT
ID: 94722 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94726 - Posted: 18 Apr 2020, 3:44:53 UTC - in response to Message 94710.  
Last modified: 18 Apr 2020, 3:45:59 UTC

...i've posted the link showing that the the Estimated computation size is central to the allocation of Credit (unless Rosetta have implemented their own system)


Yes, this is what I've been saying, the only way to allocate credit on a per model basis, and flex with the variability of model computation across WU batches, is to implement your own system. See this sticky thread with the details.
Rosetta Moderator: Mod.Sense
ID: 94726 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1506
Credit: 14,889,290
RAC: 20,843
Message 94732 - Posted: 18 Apr 2020, 4:35:47 UTC - in response to Message 94726.  

Yes, this is what I've been saying, the only way to allocate credit on a per model basis, and flex with the variability of model computation across WU batches, is to implement your own system. See this sticky thread with the details.
Thanks for that.
It doesn't explain how the Credit system works, but it does show that it isn't the standard Credit New system.
Grant
Darwin NT
ID: 94732 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1506
Credit: 14,889,290
RAC: 20,843
Message 94738 - Posted: 18 Apr 2020, 7:01:57 UTC - in response to Message 94722.  
Last modified: 18 Apr 2020, 7:42:39 UTC

It could happen, certainly. Such events generally are with new protocols, which are revised to prevent such long-running models. If you would link to the WU, we could see more about it. How much credit did it get? This would be a clue as to whether other systems running WUs from the same batch are having similar struggles. Was the WU ended by the watchdog? What was the runtime preference when the WU was run?
Looks like there are some more RB Tasks with longer then selected CPU Runtimes, but are different Task names from the last group.

Will be a while yet before they are finished, but one hasn't had a checkpoint in 30min.

Looks like the Watchdog isn't working on 2 of the Tasks presently in progress.

Target CPU Run time 8hrs


Runtime 13hrs 39min, CPU time 13hrs 32min 27sec, Last checkpoint at 13hrs 10min 12sec.
So far.
rb_03_31_20049_19874__t000__3_C1_SAVE_ALL_OUT_IGNORE_THE_REST_904837_1472_0


Runtime 13hrs 20min 27sec, CPU time 13hrs 17min 51 sec, Last checkpoint at 13hrs 08mmin 24sec.
So far.
rb_03_31_20031_19865__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_904757_832_0
Grant
Darwin NT
ID: 94738 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 · Next

Message boards : Number crunching : Discussion on increasing the default run time



©2024 University of Washington
https://www.bakerlab.org