long/large work units, cpu_run_time limit and how to check 'progress'?

Message boards : Number crunching : long/large work units, cpu_run_time limit and how to check 'progress'?

To post messages, you must log in.

AuthorMessage
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 188
Credit: 134,531
RAC: 0
Message 90614 - Posted: 5 Apr 2019, 0:33:44 UTC
Last modified: 5 Apr 2019, 0:57:31 UTC

i kind of started crunching rosetta@home again received a set of Rosetta 4.07 jobs
rb_04_03_2501_2629_ab_1000_robetta_cstwt_5.0*
these seemed to be somewhat bigger more complex proteins when i tried the 'show graphics'
what i'm somewhat surprised is that i used to set a run time limit cpu_run_time of 4 hours
but these WUs has run beyond that normal 4 hours i came to expect (and more below, there is apparently no results - no decoy found in that 4 hour run)

from the 'show graphics' , the lowest in for the 'low energy' is some less than -200, but i'm not too sure if that isn't 'low enough'. and it keep bouncing up to try other conformations which has higher energy

is that cpu_run_time limit still in effect and used anywhere ? (oops, looked it up in the online preferences page, Target CPU run time is still 4 hours, so i'd guess it is still used?

another thing would be that is there a way i can check the 'progress'?
i tried going into the slots/n/ directory and looking at the stderr
and as apparent it seem that in stderr and stdout i did not find any 'decoys' (models) messages being listed there and the jobs keep running.
is the stderr or stdout the correct file to find out if any 'decoys' (models) has been found for the WU (in particular while it is running)?

limiting the continuous run time for the wu is necessary as i'd normally switch off the pc after that and mind i normally let the jobs run in the night as room temperatures are cooler and it runs as i sleep so that it has as much uninterrupted cpu usage as is possible to complete the WU

i'd try to suspend the jobs as they have run for some 5 hours beyond the cpu_run_time of 4 hours and apparently there is no decoys found yet if stderr is the correct file to check. hopefully, the data is still in the checkpoint and that the wu can continue later. note that for long running WUs i'm ok for it to be 'suspended' and let them continue from that point say the next night, that would in a way allow more 'difficult' WU to complete
ID: 90614 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 188
Credit: 134,531
RAC: 0
Message 90617 - Posted: 5 Apr 2019, 9:08:13 UTC
Last modified: 5 Apr 2019, 9:16:38 UTC

ok finally it completes after 5 hours of run time, 6 hours elapsed, suspended once in between, no fanfare
http://boinc.bakerlab.org/rosetta/result.php?resultid=1066394376
a single decoy in that 5 hours
ID: 90617 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 892
Credit: 3,347,076
RAC: 3,113
Message 90618 - Posted: 5 Apr 2019, 9:44:37 UTC - in response to Message 90617.  
Last modified: 5 Apr 2019, 9:44:55 UTC

ok finally it completes after 5 hours of run time, 6 hours elapsed, suspended once in between, no fanfare
http://boinc.bakerlab.org/rosetta/result.php?resultid=1066394376
a single decoy in that 5 hours


Same here on my Xeon
Runtime 2hs, 6hs of calculation, 1 decoy
These are big proteins, i think
ID: 90618 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 188
Credit: 134,531
RAC: 0
Message 90619 - Posted: 5 Apr 2019, 10:00:17 UTC - in response to Message 90618.  

i've been wary of suspending jobs, concerned that it may not checkpoint adequately and continue from that point.
it is good that suspending them did not cause any visible harm and i can continue the long jobs at a separate sitting
that may allow me to use a higher cpu_run_time so that i can crunch the bigger jobs as well just like this batch
ID: 90619 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator
Project administrator

Send message
Joined: 22 Aug 06
Posts: 3528
Credit: 0
RAC: 0
Message 90621 - Posted: 5 Apr 2019, 15:48:47 UTC

The graphics is the simplest way to see how many decoys a given WU has produced. Also, from the properties of the WU display, you can see the number of CPU seconds currently, and the number at the time of the last checkpoint. If the PC is powered down or the task is removed from memory to run another BOINC task, the work will resume at the checkpoint (if any). If no checkpoint has been taken yet, work will be restarted from the beginning.

The goal is to checkpoint every 10-15 minutes. But new protocols and large proteins often start by going longer between checkpoints. If the protocol proves itself useful, then it is generally enhanced to do more frequent checkpointing.
Rosetta Moderator: Mod.Sense
ID: 90621 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 188
Credit: 134,531
RAC: 0
Message 90622 - Posted: 5 Apr 2019, 17:30:19 UTC - in response to Message 90621.  
Last modified: 5 Apr 2019, 17:35:30 UTC

thanks! :)
i didn't realise the graphics show the number of models (decoys) found :)
it would be good indeed if larger proteins checkpoint at longer interval as the files are literally rather large and i'd think more so for large molecules.
the compromise of course is that if the task is suspended or for that matter the pc is shutdown, more is lost between the checkpoints so it would take longer to resume the job till finish. but i think for administrative suspend on the panel, it would seem that would kind of trigger a checkpoint, i'm not too sure if it does, but i'd think it should

boinc preferences apparently has a checkpoint at most interval preferences parameter which i set a preference of 2 minutes, too close between intervals may see a lot of 'disk trashing' , harddisks are still pretty much a norm despite that ssd is gaining popularity. hence, users should be able to influence it with the parameter as well

but the checkpoint proves useful as in that set of tasks i suspended them on the panel and restarted them today and they complete without issues.
it is good as this alleviates the concern that long running jobs lose all that work after running for hours, and i'd be able to crunch bigger jobs which may take more than a single 'sitting' (continuous run interval)
ID: 90622 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 188
Credit: 134,531
RAC: 0
Message 90623 - Posted: 5 Apr 2019, 18:05:07 UTC - in response to Message 90622.  

off-topic:
it seemed the fact that large molecules / proteins which has much more folding permutations and is much harder to perform an appropriate fold.
it may point to the natural cause of diseases due to protein misfolding e.g. alzheimer and cancer
misfolded proteins cause alzheimer or cancer?
ID: 90623 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator
Project administrator

Send message
Joined: 22 Aug 06
Posts: 3528
Credit: 0
RAC: 0
Message 90640 - Posted: 8 Apr 2019, 18:57:26 UTC - in response to Message 90622.  


... i think for administrative suspend on the panel, it would seem that would kind of trigger a checkpoint, i'm not too sure if it does, but i'd think it should


Unfortunately, it does not work that way. The task has to reach a point where it can completely store and reload itself. So, it is not possible to call something in the task and command it to take a checkpoint now.
Rosetta Moderator: Mod.Sense
ID: 90640 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 892
Credit: 3,347,076
RAC: 3,113
Message 90660 - Posted: 12 Apr 2019, 8:50:31 UTC - in response to Message 90640.  

Unfortunately, it does not work that way. The task has to reach a point where it can completely store and reload itself. So, it is not possible to call something in the task and command it to take a checkpoint now.


And, after a reboot, all my "_robetta_cstwt_5.0*" restart from 0%
5hs of crunching lost...
ID: 90660 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 188
Credit: 134,531
RAC: 0
Message 90662 - Posted: 12 Apr 2019, 14:29:50 UTC - in response to Message 90660.  
Last modified: 12 Apr 2019, 14:34:01 UTC

Unfortunately, it does not work that way. The task has to reach a point where it can completely store and reload itself. So, it is not possible to call something in the task and command it to take a checkpoint now.


And, after a reboot, all my "_robetta_cstwt_5.0*" restart from 0%
5hs of crunching lost...

next time try to do a full proper suspend for all the tasks before you shutdown. that may make a difference
i'm not sure why but for that batch, perhaps i'm lucky, i'm able to continue from that point forwards after restarting

perhaps it isn't quite possible for all wu but for a fraction of it it works.
r@h should seriously look at resumable checkpoints even for that matter if the check point is 15 or even 30 minutes it would at least place a savepoint there so that for any reason the pc is shutdown wu can be continued. otherwise some (many?) partitipants may not be able to run the long jobs
ID: 90662 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 294
Credit: 8,416,590
RAC: 8,815
Message 90663 - Posted: 12 Apr 2019, 15:01:10 UTC - in response to Message 90662.  

r@h should seriously look at resumable checkpoints even for that matter if the check point is 15 or even 30 minutes it would at least place a savepoint there so that for any reason the pc is shutdown wu can be continued. otherwise some (many?) partitipants may not be able to run the long jobs

I run my machines 24/7, and it would not be much of a problem. So if they are going to produce a large number of such workunits, they could set up a separate queue, and allow the users to select it.
ID: 90663 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 892
Credit: 3,347,076
RAC: 3,113
Message 90664 - Posted: 12 Apr 2019, 19:33:51 UTC - in response to Message 90662.  

next time try to do a full proper suspend for all the tasks before you shutdown. that may make a difference
i'm not sure why but for that batch, perhaps i'm lucky, i'm able to continue from that point forwards after restarting


Nope. Restarting from 0% after pause and reboot :-(
ID: 90664 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator
Project administrator

Send message
Joined: 22 Aug 06
Posts: 3528
Credit: 0
RAC: 0
Message 90665 - Posted: 12 Apr 2019, 20:19:24 UTC - in response to Message 90662.  

r@h should seriously look at resumable checkpoints even for that matter if the check point is 15 or even 30 minutes it would at least place a savepoint there so that for any reason the pc is shutdown wu can be continued. otherwise some (many?) partitipants may not be able to run the long jobs


In early development of a new method of analysis of a large protein, it is pretty common to span long periods of time without checkpoints. If the new method proves useful, and yields better models, then further development is done to improve runtime per model and checkpointing.
Rosetta Moderator: Mod.Sense
ID: 90665 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 250
Credit: 8,037,564
RAC: 0
Message 90671 - Posted: 14 Apr 2019, 17:13:03 UTC - in response to Message 90665.  

r@h should seriously look at resumable checkpoints even for that matter if the check point is 15 or even 30 minutes it would at least place a savepoint there so that for any reason the pc is shutdown wu can be continued. otherwise some (many?) partitipants may not be able to run the long jobs


In early development of a new method of analysis of a large protein, it is pretty common to span long periods of time without checkpoints. If the new method proves useful, and yields better models, then further development is done to improve runtime per model and checkpointing.



Why isn't this new and possibly disruptive work done on RALPH? Seems like RALPH is the place where Rosetta experimentation takes place and not on the main Rosetta@home. The RALPH volunteers are expecting this and it does not disrupt those who don't want to be messed up.

Other projects that perform their development work on their main site have the option for crunchers to opt-out of this testing.
ID: 90671 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 892
Credit: 3,347,076
RAC: 3,113
Message 90672 - Posted: 14 Apr 2019, 20:34:43 UTC - in response to Message 90671.  

Why isn't this new and possibly disruptive work done on RALPH? Seems like RALPH is the place where Rosetta experimentation takes place and not on the main Rosetta@home. The RALPH volunteers are expecting this and it does not disrupt those who don't want to be messed up.


+1.
I crunch on both Ralph and Rosetta.
When i crunch on Ralph i have no problems with crash, errors, etc. It's normal in beta test.
When i crunch on Rosetta i would like stability and no errors.
ID: 90672 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 188
Credit: 134,531
RAC: 0
Message 90674 - Posted: 15 Apr 2019, 6:35:48 UTC - in response to Message 90672.  

Why isn't this new and possibly disruptive work done on RALPH? Seems like RALPH is the place where Rosetta experimentation takes place and not on the main Rosetta@home. The RALPH volunteers are expecting this and it does not disrupt those who don't want to be messed up.


+1.
I crunch on both Ralph and Rosetta.
When i crunch on Ralph i have no problems with crash, errors, etc. It's normal in beta test.
When i crunch on Rosetta i would like stability and no errors.


errors can be the results themselves, e.g. if a researcher generates lots of aribtrary models and maybe only 1 in 1,000,000 is a model (protein) that would assemble and run to completion, all 999,999 would *run to failure* error and that last 1 in 1,000,000 runs to completion
lol

the extreme of which i'd think some proteins may be completely synthetic, i.e. not seen in nature
ID: 90674 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 892
Credit: 3,347,076
RAC: 3,113
Message 90675 - Posted: 15 Apr 2019, 9:09:59 UTC - in response to Message 90674.  

errors can be the results themselves, e.g. if a researcher generates lots of aribtrary models and maybe only 1 in 1,000,000 is a model (protein) that would assemble and run to completion, all 999,999 would *run to failure* error and that last 1 in 1,000,000 runs to completion


Errors are results in test projects (like Ralph), cause debugging is welcome.
I'm thinking about technical error like "validation error", "c++(out of memory) error", etc, in production projects, like Rosetta.
ID: 90675 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : long/large work units, cpu_run_time limit and how to check 'progress'?



©2019 University of Washington
http://www.bakerlab.org