No checkpoint in more than 1 hour - Largescale_large_fullatom...

Message boards : Number crunching : No checkpoint in more than 1 hour - Largescale_large_fullatom...

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Shoikan

Send message
Joined: 4 Apr 06
Posts: 14
Credit: 180,211
RAC: 0
Message 13944 - Posted: 17 Apr 2006, 9:47:18 UTC - in response to Message 13877.  

Thanks for the quick replys!

So do I need to have my switch time > time for the entire WU to complete?
Isn't there any checkpointing?

-Sid


ALL Work Units will checkpoint at the completion of a model. For some Work Units this means every 5 minuets, for larger ones this could mean 5 or six hours. Also ALL Work units will complete AT LEAST one model no matter how you set your user selectable time setting

The BEST answer if you can do it, is to set your preferences to keep the application in memory during a swap. You could try to set the swap time to 4+ hours, but there is no guarantee that that will make it to a checkpoint. It depends on the size of the protein.

Also keep in mind that "keep in memory" only works if you do not turn your machine off, or stop BOINC for some reason, as these actions would also remove the application from memory.


This issue has to be adressed ASAP. Many cycles go directly to the trash can because of this. An improved checkpointing system should be #1 priority on the TO DO list of the development team of Rosie.

Regards.
ID: 13944 · Rating: -1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Insidious
Avatar

Send message
Joined: 10 Nov 05
Posts: 49
Credit: 604,937
RAC: 0
Message 13950 - Posted: 17 Apr 2006, 14:20:19 UTC - in response to Message 13944.  

Thanks for the quick replys!

So do I need to have my switch time > time for the entire WU to complete?
Isn't there any checkpointing?

-Sid


ALL Work Units will checkpoint at the completion of a model. For some Work Units this means every 5 minuets, for larger ones this could mean 5 or six hours. Also ALL Work units will complete AT LEAST one model no matter how you set your user selectable time setting

The BEST answer if you can do it, is to set your preferences to keep the application in memory during a swap. You could try to set the swap time to 4+ hours, but there is no guarantee that that will make it to a checkpoint. It depends on the size of the protein.

Also keep in mind that "keep in memory" only works if you do not turn your machine off, or stop BOINC for some reason, as these actions would also remove the application from memory.


This issue has to be adressed ASAP. Many cycles go directly to the trash can because of this. An improved checkpointing system should be #1 priority on the TO DO list of the development team of Rosie.

Regards.


Bin Qian addressed this already above (we all agree on this!)

Proudly crunching with TeAm Anandtech
ID: 13950 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 14022 - Posted: 18 Apr 2006, 9:10:31 UTC - in response to Message 13912.  
Last modified: 18 Apr 2006, 9:27:24 UTC

You already have a separate test project (Ralph), so if there's no good solution to interrupting a big model, maybe you could start a new "RosettaExtreme" project just for those of us who would be happy to take care of those big proteins for you.


If you go for the separate project, then in the times when there are no extreme WU to run, you could always deliver ordinary Rosetta work to the Rex users to keep the project share in use.

Ralph was a shortening of Rosetta alpha, a nickname suggested by another Bill, in fact.

Maybe Rosetta extreme could be Rex?

Another way to do this would be to implement some form of user preference flag, off by default, but whenset manually by the user made them liable to receive work that was Extreme in some way. The course team could ask for volunteers to set the FullOn flag if they wanted to opt in.

One method needs more coding, the other needs a separate url and some work at sysadmin level on one or more servers. The Bakerlab folk will know which of these is easier to deliver.

I'd suggest either would be a good solution from the user's point of view. Your users already have experience of choosing the development project (Ralph) and of customising the run length of their work, and my impression is that both forms of user control have been well received.

I am running down my CPDN participation due to their refusal to implement any form of user-specified control over the size of work that is issued. That is despite the fact that I personally feel their science to be very important.

It is good that you acted to protect some users, but sad that you lose out on some interesting work by doing so. I feel sure that increased user control is the way forward. It won't attract any more people, but will help you to keep the ones you've already got.

River~~
ID: 14022 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rebel Alliance

Send message
Joined: 4 Nov 05
Posts: 50
Credit: 3,579,531
RAC: 0
Message 14023 - Posted: 18 Apr 2006, 9:28:41 UTC

I think it makes MUCH more sense to fix the mistakes of the largescale work units than it does to try and design a sub-project like this "extreme" thing.

-Sid
ID: 14023 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 14026 - Posted: 18 Apr 2006, 10:26:56 UTC - in response to Message 14023.  

I think it makes MUCH more sense to fix the mistakes of the largescale work units than it does to try and design a sub-project like this "extreme" thing.

-Sid


Sid, although obviously I don't know the internals of Rosetta's code, it's quite possible that it's going to be quite difficult to "checkpoint" at intermediate points within a model.

In such a case, it could mean that the program needs to write many MBytes of memory to disk. And this could adversely affect performance on an "average" PC.

So I wouldn't call it a "mistake" to "fix", these are just bigger jobs which aren't suited for many BOINC PCs.

So it seems to me that the options are:

1/ Only run those big WUs internally, on the projects own 500-node Linux cluster.

2/ Send those WUs only to PCs which are eligible / capable / willing to crunch them. The BigWU flag I've been talking about since last month. (this needs changes to BOINC software which could take time)

3/ Establish a new BOINC project, for the purpose of running those big WUs, until #2 is possible.

Unless someone can tweak BOINC server code to implement #2 asap (bug free?), so it can be ready for CASP, it seems to me that #3 is the only option immediately available.
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 14026 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 14036 - Posted: 18 Apr 2006, 13:35:10 UTC
Last modified: 18 Apr 2006, 13:37:13 UTC

Yep Rosetta advanced for high-demanding models would be a good idea and very easy to implement. It's like CPDN has a subproject Seasonal attribution project only for those which can cope with the high specs and the fact that it checkpoints only every 4-8 hours(sic!).

I contribute to the seasonal attribution project with my standalone comp partly because I know my computer can do this while other's can't. I'm sure a Rosetta extreme (or advaced) with higher memory requirements and longer intervals between checkpoints would find it's supporters.
ID: 14036 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 14069 - Posted: 18 Apr 2006, 19:25:26 UTC - in response to Message 14036.  
Last modified: 18 Apr 2006, 19:53:14 UTC

Yep Rosetta advanced for high-demanding models would be a good idea and very easy to implement. It's like CPDN has a subproject Seasonal attribution project only for those which can cope with the high specs and the fact that it checkpoints only every 4-8 hours(sic!).

I contribute to the seasonal attribution project with my standalone comp partly because I know my computer can do this while other's can't. I'm sure a Rosetta extreme (or advaced) with higher memory requirements and longer intervals between checkpoints would find it's supporters.


Actually, the Time setting could be used to produce a similar effect for systems with sufficient memory.

Technically the time setting could be used for two purposes. If set to less than 4 days, it would only issue you small proteins that run in the normal way, if the value was equal to 4 days you would get any available WorkUnit including large ones. The only real issue would be the effect this might have on the modem and farm users this feature was originally designed to help. They might want to avoid the 4 day setting.

Of course it would also require some discipline on the part of the user community. If someone with 256MB of memory set their system to 4 days, they would have a lot of errors.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 14069 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 14072 - Posted: 18 Apr 2006, 20:02:30 UTC - in response to Message 14069.  



Actually, the Time setting could be used to produce a similar effect for systems with sufficient memory.

Technically the time setting could be used for two purposes. If set to less than 4 days, it would only issue you small proteins that run in the normal way, if the value was equal to 4 days you would get any available WorkUnit including large ones. The only real issue would be the effect this might have on the modem and farm users this feature was originally designed to help. They might want to avoid the 4 day setting.

Of course it would also require some discipline on the part of the user community. If someone with 256MB of memory set their system to 4 days, they would have a lot of errors.


AFAIK it is not possible at the moment to discriminate between hosts with the distribution of WUs. That needs a change in the BOINC server software. Although I think it should not be too much work it is not an option available at the moment.

P.S.: Even one Laptop from me with only 256 MB RAM crunched already two Large WUs succesfull and is crunching the third at the moment. Perhaps with proper checkpointing there is no need to discriminate.
ID: 14072 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 14074 - Posted: 18 Apr 2006, 20:24:05 UTC - in response to Message 14072.  



Actually, the Time setting could be used to produce a similar effect for systems with sufficient memory.

Technically the time setting could be used for two purposes. If set to less than 4 days, it would only issue you small proteins that run in the normal way, if the value was equal to 4 days you would get any available Work Unit including large ones. The only real issue would be the effect this might have on the modem and farm users this feature was originally designed to help. They might want to avoid the 4 day setting.

Of course it would also require some discipline on the part of the user community. If someone with 256MB of memory set their system to 4 days, they would have a lot of errors.


AFAIK it is not possible at the moment to discriminate between hosts with the distribution of WUs. That needs a change in the BOINC server software. Although I think it should not be too much work it is not an option available at the moment.

P.S.: Even one Laptop from me with only 256 MB RAM crunched already two Large WUs successful and is crunching the third at the moment. Perhaps with proper checkpointing there is no need to discriminate.

You are correct, BOINC cannot do this by itself.

But what I am suggesting is that the users could ASK for larger Work Units using the time setting. As I said this would require SOME discipline on the part of the people using the feature. The project certainly knows which WUs are large and which are small. There might be a way to send large ones to systems that specifically request them using a preference setting.

But if there is no BOINC capability for distributing particular Work Units based on a preference setting then it would not work at all.

Right now the best option seems to be to get the real bugs out of the system, and then tackle the checkpointing issue. But in the face of the short time remaining before the beginning of CASP, increased checkpointing may be beyond reach at this time. Far better to have a stable application, with some use restrictions, than to have an application that checkpoints a lot but fails routinely for other reasons.


Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 14074 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 14084 - Posted: 18 Apr 2006, 22:20:06 UTC

I wonder if Boinc could redefine "Homogeneous Redundancy" to suite this need.

Predictor uses Homogeneous Redundancy.

tony
ID: 14084 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bin Qian

Send message
Joined: 13 Jul 05
Posts: 33
Credit: 36,897
RAC: 0
Message 14160 - Posted: 20 Apr 2006, 5:23:39 UTC - in response to Message 14074.  

Right now the best option seems to be to get the real bugs out of the system, and then tackle the checkpointing issue. But in the face of the short time remaining before the beginning of CASP, increased checkpointing may be beyond reach at this time. Far better to have a stable application, with some use restrictions, than to have an application that checkpoints a lot but fails routinely for other reasons.


Hi Moderator9 and all,

I want to give you some updates on the long jobs / stucked jobs issue based on your suggestions and discussion here:

1, We've tracked down the bugs which were causing some jobs stucked at 1.04% and been testing the fixes on ralph since yesterday.

2, We've coded up rosetta to do more frequent checkpointing in the modeling process. Now for the large jobs, we are expecting less than 30 minutes for the time between two check points. This code has been tested locally, and will be tested on ralph within a couple of days.

3, Rhiju has coded a watchdog thread for rosetta which will terminate the stucking jobs and return the intermediate results. see his post at this thread. This will be tested on ralph within a couple of days too.

We think these measures will greatly improve the stability of R@H, and make your crunching effort much more enjoyable!
ID: 14160 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 14162 - Posted: 20 Apr 2006, 6:36:50 UTC

Bin Qian stated:
Now for the large jobs, we are expecting less than 30 minutes for the time between two check points.

30 mins on.. a 2Ghz Athlon/3Ghz p4? Or on 400 to 600 Mhz machines with 512Megs of Ram?
ID: 14162 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Insidious
Avatar

Send message
Joined: 10 Nov 05
Posts: 49
Credit: 604,937
RAC: 0
Message 14168 - Posted: 20 Apr 2006, 9:33:34 UTC - in response to Message 14160.  

Right now the best option seems to be to get the real bugs out of the system, and then tackle the checkpointing issue. But in the face of the short time remaining before the beginning of CASP, increased checkpointing may be beyond reach at this time. Far better to have a stable application, with some use restrictions, than to have an application that checkpoints a lot but fails routinely for other reasons.


Hi Moderator9 and all,

I want to give you some updates on the long jobs / stucked jobs issue based on your suggestions and discussion here:

1, We've tracked down the bugs which were causing some jobs stucked at 1.04% and been testing the fixes on ralph since yesterday.

2, We've coded up rosetta to do more frequent checkpointing in the modeling process. Now for the large jobs, we are expecting less than 30 minutes for the time between two check points. This code has been tested locally, and will be tested on ralph within a couple of days.

3, Rhiju has coded a watchdog thread for rosetta which will terminate the stucking jobs and return the intermediate results. see his post at this thread. This will be tested on ralph within a couple of days too.

We think these measures will greatly improve the stability of R@H, and make your crunching effort much more enjoyable!


This sounds like good progress.
Thanks for the update!

-Sid

Proudly crunching with TeAm Anandtech
ID: 14168 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bin Qian

Send message
Joined: 13 Jul 05
Posts: 33
Credit: 36,897
RAC: 0
Message 14185 - Posted: 20 Apr 2006, 17:25:46 UTC - in response to Message 14162.  
Last modified: 20 Apr 2006, 18:23:14 UTC

30 mins on.. a 2Ghz Athlon/3Ghz p4? Or on 400 to 600 Mhz machines with 512Megs of Ram?


Good point! I was testing a 220 residue protein ( a monster size for our current capacity ) on a relatively slow machine 800Mhz 512Meg running Linux. The longest interval between check points is about 25 minutes. Of course this number will vary based on hardware/software configurations and protein size, and that's why we need to test it on Ralph first, but in generally this should give more space for a client to finish a WU.

Now for most WUs we'd expect a much shorter interval between check points, since the computation decreases geometrically when the protein size decreases.

One important change I forgot to mention: on top of the more frequent checkpointing, David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory.
ID: 14185 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Osku87

Send message
Joined: 1 Nov 05
Posts: 17
Credit: 280,268
RAC: 0
Message 14195 - Posted: 20 Apr 2006, 20:25:00 UTC

David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory.

This is something I have waited. It would be nice if you could inform us then when this is in use.
ID: 14195 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bin Qian

Send message
Joined: 13 Jul 05
Posts: 33
Credit: 36,897
RAC: 0
Message 14196 - Posted: 20 Apr 2006, 20:46:56 UTC - in response to Message 14195.  

David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory.

This is something I have waited. It would be nice if you could inform us then when this is in use.


You bet! We will anounce the changes when we update the R@H application. Currently this feature is being tested on Ralph.
ID: 14196 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
casio7131

Send message
Joined: 10 Oct 05
Posts: 35
Credit: 149,748
RAC: 0
Message 14215 - Posted: 21 Apr 2006, 2:40:07 UTC - in response to Message 14185.  

30 mins on.. a 2Ghz Athlon/3Ghz p4? Or on 400 to 600 Mhz machines with 512Megs of Ram?


Good point! I was testing a 220 residue protein ( a monster size for our current capacity ) on a relatively slow machine 800Mhz 512Meg running Linux. The longest interval between check points is about 25 minutes. Of course this number will vary based on hardware/software configurations and protein size, and that's why we need to test it on Ralph first, but in generally this should give more space for a client to finish a WU.

Now for most WUs we'd expect a much shorter interval between check points, since the computation decreases geometrically when the protein size decreases.

so a typical WU will checkpoint in much under every 25 min, say every 5 min for a fast cpu. i'm not sure how much data is required to be written at each checkpoint, but this sounds like it might create quite a lot of disk thrashing. i certainly don't want checkpoints written so frequently (i'm not a points tight arse, so i don't care if i lose "a few" minutes worth of points), especially if they are big checkpoints.

i would prefer a longer time interval between checkpoints, or even better would be a user option (like the "model cpu time" option).

i suppose my question is how much data is written to disk at each checkpoint? (if it's only a couple of hundred kB, then it's not too bad, but if it's 10s of MB, then i'm "not happy jan!".)
ID: 14215 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bin Qian

Send message
Joined: 13 Jul 05
Posts: 33
Credit: 36,897
RAC: 0
Message 14218 - Posted: 21 Apr 2006, 3:08:58 UTC - in response to Message 14215.  

i suppose my question is how much data is written to disk at each checkpoint? (if it's only a couple of hundred kB, then it's not too bad, but if it's 10s of MB, then i'm "not happy jan!".)


Don't worry - the check point file is only 300kB for larger proteins which will have longer interval, and much smaller (less than 100k) for smaller proteins with shorter checkpointing intervals. Also this checkpointing mechanism is optional - we will only turn it on in WUs with larger proteins or longer searches (based on scientific justification). So most likely you will not have any checkpoint files for those normal jobs.
ID: 14218 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 14223 - Posted: 21 Apr 2006, 3:45:31 UTC - in response to Message 14218.  
Last modified: 21 Apr 2006, 3:46:38 UTC

I'm extatic about all the changes and improvements we've seen in just a short time here. Great job!

we will only turn it on in WUs with larger proteins or longer searches


Would it be possible to have it take a look around when it reaches a point that CAN be checkpointed? You haven't really defined if it is now possible to basically checkpoint at any time or not. Do you have to reach a certain phase within a model?

What I'm thinking is that a very slow PC might have spent an hour just to get a portion of the way through even a small WU. Why not "come up for air" when you reach points that a checkpoint is possible, and see how much CPU time has gone by since your last checkpoint? If it is >20 min. then write the checkpoint.

So, you'd ship WUs that will always checkpoint at least once an hour, regardless of the protein size.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 14223 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bin Qian

Send message
Joined: 13 Jul 05
Posts: 33
Credit: 36,897
RAC: 0
Message 14226 - Posted: 21 Apr 2006, 4:37:06 UTC - in response to Message 14223.  

Would it be possible to have it take a look around when it reaches a point that CAN be checkpointed? You haven't really defined if it is now possible to basically checkpoint at any time or not. Do you have to reach a certain phase within a model?


Great idea! The checkpointing has to be done in certain stages of the modeling process, but we can (and should!) let the WU decide if it should checkpoint when it reaches a checkpointable stage. Thanks for the great suggestion. Will implement it now.
ID: 14226 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : No checkpoint in more than 1 hour - Largescale_large_fullatom...



©2024 University of Washington
https://www.bakerlab.org