No checkpoint in more than 1 hour - Largescale_large_fullatom...

Message boards : Number crunching : No checkpoint in more than 1 hour - Largescale_large_fullatom...

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Rebel Alliance

Send message
Joined: 4 Nov 05
Posts: 50
Credit: 3,579,531
RAC: 0
Message 14023 - Posted: 18 Apr 2006, 9:28:41 UTC

I think it makes MUCH more sense to fix the mistakes of the largescale work units than it does to try and design a sub-project like this "extreme" thing.

-Sid
ID: 14023 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 14026 - Posted: 18 Apr 2006, 10:26:56 UTC - in response to Message 14023.  

I think it makes MUCH more sense to fix the mistakes of the largescale work units than it does to try and design a sub-project like this "extreme" thing.

-Sid


Sid, although obviously I don't know the internals of Rosetta's code, it's quite possible that it's going to be quite difficult to "checkpoint" at intermediate points within a model.

In such a case, it could mean that the program needs to write many MBytes of memory to disk. And this could adversely affect performance on an "average" PC.

So I wouldn't call it a "mistake" to "fix", these are just bigger jobs which aren't suited for many BOINC PCs.

So it seems to me that the options are:

1/ Only run those big WUs internally, on the projects own 500-node Linux cluster.

2/ Send those WUs only to PCs which are eligible / capable / willing to crunch them. The BigWU flag I've been talking about since last month. (this needs changes to BOINC software which could take time)

3/ Establish a new BOINC project, for the purpose of running those big WUs, until #2 is possible.

Unless someone can tweak BOINC server code to implement #2 asap (bug free?), so it can be ready for CASP, it seems to me that #3 is the only option immediately available.
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 14026 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 14036 - Posted: 18 Apr 2006, 13:35:10 UTC
Last modified: 18 Apr 2006, 13:37:13 UTC

Yep Rosetta advanced for high-demanding models would be a good idea and very easy to implement. It's like CPDN has a subproject Seasonal attribution project only for those which can cope with the high specs and the fact that it checkpoints only every 4-8 hours(sic!).

I contribute to the seasonal attribution project with my standalone comp partly because I know my computer can do this while other's can't. I'm sure a Rosetta extreme (or advaced) with higher memory requirements and longer intervals between checkpoints would find it's supporters.
ID: 14036 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 14072 - Posted: 18 Apr 2006, 20:02:30 UTC - in response to Message 14069.  



Actually, the Time setting could be used to produce a similar effect for systems with sufficient memory.

Technically the time setting could be used for two purposes. If set to less than 4 days, it would only issue you small proteins that run in the normal way, if the value was equal to 4 days you would get any available WorkUnit including large ones. The only real issue would be the effect this might have on the modem and farm users this feature was originally designed to help. They might want to avoid the 4 day setting.

Of course it would also require some discipline on the part of the user community. If someone with 256MB of memory set their system to 4 days, they would have a lot of errors.


AFAIK it is not possible at the moment to discriminate between hosts with the distribution of WUs. That needs a change in the BOINC server software. Although I think it should not be too much work it is not an option available at the moment.

P.S.: Even one Laptop from me with only 256 MB RAM crunched already two Large WUs succesfull and is crunching the third at the moment. Perhaps with proper checkpointing there is no need to discriminate.
ID: 14072 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 14084 - Posted: 18 Apr 2006, 22:20:06 UTC

I wonder if Boinc could redefine "Homogeneous Redundancy" to suite this need.

Predictor uses Homogeneous Redundancy.

tony
ID: 14084 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bin Qian

Send message
Joined: 13 Jul 05
Posts: 33
Credit: 36,897
RAC: 0
Message 14160 - Posted: 20 Apr 2006, 5:23:39 UTC - in response to Message 14074.  

Right now the best option seems to be to get the real bugs out of the system, and then tackle the checkpointing issue. But in the face of the short time remaining before the beginning of CASP, increased checkpointing may be beyond reach at this time. Far better to have a stable application, with some use restrictions, than to have an application that checkpoints a lot but fails routinely for other reasons.


Hi Moderator9 and all,

I want to give you some updates on the long jobs / stucked jobs issue based on your suggestions and discussion here:

1, We've tracked down the bugs which were causing some jobs stucked at 1.04% and been testing the fixes on ralph since yesterday.

2, We've coded up rosetta to do more frequent checkpointing in the modeling process. Now for the large jobs, we are expecting less than 30 minutes for the time between two check points. This code has been tested locally, and will be tested on ralph within a couple of days.

3, Rhiju has coded a watchdog thread for rosetta which will terminate the stucking jobs and return the intermediate results. see his post at this thread. This will be tested on ralph within a couple of days too.

We think these measures will greatly improve the stability of R@H, and make your crunching effort much more enjoyable!
ID: 14160 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 14162 - Posted: 20 Apr 2006, 6:36:50 UTC

Bin Qian stated:
Now for the large jobs, we are expecting less than 30 minutes for the time between two check points.

30 mins on.. a 2Ghz Athlon/3Ghz p4? Or on 400 to 600 Mhz machines with 512Megs of Ram?
ID: 14162 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Insidious
Avatar

Send message
Joined: 10 Nov 05
Posts: 49
Credit: 604,937
RAC: 0
Message 14168 - Posted: 20 Apr 2006, 9:33:34 UTC - in response to Message 14160.  

Right now the best option seems to be to get the real bugs out of the system, and then tackle the checkpointing issue. But in the face of the short time remaining before the beginning of CASP, increased checkpointing may be beyond reach at this time. Far better to have a stable application, with some use restrictions, than to have an application that checkpoints a lot but fails routinely for other reasons.


Hi Moderator9 and all,

I want to give you some updates on the long jobs / stucked jobs issue based on your suggestions and discussion here:

1, We've tracked down the bugs which were causing some jobs stucked at 1.04% and been testing the fixes on ralph since yesterday.

2, We've coded up rosetta to do more frequent checkpointing in the modeling process. Now for the large jobs, we are expecting less than 30 minutes for the time between two check points. This code has been tested locally, and will be tested on ralph within a couple of days.

3, Rhiju has coded a watchdog thread for rosetta which will terminate the stucking jobs and return the intermediate results. see his post at this thread. This will be tested on ralph within a couple of days too.

We think these measures will greatly improve the stability of R@H, and make your crunching effort much more enjoyable!


This sounds like good progress.
Thanks for the update!

-Sid

Proudly crunching with TeAm Anandtech
ID: 14168 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bin Qian

Send message
Joined: 13 Jul 05
Posts: 33
Credit: 36,897
RAC: 0
Message 14185 - Posted: 20 Apr 2006, 17:25:46 UTC - in response to Message 14162.  
Last modified: 20 Apr 2006, 18:23:14 UTC

30 mins on.. a 2Ghz Athlon/3Ghz p4? Or on 400 to 600 Mhz machines with 512Megs of Ram?


Good point! I was testing a 220 residue protein ( a monster size for our current capacity ) on a relatively slow machine 800Mhz 512Meg running Linux. The longest interval between check points is about 25 minutes. Of course this number will vary based on hardware/software configurations and protein size, and that's why we need to test it on Ralph first, but in generally this should give more space for a client to finish a WU.

Now for most WUs we'd expect a much shorter interval between check points, since the computation decreases geometrically when the protein size decreases.

One important change I forgot to mention: on top of the more frequent checkpointing, David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory.
ID: 14185 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Osku87

Send message
Joined: 1 Nov 05
Posts: 17
Credit: 280,268
RAC: 0
Message 14195 - Posted: 20 Apr 2006, 20:25:00 UTC

David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory.

This is something I have waited. It would be nice if you could inform us then when this is in use.
ID: 14195 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bin Qian

Send message
Joined: 13 Jul 05
Posts: 33
Credit: 36,897
RAC: 0
Message 14196 - Posted: 20 Apr 2006, 20:46:56 UTC - in response to Message 14195.  

David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory.

This is something I have waited. It would be nice if you could inform us then when this is in use.


You bet! We will anounce the changes when we update the R@H application. Currently this feature is being tested on Ralph.
ID: 14196 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
casio7131

Send message
Joined: 10 Oct 05
Posts: 35
Credit: 149,748
RAC: 0
Message 14215 - Posted: 21 Apr 2006, 2:40:07 UTC - in response to Message 14185.  

30 mins on.. a 2Ghz Athlon/3Ghz p4? Or on 400 to 600 Mhz machines with 512Megs of Ram?


Good point! I was testing a 220 residue protein ( a monster size for our current capacity ) on a relatively slow machine 800Mhz 512Meg running Linux. The longest interval between check points is about 25 minutes. Of course this number will vary based on hardware/software configurations and protein size, and that's why we need to test it on Ralph first, but in generally this should give more space for a client to finish a WU.

Now for most WUs we'd expect a much shorter interval between check points, since the computation decreases geometrically when the protein size decreases.

so a typical WU will checkpoint in much under every 25 min, say every 5 min for a fast cpu. i'm not sure how much data is required to be written at each checkpoint, but this sounds like it might create quite a lot of disk thrashing. i certainly don't want checkpoints written so frequently (i'm not a points tight arse, so i don't care if i lose "a few" minutes worth of points), especially if they are big checkpoints.

i would prefer a longer time interval between checkpoints, or even better would be a user option (like the "model cpu time" option).

i suppose my question is how much data is written to disk at each checkpoint? (if it's only a couple of hundred kB, then it's not too bad, but if it's 10s of MB, then i'm "not happy jan!".)
ID: 14215 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bin Qian

Send message
Joined: 13 Jul 05
Posts: 33
Credit: 36,897
RAC: 0
Message 14218 - Posted: 21 Apr 2006, 3:08:58 UTC - in response to Message 14215.  

i suppose my question is how much data is written to disk at each checkpoint? (if it's only a couple of hundred kB, then it's not too bad, but if it's 10s of MB, then i'm "not happy jan!".)


Don't worry - the check point file is only 300kB for larger proteins which will have longer interval, and much smaller (less than 100k) for smaller proteins with shorter checkpointing intervals. Also this checkpointing mechanism is optional - we will only turn it on in WUs with larger proteins or longer searches (based on scientific justification). So most likely you will not have any checkpoint files for those normal jobs.
ID: 14218 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 14223 - Posted: 21 Apr 2006, 3:45:31 UTC - in response to Message 14218.  
Last modified: 21 Apr 2006, 3:46:38 UTC

I'm extatic about all the changes and improvements we've seen in just a short time here. Great job!

we will only turn it on in WUs with larger proteins or longer searches


Would it be possible to have it take a look around when it reaches a point that CAN be checkpointed? You haven't really defined if it is now possible to basically checkpoint at any time or not. Do you have to reach a certain phase within a model?

What I'm thinking is that a very slow PC might have spent an hour just to get a portion of the way through even a small WU. Why not "come up for air" when you reach points that a checkpoint is possible, and see how much CPU time has gone by since your last checkpoint? If it is >20 min. then write the checkpoint.

So, you'd ship WUs that will always checkpoint at least once an hour, regardless of the protein size.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 14223 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bin Qian

Send message
Joined: 13 Jul 05
Posts: 33
Credit: 36,897
RAC: 0
Message 14226 - Posted: 21 Apr 2006, 4:37:06 UTC - in response to Message 14223.  

Would it be possible to have it take a look around when it reaches a point that CAN be checkpointed? You haven't really defined if it is now possible to basically checkpoint at any time or not. Do you have to reach a certain phase within a model?


Great idea! The checkpointing has to be done in certain stages of the modeling process, but we can (and should!) let the WU decide if it should checkpoint when it reaches a checkpointable stage. Thanks for the great suggestion. Will implement it now.
ID: 14226 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Osku87

Send message
Joined: 1 Nov 05
Posts: 17
Credit: 280,268
RAC: 0
Message 14231 - Posted: 21 Apr 2006, 6:03:33 UTC

And it would be way better if user could decide the timelimit.
ID: 14231 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 14241 - Posted: 21 Apr 2006, 7:31:52 UTC - in response to Message 14231.  
Last modified: 21 Apr 2006, 7:53:23 UTC

And it would be way better if user could decide the timelimit.


Actually there is already such a parameter in the "General settings":

"Write to disk at most every" = xx seconds

Default is 60 (one minute). I set it to 600 (10 minutes) since one minute seems very often.

So such a parameter is already present in the BOINC-Framework. Most projects ignore it anyway but one could use it. As for checkpoints <100 KB it is really not a problem to checkpoint often (not every miunute but say 5-10 minutes).
ID: 14241 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 14276 - Posted: 21 Apr 2006, 16:35:27 UTC

Yes, better would be if user could define it. I just thought 20min would help improve the throughputs for the folks with the 1hr switch between apps setting. It might also help avoid the 1% stuck, and 5 strikes and you're out conditions, because it will make checkpoints dynamically as appropriate for the box and the WU it's running. It should also help folks that only have their PC on for brief periods of time. I'm glad to hear Bin feels he can implement the idea to dynamically determine if checkpoint is desirable at the same time.

This should make the WU run experience more consistent, regardless of the CPU speed and length of the WU's protein. And I think that will help avoid confusion, and any perception of instability.

If you're curious, there's also a very similar thread on the BOINC boards:
Preempt only at checkpoints. THIS would be the ultimate. Now instead of "only" losing an average of 10min per preempt... you'd lose an average of... well... ZERO! An 18% improvement over the improved R@H checkpointing!
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 14276 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bin Qian

Send message
Joined: 13 Jul 05
Posts: 33
Credit: 36,897
RAC: 0
Message 14300 - Posted: 21 Apr 2006, 20:40:00 UTC - in response to Message 14276.  

The user controllable checkpointing would be ideal. Unfortunately the current checkpointing machinery can only be done in certain stages of the modeling process. In a nutshell, the process has to reach a stage in the modeling process where the previous searching history (which includes a huge amount of data if we were to record all of it) can be discarded, so we can live with only checkpointing a minimum amount of data for the future searches. We can not checkpoint at any point of the modeling process yet.

I've implemented Feet1st's idea below: when the WU reaches a stage where checkpointing is possible, it will see how long it has been since the last checkpointing. If it's over 20 minutes, then checkpoints.
ID: 14300 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
casio7131

Send message
Joined: 10 Oct 05
Posts: 35
Credit: 149,748
RAC: 0
Message 14439 - Posted: 23 Apr 2006, 3:02:20 UTC

this checkpointing news sound great.
ID: 14439 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : Number crunching : No checkpoint in more than 1 hour - Largescale_large_fullatom...



©2025 University of Washington
https://www.bakerlab.org