Message boards : Number crunching : No checkpoint in more than 1 hour - Largescale_large_fullatom...
Previous · 1 · 2
Author | Message |
---|---|
Rebel Alliance Send message Joined: 4 Nov 05 Posts: 50 Credit: 3,579,531 RAC: 0 |
I think it makes MUCH more sense to fix the mistakes of the largescale work units than it does to try and design a sub-project like this "extreme" thing. -Sid |
![]() Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
I think it makes MUCH more sense to fix the mistakes of the largescale work units than it does to try and design a sub-project like this "extreme" thing. Sid, although obviously I don't know the internals of Rosetta's code, it's quite possible that it's going to be quite difficult to "checkpoint" at intermediate points within a model. In such a case, it could mean that the program needs to write many MBytes of memory to disk. And this could adversely affect performance on an "average" PC. So I wouldn't call it a "mistake" to "fix", these are just bigger jobs which aren't suited for many BOINC PCs. So it seems to me that the options are: 1/ Only run those big WUs internally, on the projects own 500-node Linux cluster. 2/ Send those WUs only to PCs which are eligible / capable / willing to crunch them. The BigWU flag I've been talking about since last month. (this needs changes to BOINC software which could take time) 3/ Establish a new BOINC project, for the purpose of running those big WUs, until #2 is possible. Unless someone can tweak BOINC server code to implement #2 asap (bug free?), so it can be ready for CASP, it seems to me that #3 is the only option immediately available. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
Yep Rosetta advanced for high-demanding models would be a good idea and very easy to implement. It's like CPDN has a subproject Seasonal attribution project only for those which can cope with the high specs and the fact that it checkpoints only every 4-8 hours(sic!). I contribute to the seasonal attribution project with my standalone comp partly because I know my computer can do this while other's can't. I'm sure a Rosetta extreme (or advaced) with higher memory requirements and longer intervals between checkpoints would find it's supporters. |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
AFAIK it is not possible at the moment to discriminate between hosts with the distribution of WUs. That needs a change in the BOINC server software. Although I think it should not be too much work it is not an option available at the moment. P.S.: Even one Laptop from me with only 256 MB RAM crunched already two Large WUs succesfull and is crunching the third at the moment. Perhaps with proper checkpointing there is no need to discriminate. |
Astro![]() Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
I wonder if Boinc could redefine "Homogeneous Redundancy" to suite this need. Predictor uses Homogeneous Redundancy. tony |
Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0 |
Right now the best option seems to be to get the real bugs out of the system, and then tackle the checkpointing issue. But in the face of the short time remaining before the beginning of CASP, increased checkpointing may be beyond reach at this time. Far better to have a stable application, with some use restrictions, than to have an application that checkpoints a lot but fails routinely for other reasons. Hi Moderator9 and all, I want to give you some updates on the long jobs / stucked jobs issue based on your suggestions and discussion here: 1, We've tracked down the bugs which were causing some jobs stucked at 1.04% and been testing the fixes on ralph since yesterday. 2, We've coded up rosetta to do more frequent checkpointing in the modeling process. Now for the large jobs, we are expecting less than 30 minutes for the time between two check points. This code has been tested locally, and will be tested on ralph within a couple of days. 3, Rhiju has coded a watchdog thread for rosetta which will terminate the stucking jobs and return the intermediate results. see his post at this thread. This will be tested on ralph within a couple of days too. We think these measures will greatly improve the stability of R@H, and make your crunching effort much more enjoyable! |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
Bin Qian stated: 30 mins on.. a 2Ghz Athlon/3Ghz p4? Or on 400 to 600 Mhz machines with 512Megs of Ram? |
Insidious![]() Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
Right now the best option seems to be to get the real bugs out of the system, and then tackle the checkpointing issue. But in the face of the short time remaining before the beginning of CASP, increased checkpointing may be beyond reach at this time. Far better to have a stable application, with some use restrictions, than to have an application that checkpoints a lot but fails routinely for other reasons. This sounds like good progress. Thanks for the update! -Sid Proudly crunching with TeAm Anandtech |
Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0 |
30 mins on.. a 2Ghz Athlon/3Ghz p4? Or on 400 to 600 Mhz machines with 512Megs of Ram? Good point! I was testing a 220 residue protein ( a monster size for our current capacity ) on a relatively slow machine 800Mhz 512Meg running Linux. The longest interval between check points is about 25 minutes. Of course this number will vary based on hardware/software configurations and protein size, and that's why we need to test it on Ralph first, but in generally this should give more space for a client to finish a WU. Now for most WUs we'd expect a much shorter interval between check points, since the computation decreases geometrically when the protein size decreases. One important change I forgot to mention: on top of the more frequent checkpointing, David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory. |
Osku87 Send message Joined: 1 Nov 05 Posts: 17 Credit: 280,268 RAC: 0 |
David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory. This is something I have waited. It would be nice if you could inform us then when this is in use. ![]() |
Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0 |
David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory. You bet! We will anounce the changes when we update the R@H application. Currently this feature is being tested on Ralph. |
casio7131 Send message Joined: 10 Oct 05 Posts: 35 Credit: 149,748 RAC: 0 |
30 mins on.. a 2Ghz Athlon/3Ghz p4? Or on 400 to 600 Mhz machines with 512Megs of Ram? so a typical WU will checkpoint in much under every 25 min, say every 5 min for a fast cpu. i'm not sure how much data is required to be written at each checkpoint, but this sounds like it might create quite a lot of disk thrashing. i certainly don't want checkpoints written so frequently (i'm not a points tight arse, so i don't care if i lose "a few" minutes worth of points), especially if they are big checkpoints. i would prefer a longer time interval between checkpoints, or even better would be a user option (like the "model cpu time" option). i suppose my question is how much data is written to disk at each checkpoint? (if it's only a couple of hundred kB, then it's not too bad, but if it's 10s of MB, then i'm "not happy jan!".) |
Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0 |
i suppose my question is how much data is written to disk at each checkpoint? (if it's only a couple of hundred kB, then it's not too bad, but if it's 10s of MB, then i'm "not happy jan!".) Don't worry - the check point file is only 300kB for larger proteins which will have longer interval, and much smaller (less than 100k) for smaller proteins with shorter checkpointing intervals. Also this checkpointing mechanism is optional - we will only turn it on in WUs with larger proteins or longer searches (based on scientific justification). So most likely you will not have any checkpoint files for those normal jobs. |
![]() ![]() Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
I'm extatic about all the changes and improvements we've seen in just a short time here. Great job! we will only turn it on in WUs with larger proteins or longer searches Would it be possible to have it take a look around when it reaches a point that CAN be checkpointed? You haven't really defined if it is now possible to basically checkpoint at any time or not. Do you have to reach a certain phase within a model? What I'm thinking is that a very slow PC might have spent an hour just to get a portion of the way through even a small WU. Why not "come up for air" when you reach points that a checkpoint is possible, and see how much CPU time has gone by since your last checkpoint? If it is >20 min. then write the checkpoint. So, you'd ship WUs that will always checkpoint at least once an hour, regardless of the protein size. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0 |
Would it be possible to have it take a look around when it reaches a point that CAN be checkpointed? You haven't really defined if it is now possible to basically checkpoint at any time or not. Do you have to reach a certain phase within a model? Great idea! The checkpointing has to be done in certain stages of the modeling process, but we can (and should!) let the WU decide if it should checkpoint when it reaches a checkpointable stage. Thanks for the great suggestion. Will implement it now. |
Osku87 Send message Joined: 1 Nov 05 Posts: 17 Credit: 280,268 RAC: 0 |
And it would be way better if user could decide the timelimit. ![]() |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
And it would be way better if user could decide the timelimit. Actually there is already such a parameter in the "General settings": "Write to disk at most every" = xx seconds Default is 60 (one minute). I set it to 600 (10 minutes) since one minute seems very often. So such a parameter is already present in the BOINC-Framework. Most projects ignore it anyway but one could use it. As for checkpoints <100 KB it is really not a problem to checkpoint often (not every miunute but say 5-10 minutes). |
![]() ![]() Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Yes, better would be if user could define it. I just thought 20min would help improve the throughputs for the folks with the 1hr switch between apps setting. It might also help avoid the 1% stuck, and 5 strikes and you're out conditions, because it will make checkpoints dynamically as appropriate for the box and the WU it's running. It should also help folks that only have their PC on for brief periods of time. I'm glad to hear Bin feels he can implement the idea to dynamically determine if checkpoint is desirable at the same time. This should make the WU run experience more consistent, regardless of the CPU speed and length of the WU's protein. And I think that will help avoid confusion, and any perception of instability. If you're curious, there's also a very similar thread on the BOINC boards: Preempt only at checkpoints. THIS would be the ultimate. Now instead of "only" losing an average of 10min per preempt... you'd lose an average of... well... ZERO! An 18% improvement over the improved R@H checkpointing! Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0 |
The user controllable checkpointing would be ideal. Unfortunately the current checkpointing machinery can only be done in certain stages of the modeling process. In a nutshell, the process has to reach a stage in the modeling process where the previous searching history (which includes a huge amount of data if we were to record all of it) can be discarded, so we can live with only checkpointing a minimum amount of data for the future searches. We can not checkpoint at any point of the modeling process yet. I've implemented Feet1st's idea below: when the WU reaches a stage where checkpointing is possible, it will see how long it has been since the last checkpointing. If it's over 20 minutes, then checkpoints. |
casio7131 Send message Joined: 10 Oct 05 Posts: 35 Credit: 149,748 RAC: 0 |
this checkpointing news sound great. |
Message boards :
Number crunching :
No checkpoint in more than 1 hour - Largescale_large_fullatom...
©2025 University of Washington
https://www.bakerlab.org