No checkpoint in more than 1 hour - Largescale_large

Author	Message
Rebel Alliance Send message Joined: 4 Nov 05 Posts: 50 Credit: 3,579,531 RAC: 0	Message 14023 - Posted: 18 Apr 2006, 9:28:41 UTC I think it makes MUCH more sense to fix the mistakes of the largescale work units than it does to try and design a sub-project like this "extreme" thing. -Sid ID: 14023 · Rating: 0 · rate: / Reply Quote

Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0	Message 14026 - Posted: 18 Apr 2006, 10:26:56 UTC - in response to Message 14023. I think it makes MUCH more sense to fix the mistakes of the largescale work units than it does to try and design a sub-project like this "extreme" thing. -Sid Sid, although obviously I don't know the internals of Rosetta's code, it's quite possible that it's going to be quite difficult to "checkpoint" at intermediate points within a model. In such a case, it could mean that the program needs to write many MBytes of memory to disk. And this could adversely affect performance on an "average" PC. So I wouldn't call it a "mistake" to "fix", these are just bigger jobs which aren't suited for many BOINC PCs. So it seems to me that the options are: 1/ Only run those big WUs internally, on the projects own 500-node Linux cluster. 2/ Send those WUs only to PCs which are eligible / capable / willing to crunch them. The BigWU flag I've been talking about since last month. (this needs changes to BOINC software which could take time) 3/ Establish a new BOINC project, for the purpose of running those big WUs, until #2 is possible. Unless someone can tweak BOINC server code to implement #2 asap (bug free?), so it can be ready for CASP, it seems to me that #3 is the only option immediately available. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity ID: 14026 · Rating: 0 · rate: / Reply Quote

tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 14036 - Posted: 18 Apr 2006, 13:35:10 UTC Last modified: 18 Apr 2006, 13:37:13 UTC Yep Rosetta advanced for high-demanding models would be a good idea and very easy to implement. It's like CPDN has a subproject Seasonal attribution project only for those which can cope with the high specs and the fact that it checkpoints only every 4-8 hours(sic!). I contribute to the seasonal attribution project with my standalone comp partly because I know my computer can do this while other's can't. I'm sure a Rosetta extreme (or advaced) with higher memory requirements and longer intervals between checkpoints would find it's supporters. ID: 14036 · Rating: 0 · rate: / Reply Quote

tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 14072 - Posted: 18 Apr 2006, 20:02:30 UTC - in response to Message 14069. Actually, the Time setting could be used to produce a similar effect for systems with sufficient memory. Technically the time setting could be used for two purposes. If set to less than 4 days, it would only issue you small proteins that run in the normal way, if the value was equal to 4 days you would get any available WorkUnit including large ones. The only real issue would be the effect this might have on the modem and farm users this feature was originally designed to help. They might want to avoid the 4 day setting. Of course it would also require some discipline on the part of the user community. If someone with 256MB of memory set their system to 4 days, they would have a lot of errors. AFAIK it is not possible at the moment to discriminate between hosts with the distribution of WUs. That needs a change in the BOINC server software. Although I think it should not be too much work it is not an option available at the moment. P.S.: Even one Laptop from me with only 256 MB RAM crunched already two Large WUs succesfull and is crunching the third at the moment. Perhaps with proper checkpointing there is no need to discriminate. ID: 14072 · Rating: 0 · rate: / Reply Quote

Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0	Message 14084 - Posted: 18 Apr 2006, 22:20:06 UTC I wonder if Boinc could redefine "Homogeneous Redundancy" to suite this need. Predictor uses Homogeneous Redundancy. tony ID: 14084 · Rating: 0 · rate: / Reply Quote

Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0	Message 14160 - Posted: 20 Apr 2006, 5:23:39 UTC - in response to Message 14074. Right now the best option seems to be to get the real bugs out of the system, and then tackle the checkpointing issue. But in the face of the short time remaining before the beginning of CASP, increased checkpointing may be beyond reach at this time. Far better to have a stable application, with some use restrictions, than to have an application that checkpoints a lot but fails routinely for other reasons. Hi Moderator9 and all, I want to give you some updates on the long jobs / stucked jobs issue based on your suggestions and discussion here: 1, We've tracked down the bugs which were causing some jobs stucked at 1.04% and been testing the fixes on ralph since yesterday. 2, We've coded up rosetta to do more frequent checkpointing in the modeling process. Now for the large jobs, we are expecting less than 30 minutes for the time between two check points. This code has been tested locally, and will be tested on ralph within a couple of days. 3, Rhiju has coded a watchdog thread for rosetta which will terminate the stucking jobs and return the intermediate results. see his post at this thread. This will be tested on ralph within a couple of days too. We think these measures will greatly improve the stability of R@H, and make your crunching effort much more enjoyable! ID: 14160 · Rating: 0 · rate: / Reply Quote

BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0	Message 14162 - Posted: 20 Apr 2006, 6:36:50 UTC Bin Qian stated: Now for the large jobs, we are expecting less than 30 minutes for the time between two check points. 30 mins on.. a 2Ghz Athlon/3Ghz p4? Or on 400 to 600 Mhz machines with 512Megs of Ram? ID: 14162 · Rating: 0 · rate: / Reply Quote

Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0	Message 14168 - Posted: 20 Apr 2006, 9:33:34 UTC - in response to Message 14160. Right now the best option seems to be to get the real bugs out of the system, and then tackle the checkpointing issue. But in the face of the short time remaining before the beginning of CASP, increased checkpointing may be beyond reach at this time. Far better to have a stable application, with some use restrictions, than to have an application that checkpoints a lot but fails routinely for other reasons. Hi Moderator9 and all, I want to give you some updates on the long jobs / stucked jobs issue based on your suggestions and discussion here: 1, We've tracked down the bugs which were causing some jobs stucked at 1.04% and been testing the fixes on ralph since yesterday. 2, We've coded up rosetta to do more frequent checkpointing in the modeling process. Now for the large jobs, we are expecting less than 30 minutes for the time between two check points. This code has been tested locally, and will be tested on ralph within a couple of days. 3, Rhiju has coded a watchdog thread for rosetta which will terminate the stucking jobs and return the intermediate results. see his post at this thread. This will be tested on ralph within a couple of days too. We think these measures will greatly improve the stability of R@H, and make your crunching effort much more enjoyable! This sounds like good progress. Thanks for the update! -Sid Proudly crunching with TeAm Anandtech ID: 14168 · Rating: 0 · rate: / Reply Quote

Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0	Message 14185 - Posted: 20 Apr 2006, 17:25:46 UTC - in response to Message 14162. Last modified: 20 Apr 2006, 18:23:14 UTC 30 mins on.. a 2Ghz Athlon/3Ghz p4? Or on 400 to 600 Mhz machines with 512Megs of Ram? Good point! I was testing a 220 residue protein ( a monster size for our current capacity ) on a relatively slow machine 800Mhz 512Meg running Linux. The longest interval between check points is about 25 minutes. Of course this number will vary based on hardware/software configurations and protein size, and that's why we need to test it on Ralph first, but in generally this should give more space for a client to finish a WU. Now for most WUs we'd expect a much shorter interval between check points, since the computation decreases geometrically when the protein size decreases. One important change I forgot to mention: on top of the more frequent checkpointing, David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory. ID: 14185 · Rating: 0 · rate: / Reply Quote

Osku87 Send message Joined: 1 Nov 05 Posts: 17 Credit: 284,040 RAC: 1	Message 14195 - Posted: 20 Apr 2006, 20:25:00 UTC David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory. This is something I have waited. It would be nice if you could inform us then when this is in use. ID: 14195 · Rating: 0 · rate: / Reply Quote

Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0	Message 14196 - Posted: 20 Apr 2006, 20:46:56 UTC - in response to Message 14195. David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory. This is something I have waited. It would be nice if you could inform us then when this is in use. You bet! We will anounce the changes when we update the R@H application. Currently this feature is being tested on Ralph. ID: 14196 · Rating: 0 · rate: / Reply Quote

casio7131 Send message Joined: 10 Oct 05 Posts: 35 Credit: 149,748 RAC: 0	Message 14215 - Posted: 21 Apr 2006, 2:40:07 UTC - in response to Message 14185. 30 mins on.. a 2Ghz Athlon/3Ghz p4? Or on 400 to 600 Mhz machines with 512Megs of Ram? Good point! I was testing a 220 residue protein ( a monster size for our current capacity ) on a relatively slow machine 800Mhz 512Meg running Linux. The longest interval between check points is about 25 minutes. Of course this number will vary based on hardware/software configurations and protein size, and that's why we need to test it on Ralph first, but in generally this should give more space for a client to finish a WU. Now for most WUs we'd expect a much shorter interval between check points, since the computation decreases geometrically when the protein size decreases. so a typical WU will checkpoint in much under every 25 min, say every 5 min for a fast cpu. i'm not sure how much data is required to be written at each checkpoint, but this sounds like it might create quite a lot of disk thrashing. i certainly don't want checkpoints written so frequently (i'm not a points tight arse, so i don't care if i lose "a few" minutes worth of points), especially if they are big checkpoints. i would prefer a longer time interval between checkpoints, or even better would be a user option (like the "model cpu time" option). i suppose my question is how much data is written to disk at each checkpoint? (if it's only a couple of hundred kB, then it's not too bad, but if it's 10s of MB, then i'm "not happy jan!".) ID: 14215 · Rating: 0 · rate: / Reply Quote

Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0	Message 14218 - Posted: 21 Apr 2006, 3:08:58 UTC - in response to Message 14215. i suppose my question is how much data is written to disk at each checkpoint? (if it's only a couple of hundred kB, then it's not too bad, but if it's 10s of MB, then i'm "not happy jan!".) Don't worry - the check point file is only 300kB for larger proteins which will have longer interval, and much smaller (less than 100k) for smaller proteins with shorter checkpointing intervals. Also this checkpointing mechanism is optional - we will only turn it on in WUs with larger proteins or longer searches (based on scientific justification). So most likely you will not have any checkpoint files for those normal jobs. ID: 14218 · Rating: 0 · rate: / Reply Quote

Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0	Message 14223 - Posted: 21 Apr 2006, 3:45:31 UTC - in response to Message 14218. Last modified: 21 Apr 2006, 3:46:38 UTC I'm extatic about all the changes and improvements we've seen in just a short time here. Great job! we will only turn it on in WUs with larger proteins or longer searches Would it be possible to have it take a look around when it reaches a point that CAN be checkpointed? You haven't really defined if it is now possible to basically checkpoint at any time or not. Do you have to reach a certain phase within a model? What I'm thinking is that a very slow PC might have spent an hour just to get a portion of the way through even a small WU. Why not "come up for air" when you reach points that a checkpoint is possible, and see how much CPU time has gone by since your last checkpoint? If it is >20 min. then write the checkpoint. So, you'd *ship WUs that will always checkpoint* at least once an hour**, regardless of the protein size. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ ID: 14223 · Rating: 0 · rate: / Reply Quote

Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0	Message 14226 - Posted: 21 Apr 2006, 4:37:06 UTC - in response to Message 14223. Would it be possible to have it take a look around when it reaches a point that CAN be checkpointed? You haven't really defined if it is now possible to basically checkpoint at any time or not. Do you have to reach a certain phase within a model? Great idea! The checkpointing has to be done in certain stages of the modeling process, but we can (and should!) let the WU decide if it should checkpoint when it reaches a checkpointable stage. Thanks for the great suggestion. Will implement it now. ID: 14226 · Rating: 0 · rate: / Reply Quote

Osku87 Send message Joined: 1 Nov 05 Posts: 17 Credit: 284,040 RAC: 1	Message 14231 - Posted: 21 Apr 2006, 6:03:33 UTC And it would be way better if user could decide the timelimit. ID: 14231 · Rating: 0 · rate: / Reply Quote

tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 14241 - Posted: 21 Apr 2006, 7:31:52 UTC - in response to Message 14231. Last modified: 21 Apr 2006, 7:53:23 UTC And it would be way better if user could decide the timelimit. Actually there is already such a parameter in the "General settings": "Write to disk at most every" = xx seconds Default is 60 (one minute). I set it to 600 (10 minutes) since one minute seems very often. So such a parameter is already present in the BOINC-Framework. Most projects ignore it anyway but one could use it. As for checkpoints <100 KB it is really not a problem to checkpoint often (not every miunute but say 5-10 minutes). ID: 14241 · Rating: 0 · rate: / Reply Quote

Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0	Message 14276 - Posted: 21 Apr 2006, 16:35:27 UTC Yes, better would be if user could define it. I just thought 20min would help improve the throughputs for the folks with the 1hr switch between apps setting. It might also help avoid the 1% stuck, and 5 strikes and you're out conditions, because it will make checkpoints dynamically as appropriate for the box and the WU it's running. It should also help folks that only have their PC on for brief periods of time. I'm glad to hear Bin feels he can implement the idea to dynamically determine if checkpoint is desirable at the same time. This should make the WU run experience more consistent, regardless of the CPU speed and length of the WU's protein. And I think that will help avoid confusion, and any perception of instability. If you're curious, there's also a very similar thread on the BOINC boards: Preempt only at checkpoints. THIS would be the ultimate. Now instead of "only" losing an average of 10min per preempt... you'd lose an average of... well... ZERO! An 18% improvement over the improved R@H checkpointing! Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ ID: 14276 · Rating: 0 · rate: / Reply Quote

Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0	Message 14300 - Posted: 21 Apr 2006, 20:40:00 UTC - in response to Message 14276. The user controllable checkpointing would be ideal. Unfortunately the current checkpointing machinery can only be done in certain stages of the modeling process. In a nutshell, the process has to reach a stage in the modeling process where the previous searching history (which includes a huge amount of data if we were to record all of it) can be discarded, so we can live with only checkpointing a minimum amount of data for the future searches. We can not checkpoint at any point of the modeling process yet. I've implemented Feet1st's idea below: when the WU reaches a stage where checkpointing is possible, it will see how long it has been since the last checkpointing. If it's over 20 minutes, then checkpoints. ID: 14300 · Rating: 0 · rate: / Reply Quote

casio7131 Send message Joined: 10 Oct 05 Posts: 35 Credit: 149,748 RAC: 0	Message 14439 - Posted: 23 Apr 2006, 3:02:20 UTC this checkpointing news sound great. ID: 14439 · Rating: 0 · rate: / Reply Quote

No checkpoint in more than 1 hour - Largescale_large_fullatom...