Message boards : Number crunching : No checkpoint in more than 1 hour - Largescale_large_fullatom...
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Shoikan Send message Joined: 4 Apr 06 Posts: 14 Credit: 180,211 RAC: 0 |
Thanks for the quick replys! This issue has to be adressed ASAP. Many cycles go directly to the trash can because of this. An improved checkpointing system should be #1 priority on the TO DO list of the development team of Rosie. Regards. |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
Thanks for the quick replys! Bin Qian addressed this already above (we all agree on this!) Proudly crunching with TeAm Anandtech |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
You already have a separate test project (Ralph), so if there's no good solution to interrupting a big model, maybe you could start a new "RosettaExtreme" project just for those of us who would be happy to take care of those big proteins for you. If you go for the separate project, then in the times when there are no extreme WU to run, you could always deliver ordinary Rosetta work to the Rex users to keep the project share in use. Ralph was a shortening of Rosetta alpha, a nickname suggested by another Bill, in fact. Maybe Rosetta extreme could be Rex? Another way to do this would be to implement some form of user preference flag, off by default, but whenset manually by the user made them liable to receive work that was Extreme in some way. The course team could ask for volunteers to set the FullOn flag if they wanted to opt in. One method needs more coding, the other needs a separate url and some work at sysadmin level on one or more servers. The Bakerlab folk will know which of these is easier to deliver. I'd suggest either would be a good solution from the user's point of view. Your users already have experience of choosing the development project (Ralph) and of customising the run length of their work, and my impression is that both forms of user control have been well received. I am running down my CPDN participation due to their refusal to implement any form of user-specified control over the size of work that is issued. That is despite the fact that I personally feel their science to be very important. It is good that you acted to protect some users, but sad that you lose out on some interesting work by doing so. I feel sure that increased user control is the way forward. It won't attract any more people, but will help you to keep the ones you've already got. River~~ |
Rebel Alliance Send message Joined: 4 Nov 05 Posts: 50 Credit: 3,579,531 RAC: 0 |
I think it makes MUCH more sense to fix the mistakes of the largescale work units than it does to try and design a sub-project like this "extreme" thing. -Sid |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
I think it makes MUCH more sense to fix the mistakes of the largescale work units than it does to try and design a sub-project like this "extreme" thing. Sid, although obviously I don't know the internals of Rosetta's code, it's quite possible that it's going to be quite difficult to "checkpoint" at intermediate points within a model. In such a case, it could mean that the program needs to write many MBytes of memory to disk. And this could adversely affect performance on an "average" PC. So I wouldn't call it a "mistake" to "fix", these are just bigger jobs which aren't suited for many BOINC PCs. So it seems to me that the options are: 1/ Only run those big WUs internally, on the projects own 500-node Linux cluster. 2/ Send those WUs only to PCs which are eligible / capable / willing to crunch them. The BigWU flag I've been talking about since last month. (this needs changes to BOINC software which could take time) 3/ Establish a new BOINC project, for the purpose of running those big WUs, until #2 is possible. Unless someone can tweak BOINC server code to implement #2 asap (bug free?), so it can be ready for CASP, it seems to me that #3 is the only option immediately available. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
Yep Rosetta advanced for high-demanding models would be a good idea and very easy to implement. It's like CPDN has a subproject Seasonal attribution project only for those which can cope with the high specs and the fact that it checkpoints only every 4-8 hours(sic!). I contribute to the seasonal attribution project with my standalone comp partly because I know my computer can do this while other's can't. I'm sure a Rosetta extreme (or advaced) with higher memory requirements and longer intervals between checkpoints would find it's supporters. |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
Yep Rosetta advanced for high-demanding models would be a good idea and very easy to implement. It's like CPDN has a subproject Seasonal attribution project only for those which can cope with the high specs and the fact that it checkpoints only every 4-8 hours(sic!). Actually, the Time setting could be used to produce a similar effect for systems with sufficient memory. Technically the time setting could be used for two purposes. If set to less than 4 days, it would only issue you small proteins that run in the normal way, if the value was equal to 4 days you would get any available WorkUnit including large ones. The only real issue would be the effect this might have on the modem and farm users this feature was originally designed to help. They might want to avoid the 4 day setting. Of course it would also require some discipline on the part of the user community. If someone with 256MB of memory set their system to 4 days, they would have a lot of errors. Moderator9 ROSETTA@home FAQ Moderator Contact |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
AFAIK it is not possible at the moment to discriminate between hosts with the distribution of WUs. That needs a change in the BOINC server software. Although I think it should not be too much work it is not an option available at the moment. P.S.: Even one Laptop from me with only 256 MB RAM crunched already two Large WUs succesfull and is crunching the third at the moment. Perhaps with proper checkpointing there is no need to discriminate. |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
You are correct, BOINC cannot do this by itself. But what I am suggesting is that the users could ASK for larger Work Units using the time setting. As I said this would require SOME discipline on the part of the people using the feature. The project certainly knows which WUs are large and which are small. There might be a way to send large ones to systems that specifically request them using a preference setting. But if there is no BOINC capability for distributing particular Work Units based on a preference setting then it would not work at all. Right now the best option seems to be to get the real bugs out of the system, and then tackle the checkpointing issue. But in the face of the short time remaining before the beginning of CASP, increased checkpointing may be beyond reach at this time. Far better to have a stable application, with some use restrictions, than to have an application that checkpoints a lot but fails routinely for other reasons. Moderator9 ROSETTA@home FAQ Moderator Contact |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
I wonder if Boinc could redefine "Homogeneous Redundancy" to suite this need. Predictor uses Homogeneous Redundancy. tony |
Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0 |
Right now the best option seems to be to get the real bugs out of the system, and then tackle the checkpointing issue. But in the face of the short time remaining before the beginning of CASP, increased checkpointing may be beyond reach at this time. Far better to have a stable application, with some use restrictions, than to have an application that checkpoints a lot but fails routinely for other reasons. Hi Moderator9 and all, I want to give you some updates on the long jobs / stucked jobs issue based on your suggestions and discussion here: 1, We've tracked down the bugs which were causing some jobs stucked at 1.04% and been testing the fixes on ralph since yesterday. 2, We've coded up rosetta to do more frequent checkpointing in the modeling process. Now for the large jobs, we are expecting less than 30 minutes for the time between two check points. This code has been tested locally, and will be tested on ralph within a couple of days. 3, Rhiju has coded a watchdog thread for rosetta which will terminate the stucking jobs and return the intermediate results. see his post at this thread. This will be tested on ralph within a couple of days too. We think these measures will greatly improve the stability of R@H, and make your crunching effort much more enjoyable! |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
Bin Qian stated: 30 mins on.. a 2Ghz Athlon/3Ghz p4? Or on 400 to 600 Mhz machines with 512Megs of Ram? |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
Right now the best option seems to be to get the real bugs out of the system, and then tackle the checkpointing issue. But in the face of the short time remaining before the beginning of CASP, increased checkpointing may be beyond reach at this time. Far better to have a stable application, with some use restrictions, than to have an application that checkpoints a lot but fails routinely for other reasons. This sounds like good progress. Thanks for the update! -Sid Proudly crunching with TeAm Anandtech |
Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0 |
30 mins on.. a 2Ghz Athlon/3Ghz p4? Or on 400 to 600 Mhz machines with 512Megs of Ram? Good point! I was testing a 220 residue protein ( a monster size for our current capacity ) on a relatively slow machine 800Mhz 512Meg running Linux. The longest interval between check points is about 25 minutes. Of course this number will vary based on hardware/software configurations and protein size, and that's why we need to test it on Ralph first, but in generally this should give more space for a client to finish a WU. Now for most WUs we'd expect a much shorter interval between check points, since the computation decreases geometrically when the protein size decreases. One important change I forgot to mention: on top of the more frequent checkpointing, David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory. |
Osku87 Send message Joined: 1 Nov 05 Posts: 17 Credit: 280,268 RAC: 0 |
David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory. This is something I have waited. It would be nice if you could inform us then when this is in use. |
Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0 |
David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory. You bet! We will anounce the changes when we update the R@H application. Currently this feature is being tested on Ralph. |
casio7131 Send message Joined: 10 Oct 05 Posts: 35 Credit: 149,748 RAC: 0 |
30 mins on.. a 2Ghz Athlon/3Ghz p4? Or on 400 to 600 Mhz machines with 512Megs of Ram? so a typical WU will checkpoint in much under every 25 min, say every 5 min for a fast cpu. i'm not sure how much data is required to be written at each checkpoint, but this sounds like it might create quite a lot of disk thrashing. i certainly don't want checkpoints written so frequently (i'm not a points tight arse, so i don't care if i lose "a few" minutes worth of points), especially if they are big checkpoints. i would prefer a longer time interval between checkpoints, or even better would be a user option (like the "model cpu time" option). i suppose my question is how much data is written to disk at each checkpoint? (if it's only a couple of hundred kB, then it's not too bad, but if it's 10s of MB, then i'm "not happy jan!".) |
Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0 |
i suppose my question is how much data is written to disk at each checkpoint? (if it's only a couple of hundred kB, then it's not too bad, but if it's 10s of MB, then i'm "not happy jan!".) Don't worry - the check point file is only 300kB for larger proteins which will have longer interval, and much smaller (less than 100k) for smaller proteins with shorter checkpointing intervals. Also this checkpointing mechanism is optional - we will only turn it on in WUs with larger proteins or longer searches (based on scientific justification). So most likely you will not have any checkpoint files for those normal jobs. |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
I'm extatic about all the changes and improvements we've seen in just a short time here. Great job! we will only turn it on in WUs with larger proteins or longer searches Would it be possible to have it take a look around when it reaches a point that CAN be checkpointed? You haven't really defined if it is now possible to basically checkpoint at any time or not. Do you have to reach a certain phase within a model? What I'm thinking is that a very slow PC might have spent an hour just to get a portion of the way through even a small WU. Why not "come up for air" when you reach points that a checkpoint is possible, and see how much CPU time has gone by since your last checkpoint? If it is >20 min. then write the checkpoint. So, you'd ship WUs that will always checkpoint at least once an hour, regardless of the protein size. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0 |
Would it be possible to have it take a look around when it reaches a point that CAN be checkpointed? You haven't really defined if it is now possible to basically checkpoint at any time or not. Do you have to reach a certain phase within a model? Great idea! The checkpointing has to be done in certain stages of the modeling process, but we can (and should!) let the WU decide if it should checkpoint when it reaches a checkpointable stage. Thanks for the great suggestion. Will implement it now. |
Message boards :
Number crunching :
No checkpoint in more than 1 hour - Largescale_large_fullatom...
©2024 University of Washington
https://www.bakerlab.org