Message boards : Number crunching : No checkpoint in more than 1 hour - Largescale_large_fullatom...
Author | Message |
---|---|
Aglarond Send message Joined: 29 Jan 06 Posts: 26 Credit: 446,212 RAC: 0 |
Hi, I just had to reboot my computer (Centrino 1.5, 1G RAM) and R@H was showing something around 1.5% progress and it wat runnig litte more than 1 hour. It was counting 1. model. After reboot it started from begining. That means there was no checkpoint in more than 1 hour. Standart settings in Boinc are: switching projects after 60 minutes and not leaving them in memory. I think people with several projects and standart settings will never finish 1 model. Is it so? I think there should be at least warning on the front page that this project has rare checkpoints. (Something similar to what CPDN Seasonal has.) Regards, Aglarond |
[DPC]Charley Send message Joined: 18 Mar 06 Posts: 9 Credit: 295,915 RAC: 0 |
yes you're right. Rosetta will only save after every completed model. With these large models, you have to complete them in one go or start over from the beginning when you unload the project (switch to another project for a little time, reboot your computer, you get the idea). |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
yes you're right. OK, I'm not sure I get it??? Is the answer to leave these in memory? change settings to switch projects less frequently? only do Rosey on dedicated (one project) machines? Abort these WUs? Any advise would be appreciated -Sid Proudly crunching with TeAm Anandtech |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
There was mention of the new large WUs taking over 4 hours to finish a model on some machines. The application switch time needs to be set for more than the time it takes to finish a model/decoy on your machine. Or the keep-in-memory flag needs to be turned on. |
nairb Send message Joined: 8 Dec 05 Posts: 17 Credit: 990,147 RAC: 0 |
I have one dedicated machine for rosy. I have seen it take nearly 9 hrs to complete one of the largescale wu. Mostly they are about 4 hrs but some are more or less. Bit of a pain if you run more than one project on a machine and dont change the switch project setting to be longer than the wu complete time. |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
Thanks for the quick replys! So do I need to have my switch time > time for the entire WU to complete? Isn't there any checkpointing? -Sid Proudly crunching with TeAm Anandtech |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
Thanks for the quick replys! ALL Work Units will checkpoint at the completion of a model. For some Work Units this means every 5 minuets, for larger ones this could mean 5 or six hours. Also ALL Work units will complete AT LEAST one model no matter how you set your user selectable time setting The BEST answer if you can do it, is to set your preferences to keep the application in memory during a swap. You could try to set the swap time to 4+ hours, but there is no guarantee that that will make it to a checkpoint. It depends on the size of the protein. Also keep in mind that "keep in memory" only works if you do not turn your machine off, or stop BOINC for some reason, as these actions would also remove the application from memory. Moderator9 ROSETTA@home FAQ Moderator Contact |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
Rosetta doesn't checkpoint until after it's done with a decoy/model. And if it takes up to 9 hours to complete a single decoy/model, then you won't get checkpointed until that's done. Perhaps the keep in memory flag is the way to go, in case we get more of these really large WUs in the future. (Nairb - what are the math specs on that system that took 9 hours?) |
DigiK-oz Send message Joined: 8 Nov 05 Posts: 13 Credit: 333,730 RAC: 0 |
This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely! |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely! It does sound like a rather stringent requirement. Is this going to be the norm. from here on out? Proudly crunching with TeAm Anandtech |
nairb Send message Joined: 8 Dec 05 Posts: 17 Credit: 990,147 RAC: 0 |
(Nairb - what are the math specs on that system that took 9 Hours?) The wu was this one:- https://boinc.bakerlab.org/rosetta/result.php?resultid=16830240 The machine is a dual cpu 1ghz coppermines with 700+ ram. It spent 9.3 hrs at 1.6 % or so. It very nearly got the abort option.... except I forgot about it and it worked ok. Not the fastest bit of kit but its very stable (win nt4) |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely! The plan is to try to have it checkpoint more often, or at least try to dump everything to a memory image at program swaps. But there is a lot of data involved, and interrupting the model effects the model outcome adversely. Moderator9 ROSETTA@home FAQ Moderator Contact |
DigiK-oz Send message Joined: 8 Nov 05 Posts: 13 Credit: 333,730 RAC: 0 |
This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely! Well, NOT having checkpoints/memory images whatever will adversely affect the entire project, as the small home-crunchers are getting fed up with not getting any results in because the same WU restarts from scratch over and over again. Maybe there is a way to hand out these large WUs only to computers having a high RAC? Or only to people who have their run-time preferences set to 8 hours or something similar? |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely! I'm in a fortunate position that I can flex to requirements that impact how my PCs are operated, but I am concerned that the majority of your participants just simply cannot assure a computer will not be re-booted or available to crunch nothing but Rosetta 24/7/365. I understand the prospective of the project developers, but I think the project simply MUST acknowledge that the vast majority of it's participants consider DC to be a side endeavor. If your project impacts the use the PC was really bought to serve...your membership will suffer. A thread asking for suggestions to increase project participation was posted some time ago. I believe the single, most important answer is... your project must be hands-off with utter transparency to the work the PC was procured to do in the first place. I think I speak for many when I say I like Rosetta and the work it hopes to accomplish, but it is incorrect to expect PC users to arrange their PC to crunch your project.... it must be the other way around. Rosetta must arrange itself to fit within the resources that are available. Respectfully, -Sid Proudly crunching with TeAm Anandtech |
Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0 |
I totally agree with you Sid. It is not acceptable to have a job running for hours without a checkpoint. We realized that this is a mistake and have had those largescale jobs canceled from the boinc server. If your computer still has the largescale jobs running or queued please abort them - the new short jobs are waiting to be sent! In the future we will be extremely cautious to not let this happen again - we are coding and testing some solutions to make the program checkpoint more often. It is of our highest priority at this moment to make sure that ever minute of your precious computer-on time spent on the Rosetta@home project can contribute to the scientific goals we are trying to achieve together. I should add that for users who crunch Rosetta 24/7 or have "leave in memory" on, you can choose to let the largescale jobs currently in your computers keep running. These results are still of great interest to us! I'm in a fortunate position that I can flex to requirements that impact how my PCs are operated, but I am concerned that the majority of your participants just simply cannot assure a computer will not be re-booted or available to crunch nothing but Rosetta 24/7/365. |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
I totally agree with you Sid. It is not acceptable to have a job running for hours without a checkpoint. We realized that this is a mistake and have had those largescale jobs canceled from the boinc server. If your computer still has the largescale jobs running or queued please abort them - the new short jobs are waiting to be sent! Thanks Bin! Knowing the project IS sensitive to this kind of concern and IS making an effort to shape this project in a productive way, is plenty to keep me supporting such an interesting and valuable research endeavor! -Sid Proudly crunching with TeAm Anandtech |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
I totally agree with you Sid. It is not acceptable to have a job running for hours without a checkpoint. We realized that this is a mistake and have had those largescale jobs canceled from the boinc server. If your computer still has the largescale jobs running or queued please abort them - the new short jobs are waiting to be sent! I get the impression from your response that you have misinterpreted the information I tried to provide, to imply that I had in some way said the project does not think this is important. Nothing could be further from the truth. The original question was why did the Work Unit run so long without a checkpoint (see thread title). I never said this was an issue the project was ignoring, or not trying to fix. I just tried to explain WHY it was working the way it was. Moderator9 ROSETTA@home FAQ Moderator Contact |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
I totally agree with you Sid. It is not acceptable to have a job running for hours without a checkpoint. We realized that this is a mistake and have had those largescale jobs canceled from the boinc server. If your computer still has the largescale jobs running or queued please abort them - the new short jobs are waiting to be sent! I didn't think you were dismissing the concerns. Actually, I saw your post as an attempt to help shed light. It was appreciated as was Bin's. Please believe me when I tell you, I am much more the fan than the critic. -Sid Proudly crunching with TeAm Anandtech |
Buffalo Bill Send message Joined: 25 Mar 06 Posts: 71 Credit: 1,630,458 RAC: 0 |
This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely! You already have a separate test project (Ralph), so if there's no good solution to interrupting a big model, maybe you could start a new "RosettaExtreme" project just for those of us who would be happy to take care of those big proteins for you. Big WU's only project. Bill |
Fuzzy Hollynoodles Send message Joined: 7 Oct 05 Posts: 234 Credit: 15,020 RAC: 0 |
... maybe you could start a new "RosettaExtreme" project just for those of us who would be happy to take care of those big proteins for you. Big WU's only. RosettaExtreme?! Hmmmm.... sounds interesting. :-D [b]"I'm trying to maintain a shred of dignity in this world." - Me[/b] |
Message boards :
Number crunching :
No checkpoint in more than 1 hour - Largescale_large_fullatom...
©2024 University of Washington
https://www.bakerlab.org