Message boards : Number crunching : Report Problems with Rosetta Version 5.16 I
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 · Next
Author | Message |
---|---|
NJMHoffmann Send message Joined: 17 Dec 05 Posts: 45 Credit: 45,891 RAC: 0 |
You are already using the version that has had checkpoints added. Originally the checkpoints only were done at the end of a full model. Now they are every ~20 min. It's much better now and we loose less work with this shorter checkpoint interval. But if it is possible to insert checkpoints, why not respect the user settings? Norbert |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
You are already using the version that has had checkpoints added. Originally the checkpoints only were done at the end of a full model. Now they are every ~20 min. What settings do you feel are not being respected by the current (improved) checkpointing? Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
I'm taking up a (fictious) collection for Jose. I'm taking pledges. For every problem he posts, I'm asking each person to donate 5 cents. This way the more he posts information, the sooner we can buy him a new PC... but Jose, beware, we're not getting new monitor, keyboard, mouse nor printer, so don't include those in your next voodoo ceremony. :) Hang in there Jose! Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
[quote]Thanks Mod9 for the quick reply But thats not what I noticed. The first 5.16 unit(s) I processed didn't show checkpoints every ~20min The one I'm currently working on seems to behave nicely in the sugested way. Maybe it was a glitch in the first 5.16 units and nobody else noticed... I'll keep an eye on it and report back if I notice anything unusual. 20min checkpoint intervals is fine with me. I can live with that. Thor [quote]... Thor, Thanks for the report. I will also watch for this. I am aware that the first checkpoint is usually longer than the others, but it should still not exceed ~35 min. So what you have reported is interesting. Moderator9 ROSETTA@home FAQ Moderator Contact |
NJMHoffmann Send message Joined: 17 Dec 05 Posts: 45 Credit: 45,891 RAC: 0 |
What settings do you feel are not being respected by the current (improved) checkpointing? I would interpret the setting "write to disk at most..." as: After a checkpoint wait for x seconds before a new checkpoint and then do it as soon as possible. Norbert |
Tallguy-13088 Send message Joined: 14 Dec 05 Posts: 9 Credit: 843,378 RAC: 0 |
I think you ought to think about a wristpad/mousepad too! That way he doesn't do quite as much damage pounding his head/hands on the desk <grin>. Jose, just remember, it doesn't have to cooperate ... its a machine! I'm taking up a (fictious) collection for Jose. I'm taking pledges. For every problem he posts, I'm asking each person to donate 5 cents. This way the more he posts information, the sooner we can buy him a new PC... but Jose, beware, we're not getting new monitor, keyboard, mouse nor printer, so don't include those in your next voodoo ceremony. :) |
Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0 |
I'm taking up a (fictitious) collection for Jose. I'm taking pledges. For every problem he posts, I'm asking each person to donate 5 cents. This way the more he posts information, the sooner we can buy him a new PC... but Jose, beware, we're not getting new monitor, keyboard, mouse nor printer, so don't include those in your next voodoo ceremony. :) I was able to track the offending application. ARGH a maintenance application run amok. Hey, let's face it without me posting my weird problems and my non standard attempts at solving them this thread would be boring. I think my computer should be inducted in the Rosetta@Home Hall of Fame. Either that or a citation in the next scientific paper by Dr Baker and the team would be nice. :) Please remember , that as the "minus inter pares " of my team I am in charge of stat reporting and non-traditional credit production methods, so all voodoo is reserved for that and not for my personal gain. But, should you want, I can send you the specs for the computation system of my dreams. :) It will take a lot of 5 cents. LOL LOL LOL As to the next voodoo ceremony, it may involve a moderator or a poster being sacrificed to the team production deities (Specifically the 500,000 Credit a Day deity) . Numero 9 is in the sacrificial pool; Want to join him? LOL LOL LOL LOL Okies, the pain killers are working. I better go to bed. :) This and no other is the root from which a Tyrant springs; when he first appears he is a protector.†Plato |
Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0 |
I think you ought to think about a wristpad/mousepad too! That way he doesn't do quite as much damage pounding his head/hands on the desk <grin>. Tall guys make good candidates for the sacrificial pool. Te he te he :) This and no other is the root from which a Tyrant springs; when he first appears he is a protector.†Plato |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
What settings do you feel are not being respected by the current (improved) checkpointing? I believe that's exactly what they are doing. The problem is that "as soon as possible" isn't as often as many people would like. It's only about every 20 minutes that they reach a point in the model where they can checkpoint. But it depends on the protein and the CPU. A faster CPU hits that same point much faster than a slow CPU. So, what they do is... reach a point in the model where a checkpoint COULD be made, and if more than 20 min. has gone by since the last checkpoint was made, then another is made.... which I guess is your point now that I type it. Let me see if I can restate it... "Why use the arbitrary 20 minutes number, when the user's preference might be for write to disk every 5 minutes, and my model may be hitting a checkpointable state every 5 minutes?" It seems like that point was brought up on Ralph. The project is under maintenance at the moment so can't post a link. [edit] I think it boiled down to the volume of data they have to write for the checkpoint. It was like 100+MB. And if they wrote that much data every... (I think the default is) 1 min, then your "faster" computer, which is reaching checkpointable points in the model rapidly, would be spending a considerable fraction of time writing the checkpoints rather than getting work done. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
The write to disk parameter has no direct relationship to checkpointing. It can prevent checkpointing if the interval is set too long, but it is a disk use parameter to control disk access only. It is really there to let laptop drives spin down between write accesses. But it in no way is a setting to request more frequent checkpointing. Moderator9 ROSETTA@home FAQ Moderator Contact |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
Jose, have you tried searching for Malware/adware with Ad-ware SE, and searching for Spybots with Spybot search and destroy in addition to your virus program?? They're free. tony |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
The following WU grew steadily in memory usage up to 550 MB physical RAM and about 700 MB virtual memory (I have 1 GB RAM and 1.24 GB virtual memory on that host): https://boinc.bakerlab.org/rosetta/workunit.php?wuid=17772949 After three and a half hour and 26 decoys I restarted BOINC and memory usage started from 36 MB but is again growing with each completed model. Seems to me like a memory leak. Btw, I never looked on the graphics. Edit: It seems Rosetta is no longer writing to the file stdout.txt after restarting BOINC. However it is writing to the file xxt283.out. Don't know if this means anything. |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Tralala: Thanks for posting about this problem. I thought I had fixed this issue on this workunit, but apparently there are still problems on some clients. I am canceling these workunits now. Aborting the jobs was the right thing to do. The following WU grew steadily in memory usage up to 550 MB physical RAM and about 700 MB virtual memory (I have 1 GB RAM and 1.24 GB virtual memory on that host): |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
Tralala: Thanks for posting about this problem. I thought I had fixed this issue on this workunit, but apparently there are still problems on some clients. I am I could abort it the soft way with lowering the run time preference, but I was afraid it would kill one of my remote hosts with only 512 MB RAM. Fortunately that was not the case. You can safeguard against those incidents if you specify a memory bound for all WU. If the virtual memory exceeds this bound the WU gets automatically aborted. |
Mike Gelvin Send message Joined: 7 Oct 05 Posts: 65 Credit: 10,612,039 RAC: 0 |
Curious behaviour.... Two work units "exited with 0" but had no finish file. They then restarted and appear to have resumed where they left off. They are still running. Heres the log 5/23/2006 9:21:11 AM||Rescheduling CPU: application exited 5/23/2006 9:21:11 AM|rosetta@home|Computation for task u287__CASP7_ABRELAX_SHORTRELAX_SAVE_ALL_OUT_nterm__522_6410_0 finished 5/23/2006 9:21:11 AM|rosetta@home|Starting task v287__CASP7_ABRELAX_SAVE_ALL_OUT_cterm__527_1046_0 using rosetta version 516 5/23/2006 9:21:13 AM|rosetta@home|Started upload of file u287__CASP7_ABRELAX_SHORTRELAX_SAVE_ALL_OUT_nterm__522_6410_0_0 5/23/2006 9:21:19 AM|rosetta@home|Finished upload of file u287__CASP7_ABRELAX_SHORTRELAX_SAVE_ALL_OUT_nterm__522_6410_0_0 5/23/2006 9:21:19 AM|rosetta@home|Throughput 29328 bytes/sec 5/23/2006 9:21:24 AM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 5/23/2006 9:21:24 AM|rosetta@home|Reason: To report completed tasks 5/23/2006 9:21:24 AM|rosetta@home|Reporting 1 tasks 5/23/2006 9:21:29 AM|rosetta@home|Scheduler request succeeded 5/23/2006 10:22:32 AM||Rescheduling CPU: application exited 5/23/2006 10:22:32 AM|rosetta@home|Computation for task b287__CASP7_ABRELAX_SHORTRELAX_SAVE_ALL_OUT_truncate__522_6500_0 finished 5/23/2006 10:22:32 AM|rosetta@home|Starting task v287__CASP7_ABRELAX_SAVE_ALL_OUT_cterm__527_1041_0 using rosetta version 516 5/23/2006 10:22:34 AM|rosetta@home|Started upload of file b287__CASP7_ABRELAX_SHORTRELAX_SAVE_ALL_OUT_truncate__522_6500_0_0 5/23/2006 10:22:40 AM|rosetta@home|Finished upload of file b287__CASP7_ABRELAX_SHORTRELAX_SAVE_ALL_OUT_truncate__522_6500_0_0 5/23/2006 10:22:40 AM|rosetta@home|Throughput 28853 bytes/sec 5/23/2006 10:22:45 AM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 5/23/2006 10:22:45 AM|rosetta@home|Reason: To report completed tasks 5/23/2006 10:22:45 AM|rosetta@home|Reporting 1 tasks 5/23/2006 10:22:50 AM|rosetta@home|Scheduler request succeeded 5/23/2006 11:04:46 AM|rosetta@home|Task v287__CASP7_ABRELAX_SAVE_ALL_OUT_cterm__527_1046_0 exited with zero status but no 'finished' file 5/23/2006 11:04:46 AM|rosetta@home|If this happens repeatedly you may need to reset the project. 5/23/2006 11:04:46 AM||Rescheduling CPU: application exited 5/23/2006 11:04:46 AM|rosetta@home|Task v287__CASP7_ABRELAX_SAVE_ALL_OUT_cterm__527_1041_0 exited with zero status but no 'finished' file 5/23/2006 11:04:46 AM|rosetta@home|If this happens repeatedly you may need to reset the project. 5/23/2006 11:04:46 AM||Rescheduling CPU: application exited 5/23/2006 11:04:46 AM|rosetta@home|Restarting task v287__CASP7_ABRELAX_SAVE_ALL_OUT_cterm__527_1046_0 using rosetta version 516 5/23/2006 11:04:46 AM|rosetta@home|Restarting task v287__CASP7_ABRELAX_SAVE_ALL_OUT_cterm__527_1041_0 using rosetta version 516 |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
[quote]LINUX problem: OK, it looks to be the same issue. Rosetta "frozen" (SN=Sleeping,Nice and consuming 0% CPU) although BOINC thinks it's running. Also, for some reason BOINC won't pre-empt Rosetta after say 1hr, so effectively the whole DC queue is stuck. I see (e.g. here) you've encountered Rosetta "hangs" recently under Linux using BOINC v5.4.9 (as I see you're using now), we can rule out the BOINC v5.2.14 possibility. Also you have a different kernel 2.6.x (both myself and Aglarond had kernel 2.4.x and BOINC v5.2.14), so we can rule that out too. Although I reiterate that my Linux box that had this issue has been running smoothly for over 3 months, 24/7, crunching 90% Rosetta/Ralph, not a single "hung" instance. I thought it was an some odd issue that was "solved" by re-compiling R with new BOINC API, but apparently you guys still have it... Maybe do some thinking about SIGSEGV and SIGABRT: SIGSEGV: segmentation violationStack trace (11 frames): [0x882fbb3] Exiting... SIGABRT: abort calledStack trace (18 frames): [0x882fbb3] https://boinc.bakerlab.org/rosetta/result.php?resultid=20134206 Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
Steve Shedroff Send message Joined: 7 Nov 05 Posts: 11 Credit: 250,657 RAC: 0 |
This may be coincidence, but I just downloaded the most recent BOINC Client and all my numbers are dropping. Work per day is about 1/2 of what it was before the new client. This is true on MacX and Intel P4 systems. Is it just me? |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
This may be coincidence, but I just downloaded the most recent BOINC Client and all my numbers are dropping. Work per day is about 1/2 of what it was before the new client. This is true on MacX and Intel P4 systems. Is it just me? What version of BOINC did you install? Moderator9 ROSETTA@home FAQ Moderator Contact |
senatoralex85 Send message Joined: 27 Sep 05 Posts: 66 Credit: 169,644 RAC: 0 |
[quote][quote]LINUX problem: OK, it looks to be the same issue. Rosetta "frozen" (SN=Sleeping,Nice and consuming 0% CPU) although BOINC thinks it's running. Also, for some reason BOINC won't pre-empt Rosetta after say 1hr, so effectively the whole DC queue is stuck. ----------------------------------------------------------------------------- I am not sure but I may have a similiar problem. Once in awhile I will leave my computer running for a few consecutive hours. When I come back, it seems that BOINC got stuck and stranded a workunit at "100% ready to report" status. If I hit the update button under the projects tab, it sends the workunit and simultaneously downloads another one. Why would ite get stuck like that? I am running BOINC 4.45. |
Aglarond Send message Joined: 29 Jan 06 Posts: 26 Credit: 446,212 RAC: 0 |
BAD ERROR! Boinc 5.4.9 crunching WU t283__CASP7_ABRELAX_SAVE_ALL_OUT_hom024__528_13504_0, screensaver appeared.. suddenly windows error message appeared about Rosetta@home doing illegal operation and windows had to end this process.. "send report to microsoft? [send] [don't send]" you probably know that message.. after closing the message: boinc happily crunches another WU.. now it looks like it was normal computing error .. but it wasn't .. |
Message boards :
Number crunching :
Report Problems with Rosetta Version 5.16 I
©2024 University of Washington
https://www.bakerlab.org