Message boards : Number crunching : Problems with Rosetta version 5.64
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Craig Arno Send message Joined: 26 Mar 06 Posts: 1 Credit: 50,539 RAC: 0 |
I am experiencing reboot about 3x/day with Rosetta 5.64 on my SuSE Linux 10.1 AMD Athlon 3800+ system. This reboot behavior stops when I suspend Rosetta. Since this is my main server (phones, email, web presence, etc), I've had to suspend Rosetta work indefinitely on this machine. Any ideas why? Or is there something I can do to provide further information? SETI crunches along just fine on this same system. Version Info: > uname -a Linux suse 2.6.16.27-0.6-default #1 Wed Dec 13 09:34:50 UTC 2006 x86_64 x86_64 x86_64 GNU/Linux BOINC> ./boinc -version 5.4.9 i686-pc-linux-gnu rosetta_5.64_i686-pc-linux-gnu Rosetta is working on my AMD Duron/OpenSUSE 10.2 Linux system and on my WindowsXPpro AMD Turion64M T-30 system. This will allow me to continue to contribute something to the Rosetta project. I can be reached at craig@arno.com |
DJH@GB-Ro Send message Joined: 11 Mar 06 Posts: 1 Credit: 740 RAC: 0 |
|
LohnesinPR Send message Joined: 17 Jan 06 Posts: 1 Credit: 145,651 RAC: 0 |
After almost a year and a half I suddenly started having problems. My work units get down to a point where there are only 9.50 minutes to finish and then go no further. Some 2-3 hour work units have run more than 8 hours without ever finishing ( overnight) This has happen 5 times in the last 4 days. In each case I abort and go on to another work unit. Now it has happen twice in a row. I hesitate to go on to another unit because I am becoming frustrated. Lohnesinpr Lohnes in Puerto Rico |
Neil Send message Joined: 7 Mar 07 Posts: 25 Credit: 135,539 RAC: 0 |
I am experiencing reboot about 3x/day with Rosetta 5.64 on my SuSE Linux 10.1 AMD Athlon 3800+ system. This reboot behavior stops when I suspend Rosetta. Was Rosetta ever running successfully on this Athlon 8300+ and then it recently developed this problem? or is this a new Rosetta client that's having trouble getting off the ground? Any ideas why? I'm just a beginner, but: 1. Hot microprocessor? You can reduce the load on your CPU in your General Preferences > Processor Usage > Use at most XX percent of CPU time Enforced by version 5.6 and greater 2. Linux-Boinc having trouble with Rosie 5.64? Maybe install WinXP onto your Linux computer, using an installation disk from one of your other machines. You should have a month before the Activation/Verification thingy kicks in to see if XP helps the spontaneous restart problem. 3. You might try downloading a copy of memt86, which exercises RAM. Let it run for a day especially during warmer ambient temperatures to see if errors are detected. Google for "spontaneous reboot." http://ask.metafilter.com/52269/XP-spontaneous-reboot-related-to-full-folders "... If the machine is "spontaneously rebooting," are you getting any event error log codes or blue screen errors, or stop log (memory dump) error codes? If so, investigating these could pinpoint your problem. In my experience, 99% of crashes resulting in "spontaneous reboots" turn out to be memory related..." He certainly sounds like he knows what he's talking about. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
LohnesinPR, I am seeing some tasks taking 5+ hours per model on a 3Ghz Pemtium 4. Your machine has less memory, so perhaps doesn't get sent this type of task. The difference basically is that the models of many of the current tasks are much larger then they used to be. So the time per model is going up. The 90% thing is the new estimated runtime. Once you exceed your runtime preference (as configured in your Rosetta preferences) they don't really know what % complete to estimate. So they set it to a point that shows you've got about 10 minutes left and reduce exponentially from there. The results of your crunching are collected at the end of a model. If you look at the graphic, I'll bet you will see you are still working on model 1 and that the steps are increasing. So, things are progressing (even if the % completed may not have a good way to indicate that). If the above is indeed the case, please do this once. Let the task run for even longer. Let's say up to 48 hours (on your .5 Ghz CPU). The system is already watching the task for you with it's "watchdog". If the watchdog sees no progress being made, it will end the task for you. But I think you will find that you finally complete that first model and the task reports back normally. I for one enjoy seeing these huge and complex RNA structures being studied. It shows that Rosetta has come a long way from just a year ago when tasks with just 30-50 amino acids were still "interesting". I hope that once you see that your crunching is indeed working properly you will agree. You will also get much more credit for these long running tasks as well since they take so much longer to complete. Rosetta Moderator: Mod.Sense |
Doug Worrall Send message Joined: 19 Sep 05 Posts: 60 Credit: 58,445 RAC: 0 |
LohnesinPR, I am seeing some tasks taking 5+ hours per model on a 3Ghz Pemtium 4. Your machine has less memory, so perhaps doesn't get sent this type of task. The difference basically is that the models of many of the current tasks are much larger then they used to be. So the time per model is going up. Thank You Mod.Sense Below is w/u large: https://boinc.bakerlab.org/rosetta/result.php?resultid=80490081 I have had 4 in a row that stop at 10 minutes, and they finish quite quickly after that.These paticular w/u find only 1 decoy, which gives this Project a really good sequence, From what I understand.Am very pleased there are checkpoints now. Was concerned lately and am happy to see this Post. "Happy Crunching" sluger |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,860,059 RAC: 2,520 |
copied from the 'Hey SETI crunchers, will you continue crunching Rosetta?' thread: Crazybob @ SETI.USA wrote:
It's this machine: GenuineIntel x86 Family 6 Model 7 Stepping 3 547MHz |
Ross Parlette Send message Joined: 10 Nov 05 Posts: 32 Credit: 2,165,044 RAC: 0 |
Normally, I see rosetta completing a work unit, uploading it and requesting a new work unit, in a fairly short time frame. Then it reports the work. Now it's not requesting any new work. When I manually requested an update, it didn't. Even when I suspended rosetta and resumed it, nothing happened. I have SAH at 100 and rosetta at 50, but I now have 2.5 GB RAM, so there is clearly room in RAM. My hard drive has 113 GB free, so that's not a problem. Any suggestions? During the long SAH outage, rosetta went like gang-busters. I also suspended SAH so as not to generate unlimited pesky red messages. 5/19/2007 3:40:40 PM|rosetta@home|Resuming result CNTRL_01ABRELAX_SAVE_ALL_OUT_-1pgx_-_filters_1737_4861_0 using rosetta version 564 5/19/2007 3:40:40 PM|SETI@home|Pausing result 04mr05ab.17213.29265.454810.3.175_0 (left in memory) 5/19/2007 4:30:55 PM||request_reschedule_cpus: process exited 5/19/2007 4:30:55 PM|rosetta@home|Computation for result CNTRL_01ABRELAX_SAVE_ALL_OUT_-1pgx_-_filters_1737_4861_0 finished 5/19/2007 4:30:55 PM|SETI@home|Resuming result 04mr05ab.17213.29265.454810.3.175_0 using setiathome_enhanced version 515 5/19/2007 4:30:57 PM|rosetta@home|Started upload of CNTRL_01ABRELAX_SAVE_ALL_OUT_-1pgx_-_filters_1737_4861_0_0 5/19/2007 4:31:03 PM|rosetta@home|Finished upload of CNTRL_01ABRELAX_SAVE_ALL_OUT_-1pgx_-_filters_1737_4861_0_0 5/19/2007 4:31:03 PM|rosetta@home|Throughput 35624 bytes/sec 5/19/2007 5:24:25 PM||request_reschedule_cpus: process exited 5/19/2007 5:24:25 PM|SETI@home|Computation for result 04mr05ab.17213.29265.454810.3.175_0 finished 5/19/2007 5:24:25 PM|SETI@home|Starting result 27fe05aa.20439.6162.692336.3.65_1 using setiathome_enhanced version 515 5/19/2007 5:24:27 PM|SETI@home|Started upload of 04mr05ab.17213.29265.454810.3.175_0_0 5/19/2007 5:25:13 PM|SETI@home|Finished upload of 04mr05ab.17213.29265.454810.3.175_0_0 5/19/2007 5:25:13 PM|SETI@home|Throughput 548 bytes/sec 5/19/2007 6:55:08 PM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 5/19/2007 6:55:08 PM|rosetta@home|Reason: To report results 5/19/2007 6:55:08 PM|rosetta@home|Reporting 1 results 5/19/2007 6:55:13 PM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded ... 5/19/2007 10:33:19 PM||request_reschedule_cpus: project op 5/19/2007 10:33:24 PM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 5/19/2007 10:33:24 PM|rosetta@home|Reason: Requested by user 5/19/2007 10:33:24 PM|rosetta@home|Note: not requesting new work or reporting results 5/19/2007 10:33:29 PM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded 5/19/2007 10:34:18 PM||request_reschedule_cpus: project op 5/19/2007 10:34:26 PM||request_reschedule_cpus: project op |
anders n Send message Joined: 19 Sep 05 Posts: 403 Credit: 537,991 RAC: 0 |
Normally, I see rosetta completing a work unit, uploading it and requesting a new work unit, in a fairly short time frame. Then it reports the work. Now it's not requesting any new work. When I manually requested an update, it didn't. Even when I suspended rosetta and resumed it, nothing happened. I think you have a Long Term Debt to SETI after it's downtime. Take it easy and thing will get back to normal in a few days. Anders n |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I think you have a Long Term Debt to SETI after it's downtime. I think so too. Especially given the msg below that I've colored red. 5/19/2007 10:33:19 PM||request_reschedule_cpus: project op 5/19/2007 10:33:24 PM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 5/19/2007 10:33:24 PM|rosetta@home|Reason: Requested by user 5/19/2007 10:33:24 PM|rosetta@home|Note: not requesting new work or reporting results If you run with SETI on 2/3rds resource share, and it has been crunching Rosetta for a week... it may need to crunch nothing but SETI for about 2 weeks to get back in to balance. As always, if you would LIKE you crunch more Rosetta, you can increase your resource allocation in your Rosetta Preferences. But you will still have the "debt" to pay back to SETI first, if you increase your Rosetta share, the debt will be balanced sooner then otherwise. Rosetta Moderator: Mod.Sense |
mdettweiler Send message Joined: 15 Oct 06 Posts: 33 Credit: 2,509 RAC: 0 |
I think you have a Long Term Debt to SETI after it's downtime. You can fix that using BOINCDV's "Clear Debts" function--yeah, it resets all debts, but it is often what you need to clear stuff up. (It's not a good idea to fix one by hand; I tried that once, but all it did was mess up the debts for the other projects, and I had to clear debts anyway to fix things--well, they weren't actually broken, but a lot of projects weren't getting any work). You can download BOINCDV from the "Additional Applications" page that is linked to on the home pages of most projects. |
mdettweiler Send message Joined: 15 Oct 06 Posts: 33 Credit: 2,509 RAC: 0 |
I'm thinking, if we could get some better checkpointing around here, it would be nice. Yes, I know that the WU can only checkpoint at certain points in the model, but for people who don't crunch 24/7 (not to mention having other projects that have to share the part of the day that the computer is on), this can be a bit of a problem, at least for the big "Abinition-relax" and "Symmetric Fold And Dock" workunits that seem to be so common lately. When the model takes up to 3 hours to complete, and the model can't be checkpointed very often throughout, you get a problem. Isn't there a better way to checkpoint, that allows you to do so at any time? I was thinking of a way it might be done (although it probably won't be feasible, I'll say it anyway, in case it helps): You know how if you have your preferences set to leave apps in memory while preempted, if the workunit just pauses, it can sit there, with the entire workunit loaded into memory? I'm thinking, the Rosetta application can simply save everything it has loaded into memory, to disk, and then when it needs to restore from the checkpoint, it will simply load that back into memory, and restore as if it was restoring from a preempted task left in memory. Would this work at all? If so, then if it could be at all possible to make the application save its contents to disk as described every 5 or 10 minutes, that would be enough to make sure that hours of work are not wasted when a computer needs to be shut down! Hope this helps--although it probably won't (as I am neither a protein expert or a programmer), but maybe it will. :-) |
Doug Worrall Send message Joined: 19 Sep 05 Posts: 60 Credit: 58,445 RAC: 0 |
I'm thinking, if we could get some better checkpointing around here, it would be nice. Yes, I know that the WU can only checkpoint at certain points in the model, but for people who don't crunch 24/7 (not to mention having other projects that have to share the part of the day that the computer is on), this can be a bit of a problem, at least for the big "Abinition-relax" and "Symmetric Fold And Dock" workunits that seem to be so common lately. When the model takes up to 3 hours to complete, and the model can't be checkpointed very often throughout, you get a problem. Isn't there a better way to checkpoint, that allows you to do so at any time? Myself, Am very happy checkpointing has "just" been implimented.Having a Linux Box, I dont shut down for weeks and Months at a time.Therefore no problems.It is so easy to download a Linux Distro/Dual Boot , if you want to save your Widows O.S. Myself was happy to lose defraging, Viruses,Constant probing of ports, Spy and Mal ware.If you must shut down, click no more work, finish what you have in your que, then reboot if you must. Doug |
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
I shut down last night and instead of losing 4 hours on a RNA computation I only lost a couple of minutes. |
mdettweiler Send message Joined: 15 Oct 06 Posts: 33 Credit: 2,509 RAC: 0 |
I'm thinking, if we could get some better checkpointing around here, it would be nice. Yes, I know that the WU can only checkpoint at certain points in the model, but for people who don't crunch 24/7 (not to mention having other projects that have to share the part of the day that the computer is on), this can be a bit of a problem, at least for the big "Abinition-relax" and "Symmetric Fold And Dock" workunits that seem to be so common lately. When the model takes up to 3 hours to complete, and the model can't be checkpointed very often throughout, you get a problem. Isn't there a better way to checkpoint, that allows you to do so at any time? It's not about system stability, in my case; it's just that I don't necessarily want/need to have my computer on when I'm not going to be awake! No need to run up the electric bill like that--the whole point of BOINC is to utilize spare CPU time (although if you have some time and money to put into making it a full-time thing for your computer, that's a different story, but msot people are not in that situation). And, I'm saying, as for clicking "No New Tasks" before you shut down, that's all fine and dandy, if you can finish a workunit in half an hour. Otherwise, it's not going to have any affect on that workunit that's already downloaded and doesn't have time to finish before you need to shut down. |
mdettweiler Send message Joined: 15 Oct 06 Posts: 33 Credit: 2,509 RAC: 0 |
Yay! For once, my Rosetta workunit checkpointed and resumed OK--it didn't lose any CPU time this time around! Workunit: here Hopefully this is the way they will all be now. :-| Probably not though. |
Doug Worrall Send message Joined: 19 Sep 05 Posts: 60 Credit: 58,445 RAC: 0 |
Yay! For once, my Rosetta workunit checkpointed and resumed OK--it didn't lose any CPU time this time around! WTG, The Link says, "in Progress" w/u, but, am sure you copied wrong path. Am sure all w/u will be successful.GL "Happy Crunching" |
mdettweiler Send message Joined: 15 Oct 06 Posts: 33 Credit: 2,509 RAC: 0 |
Yay! For once, my Rosetta workunit checkpointed and resumed OK--it didn't lose any CPU time this time around! No, I'm sure it's the right one. When I copied the link, I checked the name of the WU with the one listed in BOINC Manager, so I'm sure it's correct. Anyway, I'm on to the next one now. It's kind of funny in this way: Its models take hardly any time at all (about two minutes per model) and it buzzes through steps 1-500, then sits there for the rest of the two minutes or so, until it finishes the model. I think I heard something about this over on RALPH, and that it's just this particular kind of workunit that does that. Well, I'm not complaining, beacuse if it has shorter models, then that means it can checkpoint more often! :-) |
Ty Send message Joined: 2 Mar 06 Posts: 2 Credit: 50,697 RAC: 0 |
Work Unit ID 70636999. My computer completed the work and returned the results and recieved granted credit of 0.00 for over 9228 seconds of work. Another computer received 29.25 credit granted for this workunit. What goes on ? |
MattDavis Send message Joined: 22 Sep 05 Posts: 206 Credit: 1,377,748 RAC: 0 |
Work Unit ID 70636999. My computer completed the work and returned the results and recieved granted credit of 0.00 for over 9228 seconds of work. Another computer received 29.25 credit granted for this workunit. What goes on ? The person before you was assigned the unit but did not complete it within the scheduled time. As such, the unit was sent to you. However, the person before you finished it late but returned it before you returned yours, so he got the credit. |
Message boards :
Number crunching :
Problems with Rosetta version 5.64
©2025 University of Washington
https://www.bakerlab.org