Message boards : Number crunching : Problems with Rosetta version 5.64
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Odysseus Send message Joined: 3 May 07 Posts: 14 Credit: 241,831 RAC: 0 |
I had two tasks crash on my G4/733 (Mac OS 10.3.9) today, during the first few seconds of processing. (I was actually watching the graphics on one of them; it was still “Initializing” when it went down.) Both output files have extensive crash-dumps: 2chf__BOINC_CORRECTION2_ABRELAX_SAVE_ALL_OUT_BARCODE-2chf_-frags83__1714_688 and 1acf__BOINC_CORRECTION2_ABRELAX_SAVE_ALL_OUT_BARCODE-1acf_-frags83__1714_2284. Exit status 1 (0x1) for both. |
Paul Hayslett Send message Joined: 9 Dec 05 Posts: 1 Credit: 1,511,165 RAC: 0 |
Dunno if this is due to 5.64 or not, but last night 30+ WUs stopped with error -107374 before doing any work at all. Cleaned out all pending work in the queue in less than a second. I downloaded new work and it's been fine since. XP Pro on a Core 2 Duo, Boinc 5.8.15, Rosetta 5.64. |
Odysseus Send message Joined: 3 May 07 Posts: 14 Credit: 241,831 RAC: 0 |
Another crash, this time with exit status 6 (0x6): 1e6iA_BOINC_CORRECTION2_ABRELAX_SAVE_ALL_OUT_BARCODE-1e6iA-frags83__1714_4852. Instead of failing right away, this one wasted more than three hours of CPU-time. |
Neil Send message Joined: 7 Mar 07 Posts: 25 Credit: 135,539 RAC: 0 |
David Kim, Forum moderator, Project administrator, Project developer, & Project scientist wrote: There are three types of checkpointing. From the longest to shortest interval between checkpoints: I have an old Celeron 1.4 GHz with massive 256 kB L2 cache and Boinc 5.8.16. My antivirus wanted me to do a re-start, an infrequent request. I checked Rosetta, and my 5.64 WU was at 1:50 CPU Time, and 1:05 To Completion. After restarting, the work unit reverted to 1:20 CPU Time and 1:35 To Completion. I think it would preserve lots of work (especially on my general-use computer) if Checkpoints were also saved when Boinc is manually exited. Do we have the technology? ---- Query: Regarding "posing" and "jumping jobs," what is posing? I searched a few days ago and couldn't find a definition. I don’t sup-pose it has to do with manually exiting Boinc? Thanks. Neil |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Neil, BOINC does not notify the application prior to exit. And even if it did, Rosetta does not have the capability to checkpoint at a forced point in time. It reaches predefined points in the model's computation and those are the only points where it can do a checkpoint. The recent changes added such predefined points to some types of tasks which did not previously have them. The pose and jumping were references to types of Rosetta tasks that now have the checkpointing. You will see those words in the task name. Rosetta Moderator: Mod.Sense |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Is it the way BOINC or ROH is written or what that there cannot be a save point on exit? Actually 30 mins is not to great of a loss, but if you take that over all the computers on here, there is alot of lost time due to computers crashing or restarting or being turned back on after the owner shuts down for the night. It really is to bad there is no way to save before exiting. Neil, BOINC does not notify the application prior to exit. And even if it did, Rosetta does not have the capability to checkpoint at a forced point in time. It reaches predefined points in the model's computation and those are the only points where it can do a checkpoint. The recent changes added such predefined points to some types of tasks which did not previously have them. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
As I tried to explain, it is both a BOINC limitation and a Rosetta limitation. And I'm not sure of the different operating systems have any standards about what course to take when the user closes an application. There is always a trade-off between taking the time to save work done, and using the time to get more work done. In other words, the more checkpointing you do, the less time you have to crunch. If a computer is shutdown once per day and crunches for about 10 hours each day, would you be better off overall to checkpoint every minute? two minutes? 10? What about all the machines crunching 24hrs? Their RAC will drop slightly if you add a bunch of checkpoints. Someone who is ending tasks frequently would see their RAC increase. So, the Project Team felt the balance was out of alignment on these tasks. Especially those with long runtime per model, where sometimes over an hour of crunching was lost. They've made changes to bring things back closer to that balance between losing work that is not checkpointed, and losing crunch time due to the time to capture checkpoints. Rosetta Moderator: Mod.Sense |
Knorr Send message Joined: 18 Feb 06 Posts: 21 Credit: 373,953 RAC: 0 |
I'm crunching this WU at the moment 1npsA_BOINC_CORRECTION2_ABRELAX_SAVE_ALL_OUT_BARCODE-1npsA-frags83__1714_6954 I made an exit just after the ab initio stage was finished for the first model. Once restarted the model started at a checkpoint where the relax stage started. But the CPU time and percentage reset to zero. Not a major bug, but if I remember correctly you have tried to fix this in a prior release, so I thought I'd let you know. |
EW-3 Send message Joined: 1 Sep 06 Posts: 27 Credit: 2,561,427 RAC: 0 |
Running WIN XP SP2 getting 5/12/2007 11:48:53 AM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 5/12/2007 11:48:53 AM|rosetta@home|Reason: To fetch work 5/12/2007 11:48:53 AM|rosetta@home|Requesting 8640 seconds of new work 5/12/2007 11:48:58 AM|rosetta@home|Scheduler request succeeded 5/12/2007 11:48:58 AM|rosetta@home|No work from project |
EW-3 Send message Joined: 1 Sep 06 Posts: 27 Credit: 2,561,427 RAC: 0 |
Must be magic - all OK now ;) 5/12/2007 12:18:45 PM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 5/12/2007 12:18:45 PM|rosetta@home|Reason: To fetch work 5/12/2007 12:18:45 PM|rosetta@home|Requesting 8640 seconds of new work 5/12/2007 12:18:50 PM|rosetta@home|Scheduler request succeeded 5/12/2007 12:18:52 PM|rosetta@home|Started download of file 1ctf_.fasta 5/12/2007 12:18:52 PM|rosetta@home|Started download of file 1ctf_.psipred_ss2.gz 5/12/2007 12:18:53 PM|rosetta@home|Finished download of file 1ctf_.fasta Running WIN XP SP2 |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
don't think we will run out of work now! 5,000+ in queue and 48,000+ ready to send, so thats 53,000+ WU's in line! |
Thomas Leibold Send message Joined: 30 Jul 06 Posts: 55 Credit: 19,627,164 RAC: 0 |
Workunit 71222475 is hanging on one of my systems (state is running, but no cpu time accumulates). The following is the contents of stderr.txt: Graphics are disabled due to configuration... # cpu_run_time_pref: 28800 # random seed: 2862215 ERROR:: Exit from: pose.cc line: 769 SIGABRT: abort called Stack trace (21 frames): [0x8cbf0fb] [0x8cb9f2c] [0xffffe420] [0x8d2a0b4] [0x8d3ef9f] [0x8d44005] [0x8d442e3] [0x8d14d11] [0x8d16739] [0x84aacad] [0x8d2a5ff] [0x8cbbbbf] [0x8063a3d] [0x8064905] [0x88baf95] [0x83402ed] [0x85b4a7f] [0x86d8113] [0x86d81be] [0x8d22ff4] [0x8048111] Exiting... I'm going to abort this workunit since it is obviously not going to go anywhere. Team Helix |
Neil Send message Joined: 7 Mar 07 Posts: 25 Credit: 135,539 RAC: 0 |
Mod.Sense wrote: BOINC does not notify the application prior to exit. And even if it did, Rosetta does not have the capability to checkpoint at a forced point in time. It reaches predefined points in the model's computation and those are the only points where it can do a checkpoint. I was slow to reply because your meaning was slow to sink into my dense cranium. However, you couldn't say it any clearer. Checkpoints can only be created at predefined points in the model's computation. And it sounds like Rosetta is already taking advantage of most of those predefined points. OK, how about this for a compromise: How about adding an audible alert whenever Rosetta "does a checkpoint" or starts a new WU? If my WinXP starts to run hairy and if I'm able to wait until Boinc beeps, then I could take the opportunity to restart Windows without losing hardly any work. Of course, the beep should be user-selectable with an On/Off switch. I would only switch the alert On when my computer starts getting an anxious aura, and it should conveniently automatically reset to Off after Boinc restarts. Then, I would be one with my Boinc. "And one day, man will serve machines." -Neil- PS. Greg_Be, thanks for the moral support :) |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
well i keep my audio off, so maybe a balloon message or something that is written into the message portion of BOINC that is generated by RAH?
|
Doug Worrall Send message Joined: 19 Sep 05 Posts: 60 Credit: 58,445 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=79610297 Hello, I have not posted in Long time due too Job duties and health.Had this rather "LARGE" w/u {url at top of post},that crunched for 4 hours, while finding 1 decoy, from 1 attempt.To me this must be an error,even though it was successful.Maybe I just want too post, and DSy hello to the Staff and say "Great Job". all Rosetta w/u have been the mostly, the same size.Some fetching 18 decoys, and average of 10. Noticed Rosetta had no w/u cued on Friday I beleive. Had 2 crunch "Another" experiment, untill there were more w/u in the Cue.Have not been to the Boards in a Long time.No complaints, hope I have the right thread also.Am running a Linux Distro. Doug |
Odysseus Send message Joined: 3 May 07 Posts: 14 Credit: 241,831 RAC: 0 |
Another crash on my Mac G4/733, this time with exit status 6 (0x6): 1e6iA_BOINC_CORRECTION2_ABRELAX_SAVE_ALL_OUT_BARCODE-1e6iA-frags83__1714_4852. As before, lots of data that I don’t understand, but that a programmer might, in the output file. |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Hi Doug thanks for posting, and also thanks for your encouragement. That was indeed a really big job you crunched -- these workunits with 1GID in the name can take between two to four hours, depending on the machine. Due to the amount of time required, we're assigning quite a bit of credit for each decoy from these workunits. The nice thing is that we've implemented checkpointing, so that even if you stop crunching for a while, when you return to Rosetta@home, you can pick up basically where you left off. The other *really* nice thing is that the results look awesome -- we're seeing some beautiful structures for this very large molecule. https://boinc.bakerlab.org/rosetta/result.php?resultid=79610297 |
Doug Worrall Send message Joined: 19 Sep 05 Posts: 60 Credit: 58,445 RAC: 0 |
Thanks RHIJU, Just finished another Biggy here:https://boinc.bakerlab.org/rosetta/result.php?resultid=79676296 And it makes me very happy too know that these 1 decoy w/u are actually good for Rosey. Am learning that the checkpoints are working well, quit a session, actually rebooted this actual w/u. and it did not fail.Rosey has come a long way, should take a look at Ralph again soon, after Rosey is no longer the Project of the Month at B.S. "Happy Crunching" Great work Scientists, and Moderators, and all staff at Rosetta@home Sincerely Doug |
zoom314 Send message Joined: 4 May 07 Posts: 13 Credit: 118,553 RAC: 0 |
Never mind I think I fixed It. 70% of memory when computer is in use, Stock is lower. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
this work unit must be stuck someplace as it is still on the web page as new from the 9th and its due the 19th, but it is no longer in my BOINC manager. 1r69__BOINC_ABRELAX_SAVE_ALL_OUT_BARCODE-1r69_-frags83__1706_5360_0 My current work is from the 13-15th and due the 23-25th. Any ideas? |
Message boards :
Number crunching :
Problems with Rosetta version 5.64
©2025 University of Washington
https://www.bakerlab.org