Problems with Rosetta version 5.64

Message boards : Number crunching : Problems with Rosetta version 5.64

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Odysseus

Send message
Joined: 3 May 07
Posts: 14
Credit: 241,831
RAC: 0
Message 40697 - Posted: 11 May 2007, 7:48:03 UTC

I had two tasks crash on my G4/733 (Mac OS 10.3.9) today, during the first few seconds of processing. (I was actually watching the graphics on one of them; it was still “Initializing” when it went down.) Both output files have extensive crash-dumps: 2chf__BOINC_CORRECTION2_ABRELAX_SAVE_ALL_OUT_BARCODE-2chf_-frags83__1714_688 and 1acf__BOINC_CORRECTION2_ABRELAX_SAVE_ALL_OUT_BARCODE-1acf_-frags83__1714_2284. Exit status 1 (0x1) for both.
ID: 40697 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Paul Hayslett

Send message
Joined: 9 Dec 05
Posts: 1
Credit: 1,511,165
RAC: 0
Message 40704 - Posted: 11 May 2007, 11:40:54 UTC

Dunno if this is due to 5.64 or not, but last night 30+ WUs stopped with error -107374 before doing any work at all. Cleaned out all pending work in the queue in less than a second. I downloaded new work and it's been fine since. XP Pro on a Core 2 Duo, Boinc 5.8.15, Rosetta 5.64.
ID: 40704 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Odysseus

Send message
Joined: 3 May 07
Posts: 14
Credit: 241,831
RAC: 0
Message 40713 - Posted: 11 May 2007, 15:51:54 UTC
Last modified: 11 May 2007, 15:52:52 UTC

Another crash, this time with exit status 6 (0x6): 1e6iA_BOINC_CORRECTION2_ABRELAX_SAVE_ALL_OUT_BARCODE-1e6iA-frags83__1714_4852. Instead of failing right away, this one wasted more than three hours of CPU-time.
ID: 40713 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Neil
Avatar

Send message
Joined: 7 Mar 07
Posts: 25
Credit: 135,539
RAC: 0
Message 40726 - Posted: 11 May 2007, 18:19:19 UTC - in response to Message 40688.  
Last modified: 11 May 2007, 18:23:00 UTC


David Kim, Forum moderator, Project administrator, Project developer, & Project scientist wrote:

There are three types of checkpointing. From the longest to shortest interval between checkpoints:

1.
2.
3. and, a more recent addition, checkpointing for pose and jumping jobs. These types of jobs should checkpoint at intervals depending on your disk write interval preference.


I have an old Celeron 1.4 GHz with massive 256 kB L2 cache and Boinc 5.8.16. My antivirus wanted me to do a re-start, an infrequent request. I checked Rosetta, and my 5.64 WU was at 1:50 CPU Time, and 1:05 To Completion.

After restarting, the work unit reverted to 1:20 CPU Time and 1:35 To Completion.

I think it would preserve lots of work (especially on my general-use computer) if Checkpoints were also saved when Boinc is manually exited. Do we have the technology?

----

Query: Regarding "posing" and "jumping jobs," what is posing? I searched a few days ago and couldn't find a definition. I don’t sup-pose it has to do with manually exiting Boinc? Thanks.

Neil
ID: 40726 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 40729 - Posted: 11 May 2007, 18:54:38 UTC

Neil, BOINC does not notify the application prior to exit. And even if it did, Rosetta does not have the capability to checkpoint at a forced point in time. It reaches predefined points in the model's computation and those are the only points where it can do a checkpoint. The recent changes added such predefined points to some types of tasks which did not previously have them.

The pose and jumping were references to types of Rosetta tasks that now have the checkpointing. You will see those words in the task name.
Rosetta Moderator: Mod.Sense
ID: 40729 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5662
Credit: 5,701,869
RAC: 2,154
Message 40738 - Posted: 11 May 2007, 20:44:34 UTC - in response to Message 40729.  

Is it the way BOINC or ROH is written or what that there cannot be a save point on exit? Actually 30 mins is not to great of a loss, but if you take that over all the computers on here, there is alot of lost time due to computers crashing or restarting or being turned back on after the owner shuts down for the night.
It really is to bad there is no way to save before exiting.

Neil, BOINC does not notify the application prior to exit. And even if it did, Rosetta does not have the capability to checkpoint at a forced point in time. It reaches predefined points in the model's computation and those are the only points where it can do a checkpoint. The recent changes added such predefined points to some types of tasks which did not previously have them.

The pose and jumping were references to types of Rosetta tasks that now have the checkpointing. You will see those words in the task name.


ID: 40738 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 40740 - Posted: 11 May 2007, 21:26:36 UTC

As I tried to explain, it is both a BOINC limitation and a Rosetta limitation. And I'm not sure of the different operating systems have any standards about what course to take when the user closes an application.

There is always a trade-off between taking the time to save work done, and using the time to get more work done. In other words, the more checkpointing you do, the less time you have to crunch. If a computer is shutdown once per day and crunches for about 10 hours each day, would you be better off overall to checkpoint every minute? two minutes? 10? What about all the machines crunching 24hrs? Their RAC will drop slightly if you add a bunch of checkpoints. Someone who is ending tasks frequently would see their RAC increase.

So, the Project Team felt the balance was out of alignment on these tasks. Especially those with long runtime per model, where sometimes over an hour of crunching was lost. They've made changes to bring things back closer to that balance between losing work that is not checkpointed, and losing crunch time due to the time to capture checkpoints.
Rosetta Moderator: Mod.Sense
ID: 40740 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 373,953
RAC: 0
Message 40771 - Posted: 12 May 2007, 9:39:50 UTC

I'm crunching this WU at the moment 1npsA_BOINC_CORRECTION2_ABRELAX_SAVE_ALL_OUT_BARCODE-1npsA-frags83__1714_6954

I made an exit just after the ab initio stage was finished for the first model.
Once restarted the model started at a checkpoint where the relax stage started.

But the CPU time and percentage reset to zero. Not a major bug, but if I remember correctly you have tried to fix this in a prior release, so I thought I'd let you know.
ID: 40771 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
EW-3

Send message
Joined: 1 Sep 06
Posts: 27
Credit: 2,561,427
RAC: 0
Message 40830 - Posted: 12 May 2007, 16:03:23 UTC

Running WIN XP SP2

getting

5/12/2007 11:48:53 AM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
5/12/2007 11:48:53 AM|rosetta@home|Reason: To fetch work
5/12/2007 11:48:53 AM|rosetta@home|Requesting 8640 seconds of new work
5/12/2007 11:48:58 AM|rosetta@home|Scheduler request succeeded
5/12/2007 11:48:58 AM|rosetta@home|No work from project

ID: 40830 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
EW-3

Send message
Joined: 1 Sep 06
Posts: 27
Credit: 2,561,427
RAC: 0
Message 40837 - Posted: 12 May 2007, 16:55:33 UTC - in response to Message 40830.  

Must be magic - all OK now ;)

5/12/2007 12:18:45 PM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
5/12/2007 12:18:45 PM|rosetta@home|Reason: To fetch work
5/12/2007 12:18:45 PM|rosetta@home|Requesting 8640 seconds of new work
5/12/2007 12:18:50 PM|rosetta@home|Scheduler request succeeded
5/12/2007 12:18:52 PM|rosetta@home|Started download of file 1ctf_.fasta
5/12/2007 12:18:52 PM|rosetta@home|Started download of file 1ctf_.psipred_ss2.gz
5/12/2007 12:18:53 PM|rosetta@home|Finished download of file 1ctf_.fasta


Running WIN XP SP2

getting

5/12/2007 11:48:53 AM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
5/12/2007 11:48:53 AM|rosetta@home|Reason: To fetch work
5/12/2007 11:48:53 AM|rosetta@home|Requesting 8640 seconds of new work
5/12/2007 11:48:58 AM|rosetta@home|Scheduler request succeeded
5/12/2007 11:48:58 AM|rosetta@home|No work from project


ID: 40837 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5662
Credit: 5,701,869
RAC: 2,154
Message 40851 - Posted: 12 May 2007, 18:38:06 UTC

don't think we will run out of work now!
5,000+ in queue and 48,000+ ready to send, so thats 53,000+ WU's in line!
ID: 40851 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Thomas Leibold

Send message
Joined: 30 Jul 06
Posts: 55
Credit: 19,627,164
RAC: 0
Message 40916 - Posted: 14 May 2007, 0:32:41 UTC
Last modified: 14 May 2007, 0:33:39 UTC

Workunit 71222475 is hanging on one of my systems (state is running, but no cpu time accumulates). The following is the contents of stderr.txt:

Graphics are disabled due to configuration...
# cpu_run_time_pref: 28800
# random seed: 2862215
ERROR:: Exit from: pose.cc line: 769
SIGABRT: abort called
Stack trace (21 frames):
[0x8cbf0fb]
[0x8cb9f2c]
[0xffffe420]
[0x8d2a0b4]
[0x8d3ef9f]
[0x8d44005]
[0x8d442e3]
[0x8d14d11]
[0x8d16739]
[0x84aacad]
[0x8d2a5ff]
[0x8cbbbbf]
[0x8063a3d]
[0x8064905]
[0x88baf95]
[0x83402ed]
[0x85b4a7f]
[0x86d8113]
[0x86d81be]
[0x8d22ff4]
[0x8048111]

Exiting...


I'm going to abort this workunit since it is obviously not going to go anywhere.
Team Helix
ID: 40916 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Neil
Avatar

Send message
Joined: 7 Mar 07
Posts: 25
Credit: 135,539
RAC: 0
Message 40930 - Posted: 14 May 2007, 5:46:13 UTC - in response to Message 40729.  


Mod.Sense wrote:
BOINC does not notify the application prior to exit. And even if it did, Rosetta does not have the capability to checkpoint at a forced point in time. It reaches predefined points in the model's computation and those are the only points where it can do a checkpoint.


I was slow to reply because your meaning was slow to sink into my dense cranium. However, you couldn't say it any clearer. Checkpoints can only be created at predefined points in the model's computation. And it sounds like Rosetta is already taking advantage of most of those predefined points.

OK, how about this for a compromise: How about adding an audible alert whenever Rosetta "does a checkpoint" or starts a new WU? If my WinXP starts to run hairy and if I'm able to wait until Boinc beeps, then I could take the opportunity to restart Windows without losing hardly any work.

Of course, the beep should be user-selectable with an On/Off switch. I would only switch the alert On when my computer starts getting an anxious aura, and it should conveniently automatically reset to Off after Boinc restarts.

Then, I would be one with my Boinc.

"And one day, man will serve machines."

-Neil-

PS. Greg_Be, thanks for the moral support :)
ID: 40930 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5662
Credit: 5,701,869
RAC: 2,154
Message 40934 - Posted: 14 May 2007, 8:50:59 UTC - in response to Message 40930.  

well i keep my audio off, so maybe a balloon message or something that is written into the message portion of BOINC that is generated by RAH?



Mod.Sense wrote:
BOINC does not notify the application prior to exit. And even if it did, Rosetta does not have the capability to checkpoint at a forced point in time. It reaches predefined points in the model's computation and those are the only points where it can do a checkpoint.


I was slow to reply because your meaning was slow to sink into my dense cranium. However, you couldn't say it any clearer. Checkpoints can only be created at predefined points in the model's computation. And it sounds like Rosetta is already taking advantage of most of those predefined points.

OK, how about this for a compromise: How about adding an audible alert whenever Rosetta "does a checkpoint" or starts a new WU? If my WinXP starts to run hairy and if I'm able to wait until Boinc beeps, then I could take the opportunity to restart Windows without losing hardly any work.

Of course, the beep should be user-selectable with an On/Off switch. I would only switch the alert On when my computer starts getting an anxious aura, and it should conveniently automatically reset to Off after Boinc restarts.

Then, I would be one with my Boinc.

"And one day, man will serve machines."

-Neil-

PS. Greg_Be, thanks for the moral support :)


ID: 40934 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Doug Worrall
Avatar

Send message
Joined: 19 Sep 05
Posts: 60
Credit: 58,445
RAC: 0
Message 40940 - Posted: 14 May 2007, 11:06:30 UTC
Last modified: 14 May 2007, 11:10:42 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=79610297

Hello,
I have not posted in Long time due too Job duties and health.Had this rather "LARGE" w/u {url at top of post},that crunched for 4 hours, while finding 1 decoy, from 1 attempt.To me this must be an error,even though it was successful.Maybe I just want too post, and DSy hello to the Staff and say "Great Job". all Rosetta w/u have been the mostly, the same size.Some fetching 18 decoys, and average of 10.
Noticed Rosetta had no w/u cued on Friday I beleive. Had 2 crunch "Another"
experiment, untill there were more w/u in the Cue.Have not been to the Boards in a Long time.No complaints, hope I have the right thread also.Am running a Linux Distro.

Doug
ID: 40940 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Odysseus

Send message
Joined: 3 May 07
Posts: 14
Credit: 241,831
RAC: 0
Message 40969 - Posted: 14 May 2007, 19:41:03 UTC

Another crash on my Mac G4/733, this time with exit status 6 (0x6): 1e6iA_BOINC_CORRECTION2_ABRELAX_SAVE_ALL_OUT_BARCODE-1e6iA-frags83__1714_4852. As before, lots of data that I don’t understand, but that a programmer might, in the output file.
ID: 40969 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 40999 - Posted: 15 May 2007, 5:08:31 UTC - in response to Message 40940.  
Last modified: 15 May 2007, 5:10:01 UTC

Hi Doug thanks for posting, and also thanks for your encouragement. That was indeed a really big job you crunched -- these workunits with 1GID in the name can take between two to four hours, depending on the machine. Due to the amount of time required, we're assigning quite a bit of credit for each decoy from these workunits.

The nice thing is that we've implemented checkpointing, so that even if you stop crunching for a while, when you return to Rosetta@home, you can pick up basically where you left off. The other *really* nice thing is that the results look awesome -- we're seeing some beautiful structures for this very large molecule.


https://boinc.bakerlab.org/rosetta/result.php?resultid=79610297

Hello,
I have not posted in Long time due too Job duties and health.Had this rather "LARGE" w/u {url at top of post},that crunched for 4 hours, while finding 1 decoy, from 1 attempt.To me this must be an error,even though it was successful.Maybe I just want too post, and DSy hello to the Staff and say "Great Job". all Rosetta w/u have been the mostly, the same size.Some fetching 18 decoys, and average of 10.
Noticed Rosetta had no w/u cued on Friday I beleive. Had 2 crunch "Another"
experiment, untill there were more w/u in the Cue.Have not been to the Boards in a Long time.No complaints, hope I have the right thread also.Am running a Linux Distro.

Doug


ID: 40999 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Doug Worrall
Avatar

Send message
Joined: 19 Sep 05
Posts: 60
Credit: 58,445
RAC: 0
Message 41001 - Posted: 15 May 2007, 5:39:58 UTC
Last modified: 15 May 2007, 5:41:04 UTC

Thanks RHIJU,

Just finished another Biggy here:https://boinc.bakerlab.org/rosetta/result.php?resultid=79676296

And it makes me very happy too know that these 1 decoy w/u are actually good for Rosey. Am learning that the checkpoints are working well, quit a session, actually rebooted this actual w/u. and it did not fail.Rosey has come a long way, should take a look at Ralph again soon, after Rosey is no longer the Project of the Month at B.S.
"Happy Crunching"
Great work Scientists, and Moderators, and all staff at Rosetta@home
Sincerely
Doug
ID: 41001 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
zoom314
Avatar

Send message
Joined: 4 May 07
Posts: 13
Credit: 118,553
RAC: 0
Message 41002 - Posted: 15 May 2007, 8:08:24 UTC
Last modified: 15 May 2007, 8:31:44 UTC

Never mind I think I fixed It.

70% of memory when computer is in use, Stock is lower.
ID: 41002 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5662
Credit: 5,701,869
RAC: 2,154
Message 41014 - Posted: 15 May 2007, 14:51:49 UTC

this work unit must be stuck someplace as it is still on the web page as new from the 9th and its due the 19th, but it is no longer in my BOINC manager.

1r69__BOINC_ABRELAX_SAVE_ALL_OUT_BARCODE-1r69_-frags83__1706_5360_0

My current work is from the 13-15th and due the 23-25th.

Any ideas?
ID: 41014 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Problems with Rosetta version 5.64



©2024 University of Washington
https://www.bakerlab.org