Report Problems with Rosetta Version 5.22

Message boards : Number crunching : Report Problems with Rosetta Version 5.22

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 19012 - Posted: 20 Jun 2006, 20:09:31 UTC
Last modified: 20 Jun 2006, 20:15:08 UTC

When I restarted my computer I lost over an hour on this WU. It went back at restart to 0% after running about an hour on my fast Athlon 64 @2.44 GHz. Obvioulsy no checkpoint occured during this time. I know t296 is very big but no checkpoint within an hour is not good (since one hour is the default switch time of BOINC).
ID: 19012 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 19016 - Posted: 20 Jun 2006, 21:05:18 UTC

rriggs, if you are actively using that computer much at all... your settings are preventing you from getting much work done. You see you've told BOINC to wait until you've not used your computer for 3 minutes before it runs. Then it starts running. When you return to use your computer, you've told it to remove the applications from memory, and so any work it has performed since the last checkpoint will be lost. Since Rosetta typically checkpoints no more than every 20 minutes, if you have left your computer for 15 minutes, then you've crunched for 12 minutes (after waiting for the 3 minute delay before it starts) and then when you use your computer again, you are throwing away the 12 minutes of work. And so you later have to redo that 12 minutes of work.

I don't know now agressively you intend to crunch. But you can preserve the work done (12 min. in my example) to continue on it later by setting the "leave in memory" setting to YES. You've got 2GB of memory, so that gives you plenty of room. Also, it just keeps it in virtual memory, not actually the physical memory of the machine. So, changing this setting will preserve these short work periods, and not impact your computer use. By keeping applications in memory, you would only lose bits of work when you actually turn off the computer.

Now, you also have a dual-core CPU. So you could be crunching 2 work units at the same time. But you have set BOINC to only use one. You can set the "On multiprocessors, use at most" setting to 2 and use both of them. I'm not positive what it does when you have that set to zero.

It would be further agressive to crunch while your computer is in use. I take it you've got 2GB of memory because you have some pretty intense applications to wish to use. So, your current setting of NOT working while your computer is in use should probably remain. But, just FYI, I run with half as much memory and run it all the time, and there is no noticeble effect on my running applications.

Having said all of that... your errors are mostly the -107 errors. Looks like you get either a -107 or a -1 about 10% of the time. I'm not sure, perhaps leaving in memory will reduce your chances of hitting the -107 errors. But otherwise, I don't believe the above will resolve the problem you are having with erroring work units. They are already working on a fix for the -107 errors. There are a number of people hitting that more frequently lately.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 19016 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Winkle

Send message
Joined: 22 May 06
Posts: 88
Credit: 1,354,930
RAC: 0
Message 19037 - Posted: 21 Jun 2006, 7:42:53 UTC

I have t307__CASP7_ABRELAX_SAVE_ALL_OUT_BARCODE_hom001__714_20997_0 using rosetta version 5.22 and it has been running now for 24 hrs. It has been stuck on 100% for at least the last hour I have been watching it. Mem usage of Rosetta was 88M and id now 94M after 30 mins. Now 97M ans climbing.
CPU usage doesn't change when I suspend the task from the BOINC manager.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=20861564

The show graphics screen says...
68.601% complete
CPU time: 24 hr 0 min
Stage: Ab initio + relax
Model 116 step 0
Accepted Enrgy 44.55485

Nothing is changing on the screen. The protein looks like a single zig-zag line

Target CPU time is set to 8 hrs.

Do I abort ?
ID: 19037 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Winkle

Send message
Joined: 22 May 06
Posts: 88
Credit: 1,354,930
RAC: 0
Message 19041 - Posted: 21 Jun 2006, 7:59:09 UTC - in response to Message 19037.  

I have t307__CASP7_ABRELAX_SAVE_ALL_OUT_BARCODE_hom001__714_20997_0 using rosetta version 5.22 and it has been running now for 24 hrs. It has been stuck on 100% for at least the last hour I have been watching it. Mem usage of Rosetta was 88M and id now 94M after 30 mins. Now 97M ans climbing.
CPU usage doesn't change when I suspend the task from the BOINC manager.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=20861564

The show graphics screen says...
68.601% complete
CPU time: 24 hr 0 min
Stage: Ab initio + relax
Model 116 step 0
Accepted Enrgy 44.55485

Nothing is changing on the screen. The protein looks like a single zig-zag line

Target CPU time is set to 8 hrs.

Do I abort ?


I ended up aborting it... The machine became unworkable.
I have reported it in another thread.
ID: 19041 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ian

Send message
Joined: 14 Apr 06
Posts: 29
Credit: 361,497
RAC: 687
Message 19044 - Posted: 21 Jun 2006, 8:14:24 UTC

Another:

https://boinc.bakerlab.org/rosetta/result.php?resultid=25008379 (WU 21172214)

<core_client_version>5.2.13</core_client_version>
<message>process exited with code 131 (0x83)
</message>
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 3248719
SIGBUS: bus error

Ooo-er. That doesn't sound healthy.
Ian Cundell, St Albans, UK
ID: 19044 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Bober [B@P]
Avatar

Send message
Joined: 12 Jun 06
Posts: 3
Credit: 48,690
RAC: 0
Message 19053 - Posted: 21 Jun 2006, 12:15:28 UTC - in response to Message 19044.  
Last modified: 21 Jun 2006, 12:17:42 UTC

Recently I've had -107 errors:
https://boinc.bakerlab.org/rosetta/result.php?resultid=24946846
https://boinc.bakerlab.org/rosetta/result.php?resultid=24946856

I've just started crunching for Rosetta. I don't use any screensaver.
The same error have just occured on my Ralph but with 5.24 app.

What can I do to avoid those errors?
ID: 19053 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rriggs

Send message
Joined: 5 Jun 06
Posts: 5
Credit: 48,672
RAC: 0
Message 19060 - Posted: 21 Jun 2006, 14:38:07 UTC - in response to Message 19016.  
Last modified: 21 Jun 2006, 14:46:22 UTC

rriggs, if you are actively using that computer much at all... your settings are preventing you from getting much work done. You see you've told BOINC to wait until you've not used your computer for 3 minutes before it runs. Then it starts running. When you return to use your computer, you've told it to remove the applications from memory, and so any work it has performed since the last checkpoint will be lost. Since Rosetta typically checkpoints no more than every 20 minutes, if you have left your computer for 15 minutes, then you've crunched for 12 minutes (after waiting for the 3 minute delay before it starts) and then when you use your computer again, you are throwing away the 12 minutes of work. And so you later have to redo that 12 minutes of work.

Now, you also have a dual-core CPU. So you could be crunching 2 work units at the same time. But you have set BOINC to only use one. You can set the "On multiprocessors, use at most" setting to 2 and use both of them. I'm not positive what it does when you have that set to zero.


I never even saw this page, let alone adjusted the settings so these are the defaults. Perhaps the setup process should either pick better defaults or bring this page to my attention so I would have found it sooner?

I guess when I installed I just picked Activity|Run Always and Activity|Network always available, so it has been running non-stop! This may potentially invalidate your hypothesis about why I'm getting -107 errors since the app is never leaving memory.

ps. It was crashed this morning when I came in, so I tried to debug it, but my machine locked up launching the debugger. I will try again tomorrow!

ID: 19060 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 19062 - Posted: 21 Jun 2006, 14:45:45 UTC - in response to Message 19060.  
Last modified: 21 Jun 2006, 15:26:31 UTC

Perhaps the setup process should either pick better defaults or bring this page to my attention so I would have found it sooner?

My task manager has always shown them both crunching at 50% CPU.

What do you think?


OK, my apologies. As I said I wasn't certain what it does when CPUs is set to zero. So that isn't an issue. If you've got 2 WUs running at 50% CPU each then you are fully crunching... when your computer is not in use.

I would still suggest you set the leave in memory to YES. Save all the work done during coffee breaks and during meetings or conference calls or whatever pulls you away from the computer.

As for changing the setup process, unfortunately that is not something Rosetta could change. It would be changed by the BOINC folks. So you would have to take up that suggestion on the BOINC boards.

Every time your PC is idle for 3 minutes, BOINC will start crunching... then when you come back and use the computer, BOINC suspends... and removes from memory, that was the thought. ...except since you've not said to "run based on preferences"... it's actually running all the time, regardless of whether other applications are in use?
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 19062 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rriggs

Send message
Joined: 5 Jun 06
Posts: 5
Credit: 48,672
RAC: 0
Message 19063 - Posted: 21 Jun 2006, 14:47:56 UTC - in response to Message 19062.  

OK, my apologies. As I said I wasn't certain what it does when CPUs is set to zero. So that isn't an issue. If you've got 2 WUs running at 50% CPU each then you are fully crunching... when your computer is not in use.


Oops. I edited my post while you were replying. Please recheck it now!

ID: 19063 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,081,660
RAC: 0
Message 19064 - Posted: 21 Jun 2006, 14:48:52 UTC

Before leaving for work this morning I checked my Linux box. CPU was at 0%. The ps command showed the boinc and rosetta processes there but doing nothing. Looked like it had stopped just a short while after starting a new WU. I stoped and restarted boinc and the WU took off normally. It just finished and reported. Here's the WU:

https://boinc.bakerlab.org/rosetta/result.php?resultid=25090497

Charlie

-Charlie
ID: 19064 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 19065 - Posted: 21 Jun 2006, 15:23:11 UTC - in response to Message 19053.  

Recently I've had -107 errors:
https://boinc.bakerlab.org/rosetta/result.php?resultid=24946846
https://boinc.bakerlab.org/rosetta/result.php?resultid=24946856

I've just started crunching for Rosetta. I don't use any screensaver.
The same error have just occured on my Ralph but with 5.24 app.

What can I do to avoid those errors?


Lukasz, you've already done what you can (so far as I know). One of your results reported a lot of useful information that will help analyze the problem.

Your computer time is still helping the project, and you are still getting credit for all the time crunching, so do not be detoured. Running on Ralph records additional diagnostic information back to the project. Hopefully they can determine the root cause soon. I see you have 2 out of 5 of your WUs failed, and another that you aborted for some reason.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 19065 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Bober [B@P]
Avatar

Send message
Joined: 12 Jun 06
Posts: 3
Credit: 48,690
RAC: 0
Message 19069 - Posted: 21 Jun 2006, 15:48:54 UTC - in response to Message 19065.  
Last modified: 21 Jun 2006, 15:53:12 UTC


I see you have 2 out of 5 of your WUs failed, and another that you aborted for some reason.


The reason is I thought that it's my computer's fault and I didn't want it to spoil more WUs. I have to admit that PC was overclocked a bit and it is very hot today, so I had to change some settings. But don't worry I'm far from being discouraged. I will crunch again for Rosetta soon:)

Thank you for reply!
ID: 19069 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,081,660
RAC: 0
Message 19080 - Posted: 21 Jun 2006, 20:45:20 UTC - in response to Message 19064.  

Before leaving for work this morning I checked my Linux box. CPU was at 0%. The ps command showed the boinc and rosetta processes there but doing nothing. Looked like it had stopped just a short while after starting a new WU. I stoped and restarted boinc and the WU took off normally. It just finished and reported. Here's the WU:

https://boinc.bakerlab.org/rosetta/result.php?resultid=25090497

Charlie


This might be a problem on my end. Came home from work to find the machine in the same state. Checked STDOUT from boinc (I redirect it to a file) and both this morning and this afternoon it complained about network problems. However, this afternoon restarting boinc didn't work. It was trying to download new work but the network problems were prevening it. Boinc kept shutting down. I also could not get out to the net in my web browser. So, I reset my router. It's either a problem with my router or the cable connection is messing up. Hard to tell which at this point but this past weekend the router was hung so bad I had to do a hard reset and reconfigure it from scratch. The cable company's network status page show some problems in some surrounding areas but not my particular area. Time to use that gift card from Best Buy!

Charlie
-Charlie
ID: 19080 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TCU Computer Science

Send message
Joined: 7 Dec 05
Posts: 28
Credit: 12,861,977
RAC: 0
Message 19102 - Posted: 22 Jun 2006, 3:49:26 UTC - in response to Message 18612.  

rosetta 5.22
WU Name: t316__CASP7_JUMPABINITIO_SAVE_ALL_OUT_BARCODE_secondhalf_hom019__726_329
running on Mac OS 10.4.6

BOINC Manager Tasks tab shows CPU Time stuck at 03:21:43 and 35.5%
top command shows TIME = 37:51:05 and climbing

stopped and restarted BOINC
CPU Time reverted to 02:50:49 and 35.5% but no longer stuck

This is on a G5 crunching only for rosetta.
The two previous instances of this problem occurred on a G4 crunching rosetta + ralph + einstein.
ID: 19102 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B^S] Dr. Bill Skiba
Avatar

Send message
Joined: 26 Oct 05
Posts: 5
Credit: 238,426
RAC: 0
Message 19241 - Posted: 24 Jun 2006, 21:52:31 UTC

Just aborted this work unit.

https://boinc.bakerlab.org/rosetta/result.php?resultid=25006316

Stuck at 1hr 7min - suspened and resumed several times to no avail. Next work Rosetta work unit seems to be running normally.

rosetta 5.22
windows 2K
athlon xp 2500 barton

ID: 19241 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Clare Jarvis

Send message
Joined: 14 Dec 05
Posts: 8
Credit: 874,698
RAC: 0
Message 19338 - Posted: 27 Jun 2006, 1:48:14 UTC

I have been having similar problems. I cannot
leave Rosetta alone or it simply hangs. But if I
visit and hit "Update" every day then I get much better production.
Is this a problem with Rosetta or with Boinc. It is very frustrating.
I wish the statistics page had the start time and date of each
run along with the deadline.


ID: 19338 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 19418 - Posted: 28 Jun 2006, 14:38:25 UTC
Last modified: 28 Jun 2006, 15:09:05 UTC

I have (occassionally) the problem of stalled/hanging Rosettas (somewhere, not at 0% or 1% or 100% progress) already for ages, on Red Hat EL 4.1. Now using BCC 5.4.9, attached to 7 projects, Rosetta's share is ~20%. The computer is running for months betwen reboots, without graphics.

The symptoms are that Rosetta app seems to be running, but the CPU time does not increase. Recently I've noticed that even BCC is not able to run benchmarks, if this happens. IIRC previously if BCC was able to switch to aother app, it got 0 CPU cyces (because Rosetta was consuming all) and did not increment time. Usually the only way to overcome this problem was to manually restart BCC. This way the Rosettas were able to continue and finish. (Whether correctly? Now I can see a few (5) process exited with code 131 (0x83) messages since March in the logs.)

This time, a week ago I've made few snapshots of suspended rosetta 5.22' result t312__CASP7_JUMPRELAX_SAVE_ALL_OUT_BARCODE_hom010__711_1635_0 and reported them in the Rosetta WU's stall on RedHat Fedora thread. It is stuck at 28.80% (2:43:29 CPU time), maybe for a day already. I'll try to restart BCC, if something new will come into the files in it's slot/3/ dir. And then abort and report, it's now after deadline anyway...

Yes, it restarted happily, CPU time jumped from 2:43:29 to 1:43:29 and is incrementing, but progress stayed at 28.80% and does not move. Aborting...

<core_client_version>5.4.9</core_client_version>
<message>
aborted by user
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# random seed: 1940641
# cpu_run_time_pref: 21600
SIGSEGV: segmentation violation
Stack trace (14 frames):
[0x884cb9f]
[0x8864cfc]
[0x88cade8]
[0x8621564]
[0x87f229b]
[0x873b844]
[0x873d0af]
[0x85a95e9]
[0x85b190a]
[0x83d6c9f]
[0x86022d3]
[0x84740c8]
[0x88c41e4]
[0x8048111]

Exiting...
Graphics are disabled due to configuration...
# cpu_run_time_pref: 21600
SIGSEGV: segmentation violation
Stack trace (15 frames):
[0x884cb9f]
[0x8864cfc]
[0x88cade8]
[0x88e5473]
[0x88b6601]
[0x88b8029]
[0x805fdd8]
[0x83d75de]
[0x83d90a0]
[0x83d8f89]
[0x83d72ca]
[0x88cb7ef]
[0x885bff0]
[0x8865f65]
[0x88f771a]

Exiting...
SIGSEGV: segmentation violation
Stack trace (14 frames):
[0x884cb9f]
[0x8864cfc]
[0x88cade8]
[0x853664c]
[0x854a184]
[0x830867c]
[0x8308fdf]
[0x86c4a6a]
[0x86c6f15]
[0x83d6f08]
[0x86022d3]
[0x84740c8]
[0x88c41e4]
[0x8048111]

Exiting...
Graphics are disabled due to configuration...
# cpu_run_time_pref: 21600
ERROR:: Exit at: fragments.cc line:459
FILE_LOCK::unlock(): close failed.: Bad file descriptor

</stderr_txt>

Peter
ID: 19418 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 19422 - Posted: 28 Jun 2006, 16:21:28 UTC - in response to Message 19338.  

But if I visit and hit "Update" every day then I get much better production. Is this a problem with Rosetta or with Boinc.


BOINC is responsible to contact the projects that it needs to get work from. Performing an update wouldn't have much to do with a hung work unit. Are you saying to end up without work? Or are you saying that your existing WUs are not ending properly?

Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 19422 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 19423 - Posted: 28 Jun 2006, 16:24:06 UTC

Pepo: I'm not clear how long you observed the running of the WU after restarting it. But the progress % does not change very frequently and this is normal. Here is some relevant information on the subject. Perhaps you are saying you let it run for over an hour with no progress... that would be another matter. But, if not, that portion of what you are describing is probably normal and does not require your intervention to abort.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 19423 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 19425 - Posted: 28 Jun 2006, 16:39:14 UTC - in response to Message 19423.  
Last modified: 28 Jun 2006, 16:40:16 UTC

Pepo: I'm not clear how long you observed the running of the WU after restarting it. But the progress % does not change very frequently and this is normal. Here is some relevant information on the subject. Perhaps you are saying you let it run for over an hour with no progress... that would be another matter. But, if not, that portion of what you are describing is probably normal and does not require your intervention to abort.

Yes, I've read the FAQ. If you look at the Rosetta WU's stall on RedHat Fedora thread I mentioned, the Rosetta was hung for at least more than a day, I could look into the logs to tell exactly.

I usually check the machine once in a day-two (because of Rosetta :-) and restart Boinc if this happens. And it is happening for long already. I'm pretty sure that for few months.

Peter
ID: 19425 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Report Problems with Rosetta Version 5.22



©2025 University of Washington
https://www.bakerlab.org