Report Problems with Rosetta Version 5.22

Message boards : Number crunching : Report Problems with Rosetta Version 5.22

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 18940 - Posted: 19 Jun 2006, 17:26:32 UTC - in response to Message 18934.  

It looks like most of them were ended by the "watchdog". One was a -107 error (which is something that's been under review for a while already).


Correction, I misread that "watchdog is shutting down" message (again!). I keep thinking this message indicates that the watchdog is shutting down the WU, not just ending itself as a normal end of processing a WU.

Most of their errors were -107s.

Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 18940 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rriggs

Send message
Joined: 5 Jun 06
Posts: 5
Credit: 48,672
RAC: 0
Message 18978 - Posted: 20 Jun 2006, 15:14:25 UTC - in response to Message 18934.  


The watchdog is trying to assure your computer doesn't get stuck in an unexpected loop on a work unit. If it notices no progress on a work unit in 5 restarts, then it ends it. Do you restart this computer frequently? Or have a number of other projects running in BOINC?

If you would, go to your General Preferences, and let us know what you have set for "Switch between applications every...minutes", and for "Leave applications in memory while preempted?". And is Rosetta your only BOINC project?


I'll try to answer your questions here:

Machine is rarely restarted, once every 2-3 days.

This is the only project I have under BOINC. No other background/SETI type applications are installed.

I'm not sure where this "General Preferences" dialog is you're referring to. I don't see anything like this in BOINC.

I am an accomplished C++/Java/.NET developer w/ Visual Studio installed on this box if you need me to grab a stack trace, I'd be happy to next time!
ID: 18978 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 18983 - Posted: 20 Jun 2006, 15:50:45 UTC - in response to Message 18978.  

I'm not sure where this "General Preferences" dialog is you're referring to. I don't see anything like this in BOINC.


Now that you are viewing this message board, click the "Participants" link in the heading of the screen. In the "Preferences" section, click the link for "view or edit" of General preferences. Any changes made there require BOINC to update to the project to take effect. This is done from the projects tab of BOINC, select Rosetta, then click the update button.

Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 18983 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Bandit

Send message
Joined: 21 May 06
Posts: 12
Credit: 197,197
RAC: 0
Message 18998 - Posted: 20 Jun 2006, 17:54:18 UTC - in response to Message 18196.  

In followup to Message ID 18855, as long as I don't have IE running, I don't seem to have any BOINC problems. If I leave IE on, I have intermittant BOINC crashes. For me, it does not seem to be the screensaver at this time.

Bandit's Mom
ID: 18998 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
andrewsi

Send message
Joined: 19 Jun 06
Posts: 1
Credit: 10,139,108
RAC: 0
Message 19008 - Posted: 20 Jun 2006, 19:16:58 UTC
Last modified: 20 Jun 2006, 19:19:37 UTC

Ran into a compute error with 522.

6/20/2006 12:12:35 PM|rosetta@home|Unrecoverable error for result t304__CASP7_JUMPRELAX_SAVE_ALL_OUT_BARCODE_hom001__691_17229_0 ( - exit code -1 (0xffffffff)).

Looks like it was: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=21222160

What other information should I provide?

ID: 19008 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rriggs

Send message
Joined: 5 Jun 06
Posts: 5
Credit: 48,672
RAC: 0
Message 19009 - Posted: 20 Jun 2006, 19:20:55 UTC - in response to Message 18983.  


Now that you are viewing this message board, click the "Participants" link in the heading of the screen. In the "Preferences" section, click the link for "view or edit" of General preferences. Any changes made there require BOINC to update to the project to take effect. This is done from the projects tab of BOINC, select Rosetta, then click the update button.


You didn't say what these 'should be' so I'm just reporting what they currently are and not changing anything:

work on batteries: no
work while in use: no
idle: 3 mins
hours: (no restrictions)
leave in memory: no
switch between: 60 mins
multiprocessors: 0 processors (although I have two of them!?)
use at most: 100 percent of CPU


ID: 19009 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 19012 - Posted: 20 Jun 2006, 20:09:31 UTC
Last modified: 20 Jun 2006, 20:15:08 UTC

When I restarted my computer I lost over an hour on this WU. It went back at restart to 0% after running about an hour on my fast Athlon 64 @2.44 GHz. Obvioulsy no checkpoint occured during this time. I know t296 is very big but no checkpoint within an hour is not good (since one hour is the default switch time of BOINC).
ID: 19012 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 19016 - Posted: 20 Jun 2006, 21:05:18 UTC

rriggs, if you are actively using that computer much at all... your settings are preventing you from getting much work done. You see you've told BOINC to wait until you've not used your computer for 3 minutes before it runs. Then it starts running. When you return to use your computer, you've told it to remove the applications from memory, and so any work it has performed since the last checkpoint will be lost. Since Rosetta typically checkpoints no more than every 20 minutes, if you have left your computer for 15 minutes, then you've crunched for 12 minutes (after waiting for the 3 minute delay before it starts) and then when you use your computer again, you are throwing away the 12 minutes of work. And so you later have to redo that 12 minutes of work.

I don't know now agressively you intend to crunch. But you can preserve the work done (12 min. in my example) to continue on it later by setting the "leave in memory" setting to YES. You've got 2GB of memory, so that gives you plenty of room. Also, it just keeps it in virtual memory, not actually the physical memory of the machine. So, changing this setting will preserve these short work periods, and not impact your computer use. By keeping applications in memory, you would only lose bits of work when you actually turn off the computer.

Now, you also have a dual-core CPU. So you could be crunching 2 work units at the same time. But you have set BOINC to only use one. You can set the "On multiprocessors, use at most" setting to 2 and use both of them. I'm not positive what it does when you have that set to zero.

It would be further agressive to crunch while your computer is in use. I take it you've got 2GB of memory because you have some pretty intense applications to wish to use. So, your current setting of NOT working while your computer is in use should probably remain. But, just FYI, I run with half as much memory and run it all the time, and there is no noticeble effect on my running applications.

Having said all of that... your errors are mostly the -107 errors. Looks like you get either a -107 or a -1 about 10% of the time. I'm not sure, perhaps leaving in memory will reduce your chances of hitting the -107 errors. But otherwise, I don't believe the above will resolve the problem you are having with erroring work units. They are already working on a fix for the -107 errors. There are a number of people hitting that more frequently lately.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 19016 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Winkle

Send message
Joined: 22 May 06
Posts: 88
Credit: 1,354,930
RAC: 0
Message 19037 - Posted: 21 Jun 2006, 7:42:53 UTC

I have t307__CASP7_ABRELAX_SAVE_ALL_OUT_BARCODE_hom001__714_20997_0 using rosetta version 5.22 and it has been running now for 24 hrs. It has been stuck on 100% for at least the last hour I have been watching it. Mem usage of Rosetta was 88M and id now 94M after 30 mins. Now 97M ans climbing.
CPU usage doesn't change when I suspend the task from the BOINC manager.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=20861564

The show graphics screen says...
68.601% complete
CPU time: 24 hr 0 min
Stage: Ab initio + relax
Model 116 step 0
Accepted Enrgy 44.55485

Nothing is changing on the screen. The protein looks like a single zig-zag line

Target CPU time is set to 8 hrs.

Do I abort ?
ID: 19037 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Winkle

Send message
Joined: 22 May 06
Posts: 88
Credit: 1,354,930
RAC: 0
Message 19041 - Posted: 21 Jun 2006, 7:59:09 UTC - in response to Message 19037.  

I have t307__CASP7_ABRELAX_SAVE_ALL_OUT_BARCODE_hom001__714_20997_0 using rosetta version 5.22 and it has been running now for 24 hrs. It has been stuck on 100% for at least the last hour I have been watching it. Mem usage of Rosetta was 88M and id now 94M after 30 mins. Now 97M ans climbing.
CPU usage doesn't change when I suspend the task from the BOINC manager.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=20861564

The show graphics screen says...
68.601% complete
CPU time: 24 hr 0 min
Stage: Ab initio + relax
Model 116 step 0
Accepted Enrgy 44.55485

Nothing is changing on the screen. The protein looks like a single zig-zag line

Target CPU time is set to 8 hrs.

Do I abort ?


I ended up aborting it... The machine became unworkable.
I have reported it in another thread.
ID: 19041 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ian

Send message
Joined: 14 Apr 06
Posts: 29
Credit: 164,684
RAC: 446
Message 19044 - Posted: 21 Jun 2006, 8:14:24 UTC

Another:

https://boinc.bakerlab.org/rosetta/result.php?resultid=25008379 (WU 21172214)

<core_client_version>5.2.13</core_client_version>
<message>process exited with code 131 (0x83)
</message>
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 3248719
SIGBUS: bus error

Ooo-er. That doesn't sound healthy.
Ian Cundell, St Albans, UK
ID: 19044 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Bober [B@P]
Avatar

Send message
Joined: 12 Jun 06
Posts: 3
Credit: 48,690
RAC: 0
Message 19053 - Posted: 21 Jun 2006, 12:15:28 UTC - in response to Message 19044.  
Last modified: 21 Jun 2006, 12:17:42 UTC

Recently I've had -107 errors:
https://boinc.bakerlab.org/rosetta/result.php?resultid=24946846
https://boinc.bakerlab.org/rosetta/result.php?resultid=24946856

I've just started crunching for Rosetta. I don't use any screensaver.
The same error have just occured on my Ralph but with 5.24 app.

What can I do to avoid those errors?
ID: 19053 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rriggs

Send message
Joined: 5 Jun 06
Posts: 5
Credit: 48,672
RAC: 0
Message 19060 - Posted: 21 Jun 2006, 14:38:07 UTC - in response to Message 19016.  
Last modified: 21 Jun 2006, 14:46:22 UTC

rriggs, if you are actively using that computer much at all... your settings are preventing you from getting much work done. You see you've told BOINC to wait until you've not used your computer for 3 minutes before it runs. Then it starts running. When you return to use your computer, you've told it to remove the applications from memory, and so any work it has performed since the last checkpoint will be lost. Since Rosetta typically checkpoints no more than every 20 minutes, if you have left your computer for 15 minutes, then you've crunched for 12 minutes (after waiting for the 3 minute delay before it starts) and then when you use your computer again, you are throwing away the 12 minutes of work. And so you later have to redo that 12 minutes of work.

Now, you also have a dual-core CPU. So you could be crunching 2 work units at the same time. But you have set BOINC to only use one. You can set the "On multiprocessors, use at most" setting to 2 and use both of them. I'm not positive what it does when you have that set to zero.


I never even saw this page, let alone adjusted the settings so these are the defaults. Perhaps the setup process should either pick better defaults or bring this page to my attention so I would have found it sooner?

I guess when I installed I just picked Activity|Run Always and Activity|Network always available, so it has been running non-stop! This may potentially invalidate your hypothesis about why I'm getting -107 errors since the app is never leaving memory.

ps. It was crashed this morning when I came in, so I tried to debug it, but my machine locked up launching the debugger. I will try again tomorrow!

ID: 19060 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 19062 - Posted: 21 Jun 2006, 14:45:45 UTC - in response to Message 19060.  
Last modified: 21 Jun 2006, 15:26:31 UTC

Perhaps the setup process should either pick better defaults or bring this page to my attention so I would have found it sooner?

My task manager has always shown them both crunching at 50% CPU.

What do you think?


OK, my apologies. As I said I wasn't certain what it does when CPUs is set to zero. So that isn't an issue. If you've got 2 WUs running at 50% CPU each then you are fully crunching... when your computer is not in use.

I would still suggest you set the leave in memory to YES. Save all the work done during coffee breaks and during meetings or conference calls or whatever pulls you away from the computer.

As for changing the setup process, unfortunately that is not something Rosetta could change. It would be changed by the BOINC folks. So you would have to take up that suggestion on the BOINC boards.

Every time your PC is idle for 3 minutes, BOINC will start crunching... then when you come back and use the computer, BOINC suspends... and removes from memory, that was the thought. ...except since you've not said to "run based on preferences"... it's actually running all the time, regardless of whether other applications are in use?
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 19062 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rriggs

Send message
Joined: 5 Jun 06
Posts: 5
Credit: 48,672
RAC: 0
Message 19063 - Posted: 21 Jun 2006, 14:47:56 UTC - in response to Message 19062.  

OK, my apologies. As I said I wasn't certain what it does when CPUs is set to zero. So that isn't an issue. If you've got 2 WUs running at 50% CPU each then you are fully crunching... when your computer is not in use.


Oops. I edited my post while you were replying. Please recheck it now!

ID: 19063 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,070,914
RAC: 0
Message 19064 - Posted: 21 Jun 2006, 14:48:52 UTC

Before leaving for work this morning I checked my Linux box. CPU was at 0%. The ps command showed the boinc and rosetta processes there but doing nothing. Looked like it had stopped just a short while after starting a new WU. I stoped and restarted boinc and the WU took off normally. It just finished and reported. Here's the WU:

https://boinc.bakerlab.org/rosetta/result.php?resultid=25090497

Charlie

-Charlie
ID: 19064 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 19065 - Posted: 21 Jun 2006, 15:23:11 UTC - in response to Message 19053.  

Recently I've had -107 errors:
https://boinc.bakerlab.org/rosetta/result.php?resultid=24946846
https://boinc.bakerlab.org/rosetta/result.php?resultid=24946856

I've just started crunching for Rosetta. I don't use any screensaver.
The same error have just occured on my Ralph but with 5.24 app.

What can I do to avoid those errors?


Lukasz, you've already done what you can (so far as I know). One of your results reported a lot of useful information that will help analyze the problem.

Your computer time is still helping the project, and you are still getting credit for all the time crunching, so do not be detoured. Running on Ralph records additional diagnostic information back to the project. Hopefully they can determine the root cause soon. I see you have 2 out of 5 of your WUs failed, and another that you aborted for some reason.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 19065 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Bober [B@P]
Avatar

Send message
Joined: 12 Jun 06
Posts: 3
Credit: 48,690
RAC: 0
Message 19069 - Posted: 21 Jun 2006, 15:48:54 UTC - in response to Message 19065.  
Last modified: 21 Jun 2006, 15:53:12 UTC


I see you have 2 out of 5 of your WUs failed, and another that you aborted for some reason.


The reason is I thought that it's my computer's fault and I didn't want it to spoil more WUs. I have to admit that PC was overclocked a bit and it is very hot today, so I had to change some settings. But don't worry I'm far from being discouraged. I will crunch again for Rosetta soon:)

Thank you for reply!
ID: 19069 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,070,914
RAC: 0
Message 19080 - Posted: 21 Jun 2006, 20:45:20 UTC - in response to Message 19064.  

Before leaving for work this morning I checked my Linux box. CPU was at 0%. The ps command showed the boinc and rosetta processes there but doing nothing. Looked like it had stopped just a short while after starting a new WU. I stoped and restarted boinc and the WU took off normally. It just finished and reported. Here's the WU:

https://boinc.bakerlab.org/rosetta/result.php?resultid=25090497

Charlie


This might be a problem on my end. Came home from work to find the machine in the same state. Checked STDOUT from boinc (I redirect it to a file) and both this morning and this afternoon it complained about network problems. However, this afternoon restarting boinc didn't work. It was trying to download new work but the network problems were prevening it. Boinc kept shutting down. I also could not get out to the net in my web browser. So, I reset my router. It's either a problem with my router or the cable connection is messing up. Hard to tell which at this point but this past weekend the router was hung so bad I had to do a hard reset and reconfigure it from scratch. The cable company's network status page show some problems in some surrounding areas but not my particular area. Time to use that gift card from Best Buy!

Charlie
-Charlie
ID: 19080 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TCU Computer Science

Send message
Joined: 7 Dec 05
Posts: 28
Credit: 12,861,977
RAC: 0
Message 19102 - Posted: 22 Jun 2006, 3:49:26 UTC - in response to Message 18612.  

rosetta 5.22
WU Name: t316__CASP7_JUMPABINITIO_SAVE_ALL_OUT_BARCODE_secondhalf_hom019__726_329
running on Mac OS 10.4.6

BOINC Manager Tasks tab shows CPU Time stuck at 03:21:43 and 35.5%
top command shows TIME = 37:51:05 and climbing

stopped and restarted BOINC
CPU Time reverted to 02:50:49 and 35.5% but no longer stuck

This is on a G5 crunching only for rosetta.
The two previous instances of this problem occurred on a G4 crunching rosetta + ralph + einstein.
ID: 19102 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Report Problems with Rosetta Version 5.22



©2024 University of Washington
https://www.bakerlab.org