Report Problems with Rosetta Version 5.25

Message boards : Number crunching : Report Problems with Rosetta Version 5.25

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 12 · Next

AuthorMessage
mnb

Send message
Joined: 15 Dec 05
Posts: 51
Credit: 69,458
RAC: 0
Message 21378 - Posted: 29 Jul 2006, 14:24:40 UTC - in response to Message 21354.  


Ananas wrote:
Maybe the fan is dirty?

Yeah, there was a problem with my CPU fan. For some reason it reseted to 0 rpm for about every 3 seconds. Although there was no visual hint that it was malfuncioning. I cleaned the sink and fan and removed that metal ring covering the fan and from now on I'm going to keep the PCalert4 monitoring program running. I'm also using BES to throttle cpu usage to 75%. It seems to lower the temp some 5 degrees celcius.

Thank you very much.


list of my results
ID: 21378 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 3,251,459
RAC: 1,311
Message 21482 - Posted: 31 Jul 2006, 16:55:13 UTC

Result ID 30771861
Name t347__CASP7_ASSEMBLEABRELAX_SAVE_ALL_OUT_new6to205hom022__991_873_1
Workunit 25596465
Created 31 Jul 2006 6:59:08 UTC
Sent 31 Jul 2006 10:15:14 UTC
Received 31 Jul 2006 16:47:30 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -2147483645 (0x80000003)
Computer ID 263791
Report deadline 7 Aug 2006 10:15:14 UTC
CPU time 14406.359375
stderr out <core_client_version>5.4.9</core_client_version>
<message>
One or more arguments are invalid (0x80000003) - exit code -2147483645 (0x80000003)
</message>
<stderr_txt>
# random seed: 3336638
# cpu_run_time_pref: 3600
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
CPU time: 14406.4 seconds. Greater than 4X preferred time: 3600 seconds
**********************************************************************
GZIP SILENT FILE: .xxt347.out
WARNING! attempt to gzip file .xxt347.out failed: file does not exist.


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x77F767CD

Engaging BOINC Windows Runtime Debugger...



********************

ID: 21482 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ananas

Send message
Joined: 1 Jan 06
Posts: 232
Credit: 752,471
RAC: 0
Message 21605 - Posted: 2 Aug 2006, 5:48:24 UTC
Last modified: 2 Aug 2006, 5:57:28 UTC

There have been a few WUs lately with much higher RAM requirements than the usual ones, they have been in the region of ~200MB. Maybe one of those has caused some of the problems.

p.s.: those WUs have a much lower decoys/hour rate too (e.g. this one and this one too)
ID: 21605 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Team TMR

Send message
Joined: 2 Nov 05
Posts: 21
Credit: 1,583,679
RAC: 0
Message 21613 - Posted: 2 Aug 2006, 8:17:32 UTC

I've had at least 5 WUs recently that have failed in the last week because they ran for over 12 hours (default 3 hour target CPU time in effect), and another is going to fail within the next hour (it's already up to 12.5 hours). They're not getting credit either.
ID: 21613 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ananas

Send message
Joined: 1 Jan 06
Posts: 232
Credit: 752,471
RAC: 0
Message 21614 - Posted: 2 Aug 2006, 8:23:46 UTC

If your PC isn't very fast and this WU needs so much time for one decoy, it might help to set a higher target time and contact/update the Rosetta server. The Rosetta client notices the increased time limit and (hopefully) will allow more time for that result.
ID: 21614 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 21647 - Posted: 2 Aug 2006, 16:26:14 UTC
Last modified: 2 Aug 2006, 16:38:54 UTC

I was having this problem with 5.22 (also there) already and now the same happens with 5.25 - stalled/hanging Rosettas.

I noticed that the running Rosetta app (result 29859752) ceased to exit and Boinc did not start any other app for days, also not able to run benchmarks or remove it from memory. Until I suspended the result.

I'll try to unsuspend the result and wait to see... (maybe few hours until tomorrow, but progress and rime do not increment at all, the boinc.log does not mention restarting the rosetta, only pausing previous app (although BCC is teling the rosetta is running) and the machine is 99% idle)-:

Peter

Relevant lines from log:

2006-07-24 23:49:13 [---] Rescheduling CPU: files downloaded
2006-07-25 00:22:50 [---] Rescheduling CPU: application exited
2006-07-25 00:22:51 [Einstein@Home] Computation for task h1_0208.0_S5R1__5364_S5R1a_0 finished
2006-07-25 00:22:51 [rosetta@home] Starting task t347__CASP7_ABRELAX_SAVE_ALL_OUT_121to205hom003__847_2818_0 using rosetta version 525
2006-07-25 01:22:51 [SETI@home] Restarting task 16my06ad.2870.14096.47174.3.139_3 using setiathome_enhanced version 512
2006-07-25 01:22:51 [rosetta@home] Pausing task t347__CASP7_ABRELAX_SAVE_ALL_OUT_121to205hom003__847_2818_0 (removed from memory)
2006-07-25 02:22:51 [SETI@home Beta Test] Restarting task 02ap05aa.20527.464.490894.3.48_1 using setiathome_enhanced version 512
2006-07-25 02:22:51 [SETI@home] Pausing task 16my06ad.2870.14096.47174.3.139_3 (removed from memory)
2006-07-25 03:22:52 [SETI@home Beta Test] Pausing task 02ap05aa.20527.464.490894.3.48_1 (removed from memory)
2006-07-25 03:22:52 [SETI@home] Restarting task 16my06ad.2870.14096.47174.3.139_3 using setiathome_enhanced version 512
2006-07-25 04:22:52 [SETI@home] Pausing task 16my06ad.2870.14096.47174.3.139_3 (removed from memory)
2006-07-25 04:22:52 [Einstein@Home] Starting task h1_0208.0_S5R1__5363_S5R1a_0 using einstein_S5R1 version 401
2006-07-25 05:22:53 [Einstein@Home] Pausing task h1_0208.0_S5R1__5363_S5R1a_0 (removed from memory)
2006-07-25 06:22:53 [SETI@home] Restarting task 16my06ad.2870.14096.47174.3.139_3 using setiathome_enhanced version 512
2006-07-25 06:22:53 [rosetta@home] Pausing task t347__CASP7_ABRELAX_SAVE_ALL_OUT_121to205hom003__847_2818_0 (removed from memory)
2006-07-25 06:22:55 [---] Suspending work fetch because computer is overcommitted.
2006-07-25 07:22:53 [---] Using earliest-deadline-first scheduling because computer is overcommitted.
2006-07-25 07:22:53 [SETI@home] Pausing task 16my06ad.2870.14096.47174.3.139_3 (removed from memory)
2006-07-28 16:52:39 [---] Suspending computation - running CPU benchmarks
2006-07-28 16:52:39 [rosetta@home] Pausing task t347__CASP7_ABRELAX_SAVE_ALL_OUT_121to205hom003__847_2818_0 (removed from memory)
2006-07-28 16:52:41 [---] Running CPU benchmarks
2006-07-28 16:52:49 [---] Failed to stop applications; aborting CPU benchmarks
2006-07-28 16:52:50 [---] Resuming computation
2006-07-28 16:52:50 [---] Rescheduling CPU: Resuming computation
2006-07-28 16:52:50 [---] Process 9951 not found
2006-07-31 12:13:04 [-manually-suspended-rosetta-] Rescheduling CPU: result suspended, resumed or aborted by user
2006-07-31 12:13:08 [-manually-suspended-rosetta-] Rescheduling CPU: result suspended, resumed or aborted by user
2006-07-31 12:13:08 [rosetta@home] Pausing task t347__CASP7_ABRELAX_SAVE_ALL_OUT_121to205hom003__847_2818_0 (removed from memory)
2006-07-31 12:13:08 [Einstein@Home] Restarting task h1_0208.0_S5R1__5363_S5R1a_0 using einstein_S5R1 version 401
......
2006-08-02 18:31:26 [---] Rescheduling CPU: result suspended, resumed or aborted by user
2006-08-02 18:31:27 [---] Using earliest-deadline-first scheduling because computer is overcommitted.
2006-08-02 18:31:27 [SETI@home] Pausing task 10ap05ac.26689.31026.679814.3.137_2 (removed from memory)
2006-08-02 18:31:27 [---] Suspending work fetch because computer is overcommitted.
2006-08-02 18:36:04 [---] Rescheduling CPU: result suspended, resumed or aborted by user


Peter
ID: 21647 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TCU Computer Science

Send message
Joined: 7 Dec 05
Posts: 28
Credit: 12,861,977
RAC: 0
Message 21653 - Posted: 2 Aug 2006, 16:55:46 UTC - in response to Message 21647.  

I was having this problem with 5.22 (also there) already and now the same happens with 5.25 - stalled/hanging Rosettas.



The first time I saw this problem was with Ralph 5.18 then a couple of instances with Rosetta 5.22 then some with [url=https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1891#20832]Rosetta 5.25.

I've seen the problem on Mac OS X and Linux (CentOS 4.3) but never on Windows. When the problem occurs on Linux, I stop BOINC but the Rosetta process remains in the process list. I have to kill it manually before restarting BOINC.
ID: 21653 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ananas

Send message
Joined: 1 Jan 06
Posts: 232
Credit: 752,471
RAC: 0
Message 21657 - Posted: 2 Aug 2006, 17:16:39 UTC
Last modified: 2 Aug 2006, 17:37:00 UTC

Maybe this helps the developers, it is the stdout of an endless running one that I am currently trying :

stdout belongs to FRA_t384_CASP7_hom001_4_t384_4_1ofgA_IGNORE_THE_REST_866_1060_22

The result seems to be stuck at fraction_done=0.273800 (for nearly 9 hours now)

stdout.txt is updated and growing but that's about all that changes.

Random seed is 2033319

RAM usage is at 151MB, the box doesn't have much RAM but there's still physical RAM left without swapping as the other results need less.
ID: 21657 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 21709 - Posted: 3 Aug 2006, 8:04:52 UTC - in response to Message 21647.  

I noticed that the running Rosetta app (result 29859752) ceased to exit and Boinc did not start any other app for days.......

I'll try to unsuspend the result and wait to see... (maybe few hours until tomorrow, but progress and rime do not increment at all, the boinc.log does not mention restarting the rosetta, only pausing previous app (although BCC is teling the rosetta is running) and the machine is 99% idle)-:

The CPU time stayed at the same 0:59:20 (probably the 1 hour switch point) for the whole night, no other app was started inbetween as expected. Aborted. Another one very similar WU is already overdue, but I'll try to let it run through, if it succeeds.

Peter
ID: 21709 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 3,794,203
RAC: 2,238
Message 21715 - Posted: 3 Aug 2006, 10:34:02 UTC

Have also noticed a lot of work units that just stop. Boinc Manager says they are running but no counter is moving, either "cpu time" or "to completion". Suspending and resuming does not work. Stopping and restarting Boinc Manager does not work, a reboot seems to have gotten the work units going again. Only had one on the Windows XP machine but up to 8 at once on the Linux machines, all AMD processors. My Intel Windows machine has had no problem so far. Have had to abort one that would not move on Linux machine.
This has only been happening with 5.25.

ID: 21715 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 3,794,203
RAC: 2,238
Message 21721 - Posted: 3 Aug 2006, 11:25:43 UTC - in response to Message 21715.  

Have also noticed a lot of work units that just stop. Boinc Manager says they are running but no counter is moving, either "cpu time" or "to completion". Suspending and resuming does not work. Stopping and restarting Boinc Manager does not work, a reboot seems to have gotten the work units going again. Only had one on the Windows XP machine but up to 8 at once on the Linux machines, all AMD processors. My Intel Windows machine has had no problem so far. Have had to abort one that would not move on Linux machine.
This has only been happening with 5.25.


Well a follow up on these stopped and restarted work units, 3 of the 4 restarted units all errored out at the same time (I happen to have been watching the screen at the time), giving back "unrecoverable error". This happened when another projects WU finished ans switched to start another WU, this seems to have caused the 3 Rosetta WU's to switch as well but instead of check pointing they all just failed.
The work units are :- t372__CASP7_ABRELAX_SAVE_ALL_OUT_1to115hom013__1022_3203_0
(process exited with code 131 (0x83)) (SIGSEGV: segmentation violation)
t372__CASP7_ABRELAX_SAVE_ALL_OUT_1to115hom015__1022_3194_0
(process exited with code 131 (0x83)) (SIGSEGV: segmentation violation)
t372__CASP7_ABRELAX_SAVE_ALL_OUT_1to115hom010__1022_3198_0
(process exited with code 131 (0x83)) (SIGSEGV: segmentation violation)

I have had at least 7 WU's fail with the same error since 31/7.


ID: 21721 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Alexander W. Janssen
Avatar

Send message
Joined: 31 May 06
Posts: 33
Credit: 97,311
RAC: 0
Message 21722 - Posted: 3 Aug 2006, 11:51:43 UTC
Last modified: 3 Aug 2006, 11:52:09 UTC

Got another error 131 (0x83):
Wed 02 Aug 2006 10:09:39 PM CEST|rosetta@home|Unrecoverable error for result FRA_t384_CASP7_hom001_4_t384_4_1ofgA_IGNORE_THE_REST_321_1060_1_0 (process exited with code 131 (0x83))

BOINC 5.4.9
Linux 2.6.8-3-686-smp #1 SMP Sat Jul 15 08:52:57 UTC 2006 i686 GNU/Linux

HTH, Alex.
"I am tired of all this sort of thing called science here... We have spent
millions in that sort of thing for the last few years, and it is time it
should be stopped."
-- Simon Cameron, U.S. Senator, on the Smithsonian Institute, 1901.
ID: 21722 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 3,794,203
RAC: 2,238
Message 21789 - Posted: 4 Aug 2006, 1:41:13 UTC
Last modified: 4 Aug 2006, 1:42:36 UTC

After checking with another WU that has stopped on my AMD Opteron Linux machine, I too have noticed that when the WU stops it does not switch to another project WU after due time but stays locked to the Rosetta WU with the Status showing "running" but nothing happening.
So far a reboot is the only way to get them moving again but I don't plan on doing that everytime a WU locks up. Will probably just abort.
ID: 21789 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 21790 - Posted: 4 Aug 2006, 2:35:02 UTC

I noticed this on a box that was crunching only Rosetta. Sorry, I wasn't on the box long enough to get details. But I'd told it to crunch 1hr WUs, and it was doing fine, but then hit one that ran overnight and in the morning in showed 100%, but was still crunching it. I didn't wait long, but never saw the steps increase, and it had other WUs on deck, but never began them.

It was this host, probably the next (time issued order) WU was the one it was hung on), so I guess that would make it this WU: FRA_t386_CASP7_hom001_4_t386_4_2f6sA_IGNORE_THE_REST_121_1061_32_0
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 21790 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 3,794,203
RAC: 2,238
Message 21833 - Posted: 4 Aug 2006, 13:43:50 UTC

Have had another 10 WU's fail with the same error code;
(process exited with code 131 (0x83)) (SIGSEGV: segmentation violation)
All 10 WU's working on 2 Linux machines.
t372__CASP7_ABRELAX_SAVE_ALL_OUT_1to115hom015__1022_3194_0
t372__CASP7_ABRELAX_SAVE_ALL_OUT_1to115hom006__1022_3195_0
t372__CASP7_ABRELAX_SAVE_ALL_OUT_1to115hom010__1022_3198_0
t372__CASP7_ABRELAX_SAVE_ALL_OUT_1to115hom013__1022_3202_0
t372__CASP7_ABRELAX_SAVE_ALL_OUT_1to115hom013__1022_3203_0
t372__CASP7_ABRELAX_SAVE_ALL_OUT_1to115hom017__1022_3203_0
t372__CASP7_ABRELAX_SAVE_ALL_OUT_1to115hom002__1022_3204_0
t382__CASP7_ABRELAX_SAVE_ALL_OUT_hom001__1012_80239_0
t382__CASP7_ABRELAX_SAVE_ALL_OUT_hom001__1012_80282_0
t382__CASP7_ABRELAX_SAVE_ALL_OUT_hom001__1012_80297_0
t347__CASP7_ASSEMBLEABRELAX_SAVE_ALL_OUT_new6to205hom001__991_3157_1




ID: 21833 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
RosettaMac

Send message
Joined: 16 Jul 06
Posts: 2
Credit: 1,053
RAC: 0
Message 21872 - Posted: 5 Aug 2006, 1:42:05 UTC

The work unit FRA_t384_CASP7_hom001_4_t384_4_1ofgA_IGNORE_THE_REST_724_1060_27 has been running for about 12 hours and shows progress at 1.27 percent. Is this normal or should I abort?
ID: 21872 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
RosettaMac

Send message
Joined: 16 Jul 06
Posts: 2
Credit: 1,053
RAC: 0
Message 21873 - Posted: 5 Aug 2006, 1:57:38 UTC - in response to Message 21872.  

The work unit FRA_t384_CASP7_hom001_4_t384_4_1ofgA_IGNORE_THE_REST_724_1060_27 has been running for about 12 hours and shows progress at 1.27 percent. Is this normal or should I abort?


Never mind...just minutes after I posted the above, the work unit perversely finished. I'm new to this and never had one run that long.
ID: 21873 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Vester
Avatar

Send message
Joined: 2 Nov 05
Posts: 257
Credit: 2,940,854
RAC: 17,685
Message 21874 - Posted: 5 Aug 2006, 2:37:17 UTC - in response to Message 21873.  

The work unit FRA_t384_CASP7_hom001_4_t384_4_1ofgA_IGNORE_THE_REST_724_1060_27 has been running for about 12 hours and shows progress at 1.27 percent. Is this normal or should I abort?


Never mind...just minutes after I posted the above, the work unit perversely finished. I'm new to this and never had one run that long.

Welcome to the forums. As shown in the screenshot below, you can choose to run jobs up to 1 day.



ID: 21874 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 21878 - Posted: 5 Aug 2006, 3:32:20 UTC

RosettaMac's results seem to hover around the 10800 second mark; so he's probably using the default 3 hour time setting.

That's an impressive time for a single decoy. If someone sits there and watches the client at 1.xx% and considers it stuck, chances are it's working on the first model/decoy. i.e. it'd be ticking right along if the client figured it could produce 100-300 models a day, when you saw the 1.xx% statement. :)

Congratulations on sticking it out and finishing that WU. If you're tempted to kill off a job in the future, you're supposed to be able to view the graphics to see the model moving and changing.
ID: 21878 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 21954 - Posted: 6 Aug 2006, 19:56:20 UTC

Here are some references describing the completion % and requirement to run at least one complete model.

Progress % not advancing
Time to completion going up
adjustable work units FAQ

Welcome to Rosetta!...Mac :)
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 21954 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 12 · Next

Message boards : Number crunching : Report Problems with Rosetta Version 5.25



©2024 University of Washington
https://www.bakerlab.org