Problems with version 5.96

Message boards : Number crunching : Problems with version 5.96

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 10 · Next

AuthorMessage
Path7

Send message
Joined: 25 Aug 07
Posts: 128
Credit: 61,751
RAC: 0
Message 53728 - Posted: 16 Jun 2008, 21:07:10 UTC

100 % complete, not ready to report.

Hello all,
Running Ubuntu 7.10 x86 on a one core AMD sempron 3000+ Boinc 5.10.45 the next WU;

t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_31667_0

ran until 100 % complete, 0 % CPU usage but wasn't ready to report.
Left me with no other choice than to abort the WU.

Have a nice day,
Path7.
ID: 53728 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 53729 - Posted: 16 Jun 2008, 21:46:38 UTC

I haven't had any problems for a while, this ran (tried) for 1min, 24sec.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=156473339

6/17/2008 7:24:12 AM|rosetta@home|Starting task FRA_t423_CASP8_1G3U_11_IGNORE_THE_RESTt423_3764_1166_1 using rosetta_beta version 596

6/17/2008 7:25:40 AM|rosetta@home|Output file FRA_t423_CASP8_1G3U_11_IGNORE_THE_RESTt423_3764_1166_1_0 for task FRA_t423_CASP8_1G3U_11_IGNORE_THE_RESTt423_3764_1166_1 absent

pete.



ID: 53729 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 28 May 06
Posts: 58
Credit: 219,040
RAC: 1
Message 53737 - Posted: 17 Jun 2008, 9:05:01 UTC

This one failed after almost 81.5 seconds:
Task ID 171496331
Name FRA_t423_CASP8_1G3U_11_IGNORE_THE_RESTt423_3764_7492_0
Workunit 156540555

CPU time 81.46875
stderr out

<core_client_version>6.1.0</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 7200
# random seed: 3355787
ERROR:: Exit from: .loop_relax.cc line: 1745

</stderr_txt>

]]>


ID: 53737 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
vicel

Send message
Joined: 28 Mar 06
Posts: 5
Credit: 957,142
RAC: 0
Message 53742 - Posted: 17 Jun 2008, 13:34:28 UTC

Don't set to "Finish" status. 100 % complete, but don't ready to report.

Ubunty 8.04. Intel Core2 Duo, 3MB.
WU: t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_46726


Best regards, Victor
ID: 53742 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 53743 - Posted: 17 Jun 2008, 13:58:44 UTC

I aborted this t405_ WU after it got stuck at 100% done.

https://boinc.bakerlab.org/rosetta/result.php?resultid=171135706
ID: 53743 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile netwraith
Avatar

Send message
Joined: 3 Sep 06
Posts: 80
Credit: 13,483,227
RAC: 0
Message 53744 - Posted: 17 Jun 2008, 14:30:19 UTC

On the t404 & t405 CASP8 units that get stuck, there is a way to save the work.

It's a bit of a pain, but, if you shutdown the connected client and then restart it, the task will finish on restart about 9 out of 10 times. The ones that don't finish will either continue from a percentage less than 100% or will self abort with a client error. Luck of the draw on this... I see no indicator as to why some units fail in this manner, but, it does save a majority of the units.

This leads me to believe that there may be some sort of a watchdog issue on this. Why it would affect just those units is odd, but, this workaround may save a few hassles until the problem is found...


Looking for a team ??? Join BoincSynergy!!


ID: 53744 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
dlsqbinder

Send message
Joined: 23 Nov 05
Posts: 3
Credit: 371,859
RAC: 0
Message 53745 - Posted: 17 Jun 2008, 14:34:53 UTC - in response to Message 53743.  

I had my 3rd one hang today:

6/17/2008 9:29:00 AM|rosetta@home|Unrecoverable error for result t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_12504_0 (aborted via GUI RPC)


I aborted it and have now suspended Rosetta until issue is fixed.

Larry
ID: 53745 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Warped

Send message
Joined: 15 Jan 06
Posts: 47
Credit: 1,577,620
RAC: 0
Message 53748 - Posted: 17 Jun 2008, 18:24:18 UTC

2008/06/17 19:13:08|rosetta@home|Reason: Unrecoverable error for result t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_22480_0 ( - exit code 1073807364 (0x40010004))

I'm out of here - it seems that more than half of my WU's have bombed out. The downloads are quite big as well, using up valuable bandwidth.
Warped

ID: 53748 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 53750 - Posted: 17 Jun 2008, 23:28:36 UTC

This t405 hung at 100%. I restarted Boinc, it crunched for a while, then it hung at 100% again. So I aborted it.

ID: 53750 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 53751 - Posted: 18 Jun 2008, 3:20:12 UTC

Yet another t405 got stuck at 100%. I tried restarting BOINC several times. Each time it crunched another decoy and then got stuck at 100% again. Finally I gave up and aborted it.
ID: 53751 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 608
Credit: 9,771,233
RAC: 3,502
Message 53752 - Posted: 18 Jun 2008, 6:01:18 UTC
Last modified: 18 Jun 2008, 6:20:52 UTC

I have 3 t405 wu's also stuck using their CPU quota but not advancing. I have suspended Rosetta on all machines for now, I had cores sitting idle. Q6600, Win XP SP-3, BOINC 5.10.45. I have no idea how long they have been stuck there like that...
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 53752 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brbe

Send message
Joined: 17 Dec 05
Posts: 1
Credit: 5,630,816
RAC: 44
Message 53754 - Posted: 18 Jun 2008, 10:52:15 UTC

Aplication frozen and stuck in time....
when i pres show grafics aplication crash with message:
18.6.2008 12:50:31|rosetta@home|Computation for task t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_22996_0 finished
18.6.2008 12:50:31|rosetta@home|Output file t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_22996_0_0 for task t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_22996_0 absent

what bother's me that i must manuali click on aplication before boinc go ahead... in the mean time one of mi 4 cores du nothing...
waste of cicles!!!
ID: 53754 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 4875
Credit: 4,515,389
RAC: 1,207
Message 53755 - Posted: 18 Jun 2008, 10:55:09 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=170401804
my machine and another one got validate errors on this
ID: 53755 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 53759 - Posted: 18 Jun 2008, 12:05:11 UTC

I just had to abort several more t405 WUs. BTW, the stack trace stuff is already in the stderr file in the slot directory when the WU is stuck, and before I do the abort.
ID: 53759 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ConflictingEmotions

Send message
Joined: 5 Jun 08
Posts: 10
Credit: 3,081,990
RAC: 0
Message 53764 - Posted: 18 Jun 2008, 13:17:04 UTC - in response to Message 53759.  

I just had to abort several more t405 WUs. BTW, the stack trace stuff is already in the stderr file in the slot directory when the WU is stuck, and before I do the abort.


Don't these failed units ever disappear?

I also just had to terminate these off my systems for the second time and I see we are not the only ones!

These units expose a very nasty bug with rosetta. The worst part is that these prevent boinc from starting another task.
ID: 53764 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 4875
Credit: 4,515,389
RAC: 1,207
Message 53765 - Posted: 18 Jun 2008, 13:33:14 UTC - in response to Message 53764.  

I just had to abort several more t405 WUs. BTW, the stack trace stuff is already in the stderr file in the slot directory when the WU is stuck, and before I do the abort.


Don't these failed units ever disappear?

I also just had to terminate these off my systems for the second time and I see we are not the only ones!

These units expose a very nasty bug with rosetta. The worst part is that these prevent boinc from starting another task.


They cycle twice, to see if the same error happens on a different system.
That's why so mant people are complaining, double error's.
Some get lucky and don't get the error if they are number 2 in line.
ID: 53765 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith T.
Avatar

Send message
Joined: 1 Mar 07
Posts: 58
Credit: 34,135
RAC: 0
Message 53771 - Posted: 18 Jun 2008, 14:34:55 UTC

I am currently only running Rosetta on my old PII 233Mhz system. My faster system is busy running AstroPulse, SETI, Orbit and RALPH.

I have a WU that started OK, but seems to have got stuck as the last checkpoint was @ 13:04, over 2 hours ago. The task id is t405__CASP8_JUMPAB_TYPE2_RES81to192_SAVE_ALL_OUT_BARCODE__3785_15981_0.

I know that this system is very slow by modern standards, but it has managed at least 9 other Rosetta tasks OK recently https://boinc.bakerlab.org/rosetta/results.php?hostid=455894, and quite a few for RALPH as well http://ralph.bakerlab.org/results.php?hostid=14021.

Keith
ID: 53771 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ConflictingEmotions

Send message
Joined: 5 Jun 08
Posts: 10
Credit: 3,081,990
RAC: 0
Message 53772 - Posted: 18 Jun 2008, 15:21:52 UTC - in response to Message 53765.  

I just had to abort several more t405 WUs. BTW, the stack trace stuff is already in the stderr file in the slot directory when the WU is stuck, and before I do the abort.


Don't these failed units ever disappear?

I also just had to terminate these off my systems for the second time and I see we are not the only ones!

These units expose a very nasty bug with rosetta. The worst part is that these prevent boinc from starting another task.


They cycle twice, to see if the same error happens on a different system.
That's why so mant people are complaining, double error's.
Some get lucky and don't get the error if they are number 2 in line.


Unfortunately these crash but the failure is not reported until the user aborts them or the deadlines pass. Consequently the bugs are not fixed and potentially many users are wasting resources.
ID: 53772 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile sslickerson

Send message
Joined: 14 Oct 05
Posts: 101
Credit: 578,497
RAC: 0
Message 53774 - Posted: 18 Jun 2008, 16:03:24 UTC

So BOINC sat idle all night long. Neither of my two cores were using cycles for at least 8 hours. I exited out and then came back in. This WU immediately finished (even though it was not to do so according to my 3 hour runtime) and it shows an unhandled exception. This happened once before for application version: minirosetta 1.28. Others are apparently having problems as well. Is this being looked into?

Tim





ID: 53774 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile netwraith
Avatar

Send message
Joined: 3 Sep 06
Posts: 80
Credit: 13,483,227
RAC: 0
Message 53781 - Posted: 18 Jun 2008, 17:48:37 UTC

My success ratio for restarts is now dropped to below 50%... I will now abort the hung units and will preemptively abort all t404 & t405 CASP8 units.... My apologies to Rosetta, but, I can't have my crunchers useless because of a bug in one project. Yes, I primarily crunch Rosetta, but.....


Looking for a team ??? Join BoincSynergy!!


ID: 53781 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 10 · Next

Message boards : Number crunching : Problems with version 5.96



©2021 University of Washington
https://www.bakerlab.org