Problems with version 5.96

Author	Message
Path7 Send message Joined: 25 Aug 07 Posts: 128 Credit: 61,751 RAC: 0	Message 53728 - Posted: 16 Jun 2008, 21:07:10 UTC 100 % complete, not ready to report. Hello all, Running Ubuntu 7.10 x86 on a one core AMD sempron 3000+ Boinc 5.10.45 the next WU; t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_31667_0 ran until 100 % complete, 0 % CPU usage but wasn't ready to report. Left me with no other choice than to abort the WU. Have a nice day, Path7. ID: 53728 · Rating: 0 · rate: / Reply Quote

P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0	Message 53729 - Posted: 16 Jun 2008, 21:46:38 UTC I haven't had any problems for a while, this ran (tried) for 1min, 24sec. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=156473339 6/17/2008 7:24:12 AM\|rosetta@home\|Starting task FRA_t423_CASP8_1G3U_11_IGNORE_THE_RESTt423_3764_1166_1 using rosetta_beta version 596 6/17/2008 7:25:40 AM\|rosetta@home\|Output file FRA_t423_CASP8_1G3U_11_IGNORE_THE_RESTt423_3764_1166_1_0 for task FRA_t423_CASP8_1G3U_11_IGNORE_THE_RESTt423_3764_1166_1 absent pete. ID: 53729 · Rating: 0 · rate: / Reply Quote

Dr Who Fan Send message Joined: 28 May 06 Posts: 107 Credit: 292,109 RAC: 0	Message 53737 - Posted: 17 Jun 2008, 9:05:01 UTC This one failed after almost 81.5 seconds: Task ID 171496331 Name FRA_t423_CASP8_1G3U_11_IGNORE_THE_RESTt423_3764_7492_0 Workunit 156540555 CPU time 81.46875 stderr out <core_client_version>6.1.0</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # cpu_run_time_pref: 7200 # random seed: 3355787 ERROR:: Exit from: .loop_relax.cc line: 1745 </stderr_txt> ]]> ID: 53737 · Rating: 0 · rate: / Reply Quote

vicel Send message Joined: 28 Mar 06 Posts: 5 Credit: 957,142 RAC: 0	Message 53742 - Posted: 17 Jun 2008, 13:34:28 UTC Don't set to "Finish" status. 100 % complete, but don't ready to report. Ubunty 8.04. Intel Core2 Duo, 3MB. WU: t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_46726 Best regards, Victor ID: 53742 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 53743 - Posted: 17 Jun 2008, 13:58:44 UTC I aborted this t405_ WU after it got stuck at 100% done. https://boinc.bakerlab.org/rosetta/result.php?resultid=171135706 ID: 53743 · Rating: 0 · rate: / Reply Quote

netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0	Message 53744 - Posted: 17 Jun 2008, 14:30:19 UTC On the t404 & t405 CASP8 units that get stuck, there is a way to save the work. It's a bit of a pain, but, if you shutdown the connected client and then restart it, the task will finish on restart about 9 out of 10 times. The ones that don't finish will either continue from a percentage less than 100% or will self abort with a client error. Luck of the draw on this... I see no indicator as to why some units fail in this manner, but, it does save a majority of the units. This leads me to believe that there may be some sort of a watchdog issue on this. Why it would affect just those units is odd, but, this workaround may save a few hassles until the problem is found... *Looking for a team ??? Join BoincSynergy!!* ID: 53744 · Rating: 0 · rate: / Reply Quote

dlsqbinder Send message Joined: 23 Nov 05 Posts: 3 Credit: 371,859 RAC: 0	Message 53745 - Posted: 17 Jun 2008, 14:34:53 UTC - in response to Message 53743. I had my 3rd one hang today: 6/17/2008 9:29:00 AM\|rosetta@home\|Unrecoverable error for result t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_12504_0 (aborted via GUI RPC) I aborted it and have now suspended Rosetta until issue is fixed. Larry ID: 53745 · Rating: 0 · rate: / Reply Quote

Warped Send message Joined: 15 Jan 06 Posts: 48 Credit: 1,788,185 RAC: 0	Message 53748 - Posted: 17 Jun 2008, 18:24:18 UTC 2008/06/17 19:13:08\|rosetta@home\|Reason: Unrecoverable error for result t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_22480_0 ( - exit code 1073807364 (0x40010004)) I'm out of here - it seems that more than half of my WU's have bombed out. The downloads are quite big as well, using up valuable bandwidth. *Warped* ID: 53748 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 53750 - Posted: 17 Jun 2008, 23:28:36 UTC This t405 hung at 100%. I restarted Boinc, it crunched for a while, then it hung at 100% again. So I aborted it. ID: 53750 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 53751 - Posted: 18 Jun 2008, 3:20:12 UTC Yet another t405 got stuck at 100%. I tried restarting BOINC several times. Each time it crunched another decoy and then got stuck at 100% again. Finally I gave up and aborted it. ID: 53751 · Rating: 0 · rate: / Reply Quote

adrianxw Send message Joined: 18 Sep 05 Posts: 662 Credit: 12,167,519 RAC: 0	Message 53752 - Posted: 18 Jun 2008, 6:01:18 UTC Last modified: 18 Jun 2008, 6:20:52 UTC I have 3 t405 wu's also stuck using their CPU quota but not advancing. I have suspended Rosetta on all machines for now, I had cores sitting idle. Q6600, Win XP SP-3, BOINC 5.10.45. I have no idea how long they have been stuck there like that... Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 53752 · Rating: 0 · rate: / Reply Quote

Brbe Send message Joined: 17 Dec 05 Posts: 1 Credit: 5,641,827 RAC: 0	Message 53754 - Posted: 18 Jun 2008, 10:52:15 UTC Aplication frozen and stuck in time.... when i pres show grafics aplication crash with message: 18.6.2008 12:50:31\|rosetta@home\|Computation for task t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_22996_0 finished 18.6.2008 12:50:31\|rosetta@home\|Output file t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_22996_0_0 for task t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_22996_0 absent what bother's me that i must manuali click on aplication before boinc go ahead... in the mean time one of mi 4 cores du nothing... waste of cicles!!! ID: 53754 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 53755 - Posted: 18 Jun 2008, 10:55:09 UTC https://boinc.bakerlab.org/rosetta/result.php?resultid=170401804 my machine and another one got validate errors on this ID: 53755 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 53759 - Posted: 18 Jun 2008, 12:05:11 UTC I just had to abort several more t405 WUs. BTW, the stack trace stuff is already in the stderr file in the slot directory when the WU is stuck, and before I do the abort. ID: 53759 · Rating: 0 · rate: / Reply Quote

ConflictingEmotions Send message Joined: 5 Jun 08 Posts: 10 Credit: 3,081,990 RAC: 0	Message 53764 - Posted: 18 Jun 2008, 13:17:04 UTC - in response to Message 53759. I just had to abort several more t405 WUs. BTW, the stack trace stuff is already in the stderr file in the slot directory when the WU is stuck, and before I do the abort. Don't these failed units ever disappear? I also just had to terminate these off my systems for the second time and I see we are not the only ones! These units expose a very nasty bug with rosetta. The worst part is that these prevent boinc from starting another task. ID: 53764 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 53765 - Posted: 18 Jun 2008, 13:33:14 UTC - in response to Message 53764. I just had to abort several more t405 WUs. BTW, the stack trace stuff is already in the stderr file in the slot directory when the WU is stuck, and before I do the abort. Don't these failed units ever disappear? I also just had to terminate these off my systems for the second time and I see we are not the only ones! These units expose a very nasty bug with rosetta. The worst part is that these prevent boinc from starting another task. They cycle twice, to see if the same error happens on a different system. That's why so mant people are complaining, double error's. Some get lucky and don't get the error if they are number 2 in line. ID: 53765 · Rating: 0 · rate: / Reply Quote

Keith T. Send message Joined: 1 Mar 07 Posts: 58 Credit: 34,135 RAC: 0	Message 53771 - Posted: 18 Jun 2008, 14:34:55 UTC I am currently only running Rosetta on my old PII 233Mhz system. My faster system is busy running AstroPulse, SETI, Orbit and RALPH. I have a WU that started OK, but seems to have got stuck as the last checkpoint was @ 13:04, over 2 hours ago. The task id is t405__CASP8_JUMPAB_TYPE2_RES81to192_SAVE_ALL_OUT_BARCODE__3785_15981_0. I know that this system is very slow by modern standards, but it has managed at least 9 other Rosetta tasks OK recently https://boinc.bakerlab.org/rosetta/results.php?hostid=455894, and quite a few for RALPH as well http://ralph.bakerlab.org/results.php?hostid=14021. Keith ID: 53771 · Rating: 0 · rate: / Reply Quote

ConflictingEmotions Send message Joined: 5 Jun 08 Posts: 10 Credit: 3,081,990 RAC: 0	Message 53772 - Posted: 18 Jun 2008, 15:21:52 UTC - in response to Message 53765. I just had to abort several more t405 WUs. BTW, the stack trace stuff is already in the stderr file in the slot directory when the WU is stuck, and before I do the abort. Don't these failed units ever disappear? I also just had to terminate these off my systems for the second time and I see we are not the only ones! These units expose a very nasty bug with rosetta. The worst part is that these prevent boinc from starting another task. They cycle twice, to see if the same error happens on a different system. That's why so mant people are complaining, double error's. Some get lucky and don't get the error if they are number 2 in line. Unfortunately these crash but the failure is not reported until the user aborts them or the deadlines pass. Consequently the bugs are not fixed and potentially many users are wasting resources. ID: 53772 · Rating: 0 · rate: / Reply Quote

sslickerson Send message Joined: 14 Oct 05 Posts: 101 Credit: 578,497 RAC: 0	Message 53774 - Posted: 18 Jun 2008, 16:03:24 UTC So BOINC sat idle all night long. Neither of my two cores were using cycles for at least 8 hours. I exited out and then came back in. This WU immediately finished (even though it was not to do so according to my 3 hour runtime) and it shows an unhandled exception. This happened once before for application version: minirosetta 1.28. Others are apparently having problems as well. Is this being looked into? Tim ID: 53774 · Rating: 0 · rate: / Reply Quote

netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0	Message 53781 - Posted: 18 Jun 2008, 17:48:37 UTC My success ratio for restarts is now dropped to below 50%... I will now abort the hung units and will preemptively abort all t404 & t405 CASP8 units.... My apologies to Rosetta, but, I can't have my crunchers useless because of a bug in one project. Yes, I primarily crunch Rosetta, but..... *Looking for a team ??? Join BoincSynergy!!* ID: 53781 · Rating: 0 · rate: / Reply Quote