Message boards : Number crunching : Problems with version 5.96
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 10 · Next
Author | Message |
---|---|
Path7 Send message Joined: 25 Aug 07 Posts: 128 Credit: 61,751 RAC: 0 |
100 % complete, not ready to report. Hello all, Running Ubuntu 7.10 x86 on a one core AMD sempron 3000+ Boinc 5.10.45 the next WU; t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_31667_0 ran until 100 % complete, 0 % CPU usage but wasn't ready to report. Left me with no other choice than to abort the WU. Have a nice day, Path7. |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
I haven't had any problems for a while, this ran (tried) for 1min, 24sec. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=156473339 6/17/2008 7:24:12 AM|rosetta@home|Starting task FRA_t423_CASP8_1G3U_11_IGNORE_THE_RESTt423_3764_1166_1 using rosetta_beta version 596 6/17/2008 7:25:40 AM|rosetta@home|Output file FRA_t423_CASP8_1G3U_11_IGNORE_THE_RESTt423_3764_1166_1_0 for task FRA_t423_CASP8_1G3U_11_IGNORE_THE_RESTt423_3764_1166_1 absent pete. |
Dr Who Fan Send message Joined: 28 May 06 Posts: 64 Credit: 259,474 RAC: 408 |
This one failed after almost 81.5 seconds: Task ID 171496331 Name FRA_t423_CASP8_1G3U_11_IGNORE_THE_RESTt423_3764_7492_0 Workunit 156540555 CPU time 81.46875 stderr out <core_client_version>6.1.0</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # cpu_run_time_pref: 7200 # random seed: 3355787 ERROR:: Exit from: .loop_relax.cc line: 1745 </stderr_txt> ]]> |
vicel Send message Joined: 28 Mar 06 Posts: 5 Credit: 957,142 RAC: 0 |
Don't set to "Finish" status. 100 % complete, but don't ready to report. Ubunty 8.04. Intel Core2 Duo, 3MB. WU: t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_46726 Best regards, Victor |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
I aborted this t405_ WU after it got stuck at 100% done. https://boinc.bakerlab.org/rosetta/result.php?resultid=171135706 |
netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0 |
On the t404 & t405 CASP8 units that get stuck, there is a way to save the work. It's a bit of a pain, but, if you shutdown the connected client and then restart it, the task will finish on restart about 9 out of 10 times. The ones that don't finish will either continue from a percentage less than 100% or will self abort with a client error. Luck of the draw on this... I see no indicator as to why some units fail in this manner, but, it does save a majority of the units. This leads me to believe that there may be some sort of a watchdog issue on this. Why it would affect just those units is odd, but, this workaround may save a few hassles until the problem is found... Looking for a team ??? Join BoincSynergy!! |
dlsqbinder Send message Joined: 23 Nov 05 Posts: 3 Credit: 371,859 RAC: 0 |
I had my 3rd one hang today: 6/17/2008 9:29:00 AM|rosetta@home|Unrecoverable error for result t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_12504_0 (aborted via GUI RPC) I aborted it and have now suspended Rosetta until issue is fixed. Larry |
Warped Send message Joined: 15 Jan 06 Posts: 48 Credit: 1,788,185 RAC: 0 |
2008/06/17 19:13:08|rosetta@home|Reason: Unrecoverable error for result t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_22480_0 ( - exit code 1073807364 (0x40010004)) I'm out of here - it seems that more than half of my WU's have bombed out. The downloads are quite big as well, using up valuable bandwidth. Warped |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
This t405 hung at 100%. I restarted Boinc, it crunched for a while, then it hung at 100% again. So I aborted it. |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
Yet another t405 got stuck at 100%. I tried restarting BOINC several times. Each time it crunched another decoy and then got stuck at 100% again. Finally I gave up and aborted it. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 335 |
I have 3 t405 wu's also stuck using their CPU quota but not advancing. I have suspended Rosetta on all machines for now, I had cores sitting idle. Q6600, Win XP SP-3, BOINC 5.10.45. I have no idea how long they have been stuck there like that... Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Brbe Send message Joined: 17 Dec 05 Posts: 1 Credit: 5,641,827 RAC: 0 |
Aplication frozen and stuck in time.... when i pres show grafics aplication crash with message: 18.6.2008 12:50:31|rosetta@home|Computation for task t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_22996_0 finished 18.6.2008 12:50:31|rosetta@home|Output file t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_22996_0_0 for task t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_22996_0 absent what bother's me that i must manuali click on aplication before boinc go ahead... in the mean time one of mi 4 cores du nothing... waste of cicles!!! |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=170401804 my machine and another one got validate errors on this |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
I just had to abort several more t405 WUs. BTW, the stack trace stuff is already in the stderr file in the slot directory when the WU is stuck, and before I do the abort. |
ConflictingEmotions Send message Joined: 5 Jun 08 Posts: 10 Credit: 3,081,990 RAC: 0 |
I just had to abort several more t405 WUs. BTW, the stack trace stuff is already in the stderr file in the slot directory when the WU is stuck, and before I do the abort. Don't these failed units ever disappear? I also just had to terminate these off my systems for the second time and I see we are not the only ones! These units expose a very nasty bug with rosetta. The worst part is that these prevent boinc from starting another task. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
I just had to abort several more t405 WUs. BTW, the stack trace stuff is already in the stderr file in the slot directory when the WU is stuck, and before I do the abort. They cycle twice, to see if the same error happens on a different system. That's why so mant people are complaining, double error's. Some get lucky and don't get the error if they are number 2 in line. |
Keith T. Send message Joined: 1 Mar 07 Posts: 58 Credit: 34,135 RAC: 0 |
I am currently only running Rosetta on my old PII 233Mhz system. My faster system is busy running AstroPulse, SETI, Orbit and RALPH. I have a WU that started OK, but seems to have got stuck as the last checkpoint was @ 13:04, over 2 hours ago. The task id is t405__CASP8_JUMPAB_TYPE2_RES81to192_SAVE_ALL_OUT_BARCODE__3785_15981_0. I know that this system is very slow by modern standards, but it has managed at least 9 other Rosetta tasks OK recently https://boinc.bakerlab.org/rosetta/results.php?hostid=455894, and quite a few for RALPH as well http://ralph.bakerlab.org/results.php?hostid=14021. Keith |
ConflictingEmotions Send message Joined: 5 Jun 08 Posts: 10 Credit: 3,081,990 RAC: 0 |
I just had to abort several more t405 WUs. BTW, the stack trace stuff is already in the stderr file in the slot directory when the WU is stuck, and before I do the abort. Unfortunately these crash but the failure is not reported until the user aborts them or the deadlines pass. Consequently the bugs are not fixed and potentially many users are wasting resources. |
sslickerson Send message Joined: 14 Oct 05 Posts: 101 Credit: 578,497 RAC: 0 |
So BOINC sat idle all night long. Neither of my two cores were using cycles for at least 8 hours. I exited out and then came back in. This WU immediately finished (even though it was not to do so according to my 3 hour runtime) and it shows an unhandled exception. This happened once before for application version: minirosetta 1.28. Others are apparently having problems as well. Is this being looked into? Tim |
netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0 |
My success ratio for restarts is now dropped to below 50%... I will now abort the hung units and will preemptively abort all t404 & t405 CASP8 units.... My apologies to Rosetta, but, I can't have my crunchers useless because of a bug in one project. Yes, I primarily crunch Rosetta, but..... Looking for a team ??? Join BoincSynergy!! |
Message boards :
Number crunching :
Problems with version 5.96
©2024 University of Washington
https://www.bakerlab.org