Report stuck & aborted WU here please

Message boards : Number crunching : Report stuck & aborted WU here please

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 18 · Next

AuthorMessage
Profile Cureseekers~Kristof

Send message
Joined: 5 Nov 05
Posts: 80
Credit: 689,603
RAC: 0
Message 11856 - Posted: 10 Mar 2006, 14:10:34 UTC

After 27 seconds:

Unrecoverable error for result HOMSdt_homDB027_1dtj__352_1825_1 (Incorrect function. (0x1) - exit code 1 (0x1))

On this page I saw the job had already an error at another computer
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10447424
(atm I haven't uploaded mine, I'll do this in a few hours)
Member of Dutch Power Cows
ID: 11856 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dag
Avatar

Send message
Joined: 16 Dec 05
Posts: 106
Credit: 1,000,020
RAC: 0
Message 11995 - Posted: 14 Mar 2006, 2:25:58 UTC

In the last couple of days I've seen several hangs on my system. Too bad they don't automatically abort--I have to abort them manually. The bad part is they've sat there for a day or so taking up a slot, but not doing anything, until I abort them.

dag

2006-03-10 13:05:17 [rosetta@home] Unrecoverable error for result HOMSog_homDB015_1ogw__352_1003_0 (process exited with code 131 (0x83))
2006-03-10 13:05:20 [rosetta@home] Unrecoverable error for result HOMSn0_homDB004_1n0u__352_1003_0 (process exited with code 131 (0x83))
2006-03-10 19:31:16 [rosetta@home] Unrecoverable error for result HOMSdt_homDB009_1dtj__352_1783_2 (process exited with code 1 (0x1))
2006-03-12 10:35:27 [rosetta@home] Unrecoverable error for result HOMSti_homDB025_1tif__352_1208_1 (process exited with code 1 (0x1))
2006-03-12 20:55:21 [rosetta@home] Unrecoverable error for result HOMSdt_homDB003_1dtj__352_1942_2 (process exited with code 1 (0x1))
2006-03-13 02:18:27 [rosetta@home] Unrecoverable error for result HOMSdt_homDB009_1dtj__352_992_2 (process exited with code 1 (0x1))
2006-03-13 11:27:59 [rosetta@home] Unrecoverable error for result HOMSn0_homDB017_1n0u__352_1447_0 (aborted by user)
2006-03-13 15:33:04 [rosetta@home] Unrecoverable error for result FA_RLXce_hom004_1cei__360_79_0 (process got signal 11)
2006-03-13 15:33:08 [rosetta@home] Unrecoverable error for result FA_RLXbg_hom001_1bgf__359_105_0 (process got signal 11)
2006-03-13 18:18:25 [rosetta@home] Unrecoverable error for result FA_RLXb3_hom012_1b3aA_359_81_0 (aborted by user)

dag
--Finding aliens is cool, but understanding the structure of proteins is useful.
ID: 11995 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 11999 - Posted: 14 Mar 2006, 4:31:45 UTC - in response to Message 11998.  

I'm having major problems with the WU's that start with FA_RLX. They are either getting stuck at 1% or they run until 90 - 95% and error.


are these failing with much higher frequency than other WU on your computer?
ID: 11999 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
arklms

Send message
Joined: 17 Dec 05
Posts: 7
Credit: 177,488
RAC: 0
Message 12008 - Posted: 14 Mar 2006, 12:51:42 UTC - in response to Message 8741.  
Last modified: 14 Mar 2006, 12:54:28 UTC

SSFEATURES_BARCODE_ABINITIO_5croA_334_286_0
9 hours 1%. This seems to happen a lot on this P3.

Can't get it to run via command line either, the window just closes.
Looks like I'll have to abort it.
ID: 12008 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Team TMR

Send message
Joined: 2 Nov 05
Posts: 21
Credit: 1,583,679
RAC: 0
Message 12009 - Posted: 14 Mar 2006, 13:06:33 UTC
Last modified: 14 Mar 2006, 13:07:59 UTC

Had 3 today that have been stuck on 1% after anything between 3-16 hours (runtime set to 2 hours).

3.1 hours:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11019262

4.3 hours:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11008654

16.9 hours:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10949277

All were aborted.
ID: 12009 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile hob.
Avatar

Send message
Joined: 4 Nov 05
Posts: 64
Credit: 250,683
RAC: 0
Message 12025 - Posted: 14 Mar 2006, 23:30:52 UTC

3/4/2006 6:51:19 AM|rosetta@home|Starting result ABINITli_hom010_1lis__322_14_1 using rosetta version 482


this job has been running for over 10 12 days now.........it's been on 84.89% for at least 24 hrs now.......maybe a lot longer ?? its still using cpu power so i assume it's doing something ?? runtime is listed as 256 12 hours so far...and still counting up

any advice as to what to do with it would be welcome.
46 years dc so far

join team FaDbeens
join us

ID: 12025 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 12037 - Posted: 15 Mar 2006, 5:05:55 UTC - in response to Message 12025.  

3/4/2006 6:51:19 AM|rosetta@home|Starting result ABINITli_hom010_1lis__322_14_1 using rosetta version 482


this job has been running for over 10 12 days now.........it's been on 84.89% for at least 24 hrs now.......maybe a lot longer ?? its still using cpu power so i assume it's doing something ?? runtime is listed as 256 12 hours so far...and still counting up

any advice as to what to do with it would be welcome.

You could try stopping boinc, waiting a minute, then restarting boinc. That could get the WU going again.
ID: 12037 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 12055 - Posted: 15 Mar 2006, 15:46:19 UTC

I just had a WU stuck at 23%. It was using CPU but wasn't making any progress. I stopped and restarted boinc and it finished normally.

WU: FA_RLXbq_hom006_1bq9A_359_221_0
Result: https://boinc.bakerlab.org/rosetta/result.php?resultid=13655767
ID: 12055 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Stwato

Send message
Joined: 11 Jan 06
Posts: 150
Credit: 655,634
RAC: 0
Message 12064 - Posted: 15 Mar 2006, 19:03:21 UTC
Last modified: 15 Mar 2006, 19:08:23 UTC

I had a work unit stuck on 1% for over 3 hours. It was FA_RLXbg_hom004_1bgf_359_376_0

If another WU gets stuck should I abort it (I aborted this one, sorry if that was wrong)?

[Edit: wrote down wrong work unit, oops]
ID: 12064 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Steve Shedroff

Send message
Joined: 7 Nov 05
Posts: 11
Credit: 250,657
RAC: 0
Message 12079 - Posted: 16 Mar 2006, 1:41:03 UTC

I aborted the following WU it appeared stuck at 1% after 44 hours with 59 hours to go. It did not appear to be progressing.

3/15/2006 6:54:16 PM|rosetta@home|Unrecoverable error for result FA_RLXai_hom022_1ail__359_179_0 (aborted via GUI RPC)
Regards,
Steve
ID: 12079 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ib Rasmussen

Send message
Joined: 27 Sep 05
Posts: 16
Credit: 211,416
RAC: 0
Message 12088 - Posted: 16 Mar 2006, 8:00:13 UTC
Last modified: 16 Mar 2006, 8:04:36 UTC

This morning my machine ID 5309, which only runs rosetta, showed an empty Boinc-manager - no, work, no projects.

stderrae.txt says

***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x0038F114 read attempt to address 0x00000008

1: 03/15/06 22:02:49
1: SymGetLineFromAddr(): GetLastError = 126

stdoutae.txt says (with some editing)

2006-03-15 20:08:46 [rosetta@home] Starting result FA_RLXb3_hom003_1b3aA_359_477_0 using rosetta version 482
...
2006-03-15 20:36:39 [rosetta@home] Computation for result FA_RLXbk_hom026_1bk2__359_477_0 finished
2006-03-15 20:36:40 [rosetta@home] Starting result FA_RLXac_hom001_1acf__359_478_0 using rosetta version 482
...
2006-03-15 21:01:42 [---] request_reschedule_cpus: files downloaded
2006-03-15 22:02:41 [rosetta@home] Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
2006-03-15 22:02:41 [rosetta@home] Reason: To fetch work
2006-03-15 22:02:41 [rosetta@home] Requesting 1470 seconds of new work
2006-03-15 22:02:46 [rosetta@home] Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
2006-03-15 22:02:48 [rosetta@home] Started download of hom027_1ig5A.fasta.gz
2006-03-15 22:02:48 [rosetta@home] Started download of hom027_1ig5A.psipred_ss2.gz
2006-03-15 22:02:50 [rosetta@home] Finished download of hom027_1ig5A.psipred_ss2.gz
2006-03-15 22:02:50 [rosetta@home] Throughput 1687 bytes/sec
2006-03-15 22:02:50 [rosetta@home] Started download of hom027_aa1ig5A03_05.200_v1_3.gz


After restarting the manager the two wus picked up at 87% and 71%.

The last wu in the queue: FA_RLXig_hom027_1ig5a_360_73_0

/Ib
ID: 12088 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Team TMR

Send message
Joined: 2 Nov 05
Posts: 21
Credit: 1,583,679
RAC: 0
Message 12089 - Posted: 16 Mar 2006, 8:12:38 UTC

Another one: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11167215

Stuck on 1% after 12 hours.
ID: 12089 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Timbo

Send message
Joined: 25 Sep 05
Posts: 20
Credit: 2,275,059
RAC: 0
Message 12113 - Posted: 16 Mar 2006, 23:42:08 UTC - in response to Message 12089.  

This WU was stuck at 1% for a day - then started "on it's own" - got to about 70% done and then BOINC switched over to one of the other projects I'm running (BBC CCE) and immediately I got a "computation error" and the percentage went to 100%.

16/03/2006 23:19:42|rosetta@home|Unrecoverable error for result FA_RLXdh_hom025_1dhn__360_62_0 ( - exit code -164 (0xffffff5c))


So, that more wasted CPU cycles.

This project used to be bullet-proof - what's changed....?

regards,

Tim

PS - Our team with 430+ overall members (and at least 130 already joined up to Rosetta) were going to be concentrating on Rosetta for a "Crunching Weekend" on 25th-26th March.

see here: http://www.ukboincteam.org.uk/uk-boinc-team.html

If this project doesn't get sorted REAL QUICK, we'll be forced to switch our attentions to a different project....!


ID: 12113 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile m.mitch
Avatar

Send message
Joined: 10 Feb 06
Posts: 34
Credit: 1,928,904
RAC: 0
Message 12117 - Posted: 17 Mar 2006, 1:09:27 UTC

I have a work unit that has been stuck at 1% for 10:53:49. I just noticed it.

Does anyone want to have a look or should I just restart it?


Click here to join the #1 Aussie Alliance on Rosetta
ID: 12117 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile xrobert

Send message
Joined: 28 Oct 05
Posts: 3
Credit: 168,865
RAC: 0
Message 12131 - Posted: 17 Mar 2006, 7:08:54 UTC

The WU I have has got stuck on 1%.

FA_RLXai_hom023_1aiu__359_454_0

I've now also aborted another unit which begins with ai_hom, to be safe.


ID: 12131 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12133 - Posted: 17 Mar 2006, 7:28:06 UTC - in response to Message 12113.  

This WU was stuck at 1% for a day - then started "on it's own" - got to about 70% done and then BOINC switched over to one of the other projects I'm running (BBC CCE) and immediately I got a "computation error" and the percentage went to 100%.

16/03/2006 23:19:42|rosetta@home|Unrecoverable error for result FA_RLXdh_hom025_1dhn__360_62_0 ( - exit code -164 (0xffffff5c))


So, that more wasted CPU cycles.

This project used to be bullet-proof - what's changed....?

regards,

Tim

PS - Our team with 430+ overall members (and at least 130 already joined up to Rosetta) were going to be concentrating on Rosetta for a "Crunching Weekend" on 25th-26th March.

see here: http://www.ukboincteam.org.uk/uk-boinc-team.html

If this project doesn't get sorted REAL QUICK, we'll be forced to switch our attentions to a different project....!




believe me, we are trying!

ID: 12133 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 12150 - Posted: 17 Mar 2006, 14:35:33 UTC

WU 11139882 stuck for a couple of days since 15-Mar


ps u -U boinc
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
boinc 27870 0.0 1.3 6800 2944 ? S Jan18 2:07 ./boinc
boinc 4280 81.3 0.1 141408 284 ? RN Mar15 2324:55 rosetta_4.81_i68
boinc 4281 0.0 0.1 141408 284 ? SN Mar15 0:00 rosetta_4.81_i686
boinc 4282 0.0 0.1 141408 284 ? SN Mar15 0:00 rosetta_4.81_i686

fgrep pct_ stdout.txt
BOINC :: [2006-03-15 16:43:09] :: mode: abrelax :: nstartnm: 1 :: number_of_output: 10 :: num_decoys: 0 :: pct_complete: 0.01
BOINC :: [2006-03-15 16:57:00] :: num_decoys: 1 :: number_of_output: 35 :: pct_complete: 0.0283208
BOINC :: [2006-03-15 17:09:48] :: num_decoys: 2 :: number_of_output: 36 :: pct_complete: 0.0546174


I've killed the task, let's see how it goes...
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 12150 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ib Rasmussen

Send message
Joined: 27 Sep 05
Posts: 16
Credit: 211,416
RAC: 0
Message 12151 - Posted: 17 Mar 2006, 14:38:27 UTC

HB_BARCODE_30_1elwA_351_8105_0 stuck for two hours. Still stuck efter restart. Aborted.

Computer ID 177498.

/Ib
ID: 12151 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Timbo

Send message
Joined: 25 Sep 05
Posts: 20
Credit: 2,275,059
RAC: 0
Message 12154 - Posted: 17 Mar 2006, 16:01:19 UTC - in response to Message 12133.  

This project used to be bullet-proof - what's changed....?

believe me, we are trying!



GREAT NEWS - Perhaps it might be an idea to let people know there is a problem and to stop making work available until it's fixed - that'll take the pressure off you guys.

What sort of percentage of the work returned to you is being trashed by this bug?

Would imagine it's fairly high - although, if it was an epidemic failure, would assume you would have stopped sending out work before now.

But surely you must be going into damage limitation mode by now. Can you afford to lose lots of crunchers?

regards and good luck.

Tim
ID: 12154 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 19 Sep 05
Posts: 403
Credit: 537,991
RAC: 0
Message 12155 - Posted: 17 Mar 2006, 17:30:15 UTC - in response to Message 12154.  


ID: 12155 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 18 · Next

Message boards : Number crunching : Report stuck & aborted WU here please



©2024 University of Washington
https://www.bakerlab.org