Report stuck & aborted WU here please

Message boards : Number crunching : Report stuck & aborted WU here please

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 18 · Next

AuthorMessage
Profile Rebirther
Avatar

Send message
Joined: 17 Sep 05
Posts: 116
Credit: 41,315
RAC: 0
Message 10973 - Posted: 19 Feb 2006, 22:11:46 UTC
Last modified: 19 Feb 2006, 22:37:45 UTC

PRODUCTION_ABINITIO_INCREASECYCLES50_1ten__312_1196_0
1% after 4h, 2Mio steps again and again, last entry in stdout:

Starting score3 moves...
kk,score3,low_score,rms_err,low_rms,rms_min,naccept
0 -61.034 -61.034 11.848 11.848 8.788 15290
converged 2.07775331 108316
converged 2.71214509 112159
converged 2.55168295 125540
converged 2.40547872 129158
converged 1.95232618 132867
converged 2.9595387 137826
converged 2.75581789 140668
converged 2.2488966 144434
converged 2.80967975 158799
converged 2.50006342 162126
converged 2.39554954 169710
converged 2.04850674 183329
converged 1.99719334 187299
1 -40.606 -78.377 11.668 12.506 8.788 20134
converged 2.95864868 138902
converged 2.02392745 173508
converged 2.22490144 324940
2 -12.235 -78.377 8.579 12.506 6.717 26539
converged 2.71321511 126774
converged 1.66896379 159099

Time is not updated anymore of the stdout file but content, still at model 1!
Restart boinc didn`t solve the problem, only a new random seed, what can I do?
ID: 10973 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Team TMR

Send message
Joined: 2 Nov 05
Posts: 21
Credit: 1,583,679
RAC: 0
Message 10997 - Posted: 20 Feb 2006, 13:34:04 UTC

Another one, 9442770

Over 8 hours in and still stuck on 1%. It's running rosetta 4.82 too, so I guess that didn't fix the 1% problem then. Max CPU setting is 2 hours.
ID: 10997 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Team TMR

Send message
Joined: 2 Nov 05
Posts: 21
Credit: 1,583,679
RAC: 0
Message 11002 - Posted: 20 Feb 2006, 15:42:03 UTC
Last modified: 20 Feb 2006, 15:42:31 UTC

Well it finished eventually, at 8hr 39mins. But it never did get off 1% as far as I could see.
ID: 11002 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Rebirther
Avatar

Send message
Joined: 17 Sep 05
Posts: 116
Credit: 41,315
RAC: 0
Message 11014 - Posted: 20 Feb 2006, 17:49:40 UTC
Last modified: 20 Feb 2006, 18:12:20 UTC

Damn, all is going wrong here, another one after 14h and 45% fall back to 15% :(. I will cancel all and waiting for a fix. (Rosetta 4.82, Boinc 5.2.13). I have never had any problems before... 2 of 3 failed :o
https://boinc.bakerlab.org/rosetta/result.php?resultid=11747939
https://boinc.bakerlab.org/rosetta/result.php?resultid=11748069
ID: 11014 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Stwainer

Send message
Joined: 9 Nov 05
Posts: 27
Credit: 4,406,829
RAC: 0
Message 11018 - Posted: 20 Feb 2006, 18:23:53 UTC

I had the following Wu stuck at 1% for 2 hours: PRODUCTION_ABINITIO_INCREASECYCLES50_1dhn__312_608_0
ID: 11018 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jon Kennedy

Send message
Joined: 1 Oct 05
Posts: 6
Credit: 418,027
RAC: 0
Message 11063 - Posted: 21 Feb 2006, 4:16:21 UTC

This workunit was stuck at 1% after 27h35m:
https://boinc.bakerlab.org/rosetta/result.php?resultid=11510637

Nothing occured on the machine to interrupt crunching - Message log:

2/19/2006 5:04:22 PM|rosetta@home|Starting result PRODUCTION_ABINITIO_RANDOMFRAG_1urnA_309_445_0 using rosetta version 481
2/19/2006 5:04:24 PM|rosetta@home|Started upload of PRODUCTION_ABINITIO_RANDOMFRAG_1ughI_309_445_0_0
2/19/2006 5:04:31 PM|rosetta@home|Finished upload of PRODUCTION_ABINITIO_RANDOMFRAG_1ughI_309_445_0_0
2/19/2006 5:04:31 PM|rosetta@home|Throughput 23263 bytes/sec
2/20/2006 8:26:06 PM||request_reschedule_cpus: project op
2/20/2006 8:26:10 PM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
2/20/2006 8:26:10 PM|rosetta@home|Reason: Requested by user
2/20/2006 8:26:10 PM|rosetta@home|Reporting 7 results
2/20/2006 8:26:15 PM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
2/20/2006 10:35:53 PM|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_RANDOMFRAG_1urnA_309_445_0 (aborted via GUI RPC)

The next WU PRODUCTION_ABINITIO_RANDOMFRAG_2acy__309_389_0 is also seemingly stuck at 1% after 37+ minutes... <sigh>

ID: 11063 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 11064 - Posted: 21 Feb 2006, 4:22:51 UTC
Last modified: 21 Feb 2006, 5:12:01 UTC

I have one stuck on a Mac G4 Laptop running OS 10.4.5. the WU is here. The application version is 4.82. The previous "owner" had a client error on this WU.

This will be the result ID If I can make it finish.

The WU is stuck at 1% complete after 2:15 of CPU time. My time setting is set for 2 hours. It has completed 97345 steps but shows 1% complete.

The WU name is -PRODUCTION_ABINITIO_QUADRUPLELONGRANGEANTIPARALLEL_1acf__311_807

Regards
Phil

We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 11064 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 11070 - Posted: 21 Feb 2006, 5:24:16 UTC - in response to Message 11063.  
Last modified: 21 Feb 2006, 5:25:02 UTC

This workunit was stuck at 1% after 27h35m:
https://boinc.bakerlab.org/rosetta/result.php?resultid=11510637

Nothing occured on the machine to interrupt crunching - Message log:

2/19/2006 5:04:22 PM|rosetta@home|Starting result PRODUCTION_ABINITIO_RANDOMFRAG_1urnA_309_445_0 using rosetta version 481
2/19/2006 5:04:24 PM|rosetta@home|Started upload of PRODUCTION_ABINITIO_RANDOMFRAG_1ughI_309_445_0_0
2/19/2006 5:04:31 PM|rosetta@home|Finished upload of PRODUCTION_ABINITIO_RANDOMFRAG_1ughI_309_445_0_0
2/19/2006 5:04:31 PM|rosetta@home|Throughput 23263 bytes/sec
2/20/2006 8:26:06 PM||request_reschedule_cpus: project op
2/20/2006 8:26:10 PM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
2/20/2006 8:26:10 PM|rosetta@home|Reason: Requested by user
2/20/2006 8:26:10 PM|rosetta@home|Reporting 7 results
2/20/2006 8:26:15 PM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
2/20/2006 10:35:53 PM|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_RANDOMFRAG_1urnA_309_445_0 (aborted via GUI RPC)

The next WU PRODUCTION_ABINITIO_RANDOMFRAG_2acy__309_389_0 is also seemingly stuck at 1% after 37+ minutes... <sigh>


I wonder if we actually have stuck WUs or if they are just one of the ones that used to take 30 hours. Both yours and mine are "PRODUCTION_ABINITIO_xxxx". I just watched the screen saver for a while and it is running over 100,000 steps on the first model, but it is running. It could be that it is just doing more steps per model and therefore taking longer to checkpoint and that would delay the percent complete. It has run over 20 min and all the WUs I have seen since the New version of the software was released have only taken about 5 mins per model.
ID: 11070 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Ingham

Send message
Joined: 27 Sep 05
Posts: 14
Credit: 4,215,134
RAC: 3
Message 11092 - Posted: 21 Feb 2006, 10:05:05 UTC

FYI, I've just aborted a WU () stuck at 1% after 175K seconds

Name: PRODUCTION_ABINITIO_RANDOMFRAG_1vcc__309_441
WU: 9337995
ResultID: 11582797

ID: 11092 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
KwintenB

Send message
Joined: 24 Nov 05
Posts: 6
Credit: 183,329
RAC: 0
Message 11109 - Posted: 21 Feb 2006, 13:19:58 UTC
Last modified: 21 Feb 2006, 13:21:54 UTC

I've got a WU who's crunching already 51h, now i suspended the WU. Is there any chance that i'll get point voor this job if I abort it. Because this is obviously a project fault
Details of the WU:
19/02/2006 04:12:00|rosetta@home|Starting result PRODUCTION_ABINITIO_DBFLAGS_1lis__307_738_0 using rosetta version 481
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=9300013
https://boinc.bakerlab.org/rosetta/result.php?resultid=11470130
ID: 11109 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
arklms

Send message
Joined: 17 Dec 05
Posts: 7
Credit: 177,488
RAC: 0
Message 11138 - Posted: 21 Feb 2006, 19:18:50 UTC

PRODUCTION_ABINITIO_INCREASECYCLES50_1tul__317_178_0
Appears stuck on 1%. Can't start it from the DOS window, it crashed.
ID: 11138 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
arklms

Send message
Joined: 17 Dec 05
Posts: 7
Credit: 177,488
RAC: 0
Message 11143 - Posted: 21 Feb 2006, 19:32:24 UTC - in response to Message 11138.  

PRODUCTION_ABINITIO_INCREASECYCLES50_1tul__317_178_0
Appears stuck on 1%. Can't start it from the DOS window, it crashed.


I just clicked on the Rosetta graphics, which crashed the computer. Upon reboot, it's 17% and ongoing. Strange, but true.
ID: 11143 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Daral

Send message
Joined: 13 Jan 06
Posts: 13
Credit: 870,334
RAC: 0
Message 11154 - Posted: 21 Feb 2006, 21:51:07 UTC

Got a 1% error for 1 hr 21 minutes. Work Unit Production_Abinitio_increasecycles50_1ten_317_127_0

Running it from command line now with seed 1037999 seems to also get stuck on the first iteration. It's run over 512k steps and is still on the first model.
ID: 11154 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nico

Send message
Joined: 29 Sep 05
Posts: 1
Credit: 548,959
RAC: 0
Message 11160 - Posted: 21 Feb 2006, 22:57:52 UTC

PRODUCTION_ABINITIO_QUADRUPLELONGRANGEANTIPARALLEL_1tul__311_863 stucked at 1%:
(requestet 2h WUs and this one is running for more then 2h now and still at 1%)
http://666kb.com/i/117ucnv1ep5vl.gif
ID: 11160 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile O&O
Avatar

Send message
Joined: 11 Dec 05
Posts: 25
Credit: 66,900
RAC: 0
Message 11164 - Posted: 22 Feb 2006, 0:09:47 UTC

Hello David

PRODUCTION_ABINITIO_1acf__250_809_2

My computer did 13.32 hours on this WU ... before it errored out with -177 (0xffffff4f) Exit status and "Maximum CPU time exceeded".

What about the ... 131.16 cliamed credits?

Regards,
O&O
ID: 11164 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 11171 - Posted: 22 Feb 2006, 1:06:33 UTC - in response to Message 11092.  

FYI, I've just aborted a WU () stuck at 1% after 175K seconds

Name: PRODUCTION_ABINITIO_RANDOMFRAG_1vcc__309_441
WU: 9337995
ResultID: 11582797


Some of the PRODUCTION_ABINITIO work units take a long time to pass 1%. In some cases over 4 1/2 hours on a reasonably fast machine. Before aborting them and loosing ALL the time spent, you should check the graphic display and make certain it is not actually running. In many cases these work units take well over 700,000 steps to complete a single model. The work unit will look hung until between model completions. IF it is hung you can usually preserve some of the time spent on running it by restarting boinc. This will in most cases cause the work unit to run successfully to completion.

The project team is aware of this and they are making an adjustment in the WUs to fix the problem. But it will take a few days to two weeks for the old work units to run through the system.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 11171 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Runaway1956

Send message
Joined: 5 Nov 05
Posts: 19
Credit: 535,400
RAC: 0
Message 11229 - Posted: 23 Feb 2006, 1:15:31 UTC - in response to Message 11171.  

Well, glad I stopped in to look around.

After the upgrade to 4.81, it seemed that none of my previously downloaded WU wanted to run. Which was odd, as I'd already returned a number of similar WU from the same batch.

I put those all on hold, and ran some of the newer WU, which said they were for 4.81. I got the 1% glitch on about 4 of them.

Hit reset. Everything goes away, and BOINC downloads some new WU. Same thing. 1% lasts about 2 1/2 eternities.

Was about to hit reset again, but decided to come here.....

Thanks guys. I'll let the little monster run.




FYI, I've just aborted a WU () stuck at 1% after 175K seconds

Name: PRODUCTION_ABINITIO_RANDOMFRAG_1vcc__309_441
WU: 9337995
ResultID: 11582797


Some of the PRODUCTION_ABINITIO work units take a long time to pass 1%. In some cases over 4 1/2 hours on a reasonably fast machine. Before aborting them and loosing ALL the time spent, you should check the graphic display and make certain it is not actually running. In many cases these work units take well over 700,000 steps to complete a single model. The work unit will look hung until between model completions. IF it is hung you can usually preserve some of the time spent on running it by restarting boinc. This will in most cases cause the work unit to run successfully to completion.

The project team is aware of this and they are making an adjustment in the WUs to fix the problem. But it will take a few days to two weeks for the old work units to run through the system.


ID: 11229 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
XS_team_germany

Send message
Joined: 2 Jan 06
Posts: 6
Credit: 1,469,591
RAC: 0
Message 11268 - Posted: 23 Feb 2006, 20:57:05 UTC

I uploaded these results today and I received no credit for them:


23/02/2006 21:11:55|rosetta@home|Started upload of NO_SIM_ANNEAL_BARCODE_30_2reb_286_9242_3_0
23/02/2006 21:14:42|rosetta@home|Finished upload of NO_SIM_ANNEAL_BARCODE_30_2reb_286_9242_3_0
23/02/2006 21:14:42|rosetta@home|Throughput 2945 bytes/sec
23/02/2006 21:14:42|rosetta@home|Started upload of BARCODE_30_256bA_299_5130_3_0
23/02/2006 21:17:14|rosetta@home|Finished upload of BARCODE_30_256bA_299_5130_3_0
23/02/2006 21:17:14|rosetta@home|Throughput 2734 bytes/sec
23/02/2006 21:17:14|rosetta@home|Started upload of BARCODE_30_1cc8A_299_5509_3_0
23/02/2006 21:20:43|rosetta@home|Finished upload of BARCODE_30_1cc8A_299_5509_3_0
23/02/2006 21:20:43|rosetta@home|Throughput 2978 bytes/sec
23/02/2006 21:20:43|rosetta@home|Started upload of BARCODE_30_2chf__299_5291_3_0
23/02/2006 21:22:23|rosetta@home|Finished download of aa1b72_09_05.400_v1_3.gz
23/02/2006 21:22:23|rosetta@home|Throughput 1391 bytes/sec
23/02/2006 21:22:23|rosetta@home|Started upload of BARCODE_30_1elwA_299_3950_3_0
23/02/2006 21:23:54|rosetta@home|Finished upload of BARCODE_30_2chf__299_5291_3_0
23/02/2006 21:23:54|rosetta@home|Throughput 2598 bytes/sec
23/02/2006 21:23:54|rosetta@home|Started upload of BARCODE_30_1a68__299_29202_3_0
23/02/2006 21:26:43|rosetta@home|Finished upload of BARCODE_30_1elwA_299_3950_3_0
23/02/2006 21:26:43|rosetta@home|Throughput 1789 bytes/sec
23/02/2006 21:26:43|rosetta@home|Started upload of BARCODE_30_2chf__299_7685_2_0
23/02/2006 21:29:13|rosetta@home|Finished upload of BARCODE_30_1a68__299_29202_3_0
23/02/2006 21:29:13|rosetta@home|Throughput 1758 bytes/sec
23/02/2006 21:29:13|rosetta@home|Started upload of BARCODE_30_2chf__299_7671_2_0
23/02/2006 21:31:14|rosetta@home|Finished upload of BARCODE_30_2chf__299_7685_2_0
23/02/2006 21:31:14|rosetta@home|Throughput 1770 bytes/sec
23/02/2006 21:31:14|rosetta@home|Started upload of BARCODE_30_256bA_299_4860_3_0
23/02/2006 21:33:39|rosetta@home|Finished upload of BARCODE_30_2chf__299_7671_2_0
23/02/2006 21:33:39|rosetta@home|Throughput 1731 bytes/sec
23/02/2006 21:33:39|rosetta@home|Started upload of BARCODE_30_1a68__299_7544_2_0
23/02/2006 21:35:46|rosetta@home|Finished upload of BARCODE_30_256bA_299_4860_3_0
23/02/2006 21:35:46|rosetta@home|Throughput 1941 bytes/sec
23/02/2006 21:35:46|rosetta@home|Started upload of BARCODE_30_1elwA_299_7622_2_0
23/02/2006 21:39:57|rosetta@home|Finished upload of BARCODE_30_1elwA_299_7622_2_0
23/02/2006 21:39:57|rosetta@home|Throughput 1583 bytes/sec
23/02/2006 21:39:57|rosetta@home|Started upload of BARCODE_30_1opd__299_7625_2_0
23/02/2006 21:40:30|rosetta@home|Finished upload of BARCODE_30_1a68__299_7544_2_0
23/02/2006 21:40:30|rosetta@home|Throughput 1335 bytes/sec
23/02/2006 21:43:39|rosetta@home|Finished upload of BARCODE_30_1opd__299_7625_2_0


Host: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=166212

The above work units ran 7+ hours each.

:(
ID: 11268 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 11324 - Posted: 24 Feb 2006, 14:04:07 UTC

For people having many work Unit Errors!!

I have received an e-mail from Dr. Baker with information for any of you who are having a lot of Work Unit errors.

"Could you help us to recommend to people having problems with lots of WU to set the target run time to a smaller value like 2 hours. We think there aren't any new bugs, just with longer run times it is more likely for a WU to have problems."

So if you are having a lot of errors please reset your Time setting to 2 hours and see if that helps.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 11324 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marie Lucie

Send message
Joined: 9 Dec 05
Posts: 5
Credit: 40,616
RAC: 0
Message 11375 - Posted: 25 Feb 2006, 9:28:40 UTC

Hello, I made the change in Rosetta settings as requested and I got again an error. It run 53 minutes and than ...

25/02/2006 10:28:41|rosetta@home|Unrecoverable error for result HBLR_1.0_1hz6_321_998_0 ( - exit code -1073741819 (0xc0000005))
25/02/2006 10:28:42||request_reschedule_cpus: process exited
25/02/2006 10:28:42|rosetta@home|Computation for result HBLR_1.0_1hz6_321_998_0 finished

I've one WU remaining. We will see
ID: 11375 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 18 · Next

Message boards : Number crunching : Report stuck & aborted WU here please



©2024 University of Washington
https://www.bakerlab.org