Message boards : Number crunching : Report stuck & aborted WU here please
Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · Next
Author | Message |
---|---|
warlock Send message Joined: 9 Oct 05 Posts: 1 Credit: 3,379,414 RAC: 0 |
4/1/2006 9:53:00 PM|rosetta@home|Resuming result FA_RLXpt_hom006_1ptq__361_13_0 using rosetta version 482 seems to be another w/u that gets stuck. The graphic display indicates 1.00% done after 26+ hours and: Stage: Full Atom Relax Model: 1 Step: 331496 Accepted RMSD: 10.84 Accepted Energy: -51.44441 My machine and relivant software are: 4/1/2006 5:24:19 PM||Starting BOINC client version 5.2.13 for windows_intelx86 4/1/2006 5:24:19 PM||libcurl/7.14.0 OpenSSL/0.9.8 zlib/1.2.3 4/1/2006 5:24:19 PM||Data directory: C:Program FilesBOINC 4/1/2006 5:24:20 PM||Processor: 2 GenuineIntel Intel(R) Pentium(R) 4 CPU 2.80GHz 4/1/2006 5:24:20 PM||Memory: 1006.73 MB physical, 2.37 GB virtual |
Kelemvor Send message Joined: 28 Dec 05 Posts: 7 Credit: 458,146 RAC: 0 |
Just checked a couple of my PCs that weren't reporting and found a few stuck on WUs with the FA_ start. One is FA_RLXpt_hom006_1ptq__361_29_0. It's been runnign for 95 hours. Other is FA_RLXpt_hom003_1ptq_361_283_0. That one's been running for 227 hours! What should I do? Let me know if you need more info. |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
One is FA_RLXpt_hom006_1ptq__361_29_0. It's been runnign for 95 hours. Those look like old WUs from before the latest round of fixes. Just abort them. |
Kelemvor Send message Joined: 28 Dec 05 Posts: 7 Credit: 458,146 RAC: 0 |
One is FA_RLXpt_hom006_1ptq__361_29_0. It's been runnign for 95 hours. In the queue there's a ton more WUs with that same general name. Should I go through and abort them all? |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
In the queue there's a ton more WUs with that same general name. Should I go through and abort them all? If you have WUs that call for the older client, and they are giving you problems, then go ahead and abort them all. I believe the current version of rosetta is 4.83 for windows and 4.82 for linux. (The "application" field of the "work" tab of boinc manager gives the version that the WU is asking for.) |
Jimi@0wned.org.uk Send message Joined: 10 Mar 06 Posts: 29 Credit: 335,252 RAC: 0 |
First failure on this rig, a bit odd and not one I've seen before: 7449_2_fullatom_relax_dec7449_2_09_3.pdb_415_4_0 <core_client_version>5.2.13</core_client_version> <message>Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # random seed: 1794757 # cpu_run_time_pref: 14400 # DONE :: 1 starting structures built 4 (nstruct) times # This process generated 4 decoys from 4 attempts # 0 starting pdbs were skipped ERROR:: Exit at: .read_paths.cc line:346 </stderr_txt> |
UBT - Halifax--lad Send message Joined: 17 Sep 05 Posts: 157 Credit: 2,687 RAC: 0 |
Had my 1st crash in a very long time, this is the 1st Rosetta WU I have done since the new bug tracking code was put into the WU's I wonder if that caused my crash https://boinc.bakerlab.org/rosetta/result.php?resultid=15959192 Have not got my messages for it have searched them high and low and could not find any, woke up this morning and this WU was just sat there at around 12%, nothing else happening, so I have aborted it, will see what the next WU does Join us in Chat (see the forum) Click the Sig Join UBT |
Team TMR Send message Joined: 2 Nov 05 Posts: 21 Credit: 1,583,679 RAC: 0 |
This WU seems to be stuck: 13107954 Over 3 hours in and it's on 1.19%. Job CPU time is set to 2 hours. Edit: Now at 4 hours and 1.30%. |
Dufva Send message Joined: 2 Jan 06 Posts: 1 Credit: 31,453 RAC: 0 |
Have experienced unrecoverable error twice last week when opening graphics from BOINC client. It has happend both times when I have run heavy parallell processes (perhaps CPU overload?). Windows XP got stuck and switched over to "secure" 16 color mode as I tried to open the graphics, and it resulted both times in unrecoverable errors: 2006-04-05 00:11:46 [rosetta@home] Unrecoverable error for result DOUBLE_SS_WEIGHT_1vie__419_6_0 ( - exit code -1073741819 (0xc0000005)) 2006-04-05 00:11:49 [---] request_reschedule_cpus: process exited 2006-04-05 00:11:49 [rosetta@home] Computation for result DOUBLE_SS_WEIGHT_1vie__419_6_0 finished |
[DPC]Alexcj Send message Joined: 21 Mar 06 Posts: 3 Credit: 8,374 RAC: 0 |
Hi, I got a stuck workunit also: It's this one: FAH_RLXpt_hom007_1ptq_361_166_1 also known as:11689531 It is stuck at 1.04% (according to the GUI) with 6:54:44 hours CPU time done and 8:50:57 to go ;-) It's on this machine. The other participant wasn't so lucky either. |
Team TMR Send message Joined: 2 Nov 05 Posts: 21 Credit: 1,583,679 RAC: 0 |
I've got two more somewhere that are stuck on 1.04% after 2 hours and 1.5 hours. |
Delk Send message Joined: 20 Feb 06 Posts: 25 Credit: 995,624 RAC: 0 |
Another frozen workunit: name: FA_RLXpt_hom006_1ptq__361_80_1 WU name: FA_RLXpt_hom006_1ptq__361_80 app version num: 483 checkpoint CPU time: 2112.812500 current CPU time: 121543.171875 fraction done: 0.293420 VM usage: 0.000000 resident set size: 0.000000 estimated CPU time remaining: 89449.556235 result id: 15996230 workunit id: 11646530 It was meant to be a 2 hour although obviously things went wrong, 121543 seconds later I noticed the issue. |
Delk Send message Joined: 20 Feb 06 Posts: 25 Credit: 995,624 RAC: 0 |
In regard to my previous post, shouldn't the max unit runtime of 24 hours aborted it automatically or has that been removed? |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
In regard to my previous post, shouldn't the max unit runtime of 24 hours aborted it automatically or has that been removed? It should have; will try to figure out why these weren't terminated. also the reports here will be very helpful in pinning down the problem |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
I think all these FA_* WUs are old. Someone else aborted them, so they were sent out again. You can check the creation date on the WUs page, and anything created in March should be aborted if it seems stuck. The 24hour timout is only in newer WUs, which started coming out at the end of March. |
Dave Wilson Send message Joined: 8 Jan 06 Posts: 35 Credit: 379,049 RAC: 0 |
Just found https://boinc.bakerlab.org/rosetta/result.php?resultid=16005553 sorry I did not get the rest of the info but it was stuck at around 17 hours and 34.--- % |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=15702999 |
Mikkie Send message Joined: 1 Apr 06 Posts: 9 Credit: 5,700 RAC: 0 |
Somebody need to find a solution about this series. If this goes on it's no fun anymore. All crashed after more than 1+ hour crunching. 2006-04-07 12:03:15 [rosetta@home] Unrecoverable error for result HBLR_1.0_1hz6_424_1768_0 (aborted via GUI RPC) 2006-04-07 14:28:40 [rosetta@home] Unrecoverable error for result HBLR_1.0_1ogw_424_3457_0 (aborted via GUI RPC) 2006-04-07 18:49:47 [rosetta@home] Unrecoverable error for result HBLR_1.0_1ogw_424_6531_0 (aborted via GUI RPC) 2006-04-08 10:29:50 [rosetta@home] Unrecoverable error for result HBLR_1.0_1b72_424_2309_1 (aborted via GUI RPC) 2006-04-08 15:26:29 [rosetta@home] Unrecoverable error for result HBLR_1.0_1hz6_426_2934_0 (aborted via GUI RPC) 2006-04-08 15:32:46 [rosetta@home] Unrecoverable error for result HBLR_1.0_1di2_426_4357_0 ( - exit code -1073741819 (0xc0000005)) 2006-04-08 15:41:50 [rosetta@home] Unrecoverable error for result HBLR_1.0_1r69_426_325_1 ( - exit code -1073741819 (0xc0000005)) |
Mikkie Send message Joined: 1 Apr 06 Posts: 9 Credit: 5,700 RAC: 0 |
Just some seconds after I post this the next one is down the drain. 2006-04-08 16:32:43 [rosetta@home] Unrecoverable error for result HBLR_1.0_1mky_426_4765_0 (aborted via GUI RPC) I quit/reject/go on hold till these problem[s] been solved and stable. Couldn't get a result/point on the board. |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
can you try setting the work unit time to 1 hour? thanks, David |
Message boards :
Number crunching :
Report stuck & aborted WU here please
©2025 University of Washington
https://www.bakerlab.org