Message boards : Number crunching : Report stuck & aborted WU here please
Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · Next
Author | Message |
---|---|
Runaway1956 Send message Joined: 5 Nov 05 Posts: 19 Credit: 535,400 RAC: 0 |
I saw this message last week for the first time, just aborted the WU. But twice this morning: 3/30/2006 12:26:10 PM|rosetta@home|Started upload of FA_RLXb3_hom001_1b3aA_357_21_1_0 3/30/2006 12:26:16 PM|rosetta@home|Started upload of FA_RLXti_hom001_1tif__357_26_1_0 3/30/2006 12:27:39 PM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/17e/FA_RLXb3_hom001_1b3aA_357_21_1_0 98304 bytes != offset 0 bytes 3/30/2006 12:27:39 PM|rosetta@home|Temporarily failed upload of FA_RLXb3_hom001_1b3aA_357_21_1_0: transient upload error 3/30/2006 12:27:39 PM|rosetta@home|Backing off 2 hours, 39 minutes, and 22 seconds on upload of file FA_RLXb3_hom001_1b3aA_357_21_1_0 3/30/2006 12:28:18 PM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/3b9/FA_RLXti_hom001_1tif__357_26_1_0 141256 bytes != offset 0 bytes 3/30/2006 12:28:18 PM|rosetta@home|Temporarily failed upload of FA_RLXti_hom001_1tif__357_26_1_0: transient upload error 3/30/2006 12:28:18 PM|rosetta@home|Backing off 3 hours, 10 minutes, and 46 seconds on upload of file FA_RLXti_hom001_1tif__357_26_1_0 This is on the Opteron 144, machine identified as nunyabiz-s2pvzz |
Mike Gelvin Send message Joined: 7 Oct 05 Posts: 65 Credit: 10,612,039 RAC: 0 |
|
CremionisD Send message Joined: 10 Mar 06 Posts: 9 Credit: 37,604,006 RAC: 0 |
Work unit aborted at 1.00%, CPU time used ~5:28:00 WU Name = "HB_BARCODE_30_1pgx__351_35027_0" Application Rosetta 4.82, System CPU Pentium M 1600MHz, 1GB ram. Windows XP SP 2. |
Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0 |
I think most of the problems reported in the last few posts were from work units created before the March 28 update--hopefully these older wu will all get through the system in the next day or two. I think you are Right David. It has been 36 Hrs and NO 1% stuck W/Us (*_*) THANK YOU David!! Is the data retrieval you added to your client / WU working to find out what is/was causing this Bug? If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
I think most of the problems reported in the last few posts were from work units created before the March 28 update--hopefully these older wu will all get through the system in the next day or two. That is great!! I'm particularly glad in your case because of all the computers you had to be watching over. I had hoped to be reading reports of "WU stuck at 5.0733 %" which would have helped to locate the errors, but it is even better to see that the "stuck" work units problem seems to be much reduced. please spread the word! |
Jon Kennedy Send message Joined: 1 Oct 05 Posts: 6 Credit: 418,027 RAC: 0 |
This WU was stuck at 1% for over 53 hours: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11860123 random seed: 2232363 Stuck at model 1, step 22837 Claimed credit: 269.87 Graphic frozen. Should I abort all my 4.82 WU or just the ones names similar to this one - or none? |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
This WU was stuck at 1% for over 53 hours: If you are having problems with "stuck at 1%" please do abort pre 4.83 WU. The 4.83 WU seem to get stuck less often, and if/when they do get stuck, we will be able to trace the problem more easily. |
pieface Send message Joined: 20 Sep 05 Posts: 17 Credit: 797,661 RAC: 0 |
I have a 'stuck' 4.83, wuid=11843998, cpid=163786. Noticed that it was still running after 20+ hours cpu time. Looked at graphics and it was on 21.742 pct complete. suspended unit and bm (this guy is still running 5.2.13), closed down windows and did a cold start. Brought BM back up and un-suspended the unit. Cpu time went back to about 52 minutes, then started moving forward. Graphics looked ok, lots of movement. Now after a couple of hours it's stuck on 21.742 percent complete again, model 8, step 266356. task manager says it's pulling 100pct of the CPU. Edit: just noticed that someone else with a similar machine (pentium-m, 1.86) had already aborted this unit...interesting... |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
I have a 'stuck' 4.83, wuid=11843998, cpid=163786. sorry about this, but your information will be very helpful in tracking down the problem. the ".742" tells us where the sticking is happening. thanks, David |
pieface Send message Joined: 20 Sep 05 Posts: 17 Credit: 797,661 RAC: 0 |
Not a problem, I suspended the WU again instead of aborting, so I could get on with some new work without losing it (in case you folks want something else from it). |
warlock Send message Joined: 9 Oct 05 Posts: 1 Credit: 3,379,414 RAC: 0 |
4/1/2006 9:53:00 PM|rosetta@home|Resuming result FA_RLXpt_hom006_1ptq__361_13_0 using rosetta version 482 seems to be another w/u that gets stuck. The graphic display indicates 1.00% done after 26+ hours and: Stage: Full Atom Relax Model: 1 Step: 331496 Accepted RMSD: 10.84 Accepted Energy: -51.44441 My machine and relivant software are: 4/1/2006 5:24:19 PM||Starting BOINC client version 5.2.13 for windows_intelx86 4/1/2006 5:24:19 PM||libcurl/7.14.0 OpenSSL/0.9.8 zlib/1.2.3 4/1/2006 5:24:19 PM||Data directory: C:Program FilesBOINC 4/1/2006 5:24:20 PM||Processor: 2 GenuineIntel Intel(R) Pentium(R) 4 CPU 2.80GHz 4/1/2006 5:24:20 PM||Memory: 1006.73 MB physical, 2.37 GB virtual |
Kelemvor Send message Joined: 28 Dec 05 Posts: 7 Credit: 458,146 RAC: 0 |
Just checked a couple of my PCs that weren't reporting and found a few stuck on WUs with the FA_ start. One is FA_RLXpt_hom006_1ptq__361_29_0. It's been runnign for 95 hours. Other is FA_RLXpt_hom003_1ptq_361_283_0. That one's been running for 227 hours! What should I do? Let me know if you need more info. |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
One is FA_RLXpt_hom006_1ptq__361_29_0. It's been runnign for 95 hours. Those look like old WUs from before the latest round of fixes. Just abort them. |
Kelemvor Send message Joined: 28 Dec 05 Posts: 7 Credit: 458,146 RAC: 0 |
One is FA_RLXpt_hom006_1ptq__361_29_0. It's been runnign for 95 hours. In the queue there's a ton more WUs with that same general name. Should I go through and abort them all? |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
In the queue there's a ton more WUs with that same general name. Should I go through and abort them all? If you have WUs that call for the older client, and they are giving you problems, then go ahead and abort them all. I believe the current version of rosetta is 4.83 for windows and 4.82 for linux. (The "application" field of the "work" tab of boinc manager gives the version that the WU is asking for.) |
Jimi@0wned.org.uk Send message Joined: 10 Mar 06 Posts: 29 Credit: 335,252 RAC: 0 |
First failure on this rig, a bit odd and not one I've seen before: 7449_2_fullatom_relax_dec7449_2_09_3.pdb_415_4_0 <core_client_version>5.2.13</core_client_version> <message>Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # random seed: 1794757 # cpu_run_time_pref: 14400 # DONE :: 1 starting structures built 4 (nstruct) times # This process generated 4 decoys from 4 attempts # 0 starting pdbs were skipped ERROR:: Exit at: .read_paths.cc line:346 </stderr_txt> |
UBT - Halifax--lad Send message Joined: 17 Sep 05 Posts: 157 Credit: 2,687 RAC: 0 |
Had my 1st crash in a very long time, this is the 1st Rosetta WU I have done since the new bug tracking code was put into the WU's I wonder if that caused my crash https://boinc.bakerlab.org/rosetta/result.php?resultid=15959192 Have not got my messages for it have searched them high and low and could not find any, woke up this morning and this WU was just sat there at around 12%, nothing else happening, so I have aborted it, will see what the next WU does Join us in Chat (see the forum) Click the Sig Join UBT |
Team TMR Send message Joined: 2 Nov 05 Posts: 21 Credit: 1,583,679 RAC: 0 |
This WU seems to be stuck: 13107954 Over 3 hours in and it's on 1.19%. Job CPU time is set to 2 hours. Edit: Now at 4 hours and 1.30%. |
Dufva Send message Joined: 2 Jan 06 Posts: 1 Credit: 31,453 RAC: 0 |
Have experienced unrecoverable error twice last week when opening graphics from BOINC client. It has happend both times when I have run heavy parallell processes (perhaps CPU overload?). Windows XP got stuck and switched over to "secure" 16 color mode as I tried to open the graphics, and it resulted both times in unrecoverable errors: 2006-04-05 00:11:46 [rosetta@home] Unrecoverable error for result DOUBLE_SS_WEIGHT_1vie__419_6_0 ( - exit code -1073741819 (0xc0000005)) 2006-04-05 00:11:49 [---] request_reschedule_cpus: process exited 2006-04-05 00:11:49 [rosetta@home] Computation for result DOUBLE_SS_WEIGHT_1vie__419_6_0 finished |
[DPC]Alexcj Send message Joined: 21 Mar 06 Posts: 3 Credit: 8,374 RAC: 0 |
Hi, I got a stuck workunit also: It's this one: FAH_RLXpt_hom007_1ptq_361_166_1 also known as:11689531 It is stuck at 1.04% (according to the GUI) with 6:54:44 hours CPU time done and 8:50:57 to go ;-) It's on this machine. The other participant wasn't so lucky either. |
Message boards :
Number crunching :
Report stuck & aborted WU here please
©2024 University of Washington
https://www.bakerlab.org