Report stuck & aborted WU here please

Message boards : Number crunching : Report stuck & aborted WU here please

To post messages, you must log in.

Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · Next

AuthorMessage
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 12796 - Posted: 29 Mar 2006, 19:11:41 UTC - in response to Message 12789.  

[quote]that is not good. with the jobs currently released, this problem should be greatly reduced, and from the "percent complete" we will be able to tell where the problem is.

Yes on the stuck units if you restart boinc the restets the timer to 0 .
I abouted another 4 W/Us to day that brings the total to 9 since Sunday
Sory I am Not much good at gathering Info Just hope the returned W/U will help give you the info you need to stop this BUG


Laurenu2: If I remember your description of your pharm from the Dutch Mad Cow Invasion at FaD, you had about 40 systems. That would make your stuck WU rate around 10% for yesterday, and well above the average failure rate. (The error rate seems high, even if you've expanded to 80 machines.)
Would you mind describing the hardware and OS configurations of the machines that are failing? Processor/speed/ o/c or not/ amount of ram/ OS version, Boinc version, any monitoring apps running in the background. And how are the failing machines different than the ones that aren't failing? (If there's machines that aren't randomly getting stuck.)
ID: 12796 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 12801 - Posted: 29 Mar 2006, 22:34:06 UTC - in response to Message 12796.  

[quote Laurenu2: If I remember your description of your pharm from the Dutch Mad Cow Invasion at FaD, you had about 40 systems. That would make your stuck WU rate around 10% for yesterday, and well above the average failure rate. (The error rate seems high, even if you've expanded to 80 machines.)
Would you mind describing the hardware and OS configurations of the machines that are failing? Processor/speed/ o/c or not/ amount of ram/ OS version, Boinc version, any monitoring apps running in the background. And how are the failing machines different than the ones that aren't failing? (If there's machines that aren't randomly getting stuck.) [/quote]

I run about 70 nodes here at my home I have about 40 on Rosetta most of the 40 are AMD 2400 +/- 1800 to 2800 with 256MB or more memory, 29 of the 40 have XP pro for the OS the other 11 still have WinME but should be upgraded to XP with in a week
Now the 1% stall I think come mostly to the XP nodes ON the WinME the Clock just seems to stop and I understand Rosetta dose not work well with ME and that is why I am doing the upgrade
I do not Over clock at all All or 98% of the 40 nodes do nothing but crunch Rosetta with no other programs running on them at all

I do not think it is a hardware bug issue if it was it would not be this widespread So if it is not hardware it must be the code in the software


If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 12801 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 12803 - Posted: 29 Mar 2006, 22:43:24 UTC

Let me Add one more thin I run many other DC projects none with a problem or failure rate like it is he at Rosetta That alone tells me it is not a hardware issue
If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 12803 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 12812 - Posted: 30 Mar 2006, 4:00:08 UTC

The question was not whether your systems were stable enough to run dc projects (as I've seen your stats in other dc projects).. but to try and find out what's different about your hardware/software configuration that makes it more suseptible to the 1% bug than average. It's a problem that only shows up when Boinc is in control of Rosetta (Rosetta alone crunches through that sticking point) - and seems to be showing up more often on certain hardware. (Come to think of it, if you have a low max time set, and are running through up to 480 WUs a day, to have a few get caught might be the average failure rate..)

The more data about the machines with 1% failures we can give Rom, the more likely he'll be able to track down the intermittent problem. And when we help track it down and get it eliminated.. it'll make life easier for everyone dealing with the problem.

In the meantime.. is the problem showing up on your machines that have 512 Megs, or just on ones with 256Megs? Do you have Boinc setup as a service on the WinXP machines, or as a standard app?




ID: 12812 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 12814 - Posted: 30 Mar 2006, 4:55:07 UTC - in response to Message 12801.  

I run about 70 nodes here at my home I have about 40 on Rosetta most of the 40 are AMD 2400 +/- 1800 to 2800 with 256MB or more memory, 29 of the 40 have XP pro for the OS the other 11 still have WinME but should be upgraded to XP with in a week
Now the 1% stall I think come mostly to the XP nodes ON the WinME the Clock just seems to stop and I understand Rosetta dose not work well with ME and that is why I am doing the upgrade
I do not Over clock at all All or 98% of the 40 nodes do nothing but crunch Rosetta with no other programs running on them at all

I do not think it is a hardware bug issue if it was it would not be this widespread So if it is not hardware it must be the code in the software


Lauren, since 35+ of your nodes are "crunching boxes", i.e. dedicated to work for projects like Rosetta, have you ever considered running Linux instead of WinXX (XX=XP, 2K, ME etc) on them? Linux consumes less RAM than WinXX for a minimal system. You don't need the GUI anyway for such a box and Linux's remote-control capabilities are very good.

With regard to my experience with Rosetta's 1% issue, in my almost 3 months with the project, I have had sofar one (1) WU get stuck on one of my 2 P4s w/512MB RAM running WinXPpro, but it was a "faulty" WU (it got stuck within 10sec since it started running on #1 Model, same step # everytime).

Initially, in Jan06, I've had some problems (3-4 WUs) with Rosetta getting stuck on a Linux box, which had just 256MB RAM and was running many (100+) other processes and 6 BOINC projects (all left in virt. memory while pre-empted). Since I reduced # BOINC projects to 4 (rosetta, ralph, simap, lhc) I had no problems during the last 1.5 month.

All 3 PCs have Intel CPUs.

Obviously this sample of 3 PCs is not comparable with your 40 systems, but maybe there is a pattern?
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 12814 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Nite Owl
Avatar

Send message
Joined: 2 Nov 05
Posts: 87
Credit: 3,019,449
RAC: 0
Message 12815 - Posted: 30 Mar 2006, 5:12:59 UTC
Last modified: 30 Mar 2006, 5:14:05 UTC

Egad another 86:29 hours down the loo... This one stuck @86.0%....Nil movement in graphics mode. Please note:*

Result ID 15004859
Name HB_BARCODE_30_4ubpA_351_23915_0
Workunit 12180494
* Created 26 Mar 2006 4:33:09 UTC
Sent 26 Mar 2006 13:47:34 UTC
Received ---
Server state In Progress
Outcome Unknown
Client state New
Exit status 0 (0x0)
Computer ID 53940
Report deadline 9 Apr 2006 13:47:34 UTC


Join the Teddies@WCG
ID: 12815 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 12821 - Posted: 30 Mar 2006, 7:35:45 UTC - in response to Message 12812.  

The question was not whether your systems were stable enough to run dc projects (as I've seen your stats in other dc projects).. but to try and find out what's different about your hardware/software configuration that makes it more suseptible to the 1% bug than average. It's a problem that only shows up when Boinc is in control of Rosetta (Rosetta alone crunches through that sticking point) - and seems to be showing up more often on certain hardware. (Come to think of it, if you have a low max time set, and are running through up to 480 WUs a day, to have a few get caught might be the average failure rate..)

The more data about the machines with 1% failures we can give Rom, the more likely he'll be able to track down the intermittent problem. And when we help track it down and get it eliminated.. it'll make life easier for everyone dealing with the problem.

In the meantime.. is the problem showing up on your machines that have 512 Megs, or just on ones with 256Megs? Do you have Boinc setup as a service on the WinXP machines, or as a standard app?


The stalls are not confined to and one or group of PC's and they may not happen on the same PC twice

Most work units are posted to finish in the 2 to 3 Hr range. The PC's on a norm Finnish 25 to 35% faster then the Est time posted

No Boinc is Not run as a service I start the project I want to run at startup

Not sure about the PC's with 512+ memory if they stall out

I thought David and Ron had implemented data gathering to help weed out or find out what is causing this problem

I am limited in tine here working running my company and taking care of my family, Just to do a check of all my nodes takes about 1 Hr So when I find a node that has stalled I just abort it and move on

If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 12821 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 12822 - Posted: 30 Mar 2006, 7:45:11 UTC - in response to Message 12814.  

Lauren, since 35+ of your nodes are "crunching boxes", i.e. dedicated to work for projects like Rosetta, have you ever considered running Linux instead of WinXX (XX=XP, 2K, ME etc) on them? Linux consumes less RAM than WinXX for a minimal system. You don't need the GUI anyway for such a box and Linux's remote-control capabilities are very good.
?

I am sory I would find hard to learn a New OS right now and have little time to format and install a new OS system wide
If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 12822 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rich

Send message
Joined: 30 Nov 05
Posts: 5
Credit: 594,384
RAC: 0
Message 12826 - Posted: 30 Mar 2006, 9:45:02 UTC

WU aborted at 1.00%: https://boinc.bakerlab.org/rosetta/result.php?resultid=15048830. WU was HB_BARCODE_30_2ci2I_351_26295_0.

If I was to get any additional information in the percent quote or from the database update, I did not see it.

Take care and have a good day.
Rich Seyfert
Eatontown, NJ
SeyfertR@att.net
ID: 12826 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12833 - Posted: 30 Mar 2006, 14:57:19 UTC

I think most of the problems reported in the last few posts were from work units created before the March 28 update--hopefully these older wu will all get through the system in the next day or two.
ID: 12833 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Runaway1956

Send message
Joined: 5 Nov 05
Posts: 19
Credit: 535,400
RAC: 0
Message 12843 - Posted: 30 Mar 2006, 18:36:06 UTC

I saw this message last week for the first time, just aborted the WU. But twice this morning:


3/30/2006 12:26:10 PM|rosetta@home|Started upload of FA_RLXb3_hom001_1b3aA_357_21_1_0
3/30/2006 12:26:16 PM|rosetta@home|Started upload of FA_RLXti_hom001_1tif__357_26_1_0
3/30/2006 12:27:39 PM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/17e/FA_RLXb3_hom001_1b3aA_357_21_1_0 98304 bytes != offset 0 bytes
3/30/2006 12:27:39 PM|rosetta@home|Temporarily failed upload of FA_RLXb3_hom001_1b3aA_357_21_1_0: transient upload error
3/30/2006 12:27:39 PM|rosetta@home|Backing off 2 hours, 39 minutes, and 22 seconds on upload of file FA_RLXb3_hom001_1b3aA_357_21_1_0
3/30/2006 12:28:18 PM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/3b9/FA_RLXti_hom001_1tif__357_26_1_0 141256 bytes != offset 0 bytes
3/30/2006 12:28:18 PM|rosetta@home|Temporarily failed upload of FA_RLXti_hom001_1tif__357_26_1_0: transient upload error
3/30/2006 12:28:18 PM|rosetta@home|Backing off 3 hours, 10 minutes, and 46 seconds on upload of file FA_RLXti_hom001_1tif__357_26_1_0



This is on the Opteron 144, machine identified as nunyabiz-s2pvzz
ID: 12843 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 7 Oct 05
Posts: 65
Credit: 10,612,039
RAC: 0
Message 12857 - Posted: 31 Mar 2006, 0:13:49 UTC


ID: 12857 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CremionisD

Send message
Joined: 10 Mar 06
Posts: 9
Credit: 37,604,006
RAC: 0
Message 12874 - Posted: 31 Mar 2006, 13:11:33 UTC

Work unit aborted at 1.00%, CPU time used ~5:28:00

WU Name = "HB_BARCODE_30_1pgx__351_35027_0"

Application Rosetta 4.82, System CPU Pentium M 1600MHz, 1GB ram. Windows XP SP 2.
ID: 12874 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 12890 - Posted: 31 Mar 2006, 19:51:04 UTC - in response to Message 12833.  

I think most of the problems reported in the last few posts were from work units created before the March 28 update--hopefully these older wu will all get through the system in the next day or two.


I think you are Right David. It has been 36 Hrs and NO 1% stuck W/Us (*_*) THANK YOU David!! Is the data retrieval you added to your client / WU working to find out what is/was causing this Bug?


If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 12890 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12891 - Posted: 31 Mar 2006, 20:04:35 UTC - in response to Message 12890.  

I think most of the problems reported in the last few posts were from work units created before the March 28 update--hopefully these older wu will all get through the system in the next day or two.


I think you are Right David. It has been 36 Hrs and NO 1% stuck W/Us (*_*) THANK YOU David!! Is the data retrieval you added to your client / WU working to find out what is/was causing this Bug?



That is great!! I'm particularly glad in your case because of all the computers you had to be watching over.

I had hoped to be reading reports of "WU stuck at 5.0733 %" which would have helped to locate the errors, but it is even better to see that the "stuck" work units problem seems to be much reduced.
please spread the word!

ID: 12891 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jon Kennedy

Send message
Joined: 1 Oct 05
Posts: 6
Credit: 418,027
RAC: 0
Message 12896 - Posted: 1 Apr 2006, 1:13:45 UTC

This WU was stuck at 1% for over 53 hours:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11860123
random seed: 2232363
Stuck at model 1, step 22837
Claimed credit: 269.87
Graphic frozen. Should I abort all my 4.82 WU or just the ones names similar to this one - or none?
ID: 12896 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12897 - Posted: 1 Apr 2006, 1:34:47 UTC - in response to Message 12896.  

This WU was stuck at 1% for over 53 hours:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11860123
random seed: 2232363
Stuck at model 1, step 22837
Claimed credit: 269.87
Graphic frozen. Should I abort all my 4.82 WU or just the ones names similar to this one - or none?


If you are having problems with "stuck at 1%" please do abort pre 4.83 WU. The 4.83 WU seem to get
stuck less often, and if/when they do get stuck, we will be able to trace the problem more easily.

ID: 12897 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
pieface

Send message
Joined: 20 Sep 05
Posts: 17
Credit: 797,661
RAC: 0
Message 12926 - Posted: 2 Apr 2006, 0:04:36 UTC
Last modified: 2 Apr 2006, 0:08:35 UTC

I have a 'stuck' 4.83, wuid=11843998, cpid=163786.
Noticed that it was still running after 20+ hours cpu time. Looked at graphics and it was on 21.742 pct complete. suspended unit and bm (this guy is still running 5.2.13), closed down windows and did a cold start. Brought BM back up and un-suspended the unit. Cpu time went back to about 52 minutes, then started moving forward. Graphics looked ok, lots of movement. Now after a couple of hours it's stuck on 21.742 percent complete again, model 8, step 266356. task manager says it's pulling 100pct of the CPU.

Edit: just noticed that someone else with a similar machine (pentium-m, 1.86) had already aborted this unit...interesting...
ID: 12926 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12927 - Posted: 2 Apr 2006, 1:07:40 UTC - in response to Message 12926.  

I have a 'stuck' 4.83, wuid=11843998, cpid=163786.
Noticed that it was still running after 20+ hours cpu time. Looked at graphics and it was on 21.742 pct complete. suspended unit and bm (this guy is still running 5.2.13), closed down windows and did a cold start. Brought BM back up and un-suspended the unit. Cpu time went back to about 52 minutes, then started moving forward. Graphics looked ok, lots of movement. Now after a couple of hours it's stuck on 21.742 percent complete again, model 8, step 266356. task manager says it's pulling 100pct of the CPU.

Edit: just noticed that someone else with a similar machine (pentium-m, 1.86) had already aborted this unit...interesting...



sorry about this, but your information will be very helpful in tracking down the problem. the ".742" tells us where the sticking is happening. thanks, David

ID: 12927 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
pieface

Send message
Joined: 20 Sep 05
Posts: 17
Credit: 797,661
RAC: 0
Message 12930 - Posted: 2 Apr 2006, 4:04:26 UTC

Not a problem, I suspended the WU again instead of aborting, so I could get on with some new work without losing it (in case you folks want something else from it).
ID: 12930 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · Next

Message boards : Number crunching : Report stuck & aborted WU here please



©2025 University of Washington
https://www.bakerlab.org