Message boards : Number crunching : Report stuck & aborted WU here please
Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · 18 · Next
Author | Message |
---|---|
TA_GeoffS Send message Joined: 16 Dec 05 Posts: 2 Credit: 704,640 RAC: 0 |
|
Rich Send message Joined: 30 Nov 05 Posts: 5 Credit: 594,384 RAC: 0 |
Workunit: FA_RLXub_hom008_4ubpA_362_450_0 stuck at 1% for 20 hrs. URL: https://boinc.bakerlab.org/rosetta/result.php?resultid=14787271. Rich Seyfert Eatontown, NJ SeyfertR@att.net |
[AF>Libristes>Jip] otax Send message Joined: 25 Sep 05 Posts: 1 Credit: 312,969 RAC: 0 |
Hello, this is my list of Wu client errors : FA_RLX56_hom007_256bA_362_202 FA_RLXch_hom015_2chf__362_223 FA_RLXwi_hom026_1wit__362_411 FA_RLXac_hom021_2acy__362_430 FA_RLXch_hom017_2chf__362_264 FA_RLXci_hom024_2ci2I_362_380 FA_RLXpt_hom006_1ptq__361_347 FA_RLXpt_hom002_1ptq__361_380 FA_RLXwh_hom024_1who__362_476 FA_RLXwh_hom017_1who__362_476 For a total of about 60 hours .... (on 3 PCs in 2 days ) Otax. |
Brf Send message Joined: 17 Jan 06 Posts: 1 Credit: 901,500 RAC: 0 |
I have: FA_RLXai_hom028_1aiu_359_210_0 stuck qat 46.06%. If I close Boinc or reboot, it starts up again, the CPU resets at 55 minutes, and it runs until the CPU is at 57 mins and 57 seconds and gets stuck at Model 2, Step 21273. The CPU continues counting up, but will rewind to 55 minutes if I restart Boinc. |
John Perko Send message Joined: 1 Jan 06 Posts: 3 Credit: 604,568 RAC: 0 |
3/28/2006 4:17:39 PM|rosetta@home|Starting result HB_BARCODE_30_2chf__351_32846_0 using rosetta version 482 The above WU was running for 35 minutes (out of a total time of 2:35). At that point, I turned on the graphic and saw that it was stuck at 1%. A second later it jumped to 29.5% and started filling up the graphs in the graphic box, which were previously empty. |
TCU Computer Science Send message Joined: 7 Dec 05 Posts: 28 Credit: 12,861,977 RAC: 0 |
The following were aborted today. All were stuck at 1.00% after running for 20+ hours ID=12326404 name = HB_BARCODE_30_1c8cA_351_32403 ID=12261321 name = HB_BARCODE_30_256bA_351_28680 ID=12034212 name = HB_BARCODE_30_1bk2__351_16205 ID=11076727 name = FA_RLXb3_hom001_1b3aA_359_347 ID=11972587 name = FA_RLXb3_hom010_2chf__362_384 ID=11761822 name = FA_RLXur_hom004_1urnA_362_308 |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
The following were aborted today. All were stuck at 1.00% after running for 20+ hours that is not good. with the jobs currently released, this problem should be greatly reduced, and from the "percent complete" we will be able to tell where the problem is. |
RC Send message Joined: 27 Sep 05 Posts: 13 Credit: 262,048 RAC: 0 |
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=12388765 I suspended this unit after it remained at 1% for almost 4 hours. After suspending BOINC I tried running rosetta standalone for a while; it went to 17% within 15 minutes. When I restarted BOINC and resumed processing on this unit, it reset itself to zero, so I aborted it. |
Grutte Pier [Wa Oars]~Nemesis Send message Joined: 8 Nov 05 Posts: 3 Credit: 386,730 RAC: 0 |
After a bogus WU on one of my pc's that cost me over 300 credits (it was hanging for a long time) I went though all of my WU's. This is a list of all my recent WU's that were aborted with an error: I'm wondering if the claimed credits will be awarded for these bogus WU's?? |
Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0 |
[quote]that is not good. with the jobs currently released, this problem should be greatly reduced, and from the "percent complete" we will be able to tell where the problem is. Yes on the stuck units if you restart boinc the restets the timer to 0 . I abouted another 4 W/Us to day that brings the total to 9 since Sunday Sory I am Not much good at gathering Info Just hope the returned W/U will help give you the info you need to stop this BUG If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
[quote]that is not good. with the jobs currently released, this problem should be greatly reduced, and from the "percent complete" we will be able to tell where the problem is. Laurenu2: If I remember your description of your pharm from the Dutch Mad Cow Invasion at FaD, you had about 40 systems. That would make your stuck WU rate around 10% for yesterday, and well above the average failure rate. (The error rate seems high, even if you've expanded to 80 machines.) Would you mind describing the hardware and OS configurations of the machines that are failing? Processor/speed/ o/c or not/ amount of ram/ OS version, Boinc version, any monitoring apps running in the background. And how are the failing machines different than the ones that aren't failing? (If there's machines that aren't randomly getting stuck.) |
Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0 |
[quote Laurenu2: If I remember your description of your pharm from the Dutch Mad Cow Invasion at FaD, you had about 40 systems. That would make your stuck WU rate around 10% for yesterday, and well above the average failure rate. (The error rate seems high, even if you've expanded to 80 machines.) Would you mind describing the hardware and OS configurations of the machines that are failing? Processor/speed/ o/c or not/ amount of ram/ OS version, Boinc version, any monitoring apps running in the background. And how are the failing machines different than the ones that aren't failing? (If there's machines that aren't randomly getting stuck.) [/quote] I run about 70 nodes here at my home I have about 40 on Rosetta most of the 40 are AMD 2400 +/- 1800 to 2800 with 256MB or more memory, 29 of the 40 have XP pro for the OS the other 11 still have WinME but should be upgraded to XP with in a week Now the 1% stall I think come mostly to the XP nodes ON the WinME the Clock just seems to stop and I understand Rosetta dose not work well with ME and that is why I am doing the upgrade I do not Over clock at all All or 98% of the 40 nodes do nothing but crunch Rosetta with no other programs running on them at all I do not think it is a hardware bug issue if it was it would not be this widespread So if it is not hardware it must be the code in the software If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- |
Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0 |
Let me Add one more thin I run many other DC projects none with a problem or failure rate like it is he at Rosetta That alone tells me it is not a hardware issue If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
The question was not whether your systems were stable enough to run dc projects (as I've seen your stats in other dc projects).. but to try and find out what's different about your hardware/software configuration that makes it more suseptible to the 1% bug than average. It's a problem that only shows up when Boinc is in control of Rosetta (Rosetta alone crunches through that sticking point) - and seems to be showing up more often on certain hardware. (Come to think of it, if you have a low max time set, and are running through up to 480 WUs a day, to have a few get caught might be the average failure rate..) The more data about the machines with 1% failures we can give Rom, the more likely he'll be able to track down the intermittent problem. And when we help track it down and get it eliminated.. it'll make life easier for everyone dealing with the problem. In the meantime.. is the problem showing up on your machines that have 512 Megs, or just on ones with 256Megs? Do you have Boinc setup as a service on the WinXP machines, or as a standard app? |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
I run about 70 nodes here at my home I have about 40 on Rosetta most of the 40 are AMD 2400 +/- 1800 to 2800 with 256MB or more memory, 29 of the 40 have XP pro for the OS the other 11 still have WinME but should be upgraded to XP with in a week Lauren, since 35+ of your nodes are "crunching boxes", i.e. dedicated to work for projects like Rosetta, have you ever considered running Linux instead of WinXX (XX=XP, 2K, ME etc) on them? Linux consumes less RAM than WinXX for a minimal system. You don't need the GUI anyway for such a box and Linux's remote-control capabilities are very good. With regard to my experience with Rosetta's 1% issue, in my almost 3 months with the project, I have had sofar one (1) WU get stuck on one of my 2 P4s w/512MB RAM running WinXPpro, but it was a "faulty" WU (it got stuck within 10sec since it started running on #1 Model, same step # everytime). Initially, in Jan06, I've had some problems (3-4 WUs) with Rosetta getting stuck on a Linux box, which had just 256MB RAM and was running many (100+) other processes and 6 BOINC projects (all left in virt. memory while pre-empted). Since I reduced # BOINC projects to 4 (rosetta, ralph, simap, lhc) I had no problems during the last 1.5 month. All 3 PCs have Intel CPUs. Obviously this sample of 3 PCs is not comparable with your 40 systems, but maybe there is a pattern? Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
Egad another 86:29 hours down the loo... This one stuck @86.0%....Nil movement in graphics mode. Please note:* Result ID 15004859 Name HB_BARCODE_30_4ubpA_351_23915_0 Workunit 12180494 * Created 26 Mar 2006 4:33:09 UTC Sent 26 Mar 2006 13:47:34 UTC Received --- Server state In Progress Outcome Unknown Client state New Exit status 0 (0x0) Computer ID 53940 Report deadline 9 Apr 2006 13:47:34 UTC Join the Teddies@WCG |
Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0 |
The question was not whether your systems were stable enough to run dc projects (as I've seen your stats in other dc projects).. but to try and find out what's different about your hardware/software configuration that makes it more suseptible to the 1% bug than average. It's a problem that only shows up when Boinc is in control of Rosetta (Rosetta alone crunches through that sticking point) - and seems to be showing up more often on certain hardware. (Come to think of it, if you have a low max time set, and are running through up to 480 WUs a day, to have a few get caught might be the average failure rate..) The stalls are not confined to and one or group of PC's and they may not happen on the same PC twice Most work units are posted to finish in the 2 to 3 Hr range. The PC's on a norm Finnish 25 to 35% faster then the Est time posted No Boinc is Not run as a service I start the project I want to run at startup Not sure about the PC's with 512+ memory if they stall out I thought David and Ron had implemented data gathering to help weed out or find out what is causing this problem I am limited in tine here working running my company and taking care of my family, Just to do a check of all my nodes takes about 1 Hr So when I find a node that has stalled I just abort it and move on If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- |
Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0 |
Lauren, since 35+ of your nodes are "crunching boxes", i.e. dedicated to work for projects like Rosetta, have you ever considered running Linux instead of WinXX (XX=XP, 2K, ME etc) on them? Linux consumes less RAM than WinXX for a minimal system. You don't need the GUI anyway for such a box and Linux's remote-control capabilities are very good. I am sory I would find hard to learn a New OS right now and have little time to format and install a new OS system wide If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- |
Rich Send message Joined: 30 Nov 05 Posts: 5 Credit: 594,384 RAC: 0 |
WU aborted at 1.00%: https://boinc.bakerlab.org/rosetta/result.php?resultid=15048830. WU was HB_BARCODE_30_2ci2I_351_26295_0. If I was to get any additional information in the percent quote or from the database update, I did not see it. Take care and have a good day. Rich Seyfert Eatontown, NJ SeyfertR@att.net |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
I think most of the problems reported in the last few posts were from work units created before the March 28 update--hopefully these older wu will all get through the system in the next day or two. |
Message boards :
Number crunching :
Report stuck & aborted WU here please
©2024 University of Washington
https://www.bakerlab.org