21)
Message boards :
Number crunching :
Report stuck & aborted WU here please - II
(Message 14108)
Posted 19 Apr 2006 by Laurenu2 Post: But what about the removal of bad WU's from your servers You must set up a way to stop the resending out of the BAD WU's Letting the system purge it self is not right. You have the capability to do auto upgrades you should have the capability to auto abort bad WU;s on client side To let bad WU's run on yours or our system is a BAD THING Well David it seems you can not or did not REMOVE the bad WU's I and others are still getting them I just found this one TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_126_0 that WASTED another 28 more Hrs This is not good I am nearing the end of my patients with these BAD jobs and the THOUSANDS + of Hrs of wasted work time that you will not give points for David I AM VARY UPSET ABOUT THIS |
22)
Message boards :
Number crunching :
Miscellaneous Work Unit Errors - II
(Message 13999)
Posted 18 Apr 2006 by Laurenu2 Post: Rhiju Did you ever think it might be Motherboard related? I have have had some motherboard that would not work in some projects I say this because I have batches of the same model motherboards and when the all that batch will not work I would say it was M/B or driver related Just a thought!! |
23)
Message boards :
Number crunching :
Report stuck & aborted WU here please - II
(Message 13997)
Posted 18 Apr 2006 by Laurenu2 Post: Well David I let this one FA_RLXpt_hom002_1ptq__361_380_3 run through 48 HR it did not self abort as you said it would ... Are you telling me this WU was not running for 100 to 150 Hrs I thought it was but steed it was running for a 1,000 + Hrs in a endlass Loop? (Grrrr) |
24)
Message boards :
Number crunching :
Report stuck & aborted WU here please - II
(Message 13982)
Posted 17 Apr 2006 by Laurenu2 Post: Can you post a link to the result? I think you will actually get credit for this. The result is reported to us as a large amount of "claimed credit"; we are going through on a weekly basis and granting credit for these jobs that caused big problems and returned "invalid" results. The results and your posting are still useful to us -- in this case, the postings on this work unit helped us track down a pretty esoteric bug in Rosetta. Let me know what the result number is -- if its not flagged to get credit, I can see why. I'm sorry I can not, I looked for it but could not find it. For me your system for tracking WU is hard to use for me It might work OK for me if I had only a few nodes working this. But I have over 50 nodes working this project, jobs get lost with so many pages of WU's It might help if you put in page numbers 1 to 10 20 30 40 50 instead of Just NEXT PAGE But getting the points for lost jobs was not the point of my post putting a PC into a endless LOOP That was the point of my post to let you see a problem you (Rosetta) need to address You have sent out bad WU's in the past and you will send out more new Bad ones to come to think not is unwise. The wise thing is to plan for it and figure out a way to reset the project client side Doing this would flush all the WU's on the client I know I would rather lose a few points in a flush then to have a PC stuck in a endless LOOP for Hundreds for hours untill I check on it and abort it and still get no points for it or just getting a fraction of the points for the total Hrs spent |
25)
Message boards :
Number crunching :
Report stuck & aborted WU here please - II
(Message 13932)
Posted 17 Apr 2006 by Laurenu2 Post: But what about the removal of bad WU's from your servers You must set up a way to stop the resending out of the BAD WU's Letting the system purge it self is not right. You have the capability to do auto upgrades you should have the capability to auto abort bad WU;s on client side To let bad WU's run on yours or our system is a BAD THING Well David I let this one FA_RLXpt_hom002_1ptq__361_380_3 run through 48 HR it did not self abort as you said it would and send it self in IT JUST RESTARTED and it most likely was the 3rd time it restarted. That is 142 Hrs of wasted CPU time And I will not get credit for it as you said we would because your time out did not work This is why I said you must came up with a way to auto abort on our clients A script to do a project reset or something . To expect us to clean up after you send out Bad W/Us Is not right and you know it. You must come up with a better plan then do nothing |
26)
Message boards :
Number crunching :
Report stuck & aborted WU here please - II
(Message 13798)
Posted 15 Apr 2006 by Laurenu2 Post: But what about the removal of bad WU's from your servers You must set up a way to stop the resending out of the BAD WU's Letting the system purge it self is not right. You have the capability to do auto upgrades you should have the capability to auto abort bad WU;s on client side To let bad WU's run on yours or our system is a BAD THING |
27)
Message boards :
Number crunching :
Report stuck & aborted WU here please - II
(Message 13780)
Posted 14 Apr 2006 by Laurenu2 Post: The 1.4 stalls are still coming I am vary tired of aborting them and losing the tens of thousands of points that are NOT granted in wasted CPU time. If this project is going to keep letting out BAD work WU's. Rosetta need to find a way to purge these Bad WU's from there servers when they are found to cause problems like these have. And / or send commands to the users client to delete or abort the Bad WU's on any upload / download to the Rosetta servers. To keep all the bad WU's in the system or on the Rosetta servers and forcing us to run them to purge them them from the Rosetta system is unfair to us and does damage to the project reputation. if this continue with out relief people will start to abandon this project |
28)
Message boards :
Number crunching :
Report stuck & aborted WU here please - II
(Message 13548)
Posted 12 Apr 2006 by Laurenu2 Post: ARGGGGGGGGGGGGGGGGGHHHHHHHHHHHHHHHHHH Well don't feel to bad Jose I seem to have to abort 60 to 100 Hrs of wasted CPU time every DAY. I did abort just today 7 WU's STUCK at 1.04% for a total of 80 HRs DAVID what are you going to do about solving this problem ??? Any end in sight? Baby sitting your client does consume a lot of my time |
29)
Message boards :
Number crunching :
Miscellaneous Work Unit Errors
(Message 13286)
Posted 8 Apr 2006 by Laurenu2 Post: David What is up with all the BAD W/U I must have close to 500 4/8/2006 3:22:39 PM|rosetta@home|Unrecoverable error for result HBLR_1.0_1hz6_426_5085_0 ( - exit code -1073741819 (0xc0000005)) messages on all of my nodes running Rosetta. You Asked for my/our help here Well you must realize that if you continue to give out W/U that stall our PC, or that can not complete the DC'ers here WILL lose faith in this project and the quality of the data we produce. As I see it you must do some or more in-house testing before a new Ver.# or WU batch. And as stated below you should NOT do any releases at a time that you can not be a full staff to make fixes. This project cost Me/us a lot of money to run Not counting my time. And I do expect a lot more the 1 week of good W/U's to keep running Please remove the bad W/U's or Ver# that is causing this so we/I can stop spinning or Cooling fans |
30)
Message boards :
Number crunching :
Report stuck & aborted WU here please
(Message 12890)
Posted 31 Mar 2006 by Laurenu2 Post: I think most of the problems reported in the last few posts were from work units created before the March 28 update--hopefully these older wu will all get through the system in the next day or two. I think you are Right David. It has been 36 Hrs and NO 1% stuck W/Us (*_*) THANK YOU David!! Is the data retrieval you added to your client / WU working to find out what is/was causing this Bug? |
31)
Message boards :
Number crunching :
Report stuck & aborted WU here please
(Message 12822)
Posted 30 Mar 2006 by Laurenu2 Post: Lauren, since 35+ of your nodes are "crunching boxes", i.e. dedicated to work for projects like Rosetta, have you ever considered running Linux instead of WinXX (XX=XP, 2K, ME etc) on them? Linux consumes less RAM than WinXX for a minimal system. You don't need the GUI anyway for such a box and Linux's remote-control capabilities are very good. I am sory I would find hard to learn a New OS right now and have little time to format and install a new OS system wide |
32)
Message boards :
Number crunching :
Report stuck & aborted WU here please
(Message 12821)
Posted 30 Mar 2006 by Laurenu2 Post: The question was not whether your systems were stable enough to run dc projects (as I've seen your stats in other dc projects).. but to try and find out what's different about your hardware/software configuration that makes it more suseptible to the 1% bug than average. It's a problem that only shows up when Boinc is in control of Rosetta (Rosetta alone crunches through that sticking point) - and seems to be showing up more often on certain hardware. (Come to think of it, if you have a low max time set, and are running through up to 480 WUs a day, to have a few get caught might be the average failure rate..) The stalls are not confined to and one or group of PC's and they may not happen on the same PC twice Most work units are posted to finish in the 2 to 3 Hr range. The PC's on a norm Finnish 25 to 35% faster then the Est time posted No Boinc is Not run as a service I start the project I want to run at startup Not sure about the PC's with 512+ memory if they stall out I thought David and Ron had implemented data gathering to help weed out or find out what is causing this problem I am limited in tine here working running my company and taking care of my family, Just to do a check of all my nodes takes about 1 Hr So when I find a node that has stalled I just abort it and move on |
33)
Message boards :
Number crunching :
Report stuck & aborted WU here please
(Message 12803)
Posted 29 Mar 2006 by Laurenu2 Post: Let me Add one more thin I run many other DC projects none with a problem or failure rate like it is he at Rosetta That alone tells me it is not a hardware issue |
34)
Message boards :
Number crunching :
Report stuck & aborted WU here please
(Message 12801)
Posted 29 Mar 2006 by Laurenu2 Post: [quote Laurenu2: If I remember your description of your pharm from the Dutch Mad Cow Invasion at FaD, you had about 40 systems. That would make your stuck WU rate around 10% for yesterday, and well above the average failure rate. (The error rate seems high, even if you've expanded to 80 machines.) Would you mind describing the hardware and OS configurations of the machines that are failing? Processor/speed/ o/c or not/ amount of ram/ OS version, Boinc version, any monitoring apps running in the background. And how are the failing machines different than the ones that aren't failing? (If there's machines that aren't randomly getting stuck.) [/quote] I run about 70 nodes here at my home I have about 40 on Rosetta most of the 40 are AMD 2400 +/- 1800 to 2800 with 256MB or more memory, 29 of the 40 have XP pro for the OS the other 11 still have WinME but should be upgraded to XP with in a week Now the 1% stall I think come mostly to the XP nodes ON the WinME the Clock just seems to stop and I understand Rosetta dose not work well with ME and that is why I am doing the upgrade I do not Over clock at all All or 98% of the 40 nodes do nothing but crunch Rosetta with no other programs running on them at all I do not think it is a hardware bug issue if it was it would not be this widespread So if it is not hardware it must be the code in the software |
35)
Message boards :
Number crunching :
Report stuck & aborted WU here please
(Message 12789)
Posted 29 Mar 2006 by Laurenu2 Post: [quote]that is not good. with the jobs currently released, this problem should be greatly reduced, and from the "percent complete" we will be able to tell where the problem is. Yes on the stuck units if you restart boinc the restets the timer to 0 . I abouted another 4 W/Us to day that brings the total to 9 since Sunday Sory I am Not much good at gathering Info Just hope the returned W/U will help give you the info you need to stop this BUG |
36)
Message boards :
Number crunching :
Improvements to Rosetta@home based on user feedback
(Message 12750)
Posted 28 Mar 2006 by Laurenu2 Post: would it be possible for someone to write in a script that would automatically abort a wu if it's stuck at 1% for more than say an hour? for those of us running farms this would be a godsend until the 1% stall problem is solved. Any word back yet? There is a lot CPU time being wast evey day by this |
37)
Message boards :
Number crunching :
Help us solve the 1% bug!
(Message 12748)
Posted 28 Mar 2006 by Laurenu2 Post: All work units sent out since Friday have a maximum time limit of roughly 24 hours, so no computers should be getting stuck much longer than this Not so I today have just aborted 3 that were at 1% for 28 to 38 Hrs. Your self abort is Not working I hope it at least sends you back data as to Why it did not abort and why it got stuck |
38)
Message boards :
Number crunching :
Help us solve the 1% bug!
(Message 12241)
Posted 19 Mar 2006 by Laurenu2 Post: This thread is approaching 100 posts - Is the project ANY closer to solving the 1% bug? The WU's I aborted were at a min 11Hr and that was only by luck the others were about 30 55 77 85 114 Hrs I see no reason why you would want a WU to work past 30Hrs when it should be 2 Hrs I could have done 50 WU's in the time it took me to abort that one 114 Hr WU It seems you you are having problems fixing the 1% problem And thats OK BUT you have to give us a some kind of temporary fix to this problem A time limit, a top end, something to stop it from wasting computer time that can go into the hundreds of Hrs. As for restarting the WU I my self have lost faith in that WU and I really do not want to rerun it or WASTE any more time with it I do feel for sorry Rosetta is having troubles with this But Rosetta also should feel sorry that we crunchers have to pay the troubles |
39)
Message boards :
Number crunching :
Help us solve the 1% bug!
(Message 12199)
Posted 18 Mar 2006 by Laurenu2 Post: This thread is approaching 100 posts - Is the project ANY closer to solving the 1% bug? Yes This what I would like to know "" ANY closer to solving the 1% bug?"" I have had to abort about 10 WU's stuck at 1% For a loss of about 300 Hrs of coumpter time in just the past week. Maybe a auto self abort if it go's past 3 times the limit People like me some times can not check up on all the nodes every day, and to let a WU run for 114 Hrs is just a waste of time and Money I do not work in IT and I pay for the total cost to run DC |
40)
Message boards :
Number crunching :
Report stuck work units here
(Message 7225)
Posted 22 Dec 2005 by Laurenu2 Post: Well with a reply like this one accuseing me of just doing it for the points will do NOTHING but but push me way. Well maybe you should take a look at your style of help When propel come here looking for help or just expressing that they see as a problem they may not express them selfs in a clear or to the point manner. if this is a hard thing for you to handle perhaps you should stop giving help I did not come here to get insulted or to be made a fool of by you or to do damage to this project , Just to express things that I am having a problem with. |
©2024 University of Washington
https://www.bakerlab.org