Posts by Laurenu2

21) Message boards : Number crunching : Report stuck & aborted WU here please - II (Message 14108)
Posted 19 Apr 2006 by Laurenu2
Post:
But what about the removal of bad WU's from your servers You must set up a way to stop the resending out of the BAD WU's Letting the system purge it self is not right. You have the capability to do auto upgrades you should have the capability to auto abort bad WU;s on client side To let bad WU's run on yours or our system is a BAD THING


The bad WU's are removed from our servers, but we can't remove them from your machines. Hopefully there will be no more bad WU's at all so this won't be a problem anymore.


Well David it seems you can not or did not REMOVE the bad WU's I and others are still getting them I just found this one TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_126_0 that WASTED another 28 more Hrs This is not good I am nearing the end of my patients with these BAD jobs and the THOUSANDS + of Hrs of wasted work time that you will not give points for
David I AM VARY UPSET ABOUT THIS
22) Message boards : Number crunching : Miscellaneous Work Unit Errors - II (Message 13999)
Posted 18 Apr 2006 by Laurenu2
Post:
Rhiju
Did you ever think it might be Motherboard related?
I have have had some motherboard that would not work in some projects I say this because I have batches of the same model motherboards and when the all that batch will not work I would say it was M/B or driver related

Just a thought!!
23) Message boards : Number crunching : Report stuck & aborted WU here please - II (Message 13997)
Posted 18 Apr 2006 by Laurenu2
Post:
Well David I let this one FA_RLXpt_hom002_1ptq__361_380_3 run through 48 HR it did not self abort as you said it would ...


The FA_* WUs are old ones from mid-march so they don't have the timeout enabled. As far as I know, these FA_* WUs were never cancelled, so they pop up now and then and get sent out again until 4 people have rejected them.


Are you telling me this WU was not running for 100 to 150 Hrs I thought it was but steed it was running for a 1,000 + Hrs in a endlass Loop? (Grrrr)
24) Message boards : Number crunching : Report stuck & aborted WU here please - II (Message 13982)
Posted 17 Apr 2006 by Laurenu2
Post:
Can you post a link to the result? I think you will actually get credit for this. The result is reported to us as a large amount of "claimed credit"; we are going through on a weekly basis and granting credit for these jobs that caused big problems and returned "invalid" results. The results and your posting are still useful to us -- in this case, the postings on this work unit helped us track down a pretty esoteric bug in Rosetta. Let me know what the result number is -- if its not flagged to get credit, I can see why.



I'm sorry I can not, I looked for it but could not find it. For me your system for tracking WU is hard to use for me It might work OK for me if I had only a few nodes working this. But I have over 50 nodes working this project, jobs get lost with so many pages of WU's It might help if you put in page numbers 1 to 10 20 30 40 50 instead of Just NEXT PAGE

But getting the points for lost jobs was not the point of my post putting a PC into a endless LOOP That was the point of my post to let you see a problem you (Rosetta) need to address You have sent out bad WU's in the past and you will send out more new Bad ones to come to think not is unwise. The wise thing is to plan for it and figure out a way to reset the project client side Doing this would flush all the WU's on the client
I know I would rather lose a few points in a flush then to have a PC stuck in a endless LOOP for Hundreds for hours untill I check on it and abort it and still get no points for it or just getting a fraction of the points for the total Hrs spent
25) Message boards : Number crunching : Report stuck & aborted WU here please - II (Message 13932)
Posted 17 Apr 2006 by Laurenu2
Post:
But what about the removal of bad WU's from your servers You must set up a way to stop the resending out of the BAD WU's Letting the system purge it self is not right. You have the capability to do auto upgrades you should have the capability to auto abort bad WU;s on client side To let bad WU's run on yours or our system is a BAD THING


The bad WU's are removed from our servers, but we can't remove them from your machines. Hopefully there will be no more bad WU's at all so this won't be a problem anymore.

Well David I let this one FA_RLXpt_hom002_1ptq__361_380_3 run through 48 HR it did not self abort as you said it would and send it self in IT JUST RESTARTED and it most likely was the 3rd time it restarted. That is 142 Hrs of wasted CPU time
And I will not get credit for it as you said we would because your time out did not work
This is why I said you must came up with a way to auto abort on our clients
A script to do a project reset or something .
To expect us to clean up after you send out Bad W/Us Is not right and you know it. You must come up with a better plan then do nothing
26) Message boards : Number crunching : Report stuck & aborted WU here please - II (Message 13798)
Posted 15 Apr 2006 by Laurenu2
Post:
But what about the removal of bad WU's from your servers You must set up a way to stop the resending out of the BAD WU's Letting the system purge it self is not right. You have the capability to do auto upgrades you should have the capability to auto abort bad WU;s on client side To let bad WU's run on yours or our system is a BAD THING
27) Message boards : Number crunching : Report stuck & aborted WU here please - II (Message 13780)
Posted 14 Apr 2006 by Laurenu2
Post:
The 1.4 stalls are still coming I am vary tired of aborting them and losing the tens of thousands of points that are NOT granted in wasted CPU time.
If this project is going to keep letting out BAD work WU's.
Rosetta need to find a way to purge these Bad WU's from there servers when they are found to cause problems like these have. And / or send commands to the users client to delete or abort the Bad WU's on any upload / download to the Rosetta servers.
To keep all the bad WU's in the system or on the Rosetta servers and forcing us to run them to purge them them from the Rosetta system is unfair to us and does damage to the project reputation.
if this continue with out relief people will start to abandon this project
28) Message boards : Number crunching : Report stuck & aborted WU here please - II (Message 13548)
Posted 12 Apr 2006 by Laurenu2
Post:
ARGGGGGGGGGGGGGGGGGHHHHHHHHHHHHHHHHHH

Another failed project 17027763 12417632 11 Apr 2006 20:35:20 UTC 12 Apr 2006 12:28:07 UTC Over Client error Computing 44,064.20 136.62

This makes at least 5 projects with crashes and more than 5 cpu days wasted in total.

What the hell is happening. To Say I am frustrated is an understatement


Well don't feel to bad Jose I seem to have to abort 60 to 100 Hrs of wasted CPU time every DAY. I did abort just today 7 WU's STUCK at 1.04% for a total of 80 HRs

DAVID what are you going to do about solving this problem ??? Any end in sight?
Baby sitting your client does consume a lot of my time

29) Message boards : Number crunching : Miscellaneous Work Unit Errors (Message 13286)
Posted 8 Apr 2006 by Laurenu2
Post:
David What is up with all the BAD W/U I must have close to 500
4/8/2006 3:22:39 PM|rosetta@home|Unrecoverable error for result HBLR_1.0_1hz6_426_5085_0 ( - exit code -1073741819 (0xc0000005))
messages on all of my nodes running Rosetta.
You Asked for my/our help here Well you must realize that if you continue to give out W/U that stall our PC, or that can not complete the DC'ers here WILL lose faith in this project and the quality of the data we produce.
As I see it you must do some or more in-house testing before a new Ver.# or WU batch.
And as stated below you should NOT do any releases at a time that you can not be a full staff to make fixes.
This project cost Me/us a lot of money to run Not counting my time. And I do expect a lot more the 1 week of good W/U's to keep running
Please remove the bad W/U's or Ver# that is causing this so we/I can stop spinning or Cooling fans
30) Message boards : Number crunching : Report stuck & aborted WU here please (Message 12890)
Posted 31 Mar 2006 by Laurenu2
Post:
I think most of the problems reported in the last few posts were from work units created before the March 28 update--hopefully these older wu will all get through the system in the next day or two.


I think you are Right David. It has been 36 Hrs and NO 1% stuck W/Us (*_*) THANK YOU David!! Is the data retrieval you added to your client / WU working to find out what is/was causing this Bug?

31) Message boards : Number crunching : Report stuck & aborted WU here please (Message 12822)
Posted 30 Mar 2006 by Laurenu2
Post:
Lauren, since 35+ of your nodes are "crunching boxes", i.e. dedicated to work for projects like Rosetta, have you ever considered running Linux instead of WinXX (XX=XP, 2K, ME etc) on them? Linux consumes less RAM than WinXX for a minimal system. You don't need the GUI anyway for such a box and Linux's remote-control capabilities are very good.
?

I am sory I would find hard to learn a New OS right now and have little time to format and install a new OS system wide
32) Message boards : Number crunching : Report stuck & aborted WU here please (Message 12821)
Posted 30 Mar 2006 by Laurenu2
Post:
The question was not whether your systems were stable enough to run dc projects (as I've seen your stats in other dc projects).. but to try and find out what's different about your hardware/software configuration that makes it more suseptible to the 1% bug than average. It's a problem that only shows up when Boinc is in control of Rosetta (Rosetta alone crunches through that sticking point) - and seems to be showing up more often on certain hardware. (Come to think of it, if you have a low max time set, and are running through up to 480 WUs a day, to have a few get caught might be the average failure rate..)

The more data about the machines with 1% failures we can give Rom, the more likely he'll be able to track down the intermittent problem. And when we help track it down and get it eliminated.. it'll make life easier for everyone dealing with the problem.

In the meantime.. is the problem showing up on your machines that have 512 Megs, or just on ones with 256Megs? Do you have Boinc setup as a service on the WinXP machines, or as a standard app?


The stalls are not confined to and one or group of PC's and they may not happen on the same PC twice

Most work units are posted to finish in the 2 to 3 Hr range. The PC's on a norm Finnish 25 to 35% faster then the Est time posted

No Boinc is Not run as a service I start the project I want to run at startup

Not sure about the PC's with 512+ memory if they stall out

I thought David and Ron had implemented data gathering to help weed out or find out what is causing this problem

I am limited in tine here working running my company and taking care of my family, Just to do a check of all my nodes takes about 1 Hr So when I find a node that has stalled I just abort it and move on
33) Message boards : Number crunching : Report stuck & aborted WU here please (Message 12803)
Posted 29 Mar 2006 by Laurenu2
Post:
Let me Add one more thin I run many other DC projects none with a problem or failure rate like it is he at Rosetta That alone tells me it is not a hardware issue
34) Message boards : Number crunching : Report stuck & aborted WU here please (Message 12801)
Posted 29 Mar 2006 by Laurenu2
Post:
[quote Laurenu2: If I remember your description of your pharm from the Dutch Mad Cow Invasion at FaD, you had about 40 systems. That would make your stuck WU rate around 10% for yesterday, and well above the average failure rate. (The error rate seems high, even if you've expanded to 80 machines.)
Would you mind describing the hardware and OS configurations of the machines that are failing? Processor/speed/ o/c or not/ amount of ram/ OS version, Boinc version, any monitoring apps running in the background. And how are the failing machines different than the ones that aren't failing? (If there's machines that aren't randomly getting stuck.) [/quote]

I run about 70 nodes here at my home I have about 40 on Rosetta most of the 40 are AMD 2400 +/- 1800 to 2800 with 256MB or more memory, 29 of the 40 have XP pro for the OS the other 11 still have WinME but should be upgraded to XP with in a week
Now the 1% stall I think come mostly to the XP nodes ON the WinME the Clock just seems to stop and I understand Rosetta dose not work well with ME and that is why I am doing the upgrade
I do not Over clock at all All or 98% of the 40 nodes do nothing but crunch Rosetta with no other programs running on them at all

I do not think it is a hardware bug issue if it was it would not be this widespread So if it is not hardware it must be the code in the software

35) Message boards : Number crunching : Report stuck & aborted WU here please (Message 12789)
Posted 29 Mar 2006 by Laurenu2
Post:
[quote]that is not good. with the jobs currently released, this problem should be greatly reduced, and from the "percent complete" we will be able to tell where the problem is.

Yes on the stuck units if you restart boinc the restets the timer to 0 .
I abouted another 4 W/Us to day that brings the total to 9 since Sunday
Sory I am Not much good at gathering Info Just hope the returned W/U will help give you the info you need to stop this BUG
36) Message boards : Number crunching : Improvements to Rosetta@home based on user feedback (Message 12750)
Posted 28 Mar 2006 by Laurenu2
Post:
would it be possible for someone to write in a script that would automatically abort a wu if it's stuck at 1% for more than say an hour? for those of us running farms this would be a godsend until the 1% stall problem is solved.




I just yesterday emailed David Anderson asking whether such a feature could be incorporated into boinc. haven't heard back yet. if someone can figure out how to do this outside of boinc it would be great.

Any word back yet? There is a lot CPU time being wast evey day by this
37) Message boards : Number crunching : Help us solve the 1% bug! (Message 12748)
Posted 28 Mar 2006 by Laurenu2
Post:
All work units sent out since Friday have a maximum time limit of roughly 24 hours, so no computers should be getting stuck much longer than this

Not so I today have just aborted 3 that were at 1% for 28 to 38 Hrs. Your self abort is Not working I hope it at least sends you back data as to Why it did not abort and why it got stuck
38) Message boards : Number crunching : Help us solve the 1% bug! (Message 12241)
Posted 19 Mar 2006 by Laurenu2
Post:
This thread is approaching 100 posts - Is the project ANY closer to solving the 1% bug?


Yes This what I would like to know "" ANY closer to solving the 1% bug?""
I have had to abort about 10 WU's stuck at 1% For a loss of about 300 Hrs of computer time in just the past week.
Maybe a auto self abort if it go's past 3 times the limit
People like me some times can not check up on all the nodes every day, and to let a WU run for 114 Hrs is just a waste of time and Money I do not work in IT and I pay for the total cost to run DC



In most cases an automatic abort feature causes more problems than it solves. The Max time errors were caused by an attempt at automatic aborts. But more often than not restarting the WU will work to "un stick" a WU.

While this sticking problem is a bigger issue for unattended systems, I am seeing a lot of people on this thread aborting WUs in less than 1/2 hour of run time. Very few of the WUs will get to more than the 1% stage in under a half hour. There are some that will, but the current batch is not among those.

So if you are aborting in under a half hour, especially if you are not checking the screen saver to see if the WU is stepping, you are making your problem worse. If the WU is stepping (even slowly) it is not stuck. If there is no activity on the screen saver except for the clock, then it may be stuck, and then it is appropriate to take some action. But the first choice should be a restart of the WU. In most cases rebooting the system is not required, only stopping and starting BOINC.

But remember, there are times in the normal process where the time between steps may become significant. I have seen this interval exceed 20 seconds or more in some cases. The slower the system the longer the interval. So examine the graphic display carefully for activity.

The RALPH project is testing a possible solution for this issue right now, so help is on the way as Dr. Baker said in his post below.


The WU's I aborted were at a min 11Hr and that was only by luck the others were about 30 55 77 85 114 Hrs
I see no reason why you would want a WU to work past 30Hrs when it should be 2 Hrs I could have done 50 WU's in the time it took me to abort that one 114 Hr WU
It seems you you are having problems fixing the 1% problem And thats OK BUT you have to give us a some kind of temporary fix to this problem A time limit, a top end, something to stop it from wasting computer time that can go into the hundreds of Hrs.
As for restarting the WU I my self have lost faith in that WU and I really do not want to rerun it or WASTE any more time with it
I do feel for sorry Rosetta is having troubles with this But Rosetta also should feel sorry that we crunchers have to pay the troubles
39) Message boards : Number crunching : Help us solve the 1% bug! (Message 12199)
Posted 18 Mar 2006 by Laurenu2
Post:
This thread is approaching 100 posts - Is the project ANY closer to solving the 1% bug?


Yes This what I would like to know "" ANY closer to solving the 1% bug?""
I have had to abort about 10 WU's stuck at 1% For a loss of about 300 Hrs of coumpter time in just the past week.
Maybe a auto self abort if it go's past 3 times the limit
People like me some times can not check up on all the nodes every day, and to let a WU run for 114 Hrs is just a waste of time and Money I do not work in IT and I pay for the total cost to run DC
40) Message boards : Number crunching : Report stuck work units here (Message 7225)
Posted 22 Dec 2005 by Laurenu2
Post:
Well with a reply like this one accuseing me of just doing it for the points will do NOTHING but but push me way.


First, realize that _I_ am not "project staff" - I'm a volunteer participant just like you are. That tag to the left under my name says "forum moderator", not "project" anything. However, I volunteer my time to help people who have a problem and ask for help on these boards.
.

Well maybe you should take a look at your style of help
When propel come here looking for help or just expressing that they see as a problem
they may not express them selfs in a clear or to the point manner.
if this is a hard thing for you to handle perhaps you should stop giving help
I did not come here to get insulted or to be made a fool of by you or to do damage to this project , Just to express things that I am having a problem with.






Previous 20 · Next 20



©2024 University of Washington
https://www.bakerlab.org