1% for 37 hours

Message boards : Number crunching : 1% for 37 hours

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 2724 - Posted: 9 Nov 2005, 15:32:08 UTC

Thanks for that, will give it a try the next time I see one.
ID: 2724 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AnRM
Avatar

Send message
Joined: 18 Sep 05
Posts: 123
Credit: 1,355,486
RAC: 0
Message 2817 - Posted: 10 Nov 2005, 17:18:39 UTC

We have been 1% free for weeks now but in the last day we have had two....has something changed?? Or is it just a random thing as this thread suggests? We also lost about 20 hours total when we could have been doing useful work. Any closer to finding the cause, I wonder??
ID: 2817 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 2838 - Posted: 10 Nov 2005, 20:26:18 UTC

It appears to be a random event. I could not reproduce this using the exact same input and random seed from the examples sent to me so it will be very hard to debug. If the source code gets released (see this thread), which will most likely happen sometime in the future along with redundancy, this bug will be a good candidate for developers out there to try to fix.
ID: 2838 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yeti
Avatar

Send message
Joined: 2 Nov 05
Posts: 45
Credit: 12,673,776
RAC: 54,546
Message 2843 - Posted: 10 Nov 2005, 20:53:16 UTC

HM, I have one with 1% after 8 hours

To get a better knowledge of this problem, redundancy could help. I think, it would be interesting to see if it fails on all machines, if the problem appears on all machines ...




Supporting BOINC, a great concept !
ID: 2843 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AnRM
Avatar

Send message
Joined: 18 Sep 05
Posts: 123
Credit: 1,355,486
RAC: 0
Message 2849 - Posted: 10 Nov 2005, 21:31:07 UTC - in response to Message 2838.  
Last modified: 10 Nov 2005, 21:33:28 UTC

It appears to be a random event. I could not reproduce this using the exact same input and random seed from the examples sent to me so it will be very hard to debug. If the source code gets released (see this thread), which will most likely happen sometime in the future along with redundancy, this bug will be a good candidate for developers out there to try to fix.

>OK, David....thanks for the prompt reply.....we have dropped our connection rate to .1 days and will monitor our boxes more closely (and hope for the best)....Cheers, Rog.
ID: 2849 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 2882 - Posted: 11 Nov 2005, 11:07:21 UTC - in response to Message 2843.  

To get a better knowledge of this problem, redundancy could help. I think, it would be interesting to see if it fails on all machines, if the problem appears on all machines ...

Unfortunately *MY* experience with the 1% work units they have *ALL* run successfully after a restart. So, it *MAY* be due to random variations in the flux capacitors ...

Even running on a different computer may not "prove" anything. Unless using the same random seed *AND* an identical CPU/FPU, well, you are going to see different behavior out of the models.

And when I say identical, I mean identical down to the last transistor. That means the same stepping etc.

Compilence with the IEEE 754 (and later) standards does *NOT* imply identical results output. You *WILL* see variations in the outer finges of precision. Even more, successive runs can still result in differences *IF* the FPU's operation is partially dependent on prior states. In other words, if it is not in the same state at the restart it *MAY* perform differently the second time through even though you THOUGHT you started at the same point.

Oh, and the random cosmic ray can also "flip" a bit ... :)
ID: 2882 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 2935 - Posted: 12 Nov 2005, 0:39:59 UTC

I caught a stuck wu on my laptop a while back and re-ran it manually with the same seed and it didn't get stuck.
ID: 2935 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 2939 - Posted: 12 Nov 2005, 2:12:43 UTC - in response to Message 2935.  

I caught a stuck wu on my laptop a while back and re-ran it manually with the same seed and it didn't get stuck.


I get the odd one still and (as far as I know) restarting BOINC has always fixed it. Not everybody watches progress however, so there could be CPUs out there that have been (or will be) spinning their wheels for days, weeks, perhaps longer.

I'm not a programmer (I do some scripting for websites only) and I understand it may be hard to track the source of this seemingly random problem. I'm wondering though... When a WU is stuck at 1%, is it actually doing anything? Is there some way the app can trap a timeout or error condition and send a signal to BOINC to restart or resume the WU?


*** Join BOINC@Australia today ***
ID: 2939 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AnRM
Avatar

Send message
Joined: 18 Sep 05
Posts: 123
Credit: 1,355,486
RAC: 0
Message 2946 - Posted: 12 Nov 2005, 6:04:43 UTC
Last modified: 12 Nov 2005, 6:19:11 UTC

It would be nice to get to the bottom of this vexing problem. We've had another box lock up at 1%(the third in the last 36 hours). We upgraded to BOINC 5.x on all boxes and that cured things for about 2 weeks. We are spending too much time monitoring and it is difficult to reset remotely housed boxes. Sadly, we will have to withdraw from the project until this is resolved. It's a great project and we will monitor your progress with one box and hope for the best. We will be back with the others once this bug is zapped....good hunting!
ID: 2946 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 2947 - Posted: 12 Nov 2005, 7:14:07 UTC

I have had 3 or 4 on mine (tried everything to kick start with no luck at all). Not had one for days now tho'.
ID: 2947 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yeti
Avatar

Send message
Joined: 2 Nov 05
Posts: 45
Credit: 12,673,776
RAC: 54,546
Message 2957 - Posted: 12 Nov 2005, 9:01:53 UTC - in response to Message 2946.  

We are spending too much time monitoring and it is difficult to reset remotely housed boxes.

You know, that BOINCView can help you, save a lot of time monitoring your boxes ?



Supporting BOINC, a great concept !
ID: 2957 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AnRM
Avatar

Send message
Joined: 18 Sep 05
Posts: 123
Credit: 1,355,486
RAC: 0
Message 2964 - Posted: 12 Nov 2005, 12:02:24 UTC - in response to Message 2957.  

We are spending too much time monitoring and it is difficult to reset remotely housed boxes.

You know, that BOINCView can help you, save a lot of time monitoring your boxes ?

Thanks for the tip,Yeti. I'll give it a try. I see they have released a new science app. too. Maybe that will help as well......hope springs eternal!...Cheers, Rog.
ID: 2964 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 2990 - Posted: 12 Nov 2005, 16:11:27 UTC

Well, maybe when the graphics are available that will give us a clue ...

Rom looked forever for a similarly intermittant problem for like forever. Not a bad error, but it took forever to find out the cause ("no finished file" problem).
ID: 2990 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AnRM
Avatar

Send message
Joined: 18 Sep 05
Posts: 123
Credit: 1,355,486
RAC: 0
Message 3008 - Posted: 12 Nov 2005, 18:48:37 UTC - in response to Message 2990.  

Well, maybe when the graphics are available that will give us a clue ...

Rom looked forever for a similarly intermittant problem for like forever. Not a bad error, but it took forever to find out the cause ("no finished file" problem).

I hear you,Paul.....so far no problems with R@H 4.79 on 5 boxes. Keeping my fingers crossed, though, as it was such a random, annoying thing for the Devs. (and everyone:)....Cheers, Rog.
ID: 3008 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
hugothehermit

Send message
Joined: 26 Sep 05
Posts: 238
Credit: 314,893
RAC: 0
Message 3011 - Posted: 12 Nov 2005, 19:33:33 UTC

I had one that got stuck, suspending / resuming didn't work nor did a BOINC restart. A reboot did.

ID: 3011 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AnRM
Avatar

Send message
Joined: 18 Sep 05
Posts: 123
Credit: 1,355,486
RAC: 0
Message 3012 - Posted: 12 Nov 2005, 19:43:58 UTC - in response to Message 3011.  

I had one that got stuck, suspending / resuming didn't work nor did a BOINC restart. A reboot did.


Hi Hugo. I take it that it was a R@H 4.79 WU that got stuck. If that is the case then I guess we aren't out of the woods yet. Thanks for the info...Cheers, Rog.
ID: 3012 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ralic

Send message
Joined: 22 Sep 05
Posts: 16
Credit: 46,481
RAC: 0
Message 3173 - Posted: 14 Nov 2005, 12:46:18 UTC - in response to Message 2703.  

I did suggest that a time "cap" be placed on the start up of a work unit, though it was pointed out that the use of a fixed amount of time is not viable...

It looks like there is a least some kind of cap present.
resultid=1372617 reports "Maximum CPU time exceeded" after 60,359.58 CPU time.

It hasn't been sent to another user, perhaps the project team can investigate this one?
ID: 3173 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 3463 - Posted: 17 Nov 2005, 2:41:46 UTC

Not sure whether it's relevant or not, but I had one that wedged at 1% for a couple of hours. Rather than shut it down, I left it run, but took a quick look inside the stdout.txt file. The last line was this:

pre-computing chuck/gunn move set for frag length 1

It's moved on now, but whatever that chuck/gunn move set thingumy is, it sure cogitated on it for a while. I'll save the stdout.txt file in case anyone's interested.

ID: 3463 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 3883 - Posted: 22 Nov 2005, 8:01:35 UTC

When can you say there is a problem ?
So after how many minutes/hours you should stop and restart the client or even stop the job ?

Now running for 20 minutes and still 1%.
ID: 3883 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Rebirther
Avatar

Send message
Joined: 17 Sep 05
Posts: 116
Credit: 41,315
RAC: 0
Message 3886 - Posted: 22 Nov 2005, 8:07:07 UTC - in response to Message 3883.  

When can you say there is a problem ?
So after how many minutes/hours you should stop and restart the client or even stop the job ?

Now running for 20 minutes and still 1%.


New Wus with _omega_ take much longer than the old ones, my P4 needs 1-1,5h to jump to 20%, its a little bit confused because the checkpoints here are 20,40,60,80,100. Finished some in 2:20h or up to 4h.

ID: 3886 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : 1% for 37 hours



©2024 University of Washington
https://www.bakerlab.org