1% for 37 hours

Author	Message
Scribe Send message Joined: 2 Nov 05 Posts: 284 Credit: 157,359 RAC: 0	Message 2724 - Posted: 9 Nov 2005, 15:32:08 UTC Thanks for that, will give it a try the next time I see one. ID: 2724 · Rating: 0 · rate: / Reply Quote

AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0	Message 2817 - Posted: 10 Nov 2005, 17:18:39 UTC We have been 1% free for weeks now but in the last day we have had two....has something changed?? Or is it just a random thing as this thread suggests? We also lost about 20 hours total when we could have been doing useful work. Any closer to finding the cause, I wonder?? ID: 2817 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 2838 - Posted: 10 Nov 2005, 20:26:18 UTC It appears to be a random event. I could not reproduce this using the exact same input and random seed from the examples sent to me so it will be very hard to debug. If the source code gets released (see this thread), which will most likely happen sometime in the future along with redundancy, this bug will be a good candidate for developers out there to try to fix. ID: 2838 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 2 Nov 05 Posts: 45 Credit: 16,195,236 RAC: 0	Message 2843 - Posted: 10 Nov 2005, 20:53:16 UTC HM, I have one with 1% after 8 hours To get a better knowledge of this problem, redundancy could help. I think, it would be interesting to see if it fails on all machines, if the problem appears on all machines ... Supporting BOINC, a great concept ! ID: 2843 · Rating: 0 · rate: / Reply Quote

AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0	Message 2849 - Posted: 10 Nov 2005, 21:31:07 UTC - in response to Message 2838. Last modified: 10 Nov 2005, 21:33:28 UTC It appears to be a random event. I could not reproduce this using the exact same input and random seed from the examples sent to me so it will be very hard to debug. If the source code gets released (see this thread), which will most likely happen sometime in the future along with redundancy, this bug will be a good candidate for developers out there to try to fix. >OK, David....thanks for the prompt reply.....we have dropped our connection rate to .1 days and will monitor our boxes more closely (and hope for the best)....Cheers, Rog. ID: 2849 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0	Message 2882 - Posted: 11 Nov 2005, 11:07:21 UTC - in response to Message 2843. To get a better knowledge of this problem, redundancy could help. I think, it would be interesting to see if it fails on all machines, if the problem appears on all machines ... Unfortunately MY experience with the 1% work units they have ALL run successfully after a restart. So, it MAY be due to random variations in the flux capacitors ... Even running on a different computer may not "prove" anything. Unless using the same random seed AND an identical CPU/FPU, well, you are going to see different behavior out of the models. And when I say identical, I mean *identical* down to the last transistor. That means the same stepping etc. Compilence with the IEEE 754 (and later) standards does *NOT* imply identical results output. You WILL see variations in the outer finges of precision. Even more, successive runs can still result in differences IF the FPU's operation is partially dependent on prior states. In other words, if it is not in the same state at the restart it MAY perform differently the second time through even though you THOUGHT you started at the same point. Oh, and the random cosmic ray can also "flip" a bit ... :) ID: 2882 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 2935 - Posted: 12 Nov 2005, 0:39:59 UTC I caught a stuck wu on my laptop a while back and re-ran it manually with the same seed and it didn't get stuck. ID: 2935 · Rating: 0 · rate: / Reply Quote

Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0	Message 2939 - Posted: 12 Nov 2005, 2:12:43 UTC - in response to Message 2935. I caught a stuck wu on my laptop a while back and re-ran it manually with the same seed and it didn't get stuck. I get the odd one still and (as far as I know) restarting BOINC has always fixed it. Not everybody watches progress however, so there could be CPUs out there that have been (or will be) spinning their wheels for days, weeks, perhaps longer. I'm not a programmer (I do some scripting for websites only) and I understand it may be hard to track the source of this seemingly random problem. I'm wondering though... When a WU is stuck at 1%, is it actually doing anything? Is there some way the app can trap a timeout or error condition and send a signal to BOINC to restart or resume the WU? * Join BOINC@Australia today * ID: 2939 · Rating: 0 · rate: / Reply Quote

AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0	Message 2946 - Posted: 12 Nov 2005, 6:04:43 UTC Last modified: 12 Nov 2005, 6:19:11 UTC It would be nice to get to the bottom of this vexing problem. We've had another box lock up at 1%(the third in the last 36 hours). We upgraded to BOINC 5.x on all boxes and that cured things for about 2 weeks. We are spending too much time monitoring and it is difficult to reset remotely housed boxes. Sadly, we will have to withdraw from the project until this is resolved. It's a great project and we will monitor your progress with one box and hope for the best. We will be back with the others once this bug is zapped....good hunting! ID: 2946 · Rating: 0 · rate: / Reply Quote

Scribe Send message Joined: 2 Nov 05 Posts: 284 Credit: 157,359 RAC: 0	Message 2947 - Posted: 12 Nov 2005, 7:14:07 UTC I have had 3 or 4 on mine (tried everything to kick start with no luck at all). Not had one for days now tho'. ID: 2947 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 2 Nov 05 Posts: 45 Credit: 16,195,236 RAC: 0	Message 2957 - Posted: 12 Nov 2005, 9:01:53 UTC - in response to Message 2946. We are spending too much time monitoring and it is difficult to reset remotely housed boxes. You know, that BOINCView can help you, save a lot of time monitoring your boxes ? Supporting BOINC, a great concept ! ID: 2957 · Rating: 0 · rate: / Reply Quote

AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0	Message 2964 - Posted: 12 Nov 2005, 12:02:24 UTC - in response to Message 2957. We are spending too much time monitoring and it is difficult to reset remotely housed boxes. You know, that BOINCView can help you, save a lot of time monitoring your boxes ? Thanks for the tip,Yeti. I'll give it a try. I see they have released a new science app. too. Maybe that will help as well......hope springs eternal!...Cheers, Rog. ID: 2964 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0	Message 2990 - Posted: 12 Nov 2005, 16:11:27 UTC Well, maybe when the graphics are available that will give us a clue ... Rom looked forever for a similarly intermittant problem for like forever. Not a bad error, but it took forever to find out the cause ("no finished file" problem). ID: 2990 · Rating: 0 · rate: / Reply Quote

AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0	Message 3008 - Posted: 12 Nov 2005, 18:48:37 UTC - in response to Message 2990. Well, maybe when the graphics are available that will give us a clue ... Rom looked forever for a similarly intermittant problem for like forever. Not a bad error, but it took forever to find out the cause ("no finished file" problem). I hear you,Paul.....so far no problems with R@H 4.79 on 5 boxes. Keeping my fingers crossed, though, as it was such a random, annoying thing for the Devs. (and everyone:)....Cheers, Rog. ID: 3008 · Rating: 0 · rate: / Reply Quote

hugothehermit Send message Joined: 26 Sep 05 Posts: 238 Credit: 314,893 RAC: 0	Message 3011 - Posted: 12 Nov 2005, 19:33:33 UTC I had one that got stuck, suspending / resuming didn't work nor did a BOINC restart. A reboot did. ID: 3011 · Rating: 0 · rate: / Reply Quote

AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0	Message 3012 - Posted: 12 Nov 2005, 19:43:58 UTC - in response to Message 3011. I had one that got stuck, suspending / resuming didn't work nor did a BOINC restart. A reboot did. Hi Hugo. I take it that it was a R@H 4.79 WU that got stuck. If that is the case then I guess we aren't out of the woods yet. Thanks for the info...Cheers, Rog. ID: 3012 · Rating: 0 · rate: / Reply Quote

ralic Send message Joined: 22 Sep 05 Posts: 16 Credit: 46,481 RAC: 0	Message 3173 - Posted: 14 Nov 2005, 12:46:18 UTC - in response to Message 2703. I did suggest that a time "cap" be placed on the start up of a work unit, though it was pointed out that the use of a fixed amount of time is not viable... It looks like there is a least some kind of cap present. resultid=1372617 reports "Maximum CPU time exceeded" after 60,359.58 CPU time. It hasn't been sent to another user, perhaps the project team can investigate this one? ID: 3173 · Rating: 0 · rate: / Reply Quote

dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0	Message 3463 - Posted: 17 Nov 2005, 2:41:46 UTC Not sure whether it's relevant or not, but I had one that wedged at 1% for a couple of hours. Rather than shut it down, I left it run, but took a quick look inside the stdout.txt file. The last line was this: pre-computing chuck/gunn move set for frag length 1 It's moved on now, but whatever that chuck/gunn move set thingumy is, it sure cogitated on it for a while. I'll save the stdout.txt file in case anyone's interested. ID: 3463 · Rating: 0 · rate: / Reply Quote

Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0	Message 3883 - Posted: 22 Nov 2005, 8:01:35 UTC When can you say there is a problem ? So after how many minutes/hours you should stop and restart the client or even stop the job ? Now running for 20 minutes and still 1%. ID: 3883 · Rating: 0 · rate: / Reply Quote

Rebirther Send message Joined: 17 Sep 05 Posts: 116 Credit: 41,315 RAC: 0	Message 3886 - Posted: 22 Nov 2005, 8:07:07 UTC - in response to Message 3883. When can you say there is a problem ? So after how many minutes/hours you should stop and restart the client or even stop the job ? Now running for 20 minutes and still 1%. New Wus with _omega_ take much longer than the old ones, my P4 needs 1-1,5h to jump to 20%, its a little bit confused because the checkpoints here are 20,40,60,80,100. Finished some in 2:20h or up to 4h. ID: 3886 · Rating: 0 · rate: / Reply Quote