Message boards : Number crunching : Increased to 512MB as recommended memory requirement
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,100,301 RAC: 206 |
Hey ARmassey, how you doing, good job your doing for the Team ... :) David should be aware of this problem or at least he was at one time as people were sending him the stdout.txt from the slots of Hung WU's. I personally sent him 6 of them myself. Maybe he thought the problem went away because nobody's really been saying much about it lately, I don't know. But I can assure him the problem is still here & not just @ 1% either. So far I've seen it @ 1% - 8.33% - 75% & 91.66% ... I've had WU's Hung at all 4 of those % Points just today alone ... I only made mention of it in the earlier post because I hadn't seen any response from any of the Dev's on it recently. Hopefully something can be done about it because it's frustrating as all get out to wake up in the morning & see WU's with 7 hr's of CPU Time still @ the 1% Mark. I would just as soon see them Error out after a set time and go on to the next WU rather than just sit there at the same point hour after hour. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Hey ARmassey, how you doing, good job your doing for the Team ... :) I am aware of this problem and have been looking into it. I actually ran into this myself on my laptop running a 1acf_abrelax WU, which allowed me to look closely at the issue and try to debug.....but when I tried to run it again with the same random number seed on the same computer, it continued on past where it was stuck on the previous run so it looks like it may not an issue with the rosetta application but possibly with the boinc client. I'll keep looking into it. |
STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,100,301 RAC: 206 |
but when I tried to run it again with the same random number seed on the same computer, it continued on past where it was stuck on the previous run ========= Yes, thats very common for the WU to continue the Second try after you Shut the Manager down & restart it. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
but when I tried to run it again with the same random number seed on the same computer, it continued on past where it was stuck on the previous run Actually, I am running the tests in standalone mode. What I mean is that when I try the WU again with the exact same app, computer, and random number seed (which should give an identical run, and in fact the numbers that are returned in stdout are exactly the same, which confirms this), it does not get stuck. If it was the app, it should get stuck in the exact same place. I guess this is equivilent to restarting the manager since it also should use the same random number seed. |
STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,100,301 RAC: 206 |
Okay, I get what you mean ... but like I said & other people have said also is if we restart the WU again by some other means most often it will run normally. If thats worth anything or means anything ... :) |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I am hoping that our updated app using the most current boinc api code will fix this issue, assuming it is an issue with the boinc api that may have been dealt with. |
AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0 |
I am hoping that our updated app using the most current boinc api code will fix this issue, assuming it is an issue with the boinc api that may have been dealt with. > That is great news, David. How soon will you issue the new app?. As I have stated before, I wasn't having any problems until you updated your servers which in my mind also points to a random BOINC bug. After I updated my boxes to 5.2.1, it also reduced the problems by 90%. Again, a BOINC issue. With your updated app using the most current BOINC API code, we all should be on the same page. Thanks for the feedback....Cheers, Rog. |
rbpeake Send message Joined: 25 Sep 05 Posts: 168 Credit: 247,828 RAC: 0 |
After I updated my boxes to 5.2.1, it also reduced the problems by 90%. Again, a BOINC issue. With your updated app using the most current BOINC API code, we all should be on the same page. Thanks for the feedback....Cheers, Rog. For whatever it is worth, BOINC 5.2.2 is out now. Regards, Bob P. |
AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0 |
[quote]Hey ARmassey, how you doing, good job your doing for the Team ... :) > Thanks, Bob. Until I upgraded to 5.2.1 and it reduced the 'hang' rate I was tempted to bail out as well. Thanks for hanging in there and hopefully David is on to something with their new app. Also, it was good that you brought the subject up again as they maybe didn't realize what a pain it was to people who have mutiple boxes. I rather suspect they don't want to lose someone like yourself who has such impressive processing capability.....Cheers, Rog. |
AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0 |
For whatever it is worth, BOINC 5.2.2 is out now.[/quote] > Good to know....Thanks, Bob. Will check it out. Cheers, Rog. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
5.2.2 is not "out" except as a test release. Use at your own risk ... |
AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0 |
5.2.2 is not "out" except as a test release. Use at your own risk ... OK, thanks for the 'heads up', Paul. (maybe 'heads up' has a different meaning for a ex-sailor??:) Cheers, Rog. |
STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,100,301 RAC: 206 |
I am hoping that our updated app using the most current boinc api code will fix this issue, assuming it is an issue with the boinc api that may have been dealt with. What will the new App # be David, right now I have the Rosetta 4.77 WU's ... ??? The reason I ask is because I'm trying to run the WU's I have now down to just a couple on each Computer, so when the New App comes out I can start running them right away and see if they Hang or not at certain % Points ... :) |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Yeah, no more lines ... :) |
kb7rzf Send message Joined: 7 Oct 05 Posts: 16 Credit: 35,427 RAC: 0 |
I am now running one of these WU's as well, it paused @ 1%, 8.33% now at 16.67%, and still sitting there. Tried the exiting of BOINC and rebooting and still stuck. Just posting for info. Thanks Jeremy [edit] Now paused at 25%.[/edit] |
Mike Gelvin Send message Joined: 7 Oct 05 Posts: 65 Credit: 10,612,039 RAC: 0 |
David, A lot of people are concerned over this 1% (and other points) hangs. In your learned opinion, how long (in CPU time) should a person wait before they actually call it a hang. Be generous, some folks have slow computers. My concern is that people are giving up early. I have seen some "hang" for about 1.5 hrs and then continue on (they were not hung)... and this is on a Pent 4 running 2.4Mhz. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
If it hangs for over 3 hours, it is likely to be stuck. I am going to change the rsc_fpops_bound value for new WU's so that it will abort WU's that are stuck (exceed the time it takes to do this upper bound of floating-point operations based on the computer's benchmark). I am hoping the updated BOINC api will deal with this issue because it does not seem like it is our application. |
AnRM Send message Joined: 18 Sep 05 Posts: 123 Credit: 1,355,486 RAC: 0 |
David, >Rest assured we are talking about the 8-12 hours or so range. Your point is well taken though as the % readout is not linear. Thanks for the fix, David. Cheers,Rog. |
Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0 |
Just for the record. I just restarted BOINC as I had a WU going nowhere - still on 1% after 2 hours on a 3.4GHz P4 with 1GB RAM. stderr.txt was empty (zero bytes). Since restarting, that same WU (1btn_abrelax_04549_2) has crunched for just 25 minutes and it's at 41.67% David, if these things happen (for those of us willing and able to spend time monitoring) is there anything we can glean from the stdout.txt file (or another file in the slot) in terms of seeing whether it's stuck or not? *** Join BOINC@Australia today *** |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Just for the record. Good question. The stdout file should grow in smaller time increments compared to the structures being produced so if you do not see output being appendend to the stdout file for over an hour, particularly when it is still at the initial stages (1%), it is most likely stuck. |
Message boards :
Number crunching :
Increased to 512MB as recommended memory requirement
©2024 University of Washington
https://www.bakerlab.org