Message boards : Number crunching : Help us solve the 1% bug!
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 · Next
Author | Message |
---|---|
Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
Here's a WU that wasted 105.9 hours before I noticed it in BOINCView.... Checked the Graphics, no discernible movement observed. I suspended the WU ,restarted it with no joy. Exit from BOINC, restarted BOINC still no joy... Aborted WU. Did I mentioned it wasted 105.9 hours? <grrrrrr> FA_RLXey_hom011_1eyvA_360_160_0 , Result ID 13903946, Work unit 11233006, Computer ID 56899, CPU time 381298.796875. stderr out <core_client_version>5.2.13</core_client_version> <message>aborted via GUI RPC </message> <stderr_txt> # random seed: 2665711 # cpu_run_time_pref: 36000 </stderr_txt> |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
Please--if you have frequent occurrences of the 1% bug--it would help us enormously to solve it if you could sign up for RALPH@home. Rom can then identifiy the exact lines of code where the problem is ocurring and it will be easy to fix from there. the problem is that many machines don't have this problem, and they can't help us to track it down and solve it. |
Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
Please--if you have frequent occurrences of the 1% bug--it would help us enormously to solve it if you could sign up for RALPH@home. Rom can then identifiy the exact lines of code where the problem is ocurring and it will be easy to fix from there. the problem is that many machines don't have this problem, and they can't help us to track it down and solve it. The problem is it doesn't necessarily happen a lot on all machines. I don't think I've ever two on the same puter. I already have a machine (computer # 1947) crunching Ralph WUs, and its had 11 failures of 40 downloaded but no 1%ers. I ran Ralph on another machine (computer # 317) and ran 19 WUs (when it could get one) without a problem... But that doesn't help with the other 29 machines. They have completed 43 WUs today 20th with 6 failures including the one I aborted for the 1% error. |
Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
I just did a ramdom check on the rest of my computers and found a common problem that most of them has experienced at one time or another: Result ID 12869089 Name HOMSdt_homDB030_1dtj__352_802_0 Workunit 10345130 Created 7 Mar 2006 14:32:01 UTC Sent 8 Mar 2006 1:45:20 UTC Received 8 Mar 2006 1:49:41 UTC Server state Over Outcome Client error Client state Computing Exit status 1 (0x1) Computer ID 142185 Report deadline 22 Mar 2006 1:45:20 UTC CPU time 25.890625 stderr out <core_client_version>5.2.13</core_client_version> <message>Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> </stderr_txt> Validate state Invalid Claimed credit 0.165637012638972 Granted credit 0 application version 4.82 |
UBT - Timbo Send message Joined: 25 Sep 05 Posts: 20 Credit: 2,299,279 RAC: 436 |
Please--if you have frequent occurrences of the 1% bug--it would help us enormously to solve it if you could sign up for RALPH@home. Rom can then identifiy the exact lines of code where the problem is ocurring and it will be easy to fix from there. the problem is that many machines don't have this problem, and they can't help us to track it down and solve it. OK David, Have started some RALPH units. And what's happening you ask??? The first two (I have a P4/HT) have both got "stuck" at 1%. Checked the graphics - having re-installed BOINC as a single-user - and the time is increasing nicely, as it should, the pictures are real pretty and crunching seems to be taking place, but the 1% is not moving...!. What do I do now? Abort these 2 and see what happens with the next couple of WU's Suspend them and see what happens with the next 2. Give up? regards, Tim |
UBT - Timbo Send message Joined: 25 Sep 05 Posts: 20 Credit: 2,299,279 RAC: 436 |
Have started some RALPH units. Having just wrote the last msg, I thought what the heck !! Need to experiment to help you guys. So, I went back to BOINC and sure enough, only one of the 2 WU's was still at 1% - the other one has jumped up to 2.34%. But it's got stuck again. So, I suspended the 1% and allowed BOINC to switch to the next RALPH WU. Upon starting it immediately went to 1%....and stuck! So, suspended that one and allowed a 4th WU to start. And that went straight to 1% and stuck. Same with 5th and now 6th. Have now shut-down BOINC and going to "play" a bit with my "project prefs". regards, Tim |
UBT - Timbo Send message Joined: 25 Sep 05 Posts: 20 Credit: 2,299,279 RAC: 436 |
Have now shut-down BOINC and going to "play" a bit with my "project prefs" OK - changed my project prefs from default to max - 50, 50 and 4 days. Also set my BOINC prefs to "pre-empted". Have also set computer to "visible" if it helps. Restarted BOINC. RALPH WU's are the only ones I have working. Immmediately, when BOINC restarted, the very 1st WU reset the crunched time to zero, but still showing 1% progress. Did a manual update of the project. Still the same. The 2nd WU is now on 2.35% (was 2.34%). But hasn't moved at all from there for the last 5 minutes. In "desparation mode", I've tried to suspend/resume various WU's in the hope of either causing a "computation error" or to at least to get a WU to move off from the 1%. So far, nothing has changed.....! In both cases, the CPU time (for RALPH WU's) is continuing to increase - it's just the "Progress" that stays stuck - if it weren't for that, you'd think all was well!! regards, Tim PS: System is: CPU: Pentium 4, inc HT @ 3.06GHz (not overclocked) Memory: 512Mb OS: Windows XP + SP2 HDD: 24Gb free space Graphics: Radeon 9500 Pro BOINC: v5.2.13 (standard, not optimised) All other projects crunch OK. (edit) added BOINC version |
UBT - Timbo Send message Joined: 25 Sep 05 Posts: 20 Credit: 2,299,279 RAC: 436 |
This is getting stranger. After about 14 minutes total crunching time, the 1st WU: (HB_BARCODE_30_1bk2__352_137_0 using rosetta_beta version 493) has now changed to 0.178% progress (on the graphics screen) and is now stuck again. After 34 minutes crunching time the 2nd WU (HB_BARCODE_30_5croA_352_136_0 using rosetta_beta version 493) is still at 2.35%. Will let these carry on for an hour or so and report back then. regards, Tim (edit) added WU Names |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
Please--if you have frequent occurrences of the 1% bug--it would help us enormously to solve it if you could sign up for RALPH@home. Rom can then identifiy the exact lines of code where the problem is ocurring and it will be easy to fix from there. the problem is that many machines don't have this problem, and they can't help us to track it down and solve it. is the protein still jumping around on the screen-if so, definitely let it continue! |
BadThad Send message Joined: 8 Nov 05 Posts: 30 Credit: 71,834,523 RAC: 0 |
Arrgggg.....looks like the 1% stuck wu's are back: FA_RLXc9_1c9oA_359_372_0 1% after 19 hr 44 min. |
UBT - Timbo Send message Joined: 25 Sep 05 Posts: 20 Credit: 2,299,279 RAC: 436 |
This is getting stranger. OK - so the 1st WU is now at 4 hr 27 mins of CPU time and the Progress is now at 4.56% Completion time was around 8 hr 30 m, but now reads: 12 hrs 24m !!! The 2nd WU is now at 4 hr 47 mins and 4.75% with a completion time of 12 hrs 25m (was about 8 hr 30m) In both cases, the graphics in the "Searching..." box *is* moving: with both 1st WU and 2nd WU, the graphics seem to "settle down" for a bit (with the shapes in both boxes being "similar"). The bottom right numbers change slowly. After a short while, in the "Searching..." box, the graphic then starts moving more rapidly. This corresponds to an faster rate of change of the numbers in the bottom right. Will let them continue and see what happens over the next 24 hours...! regards, Tim (edit) typo |
doc :) Send message Joined: 4 Oct 05 Posts: 47 Credit: 1,106,102 RAC: 0 |
timbo, above you wrote you changed your prefs to 4 days, if that was the target cpu run time in your ralph@home preferences then the slow movement of percentage and the increasing time to completion is perfectly normal cuz it will run for 4 days with that setting (boinc doesnt know about that project specific option yet, so it cant include it in that prediction, it has to finish some units first to make the prediction more correct and will be far off again if you change the target cpu time) as long as the graphics are still moving, even very slowly (when the stage says full atom relax) its not stuck :) |
Doug Worrall Send message Joined: 19 Sep 05 Posts: 60 Credit: 58,445 RAC: 0 |
Hello, I feel embarassed posting the only 1% stuck bug.It,s 4.81_i6 "FA_RLXpt_h....." yada.It had a problem Downloading also.3 attemepts got "Timed out" {error} Its red anyways.LOL.Not to concerned about 1 w/u but,will subscribe to this thread and I am able to help-out I will.Just donnot have enough time to read all these Posts on this Problem.Also lots are running mutliple Boxes and they are needing the Help with this Bug. "Happy Crunching All" Sincerely Doug Sluger Worrall |
dag Send message Joined: 16 Dec 05 Posts: 106 Credit: 1,000,020 RAC: 0 |
I'm having Many 1% bugs on FA_RLX jobs. I may have a good set of data points here as the failures are ~100% on one multi-processor Linux machine, but not on two other multi-processor Linux machines, and not on a single processor XP-SP2 machine. The Linux machines are all 2.4.21-XXX Linux (slightly different patch levels) and all have four Intel Xeon processors but are clocked (no overclocking) at 2.8, 3.2, and 3.4. The slowest machine has the failures. They are all running the same BOINC client. Call if you need to. dag 719 590 3038 dag --Finding aliens is cool, but understanding the structure of proteins is useful. |
rbpeake Send message Joined: 25 Sep 05 Posts: 168 Credit: 247,828 RAC: 0 |
Call if you need to. Please see David Baker's comment plea, below, which I quote: Please--if you have frequent occurrences of the 1% bug--it would help us enormously to solve it if you could sign up for RALPH@home. Rom can then identifiy the exact lines of code where the problem is ocurring and it will be easy to fix from there. the problem is that many machines don't have this problem, and they can't help us to track it down and solve it. Regards, Bob P. |
UBT - Timbo Send message Joined: 25 Sep 05 Posts: 20 Credit: 2,299,279 RAC: 436 |
timbo, above you wrote you changed your prefs to 4 days, if that was the target cpu run time in your ralph@home preferences then the slow movement of percentage and the increasing time to completion is perfectly normal cuz it will run for 4 days with that setting (boinc doesnt know about that project specific option yet, so it cant include it in that prediction, it has to finish some units first to make the prediction more correct and will be far off again if you change the target cpu time) OK - thanks for that info. Had assumed that the option to change pref's meant that the PROJECT ran for 4 days straight - not the actual work unit itself. And besides, I would have thought that if you allowed the WU to have "direct control" over what BOINC is supposed to be doing, (for these 4 days), then that must impact other WU that you will be crunching for. So, will BOINC get in a "tizz" if you work on 4 day long Rosetta WU's and you have other WU from other projects "waiting and getting close or past their deadlines..... It's nice for the project to give users that amount of control, but I think it's a bit too much....! BTW: Didn't the problem of these 1% WU's occur sometime around the time Rosetta allowed users to change these exact preferences...? I've crunched quite a few Rosetta WU's and never really had a problem until recently. regards, Tim |
doc :) Send message Joined: 4 Oct 05 Posts: 47 Credit: 1,106,102 RAC: 0 |
the 1% stuck bug has been there long before the cpu target time option was introduced. boinc will switch between projects according to your "switch between applications every" setting in your general preferences (and your resource shares ofcourse) and we are getting a little bit off-topic here :) |
MD_Willington Send message Joined: 8 Dec 05 Posts: 1 Credit: 47,751 RAC: 0 |
This is getting stranger. Same here.. @ ~ 75 hours, ??? should I ditch the WU or let it go for the long haul? MD |
Rom Walton (BOINC) Volunteer moderator Project developer Send message Joined: 17 Sep 05 Posts: 18 Credit: 40,071 RAC: 0 |
A new version of Rosetta has been posted in the RALPH@Home project. Release Notes For those who are so inclined, please help us track down the issue by running RALPH@Home and if/when you find a workunit with the '1% bug' feel free to abort it and call it out in this thread. Thanks in advance for any help you can provide. ----- Rom ----- Rom My Blog |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
This is getting stranger. as long as the graphics show movement, the calculation is proceeding, so best to stick with it.. |
Message boards :
Number crunching :
Help us solve the 1% bug!
©2024 University of Washington
https://www.bakerlab.org