Help us solve the 1% bug!

Message boards : Number crunching : Help us solve the 1% bug!

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 · Next

AuthorMessage
Profile Nite Owl
Avatar

Send message
Joined: 2 Nov 05
Posts: 87
Credit: 3,019,449
RAC: 0
Message 12362 - Posted: 21 Mar 2006, 0:12:56 UTC
Last modified: 21 Mar 2006, 0:36:13 UTC

Here's a WU that wasted 105.9 hours before I noticed it in BOINCView.... Checked the Graphics, no discernible movement observed. I suspended the WU ,restarted it with no joy. Exit from BOINC, restarted BOINC still no joy... Aborted WU. Did I mentioned it wasted 105.9 hours? <grrrrrr>

FA_RLXey_hom011_1eyvA_360_160_0 , Result ID 13903946, Work unit 11233006, Computer ID 56899, CPU time 381298.796875.

stderr out <core_client_version>5.2.13</core_client_version>
<message>aborted via GUI RPC
</message>
<stderr_txt>
# random seed: 2665711
# cpu_run_time_pref: 36000

</stderr_txt>





ID: 12362 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12374 - Posted: 21 Mar 2006, 4:21:07 UTC

Please--if you have frequent occurrences of the 1% bug--it would help us enormously to solve it if you could sign up for RALPH@home. Rom can then identifiy the exact lines of code where the problem is ocurring and it will be easy to fix from there. the problem is that many machines don't have this problem, and they can't help us to track it down and solve it.
ID: 12374 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Nite Owl
Avatar

Send message
Joined: 2 Nov 05
Posts: 87
Credit: 3,019,449
RAC: 0
Message 12387 - Posted: 21 Mar 2006, 5:52:45 UTC - in response to Message 12374.  
Last modified: 21 Mar 2006, 6:30:12 UTC

Please--if you have frequent occurrences of the 1% bug--it would help us enormously to solve it if you could sign up for RALPH@home. Rom can then identifiy the exact lines of code where the problem is ocurring and it will be easy to fix from there. the problem is that many machines don't have this problem, and they can't help us to track it down and solve it.

The problem is it doesn't necessarily happen a lot on all machines. I don't think I've ever two on the same puter.
I already have a machine (computer # 1947) crunching Ralph WUs, and its had 11 failures of 40 downloaded but no 1%ers. I ran Ralph on another machine (computer # 317) and ran 19 WUs (when it could get one) without a problem... But that doesn't help with the other 29 machines. They have completed 43 WUs today 20th with 6 failures including the one I aborted for the 1% error.

ID: 12387 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Nite Owl
Avatar

Send message
Joined: 2 Nov 05
Posts: 87
Credit: 3,019,449
RAC: 0
Message 12393 - Posted: 21 Mar 2006, 6:17:16 UTC

I just did a ramdom check on the rest of my computers and found a common problem that most of them has experienced at one time or another:

Result ID 12869089
Name HOMSdt_homDB030_1dtj__352_802_0
Workunit 10345130
Created 7 Mar 2006 14:32:01 UTC
Sent 8 Mar 2006 1:45:20 UTC
Received 8 Mar 2006 1:49:41 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status 1 (0x1)
Computer ID 142185
Report deadline 22 Mar 2006 1:45:20 UTC
CPU time 25.890625
stderr out <core_client_version>5.2.13</core_client_version>
<message>Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>

</stderr_txt>



Validate state Invalid
Claimed credit 0.165637012638972
Granted credit 0
application version 4.82

ID: 12393 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Timbo

Send message
Joined: 25 Sep 05
Posts: 20
Credit: 2,275,059
RAC: 0
Message 12417 - Posted: 21 Mar 2006, 11:35:07 UTC - in response to Message 12374.  

Please--if you have frequent occurrences of the 1% bug--it would help us enormously to solve it if you could sign up for RALPH@home. Rom can then identifiy the exact lines of code where the problem is ocurring and it will be easy to fix from there. the problem is that many machines don't have this problem, and they can't help us to track it down and solve it.


OK David,

Have started some RALPH units.

And what's happening you ask???

The first two (I have a P4/HT) have both got "stuck" at 1%.

Checked the graphics - having re-installed BOINC as a single-user - and the time is increasing nicely, as it should, the pictures are real pretty and crunching seems to be taking place, but the 1% is not moving...!.

What do I do now?

Abort these 2 and see what happens with the next couple of WU's

Suspend them and see what happens with the next 2.

Give up?

regards,

Tim
ID: 12417 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Timbo

Send message
Joined: 25 Sep 05
Posts: 20
Credit: 2,275,059
RAC: 0
Message 12418 - Posted: 21 Mar 2006, 11:42:18 UTC - in response to Message 12417.  

Have started some RALPH units.



Having just wrote the last msg, I thought what the heck !! Need to experiment to help you guys.

So, I went back to BOINC and sure enough, only one of the 2 WU's was still at 1% - the other one has jumped up to 2.34%. But it's got stuck again.

So, I suspended the 1% and allowed BOINC to switch to the next RALPH WU. Upon starting it immediately went to 1%....and stuck!

So, suspended that one and allowed a 4th WU to start. And that went straight to 1% and stuck. Same with 5th and now 6th.

Have now shut-down BOINC and going to "play" a bit with my "project prefs".

regards,

Tim

ID: 12418 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Timbo

Send message
Joined: 25 Sep 05
Posts: 20
Credit: 2,275,059
RAC: 0
Message 12419 - Posted: 21 Mar 2006, 12:03:13 UTC - in response to Message 12418.  
Last modified: 21 Mar 2006, 12:14:58 UTC

Have now shut-down BOINC and going to "play" a bit with my "project prefs"


OK - changed my project prefs from default to max - 50, 50 and 4 days.

Also set my BOINC prefs to "pre-empted".

Have also set computer to "visible" if it helps.


Restarted BOINC.

RALPH WU's are the only ones I have working.

Immmediately, when BOINC restarted, the very 1st WU reset the crunched time to zero, but still showing 1% progress.

Did a manual update of the project.

Still the same.

The 2nd WU is now on 2.35% (was 2.34%). But hasn't moved at all from there for the last 5 minutes.


In "desparation mode", I've tried to suspend/resume various WU's in the hope of either causing a "computation error" or to at least to get a WU to move off from the 1%. So far, nothing has changed.....!



In both cases, the CPU time (for RALPH WU's) is continuing to increase - it's just the "Progress" that stays stuck - if it weren't for that, you'd think all was well!!

regards,

Tim


PS: System is:
CPU: Pentium 4, inc HT @ 3.06GHz (not overclocked)
Memory: 512Mb
OS: Windows XP + SP2
HDD: 24Gb free space
Graphics: Radeon 9500 Pro
BOINC: v5.2.13 (standard, not optimised)
All other projects crunch OK.

(edit) added BOINC version
ID: 12419 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Timbo

Send message
Joined: 25 Sep 05
Posts: 20
Credit: 2,275,059
RAC: 0
Message 12420 - Posted: 21 Mar 2006, 12:12:16 UTC - in response to Message 12419.  
Last modified: 21 Mar 2006, 12:19:04 UTC

This is getting stranger.

After about 14 minutes total crunching time, the 1st WU:

(HB_BARCODE_30_1bk2__352_137_0 using rosetta_beta version 493)

has now changed to 0.178% progress (on the graphics screen) and is now stuck again.

After 34 minutes crunching time the 2nd WU

(HB_BARCODE_30_5croA_352_136_0 using rosetta_beta version 493)

is still at 2.35%.


Will let these carry on for an hour or so and report back then.

regards,

Tim

(edit) added WU Names
ID: 12420 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12427 - Posted: 21 Mar 2006, 15:16:58 UTC - in response to Message 12417.  

Please--if you have frequent occurrences of the 1% bug--it would help us enormously to solve it if you could sign up for RALPH@home. Rom can then identifiy the exact lines of code where the problem is ocurring and it will be easy to fix from there. the problem is that many machines don't have this problem, and they can't help us to track it down and solve it.


OK David,

Have started some RALPH units.

And what's happening you ask???

The first two (I have a P4/HT) have both got "stuck" at 1%.

Checked the graphics - having re-installed BOINC as a single-user - and the time is increasing nicely, as it should, the pictures are real pretty and crunching seems to be taking place, but the 1% is not moving...!.

What do I do now?

Abort these 2 and see what happens with the next couple of WU's

Suspend them and see what happens with the next 2.

Give up?

regards,

Tim


is the protein still jumping around on the screen-if so, definitely let it continue!
ID: 12427 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BadThad

Send message
Joined: 8 Nov 05
Posts: 30
Credit: 71,834,523
RAC: 0
Message 12430 - Posted: 21 Mar 2006, 15:23:07 UTC

Arrgggg.....looks like the 1% stuck wu's are back:

FA_RLXc9_1c9oA_359_372_0

1% after 19 hr 44 min.
ID: 12430 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Timbo

Send message
Joined: 25 Sep 05
Posts: 20
Credit: 2,275,059
RAC: 0
Message 12441 - Posted: 21 Mar 2006, 16:37:07 UTC - in response to Message 12420.  
Last modified: 21 Mar 2006, 17:09:16 UTC

This is getting stranger.
After about 14 minutes total crunching time, the 1st WU:
(HB_BARCODE_30_1bk2__352_137_0 using rosetta_beta version 493)
has now changed to 0.178% progress (on the graphics screen) and is now stuck again.

After 34 minutes crunching time the 2nd WU
(HB_BARCODE_30_5croA_352_136_0 using rosetta_beta version 493)
is still at 2.35%.



OK - so the 1st WU is now at 4 hr 27 mins of CPU time and the Progress is now at 4.56%

Completion time was around 8 hr 30 m, but now reads: 12 hrs 24m !!!


The 2nd WU is now at 4 hr 47 mins and 4.75% with a completion time of 12 hrs 25m (was about 8 hr 30m)


In both cases, the graphics in the "Searching..." box *is* moving:

with both 1st WU and 2nd WU, the graphics seem to "settle down" for a bit (with the shapes in both boxes being "similar"). The bottom right numbers change slowly.


After a short while, in the "Searching..." box, the graphic then starts moving more rapidly. This corresponds to an faster rate of change of the numbers in the bottom right.


Will let them continue and see what happens over the next 24 hours...!

regards,

Tim
(edit) typo
ID: 12441 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
doc :)

Send message
Joined: 4 Oct 05
Posts: 47
Credit: 1,106,102
RAC: 0
Message 12450 - Posted: 21 Mar 2006, 18:52:12 UTC

timbo, above you wrote you changed your prefs to 4 days, if that was the target cpu run time in your ralph@home preferences then the slow movement of percentage and the increasing time to completion is perfectly normal cuz it will run for 4 days with that setting (boinc doesnt know about that project specific option yet, so it cant include it in that prediction, it has to finish some units first to make the prediction more correct and will be far off again if you change the target cpu time)
as long as the graphics are still moving, even very slowly (when the stage says full atom relax) its not stuck :)
ID: 12450 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Doug Worrall
Avatar

Send message
Joined: 19 Sep 05
Posts: 60
Credit: 58,445
RAC: 0
Message 12454 - Posted: 21 Mar 2006, 20:24:19 UTC

Hello,
I feel embarassed posting the only 1% stuck bug.It,s 4.81_i6 "FA_RLXpt_h....."
yada.It had a problem Downloading also.3 attemepts got "Timed out" {error}
Its red anyways.LOL.Not to concerned about 1 w/u but,will subscribe to this
thread and I am able to help-out I will.Just donnot have enough time to read
all these Posts on this Problem.Also lots are running mutliple Boxes and they
are needing the Help with this Bug.
"Happy Crunching All"

Sincerely
Doug Sluger Worrall
ID: 12454 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dag
Avatar

Send message
Joined: 16 Dec 05
Posts: 106
Credit: 1,000,020
RAC: 0
Message 12456 - Posted: 21 Mar 2006, 20:42:52 UTC
Last modified: 21 Mar 2006, 20:44:31 UTC

I'm having Many 1% bugs on FA_RLX jobs. I may have a good set of data points here as the failures are ~100% on one multi-processor Linux machine, but not on two other multi-processor Linux machines, and not on a single processor XP-SP2 machine.

The Linux machines are all 2.4.21-XXX Linux (slightly different patch levels) and all have four Intel Xeon processors but are clocked (no overclocking) at 2.8, 3.2, and 3.4. The slowest machine has the failures. They are all running the same BOINC client.

Call if you need to.

dag
719 590 3038
dag
--Finding aliens is cool, but understanding the structure of proteins is useful.
ID: 12456 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 12457 - Posted: 21 Mar 2006, 20:47:21 UTC - in response to Message 12456.  

Call if you need to.
dag
719 590 3038


Please see David Baker's comment plea, below, which I quote:

Please--if you have frequent occurrences of the 1% bug--it would help us enormously to solve it if you could sign up for RALPH@home. Rom can then identifiy the exact lines of code where the problem is ocurring and it will be easy to fix from there. the problem is that many machines don't have this problem, and they can't help us to track it down and solve it.


Regards,
Bob P.
ID: 12457 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Timbo

Send message
Joined: 25 Sep 05
Posts: 20
Credit: 2,275,059
RAC: 0
Message 12461 - Posted: 21 Mar 2006, 21:24:22 UTC - in response to Message 12450.  

timbo, above you wrote you changed your prefs to 4 days, if that was the target cpu run time in your ralph@home preferences then the slow movement of percentage and the increasing time to completion is perfectly normal cuz it will run for 4 days with that setting (boinc doesnt know about that project specific option yet, so it cant include it in that prediction, it has to finish some units first to make the prediction more correct and will be far off again if you change the target cpu time)
as long as the graphics are still moving, even very slowly (when the stage says full atom relax) its not stuck :)



OK - thanks for that info.

Had assumed that the option to change pref's meant that the PROJECT ran for 4 days straight - not the actual work unit itself. And besides, I would have thought that if you allowed the WU to have "direct control" over what BOINC is supposed to be doing, (for these 4 days), then that must impact other WU that you will be crunching for.

So, will BOINC get in a "tizz" if you work on 4 day long Rosetta WU's and you have other WU from other projects "waiting and getting close or past their deadlines.....

It's nice for the project to give users that amount of control, but I think it's a bit too much....!


BTW: Didn't the problem of these 1% WU's occur sometime around the time Rosetta allowed users to change these exact preferences...?

I've crunched quite a few Rosetta WU's and never really had a problem until recently.


regards,

Tim
ID: 12461 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
doc :)

Send message
Joined: 4 Oct 05
Posts: 47
Credit: 1,106,102
RAC: 0
Message 12469 - Posted: 21 Mar 2006, 22:18:43 UTC

the 1% stuck bug has been there long before the cpu target time option was introduced.
boinc will switch between projects according to your "switch between applications every" setting in your general preferences (and your resource shares ofcourse)

and we are getting a little bit off-topic here :)
ID: 12469 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MD_Willington

Send message
Joined: 8 Dec 05
Posts: 1
Credit: 47,751
RAC: 0
Message 12472 - Posted: 21 Mar 2006, 22:43:58 UTC - in response to Message 12441.  

This is getting stranger.
After about 14 minutes total crunching time, the 1st WU:
(HB_BARCODE_30_1bk2__352_137_0 using rosetta_beta version 493)
has now changed to 0.178% progress (on the graphics screen) and is now stuck again.

After 34 minutes crunching time the 2nd WU
(HB_BARCODE_30_5croA_352_136_0 using rosetta_beta version 493)
is still at 2.35%.



OK - so the 1st WU is now at 4 hr 27 mins of CPU time and the Progress is now at 4.56%

Completion time was around 8 hr 30 m, but now reads: 12 hrs 24m !!!


The 2nd WU is now at 4 hr 47 mins and 4.75% with a completion time of 12 hrs 25m (was about 8 hr 30m)


In both cases, the graphics in the "Searching..." box *is* moving:

with both 1st WU and 2nd WU, the graphics seem to "settle down" for a bit (with the shapes in both boxes being "similar"). The bottom right numbers change slowly.


After a short while, in the "Searching..." box, the graphic then starts moving more rapidly. This corresponds to an faster rate of change of the numbers in the bottom right.


Will let them continue and see what happens over the next 24 hours...!

regards,

Tim
(edit) typo



Same here.. @ ~ 75 hours, ??? should I ditch the WU or let it go for the long haul?

MD
ID: 12472 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rom Walton (BOINC)
Volunteer moderator
Project developer

Send message
Joined: 17 Sep 05
Posts: 18
Credit: 40,071
RAC: 0
Message 12478 - Posted: 22 Mar 2006, 1:42:23 UTC
Last modified: 22 Mar 2006, 1:42:41 UTC

A new version of Rosetta has been posted in the RALPH@Home project.

Release Notes

For those who are so inclined, please help us track down the issue by running RALPH@Home and if/when you find a workunit with the '1% bug' feel free to abort it and call it out in this thread.

Thanks in advance for any help you can provide.

----- Rom
----- Rom
My Blog
ID: 12478 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 12489 - Posted: 22 Mar 2006, 4:31:57 UTC - in response to Message 12472.  

This is getting stranger.
After about 14 minutes total crunching time, the 1st WU:
(HB_BARCODE_30_1bk2__352_137_0 using rosetta_beta version 493)
has now changed to 0.178% progress (on the graphics screen) and is now stuck again.

After 34 minutes crunching time the 2nd WU
(HB_BARCODE_30_5croA_352_136_0 using rosetta_beta version 493)
is still at 2.35%.



OK - so the 1st WU is now at 4 hr 27 mins of CPU time and the Progress is now at 4.56%

Completion time was around 8 hr 30 m, but now reads: 12 hrs 24m !!!


The 2nd WU is now at 4 hr 47 mins and 4.75% with a completion time of 12 hrs 25m (was about 8 hr 30m)


In both cases, the graphics in the "Searching..." box *is* moving:

with both 1st WU and 2nd WU, the graphics seem to "settle down" for a bit (with the shapes in both boxes being "similar"). The bottom right numbers change slowly.


After a short while, in the "Searching..." box, the graphic then starts moving more rapidly. This corresponds to an faster rate of change of the numbers in the bottom right.


Will let them continue and see what happens over the next 24 hours...!

regards,

Tim
(edit) typo



Same here.. @ ~ 75 hours, ??? should I ditch the WU or let it go for the long haul?

MD



as long as the graphics show movement, the calculation is proceeding, so best to stick with it..


ID: 12489 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 · Next

Message boards : Number crunching : Help us solve the 1% bug!



©2024 University of Washington
https://www.bakerlab.org