Posts by UBT - Timbo

1) Message boards : Number crunching : Stalled WU (Message 105907)
Posted 11 Apr 2022 by Profile UBT - Timbo
Post:
Which leads one to assume there might be something peculiar with the Rosetta VBox tasks themselves... ?


Yes, there is a problem with some of the Rosetta VBox tasks that causes this behaviour.


Hi

Yup - it certainly seems like that :-(

You'da thought that a "tech admin" would be overseeing the results returned, would have recognised that a certain percentage were taking far too long to be reported and would be actively figuring out there was a problem and would fix it.

Instead, the situation seems to be that volunteers computers are wasting time, money and electricity, by spinning their wheels, due to Rosetta's poor and inefficient management of the tasks they make available. :-(
2) Message boards : Number crunching : Stalled WU (Message 105901)
Posted 10 Apr 2022 by Profile UBT - Timbo
Post:
Look at the difference between CPU time and elapsed time, either there is something serious running alongside Boinc or, far more likely, those tasks are dead.


Hi

Thanks for the feedback. :-)

I've seen this sort of behaviour before with other non-VBox projects and usually the rule of thumb is to "leave them be" and they will (eventually) complete...

But I've not had this happen with Rosetta's VBox tasks before - and indeed I have one other host, with the same OS (Win 7 Pro), the same VBox version and the same version of BOINC Manager, and that has been fairly rattling through the tasks...and both hosts have plenty of installed, working RAM - and no other significant non-BOINC tasks are taking place simultaneously.

eg: One VBoxHeadless.exe is taking up 71Mb, the other is at 39Mb and VirtualBox.exe is taking up 18.5Mb - which are minute amounts of RAM in the grand scheme of things...

So, it might be my old CPU on this one host could be "past it" - maybe the right CPU "core-functions" are not up to the mark ...but it works fine with LHC and QuChem VBox tasks...

Which leads one to assume there might be something peculiar with the Rosetta VBox tasks themselves... ?
3) Message boards : Number crunching : Stalled WU (Message 105898)
Posted 10 Apr 2022 by Profile UBT - Timbo
Post:
Hi all

I don't think I have "stalled" tasks - as the %age work done is still increase - but they are taking AGES to complete...

task #1

Application - rosetta python projects 1.03 (vbox64)
Name - aagb-NMPHE_pp-NMVAL-GGLY-mACPenC12C_pp_7_2674773_4
State - Running
Received - 08/04/2022 00:41:54
Report deadline - 11/04/2022 00:41:56
Estimated computation size - 80,000 GFLOPs
CPU time - 00:34:13
CPU time since checkpoint - 00:00:06
Elapsed time - 1d 20:46:22
Estimated time remaining - 01:06:58
Fraction done - 97.568%
Virtual memory size - 101.57 MB
Working set size - 2.79 GB
Directory - slots/3
Process ID - 5000
Progress rate - 2.160% per hour
Executable - vboxwrapper_26203_windows_x86_64.exe

=========
tasks #2

Application - rosetta python projects 1.03 (vbox64)
Name - aagb-mAZE-mPHE-GPN-mB3PHG_pp_9_2612326_4
State - Running
Received - 08/04/2022 00:41:11
Report deadline - 11/04/2022 00:41:13
Estimated computation size - 80,000 GFLOPs
CPU time - 00:37:56
CPU time since checkpoint - 00:00:06
Elapsed time - 2d 02:10:06
Estimated time remaining - 00:47:31
Fraction done - 98.446%
Virtual memory size - 101.04 MB
Working set size - 2.79 GB
Directory - slots/1
Process ID - 7280
Progress rate - 1.800% per hour
Executable - vboxwrapper_26203_windows_x86_64.exe


And from Task Manager Is ee that CPU usage fluctuates between 0% and maybe 1%

This is very much a waste of computing time, if the tasks are not actually doing much...but I don't want to abort them, if the task is going to complete and the "result" file is of benefit...

Maybe some admin can provide more succinct answers as to why this is happening, as others seems to ahev reported similar issues with what appear to be "zombie" tasks.,.
4) Message boards : Number crunching : Turn off Virtualbox task from host details (Message 105857)
Posted 7 Apr 2022 by Profile UBT - Timbo
Post:
Hi all

If you do not want to crunch VBox tasks on Rosetta then this might help:


You can STOP VBox tasks from being crunched...so you can then just crunch the standard Rosetta tasks.

Go to your account on Rosetta website:

Click on:

Computers on this account > "View"

and then for each computer (host) listed, click on "Details" and at the bottom of the list it says:

"VirtualBox VM jobs"

Change it to "Allow" and your account will say:

Host updated
This host will no longer receive new VirtualBox VM jobs

Do this for each computer.

That's it !!

(Oh and click on "Rosetta" project and then "Update" within BOINC Manager too, so it knows your new settings).
5) Message boards : Cafe Rosetta : Kings Distributed Systems - Alpha Registration (Message 87450)
Posted 3 Oct 2017 by Profile UBT - Timbo
Post:
B8Ub8XjZgn9u5QtbrjST0E3V3hcVG1Bq
6) Message boards : Cafe Rosetta : Kings Distributed Systems - Alpha Registration (Message 87387)
Posted 26 Sep 2017 by Profile UBT - Timbo
Post:
n4WcRh6jghkjaD87YPFqh2SAnNZdq-8T
7) Message boards : Cafe Rosetta : Kings Distributed Systems - Alpha Registration (Message 87358)
Posted 24 Sep 2017 by Profile UBT - Timbo
Post:
mN49nzCRR4tf1HNUIM0BOBG6zh-vf1y1
8) Message boards : Number crunching : Miscellaneous Work Unit Errors (Message 12802)
Posted 29 Mar 2006 by Profile UBT - Timbo
Post:
And I've just had the following errors:

29/03/2006 13:36:59|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5153_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:37:42|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5070_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:38:25|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5196_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:39:08|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5188_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:39:50|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5215_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:40:31|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5154_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:41:12|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5114_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:41:53|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5117_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:42:32|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5175_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:43:12|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5184_0 ( - exit code -529697949 (0xe06d7363))


This is using 3GHz P4 (with HT), 512Mb memory, Win XP (Srv Pck 2) + BOINC v5.3.28

That error code useally means the machine ran out of memory during the execution of the workunit. Since you only have 512MB of RAM and one instance of Rosetta can use up to 250MB of Ram, I would recommend turning off HT.



Hi,

Thanks for the reply.

So, in this day and age of service to customers, why doesn't the error message say that?

(instead of "exit code -529697949")

2nd: from this page:

The minimum spec is: Windows XP CPU: 500MHz or higher HDD space: 200MB Memory: 512MB.

Think they need to "tweak" this to state: "PER PROCESS".


In the meantime, will go back to crunching for other projects.

(edit) All the other projects I crunch for don't have any issues with regards to only having 512Mb of memory...!



Oh well.....

regards,

Tim

(Unless some-one's got a spare stick of 512Mb PC2700 memory lying around they might want to donate?)
9) Message boards : Number crunching : Miscellaneous Work Unit Errors (Message 12793)
Posted 29 Mar 2006 by Profile UBT - Timbo
Post:
[quoteReport all Work Unit errors on this thread that are NOT -

    "1%" Hang"
    "Max Time Exceeded"
    or other "stuck" or "hung" workuinits

[/quote]


Hi all,

Have seen the message about downloading the PDB file (I dl'd version 4.83 to "match" the v4.83 application I have) and having had issues before, thought that maybe, if I had problems this time around, then at least some decent reports will go back.

And I've just had the following errors:

29/03/2006 13:36:59|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5153_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:37:42|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5070_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:38:25|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5196_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:39:08|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5188_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:39:50|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5215_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:40:31|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5154_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:41:12|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5114_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:41:53|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5117_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:42:32|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5175_0 ( - exit code -529697949 (0xe06d7363))
29/03/2006 13:43:12|rosetta@home|Unrecoverable error for result NO_TERM_STRAND_1hz6_383_5184_0 ( - exit code -529697949 (0xe06d7363))


This is using 3GHz P4 (with HT), 512Mb memory, Win XP (Srv Pck 2) + BOINC v5.3.28

Hope this helps the "cause" to resolve the bugs.



Will go back to crunching on RALPH instead...!


regards,

Tim

10) Message boards : Number crunching : Help us solve the 1% bug! (Message 12461)
Posted 21 Mar 2006 by Profile UBT - Timbo
Post:
timbo, above you wrote you changed your prefs to 4 days, if that was the target cpu run time in your ralph@home preferences then the slow movement of percentage and the increasing time to completion is perfectly normal cuz it will run for 4 days with that setting (boinc doesnt know about that project specific option yet, so it cant include it in that prediction, it has to finish some units first to make the prediction more correct and will be far off again if you change the target cpu time)
as long as the graphics are still moving, even very slowly (when the stage says full atom relax) its not stuck :)



OK - thanks for that info.

Had assumed that the option to change pref's meant that the PROJECT ran for 4 days straight - not the actual work unit itself. And besides, I would have thought that if you allowed the WU to have "direct control" over what BOINC is supposed to be doing, (for these 4 days), then that must impact other WU that you will be crunching for.

So, will BOINC get in a "tizz" if you work on 4 day long Rosetta WU's and you have other WU from other projects "waiting and getting close or past their deadlines.....

It's nice for the project to give users that amount of control, but I think it's a bit too much....!


BTW: Didn't the problem of these 1% WU's occur sometime around the time Rosetta allowed users to change these exact preferences...?

I've crunched quite a few Rosetta WU's and never really had a problem until recently.


regards,

Tim
11) Message boards : Number crunching : Help us solve the 1% bug! (Message 12441)
Posted 21 Mar 2006 by Profile UBT - Timbo
Post:
This is getting stranger.
After about 14 minutes total crunching time, the 1st WU:
(HB_BARCODE_30_1bk2__352_137_0 using rosetta_beta version 493)
has now changed to 0.178% progress (on the graphics screen) and is now stuck again.

After 34 minutes crunching time the 2nd WU
(HB_BARCODE_30_5croA_352_136_0 using rosetta_beta version 493)
is still at 2.35%.



OK - so the 1st WU is now at 4 hr 27 mins of CPU time and the Progress is now at 4.56%

Completion time was around 8 hr 30 m, but now reads: 12 hrs 24m !!!


The 2nd WU is now at 4 hr 47 mins and 4.75% with a completion time of 12 hrs 25m (was about 8 hr 30m)


In both cases, the graphics in the "Searching..." box *is* moving:

with both 1st WU and 2nd WU, the graphics seem to "settle down" for a bit (with the shapes in both boxes being "similar"). The bottom right numbers change slowly.


After a short while, in the "Searching..." box, the graphic then starts moving more rapidly. This corresponds to an faster rate of change of the numbers in the bottom right.


Will let them continue and see what happens over the next 24 hours...!

regards,

Tim
(edit) typo
12) Message boards : Number crunching : Help us solve the 1% bug! (Message 12420)
Posted 21 Mar 2006 by Profile UBT - Timbo
Post:
This is getting stranger.

After about 14 minutes total crunching time, the 1st WU:

(HB_BARCODE_30_1bk2__352_137_0 using rosetta_beta version 493)

has now changed to 0.178% progress (on the graphics screen) and is now stuck again.

After 34 minutes crunching time the 2nd WU

(HB_BARCODE_30_5croA_352_136_0 using rosetta_beta version 493)

is still at 2.35%.


Will let these carry on for an hour or so and report back then.

regards,

Tim

(edit) added WU Names
13) Message boards : Number crunching : Help us solve the 1% bug! (Message 12419)
Posted 21 Mar 2006 by Profile UBT - Timbo
Post:
Have now shut-down BOINC and going to "play" a bit with my "project prefs"


OK - changed my project prefs from default to max - 50, 50 and 4 days.

Also set my BOINC prefs to "pre-empted".

Have also set computer to "visible" if it helps.


Restarted BOINC.

RALPH WU's are the only ones I have working.

Immmediately, when BOINC restarted, the very 1st WU reset the crunched time to zero, but still showing 1% progress.

Did a manual update of the project.

Still the same.

The 2nd WU is now on 2.35% (was 2.34%). But hasn't moved at all from there for the last 5 minutes.


In "desparation mode", I've tried to suspend/resume various WU's in the hope of either causing a "computation error" or to at least to get a WU to move off from the 1%. So far, nothing has changed.....!



In both cases, the CPU time (for RALPH WU's) is continuing to increase - it's just the "Progress" that stays stuck - if it weren't for that, you'd think all was well!!

regards,

Tim


PS: System is:
CPU: Pentium 4, inc HT @ 3.06GHz (not overclocked)
Memory: 512Mb
OS: Windows XP + SP2
HDD: 24Gb free space
Graphics: Radeon 9500 Pro
BOINC: v5.2.13 (standard, not optimised)
All other projects crunch OK.

(edit) added BOINC version
14) Message boards : Number crunching : Help us solve the 1% bug! (Message 12418)
Posted 21 Mar 2006 by Profile UBT - Timbo
Post:
Have started some RALPH units.



Having just wrote the last msg, I thought what the heck !! Need to experiment to help you guys.

So, I went back to BOINC and sure enough, only one of the 2 WU's was still at 1% - the other one has jumped up to 2.34%. But it's got stuck again.

So, I suspended the 1% and allowed BOINC to switch to the next RALPH WU. Upon starting it immediately went to 1%....and stuck!

So, suspended that one and allowed a 4th WU to start. And that went straight to 1% and stuck. Same with 5th and now 6th.

Have now shut-down BOINC and going to "play" a bit with my "project prefs".

regards,

Tim

15) Message boards : Number crunching : Help us solve the 1% bug! (Message 12417)
Posted 21 Mar 2006 by Profile UBT - Timbo
Post:
Please--if you have frequent occurrences of the 1% bug--it would help us enormously to solve it if you could sign up for RALPH@home. Rom can then identifiy the exact lines of code where the problem is ocurring and it will be easy to fix from there. the problem is that many machines don't have this problem, and they can't help us to track it down and solve it.


OK David,

Have started some RALPH units.

And what's happening you ask???

The first two (I have a P4/HT) have both got "stuck" at 1%.

Checked the graphics - having re-installed BOINC as a single-user - and the time is increasing nicely, as it should, the pictures are real pretty and crunching seems to be taking place, but the 1% is not moving...!.

What do I do now?

Abort these 2 and see what happens with the next couple of WU's

Suspend them and see what happens with the next 2.

Give up?

regards,

Tim
16) Message boards : Number crunching : Help us solve the 1% bug! (Message 12254)
Posted 19 Mar 2006 by Profile UBT - Timbo
Post:
But there are more users that are not farmers and that is why I suggest people use the Display function to look at the graphic.


But, if like me, you've installed BOINC as a service, the display option is NOT available.

I've had to re-install BOINC as a single-user, in order to figure out why Rosetta was messing around and failing to complete WU's.

(Luckily, I'm very PC literate, so this wasn't a problem - but for some newbies, who have joined this project and THINK they are doing useful work - for them, this could be a real deal breaker, if the project doesn't sort itself out - although with Rom doing his bit now, I have much greater faith that this will be resolved soon).


In the meantime, like others, I've lost faith in any new work that I might download and have now suspended Rosetta and am crunching more for other projects as a result, as I'm not keen on wasting the processing power at my disposal - it's not a lot, but the reason for joining BOINC was to make my PC do work, while the CPU was idle.

And having it run Rosetta and not generating useful results is a worse scenario that not having BOINC installed in the first place...!


In the meantime, I am going to have to suspend our "Weekend Crunch" next weekend in favour of Rosetta and we'll have to switch our crunching power over to another project, as I cannot accept responsibility for my team to be crunching for a project that cannot provide work units that are consistantly able to be returned.

We'll be back supporting you when you have a solution (which I'm sure will happen soon, but maybe not in time for 25th-26th March ! )

regards,

Tim
17) Message boards : Number crunching : Report stuck & aborted WU here please (Message 12207)
Posted 18 Mar 2006 by Profile UBT - Timbo
Post:
The exciting news is that the Boinc consultant we have hired, Rom, has made an improvement in how
the rosetta process terminates that seems to really have made a difference on Ralph. the problem seems to have been not any bug in the rosetta code, but a problem in how the [b}rosetta process shuts itself down when the processor starts doing something else[/b] (hence the leave in memory bug, etc.).


Hi David,

Well that ties in with an observation I can make, which I've seen once or twice.

I have noticed that a Rosetta work units "fail" when my 3GHz P4/HT switches from one project to another - so there seem to issues when the Rosetta process seems to be "suspended" by BOINC as it then switches over to another project - (I'm running BBC CCE as a second simultaneous BOINC project on the same PC - this "switch over problem" tends to occur when one "CPU" switches out of working on a Rosetta WU and then switches over to the CCE WU).

Maybe this is a help - but seems Rom is on the right trail.


regards

Tim

(edit) typo
18) Message boards : Number crunching : Report stuck & aborted WU here please (Message 12154)
Posted 17 Mar 2006 by Profile UBT - Timbo
Post:
This project used to be bullet-proof - what's changed....?

believe me, we are trying!



GREAT NEWS - Perhaps it might be an idea to let people know there is a problem and to stop making work available until it's fixed - that'll take the pressure off you guys.

What sort of percentage of the work returned to you is being trashed by this bug?

Would imagine it's fairly high - although, if it was an epidemic failure, would assume you would have stopped sending out work before now.

But surely you must be going into damage limitation mode by now. Can you afford to lose lots of crunchers?

regards and good luck.

Tim
19) Message boards : Number crunching : Report stuck & aborted WU here please (Message 12113)
Posted 16 Mar 2006 by Profile UBT - Timbo
Post:
This WU was stuck at 1% for a day - then started "on it's own" - got to about 70% done and then BOINC switched over to one of the other projects I'm running (BBC CCE) and immediately I got a "computation error" and the percentage went to 100%.

16/03/2006 23:19:42|rosetta@home|Unrecoverable error for result FA_RLXdh_hom025_1dhn__360_62_0 ( - exit code -164 (0xffffff5c))


So, that more wasted CPU cycles.

This project used to be bullet-proof - what's changed....?

regards,

Tim

PS - Our team with 430+ overall members (and at least 130 already joined up to Rosetta) were going to be concentrating on Rosetta for a "Crunching Weekend" on 25th-26th March.

see here: http://www.ukboincteam.org.uk/uk-boinc-team.html

If this project doesn't get sorted REAL QUICK, we'll be forced to switch our attentions to a different project....!


20) Message boards : Number crunching : Miscellaneous Work Unit Errors (Message 12052)
Posted 15 Mar 2006 by Profile UBT - Timbo
Post:
Report all Work Unit errors on this thread that are NOT -

    "1%" Hang"
    "Max Time Exceeded"
    or other "stuck" or "hung" workuinits





15/03/2006 00:31:28|rosetta@home|Unrecoverable error for result FA_RLXcc_hom003_1cc8A_359_158_0 ( - exit code -164 (0xffffff5c))
15/03/2006 03:04:02|rosetta@home|Unrecoverable error for result FA_RLXbq_hom005_1bq9A_359_158_0 ( - exit code -1073741819 (0xc0000005))
15/03/2006 11:26:32|rosetta@home|Unrecoverable error for result FA_RLXbk_hom002_1bk2__359_459_0 ( - exit code -164 (0xffffff5c))



Not too happy about getting these errors - but grateful to the project if they can fix it so that all WU's are good and can return useful results.

regards,

Tim









©2024 University of Washington
https://www.bakerlab.org