Posts by Insidious

1) Message boards : Number crunching : Target CPU run time (Message 30948)
Posted 11 Nov 2006 by Insidious
Post:
Thanks Feet1st !!!!!

Exactly the kind of help I was in need of.

-Sid

Insidious, I was just wondering which optimized client you are using on your X2 3800's. I don't know if you will see this, but this was your most recent posting. If you don't feel like telling everyone, then please email it to me at motoguzzie123@aol.com


It looks like he is using
5.5.0 on two of them (most probably crunch3rs old one)
and 5.4.11 on the 3rd one (looks like standard client)


Hi Cloaked Chaos

Fluffy Chicken has it right. It's the old Crunch3R 5.5 on two of them. I can't get to the third machine as I gave it to my son for a graduation present and he has it with him at college.

Are you in need of an optimized client? Post here if so, and I'll try to help.

-Sid
2) Message boards : Number crunching : Target CPU run time (Message 30235)
Posted 29 Oct 2006 by Insidious
Post:
Thanks Feet1st !!!!!

Exactly the kind of help I was in need of.

-Sid
3) Message boards : Number crunching : Target CPU run time (Message 30228)
Posted 29 Oct 2006 by Insidious
Post:
Hi folks,

I'll be happy to set this preference wherever the project team feels it will do the most good.

I have reasonably fast CPUs (AMD X2 @ 2.5GHz) and lots of RAM available. (2GB on two machines and 1GB on the other)

I have NO idea what so ever where to set target time to provide the most benefit to Rosetta@Home.

Can any one provide guidance to me.

Thanks in Advance

-Sid
4) Message boards : Number crunching : Possible to delete a host? (not merging) (Message 14327)
Posted 22 Apr 2006 by Insidious
Post:
Ah, sorry then,

OK
go to the main rosetta page,
click on "your account"
click on "Computers on this account View computers "
Under this column "Computer ID", select the puter you want to merge or delete
Scroll to the bottom of the page, and if this wasn't disabled by the projects you'd see:

Click to merge this computer
or
Click to delete this host.

tony


OK, I'm with you now. I Can't delete a host atm... but will be able to in the future. (when the feature is turned on)

Thank you

-Sid
5) Message boards : Number crunching : Possible to delete a host? (not merging) (Message 14323)
Posted 22 Apr 2006 by Insidious
Post:
It'll take a while.


HOW ?

Once a result has been received it waits in the data table for validation and the granting of credit. once it's validated, the "canonical result" is written to the master science database. After this the results you returned are no longer needed, they get tagged for purging and deleting. Once they are purged and deleted from the data table, you are normally free to delete the host, since it has no ties to the data table.

does this answer your question? [/quote]


No. I am asking when "I am allowed" to delete a host.... HOW do I do it.

ie: where do I find the option/Button/etc. and what do I click.

I have two hosts with 0 results and would like to delete them. I go to the view your computers, then to the host of concern and look everywhere on the page. (and every other page I can get to..). I see no option to delete a host computer.

So I am happy to accept that it can be done... I just would like to know HOW to do it.
6) Message boards : Number crunching : Possible to delete a host? (not merging) (Message 14307)
Posted 21 Apr 2006 by Insidious
Post:
Thanks for taking the time to convince your boss. We appreciate it. Tell you boss I/we appreciate it! If it comes down to it, we'll suck up to him/her!

We'll have 'host merge' back soon. As to deleting, the system will allow you delete the hosts once all "Results" for that computer have been purged. It'll take a while.


HOW ?
7) Message boards : Number crunching : No checkpoint in more than 1 hour - Largescale_large_fullatom... (Message 14168)
Posted 20 Apr 2006 by Insidious
Post:
Right now the best option seems to be to get the real bugs out of the system, and then tackle the checkpointing issue. But in the face of the short time remaining before the beginning of CASP, increased checkpointing may be beyond reach at this time. Far better to have a stable application, with some use restrictions, than to have an application that checkpoints a lot but fails routinely for other reasons.


Hi Moderator9 and all,

I want to give you some updates on the long jobs / stucked jobs issue based on your suggestions and discussion here:

1, We've tracked down the bugs which were causing some jobs stucked at 1.04% and been testing the fixes on ralph since yesterday.

2, We've coded up rosetta to do more frequent checkpointing in the modeling process. Now for the large jobs, we are expecting less than 30 minutes for the time between two check points. This code has been tested locally, and will be tested on ralph within a couple of days.

3, Rhiju has coded a watchdog thread for rosetta which will terminate the stucking jobs and return the intermediate results. see his post at this thread. This will be tested on ralph within a couple of days too.

We think these measures will greatly improve the stability of R@H, and make your crunching effort much more enjoyable!


This sounds like good progress.
Thanks for the update!

-Sid
8) Message boards : Number crunching : No checkpoint in more than 1 hour - Largescale_large_fullatom... (Message 13950)
Posted 17 Apr 2006 by Insidious
Post:
Thanks for the quick replys!

So do I need to have my switch time > time for the entire WU to complete?
Isn't there any checkpointing?

-Sid


ALL Work Units will checkpoint at the completion of a model. For some Work Units this means every 5 minuets, for larger ones this could mean 5 or six hours. Also ALL Work units will complete AT LEAST one model no matter how you set your user selectable time setting

The BEST answer if you can do it, is to set your preferences to keep the application in memory during a swap. You could try to set the swap time to 4+ hours, but there is no guarantee that that will make it to a checkpoint. It depends on the size of the protein.

Also keep in mind that "keep in memory" only works if you do not turn your machine off, or stop BOINC for some reason, as these actions would also remove the application from memory.


This issue has to be adressed ASAP. Many cycles go directly to the trash can because of this. An improved checkpointing system should be #1 priority on the TO DO list of the development team of Rosie.

Regards.


Bin Qian addressed this already above (we all agree on this!)
9) Message boards : Number crunching : No checkpoint in more than 1 hour - Largescale_large_fullatom... (Message 13906)
Posted 16 Apr 2006 by Insidious
Post:
I totally agree with you Sid. It is not acceptable to have a job running for hours without a checkpoint. We realized that this is a mistake and have had those largescale jobs canceled from the boinc server. If your computer still has the largescale jobs running or queued please abort them - the new short jobs are waiting to be sent!

In the future we will be extremely cautious to not let this happen again - we are coding and testing some solutions to make the program checkpoint more often. It is of our highest priority at this moment to make sure that ever minute of your precious computer-on time spent on the Rosetta@home project can contribute to the scientific goals we are trying to achieve together.

I should add that for users who crunch Rosetta 24/7 or have "leave in memory" on, you can choose to let the largescale jobs currently in your computers keep running. These results are still of great interest to us!



Thanks Bin!

Knowing the project IS sensitive to this kind of concern and IS making an effort to shape this project in a productive way, is plenty to keep me supporting such an interesting and valuable research endeavor!

-Sid

I get the impression from your response that you have misinterpreted the information I tried to provide, to imply that I had in some way said the project does not think this is important. Nothing could be further from the truth. The original question was why did the Work Unit run so long without a checkpoint (see thread title). I never said this was an issue the project was ignoring, or not trying to fix. I just tried to explain WHY it was working the way it was.


I didn't think you were dismissing the concerns. Actually, I saw your post as an attempt to help shed light. It was appreciated as was Bin's.

Please believe me when I tell you, I am much more the fan than the critic.

-Sid
10) Message boards : Number crunching : No checkpoint in more than 1 hour - Largescale_large_fullatom... (Message 13902)
Posted 16 Apr 2006 by Insidious
Post:
I totally agree with you Sid. It is not acceptable to have a job running for hours without a checkpoint. We realized that this is a mistake and have had those largescale jobs canceled from the boinc server. If your computer still has the largescale jobs running or queued please abort them - the new short jobs are waiting to be sent!

In the future we will be extremely cautious to not let this happen again - we are coding and testing some solutions to make the program checkpoint more often. It is of our highest priority at this moment to make sure that ever minute of your precious computer-on time spent on the Rosetta@home project can contribute to the scientific goals we are trying to achieve together.

I should add that for users who crunch Rosetta 24/7 or have "leave in memory" on, you can choose to let the largescale jobs currently in your computers keep running. These results are still of great interest to us!



Thanks Bin!

Knowing the project IS sensitive to this kind of concern and IS making an effort to shape this project in a productive way, is plenty to keep me supporting such an interesting and valuable research endeavor!

-Sid
11) Message boards : Number crunching : No checkpoint in more than 1 hour - Largescale_large_fullatom... (Message 13896)
Posted 16 Apr 2006 by Insidious
Post:
This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely!



It does sound like a rather stringent requirement.

Is this going to be the norm. from here on out?


The plan is to try to have it checkpoint more often, or at least try to dump everything to a memory image at program swaps. But there is a lot of data involved, and interrupting the model effects the model outcome adversely.


I'm in a fortunate position that I can flex to requirements that impact how my PCs are operated, but I am concerned that the majority of your participants just simply cannot assure a computer will not be re-booted or available to crunch nothing but Rosetta 24/7/365.

I understand the prospective of the project developers, but I think the project simply MUST acknowledge that the vast majority of it's participants consider DC to be a side endeavor. If your project impacts the use the PC was really bought to serve...your membership will suffer.

A thread asking for suggestions to increase project participation was posted some time ago.

I believe the single, most important answer is... your project must be hands-off with utter transparency to the work the PC was procured to do in the first place.

I think I speak for many when I say I like Rosetta and the work it hopes to accomplish, but it is incorrect to expect PC users to arrange their PC to crunch your project.... it must be the other way around. Rosetta must arrange itself to fit within the resources that are available.

Respectfully,
-Sid

12) Message boards : Number crunching : No checkpoint in more than 1 hour - Largescale_large_fullatom... (Message 13887)
Posted 16 Apr 2006 by Insidious
Post:
This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely!



It does sound like a rather stringent requirement.

Is this going to be the norm. from here on out?
13) Message boards : Number crunching : No checkpoint in more than 1 hour - Largescale_large_fullatom... (Message 13873)
Posted 16 Apr 2006 by Insidious
Post:
Thanks for the quick replys!

So do I need to have my switch time > time for the entire WU to complete?
Isn't there any checkpointing?

-Sid
14) Message boards : Number crunching : No checkpoint in more than 1 hour - Largescale_large_fullatom... (Message 13867)
Posted 16 Apr 2006 by Insidious
Post:
yes you're right.
Rosetta will only save after every completed model. With these large models, you have to complete them in one go or start over from the beginning when you unload the project (switch to another project for a little time, reboot your computer, you get the idea).



OK, I'm not sure I get it???

Is the answer to leave these in memory? change settings to switch projects less frequently? only do Rosey on dedicated (one project) machines? Abort these WUs?

Any advise would be appreciated

-Sid
15) Message boards : Number crunching : WU scheduling issues remain an issue (Message 11764)
Posted 7 Mar 2006 by Insidious
Post:
Sid, are your estimated times going down? Closer to actual? If so it is working. As others have pointed out, letting it go into panic mode will get the estimates closer faster.

Also, what sort of time frame are you looking at for your resource balance? If its daily, then Bonic in general may be a lost cause for you, if its longer term balance, it will sort itself out over time.

As a side note SETI will futz with your completion times as well with the various angle ranges of the WU, and will be more pronounced when enhanced goes live.


Also a member of the TeAm. :)


I'm having great luck with my latest maneuver to let both projects crunch and let BOINC get it's cache size adjusted to appropriate for these WUs.
I am seeing my estimated time go down (as expected) and I have several more WUs in the cache to keep it busy. Rosetta isn't asking for more work because it knows it has plenty and it is sharing nicely.
The only "loss" is the 7 work units I aborted (in their "ready to run" state) to eliminate the earliest deadline mode of ops.
From all the help I have received here in the way of explaination, I see that until BOINC updates their client to recognize Rosetta's 'time management' scheme (if you will allow my phrasing) this will just be necessary when Rosetta makes drastic changes in work unit crunch times until the WUs from the earlier issue are cleared from the system.

I'm happy....

-Sid

16) Message boards : Number crunching : WU scheduling issues remain an issue (Message 11761)
Posted 7 Mar 2006 by Insidious
Post:
That is a very accurate re-iteration of the issue I am trying to describe. (the idling of another BOINC project in favor of Rosetta for a day or so)
If it were only a matter of a single instance of this occurance I wouldn't think too much of it. The trouble is that this particular machine has gone into this cycle 2 times now. The first time, I aborted the excess work units and the machine was fine for a while but overloaded itself again after a few days. So, this time I reset the project and again, after a few days I found that it had overloaded itself once again. I aborted about a half-dozen of the pending WUs and now it is happy... but it is frustrating keep watching Rosetta push the other project aside. (I like the other project too)


Most of your problem is right here. The continual deletion of WU's and resetting the project keep BOINC from adjusting the duration correction factor properly to the new WU size.

I know you don't want to hear this, but leave BOINC alone, and you'll have fewer problems in the long run. Yes, you'll have days where one project monopolizes the machine (I've had this with Leiden and Sztaki recently.) But in the end, the adjustment factor will kick in, and the long-term debts will accrue correctly and you will have days with NO work for Rosetta. In the end, everything will balance out. But by micro-managing, you're probably making it worse, not better.


Actually, the winning combination seems to be to delete only enough of the mis-estimated WUs in my cache to come out of earliest deadline mode, but let the crunching process continue (by not deleting ALL Work Units) until it "gets straightened out)..

I loose no crunching time on the shared project and BOINC gets to continue adjusting it's estimation of completion times until it is correct.

yes, the Rosetta project gets a few returned WUs that have to be re-issued this way, but I think "sharing the pain" is only appropriate.

-Sid
17) Message boards : Number crunching : WU scheduling issues remain an issue (Message 11749)
Posted 7 Mar 2006 by Insidious
Post:
Thanks for the explanations (and patience)

-Sid

(I don't delete ALL of the downloaded WUs, just enough to get out of earliest deadline mode)
18) Message boards : Number crunching : WU scheduling issues remain an issue (Message 11743)
Posted 7 Mar 2006 by Insidious
Post:
If you've set a preference for how long to crunch a WU, then it will try to crunch about that long. Note that the setting will take effect next time boinc contacts the rosetta server.

If you haven't set a preference, the WU will use it's built-in default value. This is 2 hr for current WUs and 8 hr for older ones.

The estimated crunch time that boinc displays has absolutely no effect on how long the WU will actually take.

See the FAQ for more details.


I have left the settings at default. Obviously if I had changed them, I wouldn't be complaining that there is an issue.

-Sid




On the three machines I have been observing, only one was forced into DCF mode when the new time setting became available. At first I tried to manually intervene but this only produced a temporary fix for the problem, and required me to constantly tinker with the machine. When I decided to allow the machine to sort itself out, I set the time to 4 hours, and the time between contacts to the server to .25 days. In less than 24 hours the system stabilized. I was then able to raise the connection time in increments over two days (about 5 adjustments total) and it is now running very well.

You could probably make larger adjustments than I did in the connect time to make it happen faster, but the point is that the system MUST be allowed to correct itself over time. BOINC doe snot have any information about the actual length of the WUs and so it must adjust to them over time. This same situation occurs on other projects when shorter WUs are replaced by longer ones. BOINC is designed to work this out for itself.



That is a very accurate re-iteration of the issue I am trying to describe. (the idling of another BOINC project in favor of Rosetta for a day or so)
If it were only a matter of a single instance of this occurance I wouldn't think too much of it. The trouble is that this particular machine has gone into this cycle 2 times now. The first time, I aborted the excess work units and the machine was fine for a while but overloaded itself again after a few days. So, this time I reset the project and again, after a few days I found that it had overloaded itself once again. I aborted about a half-dozen of the pending WUs and now it is happy... but it is frustrating keep watching Rosetta push the other project aside. (I like the other project too)

Something is telling BOINC initially, these work units will take several hours beyond the default to complete (despite the fact they will not) and confusing BOINC to the point it stops work on any other project on this machine. Yes, BOINC will figure it out over time (at the expense of other projects), but why can't you have Rosetta tell BOINC it will take the amount of time it is defaulted to instead of 16 hours? (where is this 16 hour estimate comming from?)


-Sid
19) Message boards : Number crunching : WU scheduling issues remain an issue (Message 11739)
Posted 6 Mar 2006 by Insidious
Post:
If you've set a preference for how long to crunch a WU, then it will try to crunch about that long. Note that the setting will take effect next time boinc contacts the rosetta server.

If you haven't set a preference, the WU will use it's built-in default value. This is 2 hr for current WUs and 8 hr for older ones.

The estimated crunch time that boinc displays has absolutely no effect on how long the WU will actually take.

See the FAQ for more details.


I have left the settings at default. Obviously if I had changed them, I wouldn't be complaining that there is an issue.

-Sid
20) Message boards : Number crunching : WU scheduling issues remain an issue (Message 11735)
Posted 6 Mar 2006 by Insidious
Post:
While crunching a few WUs that take ~2 hours each, I get a download of WUs that take ~15 hours each... but in numbers that would require 2 hour completions to avoid machine over-commitment.

I share projects on some machines and "just wait until BOINC figures it out" doesn't work for me because I don't believe the other project should be idled to make up for this scheduling miscalculation.

I have been training my team mates to use the abort and reset buttons....

I would love to stop issuing 'refunds' of your Work Units...

Please help


Next 20



©2024 University of Washington
https://www.bakerlab.org