Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 125 · 126 · 127 · 128 · 129 · 130 · 131 . . . 274 · Next

AuthorMessage
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 102836 - Posted: 24 Sep 2021, 13:45:24 UTC - in response to Message 102834.  

Now its getting as complicated as what I am doing.
I have to run everything at 100, see how they work again, then fine tune the resource share for each project.
That's just as bad as the max_concurrent.
What is difficult about it?
If all projects are of are of equal importance, then just have them all set to 100. If they aren't- then what project is most important? Which is least? Rate them in order of importance & then set your Resource share values for each one accordingly- keeping in mind they are not a percentage, they are a ratio.



With RAH, it's not about the type of project, its about the core count.
That's what I am after.
X # of cores per project.
Once again as it seems i am not getting through- BOINC does not work that way. Resource share is about work actually done. Not time. Not cores. Not threads. Not CPU. Not GPU. It is about the work done. While Credit is supposed to indicate work done, that isn't the case.
So the BOINC Scheduler uses REC (Recent Estimated Credit) to determine Scheduling.



Extremely roughly- any Task requires so many FLOPs (Floating Point Operations) to be performed. It takes however long to actually do the work on a given system for a given application. BOINC/the Project server keeps track of this time & the amount of work done. If all Projects paid out Credit according to the definition of the Cobblestone, then people's RAC could be used for Scheduling. Since that doesn't happen, people's RAC isn't used- REC is.
A Task that takes however long to do a certain amount of FLOPs would earn x amount of Credit using the official definition of the Cobblestone- that is the amount of REC for that Task. All the Tasks done for all the different applications for that project produce a total amount of REC value for that Project. All the work done by the applications for all of the Projects produce an REC for each Project.
The BOINC scheduler does enough work from each Project over a period of time (days, weeks, months if need be) so that the REC value between projects matches whatever Resource shares you have selected.

That is how it works. It is not based on just time. It is not based on Cores. It is not based on Threads. It is not based on CPU. It is not based on GPU.
It is based on work done.


Initially I left things alone and then credits got all out of whack, then I ran into issues with tasks taking up days on end and nothing else getting done by other projects and just a bunch of yoyo stuff going on.
So I took back control. Again, everything was working fine until this bug showed up. And that just showed up. Maybe after updating to the latest BOINC.

Anyway..I'll mess around with things until I find the right mix.
No need to clog up this thread.
ID: 102836 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 373
Credit: 10,598,568
RAC: 8,061
Message 102843 - Posted: 25 Sep 2021, 6:06:26 UTC - in response to Message 102836.  



Initially I left things alone and then credits got all out of whack, then I ran into issues with tasks taking up days on end and nothing else getting done by other projects and just a bunch of yoyo stuff going on.
So I took back control. Again, everything was working fine until this bug showed up. And that just showed up. Maybe after updating to the latest BOINC.

Anyway..I'll mess around with things until I find the right mix.
No need to clog up this thread.


As Grant has said, the more you mess around with things the worse the situation will become.

Set rec_half_life to 1, sit back and chill for a month and the system will follow your project shares.
ID: 102843 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 102844 - Posted: 26 Sep 2021, 23:33:41 UTC - in response to Message 102843.  



Initially I left things alone and then credits got all out of whack, then I ran into issues with tasks taking up days on end and nothing else getting done by other projects and just a bunch of yoyo stuff going on.
So I took back control. Again, everything was working fine until this bug showed up. And that just showed up. Maybe after updating to the latest BOINC.

Anyway..I'll mess around with things until I find the right mix.
No need to clog up this thread.


As Grant has said, the more you mess around with things the worse the situation will become.

Set rec_half_life to 1, sit back and chill for a month and the system will follow your project shares.


Just letting it ride now, no app_config.
See where things go.
I'll have to go back and look at your notes on the half life, I haven't done that yet.
ID: 102844 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 102848 - Posted: 27 Sep 2021, 23:35:02 UTC

Bryn Mawr - added the half_life, will sit back and see what happens.
Current WCG is dying in credits, guess I will have to pump that one up higher in %
ID: 102848 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,331,566
RAC: 16,669
Message 102849 - Posted: 28 Sep 2021, 5:52:04 UTC - in response to Message 102848.  

Bryn Mawr - added the half_life, will sit back and see what happens.
Current WCG is dying in credits, guess I will have to pump that one up higher in %
Or just let things be until they have a chance to settle down- with 8 active projects, even with the changed half life value, i'd expect you're looking at a couple of weeks. One week bare minimum.
Then adjust Resource share as necessary.
Grant
Darwin NT
ID: 102849 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 102850 - Posted: 28 Sep 2021, 7:57:46 UTC - in response to Message 102849.  
Last modified: 28 Sep 2021, 7:59:44 UTC

Bryn Mawr - added the half_life, will sit back and see what happens.
Current WCG is dying in credits, guess I will have to pump that one up higher in %
Or just let things be until they have a chance to settle down- with 8 active projects, even with the changed half life value, i'd expect you're looking at a couple of weeks. One week bare minimum.
Then adjust Resource share as necessary.



Ok..will do.
It's 5 active.
I thought I had 2 GPU projects, but it seems just one at the moment.
So its 3-4 CPU projects.
ID: 102850 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 373
Credit: 10,598,568
RAC: 8,061
Message 102852 - Posted: 29 Sep 2021, 15:32:32 UTC - in response to Message 102850.  

Bryn Mawr - added the half_life, will sit back and see what happens.
Current WCG is dying in credits, guess I will have to pump that one up higher in %
Or just let things be until they have a chance to settle down- with 8 active projects, even with the changed half life value, i'd expect you're looking at a couple of weeks. One week bare minimum.
Then adjust Resource share as necessary.



Ok..will do.
It's 5 active.
I thought I had 2 GPU projects, but it seems just one at the moment.
So its 3-4 CPU projects.


I recently (6 weeks ago) added a 5th project (6 if you include Ralph which very rarely has work) because 3 of the projects were out of work / broken at the same time.

One of my crunchers is now back to running smoothly whilst the other still has the occasional lump or bump as one project or another grabs a bit extra but is almost there.
ID: 102852 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 102855 - Posted: 30 Sep 2021, 19:19:02 UTC - in response to Message 102852.  

Bryn Mawr - added the half_life, will sit back and see what happens.
Current WCG is dying in credits, guess I will have to pump that one up higher in %
Or just let things be until they have a chance to settle down- with 8 active projects, even with the changed half life value, i'd expect you're looking at a couple of weeks. One week bare minimum.
Then adjust Resource share as necessary.



Ok..will do.
It's 5 active.
I thought I had 2 GPU projects, but it seems just one at the moment.
So its 3-4 CPU projects.


I recently (6 weeks ago) added a 5th project (6 if you include Ralph which very rarely has work) because 3 of the projects were out of work / broken at the same time.

One of my crunchers is now back to running smoothly whilst the other still has the occasional lump or bump as one project or another grabs a bit extra but is almost there.



I had more than a lump and a bump before I tried dividing up the computer.
Like now, WCG is really really down close to dead and now that I opened things back up it still is down, but the results I checked are pending. So there is hope.
ID: 102855 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 373
Credit: 10,598,568
RAC: 8,061
Message 102858 - Posted: 1 Oct 2021, 3:26:17 UTC - in response to Message 102855.  

Bryn Mawr - added the half_life, will sit back and see what happens.
Current WCG is dying in credits, guess I will have to pump that one up higher in %
Or just let things be until they have a chance to settle down- with 8 active projects, even with the changed half life value, i'd expect you're looking at a couple of weeks. One week bare minimum.
Then adjust Resource share as necessary.



Ok..will do.
It's 5 active.
I thought I had 2 GPU projects, but it seems just one at the moment.
So its 3-4 CPU projects.


I recently (6 weeks ago) added a 5th project (6 if you include Ralph which very rarely has work) because 3 of the projects were out of work / broken at the same time.

One of my crunchers is now back to running smoothly whilst the other still has the occasional lump or bump as one project or another grabs a bit extra but is almost there.



I had more than a lump and a bump before I tried dividing up the computer.
Like now, WCG is really really down close to dead and now that I opened things back up it still is down, but the results I checked are pending. So there is hope.


That’s the project, not your machine. I’ve just had two days of low WCG credits and the shortfall turned up this morning - c’est la vie.
ID: 102858 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 102865 - Posted: 2 Oct 2021, 7:42:12 UTC - in response to Message 102858.  

Bryn Mawr - added the half_life, will sit back and see what happens.
Current WCG is dying in credits, guess I will have to pump that one up higher in %
Or just let things be until they have a chance to settle down- with 8 active projects, even with the changed half life value, i'd expect you're looking at a couple of weeks. One week bare minimum.
Then adjust Resource share as necessary.



Ok..will do.
It's 5 active.
I thought I had 2 GPU projects, but it seems just one at the moment.
So its 3-4 CPU projects.


I recently (6 weeks ago) added a 5th project (6 if you include Ralph which very rarely has work) because 3 of the projects were out of work / broken at the same time.

One of my crunchers is now back to running smoothly whilst the other still has the occasional lump or bump as one project or another grabs a bit extra but is almost there.



I had more than a lump and a bump before I tried dividing up the computer.
Like now, WCG is really really down close to dead and now that I opened things back up it still is down, but the results I checked are pending. So there is hope.


That’s the project, not your machine. I’ve just had two days of low WCG credits and the shortfall turned up this morning - c’est la vie.


I gave it 200% and now its climbing like a jet plane. Just have to get LHC back up after WCG and then I think everything can go back to 100%.
ID: 102865 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,331,566
RAC: 16,669
Message 102871 - Posted: 2 Oct 2021, 9:50:42 UTC - in response to Message 102865.  
Last modified: 2 Oct 2021, 9:57:04 UTC

I gave it 200% and now its climbing like a jet plane. Just have to get LHC back up after WCG and then I think everything can go back to 100%.
And then it will drop again.
So you'll change it, and it will rise again. So you'll change it and it will fall again. So you'll change it, and it will rise again. So you'll change it and it will fall again. etc, etc.
Most (if not all) of that rapid increase is not a result of your changes but for the reason Bryn posted- the Project had a delay in granting Credit, now it's all coming through. Hence the surge in Credit.


RAC rises slowly, and falls quickly.
The half_life change Bryn suggested should allow things to settle down sooner rather than later, but with the number of projects you have we're still talking weeks- not days. And as you change things, then change them back again, then change them, then change them again, it just keeps extending the time it will take for things to settle to actually meet whatever Resource share you finally leave things at for an extended period (ie over a few weeks).
Grant
Darwin NT
ID: 102871 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 102872 - Posted: 2 Oct 2021, 16:55:04 UTC - in response to Message 102871.  

I gave it 200% and now its climbing like a jet plane. Just have to get LHC back up after WCG and then I think everything can go back to 100%.
And then it will drop again.
So you'll change it, and it will rise again. So you'll change it and it will fall again. So you'll change it, and it will rise again. So you'll change it and it will fall again. etc, etc.
Most (if not all) of that rapid increase is not a result of your changes but for the reason Bryn posted- the Project had a delay in granting Credit, now it's all coming through. Hence the surge in Credit.


RAC rises slowly, and falls quickly.
The half_life change Bryn suggested should allow things to settle down sooner rather than later, but with the number of projects you have we're still talking weeks- not days. And as you change things, then change them back again, then change them, then change them again, it just keeps extending the time it will take for things to settle to actually meet whatever Resource share you finally leave things at for an extended period (ie over a few weeks).



Yeah I know it drops. So Just ramming it through to get up and later when I go back to work drop it.
Half life was changed last week.
ID: 102872 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1966
Credit: 38,188,338
RAC: 11,005
Message 103032 - Posted: 27 Oct 2021, 20:13:19 UTC

Project was down a little earlier, apparently to do a quick filesystem switch, but it got delayed and they didn't start it back up, so people would've seen

Server error: feeder not running
Project requested delay of 3600 seconds

Quickly fixed after a nudge. Looks fine now.

You didn't imagine it
ID: 103032 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,331,566
RAC: 16,669
Message 103039 - Posted: 28 Oct 2021, 9:32:01 UTC

Quite a backlog of Validations now.
Given that there is no longer any work for minirosetta, they could probably shut down all of the minirosetta processes, and make use of the freed up resources for a few more Rosetta Assimilators and Validators.

From the Server Status page-
rah_assimilator_rosetta1 (rosetta)
rah_assimilator_rosetta2 (rosetta)
rah_assimilator_rosetta3 (rosetta)
rah_assimilator_rosetta4 (rosetta)
rah_assimilator_rosetta5 (rosetta)
rah_assimilator_mini1 (minirosetta)
rah_assimilator_mini2 (minirosetta)
rah_assimilator_mini3 (minirosetta)
rah_assimilator_mini4 (minirosetta)
rah_assimilator_mini5 (minirosetta)
rah_validator_rosetta1 (rosetta)
rah_validator_rosetta2 (rosetta)
rah_validator_mini1 (minirosetta)
rah_validator_mini2 (minirosetta)

Grant
Darwin NT
ID: 103039 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,331,566
RAC: 16,669
Message 103044 - Posted: 28 Oct 2021, 20:36:08 UTC

Validation backlog appears to be growing- now over 104,000
The Server Status for the Validators might be showing green, but they don't appear to be actually doing anything at present.
Grant
Darwin NT
ID: 103044 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,331,566
RAC: 16,669
Message 103045 - Posted: 28 Oct 2021, 22:15:38 UTC - in response to Message 103044.  

Validation backlog appears to be growing- now over 104,000
The Server Status for the Validators might be showing green, but they don't appear to be actually doing anything at present.
Now over 114,000.
Yep- it's broken.
Grant
Darwin NT
ID: 103045 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1223
Credit: 13,824,497
RAC: 2,340
Message 103046 - Posted: 29 Oct 2021, 0:57:57 UTC
Last modified: 29 Oct 2021, 1:06:30 UTC

A task running MUCH longer than the expected 8 hours:

aaab_nNMALA_pp-SAR_pp-mPPS-BGLY_pp_2_2245795_6_1

https://boinc.bakerlab.org/rosetta/result.php?resultid=1441862159

2 days, 8 hours, 32 minutes so far

rosetta python 1.03 vbox64

This is elapsed time, not the much shorter CPU time.
ID: 103046 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,331,566
RAC: 16,669
Message 103047 - Posted: 29 Oct 2021, 3:41:39 UTC - in response to Message 103045.  

Validation backlog appears to be growing- now over 104,000
The Server Status for the Validators might be showing green, but they don't appear to be actually doing anything at present.
Now over 114,000.
Yep- it's broken.
Now over 138k.
Grant
Darwin NT
ID: 103047 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 373
Credit: 10,598,568
RAC: 8,061
Message 103048 - Posted: 29 Oct 2021, 9:14:09 UTC - in response to Message 103047.  

Validation backlog appears to be growing- now over 104,000
The Server Status for the Validators might be showing green, but they don't appear to be actually doing anything at present.
Now over 114,000.
Yep- it's broken.
Now over 138k.


And now over 176k but some must be getting through.

Yesterday I dropped to 3k credits for the day as everything was pending but today I have 11k :-)
ID: 103048 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1966
Credit: 38,188,338
RAC: 11,005
Message 103051 - Posted: 29 Oct 2021, 22:45:18 UTC - in response to Message 103048.  
Last modified: 29 Oct 2021, 22:50:44 UTC

Validation backlog appears to be growing- now over 104,000
The Server Status for the Validators might be showing green, but they don't appear to be actually doing anything at present.
Now over 114,000.
Yep- it's broken.
Now over 138k.


And now over 176k but some must be getting through.

Yesterday I dropped to 3k credits for the day as everything was pending but today I have 11k :-)

Now up to 237k backlog, but I don't have any pending dated 28th Oct so some are going through, just nowhere near enough to keep up, let alone catch up.
I sent a message about 11hrs ago and got a reply about 8hrs ago that it'd be looked at when they got in, which I'm guessing would be ~6hrs ago.
That it's not fully fixed yet indicates it's not as straightforward as the feeder issue a few days before. I've heard nothing more since.

It's been reported and acknowledged. That's all I can say.

PS: Apart from being away from home from yesterday until Sunday week apart from 1.5days, my email provider has had a major outage which looks like it'll take 2-3 days to fix, making matters worse.
I will be able to check in here for 6 of 9 days I'm away and I am using a backup email account if anything new comes up - hopefully I won't have to
When it rains it pours...

Edit: When I started typing my credits were 300 less than what were showing here, so I did a manual update and my credits were 400 more than are showing here. Lots from 29th October updated, but in quite a funny order. Maybe things are moving much more rapidly right now? Fingers crossed
ID: 103051 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 125 · 126 · 127 · 128 · 129 · 130 · 131 . . . 274 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org