Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 124 · 125 · 126 · 127 · 128 · 129 · 130 . . . 309 · Next
Author | Message |
---|---|
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
OR would project_max_concurrent be any better than max_concurrent? ugghh...I guess I am screwed until someone figures this bug out. I am on vacation right now so I could probably do something like you talk about, but work days, I just want to fire up my computer and get going for the day. I don't have time to fire up 6 different processes. RAH should get their act together and copy LHC webpage setup, then I wouldn't have to do all this #$% |
Breno Send message Joined: 8 Apr 20 Posts: 30 Credit: 12,984,922 RAC: 992 |
I'm not sure if this issue is already known, but I decided to install boinc and boinctui on my five NVIDIA Jetson TX2 and TX1 boards and add R@h. I was able to configure them into my account, but all boards freeze when the task is supposed to be downloading. It is not related to absence of network, since I logged into the boards by ssh. Does anyone have a clue on how to solve it? https://ibb.co/18KTxnG |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,860,059 RAC: 6,147 |
Can BOINC write to that directory? |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I have to have max concurrent in order to limit the number of cpu's RAH uses, otherwise my idea of splitting up my system so every project has its own group of cores is out the door and then I run into problem of every project dominating my system and some get all the work for days on end and others don't. Until BOINC is fixed to work with max concurrent, you can speed up the correction of the number of work units by placing this in your cc_config.xml file: <cc_config> <options> <rec_half_life_days>1.000000</rec_half_life_days> </options> </cc_config> That will correct the number of work units in a couple of days to be in accordance with the resource share for each project. It still won't allow you to fine-tune the pythons verses the regular Rosettas, but that is basically set by the project anyway according to their priorities, so I don't know that it should be changed. But if you try to run too many projects, BOINC gets all confused anyway, and an app_config.xml just makes it more confused than usual. I would never run more than three projects at a time, and one or two is better, especially if they have widely different run times. |
Breno Send message Joined: 8 Apr 20 Posts: 30 Credit: 12,984,922 RAC: 992 |
Interesting, you're suggesting it might be a sudo issue. Surely it is not a disk space issue. I was thinking that maybe it was a version issue or maybe network related. However, the project was recognized and the tasks were scheduled, they are just not downloading. Also, network is set to be freely used by the project, so it can't be that. I'll keep checking. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1725 Credit: 18,417,319 RAC: 20,286 |
If RAH would do like LHC and allow ME to pick how many cores to give it, then I would not have to do max concurrent.It's only an issue because you don't like how BOINC works- buy allocating resources to each project depending on computation work done, not on time spent on each project. Which means that with multiple cores/threads & multiple projects the number of Tasks for any given project being processed at any given time will vary. If you were happy to just let it process according to your Resource share settings, without micro managing it, it wouldn't be an issue. RAH should get their act together and copy LHC webpage setup, then I wouldn't have to do all this #$%You don't have to do it. It's something you chose to do. Ideally BOINC would just fix the problem so that those that feel the need to micro-manage things can do so without such issues occurring. Grant Darwin NT |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2002 Credit: 9,787,940 RAC: 5,329 |
BUT: You need to make them selectable from the regular Rosettas. +1 |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
I have to have max concurrent in order to limit the number of cpu's RAH uses, otherwise my idea of splitting up my system so every project has its own group of cores is out the door and then I run into problem of every project dominating my system and some get all the work for days on end and others don't. I'll give this a try in the morning. What I was getting tired of and why I split my system up was long running project times would interfere with short project times and my stats would go all haywire and I would also get cancelled not started in time issues. So I decided I will just split my system up to even out each project. Then each project can do what it wants. But with the max_concurrent problem that means I have to watch RAH downloads. It looks like the BOINC team is looking into this according to a post another group directed me to. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Interesting, you're suggesting it might be a sudo issue. It's code issue in BOINC. I was pointed to a gethub thread where a person described this exact condition exactly as I am experiencing it. the max_concurrent command used in some projects causes this problem. I watched RAH download the usual amount to keep my system busy within the time deadline, but then it would sneak in two extra tasks every few minutes, so if I don't watch and shut it off, then I will get a years worth in a matter of a few hours. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
If RAH would do like LHC and allow ME to pick how many cores to give it, then I would not have to do max concurrent.It's only an issue because you don't like how BOINC works- buy allocating resources to each project depending on computation work done, not on time spent on each project. Which means that with multiple cores/threads & multiple projects the number of Tasks for any given project being processed at any given time will vary. It's a matter of trying to even out the projects credits and run times. Some of my projects have longer run times and also want to use all or most of the cores causing an imbalance and missed deadlines. I decided that I would split up the projects into chunks of cores so that each project has its own dedicated set of cores to use and run as it pleases. It ran fine until this bug showed up. I just manually load in X number of tasks until Emfer Boinc tasks tells me I have 2 or 3 days of work and then i switch back to no new work. BOINC is aware of this and it looks like the developers are digging into it. It's just going to take time and before they release a new version they have to work out all the other bugs. So for now this is how it goes. But again, LHC allows me to select which projects I want to work in (maybe later RAH can do that when they get more Python work) and LHC allows me to select the number of cores I want to give it. It's time for RAH to grow up and adapt. Not everyone runs just RAH as a dedicated project some of us want to give time evenly to other projects. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1725 Credit: 18,417,319 RAC: 20,286 |
It's a matter of trying to even out the projects credits and run times. Some of my projects have longer run times and also want to use all or most of the cores causing an imbalance and missed deadlines.That's what your Resource share setting is for, and having zero (or next to no cache) stops the missed deadlines (when it's not ignored by a bug with one of the BOINC configuration options of course). Once BOINC knows how long various Tasks actually take (a minimum of Ten Tasks of a particular type completed & Validated), it can then process Tasks for each project as it needs to in order to meet your Resource share settings. Of course the more active Projects you have, the longer it will take for things to settle down. And setting no new Tasks, then re-enabling then, and aborting work all add to the time it takes for BOINC to figure things out. And given that there are times Projects may or may not have work, and some applications are more efficient than others it will mean there will be times where BOINC will do more of one (or several) projects than another (or several projects) in order to meet your Resource share settings. Remember- resource share is not worked out based on time, but on an estimate of work actually done. And of course it does take time to do that- be it hours, days, weeks or months depending on your hardware, the number of projects you run & the size of your cache. But again, LHC allows me to select which projects I want to work in (maybe later RAH can do that when they get more Python work)As did Seti and i'm sure many other projects. I agree that Rosetta should allow people to chose whether they want to run Rosetta 4.2 or the Vbox application, or both. It's time for RAH to grow up and adapt. Not everyone runs just RAH as a dedicated project some of us want to give time evenly to other projects.And i will point out yet again- that is not how BOINC works. Processing time for a project isn't dolled out based on time- it's based out on work done, in order to meet your Resource share settings. Someone having 10 projects, each with a Resource share value of 100 does not give each Project equal time. What it does do is give each project whatever time it needs for it's applications to do an equivalent amount of work to each of the other projects. Having 2 projects, one with a Resource share of 500 and the other only 10, does not mean the one with 500 gets 50 times more processing time. If the one with the 500 Resource share value has an extremely efficient GPU application, and the one with a Resource share value of 50 has a really inefficient CPU application in a old system with a modern high end GPU then the 500 value project may only get 1/100th of the processing time that the CPU application gets. Most of the time will be spent processing work for the project that has only the 50 Resource share value. Yet because of the huge amount of work actually done by the GPU compared to the CPU, the Resource share settings will still be met, even though the CPU project get 100 times more processing time than the GPU based project. Grant Darwin NT |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
It's a matter of trying to even out the projects credits and run times. Some of my projects have longer run times and also want to use all or most of the cores causing an imbalance and missed deadlines.That's what your Resource share setting is for, and having zero (or next to no cache) stops the missed deadlines (when it's not ignored by a bug with one of the BOINC configuration options of course). Now its getting as complicated as what I am doing. I have to run everything at 100, see how they work again, then fine tune the resource share for each project. That's just as bad as the max_concurrent. With RAH, it's not about the type of project, its about the core count. That's what I am after. X # of cores per project. Since the stats are all over the place, I will try your idea. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1725 Credit: 18,417,319 RAC: 20,286 |
Now its getting as complicated as what I am doing.What is difficult about it? If all projects are of are of equal importance, then just have them all set to 100. If they aren't- then what project is most important? Which is least? Rate them in order of importance & then set your Resource share values for each one accordingly- keeping in mind they are not a percentage, they are a ratio. With RAH, it's not about the type of project, its about the core count.Once again as it seems i am not getting through- BOINC does not work that way. Resource share is about work actually done. Not time. Not cores. Not threads. Not CPU. Not GPU. It is about the work done. While Credit is supposed to indicate work done, that isn't the case. So the BOINC Scheduler uses REC (Recent Estimated Credit) to determine Scheduling. Extremely roughly- any Task requires so many FLOPs (Floating Point Operations) to be performed. It takes however long to actually do the work on a given system for a given application. BOINC/the Project server keeps track of this time & the amount of work done. If all Projects paid out Credit according to the definition of the Cobblestone, then people's RAC could be used for Scheduling. Since that doesn't happen, people's RAC isn't used- REC is. A Task that takes however long to do a certain amount of FLOPs would earn x amount of Credit using the official definition of the Cobblestone- that is the amount of REC for that Task. All the Tasks done for all the different applications for that project produce a total amount of REC value for that Project. All the work done by the applications for all of the Projects produce an REC for each Project. The BOINC scheduler does enough work from each Project over a period of time (days, weeks, months if need be) so that the REC value between projects matches whatever Resource shares you have selected. That is how it works. It is not based on just time. It is not based on Cores. It is not based on Threads. It is not based on CPU. It is not based on GPU. It is based on work done. Grant Darwin NT |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Now its getting as complicated as what I am doing.What is difficult about it? Initially I left things alone and then credits got all out of whack, then I ran into issues with tasks taking up days on end and nothing else getting done by other projects and just a bunch of yoyo stuff going on. So I took back control. Again, everything was working fine until this bug showed up. And that just showed up. Maybe after updating to the latest BOINC. Anyway..I'll mess around with things until I find the right mix. No need to clog up this thread. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 401 Credit: 12,294,748 RAC: 5,104 |
As Grant has said, the more you mess around with things the worse the situation will become. Set rec_half_life to 1, sit back and chill for a month and the system will follow your project shares. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Just letting it ride now, no app_config. See where things go. I'll have to go back and look at your notes on the half life, I haven't done that yet. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Bryn Mawr - added the half_life, will sit back and see what happens. Current WCG is dying in credits, guess I will have to pump that one up higher in % |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1725 Credit: 18,417,319 RAC: 20,286 |
Bryn Mawr - added the half_life, will sit back and see what happens.Or just let things be until they have a chance to settle down- with 8 active projects, even with the changed half life value, i'd expect you're looking at a couple of weeks. One week bare minimum. Then adjust Resource share as necessary. Grant Darwin NT |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Bryn Mawr - added the half_life, will sit back and see what happens.Or just let things be until they have a chance to settle down- with 8 active projects, even with the changed half life value, i'd expect you're looking at a couple of weeks. One week bare minimum. Ok..will do. It's 5 active. I thought I had 2 GPU projects, but it seems just one at the moment. So its 3-4 CPU projects. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 401 Credit: 12,294,748 RAC: 5,104 |
Bryn Mawr - added the half_life, will sit back and see what happens.Or just let things be until they have a chance to settle down- with 8 active projects, even with the changed half life value, i'd expect you're looking at a couple of weeks. One week bare minimum. I recently (6 weeks ago) added a 5th project (6 if you include Ralph which very rarely has work) because 3 of the projects were out of work / broken at the same time. One of my crunchers is now back to running smoothly whilst the other still has the occasional lump or bump as one project or another grabs a bit extra but is almost there. |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org