Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 302 · Next
Author | Message |
---|---|
Darrell Send message Joined: 28 Sep 06 Posts: 25 Credit: 51,934,631 RAC: 0 |
Note: When I write "CPU" I am referring to a "logical CPU" or thread. I know how these things work and have carefully "tuned" the mix of projects such that my 8 total CPUs run 95-100% and all four GPUs run 95+% busy. SETI runs on the GPUs supported by two CPUs (each reserves 0.34 CPUs), Rosetta gets the rest. It IS the Rosetta short deadline tasks that go into panic mode that force my SETI GPU tasks off their two CPUs. I have four GPUs in some of my computers, such as the one I listed, and only one GPU was being used (because I run 2 SETI/GPU means 0.34*2 is less than one CPU). Rosetta was using all eight threads instead of the six it normally uses. Where is the Admin response? P.S. I just check todays downloaded WUs and many/all of them are ALSO short deadlines. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 5,361 |
Why all the WUs with super short deadlines? I think you have the correct answer. The GPU jobs typically will require "some" CPU resources (which consumes at least 1 CPU). IF ... any project task get within the queue setting expiration time window, the task will go into HIGH PRIORITY mode and will not be interrupted. This is a BOINC design "feature". With a complex system like this, using an " app_config.xml " file in the Rosetta project directory WHICH then allows CPU resources to be allocated during HIGH PRIORITY cases like these is the only way I know around this. |
Darrell Send message Joined: 28 Sep 06 Posts: 25 Credit: 51,934,631 RAC: 0 |
Why all the WUs with super short deadlines? Ooops, wrong button pressed ... continuing with reply. Yes, it is by design and I have no issue with that design. My issue is with Rosetta sending out large numbers of WUs with short deadlines (2 days in this instance). Having a 1+1 day queue (2 days) means those WUs immediately become HIGH PRIORITY and force other work off the CPUs. This is NOT fair and does not allow the normal mechanism to allocate resources to work. I have adjusted my App_config.xml to limit Rosetta's use of my CPUs. This will have the unavoidable affect of leaving unused CPU time when the queue of my other projects goes dry during normal times. Rosetta would normally just get that CPU time, but now it cannot. Where is the reply from the Admin as to WHY these have short deadlines? |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 8,235 |
Why all the WUs with super short deadlines? The short deadlines were originally required for CASP tasks, but seeing as CASP is over (I'm guessing they're re-running tasks to see if they can fine-tune them) the short deadlines are a bit redundant now. I suspect your problem is down to this 1+1 setting - we saw it before recently. Your system pulls in the minimum one day, then tries to fill the rest up to with another day as task availability allows. We reached some consensus that the first setting for "Store at least" should always be zero (unless you really do only poll for tasks once per day) and put the rest into "store up to an additional x.x days of work" While deadlines are currently 2 days, obviously the moment your 2 days buffer is filled every task needs to run straight away to ensure they all get back in time. I'd advise you knock this down to 1.5 days so Boinc has some room to manoeuvre to get all your work done. Try it and see what happens. I wouldn't hold your breath waiting for the project to adjust deadlines on these tasks to something more appropriate. It took a couple of years before urgent tasks were given a different priority to non-urgent tasks. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Where is the reply from the Admin as to WHY these have short deadlines? I'll start by pointing out that I am not a project admin. Yes, I know, the message board tags indicate otherwise. This is just how BOINC server code reflects my message board administrative controls. I'm not clear how a reply explaining why the project has some 48hr deadline work is actually going to be of any relief to your situation. Since you are attached to a project that does send 48hr deadlines, a 1+1 day queue and the strong desire not to have projects go to high priority due to BOINC Manager's fear that work might otherwise not be completed in time, it would seem you would be better served to establish preferences that cause the BOINC manager not to request so much work at one time. I should also point out that if your machine is running 24hrs a day, then once the first few short-deadline tasks complete, the BOINC Manager should be relieved of the concern that about a days worth of work won't be completed in 48hrs. So it's not going to be in the way of your GPU work for 48hrs, or even the roughly 20hrs of estimated compute time. It is also fairly unusual for all of your current cache of work to have the short deadlines. More commonly you'll see just a fraction of the tasks with the 48hr deadlines and others with 7-10 day deadlines. And such a mix also avoids the high priority situation you happen to see today. Rosetta Moderator: Mod.Sense |
Darrell Send message Joined: 28 Sep 06 Posts: 25 Credit: 51,934,631 RAC: 0 |
[quote]Where is the reply from the Admin as to WHY these have short deadlines? I'll start by pointing out that I am not a project admin. Yes, I know, the message board tags indicate otherwise. This is just how BOINC server code reflects my message board administrative controls. View the question I asked as rhetorical to bring the attention of the impact to the Admin. If the queue were ONLY for Rosetta, this would work. It is for ALL the projects so it won't work for me since some projects seem to have WUs in batches hours or even days apart. I should also point out that if your machine is running 24hrs a day, then once the first few short-deadline tasks complete, the BOINC Manager should be relieved of the concern that about a days worth of work won't be completed in 48hrs. So it's not going to be in the way of your GPU work for 48hrs, or even the roughly 20hrs of estimated compute time. That computer runs 24/7 (God and the electric company willing) and has finished ALL of the WUs about which I asked the question. It has received another batch at about 3:30PM of 59 more short deadline WUs, so the situation is not resolved even though I admit there may have been a time when BOINC was NOT in panic mode (HIGH PRIORITY) when I wasn't monitoring. It is also fairly unusual for all of your current cache of work to have the short deadlines. More commonly you'll see just a fraction of the tasks with the 48hr deadlines and others with 7-10 day deadlines. And such a mix also avoids the high priority situation you happen to see today. If there are just a few short deadline WUs, then I would not have noticed or been concerned. When there are so many that they shut down the GPU work, then I have a concern. I will just continue to restrict Rosetta's access to my 5 computer's CPU time for the time being since no one seems to have the real answer or appears to be taking any action to increase the time to allow for processing or throttle the release of short deadline WUs for processing. Thank you for your reply. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I keep the default buffer of 0.1+0.5 days, but have seen high priority work units due to short deadlines too. They seem to pop out of nowhere, but I suspect it is the BOINC scheduler downloading too many Rosettas at once as it tries to make up for some real or imaginary shortfall as compared to my other project, WCG. (I have seven Ivy Bridge cores split 100% for WCG and 40% for Rosetta, so there should ideally be two Rosettas running at any one time, but it can actually be four or five.) Maybe it is possible to bump off a GPU core in an extreme case, and I think it happened to me once, though I normally run Folding on the GPU and so just reserve a core for it, and there is nothing for BOINC to bump. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 8,235 |
I keep the default buffer of 0.1+0.5 days, but have seen high priority work units due to short deadlines too. They seem to pop out of nowhere, but I suspect it is the BOINC scheduler downloading too many Rosettas at once as it tries to make up for some real or imaginary shortfall as compared to my other project, WCG. (I have seven Ivy Bridge cores split 100% for WCG and 40% for Rosetta, so there should ideally be two Rosettas running at any one time, but it can actually be four or five.) That's a good (and obvious) point about BOINC scheduling. There is a <real> debt to Rosetta at the moment because we've all been struggling to see Rosetta tasks at all for the last 3 weeks, during which time other projects will have had more than their share of runtime. This debt is being rebalanced now. Combine that with the highly annoying 24hr backoff thing and the feast and famine availability of Rosetta tasks, the short deadlines (which may've come to an end as my most recent tasks have 5-day deadlines, but we'll see) and the problematic running of tasks is somewhat less surprising, but should improve as Rosetta debt reduces. The only plus with the short deadline tasks is they cut short on my PC to 4hr runs rather than the 8hrs expected, though that would also cause even more tasks to come down for other people I guess. All things together, I've got masses of WCG tasks now just when I want to be running Rosetta (my priorities are a 1003 split) so I've had to set WCG to no new tasks while forcing the backlog to run so I've got space for Rosetta tasks to fill my buffer. All very unsatisfactory (and time-consuming on 'wrong' priorities) from a range of perspectives. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
@Darrell, Thanks for the reply. Yes, the other projects and a fractional share of the time available would have that effect. As Sid points out, there is very likely a debt to R@h right now, and so in addition to short deadlines, you got a load of more work than average. It sounds like you are now to a point that even longer deadlines would reach the "running at high priority" state, because of the fractional resource share. So another approach would be to clear the debts so they are all equal. That would help avoid BOINC requesting more work than the current cache window requires for any of your projects. At one time, there was a debt viewer tool which you could use to zero out the record of project debts. Appears you can now do it with boinccmd. Rosetta Moderator: Mod.Sense |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 8,235 |
This 24 hour backoff following the "Rosetta Mini for Android is not available for your type of computer" message is driving me crackers. A week after the project came back online and there's still no consistent supply of tasks. I understand it takes a few days for people to get their buffers refilled, but just as they do, the 24 hour backoff returns, other projects make their calls during the 24 hours and we're back to square one. This was all bad enough when Charity Engine was making severe demands on capacity but my impression is that's not happening right now. So it's time to ask what the state of play is with the hardware upgrade talked about last September. Because it looks like the current setup is creaking just to handle the ordinary level of demand, with little room to handle anything out of the ordinary. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 8,235 |
Thanks for the reply. Yes, the other projects and a fractional share of the time available would have that effect. As Sid points out, there is very likely a debt to R@h right now, and so in addition to short deadlines, you got a load of more work than average. It sounds like you are now to a point that even longer deadlines would reach the "running at high priority" state, because of the fractional resource share. Would that still be the case if holding a 2-day (or 1.5-day) buffer? I'm not sure it would. 2 day deadlines, yes. 5 or 7 day deadlines, no. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Thanks for the reply. Yes, the other projects and a fractional share of the time available would have that effect. As Sid points out, there is very likely a debt to R@h right now, and so in addition to short deadlines, you got a load of more work than average. It sounds like you are now to a point that even longer deadlines would reach the "running at high priority" state, because of the fractional resource share. My thinking was that with only a fractional resource share to R@h, that BOINC would still try to process all R@H WUs first, to make up the debt and yet see more work onboard than it could process under normal resource share. But I suppose while it is making up debt, it accounts for more than the average resource share before tripping high priority. So I may be wrong with that. Rosetta Moderator: Mod.Sense |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 8,235 |
Thanks for the reply. Yes, the other projects and a fractional share of the time available would have that effect. As Sid points out, there is very likely a debt to R@h right now, and so in addition to short deadlines, you got a load of more work than average. It sounds like you are now to a point that even longer deadlines would reach the "running at high priority" state, because of the fractional resource share. We don't get to chose whether we receive 2 or 5 or 7 day deadline tasks, so in Darrell's case I don't blame him for his reaction when everything came down as 2-day. His setup is a lot more involved than I have to deal with so I guess it requires a little extra maintenance. That said, I'm getting more 5-day deadlines 2-to-1 with 2-days so I suspect the problem has gone away for the moment, even if Darrell maintains a 2-day (1+1)buffer. I'd still recommend 0+1.5 though, to cover all eventualities |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 8,235 |
I'll do a full boinc server upgrade when we get our hardware. This relates to David EK's message in the previous thread which reads: Our database server is running out of disk space. We had to reconfigure it which took a long time because it was over 140gigs, however it is operating at a very sluggish pace. Our project has been quite busy lately mainly due to Charity Engine providing 1000s of new hosts each day. This has been going on for quite some time and our database finally reached it's space limit with the current project configuration. We are working on a temporary solution since our full upgrade will take some time, in the order of months I am told. That's dated 8 Sep 2016, so 4 months ago. The new hardware must be due for delivery quite soon. What's the latest plan, please? <whistles> |
Wiktor Jezioro | lakewik.pl Send message Joined: 22 Feb 15 Posts: 4 Credit: 24,056,394 RAC: 0 |
Problem with 24-hour delay and "Rosetta Mini for Android is not available for your type of computer" still occuring. My machines is bored. When we can expect repair? |
BarryAZ Send message Joined: 27 Dec 05 Posts: 153 Credit: 30,843,285 RAC: 0 |
As to the work not available -- that should is likely a case of loading up new work for non-android devices and resolve itself. As to correcting the error message there and making it like other BOINC projects -- that is, 'no work is available for your device' instead -- that would take some actual change at the project level, which in turn would require this ongoing report (I encounter this message as well), to get received by the project admins and then acted on. As to the 24 hour delay, that is the same issue. That report (I run into it as well), requires not only that the message be received by the project admins and operators, but also that it be understood by them as a *project specific* issue and not a user issue nor one that is a function of the BOINC software. The 24 hour back-off is very much a project specific setting -- other projects have a progressive process when connects don't succeed start with a one hour back off. This is pretty clearly project specific and fixing it is up to the Rosetta project folks. As to your computer being bored, well my approach has been to make sure that every computer I have which is connected to Rosetta, is also set up with WorldGrid. They are doing good work and very rarely run out of work, do not have Rosetta's 24 hour back off bug, and tend on their home site to provide good and updated status information for the user community. Problem with 24-hour delay and "Rosetta Mini for Android is not available for your type of computer" still occuring. My machines is bored. When we can expect repair? |
Darrell Send message Joined: 28 Sep 06 Posts: 25 Credit: 51,934,631 RAC: 0 |
We don't get to chose whether we receive 2 or 5 or 7 day deadline tasks, so in Darrell's case I don't blame him for his reaction when everything came down as 2-day. His setup is a lot more involved than I have to deal with so I guess it requires a little extra maintenance. <sigh> I agree with you, Sid, that since I have many CPUs and GPUs running a mix of projects, and Rosetta setting short deadlines, I must pay more attention than I would prefer. That said, I'm getting more 5-day deadlines 2-to-1 with 2-days so I suspect the problem has gone away for the moment, even if Darrell maintains a 2-day (1+1)buffer. I'd still recommend 0+1.5 though, to cover all eventualities I am still getting many 2-day deadline WUs from Rosetta so the situation continues. I have received so many 2-day that some of my 5-day WUs are being delayed close to a HIGH PRIORITY state. Can anyone here tell me the REASON for such short deadlines? Today, tomorrow or next month shouldn't make much of a difference to the science, and a longer deadline would allow slower CPUs (or running less time) to participate. SETI, e.g., has about an EIGHT WEEK deadline on their WUs. Is there a commercial reason Rosetta is pushing so hard? Or is it that their server doesn't have the capacity to store that many active WUs? I know a system upgrade is "somewhere" in the future. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 8,235 |
That said, I'm getting more 5-day deadlines 2-to-1 with 2-days so I suspect the problem has gone away for the moment, even if Darrell maintains a 2-day (1+1)buffer. I'd still recommend 0+1.5 though, to cover all eventualities Well, I <was> getting 2 5-day tasks for each 2-day task, but I suspect everyone's in the same boat as me again as there's no tasks of any type to find over the last day and a half. Can anyone here tell me the REASON for such short deadlines? Today, tomorrow or next month shouldn't make much of a difference to the science, and a longer deadline would allow slower CPUs (or running less time) to participate. I did actually answer that. They're CASP tasks which required to be back within a couple of days. <However> CASP finished months ago so it seems these are re-runs of those tasks and the urgency no longer exists. But it took several years to make the deadlines of urgent tasks clear, so I'm guessing we won't get a timely response to changing them back during this time of re-runs. tl;dr No good reason SETI, e.g., has about an EIGHT WEEK deadline on their WUs. To be fair, Seti tasks could have a million year deadline and that would be too soon. Every minute they run is a waste of time. Is there a commercial reason Rosetta is pushing so hard? Or is it that their server doesn't have the capacity to store that many active WUs? I know a system upgrade is "somewhere" in the future. Not the former, definitely the latter - that's been said here. I've just asked about the hardware upgrade followed by server software upgrade. Without either of those it's not looking good for any of us. One question for you: If you're maintaining your 1+1 setting, once you get down to 1.99 days you'll get more tasks and with 2 day deadline tasks you'll <never> meet those deadlines, thereby forcing your PC into high-priority mode and preventing your GPU tasks from running. Why have you decided to have 2 days of buffer rather than 1.5? Or have you changed it now? It seems to me your problems will go away if you change, but they're bound to remain if you don't. Edit: Also, changing 1+1 to 0+2 would be a better setting as new (potentially 2-day) tasks will come down a full day before needed. This may well be the cause of 5-day tasks being pushed back and risking going into high priority. That's the only explanation I can think of for 5-day deadline tasks performing the way you describe. The first figure should only ever be non-zero if you only connect occasionally, not if you have a permanent connection (default setting is apparently a quite understandable 0.1). 0.1+1.5 is coherent with 8-hour default runtimes |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1995 Credit: 9,635,489 RAC: 6,843 |
As to the work not available -- that should is likely a case of loading up new work for non-android devices and resolve itself. I turn my pcs to Tn-grid and their optimized app is running very well! |
Darrell Send message Joined: 28 Sep 06 Posts: 25 Credit: 51,934,631 RAC: 0 |
Can anyone here tell me the REASON for such short deadlines? Today, tomorrow or next month shouldn't make much of a difference to the science, and a longer deadline would allow slower CPUs (or running less time) to participate. Hmmm. If CASP (whatever that acronym stands for) had to be back within a couple of days, why are they being sent out now (months later)? Seems like a contradiction to me. SETI, e.g., has about an EIGHT WEEK deadline on their WUs. Everyone has their own opinion and choices to make. Is there a commercial reason Rosetta is pushing so hard? Or is it that their server doesn't have the capacity to store that many active WUs? I know a system upgrade is "somewhere" in the future. Agreed. One question for you: If you're maintaining your 1+1 setting, once you get down to 1.99 days you'll get more tasks and with 2 day deadline tasks you'll <never> meet those deadlines, thereby forcing your PC into high-priority mode and preventing your GPU tasks from running. Why have you decided to have 2 days of buffer rather than 1.5? Or have you changed it now? It seems to me your problems will go away if you change, but they're bound to remain if you don't. The first number asks to maintain "at least" this much work, and the second number asks for "up to" this many days additional. Since some projects (like this one) run out of work on a irregular basis, I want a small supply to bridge the lack of supply. [Or you can think of it as an "occasional connection" to the supply]. Further, those numbers are for ALL projects (as I wrote in an earlier post) not just Rosetta. Today, SETI has maintenance and is offline for much or all of the day. My downloaded queue of WUs is being used to process during this period. There is one wrinkle I have that few of the crunchers here have and that is that I am not in the U.S. I am in Vietnam, and three of the trans-Pacific internet cables are broken (see here if interested). Thus I AM connected only intermittently even though I would rather be continuously connected. |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org