Large numbers of tasks aborted by project killing CPU performance

Message boards : Number crunching : Large numbers of tasks aborted by project killing CPU performance

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 96640 - Posted: 20 May 2020, 2:48:07 UTC

It's increasingly hard to believe that this project is accomplishing anything meaningful. The grasp of scheduling seems to be really weak.

What you [the project managers] seem to be doing now is sending large numbers of tasks on short deadlines. Many of these tasks seem to be linked to large blocks of data. Because the tasks can't possibly be completed within your short deadlines, then you wind up aborting large numbers of tasks. However, even the abortions of tasks are done crudely, basically stepping up one day to abort another stack of tasks that cannot be completed within their deadlines.

More haste, less speed? Or worse?

Other times my various machines have more demands on memory than can be accommodated. That results in idle CPUs (actually cores) unless large "waiting for memory" tasks are manually aborted to make space for smaller tasks. Other times tasks that have nearly finished are aborted by the project for unclear reasons. Other tasks that are also past their deadlines are permitted to finish, though of course it is unclear if any of these tasks are earning any credit.

So we [the donors of computing resources] just have to hope that the individual projects themselves are better managed than the project as a whole seems to be? As I've noted before, if I were still involved in research I would be advising the researchers to be quite careful about any results coming from a system run like this one....

Solution time, but I'm sure mine is ugly. At this point I just always manually abort the pending tasks except for those issued today. That gives the running tasks the best chance to finish and be replaced by tasks that also have the best chance to finish without being aborted by the project itself. Tasks that are "waiting for memory" are also aborted, though often I have to go through a bunch of them before a sufficiently small task gets a chance to run on the available core. Main ugliness of this kludge is that I'm sure lots of data is being downloaded and discarded untouched. (However that's happening anyway with the tasks that get aborted by the project.)

REAL solution is realistic deadlines. Sophisticated solution would involve memory management, too, but right now I feel like that is beyond your capabilities.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 96640 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile yoerik
Avatar

Send message
Joined: 24 Mar 20
Posts: 128
Credit: 169,525
RAC: 0
Message 96641 - Posted: 20 May 2020, 2:59:47 UTC - in response to Message 96640.  

uhh reduce the queue in your computation settings?
ID: 96641 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,324,523
RAC: 16,480
Message 96652 - Posted: 20 May 2020, 9:45:44 UTC - in response to Message 96640.  

REAL solution is realistic deadlines.
There is no problem with the deadlines, just your expectations.
It has been suggested that it would be best to wait until a Task is well past it's deadline before re-issuing it to save cancelling a Task that is no longer needed if it is finally returned after it's original deadline, just to help reduce the server load.


Most of your issues are due to your system settings & micro managing of the BOINC Manager, combined with insufficient RAM on your systems.

So if you set your cache to a reasonable amount (which would be 0, but even 0.4 days and 0.01 extra days would be OK) that would mean you would stop missing deadlines, even if you don't configure things to allow for your lack of system RAM.
The recommended (not the suggested minimum) amount of RAM for a PC has been 8GB for over 4 years, over the last year or 2 it's been considered 16GB to be the recommended amount for even a basic mid-range or higher core/thread CPU (that many retailers will sell systems with the bare minimum doesn't make it right. They're just tying to make the price look as good as possible when compared to other systems).


Sophisticated solution would involve memory management, too, but right now I feel like that is beyond your capabilities.
If there isn't enough RAM, then the Task waits until there is. When there is, it runs.
Of course you could stop the problem yourself by configuring your systems to allow for their their insufficient amounts of RAM, and limit the number of CPU cores/threads you use, and set a more reasonable sized case.
Grant
Darwin NT
ID: 96652 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 22 Apr 20
Posts: 17
Credit: 270,864
RAC: 0
Message 96657 - Posted: 20 May 2020, 13:23:33 UTC
Last modified: 20 May 2020, 13:44:51 UTC

Another reason for the Project Aborting tasks is that they may have found a flaw in a batch of tasks from which returned results would be useless so rather than wasting your time and theirs, they Abort the whole batch.
My buffer is set to 0.1 and 0.1 so I only ever have 1 active task per core and perhaps 1, at most 2 spare. Often it is 1 out, 1 in.
Also just now, time is of the essence so they want the work returned as soon as possible so the result can be analysed as to whether it can be discarded or is worthy of further investigation.
ID: 96657 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1965
Credit: 38,174,417
RAC: 10,123
Message 96672 - Posted: 20 May 2020, 19:26:12 UTC - in response to Message 96640.  

What you [the project managers] seem to be doing now is sending large numbers of tasks on short deadlines.

Two things here:
Short deadlines: Yes. Rosetta <used to be> a predominantly 8-day-deadline project, but from about 6 weeks ago it became a 3-day-deadline project.
You shouldn't consider 3 days as short. It's now the normal and ongoing deadline. If you need to make adjustments for that (ie smaller offline cache of tasks so you don't have too many to complete within a 3 day deadline) reduce your cache as appropriate - including how much your computer is on each day.
Large number of tasks: This is a consequence of updating to the latest Boinc Manager and the new Rosetta App version. The more tasks you successfully complete (not abort) the better Boinc (not Rosetta) will bring down the right amount of tasks.

Other times my various machines have more demands on memory than can be accommodated. That results in idle CPUs (actually cores) unless large "waiting for memory" tasks are manually aborted to make space for smaller tasks. Other times tasks that have nearly finished are aborted by the project for unclear reasons. Other tasks that are also past their deadlines are permitted to finish, though of course it is unclear if any of these tasks are earning any credit.

A few things here.
When you updated the Rosetta project, any running task as well as unstarted task was abandoned. This is a one-off in the changeover to https. It won't happen again.
Tasks that are running, but failing to complete by deadline, are allowed to continue running to completion. Because they're already started they should complete before any wingman completes the task, in which case you would get credit. The risk is more for the wingman who would have no indication the task is already completed.
Memory: yes, that is a problem now.
It used to be that 1Gb RAM per core was required to run Rosetta. Now it's 2Gb per core (and even up to 3Gb on individual tasks).
Since re-purposing to COVID19, the complication of tasks has gone up by an order of magnitude. RAM demands are much higher and old assumptions have to be discarded.
I notice most of your PCs are 4-core and 4Gb RAM. Yes, you will be unable to run some tasks unless you do one of two things@
1) Buy more RAM sufficient for 2Gbcore. For your old machines, only you will know if that's economically feasible.
2) Retain existing RAM but reduce cores running Rosetta from 4 to 3 (or 8 to 7 or 6). 3 cores running successfully will run better than 4 running unsuccessfully.
Also, if you can allocate more of your existing RAM to Rosetta (under Computing Preferences) that may help just a little. Experiment.

So we [the donors of computing resources] just have to hope that the individual projects themselves are better managed than the project as a whole seems to be? As I've noted before, if I were still involved in research I would be advising the researchers to be quite careful about any results coming from a system run like this one...

No.
Rosetta is accommodating 8x the number of hosts compared to 2 months ago while also turning round tasks 3x quicker within its existing server capacity. Without question that's a success of spectacular proportions.
When you equate your individual results not completing with results returning with bad data, you're unequivocally wrong. They're marked as aborted or failed, no account is taken of them and they're reissued to host who can successfully complete them. Yes, it's not taken advantage of capacity available to the project, but as I've demonstrated above, it's in your hands to turn your failure into your success. Try my suggestions and everyone wins.

Solution time, but I'm sure mine is ugly. At this point I just always manually abort the pending tasks except for those issued today. That gives the running tasks the best chance to finish and be replaced by tasks that also have the best chance to finish without being aborted by the project itself. Tasks that are "waiting for memory" are also aborted, though often I have to go through a bunch of them before a sufficiently small task gets a chance to run on the available core. Main ugliness of this kludge is that I'm sure lots of data is being downloaded and discarded untouched. (However that's happening anyway with the tasks that get aborted by the project.)

No. It is ugly and it only serves to ensure wasted capacity on your machines alone. At best, you just get to know it's wasted quicker. Everyone loses, but you lose 100% and the project loses 0.00000001%

REAL solution is realistic deadlines. Sophisticated solution would involve memory management, too, but right now I feel like that is beyond your capabilities.

No. The cost at the project would be to reduce task turnaround time for everyone while there'll always be someone who's own settings are inappropriate for the demands of the project, like yours currently are for the hardware you have available.
ID: 96672 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 96785 - Posted: 25 May 2020, 21:39:33 UTC

I have looked over the replies. They are remarkable. Remarkably uninformed. I think it is safe to express my doubts that any of the replies have come from professional programmers, system administrators, or students of computer science. I think you have nice intentions, but... Not sure what perspectives you are coming from, but it seems pretty obvious that I am not talking to you. Therefore if you have nothing to say that is relevant to what I wrote, then perhaps you should say nothing?

There are plenty of misconceptions that I could correct in detail. But I see no reason to do so. Go back and read what I wrote in the original comment. If you can't understand some part of it and if you actually want to understand it, then please try to write an intelligible question.

I'm just going to focus on one aspect from an old class on operating systems principles. It was one of the most influential classes of my entire life. The general principles apply far outside the bounds of computer science. Optimal scheduling is about identifying the critical resources. You always want to swap abundant resources to conserve the scarce ones. You NEVER want to create new bottlenecks where none exist. Time is NOT the critical resource here and the 3-day deadline is actually creating a bottleneck that has no justification. In addition, I have other uses for my time than trying to tweak configurations, especially since I have no access to the performance profiles (which also means my tweaks would be pointless). Nuking excess tasks is much quicker. I'm pretty sure it's causing wasted resources elsewhere, but I can only write reports like the original comment.

It actually reminds me of a system that was so badly tuned and overloaded that the character echo was taking several seconds. It feels like I'm insulting some of you if I explain what that means, but .., When you typed a character the computer was too busy to send the character back to you. The system was finally spending almost all of its computing resources keeping track of which users were supposed to receive which echoed characters and almost no actual work was being accomplished.

I suppose I better apologize for my poor teaching, eh? Though I earned a living that way for some years, I never did learn how to motivate. Most of the time I was teaching required classes, so motivation wasn't my main problem. The good students wanted to learn and mostly I just had to stay out of their way and help when I could. Most of the students just wanted to pass, so I helped them do that. Then there's always a few students who want to fail, but I focused on making it harder to fail than to pass. Didn't lose one in my last course.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 96785 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,324,523
RAC: 16,480
Message 96787 - Posted: 26 May 2020, 8:00:33 UTC - in response to Message 96785.  
Last modified: 26 May 2020, 8:01:56 UTC

Time is NOT the critical resource here and the 3-day deadline is actually creating a bottleneck that has no justification.
Seriously, you haven't heard about the COVID-19 issue?
It's a virus that has impacted word wide, with an infection & mortality rate that could cause even advanced health systems to collapse.
Finding a vaccine is something that is time critical; the sooner it is done, the sooner the world can settle down again.
So the deadlines reflect that. And given it takes 8 hours to return most results, expecting a result to be returned within 3 days is not an un-reasonable expectation.



In addition, I have other uses for my time than trying to tweak configurations, especially since I have no access to the performance profiles (which also means my tweaks would be pointless).
Performance profiles and tweaking have nothing to do with anything.

As for the time it would take, you have spent more time typing this reply out than it would take you to limit the number of running tasks at any given time to allow for your low RAM systems. And it would save the time you consider to be so valuable from aborting work that doesn't need to be aborted.
You could chose to resolve the issue, or you can continue to carry on as you are.

Lets see if you're as wonderful as you claim to be.
But given the insulting pomposity of your opening paragraph in your response to everyone else , i doubt it.
Grant
Darwin NT
ID: 96787 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 96789 - Posted: 26 May 2020, 8:26:07 UTC
Last modified: 26 May 2020, 9:19:48 UTC

reducing the work queue and not pre-fetching work probably solves it. e.g. store 0.1 days of work and 0 additional days of work.
and a good idea is really to run rosetta@home on a Pi4 if you do switch off your desktops often.
using a Pi4 gives decent points per watt performance, and it resolves the dilemma if you want to switch off the PC and there is still a queue of work outstanding.
another way though as i've been doing when i'm running on desktops, is to stop fetching work and let existing tasks run to completion and submit the results, but this would need to be done with the 0.1 / 0 work cache settings. unless you are running off one of those extreme high core counts processors e.g. ryzen3990x
https://www.amd.com/en/products/cpu/amd-ryzen-threadripper-3990x
there really isn't a point of downloading a large cache of wus.

i prefer the low or no work cache settings as well as i tend to find that using a large cache one often download a list of very similar wus. my thoughts are that doesn't really help since pseudo random numbers are involved in the simulations. my own thoughts are that running very similar wu probably makes things less 'random' than if i only fetch what i can handle at one point in time so that others can crunch the other similar wus. that'd probably give the results a higher entropy / dispersion and explore the search space more completely.

as for waiting for memory, those on the 'bleeding edge' running r@h on the Pi figured things out.
1 reduce the number of cpus (cores) used, e.g. use like 75% (3 out of 4 cores) or 50% (2 out of 4 cores).
that would hopefully reduce the number of tasks downloaded and running concurrently. it seemed it reduce the occurrence of 'waiting for memory' incidences. 2 if you are running it on a PC that you *don't use*, you may like to set the boinc memory use to 100%. that would let the bigger or more memory intensive tasks run and also reduce the 'waiting for memory' incidences. for those playing with it on the Pi, some even resort to using zram and such to maximize memory availability on that one single dram chip.;
ID: 96789 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,989,768
RAC: 8,523
Message 96795 - Posted: 26 May 2020, 15:15:00 UTC - in response to Message 96640.  

It's increasingly hard to believe that this project is accomplishing anything meaningful.
.
Scientific publications and the reputation of the project are against you.

REAL solution is realistic deadlines. Sophisticated solution would involve memory management, too, but right now I feel like that is beyond your capabilities.

- You can set the wus duration.
- You can set the cache of wus.
- You can set the numbers of active cores.
- You can set the use of memory.
- You have checkpoints.
What do you want more?


P.S.
If you're not grouchy and annoying, the discussion is better.
ID: 96795 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,989,768
RAC: 8,523
Message 96796 - Posted: 26 May 2020, 15:19:12 UTC - in response to Message 96787.  

You could chose to resolve the issue, or you can continue to carry on as you are.
Lets see if you're as wonderful as you claim to be.


He, if he want, can open a new boinc project (about protein simulations) and show how is great his work.
ID: 96796 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1965
Credit: 38,174,417
RAC: 10,123
Message 96803 - Posted: 27 May 2020, 3:40:53 UTC - in response to Message 96785.  

I have looked over the replies. They are remarkable. Remarkably uninformed

All my tasks run and complete successfully. You are reporting problems you've actually never properly detailed. Take the hint.

You have several machines available. Select your worst performing one and change it's settings - it'll take you a one-time 30 seconds.

In Options/Computing Preferences, on the first tab, set the following:

On the Computing tab:
Use at most 75% of the CPUs
Use at most 100% of CPU time

Store at least 0.1 days of work
Store up to an additional 0.5 days of work

In the Disk & Memory tab:
When Computer is in use, use at most 75% - (if your setting is already higher, keep it as it is)
When Computer is not in use, use at most 85% - (if your setting is already higher, keep it as it is)

Come back in 2 days and tell me how badly it's going.
Take a chance - you have nothing to lose if things are running so badly for you now
ID: 96803 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,544,938
RAC: 10,345
Message 96834 - Posted: 29 May 2020, 12:28:16 UTC - in response to Message 96640.  

It's increasingly hard to believe that this project is accomplishing anything meaningful. The grasp of scheduling seems to be really weak.

What you [the project managers] seem to be doing now is sending large numbers of tasks on short deadlines. Many of these tasks seem to be linked to large blocks of data. Because the tasks can't possibly be completed within your short deadlines, then you wind up aborting large numbers of tasks. However, even the abortions of tasks are done crudely, basically stepping up one day to abort another stack of tasks that cannot be completed within their deadlines.

More haste, less speed? Or worse?

Other times my various machines have more demands on memory than can be accommodated. That results in idle CPUs (actually cores) unless large "waiting for memory" tasks are manually aborted to make space for smaller tasks. Other times tasks that have nearly finished are aborted by the project for unclear reasons. Other tasks that are also past their deadlines are permitted to finish, though of course it is unclear if any of these tasks are earning any credit.

So we [the donors of computing resources] just have to hope that the individual projects themselves are better managed than the project as a whole seems to be? As I've noted before, if I were still involved in research I would be advising the researchers to be quite careful about any results coming from a system run like this one....

Solution time, but I'm sure mine is ugly. At this point I just always manually abort the pending tasks except for those issued today. That gives the running tasks the best chance to finish and be replaced by tasks that also have the best chance to finish without being aborted by the project itself. Tasks that are "waiting for memory" are also aborted, though often I have to go through a bunch of them before a sufficiently small task gets a chance to run on the available core. Main ugliness of this kludge is that I'm sure lots of data is being downloaded and discarded untouched. (However that's happening anyway with the tasks that get aborted by the project.)

REAL solution is realistic deadlines. Sophisticated solution would involve memory management, too, but right now I feel like that is beyond your capabilities.


Nope the REAL solution is a smaller cache of workunits on your machine!! There i no reason to have a 10 day cache of workunits if that are ALL due in 2 to 3 days, NO ONE could finish all those in time!! Now how to get a smaller cache can be done in one of two basic ways...1)you lower the cahce settings on your machine or 2) the Project sends out fewer workunits per request. If you want the 2nd option then people with 128 cpu cores on a single machine are screwed thanks to you!! And YES there are people crunching who have hardware like that!!!
ID: 96834 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,544,938
RAC: 10,345
Message 96835 - Posted: 29 May 2020, 12:33:01 UTC - in response to Message 96785.  

I have looked over the replies. They are remarkable. Remarkably uninformed. I think it is safe to express my doubts that any of the replies have come from professional programmers, system administrators, or students of computer science. I think you have nice intentions, but... Not sure what perspectives you are coming from, but it seems pretty obvious that I am not talking to you. Therefore if you have nothing to say that is relevant to what I wrote, then perhaps you should say nothing?

There are plenty of misconceptions that I could correct in detail. But I see no reason to do so. Go back and read what I wrote in the original comment. If you can't understand some part of it and if you actually want to understand it, then please try to write an intelligible question.

I'm just going to focus on one aspect from an old class on operating systems principles. It was one of the most influential classes of my entire life. The general principles apply far outside the bounds of computer science. Optimal scheduling is about identifying the critical resources. You always want to swap abundant resources to conserve the scarce ones. You NEVER want to create new bottlenecks where none exist. Time is NOT the critical resource here and the 3-day deadline is actually creating a bottleneck that has no justification. In addition, I have other uses for my time than trying to tweak configurations, especially since I have no access to the performance profiles (which also means my tweaks would be pointless). Nuking excess tasks is much quicker. I'm pretty sure it's causing wasted resources elsewhere, but I can only write reports like the original comment.

It actually reminds me of a system that was so badly tuned and overloaded that the character echo was taking several seconds. It feels like I'm insulting some of you if I explain what that means, but .., When you typed a character the computer was too busy to send the character back to you. The system was finally spending almost all of its computing resources keeping track of which users were supposed to receive which echoed characters and almost no actual work was being accomplished.

I suppose I better apologize for my poor teaching, eh? Though I earned a living that way for some years, I never did learn how to motivate. Most of the time I was teaching required classes, so motivation wasn't my main problem. The good students wanted to learn and mostly I just had to stay out of their way and help when I could. Most of the students just wanted to pass, so I helped them do that. Then there's always a few students who want to fail, but I focused on making it harder to fail than to pass. Didn't lose one in my last course.


STOP being a Teacher, thanks for doing it btw, and start being a cruncher! Stop telling people how things should be and start helping people figure out how to work within the system as it is, weare not Programmers or even System Admin folks, we are crunchers too and that means, like you, have ZERO ability to make the changes you are alluding too. We don't get to decide how this or that works or gets done or ANYTHING like that, we come here to crunch for our own reasons and once those reasons change we move on to someplace else.
ID: 96835 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 96957 - Posted: 31 May 2020, 5:19:27 UTC

Is the current outage related to adjustments to address these problems? I noticed a few long-deadline tasks recently...

I must miss the days when HPC was a thing, eh? When you're designing systems that are entirely under your control things are easier in many ways.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 96957 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MarkJ

Send message
Joined: 28 Mar 20
Posts: 72
Credit: 25,010,478
RAC: 383
Message 96988 - Posted: 31 May 2020, 12:21:06 UTC - in response to Message 96957.  
Last modified: 31 May 2020, 12:23:39 UTC

Is the current outage related to adjustments to address these problems? I noticed a few long-deadline tasks recently...

There isn’t an outage. There is a BOINC under Windows issue. See this thread which also has a solution until they can get a new BOINC version out.

As for the longer deadline tasks it seems one of the scientists submitted a bunch of MiniRosetta tasks after the project supposedly retired the app.
BOINC blog
ID: 96988 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 97197 - Posted: 3 Jun 2020, 19:11:22 UTC

There I was, all set to start with "Now, now, children, that's not how REAL science works." But I'm still planning to include my confession here. Maybe it's all my fault? Is there a historian of science in the house?

But the only comment in this thread since my last visit was actually rather useful, though it was about yet another problem. That's a different (and unrelated) problem (but I just count that as more evidence of how poorly managed Rosetta@home is, which goes back to my concern about the quality of the scientific results). (I hate to use the adjective "amateurish", because I know that real research often looks that way. There's a famous joke along the lines of "If I knew what I was doing, then it wouldn't be research.") From my perspective, this was a new problem since I only noticed it a few days ago. I mostly use that machine under Linux, but someone else uses it for Windows, which is where the problem is, and I only noticed it when asked to check on something else. (Thermal problem?) Pretty sure that the certificate problem explains what is happening on that machine as regards BOINC, which currently has at least 10 completed tasks hung up on it, and some more queued.

Not wanting to throw away so many hours of (overdue) work, I was going to let it finish the queued tasks and hope it would recover and at least download the results, even if no credit was granted because they are late. But that's the (manufactured) 3-day deadline problem yet again? But the new data about the new problem makes it seem clear that the work is irretrievably lost. The machine needs another project reset ASAP. Gotta ignore that sunk-cost fallacy feeling.

Right now I'm actually using my "biggest" machine. Looking over the queued tasks, it was obvious that over 30 of them had no chance of being completed in the 3-day window, so it was the usual choice of letting the project abort them or doing it myself. In either case, that means downloaded data tossed, which means a waste of the resources used to transmit that tossed data. Possibly additional resources for the new encryption?

But that's a natural segue to the actual problem I reported on my last visit here. That was an announced and scheduled outage, though badly announced (and possibly linked to an unscheduled outage about a day later?). Not only was the announcement not pushed to the clients (which would have allowed us, the volunteers, to have made some scheduling adjustments), but the announcement wasn't clear about the changes. If they are just adding encryption for connections to this website, that's one thing. Not exactly silly, and quite belated, but there may be some bits of personal information here, so why not? However, the wording of the description of the upgrade causing the outage makes it sound much heavier. Encryption for a website is trivial, but encryption for large quantities of data is something else again. Quite possibly it would involve a significant purchase of encryption hardware for the project side. (One of the researchers I used to work for designed such chips as entire families before he returned to academia. Our employer lost interest in "commodity" chips, so it's probably become yet another market niche dominated by the long-sighted Chinese. (Which actually reminds me of the first time I worked at the bleeding edge of computer science. Ancient history, but the punchline is that it was obvious (to me, at least) that the project in question would never make a profit, and the entire division (with some of my friends) was dumped and sold cheap to HP a few years after I had moved along. (CMINT)))

Is there a link between the 3-day deadline and the encryption? From an HPC perspective, the answer is probably yes. Throwing away lots of data becomes a larger cost, a larger waste of resources, when you have also invested in encrypting that data before you threw it away. It also raises questions from a scientific perspective. For one thing it indicates the results are probably not being replicated, which is a concern in a situation like this, but it might indicate worse problems. Which is actually a segue to my confession...

The story is buried in the history of BOINC now, going back about 25 years. Way back then, there was a project called seti@home that had a heavy client. In discussions on the (late and dearly departed) usenet I became one of the advocates for the kind of lightweight client that BOINC became, while seti@home became just another BOINC subproject. If there is a historian of science in the house, I think it would be interesting to find out where the BOINC design team got their ideas... Maybe some part of it is my fault? There was a company named Deja News that had a copy of much of usenet, and those archives were sold or transferred to the google later on... (I actually "discovered" the WWW on usenet (around 1994 during another academic stint) when I was searching for stuff about the (late and not so dearly departed) Gopher and WAIS knowledge-sharing systems.) (But I'm pretty sure the main linking guy at Berkeley must also be late by now. He was already an old-timer way back then.)

Now I'm the old-timer, and I'm still wheezing about the silly 3-day deadlines.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 97197 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,544,938
RAC: 10,345
Message 97204 - Posted: 3 Jun 2020, 22:42:43 UTC - in response to Message 97197.  

There I was, all set to start with "Now, now, children, that's not how REAL science works." But I'm still planning to include my confession here. Maybe it's all my fault? Is there a historian of science in the house?

But the only comment in this thread since my last visit was actually rather useful, though it was about yet another problem. That's a different (and unrelated) problem (but I just count that as more evidence of how poorly managed Rosetta@home is, which goes back to my concern about the quality of the scientific results). (I hate to use the adjective "amateurish", because I know that real research often looks that way. There's a famous joke along the lines of "If I knew what I was doing, then it wouldn't be research.") From my perspective, this was a new problem since I only noticed it a few days ago. I mostly use that machine under Linux, but someone else uses it for Windows, which is where the problem is, and I only noticed it when asked to check on something else. (Thermal problem?) Pretty sure that the certificate problem explains what is happening on that machine as regards BOINC, which currently has at least 10 completed tasks hung up on it, and some more queued.

Not wanting to throw away so many hours of (overdue) work, I was going to let it finish the queued tasks and hope it would recover and at least download the results, even if no credit was granted because they are late. But that's the (manufactured) 3-day deadline problem yet again? But the new data about the new problem makes it seem clear that the work is irretrievably lost. The machine needs another project reset ASAP. Gotta ignore that sunk-cost fallacy feeling.

Right now I'm actually using my "biggest" machine. Looking over the queued tasks, it was obvious that over 30 of them had no chance of being completed in the 3-day window, so it was the usual choice of letting the project abort them or doing it myself. In either case, that means downloaded data tossed, which means a waste of the resources used to transmit that tossed data. Possibly additional resources for the new encryption?

But that's a natural segue to the actual problem I reported on my last visit here. That was an announced and scheduled outage, though badly announced (and possibly linked to an unscheduled outage about a day later?). Not only was the announcement not pushed to the clients (which would have allowed us, the volunteers, to have made some scheduling adjustments), but the announcement wasn't clear about the changes. If they are just adding encryption for connections to this website, that's one thing. Not exactly silly, and quite belated, but there may be some bits of personal information here, so why not? However, the wording of the description of the upgrade causing the outage makes it sound much heavier. Encryption for a website is trivial, but encryption for large quantities of data is something else again. Quite possibly it would involve a significant purchase of encryption hardware for the project side. (One of the researchers I used to work for designed such chips as entire families before he returned to academia. Our employer lost interest in "commodity" chips, so it's probably become yet another market niche dominated by the long-sighted Chinese. (Which actually reminds me of the first time I worked at the bleeding edge of computer science. Ancient history, but the punchline is that it was obvious (to me, at least) that the project in question would never make a profit, and the entire division (with some of my friends) was dumped and sold cheap to HP a few years after I had moved along. (CMINT)))

Is there a link between the 3-day deadline and the encryption? From an HPC perspective, the answer is probably yes. Throwing away lots of data becomes a larger cost, a larger waste of resources, when you have also invested in encrypting that data before you threw it away. It also raises questions from a scientific perspective. For one thing it indicates the results are probably not being replicated, which is a concern in a situation like this, but it might indicate worse problems. Which is actually a segue to my confession...

The story is buried in the history of BOINC now, going back about 25 years. Way back then, there was a project called seti@home that had a heavy client. In discussions on the (late and dearly departed) usenet I became one of the advocates for the kind of lightweight client that BOINC became, while seti@home became just another BOINC subproject. If there is a historian of science in the house, I think it would be interesting to find out where the BOINC design team got their ideas... Maybe some part of it is my fault? There was a company named Deja News that had a copy of much of usenet, and those archives were sold or transferred to the google later on... (I actually "discovered" the WWW on usenet (around 1994 during another academic stint) when I was searching for stuff about the (late and not so dearly departed) Gopher and WAIS knowledge-sharing systems.) (But I'm pretty sure the main linking guy at Berkeley must also be late by now. He was already an old-timer way back then.)

Now I'm the old-timer, and I'm still wheezing about the silly 3-day deadlines.


Cut your work cache down to 1.0 days for the first line in the Boinc Manager under Options, computing preferences, then the computing tab and for the 2nd line 0.5 days. Those settings are at the botom of that page. Be sure to click 'save' to save the changes. What this will do is reduce the number of workunits you keep your pc so you have fewer and fewer expiring due to you not being able to get to them in time. You can also make the changes on the Rosetta website under your account then Computing Preferences where you then the Other section will see lines like this:
Store at least 1 days of work
Store up to an additional 0.25 days of work

That is a global setting and will affect all of your pc's, the first setting I outlined would be on a pc by pc setting. I prefer to use the pc by pc setting as not all my pc's are exactly the same.
ID: 97204 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
William Albert

Send message
Joined: 22 Mar 20
Posts: 23
Credit: 1,061,020
RAC: 0
Message 97208 - Posted: 4 Jun 2020, 4:31:20 UTC - in response to Message 97197.  
Last modified: 4 Jun 2020, 4:40:57 UTC

Continued whining from shanen about how crappy Rosetta@home is, how bad the people who run it are, and how everyone is wrong except for him.


Not only has Rosetta@home generally managed the influx of new users well given their resources, but Rosetta@home easily has the most predictable WU runtime of any BOINC project I've run, because you can literally set what that runtime is. The setting is located in your Rosetta@home account, under Preferences > Rosetta@home preferences > Target CPU run time. This combined with the caching options that BOINC provides (which numerous people in this thread have pointed you to, repeatedly) means that you can predict almost exactly what the turnaround time for a Rosetta@home WU will be.

Things get a bit more complicated if the computer isn't crunching for Rosetta@home 100% of the time, but you can compensate for this by reducing the "Target CPU run time" accordingly (e.g., setting it to 2 hours, down from the default of 8 hours).

The 3-day deadline is set as such by the project administrators because it allows researchers to get quick feedback on their work units, and it helps to mitigate the negative effects of slow or unreliable hosts. It can also reduce storage requirements, as WUs that are completed and validated can be removed more quickly to make room for other WUs. The cost to short deadlines is increased network load on the infrastructure to deal with WU timeouts and retransmissions, but given that the deadlines have shortened as Rosetta@home's computing power has increased, the project administrators have clearly found the benefits to be worth the cost.

In any case, Rosetta@home provides very fine grained control over how long a WU will execute and how many queued WUs you keep. Having to abandon work because you have more queued WU's than you can complete by the deadline is a problem ENTIRELY of your own making.

If, for whatever reason, you cannot overcome this issue (even though Rosetta@home seemingly works for hundreds of thousands of other volunteers), perhaps that's a sign that Rosetta@home isn't the right project for you? There are many other BOINC projects that have shorter run times, longer deadlines, and could potentially use your computing power. Maybe you should give World Community Grid or Ibercivis a try?
ID: 97208 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
D_S_Spence

Send message
Joined: 29 Mar 18
Posts: 2
Credit: 1,265,223
RAC: 940
Message 97242 - Posted: 5 Jun 2020, 13:42:47 UTC

The title of this thread caught my attention.

After switching URLs to the new https one, my Linux machine got inundated with WUs. I don't normally pay very close attention to the Linux machine because it lives in the basement and is a 24/7 cruncher that is used for nothing else, but it normally works mainly on WCG. When it stopped returning work for WCG I took a look and saw that it has 50+ Rosetta WUs all due on June 7!

I don't know if you can see the host, but I'll put the link here: https://boinc.bakerlab.org/rosetta/results.php?hostid=4034811

I have the setting as "Store at least 0.2 days of work" and "Store up to an additional 0.3 days of work", but this is way more work that this little 4-core machine can do in such little time.

I don't mind short deadlines. Why did so many WUs get put in my queue, though?
ID: 97242 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 97249 - Posted: 5 Jun 2020, 18:58:49 UTC
Last modified: 5 Jun 2020, 19:00:53 UTC

try 0.1/0 instead, i.e. don't cache work, i did that works well (on Pi4) so i'd guess bigger PCs probably managed with that.
all that out-of-wack work downloading probably boil down to boinc's statistics, e.g. that the estimates of time for the work to complete is too short, so if you place 0.3 days of (extra) work in the cache, that could be a lot if those statistics are incorrectly calculating that a task takes a small amount of time to complete.
but accordingly over time as you crunch more wu, the numbers may fix itself, but for a start use as little cache as is possible
and use a recent boinc client, old ones may have bugs
ID: 97249 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Large numbers of tasks aborted by project killing CPU performance



©2024 University of Washington
https://www.bakerlab.org