Large numbers of tasks aborted by project killing CPU performance

Message boards : Number crunching : Large numbers of tasks aborted by project killing CPU performance

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,311,789
RAC: 16,097
Message 97252 - Posted: 5 Jun 2020, 21:42:26 UTC - in response to Message 97242.  

I have the setting as "Store at least 0.2 days of work" and "Store up to an additional 0.3 days of work",
You would be better served with "At least 0.2 days" and "Additional 0.01 days"



I don't mind short deadlines. Why did so many WUs get put in my queue, though?
Detaching & re-attaching resets everything (except for your Credit), so all the processing performance & time history was no longer there & it downloaded work based on the defaults, not past history.

However, a while back the project made changes to how the Estimated completion times were worked out- if those changes were working correctly then you should only have received as much work as your cache settings allowed- ie a half a days worth.
I think the project needs to check that their implementation has been put in place & is working as it should, if your cache is set as you say it is, there is no way you should have received that much work if it was working.
Grant
Darwin NT
ID: 97252 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MarkJ

Send message
Joined: 28 Mar 20
Posts: 72
Credit: 25,010,478
RAC: 383
Message 97253 - Posted: 5 Jun 2020, 21:44:29 UTC - in response to Message 97242.  
Last modified: 5 Jun 2020, 21:47:19 UTC

I don't mind short deadlines. Why did so many WUs get put in my queue, though?

There was a bug with work fetch when it refills the cache if you have an app_config with a max_concurrent statement. It was supposedly fixed in 7.16.6. That may or may not be relevant to your situation.
BOINC blog
ID: 97253 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,987,219
RAC: 8,801
Message 97267 - Posted: 7 Jun 2020, 10:10:06 UTC - in response to Message 97197.  

There I was, all set to start with "Now, now, children, that's not how REAL science works." But I'm still planning to include my confession here. Maybe it's all my fault? Is there a historian of science in the house?

But the only comment in this thread since my last visit was actually rather useful, though it was about yet another problem. That's a different (and unrelated) problem (but I just count that as more evidence of how poorly managed Rosetta@home is, which goes back to my concern about the quality of the scientific results). (I hate to use the adjective "amateurish", because I know that real research often looks that way. There's a famous joke along the lines of "If I knew what I was doing, then it wouldn't be research.") From my perspective, this was a new problem since I only noticed it a few days ago. I mostly use that machine under Linux, but someone else uses it for Windows, which is where the problem is, and I only noticed it when asked to check on something else. (Thermal problem?) Pretty sure that the certificate problem explains what is happening on that machine as regards BOINC, which currently has at least 10 completed tasks hung up on it, and some more queued.

Is there a link between the 3-day deadline and the encryption? From an HPC perspective, the answer is probably yes. Throwing away lots of data becomes a larger cost, a larger waste of resources, when you have also invested in encrypting that data before you threw it away. It also raises questions from a scientific perspective. For one thing it indicates the results are probably not being replicated, which is a concern in a situation like this, but it might indicate worse problems. Which is actually a segue to my confession...


I'm waiting for you Nobel Prize in Chemistry
ID: 97267 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,533,627
RAC: 9,759
Message 97274 - Posted: 7 Jun 2020, 21:19:20 UTC - in response to Message 97197.  

There I was, all set to start with "Now, now, children, that's not how REAL science works." But I'm still planning to include my confession here. Maybe it's all my fault? Is there a historian of science in the house?

But the only comment in this thread since my last visit was actually rather useful, though it was about yet another problem. That's a different (and unrelated) problem (but I just count that as more evidence of how poorly managed Rosetta@home is, which goes back to my concern about the quality of the scientific results). (I hate to use the adjective "amateurish", because I know that real research often looks that way. There's a famous joke along the lines of "If I knew what I was doing, then it wouldn't be research.") From my perspective, this was a new problem since I only noticed it a few days ago. I mostly use that machine under Linux, but someone else uses it for Windows, which is where the problem is, and I only noticed it when asked to check on something else. (Thermal problem?) Pretty sure that the certificate problem explains what is happening on that machine as regards BOINC, which currently has at least 10 completed tasks hung up on it, and some more queued.

Not wanting to throw away so many hours of (overdue) work, I was going to let it finish the queued tasks and hope it would recover and at least download the results, even if no credit was granted because they are late. But that's the (manufactured) 3-day deadline problem yet again? But the new data about the new problem makes it seem clear that the work is irretrievably lost. The machine needs another project reset ASAP. Gotta ignore that sunk-cost fallacy feeling.

Right now I'm actually using my "biggest" machine. Looking over the queued tasks, it was obvious that over 30 of them had no chance of being completed in the 3-day window, so it was the usual choice of letting the project abort them or doing it myself. In either case, that means downloaded data tossed, which means a waste of the resources used to transmit that tossed data. Possibly additional resources for the new encryption?


ALL workunits that are aborted by you OR the Project are put bck into the pool of available workunits so they are not...LOST or "tossed data". Boinc is over 20 years old now and has come a long way, your thinking of how it works shows how"amateurish" you really are!!

But that's a natural segue to the actual problem I reported on my last visit here. That was an announced and scheduled outage, though badly announced (and possibly linked to an unscheduled outage about a day later?). Not only was the announcement not pushed to the clients (which would have allowed us, the volunteers, to have made some scheduling adjustments), but the announcement wasn't clear about the changes. If they are just adding encryption for connections to this website, that's one thing. Not exactly silly, and quite belated, but there may be some bits of personal information here, so why not? However, the wording of the description of the upgrade causing the outage makes it sound much heavier. Encryption for a website is trivial, but encryption for large quantities of data is something else again. Quite possibly it would involve a significant purchase of encryption hardware for the project side. (One of the researchers I used to work for designed such chips as entire families before he returned to academia. Our employer lost interest in "commodity" chips, so it's probably become yet another market niche dominated by the long-sighted Chinese. (Which actually reminds me of the first time I worked at the bleeding edge of computer science. Ancient history, but the punchline is that it was obvious (to me, at least) that the project in question would never make a profit, and the entire division (with some of my friends) was dumped and sold cheap to HP a few years after I had moved along. (CMINT)))


Is there a link between the 3-day deadline and the encryption? From an HPC perspective, the answer is probably yes. Throwing away lots of data becomes a larger cost, a larger waste of resources, when you have also invested in encrypting that data before you threw it away. It also raises questions from a scientific perspective. For one thing it indicates the results are probably not being replicated, which is a concern in a situation like this, but it might indicate worse problems. Which is actually a segue to my confession...

The story is buried in the history of BOINC now, going back about 25 years. Way back then, there was a project called seti@home that had a heavy client. In discussions on the (late and dearly departed) usenet I became one of the advocates for the kind of lightweight client that BOINC became, while seti@home became just another BOINC subproject. If there is a historian of science in the house, I think it would be interesting to find out where the BOINC design team got their ideas... Maybe some part of it is my fault? There was a company named Deja News that had a copy of much of usenet, and those archives were sold or transferred to the google later on... (I actually "discovered" the WWW on usenet (around 1994 during another academic stint) when I was searching for stuff about the (late and not so dearly departed) Gopher and WAIS knowledge-sharing systems.) (But I'm pretty sure the main linking guy at Berkeley must also be late by now. He was already an old-timer way back then.)

Now I'm the old-timer, and I'm still wheezing about the silly 3-day deadlines.[/quote]

You OBVIOUSLY have NOT been a Boinc cruncher for very long as EVERY Project has experienced shortages of workunits over time, even Seti the first Boinc Project has shut down and is no longer creating any new workunits. In fact there are over 100 Boinc Projects that have started and closed for one reason oranother but yet Rosetta chugs right along still producing workunits!!

As for 3 day dead lines you REALLY need to expand your crunching to other projects, more than a couple have 2 day deadlines that are met by 99% of it's users with no problem. YOUR problem seems to be your unwillingness to adjust to the fact that not every Project is the same or run in the exact same way.

One basic 'rule' of Boinc is to always set your workunit cache to a very small amount until your computer and the new Project can figure out how long each workunit takes to runandwhat cahce size works for you. SEVERAL people have already said that but you STILL seem to be saying the same old thing...adjust to ME not me adjust to you!!!

In short if you can't handle the 3 days deadlines then maybe Rosetta isn't the Project best suited for you and your resources, it works for 99% of the people who are here, so it seems you are the outlier here. Or in Teacher terms YOU are the one screwing up the curve!!!

If you prefer LOOOOOOOONG deadlines why not try Climate Prediction as some of their workunits take over 365 days to complete!!
ID: 97274 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 97290 - Posted: 8 Jun 2020, 16:03:09 UTC - in response to Message 97274.  
Last modified: 8 Jun 2020, 16:03:48 UTC

You OBVIOUSLY have NOT been a Boinc cruncher for very long as EVERY Project has experienced shortages of workunits over time, even Seti the first Boinc Project has shut down and is no longer creating any new workunits. In fact there are over 100 Boinc Projects that have started and closed for one reason oranother but yet Rosetta chugs right along still producing workunits!!

As for 3 day dead lines you REALLY need to expand your crunching to other projects, more than a couple have 2 day deadlines that are met by 99% of it's users with no problem. YOUR problem seems to be your unwillingness to adjust to the fact that not every Project is the same or run in the exact same way.

One basic 'rule' of Boinc is to always set your workunit cache to a very small amount until your computer and the new Project can figure out how long each workunit takes to runandwhat cahce size works for you. SEVERAL people have already said that but you STILL seem to be saying the same old thing...adjust to ME not me adjust to you!!!

In short if you can't handle the 3 days deadlines then maybe Rosetta isn't the Project best suited for you and your resources, it works for 99% of the people who are here, so it seems you are the outlier here. Or in Teacher terms YOU are the one screwing up the curve!!!

If you prefer LOOOOOOOONG deadlines why not try Climate Prediction as some of their workunits take over 365 days to complete!!


I admire y'alls patience dealing with him lol.
ID: 97290 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sven

Send message
Joined: 7 Feb 16
Posts: 8
Credit: 222,005
RAC: 0
Message 97490 - Posted: 22 Jun 2020, 12:33:51 UTC

Hmm, I must say that I can see a bit of truth in the suggestion to have longer deadlines. As long as it is mandatory sometimes to switch of computers over weekends, the 3 day deadline is very fast reached. 5 days instead would be a great deal to get crunching in time.

I usually adjust my system in a way that makes successful crunchings possible - for example store at least 0 days of work and at least 0.1 days (standard adjustment). But the 3 days over weekends are always problematic.
ID: 97490 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,533,627
RAC: 9,759
Message 97491 - Posted: 22 Jun 2020, 13:20:37 UTC - in response to Message 97490.  

Hmm, I must say that I can see a bit of truth in the suggestion to have longer deadlines. As long as it is mandatory sometimes to switch of computers over weekends, the 3 day deadline is very fast reached. 5 days instead would be a great deal to get crunching in time.

I usually adjust my system in a way that makes successful crunchings possible - for example store at least 0 days of work and at least 0.1 days (standard adjustment). But the 3 days over weekends are always problematic.


We are doing COVID-19 research research right now, 1000 people are dieing per day in the World, longer deadlines means more delay in the info that might just save some people. The way you have your cache setup if your return all the workunits on Friday when you leave your pc's then come Monday morning you should have all of the ones you crunched over the weekend done and ready to return on time. An option might be to increase your cache on Friday so your pc can crunch more workunits that you get Friday afternoon/evening and then reduce your cache again on Monday morning when you return all those units you crunched over the weekend.
ID: 97491 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 97533 - Posted: 23 Jun 2020, 19:35:34 UTC

Hmm... If that guy has become a key manager of the Rosetta@home project, then (1) No wonder the project stopped sending tasks, and (2) This project is probably in its death throes.

Just reading another book about black-hat hackers. It has got me to wondering if the real problem with Rosetta@home is that we've all been "recruited" for mining BitCoins or some similarly worthless task. That could actually be related to the push for more encryption, eh? Plus I see how it could explain the peculiar way the downloads were working, almost as though someone had imposed a paged memory system on the project, with data pages around half a GB each, notwithstanding large numbers of ostensibly different projects working on the same data.

Security is a chain, and the attackers are always looking for the weakest links. From reading the comments in this thread, some of which seem to be from honchos at Rosetta@home, the weak links seem pretty obvious...

Much as I disliked some of the management policies of WCG, it looks like I should switch back there. It might be amusing to find out if any of my suggestions were ever implemented,. Rosetta@home seems to have clearly crossed into the territory of even more poorly managed projects. I've seen a couple of references to WCG in threads here, and it's a long-term project with some degree of corporate support (even if IBM is only a shadow of the great company it was when I was young). (But I still think HP has fallen harder and faster...)
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 97533 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 350
Credit: 1,000,634
RAC: 0
Message 97537 - Posted: 23 Jun 2020, 20:50:10 UTC - in response to Message 97533.  

Curious, what were those suggestions?
ID: 97537 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile yoerik
Avatar

Send message
Joined: 24 Mar 20
Posts: 128
Credit: 169,525
RAC: 0
Message 97538 - Posted: 23 Jun 2020, 21:25:34 UTC - in response to Message 97533.  

Just reading another book about black-hat hackers. It has got me to wondering if the real problem with Rosetta@home is that we've all been "recruited" for mining BitCoins or some similarly worthless task. That could actually be related to the push for more encryption, eh? Plus I see how it could explain the peculiar way the downloads were working, almost as though someone had imposed a paged memory system on the project, with data pages around half a GB each, notwithstanding large numbers of ostensibly different projects working on the same data.

Security is a chain, and the attackers are always looking for the weakest links. From reading the comments in this thread, some of which seem to be from honchos at Rosetta@home, the weak links seem pretty obvious...

Much as I disliked some of the management policies of WCG, it looks like I should switch back there. It might be amusing to find out if any of my suggestions were ever implemented,. Rosetta@home seems to have clearly crossed into the territory of even more poorly managed projects. I've seen a couple of references to WCG in threads here, and it's a long-term project with some degree of corporate support (even if IBM is only a shadow of the great company it was when I was young). (But I still think HP has fallen harder and faster...)

no comment on the security/bitcoin question - but I am also curious about your criticisms of WCG. A quick search of the WCG forums brought up your concern over a lack of a server status page - I just made a new thread bringing up the issue, thanks to you. https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,42562_lastpage,yes#631089

if there are any other concerns about that project, I would certainly appreciate it being brought back into the forums there.
ID: 97538 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,533,627
RAC: 9,759
Message 97541 - Posted: 23 Jun 2020, 23:07:44 UTC - in response to Message 97533.  

Hmm... If that guy has become a key manager of the Rosetta@home project, then (1) No wonder the project stopped sending tasks, and (2) This project is probably in its death throes.

Just reading another book about black-hat hackers. It has got me to wondering if the real problem with Rosetta@home is that we've all been "recruited" for mining BitCoins or some similarly worthless task. That could actually be related to the push for more encryption, eh? Plus I see how it could explain the peculiar way the downloads were working, almost as though someone had imposed a paged memory system on the project, with data pages around half a GB each, notwithstanding large numbers of ostensibly different projects working on the same data.

Security is a chain, and the attackers are always looking for the weakest links. From reading the comments in this thread, some of which seem to be from honchos at Rosetta@home, the weak links seem pretty obvious...

Much as I disliked some of the management policies of WCG, it looks like I should switch back there. It might be amusing to find out if any of my suggestions were ever implemented,. Rosetta@home seems to have clearly crossed into the territory of even more poorly managed projects. I've seen a couple of references to WCG in threads here, and it's a long-term project with some degree of corporate support (even if IBM is only a shadow of the great company it was when I was young). (But I still think HP has fallen harder and faster...)


You must have the missed the News article saying that mining has gone thru the floor during the Pandemic and is still dropping, so it's unlikely that with the INCREASE in Rosetta users that WE are in fact mining. Tin foil hats are cool but not always needed.
ID: 97541 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Large numbers of tasks aborted by project killing CPU performance



©2024 University of Washington
https://www.bakerlab.org