Posts by shanen

1) Message boards : Number crunching : Large numbers of tasks aborted by project killing CPU performance (Message 97533)
Posted 23 Jun 2020 by Profile shanen
Post:
Hmm... If that guy has become a key manager of the Rosetta@home project, then (1) No wonder the project stopped sending tasks, and (2) This project is probably in its death throes.

Just reading another book about black-hat hackers. It has got me to wondering if the real problem with Rosetta@home is that we've all been "recruited" for mining BitCoins or some similarly worthless task. That could actually be related to the push for more encryption, eh? Plus I see how it could explain the peculiar way the downloads were working, almost as though someone had imposed a paged memory system on the project, with data pages around half a GB each, notwithstanding large numbers of ostensibly different projects working on the same data.

Security is a chain, and the attackers are always looking for the weakest links. From reading the comments in this thread, some of which seem to be from honchos at Rosetta@home, the weak links seem pretty obvious...

Much as I disliked some of the management policies of WCG, it looks like I should switch back there. It might be amusing to find out if any of my suggestions were ever implemented,. Rosetta@home seems to have clearly crossed into the territory of even more poorly managed projects. I've seen a couple of references to WCG in threads here, and it's a long-term project with some degree of corporate support (even if IBM is only a shadow of the great company it was when I was young). (But I still think HP has fallen harder and faster...)
2) Message boards : Number crunching : no new tasks? (Message 97532)
Posted 23 Jun 2020 by Profile shanen
Post:
Might be hard to believe, but some of the other BOINC projects have even worse project management than this one. The management bottleneck on this one seems to be focused on one or two people.

I've run a number of projects over the decades,,, Actually started with seta@home before BOINC was created.

Then again this one does seem to be entering its death throes. I tested out another project with one of my machines, but didn't like it much, so I haven't switched yet. Anyone have any recommendations? WCG keeps bugging me to come back, so maybe they've cleaned up their act.
3) Message boards : Number crunching : Large numbers of tasks aborted by project killing CPU performance (Message 97197)
Posted 3 Jun 2020 by Profile shanen
Post:
There I was, all set to start with "Now, now, children, that's not how REAL science works." But I'm still planning to include my confession here. Maybe it's all my fault? Is there a historian of science in the house?

But the only comment in this thread since my last visit was actually rather useful, though it was about yet another problem. That's a different (and unrelated) problem (but I just count that as more evidence of how poorly managed Rosetta@home is, which goes back to my concern about the quality of the scientific results). (I hate to use the adjective "amateurish", because I know that real research often looks that way. There's a famous joke along the lines of "If I knew what I was doing, then it wouldn't be research.") From my perspective, this was a new problem since I only noticed it a few days ago. I mostly use that machine under Linux, but someone else uses it for Windows, which is where the problem is, and I only noticed it when asked to check on something else. (Thermal problem?) Pretty sure that the certificate problem explains what is happening on that machine as regards BOINC, which currently has at least 10 completed tasks hung up on it, and some more queued.

Not wanting to throw away so many hours of (overdue) work, I was going to let it finish the queued tasks and hope it would recover and at least download the results, even if no credit was granted because they are late. But that's the (manufactured) 3-day deadline problem yet again? But the new data about the new problem makes it seem clear that the work is irretrievably lost. The machine needs another project reset ASAP. Gotta ignore that sunk-cost fallacy feeling.

Right now I'm actually using my "biggest" machine. Looking over the queued tasks, it was obvious that over 30 of them had no chance of being completed in the 3-day window, so it was the usual choice of letting the project abort them or doing it myself. In either case, that means downloaded data tossed, which means a waste of the resources used to transmit that tossed data. Possibly additional resources for the new encryption?

But that's a natural segue to the actual problem I reported on my last visit here. That was an announced and scheduled outage, though badly announced (and possibly linked to an unscheduled outage about a day later?). Not only was the announcement not pushed to the clients (which would have allowed us, the volunteers, to have made some scheduling adjustments), but the announcement wasn't clear about the changes. If they are just adding encryption for connections to this website, that's one thing. Not exactly silly, and quite belated, but there may be some bits of personal information here, so why not? However, the wording of the description of the upgrade causing the outage makes it sound much heavier. Encryption for a website is trivial, but encryption for large quantities of data is something else again. Quite possibly it would involve a significant purchase of encryption hardware for the project side. (One of the researchers I used to work for designed such chips as entire families before he returned to academia. Our employer lost interest in "commodity" chips, so it's probably become yet another market niche dominated by the long-sighted Chinese. (Which actually reminds me of the first time I worked at the bleeding edge of computer science. Ancient history, but the punchline is that it was obvious (to me, at least) that the project in question would never make a profit, and the entire division (with some of my friends) was dumped and sold cheap to HP a few years after I had moved along. (CMINT)))

Is there a link between the 3-day deadline and the encryption? From an HPC perspective, the answer is probably yes. Throwing away lots of data becomes a larger cost, a larger waste of resources, when you have also invested in encrypting that data before you threw it away. It also raises questions from a scientific perspective. For one thing it indicates the results are probably not being replicated, which is a concern in a situation like this, but it might indicate worse problems. Which is actually a segue to my confession...

The story is buried in the history of BOINC now, going back about 25 years. Way back then, there was a project called seti@home that had a heavy client. In discussions on the (late and dearly departed) usenet I became one of the advocates for the kind of lightweight client that BOINC became, while seti@home became just another BOINC subproject. If there is a historian of science in the house, I think it would be interesting to find out where the BOINC design team got their ideas... Maybe some part of it is my fault? There was a company named Deja News that had a copy of much of usenet, and those archives were sold or transferred to the google later on... (I actually "discovered" the WWW on usenet (around 1994 during another academic stint) when I was searching for stuff about the (late and not so dearly departed) Gopher and WAIS knowledge-sharing systems.) (But I'm pretty sure the main linking guy at Berkeley must also be late by now. He was already an old-timer way back then.)

Now I'm the old-timer, and I'm still wheezing about the silly 3-day deadlines.
4) Message boards : News : Outage notice (Message 97040)
Posted 31 May 2020 by Profile shanen
Post:
It would have been nice if you had pushed that notice to the clients in case people wanted to make sure they had enough tasks queued. Because of the short-deadline problem, most of my machines are now idled.
5) Message boards : Number crunching : Large numbers of tasks aborted by project killing CPU performance (Message 96957)
Posted 31 May 2020 by Profile shanen
Post:
Is the current outage related to adjustments to address these problems? I noticed a few long-deadline tasks recently...

I must miss the days when HPC was a thing, eh? When you're designing systems that are entirely under your control things are easier in many ways.
6) Message boards : Number crunching : Large numbers of tasks aborted by project killing CPU performance (Message 96785)
Posted 25 May 2020 by Profile shanen
Post:
I have looked over the replies. They are remarkable. Remarkably uninformed. I think it is safe to express my doubts that any of the replies have come from professional programmers, system administrators, or students of computer science. I think you have nice intentions, but... Not sure what perspectives you are coming from, but it seems pretty obvious that I am not talking to you. Therefore if you have nothing to say that is relevant to what I wrote, then perhaps you should say nothing?

There are plenty of misconceptions that I could correct in detail. But I see no reason to do so. Go back and read what I wrote in the original comment. If you can't understand some part of it and if you actually want to understand it, then please try to write an intelligible question.

I'm just going to focus on one aspect from an old class on operating systems principles. It was one of the most influential classes of my entire life. The general principles apply far outside the bounds of computer science. Optimal scheduling is about identifying the critical resources. You always want to swap abundant resources to conserve the scarce ones. You NEVER want to create new bottlenecks where none exist. Time is NOT the critical resource here and the 3-day deadline is actually creating a bottleneck that has no justification. In addition, I have other uses for my time than trying to tweak configurations, especially since I have no access to the performance profiles (which also means my tweaks would be pointless). Nuking excess tasks is much quicker. I'm pretty sure it's causing wasted resources elsewhere, but I can only write reports like the original comment.

It actually reminds me of a system that was so badly tuned and overloaded that the character echo was taking several seconds. It feels like I'm insulting some of you if I explain what that means, but .., When you typed a character the computer was too busy to send the character back to you. The system was finally spending almost all of its computing resources keeping track of which users were supposed to receive which echoed characters and almost no actual work was being accomplished.

I suppose I better apologize for my poor teaching, eh? Though I earned a living that way for some years, I never did learn how to motivate. Most of the time I was teaching required classes, so motivation wasn't my main problem. The good students wanted to learn and mostly I just had to stay out of their way and help when I could. Most of the students just wanted to pass, so I helped them do that. Then there's always a few students who want to fail, but I focused on making it harder to fail than to pass. Didn't lose one in my last course.
7) Message boards : Number crunching : Large numbers of tasks aborted by project killing CPU performance (Message 96640)
Posted 20 May 2020 by Profile shanen
Post:
It's increasingly hard to believe that this project is accomplishing anything meaningful. The grasp of scheduling seems to be really weak.

What you [the project managers] seem to be doing now is sending large numbers of tasks on short deadlines. Many of these tasks seem to be linked to large blocks of data. Because the tasks can't possibly be completed within your short deadlines, then you wind up aborting large numbers of tasks. However, even the abortions of tasks are done crudely, basically stepping up one day to abort another stack of tasks that cannot be completed within their deadlines.

More haste, less speed? Or worse?

Other times my various machines have more demands on memory than can be accommodated. That results in idle CPUs (actually cores) unless large "waiting for memory" tasks are manually aborted to make space for smaller tasks. Other times tasks that have nearly finished are aborted by the project for unclear reasons. Other tasks that are also past their deadlines are permitted to finish, though of course it is unclear if any of these tasks are earning any credit.

So we [the donors of computing resources] just have to hope that the individual projects themselves are better managed than the project as a whole seems to be? As I've noted before, if I were still involved in research I would be advising the researchers to be quite careful about any results coming from a system run like this one....

Solution time, but I'm sure mine is ugly. At this point I just always manually abort the pending tasks except for those issued today. That gives the running tasks the best chance to finish and be replaced by tasks that also have the best chance to finish without being aborted by the project itself. Tasks that are "waiting for memory" are also aborted, though often I have to go through a bunch of them before a sufficiently small task gets a chance to run on the available core. Main ugliness of this kludge is that I'm sure lots of data is being downloaded and discarded untouched. (However that's happening anyway with the tasks that get aborted by the project.)

REAL solution is realistic deadlines. Sophisticated solution would involve memory management, too, but right now I feel like that is beyond your capabilities.
8) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 93809)
Posted 8 Apr 2020 by Profile shanen
Post:
Trying and even eager to be of help, but...

All these short deadline units are troublesome. Is it accomplishing anything if my contributions are just discarded? And discarded for the sake of deadlines that seem quite arbitrary, even silly. Exacerbated by more checkpoint problems, too.

Actually writing from the machine that has the most problems dealing with the deadlines, but even some of my bigger machines clearly have more queued tasks than they can possibly complete within the short deadlines. Obvious workaround (though it's tedious) is to manually abort the tasks that can't be completed, but that causes problems because the flow of tasks has become sporadic again... Plus its wasting the bandwidth at the project end when they send data that is just discarded.

On top of that, some of the machines wind up wasting time because of large batches of tasks with large memory requirements that cause the "Waiting for memory" status on some tasks. Again, selective nuking of tasks can get the CPU's busy again, but I'm NOT supposed to be spending time managing memory problems because the people running Rosetta@home can't figure it out... I'm fairly confident that BOINC has those capabilities to assess and manage memory, but it seems they are not being used by the Baker Lab people.

I've currently earned over 12 million points, which is supposed to indicate a moderate contribution, but I'm thinking about moving along. The reason I switched to Rosetta was because the projects I used to support were not well managed. I'm sure I could even shop around for projects that are also working on Covid projects.

In addition, if I were still supporting researchers, I would not recommend that they rely on data processed on Rosetta because such problems make the entire thing dubious... There were a couple of teams in the lab that are probably doing Covid stuff now (but I'm retired, so I have no idea).
9) Message boards : Number crunching : Computation errors (Message 90934)
Posted 24 Jul 2019 by Profile shanen
Post:
Seems unlikely they've ever addressed this problem, eh? I see them pretty often. Especially annoying when they have run up 8 hours of effort before crashing, presumably with no points earned. And no, at this point I don't care enough to do the searching to try to figure out if the points were granted. I don't even care enough to read the rest of the thread beyond the Subject: and glancing at a couple of the posts.

Latest example:

Application
Rosetta Mini 3.78
Name
start_close_HHH_rd4_0056.min_rise1.83_whole_pass_aagb.bp_20190406150644_0001_0001_0001_0003_0001_0001_fragments_fold_SAVE_ALL_OUT_833066_1053
State
Computation error
Received
2019年07月22日 08時13分16秒
Report deadline
2019年07月30日 08時13分11秒
Estimated computation size
80,000 GFLOPs
CPU time
07:49:11
Elapsed time
07:59:03
Executable
minirosetta_3.78_x86_64-pc-linux-gnu
10) Message boards : Number crunching : More checkpointing problems (Message 90888)
Posted 4 Jul 2019 by Profile shanen
Post:
More sick puppies to report. Names start with "Cx_" where I have noticed x values from 3 to 5. Especially annoying in that the tasks claim to be checkpointing properly, but are lying about it. If you look at the Properties, it will say there was a recent checkpoint, perhaps a minute ago, but if you then reboot the computer, it typically loses 20% of its progress, representing about two hours of work. The elapsed time is conserved. In today's example, the task had over 7 hours in the Elapsed column and Remaining was under an hour, but after rebooting the computer, Elapsed was still over 7, but Progress had fallen to 60% and Remaining was over 3 hours.

Usually I spot these things on a computer than only runs for a few hours at a time. However this time I actually noticed it during the major OS upgrades last month. Just confirmed it on the short-running computer.

On your [the project management's] side it should probably show as a series of peaks in completion times. At least on the evidence I've noticed, the 2-hour loss seems to be consistent, so there would be one peak around 8 hours for uninterrupted tasks, a second around 10 hours for once-interrupted tasks, and smaller and smaller peaks each two hours after that for more and more interruptions.

The rb sick puppies remain around 20% of all rb tasks. In their defense, at least they tell the truth about never completing a checkpoint. They seemed to be getting worse lately, often running from zero without a single checkpoint, so I'm back to scrubbing them from the short-running machine before they get a chance to start.
11) Message boards : Number crunching : Out of work (Message 90561)
Posted 23 Mar 2019 by Profile shanen
Post:
Gives the impression of a rather amateurish hour, eh?

As I've said before, my main concern is that it taints the results. Any results reported out of Rosetta@home have to be replicated because it all feels like preliminary work.
12) Message boards : Number crunching : Out of work (Message 90335)
Posted 10 Feb 2019 by Profile shanen
Post:
I don't think there is anything to resolve. They have more crunchers than work. That is great. The scientists can get their stuff done in a timely manner.

On the other hand, if you need to keep your room warm in the winter, there are plenty of other projects.

I actually suspect there is some manual step involved and there is no one around to do it most of the time. Or no one who cares that much.

If it were managed on a reasonable basis, then they would have enough low priority projects to run the rest of the time rather than go spastic (the way it's been running for the last few months). I was hoping to earn 10 million points, but maybe the project will die of neglect before that time. (Am I hoping to trade the 10 million points for a boxtop?)
13) Message boards : Number crunching : Problems with web site (Message 90334)
Posted 10 Feb 2019 by Profile shanen
Post:
User of the day is long long long gone 'fesstess' who checked out October of 2007.

Kind of amusing in a way. How low key can they go?

On the one hand, I believe that BOINC represents (in some sense) one of the largest supercomputers in the world, and this Rosetta@home project corresponds to one of the oldest and largest applications running on that supercomputer. But on the other hand, it really feels like there isn't much concern on the other side. Checking with my third hand, maybe Rosetta is the BOINC project that doesn't give a ... for contributors who don't give a ... Survival of the dullest? (In the sense of "Who cares?")

And yet I'm closing in on 10 million "points" of "Word done", so I might as well wait that long. Or maybe the Roesetta project will just go away before that?
14) Message boards : Number crunching : Out of work (Message 90178)
Posted 9 Jan 2019 by Profile shanen
Post:
Home for the Christmas holiday?

Christmas has gone...

Yup. I wish I cared what's wrong.
15) Message boards : Number crunching : More checkpointing problems (Message 90136)
Posted 3 Jan 2019 by Profile shanen
Post:
Not sure where you were referencing, but if you mean the top thread in the "Number crunching" forum, then it's rarely useful. Currently it's 10 days old.

This one is mostly for checkpointing problems, which seem less severe than before. They have spread to some of the new subprojects, however.
16) Message boards : Number crunching : Out of work (Message 90135)
Posted 3 Jan 2019 by Profile shanen
Post:
I've also used WCG when this Rosetta project is too flaked out. Actually, I think WCG is pretty flaky, too, but at a more professional level of flakiness. Not sure I would trust the research results from either one of them.

I was actually hoping to find some explanation of the flakiness, an acknowledgement, a solution plan, or even a sign of life in Washington. Might be easier to check on Mars or Ultima Thule?
17) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 90037)
Posted 19 Dec 2018 by Profile shanen
Post:
Since this is the preeminent and locked-at-the-top thread and it has such a broad Subject, I was hoping to see something about the current lack of tasks... Server statuses appear to be nominal.

However I'll mention excessive memory use as an annoying problem on one of my machines with a relatively small SSD. However mostly I blame that on Microsoft for another horrendous update.
18) Message boards : Number crunching : More checkpointing problems (Message 90036)
Posted 19 Dec 2018 by Profile shanen
Post:
Thanks for the data and sorry I haven't been checking in more frequently. Well, not really sorry, since that mostly means there are no problems that seem worth worrying about. Or back to the sorry side again, maybe not visiting just reflects a loss of hope of making things better...

Latest peculiarities:

(1) Tasks that terminate themselves en masse when the computer wakes up. Presumably there is another (possibly new) completion criterion related to wall clock time, and when the computer wakes up many of the tasks discover that they are now regarded as completed. Not bad as a sanity check of some sort.

(2) Sick puppies from new projects, but nothing prevalent and annoying as the previous ones. Still seeing about 20% of the rb tasks behaving badly, but mostly ignoring that problem except for the 3-day tasks (which still get nuked whenever I spot them in time) and for the one machine with the limited run time.

Today's visit was actually provoked by another out-of-tasks condition, so off to look for relevant posts...
19) Message boards : Number crunching : More checkpointing problems (Message 89797)
Posted 29 Oct 2018 by Profile shanen
Post:
I wonder if that's in reference to the PF problems? Still running about 25% sick puppies when I don't get them nuked before they start. Same policy towards rb units. Current puppy has over an hour with no checkpoint, and I want to reboot the machine, so I've already queued some "safe" tasks and will nuke that one before shutting down (unless it managed to checkpoint itself while I'm writing this message).

During the recent task shortage I actually switched to a different project. I noticed that most of their tasks are on the order of 2 to 4 hours now. If the goal of longer work units is to save bandwidth, it certainly doesn't seem to be working in my case with all the nuking of likely sick puppies and other problematic work units that's going on.
20) Message boards : Number crunching : More checkpointing problems (Message 89510)
Posted 10 Sep 2018 by Profile shanen
Post:
Followup data: The task with 8 hours uncheckpointed actually did checkpoint sometime before 10 hours and it finally finished around12 hours.

Right now I'm actually on a Linux box, one of my machines that rarely runs for a long period. It has a small supply of non PF... units and none of them appear to be sick puppies. I'm trying to avoid downloading any of the PF... units here, but worse than that, the project has apparently switched to the short-term rb... units. I see that one of them did the fancy finish with the Computation Error. If it crashed quickly (and I suspect it did), then there is little waste of my machine's computation time, but the Rosetta project is just wasting bandwidth for any data that was sent.

It should NOT be a battle to participate "effectively" in the project. If the project is having trouble retaining volunteers, then perhaps there is a connection?


Next 20



©2020 University of Washington
https://www.bakerlab.org