Posts by shanen

1) Message boards : Number crunching : Computation errors (Message 90934)
Posted 24 Jul 2019 by Profile shanen
Post:
Seems unlikely they've ever addressed this problem, eh? I see them pretty often. Especially annoying when they have run up 8 hours of effort before crashing, presumably with no points earned. And no, at this point I don't care enough to do the searching to try to figure out if the points were granted. I don't even care enough to read the rest of the thread beyond the Subject: and glancing at a couple of the posts.

Latest example:

Application
Rosetta Mini 3.78
Name
start_close_HHH_rd4_0056.min_rise1.83_whole_pass_aagb.bp_20190406150644_0001_0001_0001_0003_0001_0001_fragments_fold_SAVE_ALL_OUT_833066_1053
State
Computation error
Received
2019年07月22日 08時13分16秒
Report deadline
2019年07月30日 08時13分11秒
Estimated computation size
80,000 GFLOPs
CPU time
07:49:11
Elapsed time
07:59:03
Executable
minirosetta_3.78_x86_64-pc-linux-gnu
2) Message boards : Number crunching : More checkpointing problems (Message 90888)
Posted 4 Jul 2019 by Profile shanen
Post:
More sick puppies to report. Names start with "Cx_" where I have noticed x values from 3 to 5. Especially annoying in that the tasks claim to be checkpointing properly, but are lying about it. If you look at the Properties, it will say there was a recent checkpoint, perhaps a minute ago, but if you then reboot the computer, it typically loses 20% of its progress, representing about two hours of work. The elapsed time is conserved. In today's example, the task had over 7 hours in the Elapsed column and Remaining was under an hour, but after rebooting the computer, Elapsed was still over 7, but Progress had fallen to 60% and Remaining was over 3 hours.

Usually I spot these things on a computer than only runs for a few hours at a time. However this time I actually noticed it during the major OS upgrades last month. Just confirmed it on the short-running computer.

On your [the project management's] side it should probably show as a series of peaks in completion times. At least on the evidence I've noticed, the 2-hour loss seems to be consistent, so there would be one peak around 8 hours for uninterrupted tasks, a second around 10 hours for once-interrupted tasks, and smaller and smaller peaks each two hours after that for more and more interruptions.

The rb sick puppies remain around 20% of all rb tasks. In their defense, at least they tell the truth about never completing a checkpoint. They seemed to be getting worse lately, often running from zero without a single checkpoint, so I'm back to scrubbing them from the short-running machine before they get a chance to start.
3) Message boards : Number crunching : Out of work (Message 90561)
Posted 23 Mar 2019 by Profile shanen
Post:
Gives the impression of a rather amateurish hour, eh?

As I've said before, my main concern is that it taints the results. Any results reported out of Rosetta@home have to be replicated because it all feels like preliminary work.
4) Message boards : Number crunching : Out of work (Message 90335)
Posted 10 Feb 2019 by Profile shanen
Post:
I don't think there is anything to resolve. They have more crunchers than work. That is great. The scientists can get their stuff done in a timely manner.

On the other hand, if you need to keep your room warm in the winter, there are plenty of other projects.

I actually suspect there is some manual step involved and there is no one around to do it most of the time. Or no one who cares that much.

If it were managed on a reasonable basis, then they would have enough low priority projects to run the rest of the time rather than go spastic (the way it's been running for the last few months). I was hoping to earn 10 million points, but maybe the project will die of neglect before that time. (Am I hoping to trade the 10 million points for a boxtop?)
5) Message boards : Number crunching : Problems with web site (Message 90334)
Posted 10 Feb 2019 by Profile shanen
Post:
User of the day is long long long gone 'fesstess' who checked out October of 2007.

Kind of amusing in a way. How low key can they go?

On the one hand, I believe that BOINC represents (in some sense) one of the largest supercomputers in the world, and this Rosetta@home project corresponds to one of the oldest and largest applications running on that supercomputer. But on the other hand, it really feels like there isn't much concern on the other side. Checking with my third hand, maybe Rosetta is the BOINC project that doesn't give a ... for contributors who don't give a ... Survival of the dullest? (In the sense of "Who cares?")

And yet I'm closing in on 10 million "points" of "Word done", so I might as well wait that long. Or maybe the Roesetta project will just go away before that?
6) Message boards : Number crunching : Out of work (Message 90178)
Posted 9 Jan 2019 by Profile shanen
Post:
Home for the Christmas holiday?

Christmas has gone...

Yup. I wish I cared what's wrong.
7) Message boards : Number crunching : More checkpointing problems (Message 90136)
Posted 3 Jan 2019 by Profile shanen
Post:
Not sure where you were referencing, but if you mean the top thread in the "Number crunching" forum, then it's rarely useful. Currently it's 10 days old.

This one is mostly for checkpointing problems, which seem less severe than before. They have spread to some of the new subprojects, however.
8) Message boards : Number crunching : Out of work (Message 90135)
Posted 3 Jan 2019 by Profile shanen
Post:
I've also used WCG when this Rosetta project is too flaked out. Actually, I think WCG is pretty flaky, too, but at a more professional level of flakiness. Not sure I would trust the research results from either one of them.

I was actually hoping to find some explanation of the flakiness, an acknowledgement, a solution plan, or even a sign of life in Washington. Might be easier to check on Mars or Ultima Thule?
9) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 90037)
Posted 19 Dec 2018 by Profile shanen
Post:
Since this is the preeminent and locked-at-the-top thread and it has such a broad Subject, I was hoping to see something about the current lack of tasks... Server statuses appear to be nominal.

However I'll mention excessive memory use as an annoying problem on one of my machines with a relatively small SSD. However mostly I blame that on Microsoft for another horrendous update.
10) Message boards : Number crunching : More checkpointing problems (Message 90036)
Posted 19 Dec 2018 by Profile shanen
Post:
Thanks for the data and sorry I haven't been checking in more frequently. Well, not really sorry, since that mostly means there are no problems that seem worth worrying about. Or back to the sorry side again, maybe not visiting just reflects a loss of hope of making things better...

Latest peculiarities:

(1) Tasks that terminate themselves en masse when the computer wakes up. Presumably there is another (possibly new) completion criterion related to wall clock time, and when the computer wakes up many of the tasks discover that they are now regarded as completed. Not bad as a sanity check of some sort.

(2) Sick puppies from new projects, but nothing prevalent and annoying as the previous ones. Still seeing about 20% of the rb tasks behaving badly, but mostly ignoring that problem except for the 3-day tasks (which still get nuked whenever I spot them in time) and for the one machine with the limited run time.

Today's visit was actually provoked by another out-of-tasks condition, so off to look for relevant posts...
11) Message boards : Number crunching : More checkpointing problems (Message 89797)
Posted 29 Oct 2018 by Profile shanen
Post:
I wonder if that's in reference to the PF problems? Still running about 25% sick puppies when I don't get them nuked before they start. Same policy towards rb units. Current puppy has over an hour with no checkpoint, and I want to reboot the machine, so I've already queued some "safe" tasks and will nuke that one before shutting down (unless it managed to checkpoint itself while I'm writing this message).

During the recent task shortage I actually switched to a different project. I noticed that most of their tasks are on the order of 2 to 4 hours now. If the goal of longer work units is to save bandwidth, it certainly doesn't seem to be working in my case with all the nuking of likely sick puppies and other problematic work units that's going on.
12) Message boards : Number crunching : More checkpointing problems (Message 89510)
Posted 10 Sep 2018 by Profile shanen
Post:
Followup data: The task with 8 hours uncheckpointed actually did checkpoint sometime before 10 hours and it finally finished around12 hours.

Right now I'm actually on a Linux box, one of my machines that rarely runs for a long period. It has a small supply of non PF... units and none of them appear to be sick puppies. I'm trying to avoid downloading any of the PF... units here, but worse than that, the project has apparently switched to the short-term rb... units. I see that one of them did the fancy finish with the Computation Error. If it crashed quickly (and I suspect it did), then there is little waste of my machine's computation time, but the Rosetta project is just wasting bandwidth for any data that was sent.

It should NOT be a battle to participate "effectively" in the project. If the project is having trouble retaining volunteers, then perhaps there is a connection?
13) Message boards : Number crunching : More checkpointing problems (Message 89507)
Posted 9 Sep 2018 by Profile shanen
Post:
I've actually started looking at the stats. Easier just now since there are nothing but PF units. I have two primary machines that are running twelve tasks between them, and usually 25% to 33% are in the sick puppy category. I have one on this machine that has over 8 hours of computation without a checkpoint. Just my feeling, but I think it will finish around 12 hours, but I doubt it will get the extra 50% of work points that it should get for the extra time...

There's a second task here that's just about to hit two hours without a checkpoint. I'm pretty sure that qualifies as another sick puppy. On the other machine... Two of the four appear to be sick puppies. Grand total is 4/12 sick puppies for the 33% reading, which is typical.

For my machines that run for less than 8 hours at a time, there is no reason to attempt running a sick puppy, but the question is "How soon can I be sure it's a sick puppy and abort it?" There's a startup period when normal tasks aren't checkpointed, but it seems to be variable. There may also be cases of sick puppies that only checkpoint at random intervals, sometimes longer or shorter.
14) Message boards : Number crunching : More checkpointing problems (Message 89493)
Posted 7 Sep 2018 by Profile shanen
Post:
A normal PF unit is one that checkpoints on a reasonable schedule, while a sick puppy is one that can't checkpoint. I also regard tasks that run significantly longer than 8 hours as sick puppies, though this is less sick than the units that can't checkpoint.

Actually, the reason I dropped by today was to ask if there is some difference, some way to predict, which PF units are okay and which are bad. My newest theory is that I should let a possible sick-puppy task run for an hour, and if it hasn't checkpointed, then I should nuke it. That's for the machine that normally runs for short time periods and the rule only applies when there are only rb and PF units coming. Maybe I should reduce the time to 30 minutes?

If there are a mix of units coming, then the optimum algorithm appears to be to nuke the rb and PF units before they waste any run time at all, and just try to make sure the queue is full of units that are unlikely to be sick puppies. I already have a kill-on-sight policy for the short-deadline tasks.

Remember the objective: Avoid wasting computing time on tasks that earn no credit. From my side that seems to be the only metric I can apply. However there are still plenty of times when work appears to be wasted. However I also think computational efficiency should be one of the objectives of the rosetta project.
15) Message boards : Number crunching : New WUs failing (Message 89484)
Posted 4 Sep 2018 by Profile shanen
Post:
Looks to be similar to the problem I just reported for Windows 10 with PF... tasks.
16) Message boards : Number crunching : More checkpointing problems (Message 89483)
Posted 4 Sep 2018 by Profile shanen
Post:
Beware the wrath of PF units?

Sort of joking, but pretty sure that all time invested in the current sick puppy of the PF stripe is going to be wasted. Haven't seen too many of this kind of problem recently, but it has accumulated over 5 hours of run time without a checkpoint. It seems to be making progress, but more slowly than the normal PF tasks.

This is not a heavy use computer, so I probably can't run it long enough to find out for sure, but I'm pretty sure this story ends with failure at some point, and presumably no credit. Insult on the injury, or vice versa? In the heavy usage scenario it would just use up a lot of time until it runs past its deadline and dies for that reason, but in the low usage reality of this computer, it will almost surely get nuked after a reboot has zeroed it. (Obviously there is no reason to repeat the same mistake again and attempt to recompute work that will never earn credit.)

As I've said before, if the tasks are buggy in any visible ways, then that casts doubt on ALL the work of the project. The less visible bugs are the ones to worry about most, but the visible bugs are sufficient to prove the existence of bugs in the project code.

I'd paste the details (Properties) here, but no easy way to do so under Windows 10. I think that's mostly a BOINC-level problem...
17) Message boards : Number crunching : More checkpointing problems (Message 89461)
Posted 31 Aug 2018 by Profile shanen
Post:
Tasks stopped flowing again after a period of flow.

Checkpointing problems roughly unchanged. Most of the problematic ones I've notices are still rb... tasks.

I looked at a couple of other threads first in hopes of find some explanation of the problems, but if there was any explanation of the fix, it would seem not so much.
18) Message boards : Number crunching : More checkpointing problems (Message 89450)
Posted 27 Aug 2018 by Profile shanen
Post:
Hmm... Seems at least as relevant as the other "active" thread about fewer hosts. The checkpointing problems are continuing, though they do seem less severe these days. Recent ones have mostly involved the bad ol' rb tasks...

However today's proximate problem appears to be a lack of fresh tasks. Not yet critical, but the unreliable supply of work is why I have to keep larger buffers on my machines which then results in throwing away deadline-constrained tasks on slower machines which means that some of the project's bandwidth is being wasted... That used to be a concern, at least at the university level.

Anyway, the server status appears to be nominal. I've never been fully clear on the difference between the "Tasks ready to send" at the upper right and the "Unsent" tasks farther down the page, under the heading of "Tasks by application". The top number is 18,082, which seems to indicate that there is plenty of work to send and it's just not getting sent to my computer. In contrast, the lower numbers could mean that there is almost no work to send and I'm just not lucky enough to get any of it. Under Rosetta it only shows 22 in the Unsent column and Rosetta Mini has 0. Not even certain of this, but pretty sure that my machines are not eligible for the third application category, "Rosetta for Android", even though there are 9991 unsent tasks there.
19) Message boards : Cafe Rosetta : Moderators Contact Point (Explanations, Assistance etc) Post here! (Message 89449)
Posted 27 Aug 2018 by Profile shanen
Post:
Pretty sure this is the wrong place to ask, but it was the most recent sign of life in the forums... I have noticed a similar problem, and I'm pretty sure the condition has been going on for some hours, but I'll post more details over in the Number Crunching forum...
20) Message boards : Number crunching : More checkpointing problems (Message 89357)
Posted 30 Jul 2018 by Profile shanen
Post:
Look, I'm just reporting the problems. It would be nice if they got fixed, but I don't really care. Not sure I ever cared regarding Rosetta, but I can say that I used to care more when I was running WCG and their inability to fix similar problems was probably the main reason I stopped running their projects. Only about a million units of work there, while I'm approaching 8 million on this project.

I continue to believe that the #1 cause of problems and lost work is the use of short-deadline tasks. I do NOT feel any urgency. Just annoyance.

From a scientific perspective, what worries me is NOT the obvious bugs or even the appearance of bugginess if I'm misunderstanding what is going on. What bothers me is that it looks like sloppy coding practices, mostly at the Rosetta end, but also at the BOINC level. In one example discussed elsewhere in this thread, it should actually be a responsibility of the BOINC client to prevent attempted execution of tasks that are incompatible with the particular machine. Remember the first computer proof of the 4-color theorem? Retracted for bugs, though they fixed them later.

Rosetta should also have economic concerns about paying for wasted bandwidth. Downloading lots of data and getting no results is not helping anyone.

I am absolutely uninterested in wasting more of my time trying to tinker with the settings of my various machines to avoid the wastage. I am somewhat annoyed when I have "invested" in electricity and the resulting contribution is lost for reasons outside of my scope. Today's example is only 8 hours and 18 minutes of an rb task that has been stuck on Uploading for several days, and which has now gone past its deadline:

Application
Rosetta Mini 3.78
Name
rb_07_18_84731_126613_ab_stage0_t000___robetta_IGNORE_THE_REST_06_18_682267_4
State
Uploading
Received
Sat 21 Jul 2018 09:55:21 AM JST
Report deadline
Sun 29 Jul 2018 09:55:21 AM JST
Estimated computation size
80,000 GFLOPs
CPU time
07:36:03
Elapsed time
08:18:41
Executable
minirosetta_3.78_x86_64-pc-linux-gnu


Next 20



©2019 University of Washington
http://www.bakerlab.org