Posts by shanen

21) Message boards : Number crunching : More checkpointing problems (Message 89507)
Posted 9 Sep 2018 by Profile shanen
Post:
I've actually started looking at the stats. Easier just now since there are nothing but PF units. I have two primary machines that are running twelve tasks between them, and usually 25% to 33% are in the sick puppy category. I have one on this machine that has over 8 hours of computation without a checkpoint. Just my feeling, but I think it will finish around 12 hours, but I doubt it will get the extra 50% of work points that it should get for the extra time...

There's a second task here that's just about to hit two hours without a checkpoint. I'm pretty sure that qualifies as another sick puppy. On the other machine... Two of the four appear to be sick puppies. Grand total is 4/12 sick puppies for the 33% reading, which is typical.

For my machines that run for less than 8 hours at a time, there is no reason to attempt running a sick puppy, but the question is "How soon can I be sure it's a sick puppy and abort it?" There's a startup period when normal tasks aren't checkpointed, but it seems to be variable. There may also be cases of sick puppies that only checkpoint at random intervals, sometimes longer or shorter.
22) Message boards : Number crunching : More checkpointing problems (Message 89493)
Posted 7 Sep 2018 by Profile shanen
Post:
A normal PF unit is one that checkpoints on a reasonable schedule, while a sick puppy is one that can't checkpoint. I also regard tasks that run significantly longer than 8 hours as sick puppies, though this is less sick than the units that can't checkpoint.

Actually, the reason I dropped by today was to ask if there is some difference, some way to predict, which PF units are okay and which are bad. My newest theory is that I should let a possible sick-puppy task run for an hour, and if it hasn't checkpointed, then I should nuke it. That's for the machine that normally runs for short time periods and the rule only applies when there are only rb and PF units coming. Maybe I should reduce the time to 30 minutes?

If there are a mix of units coming, then the optimum algorithm appears to be to nuke the rb and PF units before they waste any run time at all, and just try to make sure the queue is full of units that are unlikely to be sick puppies. I already have a kill-on-sight policy for the short-deadline tasks.

Remember the objective: Avoid wasting computing time on tasks that earn no credit. From my side that seems to be the only metric I can apply. However there are still plenty of times when work appears to be wasted. However I also think computational efficiency should be one of the objectives of the rosetta project.
23) Message boards : Number crunching : New WUs failing (Message 89484)
Posted 4 Sep 2018 by Profile shanen
Post:
Looks to be similar to the problem I just reported for Windows 10 with PF... tasks.
24) Message boards : Number crunching : More checkpointing problems (Message 89483)
Posted 4 Sep 2018 by Profile shanen
Post:
Beware the wrath of PF units?

Sort of joking, but pretty sure that all time invested in the current sick puppy of the PF stripe is going to be wasted. Haven't seen too many of this kind of problem recently, but it has accumulated over 5 hours of run time without a checkpoint. It seems to be making progress, but more slowly than the normal PF tasks.

This is not a heavy use computer, so I probably can't run it long enough to find out for sure, but I'm pretty sure this story ends with failure at some point, and presumably no credit. Insult on the injury, or vice versa? In the heavy usage scenario it would just use up a lot of time until it runs past its deadline and dies for that reason, but in the low usage reality of this computer, it will almost surely get nuked after a reboot has zeroed it. (Obviously there is no reason to repeat the same mistake again and attempt to recompute work that will never earn credit.)

As I've said before, if the tasks are buggy in any visible ways, then that casts doubt on ALL the work of the project. The less visible bugs are the ones to worry about most, but the visible bugs are sufficient to prove the existence of bugs in the project code.

I'd paste the details (Properties) here, but no easy way to do so under Windows 10. I think that's mostly a BOINC-level problem...
25) Message boards : Number crunching : More checkpointing problems (Message 89461)
Posted 31 Aug 2018 by Profile shanen
Post:
Tasks stopped flowing again after a period of flow.

Checkpointing problems roughly unchanged. Most of the problematic ones I've notices are still rb... tasks.

I looked at a couple of other threads first in hopes of find some explanation of the problems, but if there was any explanation of the fix, it would seem not so much.
26) Message boards : Number crunching : More checkpointing problems (Message 89450)
Posted 27 Aug 2018 by Profile shanen
Post:
Hmm... Seems at least as relevant as the other "active" thread about fewer hosts. The checkpointing problems are continuing, though they do seem less severe these days. Recent ones have mostly involved the bad ol' rb tasks...

However today's proximate problem appears to be a lack of fresh tasks. Not yet critical, but the unreliable supply of work is why I have to keep larger buffers on my machines which then results in throwing away deadline-constrained tasks on slower machines which means that some of the project's bandwidth is being wasted... That used to be a concern, at least at the university level.

Anyway, the server status appears to be nominal. I've never been fully clear on the difference between the "Tasks ready to send" at the upper right and the "Unsent" tasks farther down the page, under the heading of "Tasks by application". The top number is 18,082, which seems to indicate that there is plenty of work to send and it's just not getting sent to my computer. In contrast, the lower numbers could mean that there is almost no work to send and I'm just not lucky enough to get any of it. Under Rosetta it only shows 22 in the Unsent column and Rosetta Mini has 0. Not even certain of this, but pretty sure that my machines are not eligible for the third application category, "Rosetta for Android", even though there are 9991 unsent tasks there.
27) Message boards : Cafe Rosetta : Moderators Contact Point (Explanations, Assistance etc) Post here! (Message 89449)
Posted 27 Aug 2018 by Profile shanen
Post:
Pretty sure this is the wrong place to ask, but it was the most recent sign of life in the forums... I have noticed a similar problem, and I'm pretty sure the condition has been going on for some hours, but I'll post more details over in the Number Crunching forum...
28) Message boards : Number crunching : More checkpointing problems (Message 89357)
Posted 30 Jul 2018 by Profile shanen
Post:
Look, I'm just reporting the problems. It would be nice if they got fixed, but I don't really care. Not sure I ever cared regarding Rosetta, but I can say that I used to care more when I was running WCG and their inability to fix similar problems was probably the main reason I stopped running their projects. Only about a million units of work there, while I'm approaching 8 million on this project.

I continue to believe that the #1 cause of problems and lost work is the use of short-deadline tasks. I do NOT feel any urgency. Just annoyance.

From a scientific perspective, what worries me is NOT the obvious bugs or even the appearance of bugginess if I'm misunderstanding what is going on. What bothers me is that it looks like sloppy coding practices, mostly at the Rosetta end, but also at the BOINC level. In one example discussed elsewhere in this thread, it should actually be a responsibility of the BOINC client to prevent attempted execution of tasks that are incompatible with the particular machine. Remember the first computer proof of the 4-color theorem? Retracted for bugs, though they fixed them later.

Rosetta should also have economic concerns about paying for wasted bandwidth. Downloading lots of data and getting no results is not helping anyone.

I am absolutely uninterested in wasting more of my time trying to tinker with the settings of my various machines to avoid the wastage. I am somewhat annoyed when I have "invested" in electricity and the resulting contribution is lost for reasons outside of my scope. Today's example is only 8 hours and 18 minutes of an rb task that has been stuck on Uploading for several days, and which has now gone past its deadline:

Application
Rosetta Mini 3.78
Name
rb_07_18_84731_126613_ab_stage0_t000___robetta_IGNORE_THE_REST_06_18_682267_4
State
Uploading
Received
Sat 21 Jul 2018 09:55:21 AM JST
Report deadline
Sun 29 Jul 2018 09:55:21 AM JST
Estimated computation size
80,000 GFLOPs
CPU time
07:36:03
Elapsed time
08:18:41
Executable
minirosetta_3.78_x86_64-pc-linux-gnu
29) Message boards : Number crunching : More checkpointing problems (Message 89251)
Posted 9 Jul 2018 by Profile shanen
Post:
In case it isn't clear enough, I'm trying not to care more than the project is worth. These days I have doubts it is worth too much.

Actually my reason for visiting today was not checkpointing problems, though they persist and are still annoying. On the machine that has the most constraints, I just periodically check the status, and if all of the active tasks have recently checkpointed, then I jump on the opportunity to shut down the machine. When I can't and still get forced, I'm trying to use the sleep solution.

So back to today's problem. Frequent computation errors on DRH tasks. Perhaps Linux specific? I initially thought it was something I was doing, but now I don't think so. Just another bug of some sort.

Since this is a kind of catchall thread (though I did search for more relevant threads to use instead), I'll go ahead and wonder aloud about the "Aborted by project" tasks, There were a bunch of those a while back, then they seem to have gone away, but now they seem to be returning. Definitely a waste of bandwidth to send me the data and then abort the task from their end... Or maybe it's a race condition between volunteers?
30) Message boards : Number crunching : More checkpointing problems (Message 89151)
Posted 25 Jun 2018 by Profile shanen
Post:
And from a Windows 10 machine, a 4-hour computation error that is probably a checkpointing error in disguise, since it happened when the machine was booted after being shut down. Perhaps diagnostic that another task from the same sub-project managed to complete in just over 4 hours?

Can't paste the Properties from Windows 10. Not even as an image.
31) Message boards : Number crunching : More checkpointing problems (Message 89145)
Posted 25 Jun 2018 by Profile shanen
Post:
Here's another good example of the new checkpointing problem, though perhaps it's better to describe it as lost work possible. I noticed that the CPU time is also frozen, though the elapsed time is increasing. Based on prior experience with these ones, the checkpoint will never take place, but the task will never be completed no matter how long it runs. Buggy, buggy, buggy.

Application
Rosetta Mini 3.78
Name
rb_06_06_83627_125669__t000__0_C2_SAVE_ALL_OUT_IGNORE_THE_REST_669729_480
State
Running
Received
Fri 22 Jun 2018 08:20:14 AM JST
Report deadline
Sat 30 Jun 2018 08:20:13 AM JST
Estimated computation size
80,000 GFLOPs
CPU time
02:25:26
CPU time since checkpoint
00:00:00
Elapsed time
08:39:27
Estimated time remaining
03:31:59
Fraction done
20.198%
Virtual memory size
155.29 MB
Working set size
51.39 MB
Directory
slots/4
Process ID
2359
Progress rate
2.160% per hour
Executable
minirosetta_3.78_x86_64-pc-linux-gnu

At the same time I notice this machine has a couple of computation error tasks. Let's see if I can catch their Properties, too...


Application
Rosetta 4.07
Name
DRH_curve_X_h30_l3_h23_l2_16685_3_2_loop_21_0001_one_capped_0001_fragments_fold_SAVE_ALL_OUT_663453_44
State
Computation error
Received
Mon 25 Jun 2018 06:15:24 PM JST
Report deadline
Tue 03 Jul 2018 06:15:24 PM JST
Estimated computation size
80,000 GFLOPs
CPU time
---
Elapsed time
---
Executable
rosetta_4.07_x86_64-pc-linux-gnu


Application
Rosetta 4.07
Name
DRH_curve_X_h19_l3_h26_l3_06738_1_loop_63_0001_one_capped_0001_fragments_fold_SAVE_ALL_OUT_666406_131
State
Computation error
Received
Mon 25 Jun 2018 06:15:24 PM JST
Report deadline
Tue 03 Jul 2018 06:15:24 PM JST
Estimated computation size
80,000 GFLOPs
CPU time
---
Elapsed time
00:00:09
Executable
rosetta_4.07_x86_64-pc-linux-gnu

Also several more of those appeared, all DRH tasks. Buggy, buggy, buggy.
32) Message boards : Number crunching : More checkpointing problems (Message 89132)
Posted 21 Jun 2018 by Profile shanen
Post:
NOT a constructive reply. As if you appreciate your volunteers.

Actually, what it most reminds me of is spineless chicken hawks who thank me for my service. That is NOT why I enlisted, and I do NOT care about your gratitude or pretenses of gratitude or even the opposite in this case. If you served, then you know why and we don't have to thank each other. If you didn't serve when you could have, then I mostly doubt you have any understanding of what service is or why people should do it. (I'm NOT limiting that to military service, by the way. That's another newfangled form of fake patriotism.)

The kindest thing I can say is that 3-day deadlines are bad service in some form, and I don't care about your whiny excuses. Reminds me of an old military expression, which in the cleaned up version goes "Excuses are like armpits. Everyone's got 'em and they all stink."

Oh yeah. Two more things. (1) Large numbers of computation errors, mostly at the beginning and it seems more often under Linux, and (2) Eight more hours of computation lost due to the checkpointing problems.
33) Message boards : Number crunching : More checkpointing problems (Message 89128)
Posted 20 Jun 2018 by Profile shanen
Post:
Managed to capture the Properties after all. Looks like it may have been a regular unit, but if so, it must have been delayed by intervening 3-day tasks:

Application
Rosetta Mini 3.78
Name
rb_06_13_83780_125820__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_670302_390
State
Running
Received
Thu 14 Jun 2018 08:28:54 AM JST
Report deadline
Fri 22 Jun 2018 08:28:53 AM JST
Estimated computation size
80,000 GFLOPs
CPU time
03:55:32
CPU time since checkpoint
00:00:42
Elapsed time
04:01:49
Estimated time remaining
03:54:42
Fraction done
48.720%
Virtual memory size
348.68 MB
Working set size
290.45 MB
Directory
slots/3
Process ID
1485
Progress rate
11.520% per hour
Executable
minirosetta_3.78_x86_64-pc-linux-gnu
34) Message boards : Number crunching : More checkpointing problems (Message 89127)
Posted 20 Jun 2018 by Profile shanen
Post:
Just confirmed a new version of the checkpointing problem. I had suspected something along those lines. It was a 3-day rb... unit this time. The Properties showed that it had about 30 minutes until it would finish, but it had been checkpointed 00:00 minutes ago. Usually that's supposed to mean it just finished checkpointing, but the value didn't change over several minutes. Highly suspicious. So I went ahead and shut down the machine anyway, and sure enough, it was the status % that was correct, and after I booted the machine the next time, it suddenly was 4 hours from completion--which basically guarantees the task will miss its deadline.

Do I need to say again that the 3-day deadline is fundamentally unreasonable, and much less reasonable when the checkpointing code is buggy, too.

Right now I suspect a lot of these rush units are really caused by what I regard as essentially bad project management and buggy programming. I would send along the details, but right now this Linux box is also unable to open the BOINC Manager. Happens pretty often, and I'm pretty sure the trick is to get it open (on both of my Linux boxen before the Rosetta tasks have eaten up too much memory. For a long time I thought that was a BOINC-level problem, but considering some of the memory allocation problems mentioned elsewhere in this thread, I'm leaning back towards the caused-by-Rosetta hypothesis.
35) Message boards : Number crunching : More checkpointing problems (Message 89096)
Posted 10 Jun 2018 by Profile shanen
Post:
Almost 10 hours of work on that task. Refused to checkpoint at any point, so all of the work was apparently held in memory with NO intermediate results. Suddenly ended with a computation error and presumably no credit received.

Not motivating.

As a volunteer the demotivating part might be my main concern, but as a wannabe or former or retired scientist of some sort, my primary concern is actually what it says about the quality of the code. GIGO is not the only way to produce worthless results. Even the best data with bad analysis or with programming flaws will also produce garbage.

Right now I have another task that looks extremely similar to the one that just died in spasms of computation error. I am NOT predicting a happy ending for it.

By the way, it was also one of those especially troublesome 3-day-deadline tasks. At this point I think it's looking like it's in a race condition between timing out, blowing up in a computation error, or perhaps getting aborted by the project. (Just saw one of those hit a checkpointed task with 3 hours of work that was apparently tossed.) This rush task stuff reminds me of "More haste, less speed."
36) Message boards : Number crunching : More checkpointing problems (Message 89079)
Posted 8 Jun 2018 by Profile shanen
Post:
Once again I'm trying to shut down the computer and there's a task with a lot of uncheckpointed work. It's an rb... this time, which sometimes happens.

It might be "normal", but it's an excuse and everyone has 'em and they all... I certainly hope the code is properly checked and tested on the real scientific results side, but on the volunteer side, it sure looks like they aren't particularly competent coders. As I've noted before, if I were still refereeing papers for the journals, and someone submitted a paper that was based on results from rosetta@home, I would be extremely curious and concerned about the quality of the code.

Another interpretation is that they just don't care about how much of the donors' efforts and electricity they waste. If they actually did care, they would actually be able to see the results of reduced throughput for tasks with long checkpoints. In some cases, a computer could get stuffed with tasks that never make progress, constantly restarting until they get killed for passing their deadlines.

Right now I just nuke 3-day deadline tasks and nRoCM tasks on sight, as long as they haven't done much work. That way I eliminate most of the problems in advance, at the cost of wasting some bandwidth for discarded data.
37) Message boards : Number crunching : More checkpointing problems (Message 89066)
Posted 6 Jun 2018 by Profile shanen
Post:
Got another nRoCM... task with over three hours of uncheckpointed work on it, and I want to shut down the computer now. That and the cursed 3-day deadline tasks are making this project into too much of a headache, notwithstanding having passed 7 million points...
38) Message boards : Number crunching : More checkpointing problems (Message 89028)
Posted 30 May 2018 by Profile shanen
Post:
This time it's the tasks named nRoCM....
39) Message boards : Number crunching : Output versus work unit size (Message 88996)
Posted 26 May 2018 by Profile shanen
Post:
It is one of the most important projects for its science and potential benefits. They just rush the work into production without testing it thoroughly I believe.

You are repeating one of my oft-repeated concerns: Any programs that are supposed to be producing scientific results need to be "tested thoroughly" or the research itself becomes questionable. The project staff seems to have a rather cavalier attitude towards testing, but maybe that's only on the side of the software that the volunteers see. Looks buggy to us, but maybe it's perfect on the results side. (But I doubt it and I strongly hope that they are running all crucial results several times in several ways.)

From what I've seen, if I were still a senior referee for the IEEE Computer Society and if I was reviewing a paper that relied on data from Rosetta@home calculations, I would start out with a highly skeptical attitude. At a minimum I would want to know that the code was audited, but more likely I would ask for replication of the key calculations by some other researchers.

Right now I'm just a volunteer, and my main annoyance is the 3-day deadlines. I'm mostly nuking those pending tasks on sight and NOT feeling sorry about wasting the project's bandwidth. Not at all.
40) Message boards : Number crunching : Output versus work unit size (Message 88985)
Posted 24 May 2018 by Profile shanen
Post:
Frankly amusing to see people worrying about such details as regards THIS particular BOINC project. I actually stopped by to wonder how many other volunteers are nuking 3-day projects on sight.


Previous 20 · Next 20



©2024 University of Washington
https://www.bakerlab.org