Posts by shanen

21) Message boards : Number crunching : More checkpointing problems (Message 89251)
Posted 9 Jul 2018 by Profile shanen
Post:
In case it isn't clear enough, I'm trying not to care more than the project is worth. These days I have doubts it is worth too much.

Actually my reason for visiting today was not checkpointing problems, though they persist and are still annoying. On the machine that has the most constraints, I just periodically check the status, and if all of the active tasks have recently checkpointed, then I jump on the opportunity to shut down the machine. When I can't and still get forced, I'm trying to use the sleep solution.

So back to today's problem. Frequent computation errors on DRH tasks. Perhaps Linux specific? I initially thought it was something I was doing, but now I don't think so. Just another bug of some sort.

Since this is a kind of catchall thread (though I did search for more relevant threads to use instead), I'll go ahead and wonder aloud about the "Aborted by project" tasks, There were a bunch of those a while back, then they seem to have gone away, but now they seem to be returning. Definitely a waste of bandwidth to send me the data and then abort the task from their end... Or maybe it's a race condition between volunteers?
22) Message boards : Number crunching : More checkpointing problems (Message 89151)
Posted 25 Jun 2018 by Profile shanen
Post:
And from a Windows 10 machine, a 4-hour computation error that is probably a checkpointing error in disguise, since it happened when the machine was booted after being shut down. Perhaps diagnostic that another task from the same sub-project managed to complete in just over 4 hours?

Can't paste the Properties from Windows 10. Not even as an image.
23) Message boards : Number crunching : More checkpointing problems (Message 89145)
Posted 25 Jun 2018 by Profile shanen
Post:
Here's another good example of the new checkpointing problem, though perhaps it's better to describe it as lost work possible. I noticed that the CPU time is also frozen, though the elapsed time is increasing. Based on prior experience with these ones, the checkpoint will never take place, but the task will never be completed no matter how long it runs. Buggy, buggy, buggy.

Application
Rosetta Mini 3.78
Name
rb_06_06_83627_125669__t000__0_C2_SAVE_ALL_OUT_IGNORE_THE_REST_669729_480
State
Running
Received
Fri 22 Jun 2018 08:20:14 AM JST
Report deadline
Sat 30 Jun 2018 08:20:13 AM JST
Estimated computation size
80,000 GFLOPs
CPU time
02:25:26
CPU time since checkpoint
00:00:00
Elapsed time
08:39:27
Estimated time remaining
03:31:59
Fraction done
20.198%
Virtual memory size
155.29 MB
Working set size
51.39 MB
Directory
slots/4
Process ID
2359
Progress rate
2.160% per hour
Executable
minirosetta_3.78_x86_64-pc-linux-gnu

At the same time I notice this machine has a couple of computation error tasks. Let's see if I can catch their Properties, too...


Application
Rosetta 4.07
Name
DRH_curve_X_h30_l3_h23_l2_16685_3_2_loop_21_0001_one_capped_0001_fragments_fold_SAVE_ALL_OUT_663453_44
State
Computation error
Received
Mon 25 Jun 2018 06:15:24 PM JST
Report deadline
Tue 03 Jul 2018 06:15:24 PM JST
Estimated computation size
80,000 GFLOPs
CPU time
---
Elapsed time
---
Executable
rosetta_4.07_x86_64-pc-linux-gnu


Application
Rosetta 4.07
Name
DRH_curve_X_h19_l3_h26_l3_06738_1_loop_63_0001_one_capped_0001_fragments_fold_SAVE_ALL_OUT_666406_131
State
Computation error
Received
Mon 25 Jun 2018 06:15:24 PM JST
Report deadline
Tue 03 Jul 2018 06:15:24 PM JST
Estimated computation size
80,000 GFLOPs
CPU time
---
Elapsed time
00:00:09
Executable
rosetta_4.07_x86_64-pc-linux-gnu

Also several more of those appeared, all DRH tasks. Buggy, buggy, buggy.
24) Message boards : Number crunching : More checkpointing problems (Message 89132)
Posted 21 Jun 2018 by Profile shanen
Post:
NOT a constructive reply. As if you appreciate your volunteers.

Actually, what it most reminds me of is spineless chicken hawks who thank me for my service. That is NOT why I enlisted, and I do NOT care about your gratitude or pretenses of gratitude or even the opposite in this case. If you served, then you know why and we don't have to thank each other. If you didn't serve when you could have, then I mostly doubt you have any understanding of what service is or why people should do it. (I'm NOT limiting that to military service, by the way. That's another newfangled form of fake patriotism.)

The kindest thing I can say is that 3-day deadlines are bad service in some form, and I don't care about your whiny excuses. Reminds me of an old military expression, which in the cleaned up version goes "Excuses are like armpits. Everyone's got 'em and they all stink."

Oh yeah. Two more things. (1) Large numbers of computation errors, mostly at the beginning and it seems more often under Linux, and (2) Eight more hours of computation lost due to the checkpointing problems.
25) Message boards : Number crunching : More checkpointing problems (Message 89128)
Posted 20 Jun 2018 by Profile shanen
Post:
Managed to capture the Properties after all. Looks like it may have been a regular unit, but if so, it must have been delayed by intervening 3-day tasks:

Application
Rosetta Mini 3.78
Name
rb_06_13_83780_125820__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_670302_390
State
Running
Received
Thu 14 Jun 2018 08:28:54 AM JST
Report deadline
Fri 22 Jun 2018 08:28:53 AM JST
Estimated computation size
80,000 GFLOPs
CPU time
03:55:32
CPU time since checkpoint
00:00:42
Elapsed time
04:01:49
Estimated time remaining
03:54:42
Fraction done
48.720%
Virtual memory size
348.68 MB
Working set size
290.45 MB
Directory
slots/3
Process ID
1485
Progress rate
11.520% per hour
Executable
minirosetta_3.78_x86_64-pc-linux-gnu
26) Message boards : Number crunching : More checkpointing problems (Message 89127)
Posted 20 Jun 2018 by Profile shanen
Post:
Just confirmed a new version of the checkpointing problem. I had suspected something along those lines. It was a 3-day rb... unit this time. The Properties showed that it had about 30 minutes until it would finish, but it had been checkpointed 00:00 minutes ago. Usually that's supposed to mean it just finished checkpointing, but the value didn't change over several minutes. Highly suspicious. So I went ahead and shut down the machine anyway, and sure enough, it was the status % that was correct, and after I booted the machine the next time, it suddenly was 4 hours from completion--which basically guarantees the task will miss its deadline.

Do I need to say again that the 3-day deadline is fundamentally unreasonable, and much less reasonable when the checkpointing code is buggy, too.

Right now I suspect a lot of these rush units are really caused by what I regard as essentially bad project management and buggy programming. I would send along the details, but right now this Linux box is also unable to open the BOINC Manager. Happens pretty often, and I'm pretty sure the trick is to get it open (on both of my Linux boxen before the Rosetta tasks have eaten up too much memory. For a long time I thought that was a BOINC-level problem, but considering some of the memory allocation problems mentioned elsewhere in this thread, I'm leaning back towards the caused-by-Rosetta hypothesis.
27) Message boards : Number crunching : More checkpointing problems (Message 89096)
Posted 10 Jun 2018 by Profile shanen
Post:
Almost 10 hours of work on that task. Refused to checkpoint at any point, so all of the work was apparently held in memory with NO intermediate results. Suddenly ended with a computation error and presumably no credit received.

Not motivating.

As a volunteer the demotivating part might be my main concern, but as a wannabe or former or retired scientist of some sort, my primary concern is actually what it says about the quality of the code. GIGO is not the only way to produce worthless results. Even the best data with bad analysis or with programming flaws will also produce garbage.

Right now I have another task that looks extremely similar to the one that just died in spasms of computation error. I am NOT predicting a happy ending for it.

By the way, it was also one of those especially troublesome 3-day-deadline tasks. At this point I think it's looking like it's in a race condition between timing out, blowing up in a computation error, or perhaps getting aborted by the project. (Just saw one of those hit a checkpointed task with 3 hours of work that was apparently tossed.) This rush task stuff reminds me of "More haste, less speed."
28) Message boards : Number crunching : More checkpointing problems (Message 89079)
Posted 8 Jun 2018 by Profile shanen
Post:
Once again I'm trying to shut down the computer and there's a task with a lot of uncheckpointed work. It's an rb... this time, which sometimes happens.

It might be "normal", but it's an excuse and everyone has 'em and they all... I certainly hope the code is properly checked and tested on the real scientific results side, but on the volunteer side, it sure looks like they aren't particularly competent coders. As I've noted before, if I were still refereeing papers for the journals, and someone submitted a paper that was based on results from rosetta@home, I would be extremely curious and concerned about the quality of the code.

Another interpretation is that they just don't care about how much of the donors' efforts and electricity they waste. If they actually did care, they would actually be able to see the results of reduced throughput for tasks with long checkpoints. In some cases, a computer could get stuffed with tasks that never make progress, constantly restarting until they get killed for passing their deadlines.

Right now I just nuke 3-day deadline tasks and nRoCM tasks on sight, as long as they haven't done much work. That way I eliminate most of the problems in advance, at the cost of wasting some bandwidth for discarded data.
29) Message boards : Number crunching : More checkpointing problems (Message 89066)
Posted 6 Jun 2018 by Profile shanen
Post:
Got another nRoCM... task with over three hours of uncheckpointed work on it, and I want to shut down the computer now. That and the cursed 3-day deadline tasks are making this project into too much of a headache, notwithstanding having passed 7 million points...
30) Message boards : Number crunching : More checkpointing problems (Message 89028)
Posted 30 May 2018 by Profile shanen
Post:
This time it's the tasks named nRoCM....
31) Message boards : Number crunching : Output versus work unit size (Message 88996)
Posted 26 May 2018 by Profile shanen
Post:
It is one of the most important projects for its science and potential benefits. They just rush the work into production without testing it thoroughly I believe.

You are repeating one of my oft-repeated concerns: Any programs that are supposed to be producing scientific results need to be "tested thoroughly" or the research itself becomes questionable. The project staff seems to have a rather cavalier attitude towards testing, but maybe that's only on the side of the software that the volunteers see. Looks buggy to us, but maybe it's perfect on the results side. (But I doubt it and I strongly hope that they are running all crucial results several times in several ways.)

From what I've seen, if I were still a senior referee for the IEEE Computer Society and if I was reviewing a paper that relied on data from Rosetta@home calculations, I would start out with a highly skeptical attitude. At a minimum I would want to know that the code was audited, but more likely I would ask for replication of the key calculations by some other researchers.

Right now I'm just a volunteer, and my main annoyance is the 3-day deadlines. I'm mostly nuking those pending tasks on sight and NOT feeling sorry about wasting the project's bandwidth. Not at all.
32) Message boards : Number crunching : Output versus work unit size (Message 88985)
Posted 24 May 2018 by Profile shanen
Post:
Frankly amusing to see people worrying about such details as regards THIS particular BOINC project. I actually stopped by to wonder how many other volunteers are nuking 3-day projects on sight.
33) Message boards : Number crunching : invalid results; 24 hours wasted (Message 88968)
Posted 21 May 2018 by Profile shanen
Post:
Sounds similar to what I'm seeing. Unfortunately at this point I don't care that much, but maybe the laid back attitude is okay. Anyway, here's the properties of one of the sick tasks:

Application
Rosetta Mini 3.78
Name
nRoCM_new_01_P04805_group0_7_congq_SAVE_ALL_OUT_IGNORE_THE_REST_609269_3
State
Running
Received
Sat 19 May 2018 05:12:15 AM JST
Report deadline
Sun 27 May 2018 05:12:14 AM JST
Estimated computation size
80,000 GFLOPs
CPU time
00:44:04
CPU time since checkpoint
00:44:04
Elapsed time
13:13:50
Estimated time remaining
---
Fraction done
6.107%
Virtual memory size
451.04 MB
Working set size
308.38 MB
Directory
slots/0
Process ID
7829
Progress rate
0.360% per hour
Executable
minirosetta_3.78_x86_64-pc-linux-gnu
34) Message boards : Number crunching : Distributions of Linux (Message 88839)
Posted 9 May 2018 by Profile shanen
Post:
Are you seeing the 3-day rb tasks with the "Computation error" status under Ubuntu 18.04? I've had a whole string of them, so I decided to visit the website and see if anyone else was reporting something along those lines... The normal deadline rb tasks seem to be okay, as are the various other tasks.

As regards performance variations among the various distros, I doubt there is much difference there. While I haven't tested between distros, the various distros are mostly using similar kernels. I do have one multi-boot machine that runs on Linux some of the time, and the performance on that machine seems pretty similar under completely different OSes. I have noticed some memory problems under Ubuntu on one machine, but I think that's a BOINC Manager problem, not a Linux thing.
35) Message boards : Number crunching : DRH tasks are related or contagious? (Message 88713)
Posted 17 Apr 2018 by Profile shanen
Post:
Actually I wasted a lot of time trying to adjust the buffer sizes. Nothing I can do about the random mix of deadlines they send and I gave up. That was way back when I cared.

Stopped by this time because of another 9-hour computation error. Still curious if those earn any credit. I thought this thread was where the useless link was posted last time I stopped by, but evidently not. I was going to note that the linked thread didn't actually say anything useful or informative, though obviously some other people were asking about the computation errors. I could do some more searching, but I've already cared enough for a few weeks.
36) Message boards : Number crunching : DRH tasks are related or contagious? (Message 88628)
Posted 4 Apr 2018 by Profile shanen
Post:
It seems the DRH tasks come down in large groups, but if you decide to kill one of them to evade deadline problems, the project may respond by deleting the entire group, both the DRH tasks that have some hours of work and the ones that haven't started yet.
37) Message boards : Number crunching : No work (Message 88627)
Posted 4 Apr 2018 by Profile shanen
Post:
WCG is one of the projects I ran pretty heavily. I've concluded that I feel less forgiving towards them because IBM is (or was?) supporting the umbrella of WCG for other projects. One of the many problems that drove me away from WCG was confusing inconsistencies and problems among the projects, perhaps like the next poster noted.

Having said that, I actually stopped by today to warn people about the DRH project, and yet as I type this one I see another computation error from a d9244 project... At least it was an early failure. However I think the DRH warning calls for a fresh thread.
38) Message boards : Number crunching : No work (Message 88611)
Posted 2 Apr 2018 by Profile shanen
Post:
Just stopped by to see if there was any explanation of the recent outages or for the increasing problem with "computation errors" that terminate long-running tasks... Used to be the computation errors usually happened within a few minutes of starting, but I just saw another as the task approached 8 hours.

As usual, I was unable to find much substantive information in these forums, but perhaps that is mostly a visibility-and-search problem for the information that might exist somewhere on the website. Perhaps I have actually come to prefer the "We don't care, so you shouldn't worry either" attitude of this project? It would be nice to know if I get any credit at all for 8 hours of computation that ends with a "computation error" and it would be nice to know if the computation errors were related to particular hardware or OSes, but if they don't care, why should I?

I guess from a BOINC-level perspective the solution is to run several projects. I've actually run a number of them over the years, but most of them were more or less problematic, so that approach doesn't much appeal to me.
39) Message boards : Number crunching : No work (Message 88577)
Posted 28 Mar 2018 by Profile shanen
Post:
Back again, apparently affecting all types of machines. The server status page shows very few unsent units (with the requisite scrolling).

I still think I saw sufficient evidence the other day to suggest there was something different going on among the different OS/browser combinations.
40) Message boards : Number crunching : No work (Message 88558)
Posted 27 Mar 2018 by Profile shanen
Post:
Hmm... Visited the other machines, and this is the only one that can't get any fresh tasks. Even the other Linux machine was fine and got some fresh work when I woke it up. Rebooted this machine (and checked under Windows 10 at the same time), but still no fresh tasks downloading...

As I've said before, the apparent bugginess of the project tends to cast a shadow on the results. If there is something wrong with the Rosetta@home projects on certain machines, then maybe all of the results need to be verified to make sure they ran on "safe" OSes?


Previous 20 · Next 20



©2019 University of Washington
http://www.bakerlab.org