Posts by mikus

1) Message boards : Number crunching : Minirosetta v1.40 bug thread (Message 56851)
Posted 11 Nov 2008 by mikus
Post:
Rosetta/BOINC does not validate against partial results. It should.

The typical Rosetta task runs multiple decoys (each of which I believe is an *independent* simulation). I had such a task terminate because while calculating decoy 7 came it up with a NAN. The results from the correctly completed previous 6 decoys were discarded.

Looked in the 'Workunit Details' page and saw that another system was identified as successfully completing that same task. The catch -- it did only 5 decoys.

There is something fundamentally unfair when ALL the work from a system that did more crunching gets discarded, while accepting work from a system that crunched less.
.
2) Message boards : Number crunching : Longer tasks providing poor granted credit? (Message 56849)
Posted 11 Nov 2008 by mikus
Post:
I think it is unfair that if one task takes my system four times as long to crunch as another task, to NOT give the longer task proportionately greater credit.

More problematic to me, I run off-line. That means when I do connect, I need to fetch enough work to keep my system busy until the next time I connect. Rosetta was good in that it allowed me to set a "standard task time" for each task. When new tasks were fetched, they were enough to keep my system busy.

But BOINC calculates a "duration correction factor" that it applies to the tasks. As long as all Rosetta tasks adhered to the "standard task time", BOINC downloaded what I thought was a proper amount of work. But now that some Rosetta tasks take longer, BOINC is estimating that *all* Rosetta tasks will take longer, with the result that it downloads FEWER tasks when I do connect. Given that most of those downloaded tasks are not the longer ones, my system runs out of work that much sooner (and I have to make "unscheduled" connections to fetch more tasks).

PLEASE, not only give more credit to longer tasks, but also give them a higher run-time estimate. That will keep the BOINC client from mis-calculating the workload.
.
3) Message boards : Number crunching : Problems with version 5.96 (Message 53717)
Posted 16 Jun 2008 by mikus
Post:
Looked at my multi-core AMD machine (64-bit Linux client) and saw that four cores were idle. In each case, the WU had "completed" (that is, had used up as much as it should of the time allocated), but had FAILED to tell the BOINC client that the task was done. As a result, the BOINC client kept that task dispatched on a CPU core -- but since the task had finished; the corresponding core was idle. [Boincmgr showed 100% complete for these tasks, and showed that these "running" tasks were NOT accumulating any more CPU time.] Unfortunately, the only way I had to get my CPUs running again was to abort these tasks (and that in turn caused their results to be thrown away -- my system had crunched these tasks uselessly).

I've encountered this failure before with Rosetta - but that was on a different hardware system, then using the 32-bit Linux client. My conclusion is that there is a problem (at least on multi-core hardware running Linux) with how the Rosetta executable notifies the BOINC client that an application's workunit has finished.
4) Message boards : Number crunching : Problems with Minirosetta version 1.09 (Message 51985)
Posted 16 Mar 2008 by mikus
Post:
I'm running Linux (32-bit SuSE 10.2, with 32-bit 5.10.42 BOINC).

So far, some 15 Minirosetta version 1.09 workunits have been downloaded to my system.
__EVERY ONE__ of them has failed with a segment violation.
.
5) Message boards : Number crunching : Problems with minirosetta version 1.+ (Message 51294)
Posted 10 Feb 2008 by mikus
Post:
Thumbs down.

I'm running the 32-bit minirosetta 1.07 under the 32-bit BOINC client 5.10.28, on an AMD 64-bit dual-core system using 32-bit SuSE 10.2

So far I am only getting a 11% success rate on minirosetta -- various kinds of crashes, mostly segment violations. [My success rate on regular Rosetta is near 100%.]
.
6) Questions and Answers : Web site : huge width for forum window (Message 47925)
Posted 22 Oct 2007 by mikus
Post:
Thanks for pointing that out, I found a wide image incorporated into one of the posts and have now hidden it, so the margins on the thread should be better. If you get a chance, it would be worth mentioning that type of problem on the BOINC boards.

Did mention it there - one response was: "it's not an image. It's Message 46685".
However, this too was said: "... the patch was already made (SVN changeset 12620 - http://boinc.berkeley.edu/trac/changeset/12620), so Rosetta probably isn't using latest stylesheet file."
.
7) Message boards : Number crunching : Problems with Rosetta version 5.80 (Message 47897)
Posted 21 Oct 2007 by mikus
Post:
Mikus, please join the discussion on Linux preemption issues in this thread.

I may do so -- but the reason I did not originally is that I believe that all of the recommendations in that thread were already in place on my system. As far as I can tell, the Rosetta workunit got "stuck" __after__ it had completed crunching. So to my mind there was no "task preemption" involved (only "task exit").

Also, if it were a preemption issue, I would expect other Rosetta tasks on my system to be failing in a similar fashion. But only that *beta* 5.80 has failed so far. I suspect the problem was triggered by something about that particular task. [Note: My 'Rosetta time to crunch' is 8 hours, meaning I run Rosetta applications (including 5.80) for a longer time than typical participants do.
8) Questions and Answers : Web site : huge width for forum window (Message 47869)
Posted 19 Oct 2007 by mikus
Post:
Recently posted to the "Problems with Rosetta 5.80" thread. I had so much difficulty trying to read that thread that I may have missed noticing similar problem reports from other participants.

The problem was the __number-of-characters__ assigned to each text line of that thread, as formatted by my browser. Only about half the width of each line could be displayed on my screen (at maximum browser window stretch) -- meaning that for *far too many* lines I would have had to scroll horizontally (each line !!) in order to read that line.

My guess is that some people "pasted" information into the thread, which required that much horizontal extent. Nevertheless, PLEASE enforce a reasonable right margin for the formatting of text lines for forums. There are many more viewers of the forum than there are posters -- to me it makes sense to inconvenience the former rather than the latter.
.
9) Message boards : Number crunching : Problems with Rosetta version 5.80 (Message 47868)
Posted 19 Oct 2007 by mikus
Post:
aborted beta - http://boinc.bakerlab.org/rosetta/result.php?resultid=112298830

Went to my computer (to make a connection), and saw (gkrellm) that one of the cores was idle. Boincmgr status showed two Rosetta WUs running. Top showed one of them using CPU, the other sitting there "stuck". Manually aborted the second.

I have plenty of memory; "leave work in memory" is specified. Judging by the CPU time acumulated by the "stuck" workunit, it had completed its quota of decoys, and was in the process of shutting down when it got "stuck". Dual core Linux 32-bit system, boinc 5.10.21. Rosetta tasks usually complete just fine.

The problem that "stuck" workunits cause is that boinc keeps track of the number of seconds given to tasks. As near as I can tell, my system spent so much wall clock time __not__ executing that "stuck" WU that its boinc calculated efficiency has now been severely reduced. I run off-line, and connect only occasionally. The lowered efficiency value means that for a while I will be given *less* work each time I connect, and will therefore have to connect more often. Not good.
.
10) Message boards : Number crunching : Problems with Rosetta version 5.78 (Message 45892)
Posted 10 Sep 2007 by mikus
Post:
mikus, yes, the BOINC manager still seems to have some quirks on Linux. I can only suggest you keep current on the BOINC updates. But last I've seen there still seem to be occaisional problems there as well.
From this, should I conclude that Rosetta is not interested in one of its applications experiencing a segment violation, nor that whereas my expectation is that a failing task would go away, this one just sat there (holding a CPU resource that ought to have been released to other ready tasks.)
Not at all. I was simply trying to point out that BOINC manager controls which tasks run and when, so I'm not sure there is anything that the Rosetta team can do to improve things. It's probably going to require a fix to the BOINC code so that it can pick up and schedule other work accordingly.

It may well be that BOINC code needs to be upgraded to handle this unusual situation - an application task "dispatched" by BOINC which does not use any CPU.

BUT it is likely that the existing BOINC code expected that an application task which (according to the task's stderr.txt) had received (SIGSEGV + SIGABRT) would perform a "final exit". My question is - did the Rosetta application task do that ? (If yes, then BOINC dropped the ball; but if no, then it was the application that did not do what BOINC expected.) That is why I would like to send the snapshot of the slot directory to someone at Rosetta (if I knew where to send it), so Rosetta people can check for how far the application had gotten.

mikus


p.s. By the way, I now see that when I "aborted" the task to get it out of the ready queue, only the "abort" shows in the result's stderr field - overwriting the task's previously accumulated stderr output.

Also, I believe boincmgr is merely the 'GUI' to the BOINC client - the client can (and does) run perfectly well if boincmgr has been closed. So while the BOINC manager *can* control the application tasks (I issued the "abort" from boincmgr), it is the client which performs the details of task scheduling. Unfortunately, I believe the principal means the client has to keep track of what the tasks are doing is to track their CPU consumption. When faced with a task that does not consume CPU, I think the current BOINC *will* lose track.
.
11) Message boards : Number crunching : Problems with Rosetta version 5.78 (Message 45886)
Posted 10 Sep 2007 by mikus
Post:
mikus, yes, the BOINC manager still seems to have some quirks on Linux. I can only suggest you keep current on the BOINC updates. But last I've seen there still seem to be occaisional problems there as well.

From this, should I conclude that Rosetta is not interested in one of its applications experiencing a segment violation, nor that whereas my expectation is that a failing task would go away, this one just sat there (holding a CPU resource that ought to have been released to other ready tasks.)
.
12) Message boards : Number crunching : Problems with Rosetta version 5.78 (Message 45825)
Posted 9 Sep 2007 by mikus
Post:
Had a problem with <http://boinc.bakerlab.org/rosetta/workunit.php?wuid=94715507> (not reported yet, since Rosetta is not yet accepting uploads). Noticed in gkrellm that one of my CPUs was idle (though boincmgr said that the workunit on that CPU was "running").

(If you can tell me where to send it, I have a tar of the slot directory.)
Here is a copy of the stderr.txt from that slot directory:

Graphics are disabled due to configuration...
# cpu_run_time_pref: 28800
# random seed: 1285195
SIGSEGV: segmentation violation
Stack trace (12 frames):
[0x8d45107]
[0x8d3fefc]
[0x40000420]
[0x8bb4bb4]
[0x8c96f34]
[0x84b6ee1]
[0x80d8665]
[0x85efeb3]
[0x871f807]
[0x871f8b2]
[0x8da9454]
[0x8048111]

Exiting...
SIGABRT: abort called
Stack trace (23 frames):
[0x8d45107]
[0x8d3fefc]
[0x40000420]
[0x8db0514]
[0x8dc53df]
[0x8dca445]
[0x8dca723]
[0x8d9b171]
[0x8d9cb99]
[0x83f92c1]
[0x8db0a5f]
[0x8d45152]
[0x8d3fefc]
[0x40000420]
[0x8bb4bb4]
[0x8c96f34]
[0x84b6ee1]
[0x80d8665]
[0x85efeb3]
[0x871f807]
[0x871f8b2]
[0x8da9454]
[0x8048111]

Exiting...


Would prefer it if applications which terminated abnormally would go away, rather than making the boinc client (Linux 32-bit 5.10.8) believe thay are still "running".
.
13) Message boards : Number crunching : Failed download (Message 38666)
Posted 29 Mar 2007 by mikus
Post:
p.s. After posting the above message, I tried to log out from my account. But the Rosetta webserver responded: "Unable to handle request" !!
.
14) Message boards : Number crunching : Failed download (Message 38665)
Posted 29 Mar 2007 by mikus
Post:
Been having problems for a number of hours now. The client first tried to download from srv3.bakerlab.org - no go. Then it tried to downloas from srv4.bakerlab.org - also no go. [It keeps (at long intervals) switching between the two.]

I tried accesing srv3.bakerlab.org and srv4.bakerlab.org with a borwser - both attempts timed out. [Though I could access boinc.bakerlab.org/rosetta/download with my browser -- the client doesn't seem to ask there, though.]
.
15) Message boards : Number crunching : is seven-day deadline still needed ? (Message 25908)
Posted 2 Sep 2006 by mikus
Post:
I'm on a dial-up line, and would sometimes like to leave town for the weekend. The boinc client assumes that whatever I set my "connect interval" to, it needs to have work finished that much ahead of deadline to be sure the results get reported in time. Given that the last WU queued up at my system will not even be started until all the WUs ahead of it on the ready queue have been completed, the significant "queue size" I specify (to cover those weekends), plus a short deadline, combine in my circumstances to make the boinc client panic.

I could see the need for a seven-day deadline for Rosetta while doing work for CASP. But is the project __still__ under that kind of time pressure? If the scientific work Rosetta is currently performing does not depend upon seven-day (i.e., short) turnaround times, PLEASE -- out of consideration for project participants who are on dial-up connections, would you please instead set the Rosetta work deadlines to eight (or even better, ten) days.

Thanks.
.
16) Message boards : Number crunching : New Crediting system: questions (Message 22899)
Posted 18 Aug 2006 by mikus
Post:
I don't know what to think of this new crediting system. I'm running offline with a vanilla Linux client, which in the past kept receiving lower benchmark scores than did Windows clients. So far, I've reported to the new system (all at the same time, when I dialed in this Thursday) the completion of four WUs. The respective numbers, granted credit (old) on left, work credit (new) on right:
66.4 : 055.8
66.6 : 056.0
68.8 : 149.4
68.6 : 159.2

Is it just a coincidence that the two high (new) numbers are for WUs I reported on the day their deadline would have expired, whereas the two low (new) numbers are for WUs whose completion I reported ahead of their deadline date ?

And I am disappointed that there are two WUs on which the new crediting system will give me 16% lower numbers than the old system (which was already giving me lower numbers than for many other participants).
.
17) Message boards : Number crunching : Restarting Results ad infinitum (Message 17526)
Posted 2 Jun 2006 by mikus
Post:
The "zero status" warnings are typically caused by the clent having trouble communicating internally (usually by using a port on 127.0.0.1).

Try leaving your boincmgr running all the time, and see if those messages diminish. If that doesn't help, there is probably something on your system that interferes with the handling of internal TCP/IP messages.

In any case, as long as those WUs eventually complete o.k., you can probably ignore the "zero status" warnings -- i.e., no need for you to "reset".
.
18) Message boards : Number crunching : users who run off-line are impacted by shorter deadlines (Message 17511)
Posted 1 Jun 2006 by mikus
Post:
It seems the client is just trying to assure you've got work to crunch on during the time it will be without a network connection, which sounds directly in-line with your objectives. So I clearly must be missing your point. Are you saying it should have downloaded LESS WUs? Are you actually having problems completing them before the deadline? Or is it just BOINC's concern about it that is disturbing?

I personally think the *server* ought to have downloaded 10 or 8 WUs, thereby allowing me a "buffer" (for me meeting the deadline of whichever of these WUs will be completed last) of 2 or 3 days.

[Added by edit: OR the server could set a deadline of 7 days from the download for the *first* of those WUs, but a deadline of (7 days + 3 days) for those WUs whose crunching will be PRECEDED by an estimated three days of crunching (of other WUs that were downloaded first within this same "download assemblage").]


If I hear you right, if it had only given you the 8 or 10 WUs, you'd have been done crunching them in 4 or 5 days, gone a day or two without establishing an internet connection and then reported them all in when you sit down to that PC and dial up to the net. That was my point about how it's trying to keep you with enough WUs to KEEP you crunching the whole time. If you do want to shut the machine off, or be positive that you NEVER pass a deadline, then your cache size (and therefore your use of that machine really) is larger than fits a project with a 7 day deadline.

... They can't really tag the deadline to be 7 days from the 3 days of expected crunching on the first WUs... because that was the whole point, the shortened the deadlines because they want specific results RETURNED sooner. And that approach doesn't get them returned sooner... and your next connect time would be past that deadline anyway unless they WERE crunched in the first week because nothing went wrong.

The people who specify things appear to believe that the world runs according to __rigid__ timetables. In my case, if I see that my system is close to running out of work (or has a completed WU whose deadline is close), I will then and there connect to the server (no matter *what* the "interval between connects" value happens to say). [My reason for specifying a large (queue size) value is (given non-exceptional circumstances) to not run out of work if I hapen to be absent for three days, or if the server is inaccessible when I try to connect.]

My point is that if the server gives me six days of work, ALL with a deadline of seven days, that seems "out of whack" to me. [There has been discussion that the server might "filter" the WUs it hands out according to the memory size of the client system -- why not consider "filtering" the WUs according to the estimated "time to crunch" of ALL the WUs that are now being handed out to that one client ?]

{And if the project needs the results returned "sooner", why hand out _so much_ work at one time that the last results from that download can't help being returned "later" ?]

Please note that if on Jun 1 the server has available a 7-day-deadline WU, but that WU happens to be "handed out" only on Jun 2, the deadline seen by the user will be Jun 9. In effect, the "delay in handing out" of the WU __extended__ the date on which the results will be expected back. Why can't "delay due to crunching preceding WUs" (as "handed out" in this same "download assemblage") *also* be used to extend the date on which the results (of the last WUs in this "download assemblage") would be expected back ?


You say that my cache size is too large to fit a project with 7-day deadlines. I agree. But when I joined Rosetta, it had (I think) 28-day deadlines. [And the description of the project did NOT indicate that deadlines would come to be drastically shortened.] I believe that I am making a contribution to Rosetta. To repeat - if what I do doesn't "fit", I'll just leave.
.
19) Message boards : Number crunching : users who run off-line are impacted by shorter deadlines (Message 17490)
Posted 1 Jun 2006 by mikus
Post:
...making my computer overcommitted (on paper) before it even starts on the work... I am forced by the shorter deadlines to shorten my queue size, it is MUCH EASIER for me to simply stop participating in Rosetta.

It is the client that decides how much work to request. If your WU runtime preference is accurately reflected in the initial estimated runtime to completion, then it will work out just fine. BOINC goes in to a bit of a panic there for a half day or so, but what harm is it to run in earliest deadline first mode?

It seems the client is just trying to assure you've got work to crunch on during the time it will be without a network connection, which sounds directly in-line with your objectives. So I clearly must be missing your point. Are you saying it should have downloaded LESS WUs? Are you actually having problems completing them before the deadline? Or is it just BOINC's concern about it that is disturbing?

I have no problem with the *client* requesting 518400 seconds of work if my queue size is 6 days. After all, the client does NOT know what kind of work will be downloaded.

My point was that the *server* downloaded 12 WUs, all of them having a deadline of seven days, KNOWING that for Rosetta my 'CPU time' specification is 12 hours. To finish crunching the last of those ought to take ((12 WUs * 12 hrs/Wu) / 24 hrs/day) = 6 days, when the deadline of whichever of those WUs is finished last is 7 days. That's cutting it pretty close: (7 days deadline - 6 days to complete) = 1 day "buffer". Should *anything* delay processing during those 6 days, it might take me MORE time than the deadline to finish crunching ALL the WUs that were downloaded by the server (not even counting any additional time taken for me to report back to the server the results of my having crunched the last of these downloaded WUs). I personally think the *server* ought to have downloaded 10 or 8 WUs, thereby allowing me a "buffer" (for me meeting the deadline of whichever of these WUs will be completed last) of 2 or 3 days.

[Added by edit: OR the server could set a deadline of 7 days from the download for the *first* of those WUs, but a deadline of (7 days + 3 days) for those WUs whose crunching will be PRECEDED by an estimated three days of crunching (of other WUs that were downloaded first within this same "download assemblage").]


Regarding BOINC's concern, I have advocated (here and in the BOINC forum) that the user ought to be allowed to specify __two__ parameter values - one for setting 'queue size' (how much work should be kept available to be crunched) and the other for specifying 'time between connects' (how long it might be between completing work and reporting it). But in my opinion a *severe* constraint on the "politically correct" sum of those two parameters is being imposed on off-line users by having deadlines as short as 7 days. With my queue size, I'd much prefer to see 8-day or even 9-day deadlines.
.
20) Message boards : Number crunching : users who run off-line are impacted by shorter deadlines (Message 17443)
Posted 31 May 2006 by mikus
Post:
I still had some 14-day-deadline WUs left in my queue, so I let them all crunch to completion. That made my queue empty. When I connected, a full six days (my queue size) of work was downloaded (#_of_WUs x specified_CPU_time). This of course plays havoc with the BOINC client, which subtracts (queue size + 1 day) from the deadlines -- making my computer overcommitted (on paper) before it even starts on the work. But, given the new deadlines of seven days, the *actual* completion time of the last WUs assigned to me is also tight. The __server__ ought to have recognized this, and not have downloaded so many WUs to my system.

As I said at the beginning of this thread, if I am forced by the shorter deadlines to shorten my queue size, it is MUCH EASIER for me to simply stop participating in Rosetta.
.


Next 20



©2024 University of Washington
https://www.bakerlab.org