Posts by mikus

21) Message boards : Number crunching : Report Problems with Rosetta Version 5.16 I (Message 17065)
Posted 25 May 2006 by mikus
Post:
(Maximum disk usage exceeded)

Got same error (Rosetta version 5.16; BOINC version 5.4.9; Linux) -- Result ID

(1) Could NOT find anywhere that an user would have set a "disk limit" of 100 (megabytes?). If this is built-in to the Rosetta software, the limit ought to be made larger. [And if the error was not caused by the user+computer, credit ought to be given.]

(2) I noticed that __no__ results were uploaded to the server for the failing WU, despite my computer having spent eight hours crunching it. My understanding is that the Rosetta software constructs MULTIPLE 'decoys' while processing a WU -- if a problem arises after eight hours of crunching, *surely* this WU might have had one or more valid 'decoys' completed previous to the crash -- they would deserve being reported.
.
22) Message boards : Number crunching : users who run off-line are impacted by shorter deadlines (Message 17006)
Posted 24 May 2006 by mikus
Post:
Apologies, but your post does not address my situation.

Hopefully the dramatic reduction in the size of the WUs they made will help your download times to improve and counter balance the negative of the shorter deadlines that are necessary in order to get results back for CASP.

Yes, the download times are shorter. But what I do is un-suspend (with boincmgr) the client's network connection and then go away, letting the uploads/downloads proceed at their own pace. I do something else in the meantime. It matters little to me whether the downloads finish in half an hour or in an hour and a half. [Except once, when due to a BUG in the BOINC client the downloads stretched over nine hours and over-loaded my cruncher with work !!]

If you describe your proxy situation a little further, perhaps there are some steps people could suggest to make it much easier for you to connect as well.

As far as I can tell, it *cannot* be done more easily. Connecting to the server takes three steps:

(1) Computer A has hardware connected to a telephone line. I'm typically not using the 'net, so there is no connection. When I do want to dial out, it takes a single click (at (A)) to activate the connection. [If there is a problem making the connection, obviously it takes more decisions. I've had auto-dialers run "wild" (e.g., when my ISP's authentication server crashed), and won't use them.]

(2) Normally, no 'ports' on computer A are activated (unless I'm using a browser at (A) - even then direct http connections were initiated from "inside" not "outside"). When I want my OTHER computers to communicate with the outside (and am connected), it takes a single click (at (A)) to start squid on computer A as a proxy/nat. That activates/filters the 'ports' on computer A that relay http from my other computers. [I do not believe in leaving 'ports' open all the time.]

(3) Normally, the crunching computer (B) is locally on-line to (A), but has its BOINC client network communication suspended (via boincmgr). It takes a single click (at (B)) to make BOINC client network communication available. [If I didn't leave BOINC client network communication suspended, it would at times of its own choosing try to contact the server but fail, and would produce lots of error messages and try totally useless deferred communication intervals.]


Note that my PHILOSOPHY is: 'Don't keep anything "activated" unless it is actually in use'. I have no wish to change.
.
23) Message boards : Number crunching : users who run off-line are impacted by shorter deadlines (Message 16996)
Posted 24 May 2006 by mikus
Post:
By changing its deadlines to 7 days instead of 14 days, the Rosetta project has ALTERED the impact participation has upon off-line users like myself.

I have a slow dialup line; also, I need to manually setup a proxy whenever I want to connect the crunching computer to the server. I'm happiest when I can go days and days and days without having to connect.

By halving its deadlines, the Rosetta project is trying to __force__ me to connect twice as often as I'm comfortable with. I'll see how that works out -- but the result of this deadline change may be that I stop contributing to Rosetta, and go look for a project that is less burdensome for me to participate in.
.
24) Message boards : Number crunching : Is the Rosetta client "linear" ? (Message 16721)
Posted 20 May 2006 by mikus
Post:
The percent complete is not necessarily linear. If you change the time setting while the work unit is being processed it will effect the percent complete. ... The time to completion is almost never correct.

I think it's been many many days since I've changed ANYTHING. My last change was from BOINC 5.2.13 to BOINC 5.4.9. I have not changed the values in my preferences in weeks; there has been *plenty* of time for a "history" of my processing (with the 12 hour time setting) to stabilize. Plus, the work unit does *not* get removed from memory.

I corresponded with the BOINC developers about seeing a download and a couple of hours later seeing the BOINC client set EDF mode. (Rosetta is the *only* BOINC project on that computer.) They told me that the BOINC client __does__ use the value reported using boinc_fraction_done(). In particular, I interpreted what they said as: "If after one hour of processing it is reported that the result is 3.333% done, the BOINC client will test for deadlines using a formula for completing THAT result which evaluates closer to 30 hours than to the workunit's estimated time.

I believe that (immediately following the download) a report DURING THE PROCESSING OF THE CURRENT WORKUNIT that "inflated" its 'time to completion' would be *enough* to explain why my system was set to EDF mode. The download fetched so many workunits that (given a 14-day deadline but my 6-day cache size) the "safety margin comparing completion time vs. deadline" was only a matter of hours !! Thus "very little" added (e.g., induced by non-linearity) inflation would have been needed to trigger EDF mode.
.
25) Message boards : Number crunching : Is the Rosetta client "linear" ? (Message 16664)
Posted 19 May 2006 by mikus
Post:
Rosetta is the only BOINC project on my Linux computer. I normally run off-line. My 'work cache' is specified as six days, and my 'target CPU run time' is specified as 12 hours.

When I had BOINC 5.2.13 installed, with stable WUs I don't remember BOINC going into earliest-deadline-first scheduling mode. But now that I have installed BOINC 5.4.9 (which has implemented "tightened" scheduling policies), I've seen EDF mode entered following the downloading of new work. Although in my environment EDF mode makes no difference, this change in client system behavior made me curious.

In discussions on the BOINC list, it was suggested that the BOINC client gets "nervous" about scheduling when the 'progress on the result is non-linear'. [An example of non-linear would be if after the first hour of crunching a 12-hour WU, only 3% 'progress toward the result' were being reported.]


My question: Does the __Rosetta__ client behave linearly -- is the value being reported, EVERY TIME it uses the boinc_fraction_done API, accurate for (accumulated time spent crunching this WU / expected total time spent crunching this WU) ?
.
26) Message boards : Number crunching : The cheating thread (Message 15422)
Posted 3 May 2006 by mikus
Post:
My "GUESS" would be that it is more likely a look up table for different machine types and speeds with some kind of correction factor. But that is only a Guess.

That approach might not cover oddball cases. The computer I use for Rosetta is "vanilla" (no clockings have been modified), but for another project I use a "strawberry" computer that has been under-clocked! Its BIOS reports the nearest "CPU type" it knows about, but the actual performance of that system is *better* than for a system with an actual CPU chip of that "type".
.
27) Message boards : Number crunching : Choices are limited for target CPU run time selection (Message 14632)
Posted 26 Apr 2006 by mikus
Post:
I just added more time options as requested.

Thank you. I have now changed that parameter value to 12 hours.

[I had seen occasional WUs take longer than the target specification. But given that the ideal scheduler should finish the WU *before* the target is reached, I felt that my target value deserved to be bumped up a little from where it was before.]
.
28) Message boards : Number crunching : Choices are limited for target CPU run time selection (Message 14579)
Posted 25 Apr 2006 by mikus
Post:
For several weeks I have been running with my 'target CPU run time' value at 10 hours. I wanted to increase that parameter SLIGHTLY.

But today the webpage offers me the choice of setting that value to 10 hours, or to 16 hours -- NOTHING in between.

Does the project not __trust__ us volunteer participants if instead we wished to set that value to, say, 12 hours ??
.
29) Message boards : Number crunching : Actual CPU run time not always same as target (Message 14536)
Posted 24 Apr 2006 by mikus
Post:
More often than not, the WUs finish before your preference. The golden exception to that being when it takes longer than your preference to complete model 1.

Not counting the ones I asked about, of the last 14 WUs (that had finished recently) 4 took longer than my preference. [Their nstructs/attempts were 32/34, 52/54, 7/8, 4/5.]

That's a sizeable minority that finished longer than my preference.
.
30) Message boards : Number crunching : Actual CPU run time not always same as target (Message 14426)
Posted 23 Apr 2006 by mikus
Post:
For a long time now I have my 'Target CPU run time' parameter set to 10 hours. This past week I have noticed several WUs on my system which completed in times that were NOT near 10 hours. I'm wondering how come ??

One WU ran 6 nstructs from 6 attempts and took 8.8 hours total. Was that close enough to 10 hours for it to not make another attempt? Another WU ran 1 nstruct from 1 attempt and took 5.8 hours total. At that rate I _would_ have expected it to try another attempt. A similar one ran 2 nstructs from 2 attempts in 7.4 hours total.

On the other hand, there was one WU that ran 2 nstructs from 4 attempts in 13.3 hours total. I guess it stopped when it realized it was over my 10 hour target.

Are such deviations from my 10 hour target normal ?
.



31) Message boards : Number crunching : unnecessary earliest-deadline-first scheduling (Message 12045)
Posted 15 Mar 2006 by mikus
Post:
Well I guess I am having some trouble understanding your situation. If your queue is set to 6 days that is 144 hours assuming your crunch 24/7. 144 divided by your 10 hour time setting is 14.4 WUs. The system should have given you 14 WUs but if it gave you 15, which is possible, then the system would have probably gone into EDF mode right off. But lets assume it gave you 14, and as you say they are all 12 day workunits. All things being equal that should not push the system into EDF mode in and of itself. But change any of those conditions and it could. And yes the crashed WUs would affect a change in the expected run time but it should make it shorter, which might cause you to get more than the 14 WUs and push you into EDF Mode.

Please note that in this thread I am NOT commenting on "how many seconds of work were requested/downloaded". I __AM__ commenting on the following situation:

- On Mar 12, enough WUs were downloaded to make up 144 or so hours of work for my system to do. On Mar 12, the system appeared to be processing normally (round-robin scheduling mode) the WUs then in its work queue.

- On Mar 14, some of those WUs having finished (and *no* new WUs having been added), the system had about 100 hours of work to do -- but NOW (on Mar 14) it switched to EDF scheduling mode (for deadlines of Mar 26) -- though it could expect to finish the existing work in its queue by Mar 19 !!


I thought this behavior worth posting about, since if I *were* to add another project, HOW the system behaved here was different from what I was expecting.
<Even taking into account that Darren told me that I should in my mind *subtract* the "connect interval" value from the WU deadline!>
.
32) Message boards : Number crunching : unnecessary earliest-deadline-first scheduling (Message 12043)
Posted 15 Mar 2006 by mikus
Post:
On another thread, it was said: 'if a project deadline is less than double the cache size, the box will run in EDF mode'.


That is correct, and that is what I thought had happened in your case as the manual connect you made 2 days after the prior connection would have "reset" the time frame for determining when the 6 day connection setting would apply - basically making boinc think it had to fit 6 + 6 (for 2 normal connections) days worth of work into 2 + 6 (the manual connection plus the next normal connection) days on the calendar. Since it occurred before that manual connection, it's something a bit more complex than that, though.

I *am* agreeing to what you've said -- but did you note my "minor correction"? There was a download Mar 8. The next download was on Mar 12 - it would have "reset" the time frame. By Mar 13, all WUs downloaded Mar 8 had finished, and only WUs downloaded Mar 12 were left. On Mar 14 (__BEFORE__ the Mar 14 connection was ever made) the hour (when 'twice 6 days' was added) got larger than the Mar 26 deadline, and the client apparently decided that the computer was overcommitted.

At the time this decision by the client was made, the most recent "connection which reset the time frame" was that of Mar 12, __NOT__ that of Mar 14 (which was still to come - some hours later on Mar 14).

<I manually initiated the Mar 14 connection upon seeing the "switching to EDF mode" message already there in the BOINC window - I wished to have all 'results to date' uploaded to the server before I started posting about them. And I wondered that (without ever issuing any message about ending EDF mode) the client proceeded to *request* (once network connection was allowed via the boinc manager) for yet more work to be downloaded!>
.
33) Message boards : Number crunching : unnecessary earliest-deadline-first scheduling (Message 12029)
Posted 15 Mar 2006 by mikus
Post:
This can happen if some number of the workunits exceed the expected runtime. For example if a number of 10 hour WUs actually run for 10:45, your system would begin to realize that they might all run that way, and you would then possibly be over committed at the early part of the queue. But as the queue runs, this would change. What changed in this case was a few WUs crashed and reduced the workload in the queue.

You did not catch what I was complaining about:
- My reporting interval was specified as 6 days
- The queue at the client held ONLY about 100 hours of work (14-day WUs)

So far, *every* WU (before or after) has run very close to 10 hours. I think "exceeding the expected runtime" does not apply here (unless crashing 1-minute WUs *increase* the "history" value?).

On another thread, it was said: 'if a project deadline is less than double the cache size, the box will run in EDF mode'. That seems to be what was happening here - blind adherence to that rule. Perhaps people who prefer a sizeable cache size are not suitable for Rosetta's 14-day WUs.
.
34) Message boards : Number crunching : unnecessary earliest-deadline-first scheduling (Message 12026)
Posted 14 Mar 2006 by mikus
Post:

This is just one of those weird little oddities that boinc has. Keep in mind that when you give it a "connect every" setting, it assumes that it will actually be restricted exactly to that time frame.

In your case, with a 6 day connect setting, once boinc has gone a couple days into the queue, if it is allowed to actually reconnect again it will determine that it must finish everything that is due in less than about 10 days before the next connection that will occur in 6 more days days - because it assumes that after the next connection it won't be allowed to connect again until 6 days later (and after the expiration of some of the work units).

I see from your results that 2 of the work units downloaded on the 12th were reported on the 14th. At that point, boinc assumes 6 days till the next connection (20th)- then 6 days after that for another (26th). Since that connection was made after the time of day for the workunits due on the 26th, boinc would assume that all work units due on the 26th would now have to be finished before the 20th - otherwise the connection (based on a strict 6 day connect interval) on the 26th would have occured about an hour too late to report those work units due on the 26th.


Thanks for telling me. (Minor point: the client's EDF calculation <on the 14th> was done while running offline - hours before the connection occurred.)

But the client's arithmetic was WRONG !! It ran its EDF calculation on the morning of Mar 14. There were about 100 hours of work to do on its queue. It __would__ have finished before the 20th !!
.
35) Message boards : Number crunching : unnecessary earliest-deadline-first scheduling (Message 12016)
Posted 14 Mar 2006 by mikus
Post:
Rosetta is the only BOINC project on this system, which runs off-line (except for occasional connects). My preferences specify the CPU time per WU as 10 hours, and the interval between connects as 6 days.

On Mar 12 I did a download to "refresh" the queue of ready WUs at my system. When the currently-running WU completed, the next WU failed in less than one minute with code 1 (its problem is known, and has been solved by now). Two WUs then completed normally; the next WU after that again got code 1 (same problem). HOWEVER, after the 'finished' message for this failed WU, there now was the message "Resuming round-robin CPU scheduling". There had been *no* previous messages about scheduling methods.

The next WU completed normally. HOWEVER, __eight hours__ into the processing of the WU after that (the WU eventually completed normally), there was the message "Using earliest-deadline-first scheduling because computer is overcommitted".

Note that there were about 100 hours of work in the queue, with the earliest expiration-deadline about 10 days away. Plus the system had been processing 10 hour CPU time WUs for weeks now. There was NO REASON for the client to say the computer was overcommitted.


<Today I was again able to do a download to "refresh" the queue of ready WUs at the system, *despite* the last (spurious?) scheduling method message having told me the computer was 'overcommitted'.>
.
36) Message boards : Number crunching : Miscellaneous Work Unit Errors (Message 12013)
Posted 14 Mar 2006 by mikus
Post:
We have found the problem, and are resubmitting the jobs with a fix. There are still a few workunits with the following prefix out there that you can expect to fail very quickly:

HOMSdt_homDB0??_1dtj

this should not happen with the next batch.

Apparently some of these are still circulating. Within jobs downloaded on Mar 8:

http://boinc.bakerlab.org/rosetta/result.php?resultid=12969634
http://boinc.bakerlab.org/rosetta/result.php?resultid=12969645
.
37) Message boards : Number crunching : too many WUs downloaded (Message 11660)
Posted 4 Mar 2006 by mikus
Post:
For today's download of work, I did NOT interfere in any way. And everything *was* done more or less correctly. Yesterday, I had changed (at the website) my "time between connects" to 5 days. I've drawn the following two conclusions from today's downlad:

(1)
When the client calculates the amount of work to ask for, it only looks at the 'ready to run' WUs on its queue, and DOES NOT factor in already-requested work that is still in the process of being downloaded. This affects me, because I have a slow connection and my downloads take a LONG time -- that's why I started this topic in the first place.

(2)
However, today the client asked for work only twice -- once when the connection was first established (by me manually un-suspending communication); and once when it realized that the "time between connects" at the website had been upped from 3 days to 5 days. (This second request was for slightly too much -- probably because not ALL of the first request had been downloaded by the time the second request was made.)

That is how I *wanted* the client to behave -- NOT to periodically re-request work while downloads were still in progress. The most likely explanation for it behaving properly is that it had now established a "history" that WUs on my system take 10 hours each (as set in Rosetta preferences "CPU time").

On the earlier runs that I complained about, the new WUs had *longer* run times than the WUs for which the client had "history" -- possibly it was having short __"history"__ values that caused the client to repeat and repeat and repeat its (ever diminishing) work request calculations at four minute intervals (each time causing MORE and MORE and MORE work to be scheduled for download).
.
38) Message boards : Number crunching : unequal credit for same work unit to different systems (Message 11505)
Posted 1 Mar 2006 by mikus
Post:

Been reading the BOINC forums. Apparently for the SETI project, the server maintains a __table__ of the (averaged!) benchmark numbers as reported for each "CPU type". The server then uses that SINGLE internal *table value* when calculating the credit assigned for EVERY WU processed by that "CPU type", no matter which individual system reported that WU.

This seems to me to be much more "fair" than Rosetta's "blind acceptance" of the benchmark results reported by an individual system. In particular, the participant running an optimized client does NOT thereby receive a credits "bonus"; nor does someone running a Linux client thereby receive a credits "penalty" -- they are *all* credited based on that COMMON "average benchmark" for that 'CPU type'.

(The table values kept by the SETI server appear to be a "running average" -- as more recent benchmarks come in for that "CPU type", probably the same small percentage is both subtracted from the table value and added from the reported benchmark. That way, improvements in the efficiency of the client are __gradually__ "merged in" to the table values. I believe that reported benchmark values with excessive deviation from the norm are simply ignored.)
.
39) Message boards : Number crunching : too many WUs downloaded (Message 11451)
Posted 27 Feb 2006 by mikus
Post:
To avoid another "flood" of WUs, as soon as the request (for 256200 seconds of work - that's what I currently have specified in my General preferences) was sent to the server, I *manually* clicked in BOINCmanager for: "No more work from Rosetta". I'm now watching 9 WUs being downloaded (preceded by 4.81). That would be correct if each ran for 8 hours. But earlier than today I had set my "Target CPU run time" value in the Rosetta preferences to 10 hours. Oh, well! Just another example of unexpected behavior, to keep in mind.

I have updated the FAQ on the time setting to try to better explain how this all works in practice. But the system MAY have got it right by giving you only 9 WUs. remember some of them may run longer than the time setting you have, because they will run to the completion time you asked for, but they will also always complete at lest one model. So if your settings would allow 9.9 models to complete, it will run shorter than you expect. But if one model would take 10.5 hours it will run longer.

Assuming you complete a number of WUs ok, the quota will rise very quickly

You did not catch what I was complaining about:
- The client asked for 256200 seconds (72 hours) of work. (correct)
- The SERVER sent 9 WUs (as though each would run for 8 hours; 9 WUs times 8 hours/WU = 72 hours total). IF the server had realized that my "Target CPU run time" was set to 10 hours, 9 WUs represent __90__ hors of work (instead of the 72 hours that had been requested). I had expected the server to send me 8 WUs -- 8 WUs times 10 hours/WU = 80 hours total (the closest multiple of 10 <the hours/WU> that exceeds 72). To me, that total of 9 WUs makes yet another case of TOO MANY WUs DOWNLOADED.
.

40) Message boards : Number crunching : too many WUs downloaded (Message 11450)
Posted 27 Feb 2006 by mikus
Post:
The larger the max cpu time setting, the less frequent communications with the project servers, and the less likely the system will be overwhelmed with communications requests such as happened when we were handed out lots of 15 min WUs and we became a distributed denial of service attack on the project servers.

I have no argument with what you are saying. Yes, alowing longer-running WUs *does* mean that participants get to download from the server less frequently.

But it will take a client change to overcome the problem for which this thread was originally opened -- when downloads are SLOW, the current client can __re-request__ (more) work before the download of the *earlier-requested* work has completed. In that case, the second request (plus any follow-on requests) results in TOO MANY WUs DOWNLOADED (no matter what the preferences settings are).
.


Previous 20 · Next 20



©2025 University of Washington
https://www.bakerlab.org