How do I correct BOINC confusion?

Message boards : Number crunching : How do I correct BOINC confusion?

To post messages, you must log in.

AuthorMessage
Alan Roberts

Send message
Joined: 7 Jun 06
Posts: 61
Credit: 6,901,926
RAC: 0
Message 26106 - Posted: 5 Sep 2006, 14:21:39 UTC

One of my computers seems to have reached a confused state with respect to BOINC duration and completion.

This laptop crunches for Rosetta when its powered up and I'm using it. It has a modest six hour CPU target, and my last work unit seemed pretty normal in terms of CPU seconds used. Clock time did stretch out, since I was traveling over the long weekend.

When my next work unit downloaded (this morning) the message log stated that the computer was overcommitted immediately upon download and start. The task is 5.6% completed in 25m of CPU, but estimated time to completion is 512h 18m. Seems a bit beyond belief.

Do I need to do anything about this, and if so how do I correct whatever magic cookie is driving the estimate so far out? Thanks!
ID: 26106 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 26109 - Posted: 5 Sep 2006, 14:43:51 UTC

It will correct itself after some more WU. However you can reset this if you want:
1. Close BOINC
2. Open the file "client_state.xml" with a text editor
3. Search for the term "duration_correction_factor" under the Rosetta project (attention if you are attached to seferal projects there are different duration_correction_factors for each project.
4. Change the value to "1", save the file and restart BOINC
ID: 26109 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ananas

Send message
Joined: 1 Jan 06
Posts: 232
Credit: 752,471
RAC: 0
Message 26202 - Posted: 6 Sep 2006, 19:35:11 UTC
Last modified: 6 Sep 2006, 19:37:48 UTC

This is a workaround that helps quite often but it might not be sufficient in this specific case.

The "time_stats" struct in client_state.xml has one element called "active_frac". If this is way below 0.9999, it will influence the estimates and caching too. Same for "cpu_efficiency", a low value there confuses BOINC too.


Actually BOINC is not really confused, this behaviour is intended and protects against downloading way more work than the box can handle, if it is running only a few hours per week.
ID: 26202 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alan Roberts

Send message
Joined: 7 Jun 06
Posts: 61
Credit: 6,901,926
RAC: 0
Message 26593 - Posted: 11 Sep 2006, 13:46:22 UTC

So, things seem to be getting worse (or at least not better) for the laptop I mentioned at the start of this thread (link to the machine in the first post).

As my previous WU was approaching completion (evening of 9/9), I saw this traffic in my Messages:


9/9/2006 5:30:17 PM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
9/9/2006 5:30:17 PM|rosetta@home|Reason: To fetch work
9/9/2006 5:30:17 PM|rosetta@home|Requesting 4860 seconds of new work
9/9/2006 5:30:22 PM|rosetta@home|Scheduler request succeeded
9/9/2006 5:30:22 PM|rosetta@home|Message from server: No work sent
9/9/2006 5:30:22 PM|rosetta@home|Message from server: (won't finish in time) Computer on 84.4% of time, BOINC on 90.4% of that, this project gets 100.0% of that
9/9/2006 5:30:22 PM|rosetta@home|No work from project


I didn't get new work until the task had finished, uploaded, and BOINC made its next communications attempt after the upload (yes, a dreaded gap in working).

Hoping things would straighten themselves out, I haven't done anything, just carried on using my laptop as needed over the weekend. This morning, as the next WU approaches completion, I see:


9/11/2006 8:41:09 AM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
9/11/2006 8:41:09 AM|rosetta@home|Reason: To fetch work
9/11/2006 8:41:09 AM|rosetta@home|Requesting 1982 seconds of new work
9/11/2006 8:41:14 AM|rosetta@home|Scheduler request succeeded
9/11/2006 8:41:14 AM|rosetta@home|Message from server: No work sent
9/11/2006 8:41:14 AM|rosetta@home|Message from server: (won't finish in time) Computer on 85.2% of time, BOINC on 90.9% of that, this project gets 100.0% of that
9/11/2006 8:41:14 AM|rosetta@home|No work from project


Any suggestions? The machine up time (laptop on time), percent of time work is allowed (I suspend Rosetta manually and let the cooling system pull the CPU and hard drive temps back down before shutting down), and CPU efficiency (ThreadMaster throttling is set at 42% to keep the fan on its lower [noise] speed) all look more or less correct to me. I'm not asking it to grab a big queue of work, just would like the next job ready to go when this one finishes.
ID: 26593 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 19 Sep 05
Posts: 403
Credit: 537,991
RAC: 0
Message 26596 - Posted: 11 Sep 2006, 14:17:27 UTC

Hi Alan

Have you tried to set your Rosetta pref.target CPU run time time to

like 2 or 3 H ?

Anders n
ID: 26596 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 26601 - Posted: 11 Sep 2006, 14:43:49 UTC

Looks like the laptop has a 6hr runtime preference. Do you have your General Preference for "Connect to network every ... days" set to higher then .25?

I've not understood why the scheduler doesn't figure out that completely out of work means we should attempt to connect regardless, but I think if you just increase your cache size to match or exceed your runtime preference, then you will always have work. Actually, I'd set the cache to double the WU runtime. So, with your 6hr (.25 days) runtime, I'd set the cache to connect every .5 days.

If you were unclear the percentages shown in the message are basically showing that you don't have you laptop powered on 100% of the time, and as you say, that it's not running BOINC 100% of the time it is powered on. These %s were shown there because they are used in the calculations to know if more tasks should be downloaded.

In the scenerio above, if you try to keep .5 days of work on hand, and your machine is only running BOINC 90% of 85% of the time, that is 77%. And so BOINC estimates that to have enough work to keep your machine busy for .5 days, it actually only needs (.5 times 77%) .38 days of work (about 9 hrs). That's part of what's behind my suggestion to double your WU runtime preference to get value to set your "connect to network" value from.

As long as you don't set WU runtime and cache so high that you end up requesting more work then you can crunch before the deadline (currently 7 days), it isn't a problem.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 26601 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alan Roberts

Send message
Joined: 7 Jun 06
Posts: 61
Credit: 6,901,926
RAC: 0
Message 26619 - Posted: 11 Sep 2006, 18:30:50 UTC
Last modified: 11 Sep 2006, 18:32:17 UTC

Anders n: Cranking down from my current 6hr to 2-3hr was my change of last resort, since I understood the argument to be that pulling down a bigger work unit resulted in less demand on the servers over time.

Feet1st: Was set to 6hrs and connect every 0.25 days. This morning's work unit finished and uploaded, and I was without work, in a long communication deferred interval. So I upped connect time to 0.5 days and forced an update.

I've got work again, but the message traffic went:


9/11/2006 1:35:25 PM|rosetta@home|Finished upload of file 1di2__ROTLARS_ABRELAX_JOINT_MAXFRAGS_BARCODE__1228_4615_0_0
9/11/2006 1:35:25 PM|rosetta@home|Throughput 44597 bytes/sec
9/11/2006 1:39:00 PM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
9/11/2006 1:39:00 PM|rosetta@home|Reason: Requested by user
9/11/2006 1:39:00 PM|rosetta@home|Requesting 21600 seconds of new work, and reporting 1 completed tasks
9/11/2006 1:39:05 PM|rosetta@home|Scheduler request succeeded
9/11/2006 1:39:05 PM|rosetta@home|General preferences have been updated
9/11/2006 1:39:05 PM||General prefs: from rosetta@home (last modified 2006-09-11 13:37:59)
9/11/2006 1:39:05 PM||General prefs: using separate prefs for home

...

9/11/2006 1:39:17 PM||Rescheduling CPU: files downloaded
9/11/2006 1:39:17 PM||Using earliest-deadline-first scheduling because computer is overcommitted.
9/11/2006 1:39:17 PM|rosetta@home|Starting task 1pgs__BOINC_ABINITIO_SAVE_ALL_OUT_hom001__1232_74630_0 using rosetta version 525
9/11/2006 1:39:21 PM||Suspending work fetch because computer is overcommitted.


Thus it thinks I'm overcommitted on a 6 CPUhr. task at the start of a seven clock day (time to deadline) interval, even though my average turnaround time for this machine is 2.63 days. Guess I'll see if it balks on downloading new work towards the end of this WU.

BOINC is currently estimating 55 hrs remaining to completion on this WU, even though I just finished Model #1 after 41 minutes of clock time and percent complete jumped to 4.72%. That seems like about 13.8 hours of crunching remaining to me.

Does anyone know off hand if my Result duration correction factor should be: 1 / ( BOINC running fraction * work allowed fraction * CPU efficiency )? If not no need to go digging, I can find some time to head over to the BOINC site and read their documentation.
ID: 26619 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 26623 - Posted: 11 Sep 2006, 19:03:24 UTC

Alan, the current estimated 55hrs is probably what triggered Earliest Deadline First. Not to worry. It will settle.

Yes, my point was that by doubling your cache size, you achieve the same result as if you dropped to a 2-3hr WU runtime pref.

I've never adjusted my result duration correction factor. Just keep running like you have been and it will all settle.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 26623 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : How do I correct BOINC confusion?



©2024 University of Washington
https://www.bakerlab.org