BOINC unsure how many CPUs to use

Message boards : Number crunching : BOINC unsure how many CPUs to use

To post messages, you must log in.

AuthorMessage
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 37075 - Posted: 21 Feb 2007, 19:19:12 UTC

I seem to have a problem on a dual-cpu box, which has only appeared since upgrading BOINC 5.4.11 -> 5.8.11.

BOINC can't decide if the box has one or two cpus.

When I look at the status in BoincView sometimes one task is running, and sometimes two.

I am attached to three projects, all of which have this host set to loacation Work.

In my general prefs, Default and Home both have max 1 cpu.

Work and School did both have 2 cpus max. I have changed both these to 4cpus in the hope that the change would force a reload of whatever dodgy value is causing the problem, but the problem has recurred.

An additional point that makes trouble shooting harer is that 5.8.11 seems not to log when a taks is suspended, only when it is started, restarted, or resumed. This makes it hard to tell from the log exactly when the secind cpu is being stopped.

With all versions up to 5.4.11 when the client thought the number of usable cpus had changed, it ran the benchmarks which had the side-effect of making the change obvious in the log. This is not happening here.

Here is the log for the box, line starting ### have been inserted by me.

21/02/2007 17:17:54||Starting BOINC client version 5.8.11 for windows_intelx86
21/02/2007 17:17:54||log flags: task, file_xfer, sched_ops
21/02/2007 17:17:54||Libraries: libcurl/7.16.0 OpenSSL/0.9.8a zlib/1.2.3
21/02/2007 17:17:54||Executing as a daemon
21/02/2007 17:17:54||Data directory: C:Program FilesBOINC
21/02/2007 17:17:54||BOINC is running as a service and as a non-system user.
21/02/2007 17:17:54||No application graphics will be available.
21/02/2007 17:17:54||Processor: 2 GenuineIntel x86 Family 6 Model 8 Stepping 3 665MHz [x86 Family 6 Model 8 Stepping 3] [fpu tsc sse mmx]
21/02/2007 17:17:54||Memory: 255.48 MB physical, 617.92 MB virtual
21/02/2007 17:17:54||Disk: 4.34 GB total, 794.00 MB free
21/02/2007 17:17:54|rosetta@home|URL: https://boinc.bakerlab.org/rosetta/; Computer ID: 304101; location: work; project prefs: work
21/02/2007 17:17:54|Leiden Classical|URL: http://boinc.gorlaeus.net/; Computer ID: 8747; location: work; project prefs: default
21/02/2007 17:17:54|lhcathome|URL: http://lhcathome.cern.ch/; Computer ID: 2138390; location: work; project prefs: default
21/02/2007 17:17:54||General prefs: from rosetta@home (last modified 2007-02-21 09:22:18)
21/02/2007 17:17:54||Host location: work
21/02/2007 17:17:54||General prefs: using separate prefs for work
21/02/2007 17:17:55|rosetta@home|Restarting task s021__BOINC_ABRELAX_NEWRELAXFLAGS_hom018__1568_10058_0 using rosetta version 546

### note only one task started, yet there is another waiting to start

21/02/2007 17:18:52|lhcathome|Sending scheduler request: To fetch work
21/02/2007 17:18:52|lhcathome|Requesting 11887 seconds of new work
21/02/2007 17:18:57|lhcathome|Scheduler RPC succeeded [server version 502]
21/02/2007 17:18:57|lhcathome|Deferring communication for 7 sec
21/02/2007 17:18:57|lhcathome|Reason: requested by project
21/02/2007 17:18:57|lhcathome|Deferring communication for 1 min 0 sec
21/02/2007 17:18:57|lhcathome|Reason: no work from project
21/02/2007 17:20:00|lhcathome|Sending scheduler request: To fetch work
21/02/2007 17:20:00|lhcathome|Requesting 11888 seconds of new work
21/02/2007 17:20:05|lhcathome|Scheduler RPC succeeded [server version 502]
21/02/2007 17:20:05|lhcathome|Deferring communication for 7 sec
21/02/2007 17:20:05|lhcathome|Reason: requested by project
21/02/2007 17:20:05|lhcathome|Deferring communication for 1 min 0 sec
21/02/2007 17:20:05|lhcathome|Reason: no work from project
21/02/2007 17:21:11|lhcathome|Sending scheduler request: To fetch work
21/02/2007 17:21:11|lhcathome|Requesting 11888 seconds of new work
21/02/2007 17:21:16|lhcathome|Scheduler RPC succeeded [server version 502]
21/02/2007 17:21:16|lhcathome|Deferring communication for 7 sec
21/02/2007 17:21:16|lhcathome|Reason: requested by project
21/02/2007 17:21:16|lhcathome|Deferring communication for 2 min 27 sec
21/02/2007 17:21:16|lhcathome|Reason: no work from project
21/02/2007 17:21:22|rosetta@home|Sending scheduler request: Requested by user
21/02/2007 17:21:22|rosetta@home|(not requesting new work or reporting completed tasks)
21/02/2007 17:21:26|rosetta@home|Scheduler RPC succeeded [server version 509]
21/02/2007 17:21:26||General prefs: from rosetta@home (last modified 2007-02-21 17:21:05)
21/02/2007 17:21:26||Host location: work
21/02/2007 17:21:26||General prefs: using separate prefs for work
21/02/2007 17:21:26|rosetta@home|Deferring communication for 4 min 2 sec
21/02/2007 17:21:26|rosetta@home|Reason: requested by project
21/02/2007 17:23:48|lhcathome|Sending scheduler request: To fetch work
21/02/2007 17:23:48|lhcathome|Requesting 11890 seconds of new work
21/02/2007 17:23:53|lhcathome|Scheduler RPC succeeded [server version 502]
21/02/2007 17:23:53|lhcathome|Deferring communication for 7 sec
21/02/2007 17:23:53|lhcathome|Reason: requested by project
21/02/2007 17:23:53|lhcathome|Deferring communication for 2 min 58 sec
21/02/2007 17:23:53|lhcathome|Reason: no work from project
21/02/2007 17:26:56|lhcathome|Sending scheduler request: To fetch work
21/02/2007 17:26:56|lhcathome|Requesting 11891 seconds of new work
21/02/2007 17:27:01|lhcathome|Scheduler RPC succeeded [server version 502]
21/02/2007 17:27:01|lhcathome|Deferring communication for 7 sec
21/02/2007 17:27:01|lhcathome|Reason: requested by project
21/02/2007 17:27:01|lhcathome|Deferring communication for 10 min 52 sec
21/02/2007 17:27:01|lhcathome|Reason: no work from project
21/02/2007 17:27:35|rosetta@home|Restarting task s021__BOINC_ABRELAX_NEWRELAXFLAGS_hom010__1568_10137_0 using rosetta version 546

### suddenly for no obvious reason, it now decides to start the other task

21/02/2007 17:37:56|lhcathome|Sending scheduler request: To fetch work
21/02/2007 17:37:56|lhcathome|Requesting 11896 seconds of new work
21/02/2007 17:38:00|lhcathome|Scheduler RPC succeeded [server version 502]
21/02/2007 17:38:00|lhcathome|Deferring communication for 7 sec
21/02/2007 17:38:00|lhcathome|Reason: requested by project
21/02/2007 17:38:00|lhcathome|Deferring communication for 25 min 12 sec
21/02/2007 17:38:00|lhcathome|Reason: no work from project
21/02/2007 17:51:18|rosetta@home|Resuming task s021__BOINC_ABRELAX_NEWRELAXFLAGS_hom010__1568_10137_0 using rosetta version 546

### and here it is resuming the second task again without any sign of having suspended it

21/02/2007 18:03:16|lhcathome|Sending scheduler request: To fetch work
21/02/2007 18:03:16|lhcathome|Requesting 11906 seconds of new work
21/02/2007 18:03:21|lhcathome|Scheduler RPC succeeded [server version 502]
21/02/2007 18:03:21|lhcathome|Deferring communication for 7 sec
21/02/2007 18:03:21|lhcathome|Reason: requested by project
21/02/2007 18:03:21|lhcathome|Deferring communication for 14 min 52 sec
21/02/2007 18:03:21|lhcathome|Reason: no work from project
21/02/2007 18:04:09|rosetta@home|Sending scheduler request: Requested by user
21/02/2007 18:04:09|rosetta@home|(not requesting new work or reporting completed tasks)
21/02/2007 18:04:13|rosetta@home|Scheduler RPC succeeded [server version 509]
21/02/2007 18:04:13|rosetta@home|Deferring communication for 4 min 2 sec
21/02/2007 18:04:13|rosetta@home|Reason: requested by project
21/02/2007 18:18:21|lhcathome|Sending scheduler request: To fetch work
21/02/2007 18:18:21|lhcathome|Requesting 11909 seconds of new work
21/02/2007 18:18:31|lhcathome|Scheduler RPC succeeded [server version 502]
21/02/2007 18:18:31|lhcathome|Deferring communication for 7 sec
21/02/2007 18:18:31|lhcathome|Reason: requested by project
21/02/2007 18:18:31|lhcathome|Deferring communication for 4 min 10 sec
21/02/2007 18:18:31|lhcathome|Reason: no work from project
21/02/2007 18:22:45|lhcathome|Sending scheduler request: To fetch work
21/02/2007 18:22:45|lhcathome|Requesting 11912 seconds of new work
21/02/2007 18:22:56|lhcathome|Scheduler RPC succeeded [server version 502]
21/02/2007 18:22:56|lhcathome|Deferring communication for 7 sec
21/02/2007 18:22:56|lhcathome|Reason: requested by project
21/02/2007 18:22:56|lhcathome|Deferring communication for 1 min 0 sec
21/02/2007 18:22:56|lhcathome|Reason: no work from project
21/02/2007 18:23:59|lhcathome|Fetching scheduler list
21/02/2007 18:24:04|lhcathome|Master file download succeeded
21/02/2007 18:24:10|lhcathome|Sending scheduler request: To fetch work
21/02/2007 18:24:10|lhcathome|Requesting 11912 seconds of new work
21/02/2007 18:24:15|lhcathome|Scheduler RPC succeeded [server version 502]
21/02/2007 18:24:15|lhcathome|Deferring communication for 7 sec
21/02/2007 18:24:15|lhcathome|Reason: requested by project
21/02/2007 18:24:15|lhcathome|Deferring communication for 1 min 0 sec
21/02/2007 18:24:15|lhcathome|Reason: no work from project
21/02/2007 18:24:48|rosetta@home|Resuming task s021__BOINC_ABRELAX_NEWRELAXFLAGS_hom010__1568_10137_0 using rosetta version 546

### and again

much more of the same follows. From the progress I'd estimate that the second task has run for less than 20min in over two hours, whereas with two tasks and two cpus there should be full opportunity to run both tasks.

Is the an artefact of version 5.8.11 or someting else?

River~~
ID: 37075 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 37078 - Posted: 21 Feb 2007, 19:27:22 UTC
Last modified: 21 Feb 2007, 19:29:24 UTC

Look at the status, after download units report "ready to run", if they are running they report "running", If they have been paused/preempted, they read "waiting to run".

With the new memory settings in your "general prefs", if it decides you don't have enough memory, it used to (a couple alpha versions ago) read "waiting for memory", and it will eventually change the status to "waiting to run" if not memory wasn't available right away.

The default setting are 50% while in use and 90% while not in use. If you haven't changed them, you might want to. Also look at the boinc manager status to see if that's what you're seeing.

I had quite an issue with this prior to Rosetta updating their server software, and when I was using linux
ID: 37078 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 37080 - Posted: 21 Feb 2007, 19:36:35 UTC - in response to Message 37078.  

Look at the status, after download units report "ready to run", if they are running they report "running", If they have been paused/preempted, they read "waiting to run".

With the new memory settings in your "general prefs", if it decides you don't have enough memory, it used to (a couple alpha versions ago) read "waiting for memory", and it will eventually change the status to "waiting to run" if not memory wasn't available right away.

The default setting are 50% while in use and 90% while not in use. If you haven't changed them, you might want to. Also look at the boinc manager status to see if that's what you're seeing.

I had quite an issue with this prior to Rosetta updating their server software, and when I was using linux



Thanks Astro, spot on. And thanks too for a very quick reply.

The 'Waiting for memory' message is there if I look at the state from the new shiny BOINCmgr, but BOINCview simply says 'Paused'. So chalk one advantage up to the new BM over BV.

R~~
ID: 37080 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 37081 - Posted: 21 Feb 2007, 19:47:29 UTC
Last modified: 21 Feb 2007, 19:48:32 UTC

It had my head scratching there for a while when I first saw that funky behavior, and even made a fool of myself trying to convince Dr. Anderson that there was a problem. LOL

It seems Rosetta is one that either uses or reserves lots of memory, I'm still not positive how that works.
ID: 37081 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 37082 - Posted: 21 Feb 2007, 20:06:51 UTC - in response to Message 37081.  

It had my head scratching there for a while when I first saw that funky behavior, and even made a fool of myself trying to convince Dr. Anderson that there was a problem. LOL

It seems Rosetta is one that either uses or reserves lots of memory, I'm still not positive how that works.


Rosetta uses a lot of memory and increasingly so as it works through. Eventual it may just stop, I am unsure if/how/what/when fix or happens whn this occurs, but I do remember it comming up in alpha (Mikus i think).

Since you only have 256MB you will se this 'stalling' more often and little chance of getting 2xrosetta's to run at the same time.


Team mauisun.org
ID: 37082 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 37083 - Posted: 21 Feb 2007, 20:35:03 UTC - in response to Message 37082.  

Rosetta uses a lot of memory and increasingly so as it works through.

That's interesting. As it works through a single decoy, or as it works through a series of decoys in a long run?

Since you only have 256MB you will se this 'stalling' more often and little chance of getting 2xrosetta's to run at the same time.


Well, I admit running two in 256 is a bit cheeky when the spec is for 256 anyway. Tho I could be even more cheeky and point out that the spec does not ask for any more for multi-cpus...

But in fact two are running happily together now I've given them 99% of the memory to share between them. The server had blanks in for these figures, so presumably they defaulted to something plausible.

R~~
ID: 37083 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BobCat13

Send message
Joined: 18 Jun 06
Posts: 4
Credit: 130,387
RAC: 0
Message 37096 - Posted: 22 Feb 2007, 5:46:35 UTC - in response to Message 37075.  

An additional point that makes trouble shooting harer is that 5.8.11 seems not to log when a taks is suspended, only when it is started, restarted, or resumed. This makes it hard to tell from the log exactly when the secind cpu is being stopped.

You need to set a flag in cc_config.xml to see those messages on 5.8.x

If you already have a cc_config.xml file, check for the following:

<cpu_sched>1</cpu_sched>


If you don't have a cc_config.xml file, create a blank text file, then add the following:

<cc_config>
<log_flags>
<cpu_sched>1</cpu_sched>
</log_flags>
</cc_config>

and save it as cc_config.xml

You can find the flags and options available for cc_config.xml here.
ID: 37096 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BobCat13

Send message
Joined: 18 Jun 06
Posts: 4
Credit: 130,387
RAC: 0
Message 37097 - Posted: 22 Feb 2007, 5:50:57 UTC - in response to Message 37080.  

The 'Waiting for memory' message is there if I look at the state from the new shiny BOINCmgr, but BOINCview simply says 'Paused'. So chalk one advantage up to the new BM over BV.

Which version of BV are you using? The 1.4.1 and 1.4.2 beta versions have the "Waiting for memory" message listed, but I have never had BOINC pause for that reason so I can't be sure it is displayed.
ID: 37097 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 37098 - Posted: 22 Feb 2007, 6:16:25 UTC - in response to Message 37097.  

The 'Waiting for memory' message is there if I look at the state from the new shiny BOINCmgr, but BOINCview simply says 'Paused'. So chalk one advantage up to the new BM over BV.

Which version of BV are you using? The 1.4.1 and 1.4.2 beta versions have the "Waiting for memory" message listed, but I have never had BOINC pause for that reason so I can't be sure it is displayed.


Good spot. Still running BV 1.2.2 :-(

And thanks for the config file info. It still seems odd to me to have a default setting that shows things resuming but not pausing, to my mind it would seem more logical to have both or neither. But at least if it is settable then I can tweak it up to my liking.

btw I like your handle which you share with my downstairs neighbour who runs a home publishing business called BobCat press. Named after his cat, Bob, of course...

R~~
ID: 37098 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 37099 - Posted: 22 Feb 2007, 6:32:35 UTC - in response to Message 37082.  
Last modified: 22 Feb 2007, 7:06:50 UTC

Since you only have 256MB you will se this 'stalling' more often and little chance of getting 2xrosetta's to run at the same time.


Been looking at this FC.

Where I win is by not having a GUI. Not windoze, not KDE, not Gnome, and not BoincMgr. Not even BV on the same machine. So here are my meminfo figures at a point where one Rosetta has been running 12hrs and is 4000 sec into the current decoy, and the other Rosetta has been running 600 sec on its first decoy:

ric-gw-live:~# cat /proc/meminfo
MemTotal:       256268 kB
MemFree:         15204 kB
Buffers:         17364 kB
Cached:          51656 kB
SwapCached:          0 kB
Active:         184028 kB
Inactive:        32752 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       256268 kB
LowFree:         15204 kB
SwapTotal:      682720 kB
SwapFree:       682720 kB
Dirty:             100 kB
Writeback:           0 kB
Mapped:         159144 kB
Slab:            19728 kB
Committed_AS:   298648 kB
PageTables:        824 kB
VmallocTotal:   770040 kB
VmallocUsed:      3064 kB
VmallocChunk:   766804 kB


As you can see, both Rosettas are fitting well into real memory. The problem was that before Astro's tip they were trying to both fit into half the memory, as BOINC seems clever enough to spot that this machine is also doing other stuff, like serving internal web pages and so on.

And, of course I'd have noticed if there was a genuine memeory problem as the machine's main mission would have suffered and max cpus would have been set back to 1 (or even BOINC removed). The new memory limits will prove useful in future, when Rosetta does get big enough to cause this kind of issue, but on a Linux command-line only box there is a long way to go yet.

Edit, added:

I won't bore you with anonther meminfo listing, but my experiments indicate than running a second Rosetta adds between 80Mb - 100Mb memory usage. For example, top shows the two Rosettas each with around 30% to 35% of memory. The BOINC client (v5.8.11) weighs in at 1.3% of 256M = under 4Mb so the client is not a significant memory issue.

One area where the footprint of the second task is smaller than you'd expect is that all the shared library code is only loaded into real memry once, even if both tasks are using it and even if they both have it at different virtual addresses (the magic of the VM mapping - all credit to Intel for that, for their 386 memory design)

Perhaps there is a case for the System Requirements page to show a smaller figure for the memory usage needed on a command line only machine, and larger requirements for people hoping to run multi cpus.

So for a GUI operating system (Win, Mac, KDE, Gnome) you need (I am suggesting) around 150Mb overhead plus 100Mb per cpu running Rosetta, which for one cpu is consisten with the advice to have 256MB installed.

On a linux command-line only box something like 50Mb overhead plus 100Mb per cpu running Rosetta.

So the other way of looking at it is that by throwing out KDE/Windows I get to run a whole extra Rosetta. Seems like a good tradeoff on a box that doesn't even have its own monitor...

R~~
ID: 37099 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : BOINC unsure how many CPUs to use



©2024 University of Washington
https://www.bakerlab.org