Posts by David Ball

1) Message boards : Number crunching : weird upload problem (Message 79211)
Posted 13 Dec 2015 by David Ball
Post:
I finally got the uploads to work. I had upload speed limited to about 45K Bytes per second.

I tried limiting the speed to 5KBps and that got an error as soon as 100% of the file was sent.

Then, I tried completely removing the limit on upload speed and the files uploaded. I have no idea why this fixes the problem but I had 2 stuck uploads and they both uploaded as soon as I told them to retry.

I'm on Windows 7 64 bit with BOINC 7.6.9 and DSL (5 Mb download / 512Kb upload).

David
2) Message boards : Number crunching : weird upload problem (Message 79175)
Posted 9 Dec 2015 by David Ball
Post:
I'm using BOINC 7.6.9 on a Core 2 Quad 6600. I have the GPU disabled for BOINC.

I'm running into a really weird upload problem. It seems that small files will upload OK and I can download files of any size from Rosetta. However, When I have to upload a large file, it uploads at high speed (about 45K Bytes per second) and reaches 100%. Then it waits for 20+ seconds at 100%, fails, and schedules a retry. I've seen this before. It can go on doing retries for days and then suddenly everything uploads and I stop having the problem for a few days.

The server status shows everything is fine and other file transfers work. BTW, the current file having trouble is 1.65 MB.

I am attached to several projects (WCG, POEM, Einstein, etc) and don't have the problem on any of those.

David
3) Message boards : Number crunching : Minirosetta 3.62-3.65 (Message 79166)
Posted 8 Dec 2015 by David Ball
Post:
I've started aborting any WUs that start with "rb_" since yours were in the "rb_11_2*" range and mine were in the "rb_12_06_61173_105565_ab_stage0_t000*" range. I also noticed that some in that range were completing and validating.


This is a bit of a bazooka-to-kill-a-housefly type of a solution. I'd encourage anyone not to abort WUs based on something as broad as starting with 'rb_' as 'rb_' work units are part of the Robetta prediction server and serve an incredibly wide range of research projects.

Secondly, I'll note that a) the two 'failed' WUs you listed above David actually DID grant you full credit (Click on the WU link you posted, and scroll to the bottom, you'll see credit was granted even though it doesn't show in the summary of your WUs it does count towards your total).

Lastly, looking at some of the WUs you've aborted, most of them (like this one, and this one, and this one, for example) were successfully completed by other users after being aborted on your end :).


Thanks for the info. Basically I was waiting for more information and will start letting them run. BTW, I could be remembering wrong but when I checked the failed WUs shortly after they failed, ISTR that the granted credit on the workunit details was something like "-----".

Anyway, I'll let them process from now on.

-- David
4) Message boards : Number crunching : Minirosetta 3.62-3.65 (Message 79162)
Posted 7 Dec 2015 by David Ball
Post:
Validate errors on rb_11_2* tasks


Both workunits were canceled, how about sending a kill command to the clients in order to avoid wasting ressources?


I'm also getting validate errors on some workunits that say they completed OK on the client but get a validate error on the server with the workunit details saying they were cancelled. I've started aborting any WUs that start with "rb_" since yours were in the "rb_11_2*" range and mine were in the "rb_12_06_61173_105565_ab_stage0_t000*" range. I also noticed that some in that range were completing and validating.

workunits:

failed: rb_12_06_61173_105565_ab_stage0_t000___robetta_IGNORE_THE_REST_04_10_313585_76_0
failed: rb_12_06_61173_105562_ab_stage0_t000___robetta_IGNORE_THE_REST_07_12_313580_207_0

passed: rb_12_06_61173_105565_ab_stage0_t000___robetta_IGNORE_THE_REST_05_10_313585_82_0
passed: rb_12_06_61173_105562_ab_stage0_t000___robetta_IGNORE_THE_REST_07_12_313580_47_0

if there's a way to cancel WUs without them running for many hours on the client then I really wish rosetta would use it.

Thanks,

David
5) Message boards : Number crunching : Only using 3 out of 4 available cores. (Message 75627)
Posted 19 May 2013 by David Ball
Post:

I've checked the local preferences here which say to use at most 16 cores, and I've set the local preferences to use 100% of the CPU.
Any ideas?


First, I'm fairly certain that the most recent versions (like 7.0.64) of boinc ignore the "use at most 16 cores" field and calculate the number of cores to use from the "use 100% of the CPU" field.

Second, there's the 2 MB memory problem. What percentage of the memory and swap space do you have boinc set to use? Is boinc set to "Leave applications in memory while suspended"?

If it's a memory problem you usually end up with a work unit that says "Waiting for memory" instead of running but there might be some situations where it will simply not start up a 4th work unit. I very rarely have a memory problem but I did see one today on a CPU work unit and it was because a GPU work unit grabbed a lot of memory. Apparently, boinc gives priority to the GPU work unit when it decides to make something wait for memory so it suspended a CPU work unit with the status message "Waiting for memory".

The operating system can change the amount of memory that it uses or start up some maintenance task and that can cause it to use more memory. In fact, the boinc manager is a separate program and it requires 20+ MB of memory. I'm attached to a lot of projects and my boinc manager is currently using 29MB and has used a peak of 54MB of memory with a commit size (IIRC this includes swap space) of 77MB.

I always set my page file to at least twice the size of memory so on that machine I would set it to a 4GB swap space and configure boinc to allow it to use up to 75% of the swap space.

Boinc.exe 7.0.x has a problem where it gradually uses more memory the longer it runs. I saw some of the test versions of boinc.exe in the 7.0.5x and 7.0.6x actually reach over 450MB of memory but I'm a boinc alpha tester and was running boinc.exe with all kinds of debug options set in cc_config.xml when that happened. With the normal options, boinc.exe doesn't grow much but you have so little memory that even a little growth could effect things.

Basically, I think you need to find a way to get more memory in that machine and make sure you have at least a 4GB page file.

Regards,

David Ball
6) Message boards : Number crunching : Low Credits RAC for 8-Core PC? (Message 64630)
Posted 29 Dec 2009 by David Ball
Post:

When the slowdown occurs, going to Windows Task Manager -> Performance -> Physical Memory often shows just 1 MB of free memory, so SOMETHING is using more memory than expected. However, in such a case, the graph at Windows Task Manager -> Performance usually shows not much more than 50% of the memory in use.


Instead of just letting unused physical memory sit idle, the Operating System dynamically uses extra physical memory for disk cache. The free memory number that you're looking at represents what is left over. It's normal for this number to be rather small, often in the lower double digits. When the OS needs more memory, it just discards some of the disk cache and uses that memory.

The performance graph only indicates how much memory is in use by programs. It excludes the use of idle memory as disk cache. That's why it's showing a lower number.

What you're seeing sounds like memory thrashing. Something suddenly wants a lot of memory and the OS is discarding disk cache to give it to programs. Memory allocation is something of an art and can represent a choke point in the system. Not only is the OS having to track all of that memory but it has to coordinate the operation between CPUs/cores and virtual memory tables are being updated which means some cpu caches are being flushed as well.

Also, program startup is when some anti-virus programs scan the program being loaded and scan the files being opened. This can significantly slow program startup.

Depending on what version of Windows you're using, there are some tunable parameters. On Vista SP2, you might want to go into control panel and check the following:

Go to Control Panel -> System and select the item on the left titled "Advanced System Settings". You'll get a popup window titled "System Properties". Select the "Advanced" tab. In the section on performance, press the "settings" button. This will give you a popup window title "performance options". Select the "Advanced" tab on this window. At the top is a section titled "processor scheduling". In this section, you have a choice of 2 options, "Programs" or "Background Services". You want to select "Programs". If you're on an earlier version of windows, beneath the "processor scheduling" section, there's another section titled "Memory Usage" which lets you adjust for the best performance of "Programs" or "System Cache". You'd want to select "programs" in this section if it's present.When you're done with your selection(s), click the apply button at the bottom. The hit the various OK buttons to get back out of the nested windows. A reboot may be required. Use at your own risk!

I don't know what options are present in Win7. Microsoft seems to be making this less and less tunable from the user perspective. If you're a programmer and very daring, there are some additional ways to tune this. See:

Blog: Too Much Cache? : http://blogs.msdn.com/ntdebugging/archive/2007/11/27/too-much-cache.aspx

GetSystemFileCacheSize Function: http://msdn.microsoft.com/en-us/library/aa965224%28VS.85%29.aspx

SetSystemFileCacheSize Function: http://msdn.microsoft.com/en-us/library/aa965240%28VS.85%29.aspx

That's probably more than you ever wanted to know about the system cache and I've barely scratched the surface.




7) Message boards : Number crunching : Minirosetta v1.45 bug thread (Message 57668)
Posted 7 Dec 2008 by David Ball
Post:
Vista 64 bit on stock HP machine with Q6600 CPU and 5 GB memory - no OC
BOINC 6.2.19
App: Mini 1.45

Name cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_hr1958_olange_5387_12341_1

Ran for around 4 hours and exited with
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004EAD47 read attempt to address 0x00000000

Stack trace is in the result

http://boinc.bakerlab.org/rosetta/result.php?resultid=212406604
8) Message boards : Number crunching : Minirosetta v1.40 bug thread (Message 57275)
Posted 27 Nov 2008 by David Ball
Post:
http://boinc.bakerlab.org/rosetta/result.php?resultid=210193423

2vik__BOINC_ABRELAX_SPLIT_SPLIT2_IGNORE_THE_REST-S25-9-S3-3--2vik_-_4768_1689_1

Vista home premium 64 bit system with 5 GB of ram. C2 Quad Q6600. Only running BOINC. 2 rosetta tasks were running along with 2 tasks from other projects. Lots of free memory and disk space. BOINC is set to leave tasks in memory. BOINC is not used as a screensaver. BOINC client version is 6.2.19.

The WU above was running but the CPU time (3 hours 50 minutes 2 seconds) and percent complete (about 69%) weren't increasing. I checked with task manager and it WAS using 25% cpu (1 of 4 cores in a C2Q Q6600). I suspended the WU and the status in the BOINC manager changed from running to waiting to run. However, windows task manager showed that it was still running. I had another rosetta task running so I suspended the second WU as well to make sure I had the right one. The second rosetta WU stopped using CPU when it was suspended but remained in memory as it should. BOINC manager now showed NO rosetta tasks running, but windows task manager showed the problem WU was still using all the cpu time it could get. I killed it in task manager and aborted the WU. When looking at the result, I found that I was the second person to get the WU and it had died on the other computer after about 3 minutes.

IIRC, the WU was on the 5th model when this happened.

Hope this helps.
9) Message boards : Number crunching : Minirosetta v1.39 bug thread (Message 56577)
Posted 31 Oct 2008 by David Ball
Post:
I've got multiple machines erroring out WUs immediately with the following stdout:


<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>

ERROR: in::file::zip minirosetta_database.zip does not exist!
ERROR:: Exit from: ....srcappspublicboincminirosetta.cc line: 74
called boinc_finish

</stderr_txt>
]]>


They're not able to get any more work. These are reliable, non-overclocked, HP Quads (Q6600) with 3 GB of memory each, running Vista Home Premium SP1 32 bit with all patches current.
10) Message boards : Number crunching : Problems with web site (Message 55971)
Posted 23 Sep 2008 by David Ball
Post:
Stats export hasn't run for 3 or 4 days. Did some server process stop?
11) Message boards : Number crunching : minirosetta v1.24 bug thread (Message 53297)
Posted 23 May 2008 by David Ball
Post:
For some reason a Mini-Rosetta 1.24 WU got the following error:

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
ERROR: Option matching -description_file not found in command line top-level context

</stderr_txt>
]]>





Other 1.24 WU on the same machine are ok.


EDIT: This was on a linux machine. I just noticed that the same WU got the same error on an XP Pro machine so it's not being re-issued due to "Too many error results". Could the command line have contained an illegal character?

The XP pro machine said:

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
Funzione non corretta. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
ERROR: Option matching -description_file not found in command line top-level context

</stderr_txt>
]]>
12) Message boards : Cafe Rosetta : Word link 9 (Message 42162)
Posted 14 Jun 2007 by David Ball
Post:
pregnant penguin


New Linux distro
13) Message boards : Cafe Rosetta : Word link 9 (Message 42093)
Posted 12 Jun 2007 by David Ball
Post:
Gilbert Gottfried


Annoying
14) Message boards : Cafe Rosetta : Word link 9 (Message 42008)
Posted 10 Jun 2007 by David Ball
Post:
Jenny


Penny
15) Message boards : Cafe Rosetta : Word link 9 (Message 41943)
Posted 8 Jun 2007 by David Ball
Post:
thief
16) Message boards : Cafe Rosetta : Word link 9 (Message 41912)
Posted 6 Jun 2007 by David Ball
Post:
An Affair to Remember (http://www.imdb.com/title/tt0050105/)

hillary clinton had an affair
~ development

shocking!





17) Message boards : Cafe Rosetta : Personal Milestones (Message 41910)
Posted 6 Jun 2007 by David Ball
Post:
Rosetta is my first project to finally get above 100,000 :-)
18) Message boards : Cafe Rosetta : Word link 9 (Message 41908)
Posted 6 Jun 2007 by David Ball
Post:
~ development

shocking!



19) Message boards : Number crunching : Problems with Rosetta version 5.51 (Message 37802)
Posted 14 Mar 2007 by David Ball
Post:
http://boinc.bakerlab.org/rosetta/result.php?resultid=67366016

1esy__BOINC_RNA_ABINITIO-1esy_-_1609_5735_0

CPU time 2009.7416

# random seed: 1591680
# cpu_run_time_pref: 86400
======================================================
DONE :: 1 starting structures built 30 (nstruct) times
This process generated 0 decoys from 0 attempts
======================================================

This has been a very reliable cruncher that's set for 24 hour execution preference.

-- David
20) Message boards : Number crunching : How much has your RAC Dropped Since 12/6/06 (Message 35160)
Posted 20 Jan 2007 by David Ball
Post:
dcdc, now we just need to figure out why my machine gets 32 models crunched in 10638 seconds and yours gets 1 done in 8511 seconds.

Comparing the host of mine that crunched the 32 models with the host of yours.......

DCDC's machine has less than 1/2 the memory of Feet1st, though the cache is the same. This have some bearing on the disparity?


Actually, the cache size you're seeing (976.56 KB) is the default reported by some versions of the BOINC software. My P-4 based Celeron 2.3 GHz has 128KB L2, but reports the same value, 976.56KB, for cache size on some versions of the BOINC software.

P-III Celerons never got higher than 256KB cache and most had 128KB cache. If dcdc's 1 GHz Celeron is a Coppermine, it should have a 128KB cache. If it's a Tualatin, it should have 256KB. The FSB is 100 MHz for a 1 GHz on both Coppermine and Tualatin.

A P-4 2.8 should have 512KB cache (Northwood) or 1MB cache (Prescott / 5xx series). The Pentium 6xx series has 2 MB cache, but I think it starts at 3.0 Ghz. The FSB should be either 533 MHz (early Northwood) or 800 MHz.

HTH,

-- David


Next 20



©2024 University of Washington
https://www.bakerlab.org