Problems with Minirosetta v1.54

Message boards : Number crunching : Problems with Minirosetta v1.54

To post messages, you must log in.

Previous · 1 . . . 12 · 13 · 14 · 15

AuthorMessage
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 3,752,294
RAC: 1,726
Message 60898 - Posted: 29 Apr 2009, 21:21:28 UTC
Last modified: 29 Apr 2009, 21:23:04 UTC

BOINC v6.6.20 seems to be causing failures due to too many restarts.
https://boinc.bakerlab.org/rosetta/result.php?resultid=247095859
https://boinc.bakerlab.org/rosetta/result.php?resultid=246620233

It suggests keeping tasks in memory. But I've always had it configured to do so. I've also limited the memory available to BOINC while computer is in use. This seems to cause BOINC to begin and then suspend the tasks numerous times during the day. When the task attempts to run and then exceeds memory bound, it goes to a status of waiting for memory. But it no longer appears in the Windows task list, hence was removed from memory.

I have a HT P4, so 2 CPUs. As the primary task cycles through periods with lower memory usage, it attempts to fire up the second core. Only to find it ends up short of memory again a few minutes later as the second task gears up and uses more, or the first cycles in to another phase of higher memory usage.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 60898 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Speedy
Avatar

Send message
Joined: 25 Sep 05
Posts: 161
Credit: 685,798
RAC: 269
Message 60915 - Posted: 30 Apr 2009, 5:24:59 UTC
Last modified: 30 Apr 2009, 5:35:33 UTC

ID: 60915 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
transient
Avatar

Send message
Joined: 30 Sep 06
Posts: 376
Credit: 9,666,665
RAC: 2,241
Message 60921 - Posted: 30 Apr 2009, 12:04:12 UTC

frb_0_8_mike_chosen_cst_hb_t367__IGNORE_THE_REST_1UFBA_2_11071_831_0

Interesting task, IMO. It generated 99 decoys in a bit more than 20 minutes.
ID: 60921 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
WilMar

Send message
Joined: 29 Mar 09
Posts: 1
Credit: 1,984
RAC: 0
Message 60922 - Posted: 30 Apr 2009, 13:23:20 UTC

Hello !
Now at the advent of version 1.64, I´ve difficulties to load up my last crunched file with version 1.54. I get repeatedly the following messages:
30/04/2009 13:40:31|rosetta@home|Started upload of lb_all_multi_threshold.0.5_hb_t311__IGNORE_THE_REST_1ZK8A_1_10279_7_2_0
30/04/2009 13:42:19||Project communication failed: attempting access to reference site
30/04/2009 13:42:19|rosetta@home|Temporarily failed upload of lb_all_multi_threshold.0.5_hb_t311__IGNORE_THE_REST_1ZK8A_1_10279_7_2_0: connect() failed
30/04/2009 13:42:19|rosetta@home|Backing off 1 hr 50 min 57 sec on upload of lb_all_multi_threshold.0.5_hb_t311__IGNORE_THE_REST_1ZK8A_1_10279_7_2_0
30/04/2009 13:42:21||Internet access OK - project servers may be temporarily down.

As seen on the server status page, all servers are running. So, why this problem and how to cure it ?

Martin
ID: 60922 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 819
Credit: 10,190,161
RAC: 3,605
Message 60923 - Posted: 30 Apr 2009, 14:22:24 UTC - in response to Message 60898.  
Last modified: 30 Apr 2009, 15:18:34 UTC

BOINC v6.6.20 seems to be causing failures due to too many restarts.
https://boinc.bakerlab.org/rosetta/result.php?resultid=247095859
https://boinc.bakerlab.org/rosetta/result.php?resultid=246620233

It suggests keeping tasks in memory. But I've always had it configured to do so. I've also limited the memory available to BOINC while computer is in use. This seems to cause BOINC to begin and then suspend the tasks numerous times during the day. When the task attempts to run and then exceeds memory bound, it goes to a status of waiting for memory. But it no longer appears in the Windows task list, hence was removed from memory.

I have a HT P4, so 2 CPUs. As the primary task cycles through periods with lower memory usage, it attempts to fire up the second core. Only to find it ends up short of memory again a few minutes later as the second task gears up and uses more, or the first cycles in to another phase of higher memory usage.


BOINC 6.6.20 is wotking better for me, so lets's compare our machines and settings. My newer machine, with BOINC 6.6.20 under 64-bit Vista SP1 with 8 GB of memory, does not appear to have any memory problems.

My 32-bit Vista SP1 machine, with BOINC 6.2.28, originally came with 1 GB of memory. I found that wasn't enough to even start running two minirosetta@home workunits at the same time. After enough other problems showed up which I decided were memory problems, I used this site to find out how much memory my motherboard could handle, and then order enough to raise it to the 2 GB limit for my motherboard:

http://www.crucial.com/

This was enough to allow it to start running two minirosetta workunits at one on my 2 CPU cores, but still not enough to run them well. Eventually, I raised both the amount of disk space BOINC is allowed to use, and the amount of swap space BOINC is allowed to use. It's not clear which of the last two steps were actually needed, if not both of them, but that combination handled the memory problems on that machine.

At least some versions of BOINC do not divide up the available swap space in the most efficient way - they first divide it up into equal shares for each BOINC project you have subscribed to, then those shares into smaller shares for each CPU core. If these smaller shares aren't large enough, it can't preserve any work done since the last checkpoint by simply swapping one into the swap space on the hard drive.

Does the HT stand for hyperthreaded, a method of appearing to have twice as many CPU cores by giving each one of them an extra set of registers? If so, I've seen messages from other BOINC users saying that this does not increase the total throughput very much. Therefore, until you are able to handle the memory and swapfile problems, you may find it worthwhile to tell BOINC to use only one of the two apparant CPU cores on your machine.
ID: 60923 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 819
Credit: 10,190,161
RAC: 3,605
Message 60926 - Posted: 30 Apr 2009, 16:05:45 UTC

I've recently had two workunits with the lockfile problem:

https://boinc.bakerlab.org/rosetta/result.php?resultid=247527853

https://boinc.bakerlab.org/rosetta/result.php?resultid=247443039

Both were then completed successfully by someone else.

Could minirosetta be modified to check for the lockfile problem sooner, and at least produce more debug information about it instead of wasting CPU time first?
ID: 60926 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 3,752,294
RAC: 1,726
Message 60927 - Posted: 30 Apr 2009, 16:19:58 UTC

Robert, thanks for the comments. I have plenty of memory, but for 1/3 of the day I actually use it for a number of work applications and with the new increase in memory used by mini, I'm testing to see if BOINC is the cause of some sluggish behavior on my machine. Indeed it seems to be the case.

Yes, by HT, I meant hyperthreaded. But I believe setting number of CPUs to one on a machine configured with HT active would cut my credit roughly in half. I'd think that the other analysis you've read is comparing a machine with HT enabled running 2 tasks at a time, with the same machine with HT disabled running 1. Since my HT is enabled, running 2 tasks is the only way to break even. But yes, one option would be to disable HT, then I'd be focusing all the resource on one task at a time, and not have the desire to support memory enough for two tasks.

I was just trying to point out that 6.6.20 seems to be removing tasks from memory in some cases, even when configured to leave tasks in memory. And this can lead to cancelled WUs such as I reported. I wasn't limiting memory on my prior version of BOINC, so am unsure if this is new behavior or not.

I just saw another task suspended waiting for memory, but this time it remained in the task list. Could be BOINC saw it had 3 hours invested in it and didn't want to throw it away. I believe the tasks that are getting removed are actually only running for a couple of minutes.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 60927 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Speedy
Avatar

Send message
Joined: 25 Sep 05
Posts: 161
Credit: 685,798
RAC: 269
Message 60929 - Posted: 30 Apr 2009, 21:15:55 UTC - in response to Message 60922.  

Hello !
I´ve difficulties to load up my last crunched file with version 1.54. I get repeatedly

I'm getting the same type of messages to
5/1/2009 8:51:54 AM rosetta@home Temporarily failed upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VYHA_4_11644_1_0_0: HTTP error
5/1/2009 8:51:54 AM rosetta@home Backing off 2 hr 52 min 32 sec on upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VYHA_4_11644_1_0_0
5/1/2009 8:51:54 AM rosetta@home Started upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1FXWF_6_11644_1_0_0
5/1/2009 8:51:56 AM Internet access OK - project servers may be temporarily down.
5/1/2009 8:51:59 AM rosetta@home Finished upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1FXWF_6_11644_1_0_0
5/1/2009 8:52:53 AM Project communication failed: attempting access to reference site
5/1/2009 8:52:53 AM rosetta@home Temporarily failed upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VJGA_4_11644_1_0_0: HTTP error
5/1/2009 8:52:53 AM rosetta@home Backing off 12 min 18 sec on upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VJGA_4_11644_1_0_0
5/1/2009 8:52:55 AM Internet access OK - project servers may be temporarily down.

Should I abort these transfers? I will wait for further instructios before I do anything to these.
Have a crunching good day!!
ID: 60929 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 819
Credit: 10,190,161
RAC: 3,605
Message 60930 - Posted: 30 Apr 2009, 21:26:28 UTC - in response to Message 60927.  

Robert, thanks for the comments. I have plenty of memory, but for 1/3 of the day I actually use it for a number of work applications and with the new increase in memory used by mini, I'm testing to see if BOINC is the cause of some sluggish behavior on my machine. Indeed it seems to be the case.

Yes, by HT, I meant hyperthreaded. But I believe setting number of CPUs to one on a machine configured with HT active would cut my credit roughly in half. I'd think that the other analysis you've read is comparing a machine with HT enabled running 2 tasks at a time, with the same machine with HT disabled running 1. Since my HT is enabled, running 2 tasks is the only way to break even. But yes, one option would be to disable HT, then I'd be focusing all the resource on one task at a time, and not have the desire to support memory enough for two tasks.

I was just trying to point out that 6.6.20 seems to be removing tasks from memory in some cases, even when configured to leave tasks in memory. And this can lead to cancelled WUs such as I reported. I wasn't limiting memory on my prior version of BOINC, so am unsure if this is new behavior or not.

I just saw another task suspended waiting for memory, but this time it remained in the task list. Could be BOINC saw it had 3 hours invested in it and didn't want to throw it away. I believe the tasks that are getting removed are actually only running for a couple of minutes.


Do you have enough free disk space to allow BOINC enough space to increase the swap space it can use to store any partly completed work in a way that allows resuming it where it was interrupted? That way, BOINC could simply switch to helping projects with lower memory requirements while you need more memory for something else; for example, the POEM@HOME project requires less memory, but helps an earlier step in medical research. That way, the suspended tasks will move off of the list of tasks currently running, but in a way that lets them move back onto this list and at the point of interruption later, instead of being dropped entirely. Such tasks will need to go back to the last checkpoint if you reboot for any reason, though. If you prefer to run mainly Rosetta@home, just keep the percentage of your CPU time assigned to these lower memory requirement projects less than the percentage of your CPU time you actually need to run with lower memory requirements. Also, insuring that there is enough swap space for all the projects BOINC tries to keep running at once allows you to suspend all BOINC projects at once if you need to run something with even more requirements. It seems that the defaults for the amount of swap space BOINC is allowed to use aren't good enough if you attach to enough BOINC projects at once, and even one of them is as memory-hungry as Rosetta@home.

http://boinc.fzk.de/poem/

Also, turning off one of a pair of hyperthreaded CPUs shouldn't cause you to get only half the credits, since it then allows you to run the other one at full speed, instead of at barely more than half the full speed. It would, however, give you only half the credits if you actually had two fully independent CPU cores instead of a hyperthreaded pair, or if you use an older version of BOINC that isn't aware that it needs to keep track of CPU core sharing between hyperthreaded pairs.

If your main concern is credits for helping medical research and you happen to have one of the newer graphics boards GPUGRID can use (mainly recent Nvidia cards), consider adding GPUGRID to your list of BOINC projects. It will require switching to the newest version of BOINC I've read about, but then can run workunits on your graphics card instead of on your CPUs. Shouldn't interfere with your regular computer use if it isn't graphics-intensive.

http://www.gpugrid.net/

Also, check if that web site I gave mentions how much memory your machine can handle and what the price is. I spent only about $50 (US) to reach the maximum amount this computer can use, but that did have me as the person who installed the new and faster memory.
ID: 60930 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 3987
Credit: 0
RAC: 0
Message 60931 - Posted: 30 Apr 2009, 21:32:24 UTC

Speedy, no don't abort them. I'm sure the problem with uploads must be related to the current problems with getting credit issued. When the back end file system is having problems, everything is having problems to some degree or another.
Rosetta Moderator: Mod.Sense
ID: 60931 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Speedy
Avatar

Send message
Joined: 25 Sep 05
Posts: 161
Credit: 685,798
RAC: 269
Message 60932 - Posted: 30 Apr 2009, 21:40:26 UTC - in response to Message 60931.  

Speedy, no don't abort them. I'm sure the problem with uploads must be related to the current problems with getting credit issued. When the back end file system is having problems, everything is having problems to some degree or another.

Thank you. All my results that need to be uploaded have just been uploaded. All is good at my end. Thank you for your continued hard work.
Have a crunching good day!!
ID: 60932 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 3,752,294
RAC: 1,726
Message 60933 - Posted: 30 Apr 2009, 21:42:18 UTC

Robert, yes I've had all the same thoughts, and have plenty of disk allowed to BOINC, and to my swap file. But am finding that BOINC isn't smart enough to realize which projects require less memory. It cycles through all the work you currently have for the project it wants to repay debt to, and only after it gets about 2 minutes in to every single downloaded Rosetta task will it try to run a 10MB WCG rice task. But if I don't happen to have any WCG work, it isn't smart enough to think about getting some rather then leaving a CPU idle.

I'd love if it were smart enough to run one Rosetta and one rice during the day when I'm using the machine, and then run dual Rosetta tasks at night when my machine is idle and I allow more memory to BOINC. But it's just not smart enough to do so without major manual adjustments.

I could keep a larger cache of work, and therefore help assure I always have something from each project, but then it would cycle through 10 Rosetta tasks, running each for 2 minutes, rather then just 6.

Hopefully with all the discussion on the client work fetch policies, something will shake out that will work better.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 60933 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 3,752,294
RAC: 1,726
Message 60934 - Posted: 30 Apr 2009, 21:48:23 UTC

Guess I'd never noticed BOINC allows you to configure the amount of swap space (I thought you meant size of Win page file). It was set to 75%, and Win task manager shows my "commit charge" to be 1477M/3397M. So does that mean my swap file is 3.4GB? And so BOINC is allowed over 2GB of swap space, but my entire system hasn't reached that much.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 60934 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 819
Credit: 10,190,161
RAC: 3,605
Message 60935 - Posted: 30 Apr 2009, 21:52:27 UTC - in response to Message 60929.  

Hello !
I´ve difficulties to load up my last crunched file with version 1.54. I get repeatedly

I'm getting the same type of messages to
5/1/2009 8:51:54 AM rosetta@home Temporarily failed upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VYHA_4_11644_1_0_0: HTTP error
...
5/1/2009 8:52:53 AM Project communication failed: attempting access to reference site
5/1/2009 8:52:53 AM rosetta@home Temporarily failed upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VJGA_4_11644_1_0_0: HTTP error
5/1/2009 8:52:53 AM rosetta@home Backing off 12 min 18 sec on upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VJGA_4_11644_1_0_0
5/1/2009 8:52:55 AM Internet access OK - project servers may be temporarily down.

Should I abort these transfers? I will wait for further instructios before I do anything to these.


If the transfers aren't too close to their deadlines, I'd just let BOINC keep trying. I've had workunits upload successfully after getting similar messages for days, when router problems kept me from reaching the internet at all for several days. However, it's occasionally useful in such circumstances to first start viewing the Rosetta@home web site to make sure the connection is open,
then without closing your browser, start the BOINC manager program if it isn't already running, click on Advanced View if the simplified view appears first, then click on the Transfers tab, click on Advanced, then click on Do network communication in order to make it retry the communications while your connection to the internet is still open.

For some BOINC projects, even returning the results after their deadlines is useful, if you manage to return the results before anyone else does for the same workunit. Not all BOINC projects allow this, though.
ID: 60935 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 819
Credit: 10,190,161
RAC: 3,605
Message 60948 - Posted: 2 May 2009, 9:55:40 UTC - in response to Message 60934.  

Guess I'd never noticed BOINC allows you to configure the amount of swap space (I thought you meant size of Win page file). It was set to 75%, and Win task manager shows my "commit charge" to be 1477M/3397M. So does that mean my swap file is 3.4GB? And so BOINC is allowed over 2GB of swap space, but my entire system hasn't reached that much.


At least some versions of Windows automatically expand the swap space if BOINC is allowed to use a large enough fraction of it to come close enough to the amount already provided. I'd expect the name page file to be what some people call the swap file.

I've set up my machines to start up with the swap file size already set to 30 GB, with no sign of coming close to that limit. That doesn't allow any further expansion, but should keep the disk head from needing to move very far when going from one place in the swap file to another.

I have seen signs that BOINC divides the available swap space equally among either the active slots or all the enabled BOINC projects before deciding how much to give to each workunit, and does not adjust this based on how much memory each BOINC project is expected to require. For that reason, if you have enough free disk space, allowing both the swap file and the disk space for each workunit to be significantly more than the average required is helpful for the applications with high requirements, such as minirosetta.
ID: 60948 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 12 · 13 · 14 · 15

Message boards : Number crunching : Problems with Minirosetta v1.54



©2020 University of Washington
https://www.bakerlab.org