Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · 20 · 21 . . . 55 · Next

AuthorMessage
TJ

Send message
Joined: 29 Mar 09
Posts: 127
Credit: 4,799,890
RAC: 0
Message 72391 - Posted: 23 Feb 2012, 16:39:17 UTC

I have also files that won't upload, go to 16% and that's it.
BOINC will retry indeed Mod. Sense it is doing that for days even when deadline to return is over. Then I'll abort it as it is no use anymore.
When will this be solved and why do I have this on an AMD with BOINC 6.10.60?
Other intels with same BOINC run fine.
Please help!
Greetings,
TJ.
ID: 72391 · Rating: 0 · rate: Rate + / Rate - Report as offensive
captainkc

Send message
Joined: 20 Jul 08
Posts: 2
Credit: 6,060,310
RAC: 3,409
Message 72417 - Posted: 1 Mar 2012, 3:04:10 UTC

I have been trying to run rosetta v3.22 in the screen saver on a spare IBM T42. When the application runs, it locks up the computer completely if it runs for more than about 1 minute The screen saver display status line alternates between the name of the job and "app suspended" and makes very little progress. Checking the preferences, I found that it was set to allow 100% CPU use which would never allow it to be interrupted. I backed that down to 80% but it still locks up the computer once it has been running for more than about 1 minute. Once it locks up, even <ctrl> <alt> <delete> can't get to the task manager - I have to force the power off and then cold reboot. Does anyone have any suggestions? (I have rosetta running on another desktop and it doesn't seem to have any problems.)
ID: 72417 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 72442 - Posted: 4 Mar 2012, 16:54:46 UTC

While I don't have a specific answer of suggestion, I will attempt to clarify a few points. You say that allowing a BOINC application to allow 100% of CPU means it "would never allow it to be interrupted". This is untrue. BOINC runs applications at low priority. Most any other task your computer is asked to do will be higher priority and will interrupt the BOINC task.

You also say that the screensaver shows "app suspended". I believe that is what would be shown if BOINC were no longer actively running a task for some reason. i.e. if you were able to look at the BOINC Manager display at the same time, it would show "suspended" rather than "running" for the status of that particular task.

I see you have two hosts. I've never been too hot at translating facts like "IBM T42" into CPU specs as shown by BOINC... but you've got one machine with 2GB of memory and a RAC approaching 500. The other has 512MB of memory and a RAC under 100. So if that second one is the host machine you are talking about, you do not have much memory there. Especially for running Rosetta tasks. Perhaps BOINC suspended the task because it was exceeding any of the many memory limits that can be configured. Perhaps your Windows swap file is not properly extending when it is used more than it used to be.

As a Rosetta task begins, it initializes various memory spaces etc. and gradually grows in memory use to several hundred MB. If your machine had a memory fault or problem with the swap file, that would explain why it doesn't fail immediately.

It doesn't sound like BOINC suspending the task went all that well. It is possible that a newer BOINC version would help resolve any problems it has running with low memory and suspending tasks.

Ultimately, the memory thresholds are there to help assure that such lock-up problems are avoided. I'm not suggesting that opening the floodgates on memory settings is a good idea. But it should be possible for the machine to behave better than it is.
Rosetta Moderator: Mod.Sense
ID: 72442 · Rating: 0 · rate: Rate + / Rate - Report as offensive
captainkc

Send message
Joined: 20 Jul 08
Posts: 2
Credit: 6,060,310
RAC: 3,409
Message 72462 - Posted: 6 Mar 2012, 20:03:32 UTC - in response to Message 72442.  

Thanks for the feedback - I try to keep any machine that I have running Rosetta@Home when idle, so I can provide at least some help for what I consider the valuable research that you are doing.

You are correct the problem machine is pretty sort on resources. Maybe it just isn't up to the task. I think I'll just remove BOINC from there and see if I can get some different hardware up.

Thanks again for the ideas!

While I don't have a specific answer of suggestion, I will attempt to clarify a few points. You say that allowing a BOINC application to allow 100% of CPU means it "would never allow it to be interrupted". This is untrue. BOINC runs applications at low priority. Most any other task your computer is asked to do will be higher priority and will interrupt the BOINC task.

You also say that the screensaver shows "app suspended". I believe that is what would be shown if BOINC were no longer actively running a task for some reason. i.e. if you were able to look at the BOINC Manager display at the same time, it would show "suspended" rather than "running" for the status of that particular task.

I see you have two hosts. I've never been too hot at translating facts like "IBM T42" into CPU specs as shown by BOINC... but you've got one machine with 2GB of memory and a RAC approaching 500. The other has 512MB of memory and a RAC under 100. So if that second one is the host machine you are talking about, you do not have much memory there. Especially for running Rosetta tasks. Perhaps BOINC suspended the task because it was exceeding any of the many memory limits that can be configured. Perhaps your Windows swap file is not properly extending when it is used more than it used to be.

As a Rosetta task begins, it initializes various memory spaces etc. and gradually grows in memory use to several hundred MB. If your machine had a memory fault or problem with the swap file, that would explain why it doesn't fail immediately.

It doesn't sound like BOINC suspending the task went all that well. It is possible that a newer BOINC version would help resolve any problems it has running with low memory and suspending tasks.

Ultimately, the memory thresholds are there to help assure that such lock-up problems are avoided. I'm not suggesting that opening the floodgates on memory settings is a good idea. But it should be possible for the machine to behave better than it is.

ID: 72462 · Rating: 0 · rate: Rate + / Rate - Report as offensive
muddocktor

Send message
Joined: 11 May 07
Posts: 17
Credit: 14,543,886
RAC: 0
Message 72584 - Posted: 24 Mar 2012, 3:23:41 UTC

I've run into a problem with Rosetta and memory usage on one of my machines. I came in from the rig (work offshore) and found one of my crunchers almost totally unresponsive, with Rosetta thrashing the living hell out of the hard drive because of swap file usage by Rosetta. This particular machine is an i7 2600k that I had set to run at 4400 and had 4 gigs of ram installed in it. Once I was able to get BOINC shut down, the machine started responding normally and the hard drive thrashing went away. I tried turning off swap file usage by setting swap file usage to 0, but upon rebooting the system still found it thrashing the hard drive. Luckily, I have an extra set of 2 x 4GB ram that wasn't in use and just replaced the 2 x 2GB kit that was in it. After rebooting, the computer is now running essentially normal and is responsive. Since I can now see what is going on there, I opened up task manager and see that of the 8 processes of Rosetta running, 7 of them are using 500+ megs of ram each and the last one is just under 400 megs of ram. In my opinion, Rosetta is getting way to greedy about the memory footprint they are taking on our machines. No wonder this poor system was brought to it's knees because it didn't have enough free ram to do crap.

Is there a way to set a maximum ram usage requirement in preferences? I know about the one in the BOINC preferences page and have set that already, which didn't make a bit of difference. I was planning to sell that kit of ram, since it's some very expensive ram that I could use the money on for other things. If Rosetta keeps getting a larger and larger memory footprint, I might be forced to go to some other project that knows how to keep from bringing a modern system to it's knees like this.
ID: 72584 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5662
Credit: 5,703,329
RAC: 2,182
Message 72589 - Posted: 25 Mar 2012, 1:21:42 UTC

whats with all the errors in CASP9 and rb_03_xxxxxx?
if it begins with one of these names it errors out on my system.
bugs, bugs and more bugs.
thought RALPH was supposed to find these problems and let you know before you release them here?!!
ID: 72589 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 72597 - Posted: 25 Mar 2012, 17:10:12 UTC - in response to Message 72584.  


Is there a way to set a maximum ram usage requirement in preferences? I know about the one in the BOINC preferences page and have set that already, which didn't make a bit of difference. I was planning to sell that kit of ram, since it's some very expensive ram that I could use the money on for other things. If Rosetta keeps getting a larger and larger memory footprint, I might be forced to go to some other project that knows how to keep from bringing a modern system to it's knees like this.


Difficult to answer. Certainly, there is a pane full of memory & disk configuration options. So if you've only found "the one", you can review the BOINC Manager preferences closer. You can set thresholds on swap space, and memory usage.

Rather than switching projects, you might consider running Rosetta along with other projects with lower memory requirements. And always keep in mind that the Project Team is always working to reign in the memory usage. It seems part of the evolution as new protocols are developed. They begin with large memory footprints and if the protocol shows promising results, it is honed down from there and they find ways to make it more efficient.
Rosetta Moderator: Mod.Sense
ID: 72597 · Rating: 0 · rate: Rate + / Rate - Report as offensive
JKitterman

Send message
Joined: 21 Oct 05
Posts: 11
Credit: 814,463
RAC: 0
Message 72600 - Posted: 26 Mar 2012, 2:24:52 UTC

I am noticing errors on tasks starting with CASP9 also. I had one task that was stuck with 17 hours on it and wasn't actually running. I exited BOINC and
restarted it. It appears to be running now.
ID: 72600 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5662
Credit: 5,703,329
RAC: 2,182
Message 72620 - Posted: 29 Mar 2012, 0:05:49 UTC

anything that begins with CASP9 and rb03 are being aborted.
every single friggin time, error, error, error. same with 2 other wingman....error error error.
fix the problem already.
your wasting my cycles with this bugged out task.
ID: 72620 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5662
Credit: 5,703,329
RAC: 2,182
Message 72629 - Posted: 30 Mar 2012, 14:05:25 UTC - in response to Message 72620.  

anything that begins with CASP9 and rb03 are being aborted.
every single friggin time, error, error, error. same with 2 other wingman....error error error.
fix the problem already.
your wasting my cycles with this bugged out task.



*after shutting down the overclocking things are back to normal*
Anyone know why these new tasks are so sensitive to overclock speeds?
ID: 72629 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Solar

Send message
Joined: 6 Jun 07
Posts: 1
Credit: 15,105,547
RAC: 0
Message 72669 - Posted: 5 Apr 2012, 8:02:53 UTC - in response to Message 72584.  

I've run into a problem with Rosetta and memory usage on one of my machines. I came in from the rig (work offshore) and found one of my crunchers almost totally unresponsive, with Rosetta thrashing the living hell out of the hard drive because of swap file usage by Rosetta. This particular machine is an i7 2600k that I had set to run at 4400 and had 4 gigs of ram installed in it. Once I was able to get BOINC shut down, the machine started responding normally and the hard drive thrashing went away. I tried turning off swap file usage by setting swap file usage to 0, but upon rebooting the system still found it thrashing the hard drive. Luckily, I have an extra set of 2 x 4GB ram that wasn't in use and just replaced the 2 x 2GB kit that was in it. After rebooting, the computer is now running essentially normal and is responsive. Since I can now see what is going on there, I opened up task manager and see that of the 8 processes of Rosetta running, 7 of them are using 500+ megs of ram each and the last one is just under 400 megs of ram. In my opinion, Rosetta is getting way to greedy about the memory footprint they are taking on our machines. No wonder this poor system was brought to it's knees because it didn't have enough free ram to do crap.

Is there a way to set a maximum ram usage requirement in preferences? I know about the one in the BOINC preferences page and have set that already, which didn't make a bit of difference. I was planning to sell that kit of ram, since it's some very expensive ram that I could use the money on for other things. If Rosetta keeps getting a larger and larger memory footprint, I might be forced to go to some other project that knows how to keep from bringing a modern system to it's knees like this.


I have had similar problems in the past, normally a shut down and reboot seems to clear the HD thrashing but not always. I have never been able to find a way to set a limit on RAM usage. Be interested to know if anybody has worked out a way to set RAM resources for BOINC that actually work.
ID: 72669 · Rating: 0 · rate: Rate + / Rate - Report as offensive
peristalsis

Send message
Joined: 29 Mar 09
Posts: 8
Credit: 2,421,694
RAC: 0
Message 72671 - Posted: 5 Apr 2012, 13:52:43 UTC

Problems again.
Early last month had many WU's error out both in Rosetta and Einstein.. Tried replacing the ram (easiest fix) with no joy. So new cpu, motherboard, 16 gigs of ram, new harddrive and WIn7 64 instead of 32 bit. All was golden. Suddenly getting errors again. The Rosetta errors seem to all be:
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00801867 read attempt to address 0x4E406652

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00401013 read attempt to address 0x20524820

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7520B9BC

I'll run four instances (quad core) of Rosetta/Einstein. Should I only run three instances? My days of being conversant/interested in figuring out this computer nonsense are over with. Could be some peccadillo of Windows for all I know. Any help would be appreciated. My computer ID: 1495120
Thanks in advance..
ID: 72671 · Rating: 0 · rate: Rate + / Rate - Report as offensive
peristalsis

Send message
Joined: 29 Mar 09
Posts: 8
Credit: 2,421,694
RAC: 0
Message 72673 - Posted: 5 Apr 2012, 16:06:08 UTC - in response to Message 72671.  

Problems again.
Early last month had many WU's error out both in Rosetta and Einstein.. Tried replacing the ram (easiest fix) with no joy. So new cpu, motherboard, 16 gigs of ram, new harddrive and WIn7 64 instead of 32 bit. All was golden. Suddenly getting errors again. The Rosetta errors seem to all be:
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00801867 read attempt to address 0x4E406652

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00401013 read attempt to address 0x20524820

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7520B9BC

I'll run four instances (quad core) of Rosetta/Einstein. Should I only run three instances? My days of being conversant/interested in figuring out this computer nonsense are over with. Could be some peccadillo of Windows for all I know. Any help would be appreciated. My computer ID: 1495120
Thanks in advance..


The below is from my error file:
05-Apr-2012 10:39:56 [rosetta@home] Output file heterodimer_design_24_pose_BA_perturbation_JOBID_SAVE_ALL_OUT_46205_479_0_0 for task heterodimer_design_24_pose_BA_perturbation_JOBID_SAVE_ALL_OUT_46205_479_0 absent

11:39:15 [rosetta@home] Computation for task heterodimer_design_22_pose_CC_perturbation_JOBID_SAVE_ALL_OUT_46203_614_0 finished
05-Apr-2012 11:39:15 [rosetta@home] Output file heterodimer_design_22_pose_CC_perturbation_JOBID_SAVE_ALL_OUT_46203_614_0_0 for task heterodimer_design_22_pose_CC_perturbation_JOBID_SAVE_ALL_OUT_46203_614_0 absent

So a file is not being written? Am I missing something obvious? Should I just go back to XP or linux? Somewhat serious about that. Just depressed (small grin)..
ID: 72673 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile JayPi

Send message
Joined: 8 Dec 06
Posts: 2
Credit: 10,000,558
RAC: 0
Message 72679 - Posted: 5 Apr 2012, 20:29:30 UTC

Since the first Workunits with version 3.26 came, my successfully finished 3.24 workunits can't be uploaded to the server. The download of the new wokunits goes very slow - ok - 90Mb per computer must be downloaded from Rosetta@home and it works. But while uploading my 3.24 workunits the following message appears:
05.04.2012 21:37:43 | rosetta@home | Temporarily failed upload of heterodimer_design_2_pose_C_abinitio_SAVE_ALL_OUT_45729_4564_0_0: connect() failed
05.04.2012 21:37:48 | | Internet access OK - project servers may be temporarily down.

What is happen?

Regards
JayPi
ID: 72679 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Rocco Moretti

Send message
Joined: 18 May 10
Posts: 66
Credit: 585,745
RAC: 0
Message 72680 - Posted: 5 Apr 2012, 20:35:09 UTC - in response to Message 72679.  

Internet access OK - project servers may be temporarily down.

What is happen?


Whenever we release a new version of the application, the servers get hammered with everyone automatically downloading it. This results in intermittent failures in uploading.

Don't worry, Boinc should keep trying to send the results, and it will get through shortly when the load on the servers finally goes down.
ID: 72680 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile JayPi

Send message
Joined: 8 Dec 06
Posts: 2
Credit: 10,000,558
RAC: 0
Message 72682 - Posted: 5 Apr 2012, 20:39:47 UTC

Thanks for the quick response, i'll wait a day whether the workunits are send.

Regards
JayPi
ID: 72682 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BarryAZ

Send message
Joined: 27 Dec 05
Posts: 153
Credit: 30,843,285
RAC: 39
Message 72684 - Posted: 5 Apr 2012, 22:23:39 UTC - in response to Message 72680.  

Ah -- thanks for that -- and just when I was shifting processing over to Rosetta. Fair enough, I'll revert back to Malaria and Einstein for a couple of days and then hope that my uploads will process.



Internet access OK - project servers may be temporarily down.

What is happen?


Whenever we release a new version of the application, the servers get hammered with everyone automatically downloading it. This results in intermittent failures in uploading.

Don't worry, Boinc should keep trying to send the results, and it will get through shortly when the load on the servers finally goes down.


ID: 72684 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 72689 - Posted: 6 Apr 2012, 14:45:36 UTC

I may be mistaken here, but just based on external observation over the years I believe the servers actually keep up fairly well with the demand caused by all the downloads of a new program version. I would estimate the main stressed server time only lasts 6-12 hours. And if your machine has already downloaded the new version, the load caused by you requesting more work is minimal. Even if you need the new version, one machine requesting a copy of it is minimal, it is just the cumulative effects of thousands of machines doing so at the same time.

In short, BOINC will handle the situation rather nicely for you. Doing retries over time, and getting the files exchanged in either direction. The retries begin where they left off, so it does not have to start over on a large file.

Dr. Baker has requested as much help as we can muster to help study new code they intend to use for CASP this Summer. Please don't let a few automatically recovered retries dissuade you.
Rosetta Moderator: Mod.Sense
ID: 72689 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Robert Gammon

Send message
Joined: 9 Nov 07
Posts: 14
Credit: 969,848
RAC: 0
Message 72697 - Posted: 8 Apr 2012, 11:30:22 UTC - in response to Message 72680.  

Internet access OK - project servers may be temporarily down.

What is happen?


Whenever we release a new version of the application, the servers get hammered with everyone automatically downloading it. This results in intermittent failures in uploading.

Don't worry, Boinc should keep trying to send the results, and it will get through shortly when the load on the servers finally goes down.


This does not appear to be my current issue. Last night and this morning, I turned in 6-8 completed workunits. My list of tasks shows the reporting of the uploads and credit granted. However, total credit and average credit figures are still stuck at yesterday afternoon figures.
ID: 72697 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mike

Send message
Joined: 30 Apr 09
Posts: 44
Credit: 65,019
RAC: 0
Message 72749 - Posted: 13 Apr 2012, 17:39:17 UTC
Last modified: 13 Apr 2012, 17:39:37 UTC

I am not getting a response from the thread I posted on 10 days ago so maybe I thought I would post it here... It is a technical issue...

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5611
ID: 72749 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · 20 · 21 . . . 55 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org