Rosetta Client Routinely Hangs

Questions and Answers : Unix/Linux : Rosetta Client Routinely Hangs

To post messages, you must log in.

AuthorMessage
Raster

Send message
Joined: 1 Apr 06
Posts: 2
Credit: 213,377
RAC: 0
Message 47397 - Posted: 4 Oct 2007, 18:35:52 UTC

I'm running BOINC 5.8.16 for i686-pc-linux-gnu. I've noticed that the Rosetta clients routinely hang in the beginning/middle/near end of processing a work-unit. The hardware is a dual-processor/2 core per CPU/hyper-threaded Intel Xeons so the OS sees essentially 8 CPUs. So I have it configured to process up-to 8 work-units simultaneously, however after running about 2 weeks, work-units start to get stuck. This morning it was down to 1 active process.
I've tried the beta version of BOINC with the same results. Do I just need to restart the boinc client every week or so?

thanks,
Mike Morgan
ID: 47397 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,242,482
RAC: 389
Message 47475 - Posted: 7 Oct 2007, 3:23:00 UTC - in response to Message 47397.  

1st, check your memory usage and settings. With 8 cores, chances are R@H may run out of memory before 8 processes can run. BOINC will leave them in memory (sometimes) but not "run" them if there isn't enough free memory (according to your settings). R@H uses between 120 and 360 MB of memeory for each task.

2nd, a memory contention would explain what several users and I have experienced with R@H. After suspending a WU (doesn't matter why/how), it will not resume properly even though BOINC thinks it's running. Eventually, it crashes or you have to kill the pid, resulting in a compute error.

A work around is to limit number of CPUs, set memory limit settings high, and set "leave suspended applications in memory = yes". This will not solve the problem, but it minimizes the bug from occuring.

Does that describe your problem?

Read more here:
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=3481
ID: 47475 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Raster

Send message
Joined: 1 Apr 06
Posts: 2
Credit: 213,377
RAC: 0
Message 47513 - Posted: 8 Oct 2007, 14:49:39 UTC - in response to Message 47475.  

Thanks for the suggestions! I think they may minimize the problem.

I think I'm hitting condition #2 because of condition #1.

My machine has 2G but is configured to use 50% of available memory "while computer is in use" and 90% if otherwise idle. Rosetta is my only project on this machine so I suspect that when the machine was idle it was able to start 8 clients, but when it detected that the machine was busy it suspended some WUs. Since my preferences were set to not keep the processes in memory, I guess I hit the bug you described in condition #2. After several weeks of suspend/resume failures I was left with just one WU being processed.

Is this a BOINC defect? The discussions at the link you provided seem to suggest it's a R@H problem.

thanks,
Mike
ID: 47513 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,242,482
RAC: 389
Message 47520 - Posted: 8 Oct 2007, 17:55:35 UTC - in response to Message 47513.  

Is this a BOINC defect? The discussions at the link you provided seem to suggest it's a R@H problem.


It is definitely a Rosetta problem. Other projects I've run, including CPDN, Einstein, Seasonal Attribution, and SETI, do not have this problem uninitializing.

I've posted about it several times, but the admins ignore me and other users who confirm my diagnosis. I don't think they care much about their Linux application.
ID: 47520 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 47523 - Posted: 8 Oct 2007, 21:12:19 UTC

DJ, I wouldn't say you are ignored. I've been thinking we should start a thread in the Number Crunching forum about Linux task preemption problems. And the tips and things to check that you've added here would be a good start to putting helpful information about the topic in a single place. I'd also like to collect the symptoms all in one place, and if one reverses your recommendations, they can see configurations that seem to expose the problem.

I should also point out that just because other projects do not see the problem does not mean the Rosetta team will be able to make the fix. I believe BOINC is in charge of ending and tearing the thread down when preempted tasks are not retained in memory. So, if the thread isn't ending when BOINC wants it to, it may prove to be a Linux bug in the end.

Please start a thread to discuss this in detail. List specifics about Linux versions and memory preferences, and perhaps we can get some input on whether there are flavors of Linux that don't have the problem. Or if there are other factors to when people see it occur, and when not.
Rosetta Moderator: Mod.Sense
ID: 47523 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,242,482
RAC: 389
Message 47532 - Posted: 9 Oct 2007, 3:47:43 UTC - in response to Message 47523.  

Done.

I may post more links/research more once I finish this paper for school on Thursday.
ID: 47532 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Questions and Answers : Unix/Linux : Rosetta Client Routinely Hangs



©2024 University of Washington
https://www.bakerlab.org