Preemption Failures on Linux

Message boards : Number crunching : Preemption Failures on Linux

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,239,073
RAC: 373
Message 47531 - Posted: 9 Oct 2007, 3:46:32 UTC

Original thread:
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=3640

Per the suggestion of Mod.Sense, I'm posting this thread for all linux/unix users who are having problems with Rosetta to read, share problems, and perhaps share solutions.

When general preference "leave applications in memory" = no, BOINC versions 5.8.16 and previous try to UNINITIALIZE (preempt) the Rosetta application. However, the Rosetta app just stops and stays hung in memory. BOINC cannot resume the task anymore. The only solution is to shutdown BOINC or kill the process. Upon restarting the task, sometimes the Rosetta task will stop with "compute error".

Steps known to reproduce problem:
Set preference "leave applications in memory" = no.
Update preferences
Start a R@H task on linux.
At any point in the computation, stop the task.
Wait a minute.
Start the task again.
If the task does not consume CPU, it is frozen in memory and must be killed.
Upon restart, it will fail with compute error (typically error 193).

Common elements of systems observed:
* Linux 2.6 kernel or higher
* All are AMD Athlon, Seperon, or Opteron CPUs
* Preference "leave applications in memory" = no
* Preference "switch between applications" less than the target runtime preference setting for R@H.

Reference Threads:
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=3640
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1795
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=309
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=3481
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=51
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1201
...shall I go on?
ID: 47531 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF] VeauX

Send message
Joined: 26 Mar 07
Posts: 8
Credit: 27,098
RAC: 0
Message 47533 - Posted: 9 Oct 2007, 4:53:13 UTC
Last modified: 9 Oct 2007, 4:57:24 UTC

Just to share a bit, it is not just a pb with AMD processors. I'm running BOINC on a C2D E4300 @ stock. Ubuntu 7.04 X86_64 and I have that problem too.

Putting leave application in memory = yes do not solve the problem. I'm running 100% rosetta so the switch between app option do not enter in consideration here. ( i detached all the other projects)

This occur with all Rosetta apps 5.69 and 5.80

Btw, this occurs ONLY with Rosetta. The other projects I'm running are not freezing (QMC, Riesel, WCG, Cosmotology ...)
ID: 47533 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,239,073
RAC: 373
Message 47541 - Posted: 9 Oct 2007, 13:55:25 UTC

Since I can't edit my own posts, let me add this. It appears R@H application does not uninitialize unless it's written the first checkpoint. Otherwise, it just suspends. R@H appears to only crash when uninitializing after writing this first checkpoint.
ID: 47541 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ziegenmelker

Send message
Joined: 26 Jul 06
Posts: 10
Credit: 26,061
RAC: 0
Message 47548 - Posted: 9 Oct 2007, 16:45:57 UTC

My system: AMD 64 X2 4400+, 2GB RAM, 64-Bit OpenSuse 10.2

General Preferences(default):

Use at most 75% of page file (swap space)
Use at most 50% of memory when computer is in use
Use at most 90% of memory when computer is idle

At least if I don't start any VMWare sessions, there should be plenty of RAM left.
But anyway, I have raised the 'mem in use' value to 60%.

But the host belongs to location 'home', where no preferences were defined jet. Now I've defined the same preferences for 'home' like for 'default'.

At the moment, there is just one Rosetta task running and one ABC task.
Top says there are 5 Rosetta processes and the amount of reserved and shared memory they are using is changing simultaneously(e.g. 153/75MB).

Gnome system monitor sees only one Rosetta parent process, one child and this child has another two children. Today I didn't start any memory expensive apps like VMWare. Both cores run at almost 100%, 1.3GB of memory is used, swap is 528kB.

I'm proud today this host delivered his first and only successfully finished WU! But this WU still has a couple of SIGSEGVs in the log file. Maybe the NX-Bit is a reason for this?

cu,
Michael
ID: 47548 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF] VeauX

Send message
Joined: 26 Mar 07
Posts: 8
Credit: 27,098
RAC: 0
Message 47559 - Posted: 9 Oct 2007, 18:52:06 UTC

ID: 47559 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,239,073
RAC: 373
Message 47571 - Posted: 9 Oct 2007, 21:02:51 UTC - in response to Message 47559.  

Is there a developper taking care of this issue?


That's why I tried to amalgamate all the Linux posts with the compute error problem. I'm hoping to light a fire under their Linux development team/person. We're a solid 10% of the Rosetta crunchers and growing.
ID: 47571 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 47576 - Posted: 9 Oct 2007, 23:01:43 UTC

My hope in asking DJ to start the thread was many things.

First, there seem to be an increasing number of Linux users, and an increasing number of problems (although it is possible I'm just seeing more posts about problems), either way I wanted to get some information together to help those affected.

Second, in order to resolve a problem, it is necessary to have a clear explanation of it, and each report is a bit different and many are often a little vague about exactly what occured. So I was hoping by getting several people that have seen odd preemption and run-away threads into the same thread, we might find some commonality in when the problem occurs and when it does not.

Third, since it is not a problem specific to any given Rosetta release, I wanted to discuss it outside of the "problems with..." thread, so that it can remain dedicated to other recent and pending problems.

Please be sure to describe which Linux distro you are using, and what update level is installed.

Does anyone have a machine or Linux version where they do not seem to encounter these problems?

Does the version of BOINC have any impact on seeing problems occur more or less frequently?

When a task "freezes", does it still use CPU? What status does BOINC show on the task?

Has a thourough memory test been run on the machine? What % of memory is BOINC allowed to use (see your General Preferences)? And how much memory is on the machine?

In the machine overclocked?

Are some task names more likely to see a problem then others?
Rosetta Moderator: Mod.Sense
ID: 47576 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,070,914
RAC: 0
Message 47580 - Posted: 10 Oct 2007, 0:49:28 UTC - in response to Message 47576.  
Last modified: 10 Oct 2007, 0:50:25 UTC


Does anyone have a machine or Linux version where they do not seem to encounter these problems?

Does the version of BOINC have any impact on seeing problems occur more or less frequently?

When a task "freezes", does it still use CPU? What status does BOINC show on the task?

Has a thourough memory test been run on the machine? What % of memory is BOINC allowed to use (see your General Preferences)? And how much memory is on the machine?

In the machine overclocked?

Are some task names more likely to see a problem then others?



I've been running boinc for quite a while on my home Linux box. It's running fedora 5 with BOINC 5.10.8. It is not overclocked. The MB is an ASUS A7V8X with an AMD XP 2600+ processor. It has 1 GB of memory and prefereces are for 50% when in use and 90% when idle. This machine also runs my website.

The darn thing is pretty much rock solid. I rarely have any aborts, freezes, or other problems with RAH or any other project. Right now I'm crunching SIMAP, but when that project does not have work, it crunches RAH. (SIMAP typically has 2-5 days worth of work at the beginning of each month, but this month is unusual in that it was a ton of new work.) So, if you check my results (computers are visible, I believe) you might not see too many results listed.

The only time I've seen problems is when my cable connection or router has a problem. Then BOINC can't phone home and results (not just RAH) seem to abort at times.

Charlie
-Charlie
ID: 47580 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,239,073
RAC: 373
Message 47583 - Posted: 10 Oct 2007, 2:22:55 UTC - in response to Message 47580.  
Last modified: 10 Oct 2007, 2:23:04 UTC

I believe if a Rosetta WU runs completely uninterrupted (from start to finish) on a machine it will finish successfully (same rate as Windows at least).

The trouble is when it's preempted by BOINC settings, user active, or other reason. I'm glad to hear that your system is running well.
ID: 47583 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ziegenmelker

Send message
Joined: 26 Jul 06
Posts: 10
Credit: 26,061
RAC: 0
Message 47588 - Posted: 10 Oct 2007, 5:04:44 UTC

Some minutes ago I started my box and again Boinc Manager shows just one aktive RAH task, but top and gnome-system-monitor show 4(!) RAH processes.
When I start this host all apps that were active during shutdown get loaded again. There is no swap used and only 835MB memory is used so far.
RAH is allowed to use more then 1GB of RAM, so atm there is no reason to preempt in any way.
Btw. Boinc is started through init i.e. before user apps are loaded.

I got two more successfully finished WUs, but again their logs show one SIGABRT and one SIGSEGV.

All three so far successful WUs are from the type 'BENCH_051207_ABRELAX_SAVE_ALL_OUT...' and I abort all 'sen15__RESAMPLE' WUs because they always crash.

Are all these problems maybe mostly/only related to 64-Bit systems?
Are all affected Kernel versions 2.6.8 and later? (-> NX-Bit)

My Boinc version is 5.10.8 (64-Bit beta), system is always fully patched.

Next I will start a VMWare session with 768MB RAM to see what happens. ;-)

cu,
Michael
ID: 47588 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF] VeauX

Send message
Joined: 26 Mar 07
Posts: 8
Credit: 27,098
RAC: 0
Message 47589 - Posted: 10 Oct 2007, 5:06:26 UTC

Ok, then. TO add more info,

When a task freeze, BOINC shows the RUNNING status, but the CPU time do not increase. This is notorious becauve I have CPU Feq intalled ( CPU frequence monitor) and when it happens, the core running the wu just go back to idle state and throttle down (the speedstep think).

I'm not running an overclocked machine for your info. For the WU name that stops, those are all the ones that appear as aborted under my results list on this host:
https://boinc.bakerlab.org/rosetta/results.php?hostid=581242

I tried BOINC version 5.10.8 and 5.10.21 the result is similar.


Today I changes my Rosetta prefs, I reduce the WU time to 1 hour, leave appli in memory yes and forces the boinc manager to always run and always connected. I'm running for 14 hours without flaws.

More to come tomorrow...wait a minute... BOINC did an automatic CPU benchmark and stopped the 2 WU running... those 2 WU are not starting again... I post a screenshot.
ID: 47589 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ziegenmelker

Send message
Joined: 26 Jul 06
Posts: 10
Credit: 26,061
RAC: 0
Message 47590 - Posted: 10 Oct 2007, 5:41:24 UTC

Another crash:

2007-10-10 07:18:09 [ABC@home] Starting abc_wu_1389383660000_400000_0
2007-10-10 07:18:09 [ABC@home] Starting task abc_wu_1389383660000_400000_0 using abc-finder version 103
2007-10-10 07:18:10 [rosetta@home] Deferring communication for 1 min 0 sec
2007-10-10 07:18:10 [rosetta@home] Reason: Unrecoverable error for result BENCH_051207_ABRELAX_SAVE_ALL_OUT_-5croA-_BARCODE_R16_filters_2164_3727_0 (process exited with code 193 (0xc1, -63))
2007-10-10 07:18:10 [rosetta@home] Computation for task BENCH_051207_ABRELAX_SAVE_ALL_OUT_-5croA-_BARCODE_R16_filters_2164_3727_0 finished
2007-10-10 07:18:10 [rosetta@home] Output file BENCH_051207_ABRELAX_SAVE_ALL_OUT_-5croA-_BARCODE_R16_filters_2164_3727_0_0 for task BENCH_051207_ABRELAX_SAVE_ALL_OUT_-5croA-_BARCODE_R16_filters_2164_3727_0 absent

cu,
Michael
ID: 47590 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,239,073
RAC: 373
Message 47596 - Posted: 10 Oct 2007, 14:08:36 UTC - in response to Message 47589.  

That's sounds very typical to what I'm seeing, both in my WU and others. Thanks for the info.
ID: 47596 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 172
Credit: 5,654,074
RAC: 3,327
Message 47607 - Posted: 10 Oct 2007, 21:43:53 UTC

I, too, have problems with rosetta, and Mod.Sense succested I post here.

First of all, I have two 3.06 GHz Hyperthreaded Xeon processors, 8 GBytes RAM, and a dedicated disk partition of 16 GBytes for BOINC stuff. I run Red Hat Enterprise Linux 5 with (at the moment) kernel 2.6.18-8.1.14.el5PAE. Swap space is set up as two partitions of 2 GBytes each. My network connection is Verizon FiOS with 20 Megabit/second download speed and 5 Megabit/second upload speed. I usually get these speeds.

As far as BOINC is concerned, I say to leave applications in memory when they are suspended, use all 4 processors, switch applications every 60 minutes, and use at most 100% of the processor time. Use at most 15.75 GBytes of disk space, leave at least .1 GByte free, and use at most 98% available disk space. Use at most 75% of swap space, 75% of memory when computer is in use and 90% of memory when computer is not in use. (Computer is turned on about 100% of the time.)

For Rosetta, I say give the application 11.11% resource share.

The original problem I though I had was a rosetta application ran up about 4 hours of time, which is about what I expect, indicated that there was -- left to complete the work unit, progress 100%, and so on. But it continued running a long time (about 30 hours), really running up CPU time. I.e., it did not freeze. As Mod.Sense suggested, I stopped the BOINC client by running /etc/rc.d/init.d/boinc stop. This shut down everything _except_ the rosetta applications. I nominally had one running, but pstree revealed (in part) something like this (before shutting down):

─su───boinc─┬─hadam3_4.07_i68─┬─hadam3_um_4.07_───{hadam3_um_4.07_}
│ │ └─2*[{hadam3_4.07_i68}]
│ ├─2*[hadcm3trans_5.4─┬─hadcm3transum_5───{hadcm3transum_5}]
│ │ └─2*[{hadcm3trans_5.4}]]
│ ├─malariacontrol_───{malariacontrol_}
│ ├─rosetta_beta_5.───rosetta_beta_5.───2*[rosetta_beta_5.]
│ ├─setiathome-5.27───setiathome-5.27───2*[setiathome-5.27]
│ └─wcg_faah_autodo───3*[{wcg_faah_autodo}]

(This is one that, as far as I know, is actually running correctly.)

Now this time, when everything seems to be running correctly, stopping the boinc clienit causes all the boinc applications to stop too.
ID: 47607 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 47739 - Posted: 14 Oct 2007, 22:40:50 UTC

Recently I relocated two of my diskless nodes (both dual core AMD running Linux). This involved powering down and restarting each of them twice. For each node shutdown I first shut down BOINC (using the standard "kill" signal).

The four Rosetta WUs being run ended up with complaints about segmentation violations and other such things in their stdout.txt files. They did restart and finish successfully in the end. In the past I have seen such improperly exiting WUs hang on exit (including when the watchdog shut them down) or fail to restart properly.

https://boinc.bakerlab.org/rosetta/result.php?resultid=112226513
https://boinc.bakerlab.org/rosetta/result.php?resultid=112103685
https://boinc.bakerlab.org/rosetta/result.php?resultid=112232914
https://boinc.bakerlab.org/rosetta/result.php?resultid=112162647
ID: 47739 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TCU Computer Science

Send message
Joined: 7 Dec 05
Posts: 28
Credit: 12,861,977
RAC: 0
Message 47797 - Posted: 17 Oct 2007, 3:12:55 UTC

I've been plagued by this problem (task running but accumulating no CPU time) since May of last year. The problem has occurred on a variety of machines (HT, non-HT, Core2, etc) most running CentOS (4.1 thru 4.5) but also fedora 7 and Mac OSX 10.3 and 10.4. Some of the machines crunch only Rosetta, some Rosetta+RALPH.

I moved most of my Linux boxes and all of the Macs to Einstein a year ago because I didn't have time to deal with the stalls on a few dozen machines. I moved the Linux boxes back to Rosetta a few months ago and didn't seem to have as many problems. But the last couple of weeks have been bad, so I'm about to switch them back to Einstein. I'll post some info about the latest problems in a following message.
ID: 47797 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TCU Computer Science

Send message
Joined: 7 Dec 05
Posts: 28
Credit: 12,861,977
RAC: 0
Message 47798 - Posted: 17 Oct 2007, 3:15:46 UTC - in response to Message 47797.  
Last modified: 17 Oct 2007, 3:31:25 UTC

Here is the latest example of the common type of stall:

boincmgr shows two tasks running but only one is accumulating CPU time.
top command shows 50% idle.

Intel Core2 6300 1.86GHz
1 GB RAM
Linux version 2.6.22.9-91.fc7

"leave applications in memory" = yes
memory preferences are the default (75/50/90)

BOINC 5.8.16
STM0082_BOINC_MFR_ABRELAX_PICKED_2175_14024
rosetta_beta version 580

The machine is a lightly used server.
Rosetta and RALPH are running on it.
ID: 47798 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TCU Computer Science

Send message
Joined: 7 Dec 05
Posts: 28
Credit: 12,861,977
RAC: 0
Message 47799 - Posted: 17 Oct 2007, 3:30:47 UTC - in response to Message 47798.  

And here is a different type of stall that has occurred six or seven times (total occurrences across 20+ Linux boxes) in the past week:

The task does not terminate when it reaches the "Target CPU run time" which I have set to 8 hours. For a WU that had been running for 87 hours, the client_state.xml file contained:

<active_task>
<project_master_url>https://boinc.bakerlab.org/rosetta/</project_master_url>
<result_name>STM0082_BOINC_MFR_ABRELAX_PICKED_2175_1341_0</result_name>
<active_task_state>1</active_task_state>
<app_version_num>580</app_version_num>
<slot>1</slot>
<scheduler_state>2</scheduler_state>
<checkpoint_cpu_time>0.000000</checkpoint_cpu_time>
<fraction_done>1.000000</fraction_done>
<current_cpu_time>313292.725724</current_cpu_time>
<swap_size>73195520.000000</swap_size>
<working_set_size>61992960.000000</working_set_size>
<working_set_size_smoothed>61992960.000000</working_set_size_smoothed>
<page_fault_rate>0.000000</page_fault_rate>
</active_task>

This instance was on a CentOS 4.5 (Linux 2.6.9-55.0.9.ELsmp) box crunching only Rosetta.
ID: 47799 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
davidtaille

Send message
Joined: 7 Oct 07
Posts: 2
Credit: 1,470,348
RAC: 0
Message 47841 - Posted: 18 Oct 2007, 21:01:37 UTC

Hi all,
I too experience stalls with rosetta.

I have a dedicated hosted server that spends alsmot all its time on boinc since Oct 7.

$ uname -a
Linux xx 2.6.18-8.1.14.el5 #1 SMP Thu Sep 27 18:58:54 EDT 2007 i686 i686 i386 GNU/Linux
It's a CentOS 5.0

The machine is a piece of hardware made by www.dedibox.fr to run their hosting business.
CPU : Centaur VIA Esther processor 2000MHz stepping 09 ; NX bit activated.
Motherboard : VIA-made, chipsets VIA CN700 & VT8237
RAM : 1GB

$ ./boinc -version
5.4.11 i686-pc-linux-gnu

BOINC settings : all default.

I attached to 3 projects : seti, lhc, rosetta.
Seti & lhc have no problems, and the machine could process successfully 410-credit worth WU since Oct 7 (2007).
Rosetta : only one WU got to successful completion ; all others have been reported in error, and I aborted 3 of them.

The typical situation when boinc thinks rosetta is at work is as follows :
----------------------------------------------
$top
top - 22:23:36 up 13 days, 18 min, 1 user, load average: 0.00, 0.00, 0.00
Tasks: 68 total, 1 running, 67 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 1019144k total, 915868k used, 103276k free, 352632k buffers
Swap: 1044216k total, 0k used, 1044216k free, 342212k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
....
9866 seti_at_ 15 0 5836 4124 1828 S 0.0 0.4 0:27.99 boinc
20128 seti_at_ 35 19 28564 22m 12 S 0.0 2.3 0:00.03 rosetta_beta_5.
20304 seti_at_ 35 19 27540 22m 12 S 0.0 2.3 0:00.03 rosetta_beta_5.
...
29430 seti_at_ 34 19 186m 99m 44 S 0.0 10.0 60:01.75 rosetta_beta_5.
29431 seti_at_ 34 19 186m 99m 44 S 0.0 10.0 0:00.00 rosetta_beta_5.
29432 seti_at_ 35 19 186m 99m 44 S 0.0 10.0 0:00.00 rosetta_beta_5.
29433 seti_at_ 34 19 186m 99m 44 S 0.0 10.0 0:00.00 rosetta_beta_5.
----------------------------------------------
Then I can see in boinc log that it has gone through hourly suspend/resume cycles for tens of hours, but process times for rosetta never changes... while WUs for seti or lhc complete !
sterr files in R@H slot shows nasty things :
----------------------------------------
$ cat stderr.txt
Graphics are disabled due to configuration...
# cpu_run_time_pref: 10800
# random seed: 1890337
SIGSEGV: segmentation violation
Stack trace (12 frames):
[0x8d7cf2f]
[0x8d77d1c]
[0xb7f0b420]
[0x8e024c7]
[0x8dd2715]
[0x8dd2481]
[0x83f9b4c]
[0x8de873f]
[0x8d79987]
[0x8d7afa5]
[0x8d73f9d]
[0x8e1487a]

Exiting...
-----------------------
Rosetta applications are rosetta_5.69_i686-pc-linux-gnu and rosetta_beta_5.80_i686-pc-linux-gnu.

Hope this helps.

David
ID: 47841 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,239,073
RAC: 373
Message 47843 - Posted: 19 Oct 2007, 0:08:13 UTC

Thank you to everyone has posted their application failures in Linux/Unix.

If you're not sure what info to include, see Mod.Sense's post (about the 7th post down from the top).

KEEP THE REPORTS COMING! :)
ID: 47843 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Preemption Failures on Linux



©2024 University of Washington
https://www.bakerlab.org