Rosetta WU's stall on RedHat Fedora

Questions and Answers : Unix/Linux : Rosetta WU's stall on RedHat Fedora

To post messages, you must log in.

AuthorMessage
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 2848 - Posted: 10 Nov 2005, 21:28:08 UTC

Most of my systems run Windows 2K or XP, and handle Rosetta without a problem.

However, I have one box running RedHat Fedora Core 4, on a Dell Dimension 4700. This system occasionally gets a WU that just stalls. Boincmgr shows it as running, but if I use ps to look at it, there's three copies of rosetta_4.78_i686-pc-linux-gnu running, all of which are sleeping: the STAT column shows S for all of them, and the CPU time (both in ps and boincmgr) does not increase.

Any ideas? Anything else I can do to help with this?
ID: 2848 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Andrew

Send message
Joined: 19 Sep 05
Posts: 162
Credit: 105,512
RAC: 0
Message 2861 - Posted: 11 Nov 2005, 0:12:36 UTC

What boinc client version are you using on the linux side?
ID: 2861 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 2943 - Posted: 12 Nov 2005, 4:19:42 UTC - in response to Message 2861.  

What boinc client version are you using on the linux side?


5.2.6


ID: 2943 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tmr0

Send message
Joined: 9 Nov 05
Posts: 2
Credit: 3,012,594
RAC: 0
Message 3561 - Posted: 18 Nov 2005, 1:59:26 UTC - in response to Message 2943.  

What boinc client version are you using on the linux side?


5.2.6



I've had this happen twice now with FC4, ps aux shows boinc as S+

here is the output, at 16:42 I started poking around in Boinc Manager:

2005-11-15 19:26:29 [rosetta@home] Requesting 646 seconds of new work, and repor
ting 1 results

2005-11-15 19:26:34 [rosetta@home] Scheduler request to https://boinc.bakerlab.or
g/rosetta_cgi/cgi succeeded

2005-11-15 19:26:34 [rosetta@home] Message from server: No work sent

2005-11-15 19:26:34 [rosetta@home] Message from server: No work sent

2005-11-15 19:26:34 [rosetta@home] Message from server: (there was work for othe
r platforms)

2005-11-15 19:26:34 [rosetta@home] Message from server: (there was work for other platforms)

2005-11-15 19:26:34 [rosetta@home] No work from project

2005-11-15 19:26:34 [rosetta@home] No work from project

2005-11-15 20:59:26 [---] request_reschedule_cpus: process exited

2005-11-15 20:59:26 [rosetta@home] Computation for result 1n0u__abrelaxmode_random_length20_jitter02_omega_32991_0 finished
2005-11-15 20:59:29 [rosetta@home] Started upload of 1n0u__abrelaxmode_random_length20_jitter02_omega_32991_0_0

2005-11-15 20:59:34 [rosetta@home] Finished upload of 1n0u__abrelaxmode_random_length20_jitter02_omega_32991_0_0

2005-11-15 20:59:34 [rosetta@home] Throughput 14967 bytes/sec

2005-11-16 16:42:29 [---] request_reschedule_cpus: project op

ID: 3561 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tmr0

Send message
Joined: 9 Nov 05
Posts: 2
Credit: 3,012,594
RAC: 0
Message 3564 - Posted: 18 Nov 2005, 2:04:44 UTC - in response to Message 3561.  

My Boinc version is 5.2.7.
In Boinc Manager the status is shown as "Communication Deferred"
ID: 3564 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
paul.g

Send message
Joined: 5 Jan 06
Posts: 1
Credit: 174,359
RAC: 0
Message 8459 - Posted: 6 Jan 2006, 1:42:13 UTC

I don't know if this helps, but I'm running boinc ver. 5.2.13 and rosetta ver. 4.80. I've also got two other projects on the go, seti@home and predictor@home, both of which have not given me any problems to date (for about 4 weeks now).

I'm running on a somewhat upgraded slackware 7.0 which is now closer to 10.0.

I just subscribed to rosetta and my first WU got to about 20% then also stalled. The boinc manager shows rosetta is running, and the the processes are in memory, however they are all sleeping and are not actually taking any CPU time.

I didn't try exiting out of the manager altogether and restarting it, I aborted the WU and it got a new one and started processing that one. If it stalls again, I'll try restarting the the manager, but I somehow doubt it will solve the problem.

I'm wondering if there is either a problem with unloading the app from memory before the WU is complete when it schedules another project or if there is some kind of race condition from within the rosetta app.
ID: 8459 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 19402 - Posted: 28 Jun 2006, 10:53:59 UTC
Last modified: 28 Jun 2006, 11:30:40 UTC

I have (occassionally) this problem already for ages, on Red Hat EL 4.1. Now using BCC 5.4.9, attached to 7 projects, Rosetta's share is ~20%.

The symptoms are that Rosetta app seems to be running, but the CPU time does not increase. Recently I've noticed that even BCC is not able to run benchmarks, if this happens. IIRC previously if BCC was able to switch to aother app, it got 0 CPU cyces (Rosetta was consuming all) and did not increment time. Usually the only way to overcome this problem was to manually restart BCC.

This time I've made few snapshots and suspended Rosetta in memory to be able to test something, if anyone would be interested in.

To the snapshots - sorry for inconvenience, they go from bottom up to top, are pretty wide, the [ code ] formatting seems to be ugly and it's even worse without.

[edit] I'll read the Rosetta 5.22 problems reporting thread through...[/edit]

Peter

[size=10]
[pepo@orc ~]$ top -U pepo
top - 21:21:52 up 65 days,  7:27,  1 user,  load average: 0.89, 0.42, 0.20
Tasks: 102 total,   1 running, 101 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.3% us,  0.7% sy, 99.0% ni,  0.0% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:    515744k total,   510592k used,     5152k free,   139564k buffers
Swap:  1048568k total,      180k used,  1048388k free,   103004k cached
[/size]
[size=9]
  PID USER   PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
31420 pepo   17   0  2212  980  764 R  0.7  0.2   0:00.60 top -U pepo
14747 pepo   16   0  5684 4104 1680 S  0.0  0.8   3693:33 ./boinc -allow_remote_gui_rpc
22383 pepo   34  19 78472  51m 4792 S  0.0 10.3  59:40.05 rosetta_5.22_i686-pc-linux-gnu xx t312 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -o.....
22384 pepo   34  19 78472  51m 4792 S  0.0 10.3   0:00.02 rosetta_5.22_i686-pc-linux-gnu xx t312 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -o.....
22385 pepo   35  19 78472  51m 4792 S  0.0 10.3   0:00.05 rosetta_5.22_i686-pc-linux-gnu xx t312 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -o
22386 pepo   34  19 78472  51m 4792 S  0.0 10.3   0:00.14 rosetta_5.22_i686-pc-linux-gnu xx t312 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -o
31382 pepo   34  19 30004 6672 2056 S  0.0  1.3   0:00.01 albert_4.58_i686-pc-linux-gnu @conf --IFO=LHO --Freq=287.667431193 --FreqBand=0.00114678899083 --startTime=7951
31390 pepo   16   0  8128 2212 1808 S  0.0  0.4   0:00.10 sshd: pepo@pts/1
31391 pepo   15   0  6308 1456 1176 S  0.0  0.3   0:00.35 -bash
[/size]
[size=10]
XXXX|Rosetta manually suspended (but still in memory), Einstein now increments time.
Einstein@Home|22 Jun 2006 21:21:15|Resuming task r1_0287.5__542_S4R2a_3 using albert version 458
rosetta@home|22 Jun 2006 21:21:15|Pausing task t312__CASP7_JUMPRELAX_SAVE_ALL_OUT_BARCODE_hom010__711_1635_0 (removed from memory)
---|22 Jun 2006 21:21:15|Rescheduling CPU: project suspended by user

[pepo@orc ~]$ top -U pepo
top - 21:20:37 up 65 days,  7:25,  1 user,  load average: 0.16, 0.25, 0.13
Tasks: 101 total,   2 running,  99 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.5% us,  0.1% sy, 86.8% ni, 12.5% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:    515744k total,   512748k used,     2996k free,   140196k buffers
Swap:  1048568k total,      180k used,  1048388k free,   101104k cached
[/size]
[size=9]
  PID USER   PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
31420 pepo   16   0  2208  900  696 R  3.8  0.2   0:00.04 top -U pepo
14747 pepo   16   0  5684 4104 1680 S  1.9  0.8   3693:32 ./boinc -allow_remote_gui_rpc
22383 pepo   34  19 78472  51m 4792 S  0.0 10.3  59:40.05 rosetta_5.22_i686-pc-linux-gnu xx t312 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -o
22384 pepo   34  19 78472  51m 4792 S  0.0 10.3   0:00.02 rosetta_5.22_i686-pc-linux-gnu xx t312 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -o
22385 pepo   35  19 78472  51m 4792 S  0.0 10.3   0:00.05 rosetta_5.22_i686-pc-linux-gnu xx t312 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -o
22386 pepo   34  19 78472  51m 4792 S  0.0 10.3   0:00.14 rosetta_5.22_i686-pc-linux-gnu xx t312 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -o
31382 pepo   34  19 30004 6672 2056 S  0.0  1.3   0:00.01 albert_4.58_i686-pc-linux-gnu @conf --IFO=LHO --Freq=287.667431193 --FreqBand=0.00114678899083 --startTime=7951
31390 pepo   15   0  8128 2212 1808 S  0.0  0.4   0:00.03 sshd: pepo@pts/1
31391 pepo   15   0  6308 1456 1176 S  0.0  0.3   0:00.35 -bash
[/size]


Project|Time|Messages
XXXX|Rosetta unsuspended, now active & "running", but not incrementing time, is consuming CPU cycles
Einstein@Home|22 Jun 2006 21:16:48|Pausing task r1_0287.5__542_S4R2a_3 (removed from memory)
--- |22 Jun 2006 21:16:48|Rescheduling CPU: project resumed by user
XXXX|Rosetta manually suspended (but still in memory), Einstein now increments time.
Einstein@Home|22 Jun 2006 21:14:25|Starting task r1_0287.5__542_S4R2a_3 using albert version 458
rosetta@home|22 Jun 2006 21:14:25|Pausing task t312__CASP7_JUMPRELAX_SAVE_ALL_OUT_BARCODE_hom010__711_1635_0 (removed from memory)
---|22 Jun 2006 21:14:25|Rescheduling CPU: project suspended by user
---|22 Jun 2006 3:00:00|Resuming network activity
---|22 Jun 2006 2:00:00|Suspending network activity - time of day
XXXX |22 Jun 2006 21:12:00|still the same, Rosetta seems to be active, but idle and not incrementing time.
|22 Jun 2006 3:00:00|Resuming network activity
---|22 Jun 2006 2:00:00|Suspending network activity - time of day
---|22 Jun 2006 0:53:34|Process 26964 not found
---|22 Jun 2006 0:53:34|Resuming network activity
---|22 Jun 2006 0:53:34|Rescheduling CPU: Resuming computation
---|22 Jun 2006 0:53:34|Resuming computation
---|22 Jun 2006 0:53:33|Failed to stop applications; aborting CPU benchmarks
---|22 Jun 2006 0:53:25|Running CPU benchmarks
---|22 Jun 2006 0:53:23|Suspending network activity - running CPU benchmarks
rosetta@home|22 Jun 2006 0:53:23|Pausing task t312__CASP7_JUMPRELAX_SAVE_ALL_OUT_BARCODE_hom010__711_1635_0 (removed from memory)
---|22 Jun 2006 0:53:23|Suspending computation - running CPU benchmarks
Einstein@Home|21 Jun 2006 18:35:25|Scheduler request succeeded
Einstein@Home|21 Jun 2006 18:35:20|Reporting 1 tasks

ID: 19402 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Bob Bowen
Avatar

Send message
Joined: 22 Mar 06
Posts: 14
Credit: 6,140,706
RAC: 186
Message 20268 - Posted: 15 Jul 2006, 21:29:22 UTC

The way I keep this to a minimum on my Fedora boxes is to keep my rosetta preference Target CPU run time set to not selected which defaults to 3 hours. That way the ones that hang get cleaned out faster.
Join our Great Team at Team-SciFi

ID: 20268 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Questions and Answers : Unix/Linux : Rosetta WU's stall on RedHat Fedora



©2024 University of Washington
https://www.bakerlab.org