Posts by TCU Computer Science

1) Message boards : Number crunching : Rosetta work units freeze up on Mac OS 10.5.2 dual core machine (Message 52174)
Posted 29 Mar 2008 by TCU Computer Science
Post:
There is a WU "running" for the other project & the Rosetta WU that had been "waiting to run" last night while the Rosetta WU that's now completed was "running" is now listed as "running, high priority" but has the exact same CPU time and percentage of progress that it has shown for about the last 18 hours.


I run BOINC on 60+ computers. Most of them are Linux, some are Windows XP, and a few are Mac OS. I have seen the problem that you describe occur frequently on Mac OS, less often on Linux and rarely on the Windows machines. It is probably a bug in the BOINC Manager. But in my experience Rosetta seems to trigger the problem much more frequently than other projects that I run, more often on Mac OS than other platforms, and more often when there are multiple projects running on the machine.
2) Message boards : Number crunching : Preemption Failures on Linux (Message 48184)
Posted 31 Oct 2007 by TCU Computer Science
Post:
Here is a RALPH WU that has been stuck for 8 hours:

<active_task>
<project_master_url>http://ralph.bakerlab.org/</project_master_url>
<result_name>2dlb__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2dlb_-native__2480_49_1</result_name>
<app_version_num>581</app_version_num>
<slot>1</slot>
<checkpoint_cpu_time>14379.437992</checkpoint_cpu_time>
<fraction_done>1.000000</fraction_done>
<current_cpu_time>14421.431608</current_cpu_time>
<swap_size>410079232.000000</swap_size>
<working_set_size>231837696.000000</working_set_size>
<working_set_size_smoothed>231837696.000000</working_set_size_smoothed>
<page_fault_rate>0.000000</page_fault_rate>
</active_task>


Intel P4 3.0 GHz HT
RAM: 512 MB
CentOS 4.5 (Linux 2.6.9-55.0.9.ELsmp)
BOINC 5.10.21
preferences are the defaults
running Rosetta & RALPH
3) Message boards : Number crunching : Preemption Failures on Linux (Message 47799)
Posted 17 Oct 2007 by TCU Computer Science
Post:
And here is a different type of stall that has occurred six or seven times (total occurrences across 20+ Linux boxes) in the past week:

The task does not terminate when it reaches the "Target CPU run time" which I have set to 8 hours. For a WU that had been running for 87 hours, the client_state.xml file contained:

<active_task>
<project_master_url>http://boinc.bakerlab.org/rosetta/</project_master_url>
<result_name>STM0082_BOINC_MFR_ABRELAX_PICKED_2175_1341_0</result_name>
<active_task_state>1</active_task_state>
<app_version_num>580</app_version_num>
<slot>1</slot>
<scheduler_state>2</scheduler_state>
<checkpoint_cpu_time>0.000000</checkpoint_cpu_time>
<fraction_done>1.000000</fraction_done>
<current_cpu_time>313292.725724</current_cpu_time>
<swap_size>73195520.000000</swap_size>
<working_set_size>61992960.000000</working_set_size>
<working_set_size_smoothed>61992960.000000</working_set_size_smoothed>
<page_fault_rate>0.000000</page_fault_rate>
</active_task>

This instance was on a CentOS 4.5 (Linux 2.6.9-55.0.9.ELsmp) box crunching only Rosetta.
4) Message boards : Number crunching : Preemption Failures on Linux (Message 47798)
Posted 17 Oct 2007 by TCU Computer Science
Post:
Here is the latest example of the common type of stall:

boincmgr shows two tasks running but only one is accumulating CPU time.
top command shows 50% idle.

Intel Core2 6300 1.86GHz
1 GB RAM
Linux version 2.6.22.9-91.fc7

"leave applications in memory" = yes
memory preferences are the default (75/50/90)

BOINC 5.8.16
STM0082_BOINC_MFR_ABRELAX_PICKED_2175_14024
rosetta_beta version 580

The machine is a lightly used server.
Rosetta and RALPH are running on it.
5) Message boards : Number crunching : Preemption Failures on Linux (Message 47797)
Posted 17 Oct 2007 by TCU Computer Science
Post:
I've been plagued by this problem (task running but accumulating no CPU time) since May of last year. The problem has occurred on a variety of machines (HT, non-HT, Core2, etc) most running CentOS (4.1 thru 4.5) but also fedora 7 and Mac OSX 10.3 and 10.4. Some of the machines crunch only Rosetta, some Rosetta+RALPH.

I moved most of my Linux boxes and all of the Macs to Einstein a year ago because I didn't have time to deal with the stalls on a few dozen machines. I moved the Linux boxes back to Rosetta a few months ago and didn't seem to have as many problems. But the last couple of weeks have been bad, so I'm about to switch them back to Einstein. I'll post some info about the latest problems in a following message.
6) Message boards : Number crunching : Process in Limbo (Message 44847)
Posted 9 Aug 2007 by TCU Computer Science
Post:
I noticed that one of my Linux boxes still had a WU that was a few days old. boincmgr shows it 30% complete and "Waiting to run".

Checking the log file, I see only two messages for that WU:

=====================================
2007-08-06 09:27:01 [rosetta@home] Starting
1utx__BOINC_SYMM_FOLD_AND_DOCK_RELAX-1utx_-foldanddock__1878_1543_0
2007-08-06 09:27:01 [rosetta@home] Starting task
1utx__BOINC_SYMM_FOLD_AND_DOCK_RELAX-1utx_-foldanddock__1878_1543_0
using rosetta_beta version 573
=====================================


The client_state.xml file has this:
=====================================
<active_task>

<project_master_url>http://boinc.bakerlab.org/rosetta/</project_master_url>
<result_name>1utx__BOINC_SYMM_FOLD_AND_DOCK_RELAX-1utx_-
foldanddock__1878_1543_0</result_name>
<active_task_state>8</active_task_state>
<app_version_num>573</app_version_num>
<slot>1</slot>
<scheduler_state>1</scheduler_state>
<checkpoint_cpu_time>8846.305156</checkpoint_cpu_time>
<fraction_done>0.307476</fraction_done>
<current_cpu_time>8857.308483</current_cpu_time>
<swap_size>252416000.000000</swap_size>
<working_set_size>152301568.000000</working_set_size>

<working_set_size_smoothed>152301568.000000</working_set_size_smoothed>
<page_fault_rate>0.000000</page_fault_rate>
</active_task>
=====================================

That is an <active_task_state> that I had not seen previously. Searching for the meaning of it, I found a thread that listed the defines for <active_task_state> which included
#define PROCESS_IN_LIMBO 8

That sounds like the catch-all "an unknown error has occurred". Any suggestions before I stop and restart boinc and/or the computer?
7) Message boards : Number crunching : Problems with Rosetta version 5.68 and 5.70 (Message 42807)
Posted 29 Jun 2007 by TCU Computer Science
Post:
With 5.68 on a Debian Linux 2.6x machine, most Rosetta tasks will run to about 84% completion, and then hang. The "CPU time" does not increment for the task, and the task will remain hung for as long as it is the executable task.


I had a problem similar to this a year ago. On CentOS (and Mac OS X) the Rosetta task would hang. boincmgr showed Rosetta running but the accumulated CPU time did not increase. Usually, the Rosetta task would remain in the process list after I stopped boinc. I had to manually kill the Rosetta task. Then when I restarted boinc, the Rosetta task would resume accumulating CPU time. I switched most of my Linux boxes and all of my Macs to Einstein because I didn't have time to check those machines for hung tasks.

Recently, I tried switching back to Rosetta. On machines with CentOS 4.1 (kernel 2.6.9-11) Rosetta still hung but machines with CentOS 4.5 (kernel 2.6.9-55) have not experienced that problem. So, all of my Linux boxes have been updated and most switched back to Rosetta.

I still have the problem on Mac OS X.
8) Message boards : Number crunching : Report Problems with Rosetta Version 5.25 (Message 25402)
Posted 29 Aug 2006 by TCU Computer Science
Post:

Before I pass it along (I apologize for not having a clue), what is CentOS?


It is a free version of a "prominent North American Enterprise Linux vendor" product.
9) Message boards : Number crunching : Report Problems with Rosetta Version 5.25 (Message 25392)
Posted 29 Aug 2006 by TCU Computer Science
Post:
Yet another stuck work unit:

NMR_1i27_CASPR_1_1i27__1_id_model_10IGNORE_THE_REST_idl_1218_1949
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=30244351

It has been running for 27 hours but it is stuck at 2 hours accumulated CPU time. BOINC says Rosetta is running but the CPU for the process is at 0% and the load avaerage is 0.

Like the previous stuck WU, I rebooted the machine and the WU immediately terminated with the error

ERROR:: Exit at: initialize.cc line:1618

This machine is running CentOS 4.3
10) Message boards : Number crunching : Report Problems with Rosetta Version 5.25 (Message 22898)
Posted 18 Aug 2006 by TCU Computer Science
Post:
Does the problem on your Linux machines go away if you switch to "leave app in memory"?



I leave the app in memory and still have the problem, but it doesn't occur very often. Boinc Manager says Rosetta is running, but the CPU Time is not increasing and the CPU is idle. Usually when the problem occurs and I stop the Boinc process, the Rosetta process remains in the process list. I have to manually kill it or reboot the machine.

I have seen the problem on Mac OS X and Linux (CentOS 4.3) but never under Windows. It has occurred on machines running only Rosetta and machines running Rosetta and Einstein but it only effects the Rosetta app.
11) Message boards : Number crunching : Report Problems with Rosetta Version 5.25 (Message 22414)
Posted 13 Aug 2006 by TCU Computer Science
Post:
Another stuck work unit:

2f21X_BOINC_ABRELAX_SAVE_ALL_OUT_BARCODE__1075_31308
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=27912763

It has been running for more than 2 days but accumulated only 6 hours of CPU time and is stuck at 74.4%.

When I stopped BOINC, the Rosetta processes did not terminate.
I rebooted the machine and the WU immediately terminated with the error

ERROR:: Exit at: initialize.cc line:1618
12) Message boards : Number crunching : Report Problems with Rosetta Version 5.25 (Message 21653)
Posted 2 Aug 2006 by TCU Computer Science
Post:
I was having this problem with 5.22 (also there) already and now the same happens with 5.25 - stalled/hanging Rosettas.



The first time I saw this problem was with Ralph 5.18 then a couple of instances with Rosetta 5.22 then some with [url=http://boinc.bakerlab.org/rosetta/forum_thread.php?id=1891#20832]Rosetta 5.25.

I've seen the problem on Mac OS X and Linux (CentOS 4.3) but never on Windows. When the problem occurs on Linux, I stop BOINC but the Rosetta process remains in the process list. I have to kill it manually before restarting BOINC.
13) Message boards : Number crunching : Report Problems with Rosetta Version 5.25 (Message 20832)
Posted 21 Jul 2006 by TCU Computer Science
Post:
Another stuck work unit:

Mac OS X 10.4.7
BOINC 5.4.9
wuid=25113559

The Messages tab shows the entry

Thu Jul 20 08:28:59 2006|rosetta@home|Starting task FRA_t370_CASP7_hom001_4_t370_4_1g76A_IGNORE_THE_REST_46_1010_22_0 using rosetta version 525

followed by a few lines about uploading the previous result and reporting task completion. Then nothing for 24 hours.

The Tasks tab shows CPU Time stuck at 00:00:01

top command shows over 24 hours accumulated and rising.

Had to stop and restart BOINC.
14) Message boards : Number crunching : Report Problems with Rosetta Version 5.25 (Message 20421)
Posted 17 Jul 2006 by TCU Computer Science
Post:
Here is another one:

t321__CASP7_ABRELAX_SAVE_ALL_OUT_nterm_hom004__685_15852
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=23020078


It was running for six days.
The messages show it being paused and resumed at one hour intervals.
But the accumulated time was stuck at 8 hr 16 mins.

When I stopped boinc, I noticed that the process for that work unit remained in the process list.

I rebooted the machine and the work unit finished immediately.

This occurred on a Linux box different from the previous ones.
15) Message boards : Number crunching : Report Problems with Rosetta Version 5.25 (Message 19944)
Posted 9 Jul 2006 by TCU Computer Science
Post:
Here is another one:

t329__MAPBACK_CLUSTER02_CASP7_ABRELAX_SAVE_ALL_OUT_CONTACT_ncap_hom001__826_17779
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=22976377


It was running for four days.
The messages show it being paused and resumed at one hour intervals.
But the accumulated time was stuck at 7 hr 48 mins.

Stopped and restarted boinc, the accumulated time began increasing, and the work unit finished normally about 30 minutes later.

This occurred on a Linux box different from the previous one.
16) Message boards : Number crunching : Report Problems with Rosetta Version 5.25 (Message 19744)
Posted 3 Jul 2006 by TCU Computer Science
Post:
WU ID 22634421

It's been running for three days.
The messages show it being paused and resumed at one hour intervals.
But the accumulated time was stuck at 1 hr 55 mins.

Stopped and restarted boinc and the accumulated time is now increasing.

This occurred on a Linux box.
17) Message boards : Number crunching : Report Problems with Rosetta Version 5.22 (Message 19102)
Posted 22 Jun 2006 by TCU Computer Science
Post:
rosetta 5.22
WU Name: t316__CASP7_JUMPABINITIO_SAVE_ALL_OUT_BARCODE_secondhalf_hom019__726_329
running on Mac OS 10.4.6

BOINC Manager Tasks tab shows CPU Time stuck at 03:21:43 and 35.5%
top command shows TIME = 37:51:05 and climbing

stopped and restarted BOINC
CPU Time reverted to 02:50:49 and 35.5% but no longer stuck

This is on a G5 crunching only for rosetta.
The two previous instances of this problem occurred on a G4 crunching rosetta + ralph + einstein.
18) Message boards : Number crunching : Report Problems with Rosetta Version 5.22 (Message 18612)
Posted 14 Jun 2006 by TCU Computer Science
Post:
rosetta 5.22
WU Name: t314__CASP7_ABRELAX_SAVE_ALL_OUT_hom004__666_16529_0
running on Mac OS 10.4.6

BOINC Manager Tasks tab shows CPU Time stuck at 01:30:40 and 15%
top command shows TIME = 28:53:41 and climbing

stopped and restarted BOINC
CPU Time reverted to 01:13:00 and 15% but no longer stuck

Symptoms are identical to my post for ralph 5.18
19) Message boards : Number crunching : Report stuck & aborted 5.01 WU here please - III (Message 14898)
Posted 28 Apr 2006 by TCU Computer Science
Post:
Four more 5.01 WUs were aborted this morning

50.1 hrs
http://boinc.bakerlab.org/rosetta/result.php?resultid=18296499
HB_BARCODE_30_5croA_351_21027

51.9 hrs
http://boinc.bakerlab.org/rosetta/result.php?resultid=18296492
HB_BARCODE_30_1a19A_351_28780_3

53.0 hrs
http://boinc.bakerlab.org/rosetta/result.php?resultid=18296362
HBLR_1.0_1dtj_ROT_TRIALS_TRIE_449_27

89.6 hrs
http://boinc.bakerlab.org/rosetta/result.php?resultid=18037119
FA_RLXfn_hom001_1fna__357_63
20) Message boards : Number crunching : Report stuck & aborted 5.01 WU here please - III (Message 14559)
Posted 25 Apr 2006 by TCU Computer Science
Post:
Another three 5.01 WUs were aborted this afternoon:

23.6 hrs
http://boinc.bakerlab.org/rosetta/result.php?resultid=18037118
FA_RLXpg_hom001_1pgx__357_8

25.2 hrs
http://boinc.bakerlab.org/rosetta/result.php?resultid=18036816
HB_BARCODE_30_1a19A_351_23815

26.3 hrs
http://boinc.bakerlab.org/rosetta/result.php?resultid=18036689
HB_BARCODE_30_2chf__351_24265


Next 20



©2021 University of Washington
https://www.bakerlab.org