Posts by ziegenmelker

1) Message boards : Number crunching : Problems with Rosetta version 5.80 (Message 48257)
Posted 1 Nov 2007 by ziegenmelker
Post:
Some more:

5.80: SIGSEGV and '*** glibc detected *** corrupted double-linked list: 0x0a01aa28 ***', but valid and granted credits (32 for 4h ???)
5.80: process got signal 11 and 2 SIGSEGV: Invalid
5.69: process exited with code 193 (0xc1) and 3 SIGSEGV: Invalid
5.80: process exited with code 193 (0xc1) and 1 SIGSEGV: Invalid

I shortened the crunching time from 4 to 1 h.

5.80: *** glibc detected *** corrupted double-linked list: 0x097ea480 *** and 1 SIGSEGV: Valid
5.69: resultid=116839781: Valid
5.69: resultid=116896290 1 SIGSEGV: Valid

The '*** glibc detected *** corrupted double-linked list:' is an error in the app.
One of the last(valid) WUs got stuck, so I shut down boinc, restarted and the WU was successfully finished.

This host is doing work for Einstein(32Bit), ABC(64Bit), Seti(64Bit) and WCG(32Bit) without problems.

cu,
Michael

[edit]format[/edit]
2) Message boards : Number crunching : Problems with Rosetta version 5.80 (Message 47678)
Posted 13 Oct 2007 by ziegenmelker
Post:
A valid WU, but still with errors:

<core_client_version>5.10.8</core_client_version>
<![CDATA[
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 14400
# random seed: 3647667
SIGSEGV: segmentation violation
Stack trace (12 frames):
[0x8d7cf2f]
[0x8d77d1c]
[0xffffe500]
[0x8e024c7]
[0x8dd2715]
[0x8dd2481]
[0x83f9b8b]
[0x8de873f]
[0x8d79987]
[0x8d7afa5]
[0x8d73f9d]
[0x8e1487a]

Exiting...
Graphics are disabled due to configuration...
# cpu_run_time_pref: 14400
ERROR:: Exit from: fragments.cc line: 465
FILE_LOCK::unlock(): close failed.: Bad file descriptor
*** glibc detected *** double free or corruption (fasttop): 0x0909e348 ***
Graphics are disabled due to configuration...
# cpu_run_time_pref: 14400
*** glibc detected *** corrupted double-linked list: 0x09757f20 ***
Graphics are disabled due to configuration...
# cpu_run_time_pref: 14400
*** glibc detected *** corrupted double-linked list: 0x09511408 ***
Graphics are disabled due to configuration...
# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 14211.6 cpu seconds
This process generated 19 decoys from 19 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
]]>

I really wonder about all these SIGSEGV errors. I don't think they are hardware related.
"glibc detected *** corrupted double-linked list" should be caused from the app itself.
System: AMD 64 X2 4400, 2Gig, standard clock, 64-Bit OpenSUSE 10.2, glibc-2.5-25.

cu,
Michael
3) Message boards : Number crunching : Preemption Failures on Linux (Message 47590)
Posted 10 Oct 2007 by ziegenmelker
Post:
Another crash:

2007-10-10 07:18:09 [ABC@home] Starting abc_wu_1389383660000_400000_0
2007-10-10 07:18:09 [ABC@home] Starting task abc_wu_1389383660000_400000_0 using abc-finder version 103
2007-10-10 07:18:10 [rosetta@home] Deferring communication for 1 min 0 sec
2007-10-10 07:18:10 [rosetta@home] Reason: Unrecoverable error for result BENCH_051207_ABRELAX_SAVE_ALL_OUT_-5croA-_BARCODE_R16_filters_2164_3727_0 (process exited with code 193 (0xc1, -63))
2007-10-10 07:18:10 [rosetta@home] Computation for task BENCH_051207_ABRELAX_SAVE_ALL_OUT_-5croA-_BARCODE_R16_filters_2164_3727_0 finished
2007-10-10 07:18:10 [rosetta@home] Output file BENCH_051207_ABRELAX_SAVE_ALL_OUT_-5croA-_BARCODE_R16_filters_2164_3727_0_0 for task BENCH_051207_ABRELAX_SAVE_ALL_OUT_-5croA-_BARCODE_R16_filters_2164_3727_0 absent

cu,
Michael
4) Message boards : Number crunching : Preemption Failures on Linux (Message 47588)
Posted 10 Oct 2007 by ziegenmelker
Post:
Some minutes ago I started my box and again Boinc Manager shows just one aktive RAH task, but top and gnome-system-monitor show 4(!) RAH processes.
When I start this host all apps that were active during shutdown get loaded again. There is no swap used and only 835MB memory is used so far.
RAH is allowed to use more then 1GB of RAM, so atm there is no reason to preempt in any way.
Btw. Boinc is started through init i.e. before user apps are loaded.

I got two more successfully finished WUs, but again their logs show one SIGABRT and one SIGSEGV.

All three so far successful WUs are from the type 'BENCH_051207_ABRELAX_SAVE_ALL_OUT...' and I abort all 'sen15__RESAMPLE' WUs because they always crash.

Are all these problems maybe mostly/only related to 64-Bit systems?
Are all affected Kernel versions 2.6.8 and later? (-> NX-Bit)

My Boinc version is 5.10.8 (64-Bit beta), system is always fully patched.

Next I will start a VMWare session with 768MB RAM to see what happens. ;-)

cu,
Michael
5) Message boards : Number crunching : Problems with Rosetta version 5.80 (Message 47586)
Posted 10 Oct 2007 by ziegenmelker
Post:
According to this post I kill every 'sen15_RESAMPLE_BOINC_MFR_ABRELAX_...' WU.

cu,
Michael
6) Message boards : Number crunching : Preemption Failures on Linux (Message 47548)
Posted 9 Oct 2007 by ziegenmelker
Post:
My system: AMD 64 X2 4400+, 2GB RAM, 64-Bit OpenSuse 10.2

General Preferences(default):

Use at most 75% of page file (swap space)
Use at most 50% of memory when computer is in use
Use at most 90% of memory when computer is idle

At least if I don't start any VMWare sessions, there should be plenty of RAM left.
But anyway, I have raised the 'mem in use' value to 60%.

But the host belongs to location 'home', where no preferences were defined jet. Now I've defined the same preferences for 'home' like for 'default'.

At the moment, there is just one Rosetta task running and one ABC task.
Top says there are 5 Rosetta processes and the amount of reserved and shared memory they are using is changing simultaneously(e.g. 153/75MB).

Gnome system monitor sees only one Rosetta parent process, one child and this child has another two children. Today I didn't start any memory expensive apps like VMWare. Both cores run at almost 100%, 1.3GB of memory is used, swap is 528kB.

I'm proud today this host delivered his first and only successfully finished WU! But this WU still has a couple of SIGSEGVs in the log file. Maybe the NX-Bit is a reason for this?

cu,
Michael
7) Message boards : Number crunching : Problems with Rosetta stable version 5.69 and beta version 5.77 (Message 47546)
Posted 9 Oct 2007 by ziegenmelker
Post:
Please add your comments to the Linux thread as you study it further.


Thanks, I'll do so.

cu,
Michal
8) Message boards : Number crunching : Problems with Rosetta stable version 5.69 and beta version 5.77 (Message 47535)
Posted 9 Oct 2007 by ziegenmelker
Post:
This host really tries out hard to get a valid WU. :-(
Btw. when I shut down the machine yesterday, there were afair 9(!) instances of Rosetta@home in memory, each using 79MB of RAM. At that time one WU was aktive, another one was at some % and waiting to run again.

<core_client_version>5.10.8</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 14400
# random seed: 2692519
pure virtual method called
SIGSEGV: segmentation violation
Stack trace (9 frames):
[0x8cdfe17]
[0x8cdac0c]
[0xffffe500]
[0x8d65433]
[0x8d4b794]
[0x8cdc897]
[0x8cddeb5]
[0x8cd6ea5]
[0x8d777fa]

Exiting...
terminate called without an active exception
SIGABRT: abort called
Stack trace (19 frames):
[0x8cdfe17]
[0x8cdac0c]
[0xffffe500]
[0x8d4b224]
[0x8d38b0e]
[0x8d35e9d]
[0x8d35ed2]
[0x8d355b5]
[0x8be23b3]
[0x8bea61d]
[0x8b50074]
[0x8c31c58]
[0x849a8a1]
[0x80dad6d]
[0x85c5a97]
[0x86eda4f]
[0x86edafa]
[0x8d44164]
[0x8048111]

Exiting...
Graphics are disabled due to configuration...
# cpu_run_time_pref: 14400
SIGSEGV: segmentation violation
Stack trace (13 frames):
[0x8cdfe17]
[0x8cdac0c]
[0xffffe500]
[0x8c4a1db]
[0x8b51266]
[0x8c31c58]
[0x849a87c]
[0x80dad6d]
[0x85c5a97]
[0x86eda4f]
[0x86edafa]
[0x8d44164]
[0x8048111]

Exiting...
SIGSEGV: segmentation violation

</stderr_txt>
]]>


Right now two WUs are waiting to run:
Rosetta Beta 5.80: 1ubi__BOINC_ABRELAX_SHORTREL... 85,247 %
Rosetta 5.69: CNTRL_01ABRELAX_SAVE_ALL_OU... 9,768

Nothing related to Rosetta in memory. If this is going to change, I will report here.

cu,
Michael

edit: spelling
9) Message boards : Number crunching : Problems with Rosetta version 5.80 (Message 47498)
Posted 7 Oct 2007 by ziegenmelker
Post:
I think it's not because of the app, but because of the WU-type. All my wingmen got errors too. My next WU is a different type(1ubi__BOINC_ABRELAX_SHORTRELAX_SAVE_ALL_OUT-1ubi_-frags83__2162...), so I'm curious if this on will crash too.

cu,
Michael
10) Message boards : Number crunching : Problems with Rosetta version 5.80 (Message 47492)
Posted 7 Oct 2007 by ziegenmelker
Post:
Two compute errors with 5.80 (64-Bit Linux on X2 4400 no oc).

First one:

###BEGIN############################################################
<core_client_version>5.10.8</core_client_version>
<![CDATA[
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 14400
# random seed: 3979598
*** glibc detected *** corrupted double-linked list: 0x09647f08 ***
SIGABRT: abort called
Stack trace (19 frames):
[0x8d7cf2f]
[0x8d77d1c]
[0xffffe500]
[0x8de8234]
[0x8dfd0ce]
[0x8e01ae2]
[0x8e02774]
[0x8e04045]
[0x8dd24b7]
[0x8dd3f51]
[0x8b1c308]
[0x8ccedcd]
[0x84b7f90]
[0x80d82b5]
[0x85f6c37]
[0x87320a7]
[0x8732152]
[0x8de10f4]
[0x8048121]

Exiting...
Graphics are disabled due to configuration...
# cpu_run_time_pref: 14400
*** glibc detected *** corrupted double-linked list: 0x099a9ea0 ***
Graphics are disabled due to configuration...
# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 14009.8 cpu seconds
This process generated 10 decoys from 10 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
<message>
<file_xfer_error>
<file_name>sen15_RESAMPLE_BOINC_MFR_ABRELAX_PICKED_2155_403_1_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>
###END########################################################################

And the second one:

###BEGIN######################################################################
<core_client_version>5.10.8</core_client_version>
<![CDATA[
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 14400
# random seed: 3979904
*** glibc detected *** corrupted double-linked list: 0x0991f6b8 ***
Graphics are disabled due to configuration...
# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 13128 cpu seconds
This process generated 9 decoys from 9 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
<message>
<file_xfer_error>
<file_name>sen15_RESAMPLE_BOINC_MFR_ABRELAX_PICKED_2155_97_1_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>
###END######################################################################

Maybe I should adjust crunching time to one h till the problems are solved?

cu,
Michael






©2024 University of Washington
https://www.bakerlab.org