7 WUs returned in error by computer 131283

Message boards : Rosetta@home Science : 7 WUs returned in error by computer 131283

To post messages, you must log in.

AuthorMessage
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 22 Dec 05
Posts: 71
Credit: 138,867
RAC: 0
Message 10335 - Posted: 2 Feb 2006, 7:51:58 UTC
Last modified: 2 Feb 2006, 7:55:08 UTC

These 7 WUs exited with error status 2 yesterday

While looking at my results to get further knowledege
about that error, I found that they all was unsent to
to everyone else !.

The error in subject computer was caused
by I inadvertently booting TWO OSsses at same time
that had written simultaneus in that same hard
disk space, without a CPU interlock mechanism,
for each independent OS write -:(

So, Please re-issue these 7 WUs again -:)

No error on any of them ... only hard disk corruption

Thanks
Click signature for global team stats
ID: 10335 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 10380 - Posted: 2 Feb 2006, 21:21:12 UTC - in response to Message 10335.  
Last modified: 2 Feb 2006, 21:31:25 UTC

These 7 WUs exited with error status 2 yesterday
...

So, Please re-issue these 7 WUs again -:)

No error on any of them ... only hard disk corruption

Thanks


Thanks for the thought - but this will happen automatically anyway.

When a WU is returned with an error, or if it is not returned at all before the deadline, the server automatically sends it out again (up to a maximum of three tries, at present). This means that work would only be skipped if three different people had spurious error reports.

Each project selects how many times BOINC should retry before giving up.

Three tries was chosen as a sensible balance between trying too many times when there really is something wrong with the WU, and not trying enough when it is a non-WU error like this one.

Actually I am not sure if it is three tries in total, or thre re-tries making four in total, but you get the idea either way. We will all be able to see which it is by folowing this wu over the next few days, because it has already been errored three times now.

With thousands of participants, mistakes like this are bound to happen most days to *somebody* on the project, and just by bad luck it was your turn that day. And it is by good design to expect such errors and work round them automatically.

If you keep an eye on those WU, you will see them reissued to someone else in due course - or you can just forget about them and trust the system to do its thing.

River~~
ID: 10380 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 22 Dec 05
Posts: 71
Credit: 138,867
RAC: 0
Message 10409 - Posted: 3 Feb 2006, 15:08:38 UTC


Click signature for global team stats
ID: 10409 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Bok

Send message
Joined: 17 Sep 05
Posts: 54
Credit: 3,514,973
RAC: 0
Message 10410 - Posted: 3 Feb 2006, 15:16:52 UTC

Carlos,

the difference in versions is, I believe, just a fix to a windows only bug. There was no need to release this version for linux.

Most of my machines run Rosetta under linux, a mixture of Gentoo and Redhat with zero problems..

Bok
Free-DC

Stats for all projects

Custom Stats
ID: 10410 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 22 Dec 05
Posts: 71
Credit: 138,867
RAC: 0
Message 10450 - Posted: 4 Feb 2006, 8:43:03 UTC

Which "windows only" bug ?

Unhandled exception ? Division by zero ?
Click signature for global team stats
ID: 10450 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 10469 - Posted: 4 Feb 2006, 21:15:42 UTC
Last modified: 4 Feb 2006, 21:19:02 UTC

Carlos, a quick note: normally you shouldn't need to use kill -9, a simple kill (implies "15", i.e. SIGTERM) instead should be fine 99.9% of the time (unless the app itself blocks it).

Normal SIGTERM (no argument, or -15) kill the equivalent between shutting down Windows gracefully via Start->Shutdown (funny, like the BIOS error "<Beep!> Keyboard not found. Press any key to continue"). It lets the app write its stuff to disk etc.

SIGKILL (-9) is like pulling the power plug.

PS: I also had several Rosetta 4.80 WUs freeze on Linux 2.4.27 (Debian Sarge), which I had to kill (and BOINC restarted them). I've never had a WU freeze under WinXP sofar. I've never had a Rosetta 4.21 (WCG/HPF) freeze sofar (Linux or Win).
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 10469 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 22 Dec 05
Posts: 71
Credit: 138,867
RAC: 0
Message 10624 - Posted: 10 Feb 2006, 14:46:30 UTC

I am getting this error on some WUs under linux 2.4.x and rosetta freezes on ram
until I pkill -9 boinc
and then restart it manually

*A Linux only bug ???

crobertp [/home/boinc/BOINC] > ./boinc -redirectio -allow_remote_gui_rpc -return_results_immediately &
[1] 16353
crobertp [/home/boinc/BOINC] > ssh think@matrix.cp3
ssh: connect to host matrix.cp3 port 22: Connection timed out
crobertp [/home/boinc/BOINC] > w
11:48am up 14 days, 36 min, 1 user, load average: 0.67, 0.47, 0.22
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
boinc pts/3 200.149.245.172 10:45am 0.00s 2:33 0.01s w
crobertp [/home/boinc/BOINC] > *** glibc detected *** corrupted double-linked list: 0x093eb028 ***

crobertp [/home/boinc/BOINC] >

Click signature for global team stats
ID: 10624 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 22 Dec 05
Posts: 71
Credit: 138,867
RAC: 0
Message 10625 - Posted: 10 Feb 2006, 14:49:10 UTC

crobertp [/home/boinc/BOINC] > ps xu
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
boinc 14309 0.0 1.3 5444 3308 ? S 04:48 0:04 ./boinc -redirectio -allow_remote_gui_rpc -return_results_imme
boinc 16088 0.0 0.8 7200 2144 ? S 10:45 0:00 /usr/sbin/sshd
boinc 16089 0.0 0.8 3484 2232 pts/3 S 10:45 0:00 -bash
boinc 16137 24.1 23.1 109600 57416 ? SN 10:46 13:18 rosetta_4.80_i686-pc-linux-gnu cc 1fna _ -abrelax -stringent_r
boinc 16138 0.0 23.1 109600 57416 ? SN 10:46 0:00 rosetta_4.80_i686-pc-linux-gnu cc 1fna _ -abrelax -stringent_r
boinc 16139 0.0 23.1 109600 57416 ? SN 10:46 0:00 rosetta_4.80_i686-pc-linux-gnu cc 1fna _ -abrelax -stringent_r
boinc 16345 0.0 0.2 2544 664 pts/3 R 11:41 0:00 ps xu
crobertp [/home/boinc/BOINC] > pkill boinc
crobertp [/home/boinc/BOINC] > ps xu
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
boinc 16088 0.0 0.8 7200 2144 ? S 10:45 0:00 /usr/sbin/sshd
boinc 16089 0.0 0.8 3484 2232 pts/3 S 10:45 0:00 -bash
boinc 16350 0.0 0.2 2532 652 pts/3 R 11:42 0:00 ps xu
crobertp [/home/boinc/BOINC] > ./boinc -redirectio -allow_remote_gui_rpc -return_results_immediately &
[1] 16353
crobertp [/home/boinc/BOINC] > ssh think@matrix.cp3
ssh: connect to host matrix.cp3 port 22: Connection timed out
crobertp [/home/boinc/BOINC] > w
11:48am up 14 days, 36 min, 1 user, load average: 0.67, 0.47, 0.22
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
boinc pts/3 200.149.245.172 10:45am 0.00s 2:33 0.01s w
crobertp [/home/boinc/BOINC] > *** glibc detected *** corrupted double-linked list: 0x093eb028 ***

crobertp [/home/boinc/BOINC] > pkill boinc
crobertp [/home/boinc/BOINC] > ps xu
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
boinc 16088 0.0 0.9 7216 2276 ? S 10:45 0:00 /usr/sbin/sshd
boinc 16089 0.0 0.7 3484 1952 pts/3 S 10:45 0:00 -bash
boinc 16549 0.0 0.2 2532 652 pts/3 R 12:39 0:00 ps xu
[1]+ Done ./boinc -redirectio -allow_remote_gui_rpc -return_results_immediately
crobertp [/home/boinc/BOINC] > ./boinc -redirectio -allow_remote_gui_rpc -return_results_immediately &
[1] 16551
crobertp [/home/boinc/BOINC] > ps xu
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
boinc 16088 0.0 0.8 7220 2196 ? S 10:45 0:00 /usr/sbin/sshd
boinc 16089 0.0 0.7 3484 1856 pts/3 S 10:45 0:00 -bash
boinc 16551 0.0 1.0 4972 2592 pts/3 S 12:39 0:00 ./boinc -redirectio -allow_remote_gui_rpc -return_results_imme
boinc 16553 99.7 25.1 156424 62348 pts/3 RN 12:39 8:36 rosetta_4.80_i686-pc-linux-gnu cc 1cc8 A -abrelax -stringent_r
boinc 16554 0.0 25.1 156424 62348 pts/3 SN 12:39 0:00 rosetta_4.80_i686-pc-linux-gnu cc 1cc8 A -abrelax -stringent_r
boinc 16555 0.0 25.1 156424 62348 pts/3 SN 12:39 0:00 rosetta_4.80_i686-pc-linux-gnu cc 1cc8 A -abrelax -stringent_r
boinc 16602 0.0 0.2 2544 664 pts/3 R 12:48 0:00 ps xu
crobertp [/home/boinc/BOINC] >

Click signature for global team stats
ID: 10625 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 22 Dec 05
Posts: 71
Credit: 138,867
RAC: 0
Message 10626 - Posted: 10 Feb 2006, 14:52:14 UTC

crobertp [/home/boinc/BOINC] > lsof -u boinc
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
sshd 16088 boinc cwd DIR 3,9 4096 2 /
sshd 16088 boinc rtd DIR 3,9 4096 2 /
sshd 16088 boinc txt REG 3,9 306504 456268 /usr/sbin/sshd
sshd 16088 boinc mem REG 3,9 99663 232524 /lib/ld-2.3.2.so
sshd 16088 boinc mem REG 3,9 31592 228843 /lib/libnss_nis-2.3.2.so
sshd 16088 boinc mem REG 3,9 194800 277040 /usr/lib/libopensc.so.0.0.6
sshd 16088 boinc mem REG 3,9 33748 228098 /lib/libpam.so.0.75
sshd 16088 boinc mem REG 3,9 9064 228276 /lib/libdl-2.3.2.so
sshd 16088 boinc mem REG 3,9 56684 228848 /lib/libresolv-2.3.2.so
sshd 16088 boinc mem REG 3,9 7700 228070 /lib/libutil-2.3.2.so
sshd 16088 boinc mem REG 3,9 51356 276901 /usr/lib/libz.so.1.1.4
sshd 16088 boinc mem REG 3,9 69132 228372 /lib/libnsl-2.3.2.so
sshd 16088 boinc mem REG 3,9 968116 277204 /usr/lib/libcrypto.so.0.9.7
sshd 16088 boinc mem REG 3,9 422088 586497 /usr/lib/krb5/libkrb5.so.3.1
sshd 16088 boinc mem REG 3,9 76724 586489 /usr/lib/krb5/libk5crypto.so.3.0
sshd 16088 boinc mem REG 3,9 5468 228084 /lib/libcom_err.so.2.0
sshd 16088 boinc mem REG 3,9 1230116 228237 /lib/libc-2.3.2.so
sshd 16088 boinc mem REG 3,9 55692 277030 /usr/lib/libpcsclite.so.0.0.1
sshd 16088 boinc mem REG 3,9 20276 277046 /usr/lib/libscconf.so.0.0.0
sshd 16088 boinc mem REG 3,9 101347 228847 /lib/libpthread-0.10.so
sshd 16088 boinc mem REG 3,9 4132 586479 /usr/lib/krb5/libcom_err.so.3.0
sshd 16088 boinc mem CHR 1,5 162944 /dev/zero
sshd 16088 boinc mem REG 3,9 4616 586434 /lib/security/pam_nologin.so
sshd 16088 boinc mem REG 3,9 13616 586420 /lib/security/pam_cracklib.so
sshd 16088 boinc mem REG 3,9 27480 276952 /usr/lib/libcrack.so.2.7
sshd 16088 boinc mem REG 3,9 13140 586428 /lib/security/pam_limits.so
sshd 16088 boinc mem REG 3,9 42628 228839 /lib/libnss_files-2.3.2.so
sshd 16088 boinc mem REG 3,9 39952 228844 /lib/libnss_nisplus-2.3.2.so
sshd 16088 boinc mem REG 3,9 53296 586446 /lib/security/pam_unix.so
sshd 16088 boinc mem REG 3,9 4260 586452 /lib/security/pam_warn.so
sshd 16088 boinc mem REG 3,9 3220 586421 /lib/security/pam_deny.so
sshd 16088 boinc mem REG 3,9 18184 228238 /lib/libcrypt-2.3.2.so
sshd 16088 boinc mem CHR 1,5 162944 /dev/zero
sshd 16088 boinc 0u CHR 1,3 162942 /dev/null
sshd 16088 boinc 1u CHR 1,3 162942 /dev/null
sshd 16088 boinc 2u CHR 1,3 162942 /dev/null
sshd 16088 boinc 3u unix 0xc5de90c0 2153261 socket
sshd 16088 boinc 4r FIFO 0,4 2153263 pipe
sshd 16088 boinc 5w FIFO 0,4 2153263 pipe
sshd 16088 boinc 6u IPv4 2153210 TCP 212247.rjo.virtua.com.br:ssh->200.149.245.172:4864 (ESTABLISHED)
sshd 16088 boinc 7r FIFO 0,4 2504 pipe
sshd 16088 boinc 8w FIFO 0,4 2504 pipe
sshd 16088 boinc 9u CHR 5,2 164136 /dev/ptmx
sshd 16088 boinc 10u CHR 5,2 164136 /dev/ptmx
sshd 16088 boinc 11u CHR 5,2 164136 /dev/ptmx
sshd 16088 boinc 21w CHR 1,3 162942 /dev/null
bash 16089 boinc cwd DIR 3,9 4096 505732 /home/boinc/BOINC
bash 16089 boinc rtd DIR 3,9 4096 2 /
bash 16089 boinc txt REG 3,9 626348 16294 /bin/bash
bash 16089 boinc mem REG 3,9 99663 232524 /lib/ld-2.3.2.so
bash 16089 boinc mem REG 3,9 370 537779 /usr/lib/locale/en_US/LC_IDENTIFICATION
bash 16089 boinc mem REG 3,9 28 538365 /usr/lib/locale/en_US/LC_MEASUREMENT
bash 16089 boinc mem REG 3,9 64 538362 /usr/lib/locale/en_US/LC_TELEPHONE
bash 16089 boinc mem REG 3,9 160 538366 /usr/lib/locale/en_US/LC_ADDRESS
bash 16089 boinc mem REG 3,9 82 538363 /usr/lib/locale/en_US/LC_NAME
bash 16089 boinc mem REG 3,9 39 538338 /usr/lib/locale/en_US/LC_PAPER
bash 16089 boinc mem REG 3,9 57 765681 /usr/lib/locale/en_US/LC_MESSAGES/SYS_LC_MESSAGES
bash 16089 boinc mem REG 3,9 291 538364 /usr/lib/locale/en_US/LC_MONETARY
bash 16089 boinc mem REG 3,9 21499 248830 /usr/lib/locale/en_US/LC_COLLATE
bash 16089 boinc mem REG 3,9 2456 538361 /usr/lib/locale/en_US/LC_TIME
bash 16089 boinc mem REG 3,9 59 244374 /usr/lib/locale/en_US/LC_NUMERIC
bash 16089 boinc mem REG 3,9 5500 977493 /usr/lib/gconv/ISO8859-1.so
bash 16089 boinc mem REG 3,9 252784 228080 /lib/libncurses.so.5.2
bash 16089 boinc mem REG 3,9 9064 228276 /lib/libdl-2.3.2.so
bash 16089 boinc mem REG 3,9 1230116 228237 /lib/libc-2.3.2.so
bash 16089 boinc mem REG 3,9 178468 245165 /usr/lib/locale/en_US/LC_CTYPE
bash 16089 boinc 0u CHR 136,3 5 /dev/pts/3
bash 16089 boinc 1u CHR 136,3 5 /dev/pts/3
bash 16089 boinc 2u CHR 136,3 5 /dev/pts/3
bash 16089 boinc 255u CHR 136,3 5 /dev/pts/3
boinc 16551 boinc cwd DIR 3,9 4096 505732 /home/boinc/BOINC
boinc 16551 boinc rtd DIR 3,9 4096 2 /
boinc 16551 boinc txt REG 3,9 2468096 505776 /home/boinc/BOINC/boinc
boinc 16551 boinc mem DEL 0,3 31555584 /SYSV000933d7
boinc 16551 boinc mem REG 3,9 42628 228839 /lib/libnss_files-2.3.2.so
boinc 16551 boinc mem REG 3,9 1230116 228237 /lib/libc-2.3.2.so
boinc 16551 boinc mem REG 3,9 99663 232524 /lib/ld-2.3.2.so
boinc 16551 boinc 0u CHR 136,3 5 /dev/pts/3
boinc 16551 boinc 1w REG 3,9 2044434 505771 /home/boinc/BOINC/stdoutdae.txt
boinc 16551 boinc 2w REG 3,9 11791 505769 /home/boinc/BOINC/stderrdae.txt
boinc 16551 boinc 3wW REG 3,9 0 505746 /home/boinc/BOINC/lockfile
boinc 16551 boinc 4r DIR 3,9 4096 930768 /home/boinc/BOINC/slots
boinc 16551 boinc 5u IPv4 2169693 TCP *:1043 (LISTEN)
boinc 16551 boinc 6u IPv4 2169701 TCP 212247.rjo.virtua.com.br:1043->200.149.245.172:1550 (ESTABLISHED)
rosetta_4 16553 boinc cwd DIR 3,9 4096 930773 /home/boinc/BOINC/slots/0
rosetta_4 16553 boinc rtd DIR 3,9 4096 2 /
rosetta_4 16553 boinc txt REG 3,9 8323696 963718 /home/boinc/BOINC/projects/boinc.bakerlab.org_rosetta/rosetta_4.80_i686-pc-linux-gnu
rosetta_4 16553 boinc mem DEL 0,3 31555584 /SYSV000933d7
rosetta_4 16553 boinc 0u CHR 136,3 5 /dev/pts/3
rosetta_4 16553 boinc 1w REG 3,9 153458 932310 /home/boinc/BOINC/slots/0/stdout.txt
rosetta_4 16553 boinc 2w REG 3,9 587 932309 /home/boinc/BOINC/slots/0/stderr.txt
rosetta_4 16553 boinc 3w REG 3,9 0 505746 /home/boinc/BOINC/lockfile
rosetta_4 16553 boinc 4wW REG 3,9 0 932311 /home/boinc/BOINC/slots/0/boinc_lockfile
rosetta_4 16553 boinc 5u IPv4 2169693 TCP *:1043 (LISTEN)
rosetta_4 16553 boinc 6r FIFO 0,4 2169698 pipe
rosetta_4 16553 boinc 7w FIFO 0,4 2169698 pipe
rosetta_4 16554 boinc cwd DIR 3,9 4096 930773 /home/boinc/BOINC/slots/0
rosetta_4 16554 boinc rtd DIR 3,9 4096 2 /
rosetta_4 16554 boinc txt REG 3,9 8323696 963718 /home/boinc/BOINC/projects/boinc.bakerlab.org_rosetta/rosetta_4.80_i686-pc-linux-gnu
rosetta_4 16554 boinc mem DEL 0,3 31555584 /SYSV000933d7
rosetta_4 16554 boinc 0u CHR 136,3 5 /dev/pts/3
rosetta_4 16554 boinc 1w REG 3,9 153458 932310 /home/boinc/BOINC/slots/0/stdout.txt
rosetta_4 16554 boinc 2w REG 3,9 587 932309 /home/boinc/BOINC/slots/0/stderr.txt
rosetta_4 16554 boinc 3w REG 3,9 0 505746 /home/boinc/BOINC/lockfile
rosetta_4 16554 boinc 4w REG 3,9 0 932311 /home/boinc/BOINC/slots/0/boinc_lockfile
rosetta_4 16554 boinc 5u IPv4 2169693 TCP *:1043 (LISTEN)
rosetta_4 16554 boinc 6r FIFO 0,4 2169698 pipe
rosetta_4 16554 boinc 7w FIFO 0,4 2169698 pipe
rosetta_4 16555 boinc cwd DIR 3,9 4096 930773 /home/boinc/BOINC/slots/0
rosetta_4 16555 boinc rtd DIR 3,9 4096 2 /
rosetta_4 16555 boinc txt REG 3,9 8323696 963718 /home/boinc/BOINC/projects/boinc.bakerlab.org_rosetta/rosetta_4.80_i686-pc-linux-gnu
rosetta_4 16555 boinc mem DEL 0,3 31555584 /SYSV000933d7
rosetta_4 16555 boinc 0u CHR 136,3 5 /dev/pts/3
rosetta_4 16555 boinc 1w REG 3,9 153458 932310 /home/boinc/BOINC/slots/0/stdout.txt
rosetta_4 16555 boinc 2w REG 3,9 587 932309 /home/boinc/BOINC/slots/0/stderr.txt
rosetta_4 16555 boinc 3w REG 3,9 0 505746 /home/boinc/BOINC/lockfile
rosetta_4 16555 boinc 4w REG 3,9 0 932311 /home/boinc/BOINC/slots/0/boinc_lockfile
rosetta_4 16555 boinc 5u IPv4 2169693 TCP *:1043 (LISTEN)
rosetta_4 16555 boinc 6r FIFO 0,4 2169698 pipe
rosetta_4 16555 boinc 7w FIFO 0,4 2169698 pipe
crobertp [/home/boinc/BOINC] >

Click signature for global team stats
ID: 10626 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Rosetta@home Science : 7 WUs returned in error by computer 131283



©2024 University of Washington
https://www.bakerlab.org