Minirosetta 3.14

Message boards : Number crunching : Minirosetta 3.14

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Alan J Rodger

Send message
Joined: 16 Oct 05
Posts: 7
Credit: 32,282
RAC: 0
Message 70731 - Posted: 15 Jul 2011, 16:19:04 UTC

Still getting hangups on Mini 3.14 - is there a plan to fix this or stop issuing 3.14 WUs?

ID: 70731 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
darkestkhan

Send message
Joined: 16 Nov 09
Posts: 2
Credit: 4,886
RAC: 0
Message 70732 - Posted: 15 Jul 2011, 23:01:46 UTC

I just got yet another SIGSEGV on up to date Debian GNU/Linux sid/exp:

*** glibc detected *** ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.14_x86_64-pc-linux-gnu: double free or corruption (fasttop): 0x0c4ee8a8 ***
======= Backtrace: =========
[0xa449b81]
[0xa44d69b]
[0xa411111]
[0xa427a5d]
[0xa38b0ca]
[0xa38b50a]
[0xf776d400]
[0x80501d0]
[0xa45bafc]
[0x817b9ff]
[0x8049480]
[0xa4602de]
======= Memory map: ========
08048000-0a999000 r-xp 00000000 fe:02 3260875 /home/darkestkhan/BOINC/projects/boinc.bakerlab.org_rosetta/minirosetta_3.14_x86_64-pc-linux-gnu
0a999000-0a9a0000 rwxp 02950000 fe:02 3260875 /home/darkestkhan/BOINC/projects/boinc.bakerlab.org_rosetta/minirosetta_3.14_x86_64-pc-linux-gnu
0a9a0000-0ab5c000 rwxp 00000000 00:00 0
0c402000-17cf0000 rwxp 00000000 00:00 0 [heap]
ef900000-ef980000 rwxp 00000000 00:00 0
ef980000-efa00000 ---p 00000000 00:00 0
efa85000-f070a000 rwxp 00000000 00:00 0
f08ce000-f5a2e000 rwxp 00000000 00:00 0
f5a2e000-f5a2f000 ---p 00000000 00:00 0
f5a2f000-f622e000 rwxp 00000000 00:00 0
f622e000-f7542000 rwxs 00000000 fe:02 1089644 /home/darkestkhan/BOINC/slots/0/boinc_minirosetta_0
f7542000-f7543000 ---p 00000000 00:00 0
f7543000-f7546000 rwxp 00000000 00:00 0
f7546000-f7548000 rwxs 00000000 fe:02 1089639 /home/darkestkhan/BOINC/slots/0/boinc_mmap_file
f7548000-f776d000 rwxp 00000000 00:00 0
f776d000-f776e000 r-xp 00000000 00:00 0 [vdso]
ff97c000-ff9bc000 rw-p 00000000 00:00 0 [stack]

ID: 70732 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,811,031
RAC: 728
Message 70733 - Posted: 15 Jul 2011, 23:45:48 UTC

This has probably been noticed already as it failed on both crunchers but in case it's a mac only problem (and thus somewhat less likely to be spotted)...

ilv_fgf2_all_boinc_2ilaA_139.nonlocal.pctid_0.12.tmscore_._nonlocal_tex_IGNORE_THE_REST_27534_18042_1


ERROR: Cannot open PDB file "2ilaA.pdb"
ERROR:: Exit from: src/core/import_pose/import_pose.cc line: 199
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


Best,
Snags
ID: 70733 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[AF>france>pas-de-calais]symaski62

Send message
Joined: 19 Sep 05
Posts: 47
Credit: 33,871
RAC: 0
Message 70762 - Posted: 21 Jul 2011, 22:40:25 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=437733634


Exit status -1073741819 (0xc0000005)

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00B26465 read attempt to address 0x00000008

Engaging BOINC Windows Runtime Debugger...
ID: 70762 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Old man

Send message
Joined: 10 Nov 07
Posts: 25
Credit: 1,122,372
RAC: 1
Message 70791 - Posted: 26 Jul 2011, 14:29:48 UTC

Task ID 438352266
Name T610_bn_rs_stg0_lrlxMultiCst_t000__casp9__aln1_SAVE_ALL_OUT_29826_110_0
Workunit 400124231
Created 24 Jul 2011 19:47:54 UTC
Sent 24 Jul 2011 19:55:52 UTC
Received 26 Jul 2011 14:33:04 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
Computer ID 1443602
Report deadline 3 Aug 2011 19:55:52 UTC
CPU time 31199.31

stderr out

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
[2011- 7-26 5:28:19:] :: BOINC:: Initializing ... ok.
[2011- 7-26 5:28:19:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev42272.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/T610_bn_rs_stg0_lrlxMultiCst_t000__casp9.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 86400


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x3FF00000 read attempt to address 0x3FF00000

Engaging BOINC Windows Runtime Debugger...



********************


BOINC Windows Runtime Debugger Version 6.5.0


Dump Timestamp : 07/26/11 15:06:02
Install Directory : C:Program FilesBOINC
Data Directory : C:Documents and SettingsAll UsersApplication DataBOINCalphatesti
Project Symstore :
Loaded Library : C:Program FilesBOINC\dbghelp.dll
Loaded Library : C:Program FilesBOINC\symsrv.dll
Loaded Library : C:Program FilesBOINC\srcsrv.dll
LoadLibraryA( C:Program FilesBOINC\version.dll ): GetLastError = 126
Loaded Library : version.dll
Debugger Engine : 4.0.5.0
Symbol Search Path: C:Documents and SettingsAll UsersApplication DataBOINCalphatestislots;C:Documents and SettingsAll UsersApplication DataBOINCalphatestiprojectsboinc.bakerlab.org_rosetta;srv*C:Documents and SettingsAll UsersApplication DataBOINCalphatestiprojectsboinc.bakerlab.org_rosettasymbols*http://msdl.microsoft.com/download/symbols;srv*C:Documents and SettingsAll UsersApplication DataBOINCalphatestiprojectsboinc.bakerlab.org_rosettasymbols*http://boinc.berkeley.edu/symstore


ModLoad: 00400000 00ffd000 C:Documents and SettingsAll UsersApplication DataBOINCalphatestiprojectsboinc.bakerlab.org_rosettaminirosetta_3.14_windows_intelx86.exe (-exported- Symbols Loaded)
Linked PDB Filename : D:boinc_buildminirosetta_beta_3.14miniideVisualStudioBoincReleaseminirosetta_beta_3.14_windows_intelx86.pdb

ModLoad: 7c900000 000b2000 C:WINDOWSsystem32ntdll.dll (5.1.2600.6055) (PDB Symbols Loaded)
Linked PDB Filename : ntdll.pdb
File Version : 5.1.2600.6055 (xpsp_sp3_gdr.101209-1647)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.6055

ModLoad: 7c800000 000f6000 C:WINDOWSsystem32kernel32.dll (5.1.2600.5781) (PDB Symbols Loaded)
Linked PDB Filename : kernel32.pdb
File Version : 5.1.2600.5781 (xpsp_sp3_gdr.090321-1317)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.5781

ModLoad: 7e410000 00091000 C:WINDOWSsystem32USER32.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : user32.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2105)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.5512

ModLoad: 77f10000 00049000 C:WINDOWSsystem32GDI32.dll (5.1.2600.5698) (PDB Symbols Loaded)
Linked PDB Filename : gdi32.pdb
File Version : 5.1.2600.5698 (xpsp_sp3_gdr.081022-1932)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.5698

ModLoad: 77dd0000 0009b000 C:WINDOWSsystem32ADVAPI32.dll (5.1.2600.5755) (PDB Symbols Loaded)
Linked PDB Filename : advapi32.pdb
File Version : 5.1.2600.5755 (xpsp_sp3_gdr.090206-1234)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.5755

ModLoad: 77e70000 00093000 C:WINDOWSsystem32RPCRT4.dll (5.1.2600.6022) (PDB Symbols Loaded)
Linked PDB Filename : rpcrt4.pdb
File Version : 5.1.2600.6022 (xpsp_sp3_gdr.100813-1643)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.6022

ModLoad: 77fe0000 00011000 C:WINDOWSsystem32Secur32.dll (5.1.2600.5834) (PDB Symbols Loaded)
Linked PDB Filename : secur32.pdb
File Version : 5.1.2600.5834 (xpsp_sp3_gdr.090624-1305)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.5834

ModLoad: 76390000 0001d000 C:WINDOWSsystem32IMM32.DLL (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : imm32.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2105)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.5512

ModLoad: 77690000 00021000 C:WINDOWSsystem32NTMARTA.DLL (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : ntmarta.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2113)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.5512

ModLoad: 77c10000 00058000 C:WINDOWSsystem32msvcrt.dll (7.0.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : msvcrt.pdb
File Version : 7.0.2600.5512 (xpsp.080413-2111)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 7.0.2600.5512

ModLoad: 774e0000 0013e000 C:WINDOWSsystem32ole32.dll (5.1.2600.6010) (PDB Symbols Loaded)
Linked PDB Filename : ole32.pdb
File Version : 5.1.2600.6010 (xpsp_sp3_gdr.100712-1633)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.6010

ModLoad: 71bf0000 00013000 C:WINDOWSsystem32SAMLIB.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : samlib.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2113)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.5512

ModLoad: 76f60000 0002c000 C:WINDOWSsystem32WLDAP32.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : wldap32.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2113)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.5512

ModLoad: 12950000 00115000 C:Program FilesBOINCdbghelp.dll (6.8.4.0) (PDB Symbols Loaded)
Linked PDB Filename : dbghelp.pdb
File Version : 6.8.0004.0 (debuggers(dbg).070515-1751)
Company Name : Microsoft Corporation
Product Name : Debugging Tools for Windows(R)
Product Version : 6.8.0004.0

ModLoad: 134e0000 00048000 C:Program FilesBOINCsymsrv.dll (6.8.4.0) (PDB Symbols Loaded)
Linked PDB Filename : symsrv.pdb
File Version : 6.8.0004.0 (debuggers(dbg).070515-1751)
Company Name : Microsoft Corporation
Product Name : Debugging Tools for Windows(R)
Product Version : 6.8.0004.0

ModLoad: 13530000 0003b000 C:Program FilesBOINCsrcsrv.dll (6.8.4.0) (PDB Symbols Loaded)
Linked PDB Filename : srcsrv.pdb
File Version : 6.8.0004.0 (debuggers(dbg).070515-1751)
Company Name : Microsoft Corporation
Product Name : Debugging Tools for Windows(R)
Product Version : 6.8.0004.0

ModLoad: 77c00000 00008000 C:WINDOWSsystem32version.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : version.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2105)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.5512



*** Dump of the Process Statistics: ***

- I/O Operations Counters -
Read: 20052, Write: 0, Other 200421

- I/O Transfers Counters -
Read: 0, Write: 346596, Other 0

- Paged Pool Usage -
QuotaPagedPoolUsage: 102500, QuotaPeakPagedPoolUsage: 102500
QuotaNonPagedPoolUsage: 3392, QuotaPeakNonPagedPoolUsage: 4520

- Virtual Memory Usage -
VirtualSize: 468652032, PeakVirtualSize: 495939584

- Pagefile Usage -
PagefileUsage: 243654656, PeakPagefileUsage: 274866176

- Working Set Size -
WorkingSetSize: 251064320, PeakWorkingSetSize: 282066944, PageFaultCount: 44854399

*** Dump of thread ID 2736 (state: Waiting): ***

- Information -
Status: Wait Reason: UserRequest, , Kernel Time: 1461874944.000000, User Time: 310538764288.000000, Wait Time: 8795556.000000

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x3FF00000 read attempt to address 0x3FF00000

- Registers -
eax=143daa00 ebx=0ad6a830 ecx=0a337ac0 edx=00000060 esi=00000060 edi=0ad800c0
eip=3ff00000 esp=01d8957c ebp=00000000
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010246

- Callstack -
ChildEBP RetAddr Args to Child
01d89578 a881c0c9 01d8eb80 14057e00 1833e438 fffffffe !+0x0 SymFromAddr(): GetLastError = '126' SymGetModuleInfo(): GetLastError = '126' Address = '3ff00000'
00000000 00000000 00000000 00000000 00000000 00000000 !+0x0 SymFromAddr(): GetLastError = '126' SymGetLineFromAddr(): GetLastError = '126' SymGetModuleInfo(): GetLastError = '126' Address = 'a881c0c9'

*** Dump of thread ID 2788 (state: Waiting): ***

- Information -
Status: Wait Reason: ExecutionDelay, , Kernel Time: 468750.000000, User Time: 625000.000000, Wait Time: 8795557.000000

- Registers -
eax=03cffb44 ebx=00000000 ecx=a9595549 edx=e0000000 esi=00000000 edi=03cfff70
eip=7c90e514 esp=03cfff40 ebp=03cfff98
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000202

- Callstack -
ChildEBP RetAddr Args to Child
03cfff3c 7c90d21a 7c8023f1 00000000 03cfff70 00000000 ntdll!_KiFastSystemCallRet@0+0x0 FPO: [0,0,0]
03cfff40 7c8023f1 00000000 03cfff70 00000000 7c802446 ntdll!_NtDelayExecution@8+0x0 FPO: [2,0,0]
03cfff98 7c802455 00000064 00000000 03cfffec 004088db kernel32!_SleepEx@8+0x0
03cfffa8 004088db 00000064 00000000 7c80b729 00000000 kernel32!_Sleep@4+0x0
03cfffec 00000000 004088d0 00000000 00000000 6a510000 minirosetta_3.14_windows_intelx!+0x0

*** Dump of thread ID 2280 (state: Waiting): ***

- Information -
Status: Wait Reason: ExecutionDelay, , Kernel Time: 0.000000, User Time: 0.000000, Wait Time: 8795497.000000

- Registers -
eax=0d56fe28 ebx=0a4adb01 ecx=0d56e734 edx=0000439e esi=00000000 edi=0d56fdf8
eip=7c90e514 esp=0d56fdc8 ebp=0d56fe20
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000202

- Callstack -
ChildEBP RetAddr Args to Child
0d56fdc4 7c90d21a 7c8023f1 00000000 0d56fdf8 000000d2 ntdll!_KiFastSystemCallRet@0+0x0 FPO: [0,0,0]
0d56fdc8 7c8023f1 00000000 0d56fdf8 000000d2 00015180 ntdll!_NtDelayExecution@8+0x0 FPO: [2,0,0]
0d56fe20 7c802455 000007d0 00000000 00003840 0060c4d2 kernel32!_SleepEx@8+0x0
0d56fe30 0060c4d2 000007d0 a40fab09 ffffffff 0a4adbf8 kernel32!_Sleep@4+0x0
0d56fe38 a40fab09 ffffffff 0a4adbf8 0d56ff6c 0a4adbf8 minirosetta_3.14_windows_intelx!cppdb::backend::driver::connect+0x0
0d56fe3c ffffffff 0a4adbf8 0d56ff6c 0a4adbf8 00000001 minirosetta_3.14_windows_intelx!+0x0 SymFromAddr(): GetLastError = '126' SymGetLineFromAddr(): GetLastError = '126' SymGetModuleInfo(): GetLastError = '126' Address = 'a40fab09'
0d56ff3c 7c917c51 7c917d08 7c800000 0d56ff7c 00000000 minirosetta_3.14_windows_intelx!+0x0 SymFromAddr(): GetLastError = '126' SymGetLineFromAddr(): GetLastError = '126' SymGetModuleInfo(): GetLastError = '126' Address = 'ffffffff'
0d56ffe0 7c80b72f 00000000 00000000 00000000 00414e52 ntdll!_LdrpGetProcedureAddress@20+0x0 SymFromAddr(): GetLastError = '126' SymGetLineFromAddr(): GetLastError = '126' SymGetModuleInfo(): GetLastError = '126' Address = '7c917c51'
0d56ffe4 00000000 00000000 00000000 00414e52 0a4adbf8 kernel32!_BaseThreadStart@8+0x0 FPO: [0,0,0] SymFromAddr(): GetLastError = '126' SymGetLineFromAddr(): GetLastError = '126' SymGetModuleInfo(): GetLastError = '126' Address = '7c80b72f'


*** Debug Message Dump ****


*** Foreground Window Data ***
Window Name :
Window Class :
Window Process ID: 0
Window Thread ID : 0

Exiting...

</stderr_txt>
]]>

Validate state Invalid
Claimed credit 177.651213902221
Granted credit 0
application version 3.14

Why my task failed?
ID: 70791 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1860
Credit: 8,154,097
RAC: 8,001
Message 70860 - Posted: 2 Aug 2011, 19:00:01 UTC

439628254

Tag::read - parse error, printing backtrace.

Tag::read - parse error - file:istream line:5 column:1 - </SFXN5>
Tag::read - parse error - file:istream line:5 column:1 - ^

Tag::read - parse error - file:istream line:6 column:1 - </SCOREFXNS>
Tag::read - parse error - file:istream line:6 column:1 - ^

Tag::read - parse error - file:istream line:9 column:1 - </FILTERS>
Tag::read - parse error - file:istream line:9 column:1 - ^

Tag::read - parse error - file:istream line:13 column:1 - </TASKOPERATIONS>
Tag::read - parse error - file:istream line:13 column:1 - ^

Tag::read - parse error - file:istream line:15 column:1 - <FlxbbDesign name=flxbb ncycles=2 sfxn_design=SFXN5 sfxn_relax=SFXN5 SFXN5 clear_all_residues=1 task_operations=limitchi2,layer_allclear_all_residues=0 blueprint="master.blueprint" constraints_NtoC=1.0 />
Tag::read - parse error - file:istream line:15 column:1 - ^

Tag::read - parse error - file:istream line:14 column:1 - <MOVERS>
Tag::read - parse error - file:istream line:14 column:1 - ^

Tag::read - parse error - file:istream line:1 column:1 - <dock_design>
Tag::read - parse error - file:istream line:1 column:1 - ^


ERROR: false
ERROR:: Exit from: ......srcutilitytagTag.cc line: 387
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish
ID: 70860 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 70896 - Posted: 4 Aug 2011, 7:48:05 UTC
Last modified: 4 Aug 2011, 8:03:57 UTC

Hi.

These tasks are finishing after 16min's & 1 decoy, is this what you wanted/expected?

I've had two do it so far, & counting.

flxdsgn_Ploop_2x3_run_4_SAVE_ALL_OUT_IGNORE_THE_REST_08_29965_6412_0

# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 1201 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
BOINC :: WS_max 0

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>

ps / Good to see some work. :)

Edit // Seems like they are all getting validate errors now. :(
ID: 70896 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 70927 - Posted: 6 Aug 2011, 2:54:02 UTC
Last modified: 6 Aug 2011, 2:56:20 UTC

Hi.

Is anyone looking at these tasks, i've got another 4 of the same with the same problem as i reported earlier.

I'm i wasting my time reporting this?

All are getting validate errors.

flxdsgn_Ploop_2x3_run_4_SAVE_ALL_OUT_IGNORE_THE_REST_08_29965_8638_0

# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 1201 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
BOINC :: WS_max 0

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>
ID: 70927 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 70928 - Posted: 6 Aug 2011, 7:05:38 UTC

I've also had a bunch of them - so far I count 27 of them - same basic task name. Same validate error. Same claim that watchdog nailed them after 1201 seconds - although in most cases the task list shows they only ran somewhere between 750 and 950 seconds.

They have all been sent out to someone else for a "second try" - I've been good all week so maybe fate will smile on me and they will end up on Sid's bucket.

ID: 70928 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,844,503
RAC: 1,768
Message 70934 - Posted: 6 Aug 2011, 21:14:30 UTC - in response to Message 70896.  
Last modified: 6 Aug 2011, 21:19:38 UTC

Hi.

These tasks are finishing after 16min's & 1 decoy, is this what you wanted/expected?

I've had two do it so far, & counting.

flxdsgn_Ploop_2x3_run_4_SAVE_ALL_OUT_IGNORE_THE_REST_08_29965_6412_0

# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 1201 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
BOINC :: WS_max 0

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>

ps / Good to see some work. :)

Edit // Seems like they are all getting validate errors now. :(


For some of the previous workunits, decoy 1 was basically a test with an already known result for checking if your computer did the calculations properly. Could the validator still be assuming that any workunits where only one decoy was completed would not give any results not already known?

For those with this type of problem, I'd suggest mentioning how long you have set BOINC to allow workunits to continue before switching to some other project's workunits, whether you have any other BOINC projects enabled so it will actually try this switching, and whether you have BOINC set to keep workunits in memory when they are suspended. Also, you could mention which version of BOINC you're using in case it's one of those that did not initialize certain variables used in calculating how long workunits should be allowed to run.


On another subject, both of my computers usually gave minirosetta 3.14 workunits not completing properly (see earlier in this thread if you want details). Therefore, I've set Rosetta@Home to No New Tasks for a few weeks now, while waiting for a new minirosetta version likely to have this problem fixed. When is a new version likely? I haven't been seeing one tested on RALPH@Home yet.
ID: 70934 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,811,031
RAC: 728
Message 70936 - Posted: 6 Aug 2011, 23:41:12 UTC
Last modified: 6 Aug 2011, 23:42:48 UTC

This is not the first batch of work that behaved this way. I recall at least one other instance in which the workunit was ended after a single model was produced and, if memory serves, recorded a discrepancy in cpu time used.

I see 5 "flxdsgn" workunits in my tasks list:

1.flxdsgn_Ploop_2x3_no_sheet_constraint_11_29940_568_0
1 model completed, insignificant discrepancy in cpu time recorded, valid (3781.76, 3781.84)

2. flxdsgn_Ploop_2x3_run_4_SAVE_ALL_OUT_IGNORE_THE_REST_04_29965_5656_0
1 model completed, insignificant discrepancy in cpu time recorded, valid (1340.49, 1341.349)

3. flxdsgn_Ploop_2x3_run_4_SAVE_ALL_OUT_IGNORE_THE_REST_11_29965_5023_1
1 model completed, odd discrepancy, invalid(1201, 1172.478), 2nd copy invalid (1201, 825.2765)

4. flxdsgn_Ploop_2x3_run_4_SAVE_ALL_OUT_IGNORE_THE_REST_02_29965_7348_0
1 model, odd discrepancy, this copy invalid, (1201/ 1120.343), 2nd valid (2119.97 / 2920.167)

5. flxdsgn_Ploop_2x3_run_4_SAVE_ALL_OUT_IGNORE_THE_REST_03_29965_7238_1
1 model, insignificant discrepancy this copy valid (1249.71/1250.505), 1st copy invalid(1201/739.8815)

Notes on the cpu time:

The top section of the task details page is essentially identical for every project and includes a record of the cpu time used. The information included in the last section on this page, the sdterr output, is unique to each project, indeed can vary with each type of workunit within a project. It appears that each project decides what it wishes to record here and writes the code to include in the application. For rosetta workunits the cpu time used is recorded within the stderr out in addition to the standard location. The number in the stderr out is recorded alongside the number of models completed and must be recorded before the standard, common to all projects, let's call it the BOINC, method reported further up the details page. Thus the number in the stderr output is always fractions of a second smaller than the boinc number.

In the 3rd, 4th and 5th examples of flxdsgn workunits shown above, for all 4 invalid copies of the workunits, the cpu time used is recorded as 1201 in the stderr out. The BOINC recording is always considerably less, from 28.522 seconds less to 461.1185 seconds less.
Is it this, the BOINC time being less than the rosetta time, which causes the validator to mark the workunit invalid?
Or is it a discrepancy over a certain amount which triggers an invalidation?
Where does the 1201 come from?
Is it a default number for when the application has lost track of the time?
Is it an intentional signal designed to alert the project to a particular event within the workunit? (If x happens record the cpu time as 1201). Is the consequent invalidation with additional workunit creation needed or not?

I could go on but I'll spare you.

It may well be from the project's perspective this batch of workunits has preceded as expected, behaved as designed, efficiently and without waste of resources. But from our perspective things look off. We would be a lot more helpful in notifying the project of problems (as we've been asked and are clearly willing to do) if the project kept us better informed about the behaviour we should expect to see.


Best,
Snags

edited to add:
Robert suggests a reason for designing a workunit to produce a single model but not the odd cpu time or the invalidation. In case he's onto something: my switch interval is 270 minutes, preferred run time is 12 hours, always running workunits(of various lengths) from several projects even when using only one core, workunits are kept in memory, BOINC 6.10.21 for Mac.
ID: 70936 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,844,503
RAC: 1,768
Message 70937 - Posted: 7 Aug 2011, 1:38:54 UTC - in response to Message 70936.  
Last modified: 7 Aug 2011, 1:57:27 UTC

edited to add:
Robert suggests a reason for designing a workunit to produce a single model but not the odd cpu time or the invalidation. In case he's onto something: my switch interval is 270 minutes, preferred run time is 12 hours, always running workunits(of various lengths) from several projects even when using only one core, workunits are kept in memory, BOINC 6.10.21 for Mac.


An idea to check: If a wingman is validated, check how many decoys the wingman produced.

Editted: A failed idea. All the workunits of that type that I was able to look at the log file for produced only one decoy, regardless of whether they validated or not.
ID: 70937 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,811,031
RAC: 728
Message 70938 - Posted: 7 Aug 2011, 2:11:54 UTC - in response to Message 70937.  
Last modified: 7 Aug 2011, 2:14:56 UTC

edited to add:
Robert suggests a reason for designing a workunit to produce a single model but not the odd cpu time or the invalidation. In case he's onto something: my switch interval is 270 minutes, preferred run time is 12 hours, always running workunits(of various lengths) from several projects even when using only one core, workunits are kept in memory, BOINC 6.10.21 for Mac.


An idea to check: If a wingman is validated, check how many decoys the wingman produced.


In every instance of the flxdsgn workunits I listed above only one model was produced. The single model appears to be a feature of this type of workunit and unrelated to the 1201 issue or the validation errors.

Perhaps model doesn't mean the same thing for this type of workunit as it does for most (all?) other rosetta workunits. What if the application is ended not by time limits or model number limits but by something else, the occurrence of some other event? The model number would then be created for credit granting purposes only.


Best,
Snags

edited to say: Hey, Robert, just saw your edit : )
ID: 70938 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,844,503
RAC: 1,768
Message 70939 - Posted: 7 Aug 2011, 2:55:25 UTC
Last modified: 7 Aug 2011, 3:18:11 UTC

I've now looked at output files for this type of workunit for several users that made their outputs public enough that I could.

Each output file seems to show two different CPU times - one for the entire workunit, and a different one for decoy 1 alone. The validator appears to have accepted all cases where both were greater than 1201 seconds, and rejected all cases where the time reported for decoy 1 was 1201 seconds and the total CPU time for the entire workunit was less than that. I did not find any other cases for flxdsgn_Ploop workunits, even after looking at dozens of log files.

Therefore, this looks likely to be a case where the time reported on the Done :: line is often 1201 seconds, but never less, even if this value is incorrect. It appears that the developers need to add debugging for the calculations of this value for any future workunits of this type. Also, they should check whether the the value on this line should have been ignored by the validator if it happens to be 1201, and if so, update the validator so it will - and also run the updated validator on all the validation failed workunits of this type to see if more credit should be awarded to those computers.


I saw no significant differences in this for which CPU or which OS was used, or even which BOINC version was used.

I also noted that the values of WS_max shown a few lines below the Done :: line look very random, but saw no particular pattern of whether this affects validation.
ID: 70939 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 70940 - Posted: 7 Aug 2011, 4:04:25 UTC

BINGO Robert - good eye.

I just screened about 100 of these tasks on two of my hosts (129350 & 1300412) and in each and every case your hypothesis was correct. Not only in the case of my hosts, but also for those of my "wingmen" whose systems are fast enough to complete the decoy in less than 1201 seconds.

I did not spot a single case where a system got a clean validation when the decoy was produced in less than 1201 seconds.

So I guess the bad news is that I have a few hundred of these tasks on the four hosts I currently have dedicated to Rosetta. The good news is they blow through pretty quick.

Now I guess its time to see if I can spot these in the queue and abort them before they start with some sort of chron script.

Thanks

ID: 70940 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ed

Send message
Joined: 2 Aug 11
Posts: 31
Credit: 662,563
RAC: 0
Message 70941 - Posted: 7 Aug 2011, 4:23:39 UTC
Last modified: 7 Aug 2011, 4:24:58 UTC

Being new here I don't fully understand this conversation so I hope you don't mind my asking questions.

Are one of you guys developing the work units or are you all crunchers?

Assuming you are all crunchers, why would you abort work units like that, assuming they are not stuck. If the guys developing work units want to see how they run, aren't you working against them?

Sorry if these are dumb questions.

We seem to be out of Rosetta work so my system is just happy crunching away on Seti. I have two Astropulse WU running.
ID: 70941 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 70942 - Posted: 7 Aug 2011, 4:42:54 UTC

@ED - in my case I am not a developer for BOINC or Rosetta. My reason for being here is two fold. First, I believe the work being done by the Rosetta project is important. The second reason is BOINC / Rosetta provides a nice solid testbed for the optimized Linux kernels I build in another life. The amount of credit granted by Rosetta is consistent enough that after running for a week to ten days I am able to evaluate the value of the optimizations I am testing.

Aborting a task is not a normal thing - I have never attempted to automate an abort in the past - however it is becoming clear that if your machine is fast enough to complete a flxdsgn_Ploop task in less than 1201 seconds you are not going to get a clean validation and the completed task will be sent to someone else for a second attempt.

It appears that there is a bug in these routines and if you have a fairly fast machine, you may be predestined to have the task complete with a validation error.

By the way welcome to the project - hope you enjoy associating with some weird (or diverse) folks.
ID: 70942 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 70943 - Posted: 7 Aug 2011, 4:53:01 UTC
Last modified: 7 Aug 2011, 4:53:39 UTC

ED - one more thing - your comment about "working against the developers" - I understand exactly what you mean but in this case I think the developers probably have a thousand or more of these validate failures to evaluate.

Further, one of the real downsides to the Rosetta project is that there is almost no communication between the sysadmins and developers and those doing the crunching.

If the developer/scientist/student responsible for these tasks were to come out and state that they needed to look at a few more failure cases I would be pleased as punch to provide them.

However, the way it is around here there is a fairly good chance that the developer is not even aware of the issue, or he is aware, already has a fix in hand and is just letting the "broke" tasks flow through the system.

It is not likely you will ever see a post here from the project explaining what happened. Sad but true.
ID: 70943 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,844,503
RAC: 1,768
Message 70944 - Posted: 7 Aug 2011, 5:14:46 UTC - in response to Message 70941.  
Last modified: 7 Aug 2011, 5:29:19 UTC

Being new here I don't fully understand this conversation so I hope you don't mind my asking questions.

Are one of you guys developing the work units or are you all crunchers?

Assuming you are all crunchers, why would you abort work units like that, assuming they are not stuck. If the guys developing work units want to see how they run, aren't you working against them?

Sorry if these are dumb questions.

We seem to be out of Rosetta work so my system is just happy crunching away on Seti. I have two Astropulse WU running.


I'm no developer either, only a cruncher with crunching experience with most of the BOINC projects related to medical research. I do have years of experience fixing some types of software bugs, but never for BOINC projects. The best I can do seems to be searching for where to point the developers' attention to where to look for the bugs.

I've seen such problems with minirosetta 3.14 on my computers that I've set it to No New Tasks for now, and plan to leave it that until some later version is available. I've already sent rather long posts on what I saw in my failed 3.14 workunits, and see no reason to try any more until the developers offer some response to those posts. RALPH@Home is still enabled on my desktop, though, since that's where they'll probably try the initial testing of any fixes.

If you look at the left column of this thread, you should see which users are labelled developers. Scroll to the top of the thread to see an example. Note that none of the developers have posted anything to this thread since this latest problem was reported, so let's hope that they're all busy trying to fix it, rather than, for example, all on vacation. Running out of workunits is a sign that SOMEONE at Rosetta@Home has recognized the problem, even if that someone was not a developer and cannot do much to fix the problem.
ID: 70944 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,811,031
RAC: 728
Message 70950 - Posted: 7 Aug 2011, 14:39:39 UTC

The project runs a script, nightly I think, to grant credit for workunits that have failed validation. You won't find it on your tasks lists or workunits pages. You have to scroll to the bottom of the task details page to see it. All the invalid flxdsgn I've looked at have received credit (after a day or so).

Validate errors are not client errors and don't necessarily mean the workunits have failed and the results are useless. I see no reason to abort them. I would though, very much like the project to chime in here and tell us what invalidation means in this particular case.

Best,
Snags, just another volunteer cruncher
ID: 70950 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : Minirosetta 3.14



©2024 University of Washington
https://www.bakerlab.org