Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 48 · 49 · 50 · 51 · 52 · 53 · 54 . . . 310 · Next

AuthorMessage
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1729
Credit: 18,490,957
RAC: 20,862
Message 96433 - Posted: 13 May 2020, 7:03:02 UTC - in response to Message 96431.  
Last modified: 13 May 2020, 7:39:51 UTC

In the last couple days I've noticed a number of failed tasks, they all start with 3cl in the name, here is an example.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1057932888
The tasks run for about 8-15 seconds then return compute error. System is a Ryzen 5 3600, 16gb ram.
Hope this helps with troubleshooting!
Try limiting the number of cores you use to process work (or add more system RAM) - it's a memory issue.
I used to get the same errors that you are getting when i was running all 6c/12t with only 16GB of RAM, once i upgraded my system to 32GB RAM i've had no such errors since.

You generally need to allow for 1.3GB of RAM per running Task. Many Tasks use a lot lot less, quite few Tasks use a hell of a lot more. If a Task requires more RAM, it should gracefully suspend until there's enough RAM to continue, but that isn't always the case.



Edit- having said that, i just had one of those WUs do the same thing on my system, yet was processed OK on another system, and even though i've processed several others of the same type with no problems.

3cl_7aa_6lu7_modified_AVLstub_relaxed_renumbered_0074_110_extract_B_SAVE_ALL_OUT_927956_74_0

[pre]<core_client_version>7.6.33</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe @3cl_7aa_6lu7_modified_AVLstub_relaxed_renumbered_0074_110_extract_B.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 2443625
Using database: database_357d5d93529_n_methylminirosetta_database


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00007FF63B7D1D48

Engaging BOINC Windows Runtime Debugger...



********************


BOINC Windows Runtime Debugger Version 7.9.0


Dump Timestamp : 05/12/20 22:11:33
Install Directory : C:Program FilesBOINC
Data Directory : C:ProgramDataBOINC
Project Symstore : https://boinc.bakerlab.org/rosetta/symstore
LoadLibraryA( C:ProgramDataBOINCdbghelp.dll ): GetLastError = 126
Loaded Library : dbghelp.dll
LoadLibraryA( C:ProgramDataBOINCsymsrv.dll ): GetLastError = 126
LoadLibraryA( symsrv.dll ): GetLastError = 126
LoadLibraryA( C:ProgramDataBOINCsrcsrv.dll ): GetLastError = 126
LoadLibraryA( srcsrv.dll ): GetLastError = 126
LoadLibraryA( C:ProgramDataBOINCversion.dll ): GetLastError = 126
Loaded Library : version.dll
Debugger Engine : 4.0.5.0
Symbol Search Path: C:ProgramDataBOINCslots5;C:ProgramDataBOINCprojectsboinc.bakerlab.org_rosetta;srv*C:ProgramDataBOINCprojectsboinc.bakerlab.org_rosettasymbols*http://msdl.microsoft.com/download/symbols;srv*C:ProgramDataBOINCprojectsboinc.bakerlab.org_rosettasymbols*https://boinc.bakerlab.org/rosetta/symstore


ModLoad: 0000000037c20000 00000000057ef000 C:ProgramDataBOINCprojectsboinc.bakerlab.org_rosettarosetta_4.20_windows_x86_64.exe (-exported- Symbols Loaded)
Linked PDB Filename : C:cygwin64homeboinc4.17RosettamainsourceideVisualStudiox64BoincReleaserosetta_4.20_windows_x86_64.pdb

ModLoad: 00000000f3140000 00000000001f0000 C:WINDOWSSYSTEM32ntdll.dll (6.2.18362.719) (-exported- Symbols Loaded)
Linked PDB Filename : ntdll.pdb
File Version : 10.0.18362.329 (WinBuild.160101.0800)
Company Name : Microsoft Corporation
Product Name : Microsoft&#174; Windows&#174; Operating System
Product Version : 10.0.18362.329

ModLoad: 00000000f1500000 00000000000b2000 C:WINDOWSSystem32KERNEL32.DLL (6.2.18362.329) (-exported- Symbols Loaded)
Linked PDB Filename : kernel32.pdb
File Version : 10.0.18362.329 (WinBuild.160101.0800)
Company Name : Microsoft Corporation
Product Name : Microsoft&#174; Windows&#174; Operating System
Product Version : 10.0.18362.329

ModLoad: 00000000f06e0000 00000000002a3000 C:WINDOWSSystem32KERNELBASE.dll (6.2.18362.719) (-exported- Symbols Loaded)
Linked PDB Filename : kernelbase.pdb
File Version : 10.0.18362.329 (WinBuild.160101.0800)
Company Name : Microsoft Corporation
Product Name : Microsoft&#174; Windows&#174; Operating System
Product Version : 10.0.18362.329

ModLoad: 00000000f18a0000 000000000006f000 C:WINDOWSSystem32WS2_32.dll (6.2.18362.387) (-exported- Symbols Loaded)
Linked PDB Filename : ws2_32.pdb
File Version : 10.0.18362.1 (WinBuild.160101.0800)
Company Name : Microsoft Corporation
Product Name : Microsoft&#174; Windows&#174; Operating System
Product Version : 10.0.18362.1

ModLoad: 00000000f1710000 0000000000120000 C:WINDOWSSystem32RPCRT4.dll (6.2.18362.628) (-exported- Symbols Loaded)
Linked PDB Filename : rpcrt4.pdb
File Version : 10.0.18362.1 (WinBuild.160101.0800)
Company Name : Microsoft Corporation
Product Name : Microsoft&#174; Windows&#174; Operating System
Product Version : 10.0.18362.1

ModLoad: 00000000f1b20000 0000000000194000 C:WINDOWSSystem32USER32.dll (6.2.18362.719) (-exported- Symbols Loaded)
Linked PDB Filename : user32.pdb
File Version : 10.0.17134.343 (WinBuild.160101.0800)
Company Name : Microsoft Corporation
Product Name : Microsoft&#174; Windows&#174; Operating System
Product Version : 10.0.17134.343

ModLoad: 00000000f00d0000 0000000000021000 C:WINDOWSSystem32win32u.dll (6.2.18362.719) (-exported- Symbols Loaded)
Linked PDB Filename : win32u.pdb
File Version : 10.0.18362.719 (WinBuild.160101.0800)
Company Name : Microsoft Corporation
Product Name : Microsoft&#174; Windows&#174; Operating System
Product Version : 10.0.18362.719

ModLoad: 00000000f28e0000 0000000000026000 C:WINDOWSSystem32GDI32.dll (6.2.18362.1) (-exported- Symbols Loaded)
Linked PDB Filename : gdi32.pdb
File Version : 10.0.18362.1 (WinBuild.160101.0800)
Company Name : Microsoft Corporation
Product Name : Microsoft&#174; Windows&#174; Operating System
Product Version : 10.0.18362.1

ModLoad: 00000000f0100000 0000000000194000 C:WINDOWSSystem32gdi32full.dll (6.2.18362.719) (-exported- Symbols Loaded)
Linked PDB Filename : gdi32full.pdb
File Version : 10.0.18362.719 (WinBuild.160101.0800)
Company Name : Microsoft Corporation
Product Name : Microsoft&#174; Windows&#174; Operating System
Product Version : 10.0.18362.719

ModLoad: 00000000f04f0000 000000000009e000 C:WINDOWSSystem32msvcp_win.dll (6.2.18362.387) (-exported- Symbols Loaded)
Linked PDB Filename : msvcp_win.pdb
File Version : 10.0.18362.387 (WinBuild.160101.0800)
Company Name : Microsoft Corporation
Product Name : Microsoft&#174; Windows&#174; Operating System
Product Version : 10.0.18362.387

ModLoad: 00000000f05e0000 00000000000fa000 C:WINDOWSSystem32ucrtbase.dll (6.2.18362.387) (-exported- Symbols Loaded)
Linked PDB Filename : ucrtbase.pdb
File Version : 10.0.18362.387 (WinBuild.160101.0800)
Company Name : Microsoft Corporation
Product Name : Microsoft&#174; Windows&#174; Operating System
Product Version : 10.0.18362.387

ModLoad: 00000000f15c0000 00000000000a3000 C:WINDOWSSystem32ADVAPI32.dll (6.2.18362.329) (-exported- Symbols Loaded)
Linked PDB Filename : advapi32.pdb
File Version : 10.0.18362.1 (WinBuild.160101.0800)
Company Name : Microsoft Corporation
Product Name : Microsoft&#174; Windows&#174; Operating System
Product Version : 10.0.18362.1

ModLoad: 00000000f1250000 000000000009e000 C:WINDOWSSystem32msvcrt.dll (7.0.18362.1) (-exported- Symbols Loaded)
Linked PDB Filename : msvcrt.pdb
File Version : 7.0.18362.1 (WinBuild.160101.0800)
Company Name : Microsoft Corporation
Product Name : Microsoft&#174; Windows&#174; Operating System
Product Version : 7.0.18362.1

ModLoad: 00000000f1670000 0000000000097000 C:WINDOWSSystem32sechost.dll (6.2.18362.693) (-exported- Symbols Loaded)
Linked PDB Filename : sechost.pdb
File Version : 10.0.18362.1 (WinBuild.160101.0800)
Company Name : Microsoft Corporation
Product Name : Microsoft&#174; Windows&#174; Operating System
Product Version : 10.0.18362.1

ModLoad: 00000000f2440000 000000000002e000 C:WINDOWSSystem32IMM32.DLL (6.2.18362.387) (-exported- Symbols Loaded)
Linked PDB Filename : imm32.pdb
File Version : 10.0.18362.387 (WinBuild.160101.0800)
Company Name : Microsoft Corporation
Product Name : Microsoft&#174; Windows&#174; Operating System
Product Version : 10.0.18362.387

ModLoad: 00000000f00b0000 0000000000011000 C:WINDOWSSystem32kernel.appcore.dll (6.2.18362.1) (-exported- Symbols Loaded)
Linked PDB Filename : Kernel.Appcore.pdb
File Version : 10.0.18362.1 (WinBuild.160101.0800)
Company Name : Microsoft Corporation
Product Name : Microsoft&#174; Windows&#174; Operating System
Product Version : 10.0.18362.1

ModLoad: 00000000ef060000 0000000000031000 C:WINDOWSSYSTEM32ntmarta.dll (6.2.18362.1) (-exported- Symbols Loaded)
Linked PDB Filename : ntmarta.pdb
File Version : 10.0.18362.1 (WinBuild.160101.0800)
Company Name : Microsoft Corporation
Product Name : Microsoft&#174; Windows&#174; Operating System
Product Version : 10.0.18362.1

ModLoad: 00000000e1640000 00000000001f4000 C:WINDOWSSYSTEM32dbghelp.dll (6.2.18362.1) (-exported- Symbols Loaded)
Linked PDB Filename : dbghelp.pdb
File Version : 10.0.18362.1 (WinBuild.160101.0800)
Company Name : Microsoft Corporation
Product Name : Microsoft&#174; Windows&#174; Operating System
Product Version : 10.0.18362.1

ModLoad: 00000000f1110000 0000000000080000 C:WINDOWSSystem32bcryptPrimitives.dll (6.2.18362.295) (-exported- Symbols Loaded)
Linked PDB Filename : bcryptprimitives.pdb
File Version : 10.0.18362.295 (WinBuild.160101.0800)
Company Name : Microsoft Corporation
Product Name : Microsoft&#174; Windows&#174; Operating System
Product Version : 10.0.18362.295

ModLoad: 00000000eafb0000 000000000000a000 C:WINDOWSSYSTEM32version.dll (6.2.18362.1) (-exported- Symbols Loaded)
Linked PDB Filename : version.pdb
File Version : 10.0.18362.1 (WinBuild.160101.0800)
Company Name : Microsoft Corporation
Product Name : Microsoft&#174; Windows&#174; Operating System
Product Version : 10.0.18362.1



*** Dump of the Process Statistics: ***

- I/O Operations Counters -
Read: 5000, Write: 662, Other 13721

- I/O Transfers Counters -
Read: 14491204, Write: 9881, Other 6728

- Paged Pool Usage -
QuotaPagedPoolUsage: 317448, QuotaPeakPagedPoolUsage: 317752
QuotaNonPagedPoolUsage: 6792, QuotaPeakNonPagedPoolUsage: 7352

- Virtual Memory Usage -
VirtualSize: 83140608, PeakVirtualSize: 895655936

- Pagefile Usage -
PagefileUsage: 83140608, PeakPagefileUsage: 83140608

- Working Set Size -
WorkingSetSize: 103694336, PeakWorkingSetSize: 103698432, PageFaultCount: 25722

*** Dump of thread ID 9108 (state: Initialized): ***

- Information -
Status: Base Priority: Normal, Priority: Normal, , Kernel Time: 0.000000, User Time: 0.000000, Wait Time: 0.000000

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00007FF63B7D1D48

- Registers -
rax=000000000000003a rbx=0000000060d95750 rcx=00000000617a2ac0 rdx=0000000061882bf8 rsi=000000000000000b rdi=00000000617a2ac0
r8=000000000000003a r9=0000000000000421 r10=000000003b7c6e80 r11=00000000b6545480 r12=0000000037c20000 r13=00000000b655fba0
r14=00000000b6545bc0 r15=000000000048b215 rip=000000003b7d1d48 rsp=00000000b65454f8 rbp=0000000000000000
cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010202

- Callstack -
ChildEBP RetAddr Args to Child
b65454f0 380f831c 00000000 3b7c6d60 3b7c6e80 3b7abe78 rosetta_4.20_windows_x86_64!xmlValidateNotationDecl+0x0
b6545520 380b935d 60d95750 b65455c0 b6545d40 380a355d rosetta_4.20_windows_x86_64!xmlParserInputRead+0x0
b6545550 3b227f10 3c110150 b655fba0 00000000 380a3265 rosetta_4.20_windows_x86_64!xmlParserInputRead+0x0
b6545580 380a39e8 b6546230 e0000000 b6545b88 b6545c10 rosetta_4.20_windows_x86_64!xmlValidateNotationDecl+0x0
b65455f0 f31e121f 00000000 b6545b70 b6546230 b6546230 rosetta_4.20_windows_x86_64!xmlParserInputRead+0x0
b6545620 f31aa289 00000001 37c20000 00000000 3d1ba32c ntdll!__chkstk+0x0
b6545d30 f31dfe8e 00000030 b6545e09 3b89a450 f317ba17 ntdll!RtlRaiseException+0x0
b65464c0 38363e2b fffffffe 65103d58 ffffffff 383718c5 ntdll!KiUserExceptionDispatcher+0x0
b6546510 38373690 3b89a3a0 65103ab0 3b89a3a0 b6546609 rosetta_4.20_windows_x86_64!cppdb::session::is_open+0x0
b6546640 38489ee8 64a9d698 65479d00 65103ab0 65479d00 rosetta_4.20_windows_x86_64!cppdb::session::is_open+0x0
b65471f0 38424b6c 6556ac10 f317ba17 60cd0000 00000000 rosetta_4.20_windows_x86_64!cppdb::backend::statement::cache+0x0
b65473f0 3842488e b65474d8 00000000 b65476c0 00000000 rosetta_4.20_windows_x86_64!cppdb::backend::statement::cache+0x0
b6547550 38383da1 b65476c8 00000000 60d95410 b6547790 rosetta_4.20_windows_x86_64!cppdb::backend::statement::cache+0x0
b6547910 38389f08 b6547c60 b6547c60 b6547c60 00000000 rosetta_4.20_windows_x86_64!cppdb::backend::statement::cache+0x0
b6547f60 383884db 613cac90 b6547fc0 613bb580 613bb580 rosetta_4.20_windows_x86_64!cppdb::backend::statement::cache+0x0
b65480c0 382f1fb7 00000000 b65481d0 613bb580 b65483d0 rosetta_4.20_windows_x86_64!cppdb::backend::statement::cache+0x0
b6548230 382f57a6 00000005 38095190 612dfb10 612dfb10 rosetta_4.20_windows_x86_64!cppdb::session::is_open+0x0
b65482a0 382f56cc b65485a8 b6548419 b65485a8 613bb580 rosetta_4.20_windows_x86_64!cppdb::session::is_open+0x0
b6548350 383bb6f5 b65485a8 b6548841 00000000 380b75e8 rosetta_4.20_windows_x86_64!cppdb::session::is_open+0x0
b6548470 383ba592 00000005 b65485a8 b6548780 00000000 rosetta_4.20_windows_x86_64!cppdb::backend::statement::cache+0x0
b6548540 383bad06 00000000 00000000 b6548e60 60cd0000 rosetta_4.20_windows_x86_64!cppdb::backend::statement::cache+0x0
b65486e0 388171a3 b6548780 b6548e60 ffffff01 380a3e73 rosetta_4.20_windows_x86_64!cppdb::backend::statement::cache+0x0
b65489d0 38819d09 00000000 00000001 b6548ae0 b6548e60 rosetta_4.20_windows_x86_64!cppdb::backend::statement::cache+0x0
b6548d60 38812f8a b6548da0 b6548e60 6459df80 613720c0 rosetta_4.20_windows_x86_64!cppdb::backend::statement::cache+0x0
b6548dc0 38a2cc70 b6548e60 b6549588 612dfb10 00000000 rosetta_4.20_windows_x86_64!cppdb::backend::statement::cache+0x0
b6549550 38a2c6e4 654196d0 654755c0 3d095cc0 380975a6 rosetta_4.20_windows_x86_64!cppdb::backend::statement::cache+0x0
b65495b0 38a3603e b65496a0 65419400 b65496c0 b6549e10 rosetta_4.20_windows_x86_64!cppdb::backend::statement::cache+0x0
b6549d30 38a356d4 911ecceb 911ecbfb 3d007f70 38a56cb4 rosetta_4.20_windows_x86_64!cppdb::backend::statement::cache+0x0
b6549dc0 38a3578e 00000005 b654a368 613720c0 00000001 rosetta_4.20_windows_x86_64!cppdb::backend::statement::cache+0x0
b6549f60 380a081d 61673820 61673820 613720c0 60d96d01 rosetta_4.20_windows_x86_64!cppdb::backend::statement::cache+0x0
b655fb90 380ab215 00000000 00000000 3cfcccf8 00000000 rosetta_4.20_windows_x86_64!xmlParserInputRead+0x0
b655fbd0 f1517bd4 00000000 00000000 00000000 00000000 rosetta_4.20_windows_x86_64!xmlParserInputRead+0x0
b655fc00 f31aced1 00000000 00000000 00000000 00000000 KERNEL32!BaseThreadInitThunk+0x0
b655fc80 00000000 00000000 00000000 00000000 00000000 ntdll!RtlUserThreadStart+0x0

*** Dump of thread ID 32764 (state: Initialized): ***

- Information -
Status: Base Priority: Normal, Priority: Unknown, , Kernel Time: 6.000000, User Time: 0.000000, Wait Time: 2744734720.000000

- Registers -
rax=0000000000000000 rbx=0000000000000000 rcx=0000000000000000 rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
r8=0000000000000000 r9=0000000000000000 r10=0000000000000000 r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000 rip=0000000000000000 rsp=0000000000000000 rbp=0000000000000000
cs=0000 ss=0000 ds=0000 es=0000 fs=0000 gs=0000 efl=00000000

- Callstack -
ChildEBP RetAddr Args to Child
(-nosymbols- PC == 0)
00000000 00000000 00000000 00000000 00000000 00000000 !+0x0

*** Dump of thread ID 30812250 (state: Unknown): ***

- Information -
Status: Base Priority: Normal, Priority: Unknown, , Kernel Time: 17179869184.000000, User Time: 21474836480.000000, Wait Time: 0.000000

- Registers -
rax=0000000000000000 rbx=0000000000000000 rcx=0000000000000000 rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
r8=0000000000000000 r9=0000000000000000 r10=0000000000000000 r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000 rip=0000000000000000 rsp=0000000000000000 rbp=0000000000000000
cs=0000 ss=0000 ds=0000 es=0000 fs=0000 gs=0000 efl=00000000

- Callstack -
ChildEBP RetAddr Args to Child
(-nosymbols- PC == 0)
00000000 00000000 00000000 00000000 00000000 00000000 !+0x0


*** Debug Message Dump ****


*** Foreground Window Data ***
Window Name :
Window Class :
Window Process ID: 0
Window Thread ID : 0

Exiting...

</stderr_txt>
]]>[pre]
Grant
Darwin NT
ID: 96433 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sven

Send message
Joined: 7 Feb 16
Posts: 8
Credit: 222,005
RAC: 0
Message 96434 - Posted: 13 May 2020, 7:15:51 UTC - in response to Message 96416.  
Last modified: 13 May 2020, 7:16:58 UTC

Hi all,

I just found an old outsourced computer which would work pretty well just for crunching Rosetta tasks.
Unfortunately the computation stopps after some seconds with "computation error".

Does anyone have an idea, where the problem could be? See the event log below:

[snip]

You're looking at the wrong log file to see much about THIS problem.

To see the one with more information for this problem:

If you're using the simple view of the BOINC Manager, find View near the top line, click on it, then click on Advanced View.

Click on Projects in one of the top lines, then Rosetta@home, then Your tasks.

In the Status column, find one of the failed tasks. Ignore those shown as In progress - they don't have the other log file yet. Ignore those shown as Completed and validated for now unless you want to see one without errors for comparison purposes.

When you find one, click on the number in this line, but in the Task column. This gives the log file specific to that task.

Scroll down as needed. In this case, look at the line starting with unzip, which shows the error triggering all other errors for this task,

I've looked at this file for some of your failed tasks, and have thought of three possibilities:

1. The task was built improperly, and the list of files it needs left out one or two of the zip files. It may have assumed, incorrectly, that some previous task had downloaded it or them.

2. Your antivirus program hid one or both of those files. You might check the log file of your antivirus program, if it has one.

3. These tasks used version v4.21 of the Rosetta application, which I have not seen mentioned before. That version might have the wrong builtin names for files to send to unzip.

For 1, about all you can do is wait for more tasks that fix this problem.

For 2, you might have to tell your antivirus program not to scan the directories for BOINC.

For 3, you might watch the forums for mentions of that version, to see if similar problems are reported by others.

On another subject, I also looked at the specs for that computer, which runs Windows XP. You might look for threads on whether the current versions of Rosetta are compatible with Windows XP.

The current tasks take up to 2 GB of memory each and sometimes more, so you may have problems with the tasks running out of memory once you can get them to run longer, if you allow one task each for the 4 virtual CPU cores on that computer. You may have to limit it to only one task at a time. That computer has only 4 GB of main memory, and some of it is reserved for the Windows operating system.



Here we are, one of the failed tasks with its specific log file:
Don't know where to get the minirosetta database from if it's not downloaded by Boinc itself.

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
Unzul�ssige Funktion.
(0x1) - exit code 1 (0x1)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.21_windows_intelx86.exe -run:protocol jd2_scripting -parser:protocol predictor_v11_boinc--fuse--covid_groove_design_boinc_v1_mod.xml @flags_covid_groove2 -in:file:silent Mini_Protein_binds_COVID-19_groove_design1_8_SAVE_ALL_OUT_IGNORE_THE_REST_4gv6ne9i.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip Mini_Protein_binds_COVID-19_groove_design1_8_SAVE_ALL_OUT_IGNORE_THE_REST_4gv6ne9i.zip @Mini_Protein_binds_COVID-19_groove_design1_8_SAVE_ALL_OUT_IGNORE_THE_REST_4gv6ne9i.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3074316
Extracting in slot directory: minirosetta_database.zip
unzip: cannot find either ../../projects/boinc.bakerlab.org_rosetta/database_357d5d93529_n_methyl.zip or ../../projects/boinc.bakerlab.org_rosetta/database_357d5d93529_n_methyl.zip.zip.
Using database: minirosetta_database
Cannot find database: minirosetta_database

</stderr_txt>
]]>
ID: 96434 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sven

Send message
Joined: 7 Feb 16
Posts: 8
Credit: 222,005
RAC: 0
Message 96435 - Posted: 13 May 2020, 7:31:46 UTC - in response to Message 96434.  

I guess it would be worth it to try and install a parallel running Linux distribution on this machine.

Adjusting the antivirus settings didn't work at all and it seems to be generally problematic with running on Windows XP.

My lack of memory would still be unsolved.
ID: 96435 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile MeeeK

Send message
Joined: 7 Feb 16
Posts: 31
Credit: 19,737,304
RAC: 0
Message 96449 - Posted: 13 May 2020, 16:34:33 UTC

i also have lots of failed WUs.

53 just yesterday and today! Lots of them with wasted CPU-Time.
What the hell is wrong again?

All of them have error-code "139 (0x0000008B) Unknown error code" and all of them failed on my System with Ryzen 5 3600X and 32GB @ 3200Mhz (standard) of Ram.
the other system (R5 3600 (withoutX) with 32GB@ 3000Mhz (standard)) is running fine.
Both Systems are not overclocked on CPU and have XMP turned on.
No Gameboost or anything else, Fans at 100%
Temperatures are fine. Both are Watercooled in one loop with two 280mm radiators.

some examples:
https://boinc.bakerlab.org/rosetta/result.php?resultid=1178373908
https://boinc.bakerlab.org/rosetta/result.php?resultid=1178374184
https://boinc.bakerlab.org/rosetta/result.php?resultid=1178329397
https://boinc.bakerlab.org/rosetta/result.php?resultid=1178275970
https://boinc.bakerlab.org/rosetta/result.php?resultid=1178221426
https://boinc.bakerlab.org/rosetta/result.php?resultid=1178221294

somebody any Ideas?
its frustrating to see the wasted time without result and no points.
ID: 96449 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
funkydude

Send message
Joined: 15 Jun 08
Posts: 28
Credit: 397,934
RAC: 0
Message 96453 - Posted: 13 May 2020, 17:27:57 UTC - in response to Message 96449.  
Last modified: 13 May 2020, 17:28:12 UTC

You're not the only person experiencing signal 11. Please post in the other thread: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13941
ID: 96453 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 1,227
Message 96455 - Posted: 13 May 2020, 18:29:47 UTC - in response to Message 96449.  

i also have lots of failed WUs.

53 just yesterday and today! Lots of them with wasted CPU-Time.
What the hell is wrong again?

All of them have error-code "139 (0x0000008B) Unknown error code" and all of them failed on my System with Ryzen 5 3600X and 32GB @ 3200Mhz (standard) of Ram.
the other system (R5 3600 (withoutX) with 32GB@ 3000Mhz (standard)) is running fine.
Both Systems are not overclocked on CPU and have XMP turned on.
No Gameboost or anything else, Fans at 100%
Temperatures are fine. Both are Watercooled in one loop with two 280mm radiators.

some examples:
https://boinc.bakerlab.org/rosetta/result.php?resultid=1178373908
https://boinc.bakerlab.org/rosetta/result.php?resultid=1178374184
https://boinc.bakerlab.org/rosetta/result.php?resultid=1178329397
https://boinc.bakerlab.org/rosetta/result.php?resultid=1178275970
https://boinc.bakerlab.org/rosetta/result.php?resultid=1178221426
https://boinc.bakerlab.org/rosetta/result.php?resultid=1178221294

somebody any Ideas?
its frustrating to see the wasted time without result and no points.

Each of your examples ran under Linux, If you scroll down to the stderr output of each, you will see that each of them gave signal 11.

Under Linux, signal 11 means segmentation error - in other words, the program tried to execute something that was not marked as executable code.

Error code 139 came from a higher level, which did not know what to do about signal 11.

This means that there's an error somewhere in the Linux version of Rosetta 4.20. You can't fix that. You can only wait for tasks that either use a corrected version of the program, or have the input files adjusted so that they don't try to use the part of the program that triggers the error.
ID: 96455 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile MeeeK

Send message
Joined: 7 Feb 16
Posts: 31
Credit: 19,737,304
RAC: 0
Message 96459 - Posted: 13 May 2020, 20:27:01 UTC - in response to Message 96455.  


Each of your examples ran under Linux, If you scroll down to the stderr output of each, you will see that each of them gave signal 11.

Under Linux, signal 11 means segmentation error - in other words, the program tried to execute something that was not marked as executable code.

Error code 139 came from a higher level, which did not know what to do about signal 11.

This means that there's an error somewhere in the Linux version of Rosetta 4.20. You can't fix that. You can only wait for tasks that either use a corrected version of the program, or have the input files adjusted so that they don't try to use the part of the program that triggers the error.



thank you.

any ideas for solving this problem? maybe reinstall ubuntu or something? yesterday i did the update to 20.04. think thats the problem.
but both systems are identical(software). both have same version and i did same update on both.
the first system is runing into this problem, the secound is forking fine.
ID: 96459 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rob R

Send message
Joined: 20 May 14
Posts: 2
Credit: 2,785,677
RAC: 970
Message 96466 - Posted: 13 May 2020, 22:50:08 UTC - in response to Message 96433.  

In the last couple days I've noticed a number of failed tasks, they all start with 3cl in the name, here is an example.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1057932888
The tasks run for about 8-15 seconds then return compute error. System is a Ryzen 5 3600, 16gb ram.
Hope this helps with troubleshooting!
Try limiting the number of cores you use to process work (or add more system RAM) - it's a memory issue.
I used to get the same errors that you are getting when i was running all 6c/12t with only 16GB of RAM, once i upgraded my system to 32GB RAM i've had no such errors since.

You generally need to allow for 1.3GB of RAM per running Task. Many Tasks use a lot lot less, quite few Tasks use a hell of a lot more. If a Task requires more RAM, it should gracefully suspend until there's enough RAM to continue, but that isn't always the case.

Edit- having said that, i just had one of those WUs do the same thing on my system, yet was processed OK on another system, and even though i've processed several others of the same type with no problems.


Yea, my settings were set to max 50% of ram. I upped it to 75% but I don't think that's the problem. At the time it failed their was only a total of 6.5G ram in use and the 50% limit would have allowed 8GB ram. The absolute limit is set to 10G. Several of the 3cl* tasks have completed just fine it seams to just be a select few that have some kind of problem.
ID: 96466 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 1,227
Message 96469 - Posted: 13 May 2020, 23:48:00 UTC - in response to Message 96459.  

[snip]

thank you.

any ideas for solving this problem? maybe reinstall ubuntu or something? yesterday i did the update to 20.04. think thats the problem.
but both systems are identical(software). both have same version and i did same update on both.
the first system is runing into this problem, the secound is forking fine.

I doubt if you can fix it by doing anything on that computer. You probably have to wait for the project people to do it for you.
ID: 96469 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile yoerik
Avatar

Send message
Joined: 24 Mar 20
Posts: 128
Credit: 169,525
RAC: 0
Message 96470 - Posted: 13 May 2020, 23:54:23 UTC

This WU:
https://boinc.bakerlab.org/rosetta/result.php?resultid=1177441830 has only progressed 2 minutes in the last 12 hours. My phone is running other WUs just fine - having reported 4 valid WUs today for Rosetta, and more for smaller projects like WCG. It just isn't running.

I'll provide any additional information as requested, as soon as possible. Its deadline is tomorrow, and it has been stuck at 16% (4 hrs 10 minutes, now up to 12 minutes) for ages. What should I do?
ID: 96470 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 1,227
Message 96471 - Posted: 14 May 2020, 0:11:58 UTC - in response to Message 96470.  
Last modified: 14 May 2020, 0:12:54 UTC

This WU:
https://boinc.bakerlab.org/rosetta/result.php?resultid=1177441830 has only progressed 2 minutes in the last 12 hours. My phone is running other WUs just fine - having reported 4 valid WUs today for Rosetta, and more for smaller projects like WCG. It just isn't running.

I'll provide any additional information as requested, as soon as possible. Its deadline is tomorrow, and it has been stuck at 16% (4 hrs 10 minutes, now up to 12 minutes) for ages. What should I do?

For tasks like that, I'd let try to run for about twice the time limit set for your Rosetta@home tasks, then abort it. This is in case the problem is only in reporting the progress.

Aborting it is in case it is now waiting for something that will never happen,

In the meantime, you might try making more memory available for it, by blocking any other tasks from starting but allowing any tasks that have started to continue until they finish.
ID: 96471 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile yoerik
Avatar

Send message
Joined: 24 Mar 20
Posts: 128
Credit: 169,525
RAC: 0
Message 96474 - Posted: 14 May 2020, 4:08:27 UTC - in response to Message 96471.  

This WU:
https://boinc.bakerlab.org/rosetta/result.php?resultid=1177441830 has only progressed 2 minutes in the last 12 hours. My phone is running other WUs just fine - having reported 4 valid WUs today for Rosetta, and more for smaller projects like WCG. It just isn't running.

I'll provide any additional information as requested, as soon as possible. Its deadline is tomorrow, and it has been stuck at 16% (4 hrs 10 minutes, now up to 12 minutes) for ages. What should I do?

For tasks like that, I'd let try to run for about twice the time limit set for your Rosetta@home tasks, then abort it. This is in case the problem is only in reporting the progress.

Aborting it is in case it is now waiting for something that will never happen,

In the meantime, you might try making more memory available for it, by blocking any other tasks from starting but allowing any tasks that have started to continue until they finish.


Hmm - I paused one or two other WUs - and it started working again. Rip. I'll have to babysit Rosetta WUs on that device to make sure no more than 3 of them are running at once. At least it's a phone - so it won't be as much of a bother. Thanks so much, Robert.
ID: 96474 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
slowbook

Send message
Joined: 10 Mar 20
Posts: 4
Credit: 20,089
RAC: 1
Message 96500 - Posted: 14 May 2020, 20:10:06 UTC

I posted a question over in the Q&A Android forum about a large number of error 11 returns on my Android 5.1.1 phone (Galaxy J3 2016). The phone does return some valid work units, and the work units seem to complete when they are sent to someone else's computer, so I suspect my phone is acting up, but I don't really have any good leads. Does anyone have any thoughts?

Thanks!
ID: 96500 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 1,227
Message 96501 - Posted: 14 May 2020, 20:41:19 UTC - in response to Message 96500.  
Last modified: 14 May 2020, 20:42:40 UTC

I posted a question over in the Q&A Android forum about a large number of error 11 returns on my Android 5.1.1 phone (Galaxy J3 2016). The phone does return some valid work units, and the work units seem to complete when they are sent to someone else's computer, so I suspect my phone is acting up, but I don't really have any good leads. Does anyone have any thoughts?

Thanks!

See my reply to MeeeK further down this thread.

Are you sure that the tasks that were successful when sent to someone else used exactly the same version as on your Galaxy?

For example, also using the Android Linux version of 4.20, or some other variety of 4.20 instead?

The only failed task I could find for you has another task sent to someone else, but not finished yet.
ID: 96501 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
slowbook

Send message
Joined: 10 Mar 20
Posts: 4
Credit: 20,089
RAC: 1
Message 96507 - Posted: 15 May 2020, 2:33:46 UTC - in response to Message 96501.  

I posted a question over in the Q&A Android forum about a large number of error 11 returns on my Android 5.1.1 phone (Galaxy J3 2016). The phone does return some valid work units, and the work units seem to complete when they are sent to someone else's computer, so I suspect my phone is acting up, but I don't really have any good leads. Does anyone have any thoughts?

Thanks!

See my reply to MeeeK further down this thread.

Are you sure that the tasks that were successful when sent to someone else used exactly the same version as on your Galaxy?

For example, also using the Android Linux version of 4.20, or some other variety of 4.20 instead?

The only failed task I could find for you has another task sent to someone else, but not finished yet.


Dear Robert,

They've fallen out of my history already (they were earlier this week; my phone doesn't get consistent WU's from Rosetta even though I have it overbalanced towards Rosetta). I do think most of the ones completed successfully by other hosts were on 4.20. Some of them were Windows or Linux, but I do recall seeing at least one successful one done by ARM on 4.20. Would it make sense to dial up the BOINC log level towards 5 and monitor for more WU's?

Thanks!
ID: 96507 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 1,227
Message 96509 - Posted: 15 May 2020, 2:55:03 UTC - in response to Message 96507.  

[snip]

Dear Robert,

They've fallen out of my history already (they were earlier this week; my phone doesn't get consistent WU's from Rosetta even though I have it overbalanced towards Rosetta). I do think most of the ones completed successfully by other hosts were on 4.20. Some of them were Windows or Linux, but I do recall seeing at least one successful one done by ARM on 4.20. Would it make sense to dial up the BOINC log level towards 5 and monitor for more WU's?

Thanks!

I was talking about the several different versions of 4.20 for different operating systems.

As for the BOINC log level, I've never used that, so I don't know whether it will be useful or not.
ID: 96509 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2145
Credit: 41,555,266
RAC: 8,961
Message 96511 - Posted: 15 May 2020, 6:39:55 UTC - in response to Message 96449.  

its frustrating to see the wasted time without result and no points

If you check, you'll find they are being awarded points retrospectively according to their runtime.
A clean-up job periodically runs and awards credit as it's not your fault
ID: 96511 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JP

Send message
Joined: 20 Mar 20
Posts: 2
Credit: 102,246
RAC: 6
Message 96732 - Posted: 22 May 2020, 19:22:46 UTC
Last modified: 22 May 2020, 19:26:43 UTC

Hi,

today all my R@H jobs terminated with error, named "RSA key check failed", directly after BOINC start.
3 tasks were running yesterday, 4 have not yet been started before.
The job names were different "hgfp3_xx", "rep212_xx", "rep1153_xx" and "Junior_HalfRoid_xx".
Error code and error message are always the same.
-185 (0xFFFFFF47) ERR_RESULT_START

Here complete message from one example job that already run before.
https://boinc.bakerlab.org/rosetta/result.php?resultid=1186603301
<core_client_version>7.16.5</core_client_version>
<![CDATA[
<message>
app_version download error: couldn't get input files:
<file_xfer_error>
  <file_name>database_357d5d93529_n_methyl.zip</file_name>
  <error_code>-120 (RSA key check failed for file)</error_code>
  <error_message>signature verification failed</error_message>
</file_xfer_error>
</message>
]]>

Today new jobs were downloaded and run smoothly, so I think there is no general problem.

Any idea what went wrong?
ID: 96732 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 1,227
Message 96733 - Posted: 22 May 2020, 20:40:15 UTC - in response to Message 96732.  

Hi,

today all my R@H jobs terminated with error, named "RSA key check failed", directly after BOINC start.
3 tasks were running yesterday, 4 have not yet been started before.
The job names were different "hgfp3_xx", "rep212_xx", "rep1153_xx" and "Junior_HalfRoid_xx".
Error code and error message are always the same.
-185 (0xFFFFFF47) ERR_RESULT_START

Here complete message from one example job that already run before.
https://boinc.bakerlab.org/rosetta/result.php?resultid=1186603301
<core_client_version>7.16.5</core_client_version>
<![CDATA[
<message>
app_version download error: couldn't get input files:
<file_xfer_error>
  <file_name>database_357d5d93529_n_methyl.zip</file_name>
  <error_code>-120 (RSA key check failed for file)</error_code>
  <error_message>signature verification failed</error_message>
</file_xfer_error>
</message>
]]>

Today new jobs were downloaded and run smoothly, so I think there is no general problem.

Any idea what went wrong?

It looks like all of the successful tasks extracted part of the input from one database zip file, and all of the failed tasks tried (and failed) to extract something similar from a different database zip file.

This looks likely to mean that the database zip file with the longer name downloaded with an error, and therefore caused all of the tasks trying to use it to fail.

If someone can tell you how to download a replacement for the file with the error, this should fix the problem. I'm sorry that I can't do that.

The cause MIGHT be an overly aggressive antivirus program somewhere along the path from the download server to you chopping off the end of the file, but it's hard to be sure unless you can compare the copy with the error to a corrected version.
ID: 96733 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 96747 - Posted: 23 May 2020, 14:12:37 UTC

The databases are all in the root of the downloads directory
https://boinc.bakerlab.org/rosetta/download/database_357d5d93529_n_methyl.zip
Rosetta Moderator: Mod.Sense
ID: 96747 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 48 · 49 · 50 · 51 · 52 · 53 · 54 . . . 310 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org