Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 95 · 96 · 97 · 98 · 99 · 100 · 101 . . . 311 · Next

AuthorMessage
mrhastyrib

Send message
Joined: 18 Feb 21
Posts: 90
Credit: 2,541,890
RAC: 0
Message 101218 - Posted: 10 Apr 2021, 21:52:56 UTC - in response to Message 101208.  

My PCs have 8, 8, 8, 16, 36, 36, and 64GB.

ID: 101218 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mrhastyrib

Send message
Joined: 18 Feb 21
Posts: 90
Credit: 2,541,890
RAC: 0
Message 101219 - Posted: 10 Apr 2021, 21:59:21 UTC - in response to Message 101211.  

What? It's a valid point. Why would you think a biologist knows about the code?

All the cool kids who work on projects like this one, using proprietary software, know how to code, grandpa.
I disagree. I'm sure you can become a brilliant biologist without knowing the first thing about coding.

You reframed the argument from talking about "any" biologist to all of them.

I've noticed that you use that move when you start losing in an argument (that is to say, all of them) and it's annoying.

Also, your new version of the argument is kind of a "Captain Obvious" eyeroller.
ID: 101219 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1734
Credit: 18,532,940
RAC: 17,945
Message 101222 - Posted: 10 Apr 2021, 23:18:36 UTC - in response to Message 101217.  
Last modified: 10 Apr 2021, 23:20:12 UTC

It sometimes helps to shut down BOINC, then restart BOINC and then the task.

Progress seeming to freeze near the end of a task is often a sign that the task was created with a severe underestimate of how long the task would run.
OK, thank you. I just tried that (rebooted in between as well), and it reset to ~16% complete... Maybe I just kill it if it's going to run forever?
I've noticed than some of the latest Tasks aren't checkpointing properly, so if you interrupt them they will revert back to the last successful checkpoint.
Next time, just let it run- the default time is 8 hours, and there is a 10 hour watchdog timer in case it's not done within 8 hours. If it's still going after 20hours, then you might want to kill it off.
Grant
Darwin NT
ID: 101222 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1734
Credit: 18,532,940
RAC: 17,945
Message 101223 - Posted: 10 Apr 2021, 23:38:18 UTC - in response to Message 101206.  

Over the course of this afternoon I’ve had 6 segv errors, all on files starting miniprotien.

Anyone else? Or do I start checking my hardware?

I got a signal 11, but nothing about a segv.
https://boinc.bakerlab.org/rosetta/result.php?resultid=1365882560
I've had about a 50% failure rate for the miniprotein_relax8_ Tasks so far. 4 completed & Validated, 5 errored out.

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 3221225477 (0xc0000005)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol fr_cart_fast.xml @fr_flags_bcov2 -in:file:silent miniprotein_relax8_SAVE_ALL_OUT_IGNORE_THE_REST_4sp2hc8l.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip miniprotein_relax8_SAVE_ALL_OUT_IGNORE_THE_REST_4sp2hc8l.zip @miniprotein_relax8_SAVE_ALL_OUT_IGNORE_THE_REST_4sp2hc8l.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3662000
Using database: database_357d5d93529_n_methylminirosetta_database


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00007FF620BF8316 read attempt to address 0xFFFFFFFF

Engaging BOINC Windows Runtime Debugger...



********************


BOINC Windows Runtime Debugger Version 7.9.0


Dump Timestamp    : 04/11/21 06:44:57
Install Directory : C:Program FilesBOINC
Data Directory    : C:ProgramDataBOINC
Project Symstore  : https://boinc.bakerlab.org/rosetta/symstore
LoadLibraryA( C:ProgramDataBOINCdbghelp.dll ): GetLastError = 126
Loaded Library    : dbghelp.dll
LoadLibraryA( C:ProgramDataBOINCsymsrv.dll ): GetLastError = 126
LoadLibraryA( symsrv.dll ): GetLastError = 126
LoadLibraryA( C:ProgramDataBOINCsrcsrv.dll ): GetLastError = 126
LoadLibraryA( srcsrv.dll ): GetLastError = 126
LoadLibraryA( C:ProgramDataBOINCversion.dll ): GetLastError = 126
Loaded Library    : version.dll
Debugger Engine   : 4.0.5.0
Symbol Search Path: C:ProgramDataBOINCslots8;C:ProgramDataBOINCprojectsboinc.bakerlab.org_rosetta;srv*C:ProgramDataBOINCprojectsboinc.bakerlab.org_rosettasymbols*http://msdl.microsoft.com/download/symbols;srv*C:ProgramDataBOINCprojectsboinc.bakerlab.org_rosettasymbols*https://boinc.bakerlab.org/rosetta/symstore


ModLoad: 0000000020720000 00000000057ef000 C:ProgramDataBOINCprojectsboinc.bakerlab.org_rosettarosetta_4.20_windows_x86_64.exe (-exported- Symbols Loaded)
    Linked PDB Filename   : C:cygwin64homeboinc4.17RosettamainsourceideVisualStudiox64BoincReleaserosetta_4.20_windows_x86_64.pdb

ModLoad: 0000000028070000 00000000001f5000 C:WINDOWSSYSTEM32ntdll.dll (6.2.19041.844) (-exported- Symbols Loaded)
    Linked PDB Filename   : ntdll.pdb
    File Version          : 10.0.19041.804 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.804

ModLoad: 0000000027c10000 00000000000bd000 C:WINDOWSSystem32KERNEL32.DLL (6.2.19041.804) (-exported- Symbols Loaded)
    Linked PDB Filename   : kernel32.pdb
    File Version          : 10.0.19041.804 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.804

ModLoad: 0000000025c50000 00000000002c9000 C:WINDOWSSystem32KERNELBASE.dll (6.2.19041.804) (-exported- Symbols Loaded)
    Linked PDB Filename   : kernelbase.pdb
    File Version          : 10.0.19041.804 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.804

ModLoad: 0000000027de0000 000000000006b000 C:WINDOWSSystem32WS2_32.dll (6.2.19041.546) (-exported- Symbols Loaded)
    Linked PDB Filename   : ws2_32.pdb
    File Version          : 10.0.19041.1 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.1

ModLoad: 0000000027a80000 000000000012b000 C:WINDOWSSystem32RPCRT4.dll (6.2.19041.746) (-exported- Symbols Loaded)
    Linked PDB Filename   : rpcrt4.pdb
    File Version          : 10.0.19041.1 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.1

ModLoad: 0000000027250000 00000000001a0000 C:WINDOWSSystem32USER32.dll (6.2.19041.746) (-exported- Symbols Loaded)
    Linked PDB Filename   : user32.pdb
    File Version          : 10.0.19038.1 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19038.1

ModLoad: 00000000257b0000 0000000000022000 C:WINDOWSSystem32win32u.dll (6.2.19041.867) (-exported- Symbols Loaded)
    Linked PDB Filename   : win32u.pdb
    File Version          : 10.0.19041.867 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.867

ModLoad: 0000000027520000 000000000002a000 C:WINDOWSSystem32GDI32.dll (6.2.19041.746) (-exported- Symbols Loaded)
    Linked PDB Filename   : gdi32.pdb
    File Version          : 10.0.19041.746 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.746

ModLoad: 0000000025f20000 000000000010b000 C:WINDOWSSystem32gdi32full.dll (6.2.19041.746) (-exported- Symbols Loaded)
    Linked PDB Filename   : gdi32full.pdb
    File Version          : 10.0.19041.746 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.746

ModLoad: 0000000026030000 000000000009d000 C:WINDOWSSystem32msvcp_win.dll (6.2.19041.789) (-exported- Symbols Loaded)
    Linked PDB Filename   : msvcp_win.pdb
    File Version          : 10.0.19041.789 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.789

ModLoad: 0000000025910000 0000000000100000 C:WINDOWSSystem32ucrtbase.dll (6.2.19041.789) (-exported- Symbols Loaded)
    Linked PDB Filename   : ucrtbase.pdb
    File Version          : 10.0.19041.789 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.789

ModLoad: 0000000026ef0000 00000000000ac000 C:WINDOWSSystem32ADVAPI32.dll (6.2.19041.610) (-exported- Symbols Loaded)
    Linked PDB Filename   : advapi32.pdb
    File Version          : 10.0.19041.1 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.1

ModLoad: 00000000260d0000 000000000009e000 C:WINDOWSSystem32msvcrt.dll (7.0.19041.546) (-exported- Symbols Loaded)
    Linked PDB Filename   : msvcrt.pdb
    File Version          : 7.0.19041.546 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 7.0.19041.546

ModLoad: 0000000027680000 000000000009c000 C:WINDOWSSystem32sechost.dll (6.2.19041.789) (-exported- Symbols Loaded)
    Linked PDB Filename   : sechost.pdb
    File Version          : 10.0.19041.1 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.1

ModLoad: 0000000026700000 0000000000030000 C:WINDOWSSystem32IMM32.DLL (6.2.19041.546) (-exported- Symbols Loaded)
    Linked PDB Filename   : imm32.pdb
    File Version          : 10.0.19041.546 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.546

ModLoad: 0000000023730000 0000000000012000 C:WINDOWSSYSTEM32kernel.appcore.dll (6.2.19041.546) (-exported- Symbols Loaded)
    Linked PDB Filename   : Kernel.Appcore.pdb
    File Version          : 10.0.19041.546 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.546

ModLoad: 0000000024500000 0000000000033000 C:WINDOWSSYSTEM32ntmarta.dll (6.2.19041.546) (-exported- Symbols Loaded)
    Linked PDB Filename   : ntmarta.pdb
    File Version          : 10.0.19041.1 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.1

ModLoad: 0000000025170000 000000000000c000 C:WINDOWSSYSTEM32CRYPTBASE.DLL (6.2.19041.546) (-exported- Symbols Loaded)
    Linked PDB Filename   : cryptbase.pdb
    File Version          : 10.0.19041.546 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.546

ModLoad: 0000000025a10000 0000000000080000 C:WINDOWSSystem32bcryptPrimitives.dll (6.2.19041.662) (-exported- Symbols Loaded)
    Linked PDB Filename   : bcryptprimitives.pdb
    File Version          : 10.0.19041.662 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.662

ModLoad: 0000000020110000 00000000001e4000 C:WINDOWSSYSTEM32dbghelp.dll (6.2.19041.867) (-exported- Symbols Loaded)
    Linked PDB Filename   : dbghelp.pdb
    File Version          : 10.0.19041.867 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.867

ModLoad: 00000000206e0000 000000000000a000 C:WINDOWSSYSTEM32version.dll (6.2.19041.546) (-exported- Symbols Loaded)
    Linked PDB Filename   : version.pdb
    File Version          : 10.0.19041.546 (WinBuild.160101.0800)
    Company Name          : Microsoft Corporation
    Product Name          : Microsoft&#174; Windows&#174; Operating System
    Product Version       : 10.0.19041.546



*** Dump of the Process Statistics: ***

- I/O Operations Counters -
Read: 11302, Write: 7110, Other 38617

- I/O Transfers Counters -
Read: 37800841, Write: 25327698, Other 29684

- Paged Pool Usage -
QuotaPagedPoolUsage: 318232, QuotaPeakPagedPoolUsage: 318408
QuotaNonPagedPoolUsage: 26784, QuotaPeakNonPagedPoolUsage: 28280

- Virtual Memory Usage -
VirtualSize: 689004544, PeakVirtualSize: -1985282048

- Pagefile Usage -
PagefileUsage: 689004544, PeakPagefileUsage: 1726263296

- Working Set Size -
WorkingSetSize: 707022848, PeakWorkingSetSize: 1742659584, PageFaultCount: 3912026

*** Dump of thread ID 6932 (state: Initialized): ***

- Information -
Status: Base Priority: Normal, Priority: Normal, , Kernel Time: 0.000000, User Time: 0.000000, Wait Time: 0.000000

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00007FF620BF8316 read attempt to address 0xFFFFFFFF

- Registers -
rax=000000000000003a rbx=000000000eaa4760 rcx=000000000f4a4ac0 rdx=000000000f584bf8 rsi=000000000000000b rdi=000000000f4a4ac0
r8=000000000000003a r9=0000000000000421 r10=00000000242c6e80 r11=0000000020746600 r12=0000000020720000 r13=000000002075f960
r14=0000000020746d40 r15=000000000048b215 rip=0000000020bf8316 rsp=0000000020746680 rbp=0000000000000000
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010206

- Callstack -
ChildEBP RetAddr  Args to Child
207466a0 20bb935d 0eaa4760 20746740 20bab215 00000000 rosetta_4.20_windows_x86_64!xmlParserInputRead+0x0 
207466d0 23d27f10 24c10150 2075f960 00000000 00000001 rosetta_4.20_windows_x86_64!xmlParserInputRead+0x0 
20746700 20ba39e8 25cba32c 20720000 207467f0 280a0e7b rosetta_4.20_windows_x86_64!xmlValidateNotationDecl+0x0 
20746770 28111f6f 00000000 20746cf0 207473b0 00000000 rosetta_4.20_windows_x86_64!xmlParserInputRead+0x0 
207467a0 280c1454 00000000 20746cf0 207473b0 00000000 ntdll!__chkstk+0x0 
20746eb0 28110a9e 20746fd0 20df2be9 207476f0 207472f0 ntdll!RtlRaiseException+0x0 
20747650 20dc2938 00000000 00000001 207476f0 20ba3615 ntdll!KiUserExceptionDispatcher+0x0 
20747690 2162c536 00000001 207477a0 2437b498 00000000 rosetta_4.20_windows_x86_64!cppdb::session::is_open+0x0 
20747890 2341b619 20747a40 20748e40 2595fad0 20747730 rosetta_4.20_windows_x86_64!xmlCheckHTTPInput+0x0 
20747b70 23415c06 20748330 0f375760 0f375760 20747c80 rosetta_4.20_windows_x86_64!cppdb::mutex::~mutex+0x0 
20748910 2341d5f3 37c78970 20748a18 00000000 00000000 rosetta_4.20_windows_x86_64!cppdb::mutex::~mutex+0x0 
20749080 215fff1c fffffffe 00000000 39fdca78 971b56f6 rosetta_4.20_windows_x86_64!cppdb::mutex::~mutex+0x0 
20749150 215fbb43 25dfbc90 69766163 207491b9 00000006 rosetta_4.20_windows_x86_64!xmlCheckHTTPInput+0x0 
20749210 215fa41f 0f7acd90 ffffffff 39fdcdd8 ffffffff rosetta_4.20_windows_x86_64!xmlCheckHTTPInput+0x0 
207494d0 215a2dc0 25b96ec0 0f03af40 25b3e9f0 0f03af40 rosetta_4.20_windows_x86_64!xmlCheckHTTPInput+0x0 
20749a70 215a00f8 0f03af40 0f03af40 20749b80 20749b68 rosetta_4.20_windows_x86_64!xmlCheckHTTPInput+0x0 
20749c40 2160a4d1 20749cc8 0f03af40 20749cc8 20749e50 rosetta_4.20_windows_x86_64!xmlCheckHTTPInput+0x0 
20749c90 215f006d 0eec8600 20749cc8 0eec8600 0eaa5d01 rosetta_4.20_windows_x86_64!xmlCheckHTTPInput+0x0 
20749d20 20ba0dfd 0f4675a0 20749e50 00000000 0eaa5d01 rosetta_4.20_windows_x86_64!xmlCheckHTTPInput+0x0 
2075f950 20bab215 00000000 00000000 25acccf8 00000000 rosetta_4.20_windows_x86_64!xmlParserInputRead+0x0 
2075f990 27c27034 00000000 00000000 00000000 00000000 rosetta_4.20_windows_x86_64!xmlParserInputRead+0x0 
2075f9c0 280c2651 00000000 00000000 00000000 00000000 KERNEL32!BaseThreadInitThunk+0x0 
2075fa40 00000000 00000000 00000000 00000000 00000000 ntdll!RtlUserThreadStart+0x0 

*** Dump of thread ID 32762 (state: Initialized): ***

- Information -
Status: Base Priority: Normal, Priority: Unknown, , Kernel Time: 6.000000, User Time: 0.000000, Wait Time: 775459008.000000

- Registers -
rax=0000000000000000 rbx=0000000000000000 rcx=0000000000000000 rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
r8=0000000000000000 r9=0000000000000000 r10=0000000000000000 r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000 rip=0000000000000000 rsp=0000000000000000 rbp=0000000000000000
cs=0000  ss=0000  ds=0000  es=0000  fs=0000  gs=0000             efl=00000000

- Callstack -
ChildEBP RetAddr  Args to Child
(-nosymbols- PC == 0)
00000000 00000000 00000000 00000000 00000000 00000000 !+0x0 

*** Dump of thread ID 30879279 (state: Unknown): ***

- Information -
Status: Base Priority: Normal, Priority: Unknown, , Kernel Time: 17179869184.000000, User Time: 21474963456.000000, Wait Time: 0.000000

- Registers -
rax=0000000000000000 rbx=0000000000000000 rcx=0000000000000000 rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000
r8=0000000000000000 r9=0000000000000000 r10=0000000000000000 r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000 rip=0000000000000000 rsp=0000000000000000 rbp=0000000000000000
cs=0000  ss=0000  ds=0000  es=0000  fs=0000  gs=0000             efl=00000000

- Callstack -
ChildEBP RetAddr  Args to Child
(-nosymbols- PC == 0)
00000000 00000000 00000000 00000000 00000000 00000000 !+0x0 


*** Debug Message Dump ****


*** Foreground Window Data ***
    Window Name      : 
    Window Class     : 
    Window Process ID: 0
    Window Thread ID : 0

Exiting...

</stderr_txt>
]]>

Grant
Darwin NT
ID: 101223 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim Martin

Send message
Joined: 9 Oct 05
Posts: 23
Credit: 1,443,682
RAC: 240
Message 101224 - Posted: 11 Apr 2021, 0:13:21 UTC - in response to Message 101158.  

Brian -- The first Rosetta@home work units have arrived, which appear to be working well. They are:

TMWFYIU

pre-helical-bundles-round1 (I did not use the underscore, because a red line resulted.)

miniprotein-relax8

They are in the mid-40% range of completion.

Here's hoping. . .

jm
ID: 101224 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2146
Credit: 41,570,180
RAC: 8,210
Message 101226 - Posted: 11 Apr 2021, 0:37:44 UTC - in response to Message 101208.  

I allocate 28Gb from 32Gb total
They don't need the RAM. If they run, they generally use 300Mb, not 5 or 6Gb each. It's more than a bit crackers.
Once you've started them, I wonder if Boinc adjusts to what they're actually using, or leaves the requested 6.5GB there just in case? You could see what happens if you try to run only Rosettas.

I've speculated that's the case too. Demand a maximum amount to start, then let the task decide once it's running.
I cleared all the problem tasks on my laptop before by setting NNT to Rosetta & WCG and only when all other tasks had completed did my problem task start.

That was a while ago.

Now I've got another one problem Rosetta task. I've suspended all unstarted Rosetta tasks, set NNT again, kept my last few WCG tasks so that as each of the 3 running Rosetta tasks finish, it switches to less-demanding WCG tasks to see if enough RAM is freed up.
If it isn't, let the WCG tasks complete as well until there's only the one problem Rosetta task.
If it runs, fine. If it doesn't, abort it.
What a performance...
If it's too big, I run it on a bigger machine. My PCs have 8, 8, 8, 16, 36, 36, and 64GB.

I wasn't clear. My main PC has 28Gb allocated out of 32Gb
My recent problems have been on my laptop with 6.7Gb free out of 8Gb

As it turned out, the problem task eventually ran ok, I unchecked NNT and downloaded fresh Rosetta tasks and they started up straight away on all cores.
Not sure if that confirms the initial RAM req't is only large before the task starts running, then reduces to only what it uses once started. Kind of seems that way though.
ID: 101226 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim Martin

Send message
Joined: 9 Oct 05
Posts: 23
Credit: 1,443,682
RAC: 240
Message 101227 - Posted: 11 Apr 2021, 0:41:42 UTC - in response to Message 101224.  

A reply, here, to Robert, Brian, if I may. Robert -- I, too, wondered if that wasn't an indication of a coding glitch.

jm
ID: 101227 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 101235 - Posted: 11 Apr 2021, 16:46:32 UTC - in response to Message 101215.  

I just suspended a task that has been running for 16:19:35 (stuck at 00:10:15 remaining). Any ideas how to get this one to finish and get credit for it?
I saw an Android task run for exactly 24 hours (3 times the normal 8). Maybe they're meant to since Androids are slower than PCs?
ID: 101235 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 101236 - Posted: 11 Apr 2021, 16:47:41 UTC - in response to Message 101216.  

I just suspended a task that has been running for 16:19:35 (stuck at 00:10:15 remaining). Any ideas how to get this one to finish and get credit for it?
It sometimes helps to shut down BOINC, then restart BOINC and then the task.

Progress seeming to freeze near the end of a task is often a sign that the task was created with a severe underestimate of how long the task would run.
It can be. But I just had an LHC Atlas task running for 24 hours instead of about 3, and the CPU usage was zero. Mind you those use virtualbox, which is a pain in the ass.
ID: 101236 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 101237 - Posted: 11 Apr 2021, 16:48:44 UTC - in response to Message 101218.  

My PCs have 8, 8, 8, 16, 36, 36, and 64GB.

What an absurd analogy. So you must think everyone who owns a decent car has a small appendage?
ID: 101237 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 101238 - Posted: 11 Apr 2021, 16:51:01 UTC - in response to Message 101219.  

What? It's a valid point. Why would you think a biologist knows about the code?

All the cool kids who work on projects like this one, using proprietary software, know how to code, grandpa.
I disagree. I'm sure you can become a brilliant biologist without knowing the first thing about coding.

You reframed the argument from talking about "any" biologist to all of them.

I've noticed that you use that move when you start losing in an argument (that is to say, all of them) and it's annoying.

Also, your new version of the argument is kind of a "Captain Obvious" eyeroller.
No idea what you think I've changed, we're all talking about biologists that don't have to be programmers. Just like hairdressers don't have to be racing drivers.
ID: 101238 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 101239 - Posted: 11 Apr 2021, 16:53:17 UTC - in response to Message 101226.  

If it's too big, I run it on a bigger machine. My PCs have 8, 8, 8, 16, 36, 36, and 64GB.
I wasn't clear. My main PC has 28Gb allocated out of 32Gb
My recent problems have been on my laptop with 6.7Gb free out of 8Gb

As it turned out, the problem task eventually ran ok, I unchecked NNT and downloaded fresh Rosetta tasks and they started up straight away on all cores.
Not sure if that confirms the initial RAM req't is only large before the task starts running, then reduces to only what it uses once started. Kind of seems that way though.
I understood you perfectly. If your laptop can't handle the big tasks, get Rosetta to only run on the big machine. Or just don't get so upset when something queues up.
ID: 101239 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DizzyD

Send message
Joined: 23 Nov 20
Posts: 6
Credit: 1,438,330
RAC: 0
Message 101240 - Posted: 11 Apr 2021, 18:44:35 UTC - in response to Message 101222.  

I've noticed than some of the latest Tasks aren't checkpointing properly, so if you interrupt them they will revert back to the last successful checkpoint.
Next time, just let it run- the default time is 8 hours, and there is a 10 hour watchdog timer in case it's not done within 8 hours. If it's still going after 20hours, then you might want to kill it off.


Grant, thank you for your reply. I don't quite understand your "20 hours" comment. I let the task run for 16 hours. If there is a watchdog timer at 10 hours, what is the different between anything over 10 hours (e.g. 11 hours, 16 hours and 20 hours) not completing? Isn't it just stuck at that point?
ID: 101240 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 354
Credit: 1,276,393
RAC: 828
Message 101241 - Posted: 11 Apr 2021, 18:59:36 UTC - in response to Message 101240.  
Last modified: 11 Apr 2021, 19:01:26 UTC

I've noticed than some of the latest Tasks aren't checkpointing properly, so if you interrupt them they will revert back to the last successful checkpoint.
Next time, just let it run- the default time is 8 hours, and there is a 10 hour watchdog timer in case it's not done within 8 hours. If it's still going after 20hours, then you might want to kill it off.


Grant, thank you for your reply. I don't quite understand your "20 hours" comment. I let the task run for 16 hours. If there is a watchdog timer at 10 hours, what is the different between anything over 10 hours (e.g. 11 hours, 16 hours and 20 hours) not completing? Isn't it just stuck at that point?



The watchdog isn't at 10 hours. It's 10 hours AFTER whatever the CPU runtime setting is at. So, if you are running with the default setting, which is 8 CPU hours, then the watchdog will only kick in at 18 hours.

What Grant meant is that considering the watchdog should kick in at 18 hours, if the task is still running at 20 hours, you might want to abort it.
ID: 101241 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mrhastyrib

Send message
Joined: 18 Feb 21
Posts: 90
Credit: 2,541,890
RAC: 0
Message 101245 - Posted: 11 Apr 2021, 22:43:24 UTC - in response to Message 101238.  

No idea what you think I've changed

I know. It's that damned Dunning-Kruger thingy.
ID: 101245 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
jsm

Send message
Joined: 4 Apr 20
Posts: 3
Credit: 77,838,477
RAC: 15,529
Message 101247 - Posted: 12 Apr 2021, 6:50:00 UTC - in response to Message 101049.  

Running at 22 hours has substantially reduced the bandwidth hog but detailed checking has turned up a query. All the computers are asking the scheduler every minute or so for new tasks to be told 'no can do you have plenty' (I paraphrase). This is clearly putting an unnecessary load on the scheduler and contributing to my bandwidth loss. Is there a way to instruct the preferences only to seek additional work every so often eg 1 hour?
capt
ID: 101247 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1734
Credit: 18,532,940
RAC: 17,945
Message 101248 - Posted: 12 Apr 2021, 7:51:08 UTC - in response to Message 101247.  
Last modified: 12 Apr 2021, 7:53:16 UTC

Running at 22 hours has substantially reduced the bandwidth hog but detailed checking has turned up a query. All the computers are asking the scheduler every minute or so for new tasks to be told 'no can do you have plenty' (I paraphrase). This is clearly putting an unnecessary load on the scheduler and contributing to my bandwidth loss. Is there a way to instruct the preferences only to seek additional work every so often eg 1 hour?
capt
How often it asks for work depends on the number of cores/threads you have, the amount of time the system is actually able to process work, and most importantly- on your cache settings.
The fact that many of your Tasks time out before you even return them due to missed deadlines indicates your cache setting is way, way, way, way too large. The estimated completion time for all Tasks, regardless of how long your CPU Target time is set to is 8 hours.
So having a multi-day cache, combined with a longer than the default 8 hour Target CPU time is going to result in endless requests for work, and huge numbers of Tasks missing their deadlines.

In your computing preferences, Other
           Store at least 0.01 days of work
Store up to an additional 0.01 days of work
And they will stop trashing Work Units due to missed deadlines, and stop continually asking for more work.
If you go back to the default 8 hours in the future, you could then bump up the "Store at least 0.01 days of work" to something like 0.2 to maintain a reasonable buffer, that won't result in missed deadlines when things change.
Grant
Darwin NT
ID: 101248 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 101249 - Posted: 12 Apr 2021, 9:51:39 UTC - in response to Message 101245.  

No idea what you think I've changed
I know. It's that damned Dunning-Kruger thingy.
No context, no conversation.
ID: 101249 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 101250 - Posted: 12 Apr 2021, 9:54:49 UTC

The 6.5GB problem goes away on an 8GB machine if you set it to use 100% memory. It never actually uses 100% since everything overestimates. I just changed my old Boinc-only machines [1] and Rosettas downloaded and ran.

[1] Who has 8GB on a machine they actually interact with? You could maybe load Windows 10 and 1 application. But dare to play a game, or use email and a photo editor at once and it'll grind to a halt. Another example of modern shoddy lazy bloated programming. I can boot Linux off a 1GB flash drive. Yet Windows is 20 times bigger.
ID: 101250 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mrhastyrib

Send message
Joined: 18 Feb 21
Posts: 90
Credit: 2,541,890
RAC: 0
Message 101252 - Posted: 12 Apr 2021, 12:10:13 UTC - in response to Message 101249.  

No idea what you think I've changed
I know. It's that damned Dunning-Kruger thingy.
No context, no conversation.

Unless you are a relative -- which you are not -- it's not my duty to compensate for your inability to keep up with a conversation due to age-related infirmities. I counsel making use of Google.
ID: 101252 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 95 · 96 · 97 · 98 · 99 · 100 · 101 . . . 311 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org