R@H works, but all COVID-19 tasks fail

Questions and Answers : Unix/Linux : R@H works, but all COVID-19 tasks fail

To post messages, you must log in.

AuthorMessage
sspseudoo

Send message
Joined: 4 Mar 20
Posts: 7
Credit: 23,843
RAC: 0
Message 92322 - Posted: 26 Mar 2020, 9:28:09 UTC
Last modified: 26 Mar 2020, 9:31:31 UTC

Hello everyone,
i use rosetta@home on fedora 31 and it works. See https://boinc.bakerlab.org/rosetta/results.php?userid=2083373

My problem is that as soon as there is a COVID-19 task, the calculation will fail. This is the output in the command line:
[...]
26-Mar-2020 10:09:04 [Rosetta@home] Starting task 2uc6gr8g_jhr_design1_COVID-19_SAVE_ALL_OUT_903414_1_0
[... nothing regarding COVID-19]
26-Mar-2020 10:10:28 [Rosetta@home] Computation for task 2uc6gr8g_jhr_design1_COVID-19_SAVE_ALL_OUT_903414_1_0 finished
26-Mar-2020 10:10:28 [Rosetta@home] Output file 2uc6gr8g_jhr_design1_COVID-19_SAVE_ALL_OUT_903414_1_0_r336008625_0 for task 2uc6gr8g_jhr_design1_COVID-19_SAVE_ALL_OUT_903414_1_0 absent
[...]


Command line output when I start BOINC:
$ boinc
26-Mar-2020 10:08:31 [---] Starting BOINC client version 7.16.1 for x86_64-pc-linux-gnu
26-Mar-2020 10:08:31 [---] log flags: file_xfer, sched_ops, task
26-Mar-2020 10:08:31 [---] Libraries: libcurl/7.66.0 OpenSSL/1.1.1d-fips zlib/1.2.11 brotli/1.0.7 libidn2/2.3.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0
26-Mar-2020 10:08:31 [---] Data directory: /home/x
26-Mar-2020 10:08:32 [---] OpenCL CPU: pthread-AMD Athlon(tm) II X4 620 Processor (OpenCL driver vendor: The pocl project, driver version 1.5-pre, device version OpenCL 1.2 pocl HSTR: pthread-x86_64-unknown-linux-gnu-amdfam10)
26-Mar-2020 10:08:32 [---] No usable GPUs found
26-Mar-2020 10:08:32 [---] [libc detection] gathered: 2.30, GNU libc
26-Mar-2020 10:08:32 [---] Host name: x-2017-1.local
26-Mar-2020 10:08:32 [---] Processor: 4 AuthenticAMD AMD Athlon(tm) II X4 620 Processor [Family 16 Model 5 Stepping 2]
26-Mar-2020 10:08:32 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save
26-Mar-2020 10:08:32 [---] OS: Linux Fedora: Fedora release 31 (Thirty One) [5.5.10-200.fc31.x86_64|libc 2.30 (GNU libc)]
26-Mar-2020 10:08:32 [---] Memory: 3.59 GB physical, 3.74 GB virtual
26-Mar-2020 10:08:32 [---] Disk: 82.67 GB total, 3.70 GB free
26-Mar-2020 10:08:32 [---] Local time is UTC +1 hours
26-Mar-2020 10:08:32 [---] No general preferences found - using defaults
26-Mar-2020 10:08:32 [---] Reading preferences override file
26-Mar-2020 10:08:32 [---] Preferences:
26-Mar-2020 10:08:32 [---]    max memory usage when active: 1837.91 MB
26-Mar-2020 10:08:32 [---]    max memory usage when idle: 3308.23 MB
26-Mar-2020 10:08:32 [---]    max disk usage: 6.90 GB
26-Mar-2020 10:08:32 [---]    don't use GPU while active
26-Mar-2020 10:08:32 [---]    suspend work if non-BOINC CPU load exceeds 25%
26-Mar-2020 10:08:32 [---]    (to change preferences, visit a project web site or select Preferences in the Manager)
26-Mar-2020 10:08:32 [---] Setting up project and slot directories
26-Mar-2020 10:08:32 [---] Checking active tasks
26-Mar-2020 10:08:32 [Rosetta@home] URL https://boinc.bakerlab.org/rosetta/; Computer ID 3780697; resource share 100
26-Mar-2020 10:08:32 [---] Setting up GUI RPC socket
26-Mar-2020 10:08:32 [---] Checking presence of 29 project files
26-Mar-2020 10:08:32 Initialization completed
26-Mar-2020 10:08:40 [Rosetta@home] project resumed by user
[...]


What can I do to debug this problem? Which log flags should I enable? https://boinc.berkeley.edu/wiki/Client_configuration#Logging_flags

Thanks in advance!
ID: 92322 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92332 - Posted: 26 Mar 2020, 13:44:10 UTC
Last modified: 27 Mar 2020, 15:12:14 UTC

The report from your WU, such as this one shows:
Stderr output

<core_client_version>7.16.1</core_client_version>
<![CDATA[
<message>
process got signal 11</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol jhr_boinc.xml @flags -in:file:silent 2uc6gr8g_jhr_design1_COVID-19.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 2uc6gr8g_jhr_design1_COVID-19.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 1372451
Starting watchdog...
Watchdog active.

</stderr_txt>
]]>

Are you overclocking this machine? Is the memory stable in this machine?
Given that it runs for less than a minute, I wouldn't think that is enough time for the task to have filled its own memory space.
Rosetta Moderator: Mod.Sense
ID: 92332 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sspseudoo

Send message
Joined: 4 Mar 20
Posts: 7
Credit: 23,843
RAC: 0
Message 92350 - Posted: 26 Mar 2020, 17:13:50 UTC

Thank you for the fast answer! I do no overclocking and the machine is stable, usually no crashes, and most other tasks are completed without problems. But there is only 4 GB RAM in total and I'm still using Firefox with many tabs. And there is not so much disk space left in my home partition (only 4 GB left, old but good Samsung SSD), may that be a problem? Is there a possibility to find out if RAM or disc space is the limiting factor? May logging flags help?
ID: 92350 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alexey Vazhnov
Avatar

Send message
Joined: 26 Mar 20
Posts: 2
Credit: 0
RAC: 0
Message 92369 - Posted: 26 Mar 2020, 22:03:37 UTC
Last modified: 26 Mar 2020, 22:53:13 UTC

Hello!
I have the same problem: all tasks works fine except tasks like 3xu2pj0j_jhr_design1_COVID-19_SAVE_ALL_OUT_903165_1 , they fail with "Computation error".

Xubuntu 19.10 amd64
Linux kernel 5.3.0-42
Boinc 7.16.3 installed from official Ubuntu repository
CPU: AMD Phenom II X6 1055T
RAM: 8GB
Space for Boinc = 20GB


Tasks here: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3825844

Message is the same:
process got signal 11</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol jhr_boinc.xml @flags -in:file:silent 4il2au3a_jhr_design1_COVID-19.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 4il2au3a_jhr_design1_COVID-19.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3254684
Starting watchdog...
Watchdog active.
ID: 92369 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ToKamaK

Send message
Joined: 27 Mar 20
Posts: 4
Credit: 5,621,041
RAC: 2,964
Message 92437 - Posted: 28 Mar 2020, 9:08:50 UTC

Greetings,

I have witnessed similar crashes when joining the project yesterday evening. My configuration includes an AMD Phenom(tm) II X4 945 Processor, which interestingly is also of amdfam10 architecture. Also, I'm running Debian Sid, which currently is powered by the GlibC 2.30, if that matters.

Here are my observations:
* Affected jobs were of type COVID-19, running with the Rosetta engine version 4.08, crashing around 1% execution;
* I have a COVID-19 job type currently past 20% execution, but running with Rosetta 4.07;
* I saw non-COVID-19 related jobs in the task list having run successfully with Rosetta 4.08.
* All other kind of jobs are running succesfully apparently.

Hope this helps
ID: 92437 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sspseudoo

Send message
Joined: 4 Mar 20
Posts: 7
Credit: 23,843
RAC: 0
Message 92443 - Posted: 28 Mar 2020, 11:27:49 UTC

I just upgraded my RAM to 12 GB and made some space free on my SSD (now 39 GB free space), and the COVID-tasks still crash. I checked my CPU with mprime torture test and my RAM with memtest86, both without problems, everything seems stable. Apart from those COVID-tasks no crashes occur when using the computer.
[...]
27-Mar-2020 15:51:51 [---] [libc detection] gathered: 2.30, GNU libc
27-Mar-2020 15:51:51 [---] Host name: x-2017-1.local
27-Mar-2020 15:51:51 [---] Processor: 4 AuthenticAMD AMD Athlon(tm) II X4 620 Processor [Family 16 Model 5 Stepping 2]
27-Mar-2020 15:51:51 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save
27-Mar-2020 15:51:51 [---] OS: Linux Fedora: Fedora release 31 (Thirty One) [5.5.10-200.fc31.x86_64|libc 2.30 (GNU libc)]
27-Mar-2020 15:51:51 [---] Memory: 11.45 GB physical, 3.74 GB virtual
27-Mar-2020 15:51:51 [---] Disk: 82.67 GB total, 39.07 GB free
[...]

Is it possible to make a backtrace of the crash and would it be helpful?
Thanks in advance!
ID: 92443 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sspseudoo

Send message
Joined: 4 Mar 20
Posts: 7
Credit: 23,843
RAC: 0
Message 92614 - Posted: 30 Mar 2020, 12:15:11 UTC

Is this the same problem?

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13658

And is it a good idea to add ralph@home in the BOINC client now to help testing the new adjusted binaries, when they are available?
https://ralph.bakerlab.org/
ID: 92614 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92621 - Posted: 30 Mar 2020, 13:36:31 UTC - in response to Message 92614.  
Last modified: 1 Apr 2020, 1:03:22 UTC

Is this the same problem?

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13658

And is it a good idea to add ralph@home in the BOINC client now to help testing the new adjusted binaries, when they are available?
https://ralph.bakerlab.org/


That is certainly one cause of a COVID-19 task failing. But their seem to be others, such as machines that run for longer period of time and then report out of memory errors.

Yes, it is a good time to have Ralph active. The work will only be available once and a while, but just let BOINC keep trying to get work.
Rosetta Moderator: Mod.Sense
ID: 92621 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ToKamaK

Send message
Joined: 27 Mar 20
Posts: 4
Credit: 5,621,041
RAC: 2,964
Message 92650 - Posted: 30 Mar 2020, 20:08:41 UTC - in response to Message 92621.  
Last modified: 30 Mar 2020, 20:14:46 UTC

Good day,

I confirm SSSE3 is not recognized a least by the AMD Phenom II. Trying to assemble and execute the following code triggers an Illegal instruction error:
$ cat ssse3-test.S
        .globl main
main:
        pshufb %xmm1,%xmm0
        ret
$ gcc -o ssse3-test ssse3-test.S
$ ./ssse3-test
Illegal instruction


I registered to ralph@home in hope this helps with further testing.

Kind Regards

Edited to add that a CPU supporting SSSE3 might still see this small program crash du to a Segmentation fault. But if said CPU triggers this error instead of Illegal instruction, then it means that it supports SSSE3.
ID: 92650 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alexey Vazhnov
Avatar

Send message
Joined: 26 Mar 20
Posts: 2
Credit: 0
RAC: 0
Message 92660 - Posted: 30 Mar 2020, 21:36:11 UTC - in response to Message 92650.  

@ ToKamaK, thank you very much for investigation!
ID: 92660 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ToKamaK

Send message
Joined: 27 Mar 20
Posts: 4
Credit: 5,621,041
RAC: 2,964
Message 92784 - Posted: 31 Mar 2020, 17:50:35 UTC - in response to Message 92660.  

It is ShimmerFairy who deserve thanks, the way this particular issue has been identified requires quite some patience. I only double checked the particular CPU brand with hardware I have at hand. :)
ID: 92784 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ToKamaK

Send message
Joined: 27 Mar 20
Posts: 4
Credit: 5,621,041
RAC: 2,964
Message 93473 - Posted: 5 Apr 2020, 9:49:25 UTC - in response to Message 92784.  

Greetings,
I just wanted to confirm my first batch of Rosetta 4.12 jobs are reported to have completed and validated successfully.
Kind Regards.
ID: 93473 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Questions and Answers : Unix/Linux : R@H works, but all COVID-19 tasks fail



©2024 University of Washington
https://www.bakerlab.org