Rosetta 4.0+

Message boards : Number crunching : Rosetta 4.0+

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 19 · Next

AuthorMessage
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 88341 - Posted: 22 Feb 2018, 15:59:43 UTC - in response to Message 88337.  

I started getting errors about a week ago. The common points are that the jobs are all PF*_bnd_aivan_SAVE_ALL_OUT*, and that I only get errors on the machine with AMD Opterons.

i got a high percentage of errors with my Ryzen 1700, and noticed that a lot of other people with AMD chips have a high error rate too.
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6777&postid=87833#87833

Turning off SMT in the BIOS eliminated the errors, but that of course reduces the output, and the credit per work unit still was considerably less than with my Intel chips (i7-3770 and i7-4770) of comparable speed.
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6777&postid=87874#87874

Apparently something needs to be fixed (recompiled?) for the AMD chips, but no one at Rosetta has made any comment on the issue yet. I don't use my Ryzen here anymore. It works great on LHC, WCG, and all the other projects I have tried it on.
ID: 88341 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1845
Credit: 7,987,219
RAC: 8,801
Message 88343 - Posted: 22 Feb 2018, 17:30:37 UTC - in response to Message 88337.  

I started getting errors about a week ago. The common points are that the jobs are all PF*_bnd_aivan_SAVE_ALL_OUT*, and that I only get errors on the machine with AMD Opterons. Some WU’s with this name run successfully, and the ones that fail all exceed the target CPU time by four hours before failing.


3.78 or 4.x version?
ID: 88343 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LarryMajor

Send message
Joined: 1 Apr 16
Posts: 22
Credit: 31,533,212
RAC: 0
Message 88344 - Posted: 22 Feb 2018, 19:33:31 UTC - in response to Message 88343.  

The errors are 4.x.

One thing I noticed, is that the successful one created only 1 decoy. I have the target CPU time set at 12 hours and they fail at 16 hours.
Is there a chance that the watchdog is terminating these jobs as an overrun if they go 4 hours over the target time?
ID: 88344 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LarryMajor

Send message
Joined: 1 Apr 16
Posts: 22
Credit: 31,533,212
RAC: 0
Message 88345 - Posted: 22 Feb 2018, 20:35:00 UTC - in response to Message 88341.  

i got a high percentage of errors with my Ryzen 1700, and noticed that a lot of other people with AMD chips have a high error rate too.

Yeah, I saw where some Ryzens were posting segfaults, but the output from them (at least the ones I saw) had different errors from what I'm getting.
The Opterons are 61xx and don't do threads, so that rules out any threading issues.

I have an AMD FX machine on this project, and it hasn't had any problems, but that may be sheer chance since it does only a fraction of the work that the Opteron box does.
ID: 88345 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 88347 - Posted: 22 Feb 2018, 23:29:35 UTC - in response to Message 88345.  
Last modified: 22 Feb 2018, 23:31:34 UTC

Yeah, I saw where some Ryzens were posting segfaults, but the output from them (at least the ones I saw) had different errors from what I'm getting.
The Opterons are 61xx and don't do threads, so that rules out any threading issues.

I have an AMD FX machine on this project, and it hasn't had any problems, but that may be sheer chance since it does only a fraction of the work that the Opteron box does.

It has nothing to do with segfaults. I have the "fixed" version which avoids those. It is something specific to Rosetta, as it does not have unusual errors on any of the other projects I do. I don't really know what other AMD chips are affected, except that as a class they have a higher error rate than the Intels here.

And for some reason, my i7-4790 does better than expected against my i7-3770s or even my i7-4770, as exemplified in this comparison, which seems reasonably accurate from what I see:
https://boinc.bakerlab.org/rosetta/cpu_list.php
Note that the Ryzen 1700 does not do well in that list, though on LHC, WCG and all the others I have used it on it does at least as well, and usually better than even the Haswells (i7-4770, i7-4790). So it appears to me that Rosetta has room for considerable optimization if they wish to do so.
ID: 88347 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LarryMajor

Send message
Joined: 1 Apr 16
Posts: 22
Credit: 31,533,212
RAC: 0
Message 88348 - Posted: 23 Feb 2018, 0:02:13 UTC - in response to Message 88347.  

It has nothing to do with segfaults. I have the "fixed" version which avoids those. It is something specific to Rosetta, as it does not have unusual errors on any of the other projects I do. I don't really know what other AMD chips are affected, except that as a class they have a higher error rate than the Intels here..

I see your point. Looking at another Opteron machine, I found the same job with the same error, except his blew at a little over 12 hours, which is likely because his target time is 8 hours.

Some of these really do work when they take well over 12 hours, so I bumped my target time to 16 hours.

If they fail then, it appears that I have some decisions to make.

Thanks for your help.
ID: 88348 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1845
Credit: 7,987,219
RAC: 8,801
Message 88349 - Posted: 23 Feb 2018, 6:57:27 UTC - in response to Message 88347.  
Last modified: 23 Feb 2018, 6:57:50 UTC

Note that the Ryzen 1700 does not do well in that list, though on LHC, WCG and all the others I have used it on it does at least as well, and usually better than even the Haswells (i7-4770, i7-4790). So it appears to me that Rosetta has room for considerable optimization if they wish to do so.


Could be the compiler? Times ago, in the optimization thread, we saw that they are using a very old version of GCC. I don't know if now they are using an updated version.
Gcc has some improvements and bugfix for Ryzen from 6.4 version onwards.
ID: 88349 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 88350 - Posted: 23 Feb 2018, 13:13:16 UTC - in response to Message 88349.  

Gcc has some improvements and bugfix for Ryzen from 6.4 version onwards.

I was wondering about that, but don't know much about compiler versions. But I am wondering why they have not seen the difference themselves in their own statistics. It should be easy for them to monitor the performance of their apps and how well AMD compares to Intel. Maybe they just monitor the total output, and if it is enough, they don't worry about it further.
ID: 88350 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dadx

Send message
Joined: 11 Dec 07
Posts: 2
Credit: 36,830
RAC: 0
Message 88352 - Posted: 23 Feb 2018, 18:27:47 UTC

WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
Running on a Kindle HD 8, many tasks have dozens and sometimes hundreds of these messages.

The more of these I see in a stderr file the longer it takes to complete, though the elapse time is much longer then the displayed and/or reported run time.
I've taken to rebooting the device to see if that gooses them into continuing (works sometimes). If after 12-24 of no progress I abort them so they quickly get a chance to be run by someone else.

What do the messages mean, why would they be happening and how I reduce the likelihood of them occurring ?

Below is an example of the task that reported as successfully completed with what I small to moderate number of the warnings.

Regards,
DadX

<core_client_version>7.4.53</core_client_version>
<![CDATA[
<stderr_txt>
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu -use_filters true -silent_gz -mute all -abinitio::fastrelax -abinitio::rg_reweight 0.5 -abinitio::rsd_wt_helix 0.5 -abinitio::rsd_wt_loop 0.5 -in:file:native ab_12_01__vall_2011__1aiuA.pdb -in::file::fasta ab_12_01__vall_2011__1aiuA.fasta -psipred_ss2 ab_12_01__vall_2011__1aiuA.psipred_ss2 -kill_hairpins ab_12_01__vall_2011__1aiuA.nobuformat.psipred_ss2 -frag3 ab_12_01__vall_2011__1aiuA.200.3mers.index -fragA ab_12_01__vall_2011__1aiuA.200.9mers.index -fragB ab_12_01__vall_2011__1aiuA.200.3mers.index -nstruct 10000 -cpu_run_time 14400 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 1841840
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2
Too many restarts with no progress. Keep application in memory while preempted.
======================================================
DONE :: 1 starting structures 1488.94 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
BOINC :: WS_max 0
called boinc_finish(0)

</stderr_txt>
]]>
ID: 88352 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James W

Send message
Joined: 25 Nov 12
Posts: 130
Credit: 1,766,254
RAC: 0
Message 88395 - Posted: 1 Mar 2018, 6:22:59 UTC

Re: Android device 3182472, WU 880684086, task #976715445
Name ab_12_01__vall_2011_1eyvA_vall_2011_9mers_3mers_535141_9722_0
Workunit 880684086
Created 27 Feb 2018, 6:21:08 UTC
Sent 28 Feb 2018, 5:04:50 UTC
Report deadline 8 Mar 2018, 5:04:50 UTC
Received 28 Feb 2018, 13:34:05 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 0 (0x00000000)

Application version Rosetta for Android v4.07 arm-android-linux-gnu

<core_client_version>7.4.53</core_client_version>
<![CDATA[<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.07_arm-android-linux-gnu -use_filters true -silent_gz -mute all -abinitio::fastrelax -abinitio::rg_reweight 0.5 -abinitio::rsd_wt_helix 0.5 -abinitio::rsd_wt_loop 0.5 -in:file:native ab_12_01__vall_2011__1eyvA.pdb -in::file::fasta ab_12_01__vall_2011__1eyvA.fasta -psipred_ss2 ab_12_01__vall_2011__1eyvA.psipred_ss2 -kill_hairpins ab_12_01__vall_2011__1eyvA.nobuformat.psipred_ss2 -frag3 ab_12_01__vall_2011__1eyvA.200.3mers.index -fragA ab_12_01__vall_2011__1eyvA.200.9mers.index -fragB ab_12_01__vall_2011__1eyvA.200.3mers.index -nstruct 10000 -cpu_run_time 14400 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 1839701
Too many restarts with no progress. Keep application in memory while preempted.
======================================================
DONE :: 1 starting structures 1387.59 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
BOINC :: WS_max 0
called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>ab_12_01__vall_2011_1eyvA_vall_2011_9mers_3mers_535141_9722_0_r1220379814_0</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
ID: 88395 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Richard Bertrand

Send message
Joined: 11 Feb 09
Posts: 1
Credit: 229,640
RAC: 0
Message 88420 - Posted: 4 Mar 2018, 22:14:02 UTC

Something completely different: just now, I've got the second warning from Malwarebytes Antimalware that de windows executable has ransomware code....
Malwarebytes deemed it so serious, that it became quarantained.

The first notice (for rosetta_4.07_windows_intelx86.exe) I did get March 1st, just now I did get the one for rosetta_4.07_windows_x86_64.exe.

I let Rosetta "repair" itself because I didn't liked to reboot the machine so I could restore it from quarantine.

Anyone else the same experience?
Working on Windows 10 1709 64bit.
ID: 88420 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1965
Credit: 38,160,504
RAC: 9,210
Message 88433 - Posted: 6 Mar 2018, 8:14:01 UTC - in response to Message 88420.  

Something completely different: just now, I've got the second warning from Malwarebytes Antimalware that de windows executable has ransomware code....
Malwarebytes deemed it so serious, that it became quarantained.

The first notice (for rosetta_4.07_windows_intelx86.exe) I did get March 1st, just now I did get the one for rosetta_4.07_windows_x86_64.exe.

I let Rosetta "repair" itself because I didn't liked to reboot the machine so I could restore it from quarantine.

Anyone else the same experience?
Working on Windows 10 1709 64bit.

No, but I'd whitelist it.
I know Malwarebytes is supposed to be a good program, but I've found it more trouble than it's worth, only coming up with false positives on my machine.
ID: 88433 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 28 May 06
Posts: 62
Credit: 229,047
RAC: 85
Message 88446 - Posted: 8 Mar 2018, 21:00:13 UTC

Task CRASHED upon startup: Task# 978918569

Name: PF06353.11_bnd_aivan_SAVE_ALL_OUT_03_09_549478_551_0

Stderr output
<core_client_version>7.9.2</core_client_version>
<![CDATA[
<message>
couldn't start app: CreateProcess() failed - (unknown error)</message>
]]>


ID: 88446 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1965
Credit: 38,160,504
RAC: 9,210
Message 88466 - Posted: 13 Mar 2018, 1:06:21 UTC - in response to Message 88188.  
Last modified: 13 Mar 2018, 1:07:04 UTC

Another week goes by, another 7 PF* tasks coming up with the same "nan" error after running to apparent completion

And 7 more in the last few days - all basically the same, except this time they all quote a path before the error and didn't run to completion. Looks like a minor coding error

PF12224.7_bnd_aivan_SAVE_ALL_OUT_03_09_549478_1191_0
PF10009.8_bnd_aivan_SAVE_ALL_OUT_03_09_549478_3911_0
PF10009.8_bnd_aivan_SAVE_ALL_OUT_03_09_549478_3915_0
PF05975.11_bnd_aivan_SAVE_ALL_OUT_03_09_549479_3925_0
PF02010.14_bnd_aivan_SAVE_ALL_OUT_03_09_549478_5257_0
PF05982.11_bnd_aivan_SAVE_ALL_OUT_03_09_543807_1803_0

Application version Rosetta v4.07 windows_x86_64
File: C:cygwin64homeboincRosettamainsourcesrccore/pack/dunbrack/SingleResidueDunbrackLibrary.hh:306
chi angle must be between -180 and 180: nan


PF11832.7_bnd_aivan_SAVE_ALL_OUT_03_09_549479_4185_0
Application version Rosetta v4.07 windows_intelx86
File: C:cygwinhomeboincRosettamainsourcesrccore/pack/dunbrack/SingleResidueDunbrackLibrary.hh:306
chi angle must be between -180 and 180: nan

ID: 88466 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1845
Credit: 7,987,219
RAC: 8,801
Message 88503 - Posted: 21 Mar 2018, 11:06:17 UTC

981578612

<message>
(unknown error) - exit code -529697949 (0xe06d7363)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.07_windows_intelx86.exe -run:protocol jd2_scripting @flags_avb6_mouse_t000__0_C1_robetta -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip input_avb6_mouse_t000__0_C1_robetta.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3051551
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x775408F2

ID: 88503 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 88505 - Posted: 21 Mar 2018, 14:55:55 UTC - in response to Message 88503.  
Last modified: 21 Mar 2018, 14:58:45 UTC

delete
ID: 88505 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 88506 - Posted: 21 Mar 2018, 14:57:06 UTC - in response to Message 88503.  
Last modified: 21 Mar 2018, 15:00:54 UTC

981578612

[quote]<message>
(unknown error) - exit code -529697949 (0xe06d7363)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.07_windows_intelx86.exe

It looks like the "AMD problem" (or one of them).
I was hoping that it did not happen on 4.07, which is apparently quite different than 3.78, but apparently it does.

I am seeing practically no errors on my i7-4790 (Ubuntu 16.04).
ID: 88506 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 272
Credit: 21,028,883
RAC: 16,807
Message 88508 - Posted: 22 Mar 2018, 2:51:45 UTC - in response to Message 88506.  

981578612

[quote]<message>
(unknown error) - exit code -529697949 (0xe06d7363)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.07_windows_intelx86.exe

It looks like the "AMD problem" (or one of them).
I was hoping that it did not happen on 4.07, which is apparently quite different than 3.78, but apparently it does.

I am seeing practically no errors on my i7-4790 (Ubuntu 16.04).



Someone should remind the Rosetta staff/developers that they have a BETA test site (Ralph) to shake down a new application for function and compatibility. It seems to be unused.

My first thought was that the 4.07 errors were AMD compatibility problems. The developers could examine the population of errors and see if there is a correlation between machines and the errors. I also saw some on some Intel machines too. The issue may also be from overclocking. When someone overclocks their machine, they run a small test until it fails and then back off the frequency. This just tells them about problems in the instruction set they use to test with. An overclocked CPU may may actually fail earlier on a different instruction and then cause program failure.

Rosetta strips symbols which makes it a little harder to tell how they are build the 4.07 app. It looks like they have heavily reworked the source code to process 4 PACKED, 32-bit single precision variables in parallel (xmm registers). Version 3.78 now uses only x87 80-bit double precision numbers. The 80-bit floating point format allows 64-bit mantissa which are truncated to 52-bits when stored to 64-bit memory. The single precision mantissa is only 23-bits which introduces errors much more rapidly when you lose the 29-bits going to single precision.

If they can get it working, it should make pretty big difference. Looking at the 3.78 and 4.07credits on the same machine, the 3.78 results are fairly stable. All the 3.78 jobs run the full time slot and yield the same credits. The 4.07 seem to finish early and credits are fall all over the place. 30,000 seconds (max) of 3.78 results in about 250 to 300 credits. 4.07 jobs run 15,000 to 30,000 (max) seconds gives 95 to 600 credits for a 2x range of run times.
ID: 88508 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1845
Credit: 7,987,219
RAC: 8,801
Message 88510 - Posted: 22 Mar 2018, 7:53:51 UTC - in response to Message 88508.  

Someone should remind the Rosetta staff/developers that they have a BETA test site (Ralph) to shake down a new application for function and compatibility. It seems to be unused.

+1.

My first thought was that the 4.07 errors were AMD compatibility problems. The developers could examine the population of errors and see if there is a correlation between machines and the errors. I also saw some on some Intel machines too. The issue may also be from overclocking.

That seems strange to me.
1 - I've no Ryzen. My cpu is an old FX6300.
2 - I don't overclock.
ID: 88510 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Darrell

Send message
Joined: 28 Sep 06
Posts: 25
Credit: 51,934,631
RAC: 0
Message 88533 - Posted: 26 Mar 2018, 3:50:36 UTC

I just found Rosetta 4.07 used 2,111,242,240 bytes (1.97 GIGAbytes) before my system crashed (i7-4770K, 8GB). This seems to be just a bit more than expected, so please take a look and fix the problem.

I run SETI, EINSTEIN, and LHC in addition to Rosetta, so Rosetta can't have the whole machine!
ID: 88533 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 19 · Next

Message boards : Number crunching : Rosetta 4.0+



©2024 University of Washington
https://www.bakerlab.org