Message boards : Number crunching : Rosetta 4.0+
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 19 · Next
Author | Message |
---|---|
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I started getting errors about a week ago. The common points are that the jobs are all PF*_bnd_aivan_SAVE_ALL_OUT*, and that I only get errors on the machine with AMD Opterons. i got a high percentage of errors with my Ryzen 1700, and noticed that a lot of other people with AMD chips have a high error rate too. https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6777&postid=87833#87833 Turning off SMT in the BIOS eliminated the errors, but that of course reduces the output, and the credit per work unit still was considerably less than with my Intel chips (i7-3770 and i7-4770) of comparable speed. https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6777&postid=87874#87874 Apparently something needs to be fixed (recompiled?) for the AMD chips, but no one at Rosetta has made any comment on the issue yet. I don't use my Ryzen here anymore. It works great on LHC, WCG, and all the other projects I have tried it on. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2002 Credit: 9,790,281 RAC: 3,640 |
I started getting errors about a week ago. The common points are that the jobs are all PF*_bnd_aivan_SAVE_ALL_OUT*, and that I only get errors on the machine with AMD Opterons. Some WU’s with this name run successfully, and the ones that fail all exceed the target CPU time by four hours before failing. 3.78 or 4.x version? |
LarryMajor Send message Joined: 1 Apr 16 Posts: 22 Credit: 31,533,212 RAC: 0 |
The errors are 4.x. One thing I noticed, is that the successful one created only 1 decoy. I have the target CPU time set at 12 hours and they fail at 16 hours. Is there a chance that the watchdog is terminating these jobs as an overrun if they go 4 hours over the target time? |
LarryMajor Send message Joined: 1 Apr 16 Posts: 22 Credit: 31,533,212 RAC: 0 |
i got a high percentage of errors with my Ryzen 1700, and noticed that a lot of other people with AMD chips have a high error rate too. Yeah, I saw where some Ryzens were posting segfaults, but the output from them (at least the ones I saw) had different errors from what I'm getting. The Opterons are 61xx and don't do threads, so that rules out any threading issues. I have an AMD FX machine on this project, and it hasn't had any problems, but that may be sheer chance since it does only a fraction of the work that the Opteron box does. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
Yeah, I saw where some Ryzens were posting segfaults, but the output from them (at least the ones I saw) had different errors from what I'm getting. It has nothing to do with segfaults. I have the "fixed" version which avoids those. It is something specific to Rosetta, as it does not have unusual errors on any of the other projects I do. I don't really know what other AMD chips are affected, except that as a class they have a higher error rate than the Intels here. And for some reason, my i7-4790 does better than expected against my i7-3770s or even my i7-4770, as exemplified in this comparison, which seems reasonably accurate from what I see: https://boinc.bakerlab.org/rosetta/cpu_list.php Note that the Ryzen 1700 does not do well in that list, though on LHC, WCG and all the others I have used it on it does at least as well, and usually better than even the Haswells (i7-4770, i7-4790). So it appears to me that Rosetta has room for considerable optimization if they wish to do so. |
LarryMajor Send message Joined: 1 Apr 16 Posts: 22 Credit: 31,533,212 RAC: 0 |
It has nothing to do with segfaults. I have the "fixed" version which avoids those. It is something specific to Rosetta, as it does not have unusual errors on any of the other projects I do. I don't really know what other AMD chips are affected, except that as a class they have a higher error rate than the Intels here.. I see your point. Looking at another Opteron machine, I found the same job with the same error, except his blew at a little over 12 hours, which is likely because his target time is 8 hours. Some of these really do work when they take well over 12 hours, so I bumped my target time to 16 hours. If they fail then, it appears that I have some decisions to make. Thanks for your help. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2002 Credit: 9,790,281 RAC: 3,640 |
Note that the Ryzen 1700 does not do well in that list, though on LHC, WCG and all the others I have used it on it does at least as well, and usually better than even the Haswells (i7-4770, i7-4790). So it appears to me that Rosetta has room for considerable optimization if they wish to do so. Could be the compiler? Times ago, in the optimization thread, we saw that they are using a very old version of GCC. I don't know if now they are using an updated version. Gcc has some improvements and bugfix for Ryzen from 6.4 version onwards. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
Gcc has some improvements and bugfix for Ryzen from 6.4 version onwards. I was wondering about that, but don't know much about compiler versions. But I am wondering why they have not seen the difference themselves in their own statistics. It should be easy for them to monitor the performance of their apps and how well AMD compares to Intel. Maybe they just monitor the total output, and if it is enough, they don't worry about it further. |
Dadx Send message Joined: 11 Dec 07 Posts: 2 Credit: 36,905 RAC: 0 |
WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 Running on a Kindle HD 8, many tasks have dozens and sometimes hundreds of these messages. The more of these I see in a stderr file the longer it takes to complete, though the elapse time is much longer then the displayed and/or reported run time. I've taken to rebooting the device to see if that gooses them into continuing (works sometimes). If after 12-24 of no progress I abort them so they quickly get a chance to be run by someone else. What do the messages mean, why would they be happening and how I reduce the likelihood of them occurring ? Below is an example of the task that reported as successfully completed with what I small to moderate number of the warnings. Regards, DadX <core_client_version>7.4.53</core_client_version> <![CDATA[ <stderr_txt> WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu -use_filters true -silent_gz -mute all -abinitio::fastrelax -abinitio::rg_reweight 0.5 -abinitio::rsd_wt_helix 0.5 -abinitio::rsd_wt_loop 0.5 -in:file:native ab_12_01__vall_2011__1aiuA.pdb -in::file::fasta ab_12_01__vall_2011__1aiuA.fasta -psipred_ss2 ab_12_01__vall_2011__1aiuA.psipred_ss2 -kill_hairpins ab_12_01__vall_2011__1aiuA.nobuformat.psipred_ss2 -frag3 ab_12_01__vall_2011__1aiuA.200.3mers.index -fragA ab_12_01__vall_2011__1aiuA.200.9mers.index -fragB ab_12_01__vall_2011__1aiuA.200.3mers.index -nstruct 10000 -cpu_run_time 14400 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 1841840 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6ffffffe arg 0x2810 WARNING: linker: ../../projects/boinc.bakerlab.org_rosetta/rosetta_android_4.06_arm-android-linux-gnu: unused DT entry: type 0x6fffffff arg 0x2 Too many restarts with no progress. Keep application in memory while preempted. ====================================================== DONE :: 1 starting structures 1488.94 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== BOINC :: WS_max 0 called boinc_finish(0) </stderr_txt> ]]> |
James W Send message Joined: 25 Nov 12 Posts: 130 Credit: 1,766,254 RAC: 0 |
Re: Android device 3182472, WU 880684086, task #976715445 Name ab_12_01__vall_2011_1eyvA_vall_2011_9mers_3mers_535141_9722_0 Application version Rosetta for Android v4.07 arm-android-linux-gnu <core_client_version>7.4.53</core_client_version> |
Richard Bertrand Send message Joined: 11 Feb 09 Posts: 1 Credit: 229,640 RAC: 0 |
Something completely different: just now, I've got the second warning from Malwarebytes Antimalware that de windows executable has ransomware code.... Malwarebytes deemed it so serious, that it became quarantained. The first notice (for rosetta_4.07_windows_intelx86.exe) I did get March 1st, just now I did get the one for rosetta_4.07_windows_x86_64.exe. I let Rosetta "repair" itself because I didn't liked to reboot the machine so I could restore it from quarantine. Anyone else the same experience? Working on Windows 10 1709 64bit. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2145 Credit: 41,560,787 RAC: 8,098 |
Something completely different: just now, I've got the second warning from Malwarebytes Antimalware that de windows executable has ransomware code.... No, but I'd whitelist it. I know Malwarebytes is supposed to be a good program, but I've found it more trouble than it's worth, only coming up with false positives on my machine. |
Dr Who Fan Send message Joined: 28 May 06 Posts: 79 Credit: 273,880 RAC: 121 |
Task CRASHED upon startup: Task# 978918569 Name: PF06353.11_bnd_aivan_SAVE_ALL_OUT_03_09_549478_551_0 Stderr output |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2145 Credit: 41,560,787 RAC: 8,098 |
Another week goes by, another 7 PF* tasks coming up with the same "nan" error after running to apparent completion And 7 more in the last few days - all basically the same, except this time they all quote a path before the error and didn't run to completion. Looks like a minor coding error PF12224.7_bnd_aivan_SAVE_ALL_OUT_03_09_549478_1191_0 PF10009.8_bnd_aivan_SAVE_ALL_OUT_03_09_549478_3911_0 PF10009.8_bnd_aivan_SAVE_ALL_OUT_03_09_549478_3915_0 PF05975.11_bnd_aivan_SAVE_ALL_OUT_03_09_549479_3925_0 PF02010.14_bnd_aivan_SAVE_ALL_OUT_03_09_549478_5257_0 PF05982.11_bnd_aivan_SAVE_ALL_OUT_03_09_543807_1803_0 Application version Rosetta v4.07 windows_x86_64 PF11832.7_bnd_aivan_SAVE_ALL_OUT_03_09_549479_4185_0 Application version Rosetta v4.07 windows_intelx86 |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2002 Credit: 9,790,281 RAC: 3,640 |
981578612 <message> |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
delete |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
981578612 It looks like the "AMD problem" (or one of them). I was hoping that it did not happen on 4.07, which is apparently quite different than 3.78, but apparently it does. I am seeing practically no errors on my i7-4790 (Ubuntu 16.04). |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,229,863 RAC: 3,372 |
981578612 Someone should remind the Rosetta staff/developers that they have a BETA test site (Ralph) to shake down a new application for function and compatibility. It seems to be unused. My first thought was that the 4.07 errors were AMD compatibility problems. The developers could examine the population of errors and see if there is a correlation between machines and the errors. I also saw some on some Intel machines too. The issue may also be from overclocking. When someone overclocks their machine, they run a small test until it fails and then back off the frequency. This just tells them about problems in the instruction set they use to test with. An overclocked CPU may may actually fail earlier on a different instruction and then cause program failure. Rosetta strips symbols which makes it a little harder to tell how they are build the 4.07 app. It looks like they have heavily reworked the source code to process 4 PACKED, 32-bit single precision variables in parallel (xmm registers). Version 3.78 now uses only x87 80-bit double precision numbers. The 80-bit floating point format allows 64-bit mantissa which are truncated to 52-bits when stored to 64-bit memory. The single precision mantissa is only 23-bits which introduces errors much more rapidly when you lose the 29-bits going to single precision. If they can get it working, it should make pretty big difference. Looking at the 3.78 and 4.07credits on the same machine, the 3.78 results are fairly stable. All the 3.78 jobs run the full time slot and yield the same credits. The 4.07 seem to finish early and credits are fall all over the place. 30,000 seconds (max) of 3.78 results in about 250 to 300 credits. 4.07 jobs run 15,000 to 30,000 (max) seconds gives 95 to 600 credits for a 2x range of run times. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2002 Credit: 9,790,281 RAC: 3,640 |
Someone should remind the Rosetta staff/developers that they have a BETA test site (Ralph) to shake down a new application for function and compatibility. It seems to be unused. +1. My first thought was that the 4.07 errors were AMD compatibility problems. The developers could examine the population of errors and see if there is a correlation between machines and the errors. I also saw some on some Intel machines too. The issue may also be from overclocking. That seems strange to me. 1 - I've no Ryzen. My cpu is an old FX6300. 2 - I don't overclock. |
Darrell Send message Joined: 28 Sep 06 Posts: 25 Credit: 51,934,631 RAC: 0 |
I just found Rosetta 4.07 used 2,111,242,240 bytes (1.97 GIGAbytes) before my system crashed (i7-4770K, 8GB). This seems to be just a bit more than expected, so please take a look and fix the problem. I run SETI, EINSTEIN, and LHC in addition to Rosetta, so Rosetta can't have the whole machine! |
Message boards :
Number crunching :
Rosetta 4.0+
©2024 University of Washington
https://www.bakerlab.org