Rosetta x86 on AMD CPU

Author	Message
Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 343,893 RAC: 0	Message 92933 - Posted: 1 Apr 2020, 15:41:00 UTC - in response to Message 92930. OK, sorry, I didn't have chance to look up the specs. Please point out the WUs to described where you felt they had problems, but had no errors. This WU failed in the manner that shimmerfairy described with a segfault. https://boinc.bakerlab.org/rosetta/result.php?resultid=1136163650 These work units fail at a little over 6 hours of their intended 8 hour run. Nothing in the stderr.txt file to indicate why. https://boinc.bakerlab.org/rosetta/result.php?resultid=1136152718 https://boinc.bakerlab.org/rosetta/result.php?resultid=1136159909 https://boinc.bakerlab.org/rosetta/result.php?resultid=1136160035 https://boinc.bakerlab.org/rosetta/result.php?resultid=1136161098 https://boinc.bakerlab.org/rosetta/result.php?resultid=1136152216 That should be enough for you to look at. ID: 92933 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 92944 - Posted: 1 Apr 2020, 16:04:04 UTC OK, so signal 11 being the reported problem, in non-COVID tasks. Some rather immediately, others after many hours. At this point, I think it best to wait for the new application version and see what any new symptoms may look like. The new version will have a number of issues addressed, but I don't have further detail to be more specific. The fact that they don't seem to report any completed models implies it was still on the first model at the time of the failure. Please see, Admin bcov's post about new application and creation of work units Rosetta Moderator: Mod.Sense ID: 92944 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 343,893 RAC: 0	Message 92948 - Posted: 1 Apr 2020, 16:14:34 UTC - in response to Message 92944. OK, I think I will just go ahead and dump the work I was sent. I was sent WAY too much on the first scheduler connection that I can't possibly finish before deadline. All the current running tasks are in EDF mode. I was going to just let the excess expire naturally and be resent. But if there is no likely chance the majority of work will properly complete and award credit I might as well wait for the new applications to be developed that fix the cpu feature parsing correctly. ID: 92948 · Rating: 0 · rate: / Reply Quote

William Albert Send message Joined: 22 Mar 20 Posts: 28 Credit: 2,241,799 RAC: 308	Message 92950 - Posted: 1 Apr 2020, 16:22:33 UTC - in response to Message 92933. This WU failed in the manner that shimmerfairy described with a segfault. https://boinc.bakerlab.org/rosetta/result.php?resultid=1136163650 The issue that shimmerfairy described is a compatibility issue specific to a missing SSSE3 instruction on AMD K10 CPUs, and wouldn't occur on your Ryzen machine. Your work units are seg faulting. While in the case of K10, the cause of the seg fault is an invalid CPU instruction, WUs can also seg fault as a result of malfunctioning hardware. If you haven't already done so, I would run some hardware stress tests to verify that the hardware is actually stable. ID: 92950 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 92951 - Posted: 1 Apr 2020, 16:23:27 UTC - in response to Message 92948. Ya, sorry, but that sounds like best course ATM. Rosetta Moderator: Mod.Sense ID: 92951 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 343,893 RAC: 0	Message 92958 - Posted: 1 Apr 2020, 17:34:00 UTC - in response to Message 92950. Last modified: 1 Apr 2020, 17:36:32 UTC This WU failed in the manner that shimmerfairy described with a segfault. https://boinc.bakerlab.org/rosetta/result.php?resultid=1136163650 The issue that shimmerfairy described is a compatibility issue specific to a missing SSSE3 instruction on AMD K10 CPUs, and wouldn't occur on your Ryzen machine. Your work units are seg faulting. While in the case of K10, the cause of the seg fault is an invalid CPU instruction, WUs can also seg fault as a result of malfunctioning hardware. If you haven't already done so, I would run some hardware stress tests to verify that the hardware is actually stable. I always run stress tests on all my machines to test for stability. Many hours of stressapptest for the memory and both cpu and memory for many hours in y-cruncher and Prime95. No issues found or errors detected. I run all my other projects just fine without errors on the host as well as all my other Ryzen hosts. ID: 92958 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 343,893 RAC: 0	Message 92959 - Posted: 1 Apr 2020, 17:35:34 UTC - in response to Message 92950. Your work units are seg faulting. How can you state that?? I have had only one failed work unit for a segfault. All the other errors show no reason for the error. ID: 92959 · Rating: 0 · rate: / Reply Quote

William Albert Send message Joined: 22 Mar 20 Posts: 28 Credit: 2,241,799 RAC: 308	Message 92961 - Posted: 1 Apr 2020, 17:51:48 UTC - in response to Message 92959. Last modified: 1 Apr 2020, 17:52:02 UTC Looking at one of your work units as an example: <core_client_version>7.17.0</core_client_version> <![CDATA[ <message> process got signal 11</message> <stderr_txt> command: <snipped for brevity> Starting watchdog... Watchdog active. </stderr_txt> ]]> The error is right right near the top: <message> process got signal 11</message> A "Signal 11" is a seg fault. ID: 92961 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 343,893 RAC: 0	Message 92967 - Posted: 1 Apr 2020, 18:30:28 UTC - in response to Message 92961. I was not aware a signal 11 is a segfault. I have never received that on any of my other projects. When those projects error on a segfault, they state so explicitly in the stderr.txt output. Just like the only Rosetta task I have had error with what I know to be a segfault. https://boinc.bakerlab.org/rosetta/result.php?resultid=1136163650 SIGSEGV: segmentation violation Stack trace (18 frames): [0xde75dcf] [0xf7f64b70] [0xd7cf60a] [0xc50e485] [0xc4f28d6] [0xc4fd52f] [0xc96437e] [0xc5f7215] [0xb265724] [0xb2a83b6] [0xb2af655] [0x8a87605] [0x8a88a7c] [0x8a4b3be] [0x8a555d0] [0x80548d3] [0xdf0cfd8] [0x8048131] Exiting... </stderr_txt> ]]> ID: 92967 · Rating: 0 · rate: / Reply Quote

Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 29,616,294 RAC: 4,638	Message 93156 - Posted: 3 Apr 2020, 4:19:57 UTC - in response to Message 92550. Last modified: 3 Apr 2020, 4:21:06 UTC Does Rosetta obey using a max_concurrent statement in an app_config? I am having issue with out of memory issues preventing my gpu tasks from running and I am not able to well control just using the %cpu setting in Preferences. Yes, it works fine. I am using it for ~1.5 month already after huge COVID WUs stated to pop-up hoarding RAM. But there is a little trick because R@H have 2 different application lines (rosetta and rosetta mini) and you need to set rules for both apps or use <project_max_concurrent> option instead of just max_concurrent to set restriction on the whole project level. Reference for all who does not know how to use app_config for such things: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13644&postid=93152#93152 ID: 93156 · Rating: 0 · rate: / Reply Quote

Ivailo Bonev Send message Joined: 9 May 07 Posts: 16 Credit: 6,196,220 RAC: 1	Message 93174 - Posted: 3 Apr 2020, 7:44:07 UTC Last modified: 3 Apr 2020, 8:05:36 UTC I think something is off with the CPU utilization by new 4.12 app for the new Ryzen 3k systems. I have 3800X and with 4.07 had consistent 79-80C temps under the full load (CPU Package power - 105-107W), now with 4.12 app, temps are 65-70C under full load (CPU Package power is 80-85W). What was changed? I was under impression that new app will be somewhat more optimized for the new CPU-s. ID: 93174 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 343,893 RAC: 0	Message 93281 - Posted: 3 Apr 2020, 20:37:38 UTC I'd like to find the post that explained the optimized changes in the new app. All I've seen is that the app is targeted at Covid-19. ID: 93281 · Rating: 0 · rate: / Reply Quote

entity Send message Joined: 8 May 18 Posts: 23 Credit: 10,249,932 RAC: 0	Message 93286 - Posted: 3 Apr 2020, 21:27:19 UTC - in response to Message 93174. I think something is off with the CPU utilization by new 4.12 app for the new Ryzen 3k systems. I have 3800X and with 4.07 had consistent 79-80C temps under the full load (CPU Package power - 105-107W), now with 4.12 app, temps are 65-70C under full load (CPU Package power is 80-85W). What was changed? I was under impression that new app will be somewhat more optimized for the new CPU-s. We were seeing this same behavior over at WCG on the MIP project which also uses Rosetta. The MIP developers posted this reply after we had sent them some data on what we were seeing (this was a couple of years ago): "The short version is that Rosetta, the program being used by the MIP to fold the proteins on all of your computers*, is pretty hungry when it comes to cache. A single instance of the program fits well in to a small cache. However, when you begin to run multiple instances there is more contention for that cache. This results in L3 cache misses and the CPU sits idle while we have to make a long trip to main memory to get the data we need. This behavior is common for programs that have larger memory requirements. It's also not something that we as developers often notice; we typically run on large clusters and use hundreds to thousands of cores in parallel on machines. Nothing seemed slower for us because we are always running in that regime. We are looking to see if if we can improve the cache behavior. Rosetta is ~2 million lines of C++ and improving the cache performance might involve changing some pretty fundamental parts. We have some ideas of where to start digging, but I can't make any promises. Long term, identifying these issues may end up improving Rosetta for everyone that uses it so pat yourselves on the back for that!" It's that sitting idle while waiting for data from main memory that causes the temps and energy use to drop. ID: 93286 · Rating: 0 · rate: / Reply Quote

Ivailo Bonev Send message Joined: 9 May 07 Posts: 16 Credit: 6,196,220 RAC: 1	Message 93352 - Posted: 4 Apr 2020, 6:19:02 UTC - in response to Message 93286. We were seeing this same behavior over at WCG on the MIP project which also uses Rosetta. The MIP developers posted this reply after we had sent them some data on what we were seeing (this was a couple of years ago): "The short version is that Rosetta, the program being used by the MIP to fold the proteins on all of your computers*, is pretty hungry when it comes to cache. A single instance of the program fits well in to a small cache. However, when you begin to run multiple instances there is more contention for that cache. This results in L3 cache misses and the CPU sits idle while we have to make a long trip to main memory to get the data we need. This behavior is common for programs that have larger memory requirements. It's also not something that we as developers often notice; we typically run on large clusters and use hundreds to thousands of cores in parallel on machines. Nothing seemed slower for us because we are always running in that regime. We are looking to see if if we can improve the cache behavior. Rosetta is ~2 million lines of C++ and improving the cache performance might involve changing some pretty fundamental parts. We have some ideas of where to start digging, but I can't make any promises. Long term, identifying these issues may end up improving Rosetta for everyone that uses it so pat yourselves on the back for that!" It's that sitting idle while waiting for data from main memory that causes the temps and energy use to drop. Thank you for the answer and explanation, I see now usual behavior from the CPU, maybe data in the first batch with 4.12 was much more different and "hungry for L3 cache". ID: 93352 · Rating: 0 · rate: / Reply Quote