R@H Scientists/Coders: An analysis of the Rosetta binaries...

Author	Message
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 78273 - Posted: 5 Jun 2015, 15:11:48 UTC This thread is to bring exposure to the findings done by user *rjs5* back in the thread "Rosetta@home using AVX / AVX2 ?": The executing code seems to be compiled for a i386 and uses the 387 floating point 8-register stack model. The code (on my machine) spends about 5% of the time waiting for the "fmul st0,st1" ("====" below) to complete. minirosetta_3.54_windows_x86_64.exe Rosetta instruction clip ... address instruction 0x6b3d82 add ebx, ecx 0x6b3d84 lea ebx, ptr [edi+ebx8] 0x6b3d87 fld st0, qword ptr [edi+eax8] 0x6b3d8a mov eax, dword ptr [ebp-0x20] 0x6b3d8d mov edi, dword ptr [ebp-0x14] 0x6b3d90 fmul st0, st1 0x6b3d92 inc ecx ========================= 0x6b3d93 add eax, 0x8 0x6b3d96 fsubr st0, qword ptr [ebx] 0x6b3d98 add edx, 0x8 All post-Pentium4 CPU (newer than Nov. 2000) support the SSE2 register model. Simply adding the SSE2 target option to the builds would require the machines to be made this century but would use the SSE registers. The 16 directly addressable registers would reduce register stores to the stack and code scheduling (less shuffling of data around and more computation). A simple recompile should make a noticeable difference without any side effects. If you compile newer than SSE2 or GPUs, you have to start worrying about and managing the population of target machines you deliver workloads to. Beyond that, the developers would need to look more closely at the code. Sure. Be happy to. There many similar tools for both Windows and for Linux. At the time I only had access to my Windows machine so I got the output from the Intel Vtune sampline profiler. For Linux environments I usually use "perf". Both will annotate the disassembly with the source code if you have the source and symbols if you the them. It makes tracking back to the specific source line easy. They use the CPU EVENT counters and you can set the time or event domain to trigger on. I just used the default CYCLES and INSTRUCTION COMPLETIONS to find where the program was burning clocks. That tells you where you will get the biggest return for your effort. Sample Haswell even list: https://code.google.com/p/likwid/wiki/Haswell EXAMPLE: I was running 8 copies of miniroseta on my Ubuntu 64 machine a d profiled ALL CPU with the command: sudo perf record -a -- sleep 10 Run as sudo and record what all "-a" the CPU are doing. After the 10 second sleep, it dumps the samples and then: sudo perf report --demangle do process the counts and demangle the C++ symbols to a more readable form. 7.6% of the time was spent in the numeric10MathMatrixIdE21inverse_square_matrixEv function. Drilling down into that function .... 7.60% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZN7numeric10MathMatrixIdE21inverse_square_matrixEv 4.95% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZN4core7scoring6etable5etrie16TrieCountPairAll20resolve_trie_vs_trieERKNS0_4trie 4.46% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZNK7numeric9xyzVectorIdE16distance_squaredERKS1_ 2.96% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZN4core7scoring6etable5etrie16TrieCountPairAll20resolve_trie_vs_trieERKNS0_4trie 2.60% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZNK4core7scoring7vdwaals10VDW_Energy19residue_pair_energyERKNS_12conformation7Re 2.10% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZNK4core7scoring4elec13FA_ElecEnergy15score_atom_pairERKNS_12conformation7Residu 2.05% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] memcpy 1.94% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _int_malloc 1.61% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _int_free 1.34% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZN4core10kinematics4tree10BondedAtom17update_xyz_coordsERNS0_4StubE 1.33% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] malloc 1.20% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZNK4core12conformation7Residue3xyzEj 1.17% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZNK4core12conformation7Residue15atom_type_indexEj Perf will open up a disassembly display with the HOT SPOT higlighted, i marked with "=====". Even though the file has the x86_64 in the file name, it is still 32-bit application. file boinc.bakerlab.org_rosetta/gnu boinc.bakerlab.org_rosetta/minirosetta_3.54_x86_64-pc-linux-gnu: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, for GNU/Linux 2.6.9, not stripped _ZN7numeric10MathMatrixIdE21inverse_square_matrixEv /var/lib/boinc-client/projects/boinc.bakerlab.org_rosetta/minirosetta_3.54_x86_64-pc-linux-gnu 0.07 │ add $0x8,%ebx 0.17 │ fmul %st(1),%st 0.44 │ fsubrl (%edx) 0.56 │ fstpl (%edx) 0.21 │ add $0x8,%edx 0.08 │ cmp %esi,-0xf4(%ebp) │ ↓ je 7eb 0.75 │ 778: fldl (%eax) 0.93 │ fmul %st(1),%st 3.35 │ mov -0xf0(%ebp),%edi 0.89 │ addl $0x4,-0xf4(%ebp) 1.63 │ fsubrl (%ecx) 4.62 │ fstpl (%ecx) 1.80 │ fldl (%ebx) 0.30 │ fmul %st(1),%st 2.57 │ fsubrl (%edx) 6.25 │ fstpl (%edx) =================== 1.70 │ fldl 0x8(%eax) 0.09 │ fmul %st(1),%st 1.20 │ fsubrl 0x8(%ecx) 4.50 │ fstpl 0x8(%ecx) 1.87 │ fldl 0x8(%ebx) 0.12 │ fmul %st(1),%st 0.84 │ fsubrl 0x8(%edx) 4.62 │ fstpl 0x8(%edx) 1.98 │ fldl 0x10(%eax) 0.09 │ fmul %st(1),%st 0.95 │ fsubrl 0x10(%ecx) 4.81 │ fstpl 0x10(%ecx) I believe R@H uses the Rosetta Commons code, therefore I do not know precisely who really* codes the Rosetta Software, but this should at least be looked at by someone working for R@H. ID: 78273 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2203 Credit: 13,720,774 RAC: 27	Message 78275 - Posted: 5 Jun 2015, 15:44:11 UTC - in response to Message 78273. I believe R@H uses the Rosetta Commons code, therefore I do not know precisely who really codes the Rosetta Software, but this should at least be looked at by someone working for R@H. This is the documentation of Rosetta Commons. I don't know if the code is the same of r@h. ID: 78275 · Rating: 0 · rate: / Reply Quote

Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,662,635 RAC: 0	Message 78277 - Posted: 5 Jun 2015, 18:05:54 UTC To summarize, rjs5 used some software (Intel Vtune sampline profiler) to examine the binaries of the minirosetta core and discovered that they are being compiled using a very outdated version of the GCC, and in short, simply updating the compiler would introduce some optimizations and resolve some known bottlenecks that show up in any program built with the older version of GCC. ID: 78277 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 78299 - Posted: 13 Jun 2015, 19:14:16 UTC - in response to Message 78277. To summarize, rjs5 used some software (Intel Vtune sampline profiler) to examine the binaries of the minirosetta core and discovered that they are being compiled using a very outdated version of the GCC, and in short, simply updating the compiler would introduce some optimizations and resolve some known bottlenecks that show up in any program built with the older version of GCC. The tools needed on Linux are available to all Linux users. Just start up a bunch of R@H tasks, use "perf" to monitor all the system CPU's for your time period and use "perf" to display the results. I used "objdump" to disassemble the binary and find the "perf" program counter address in the objdump output. If you have SOURCE, objdump will add the source code to the dump. The equally good stuff on Windows seems to be mostly retail stuff. r ID: 78299 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2203 Credit: 13,720,774 RAC: 27	Message 78327 - Posted: 18 Jun 2015, 10:21:42 UTC It's a pity there are no admins in this thread... ID: 78327 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 78343 - Posted: 24 Jun 2015, 18:41:26 UTC I just rebuilt the windows version with SSE2 and see a minor improvement (1.6%) from a really quick test. I'll push it out to ralph soon. Thanks for the helpful input and suggestions for optimizations etc. ID: 78343 · Rating: 0 · rate: / Reply Quote

Dirk Broer Send message Joined: 16 Nov 05 Posts: 22 Credit: 3,922,682 RAC: 0	Message 78344 - Posted: 24 Jun 2015, 19:14:12 UTC - in response to Message 78343. I just rebuilt the windows version with SSE2 and see a minor improvement (1.6%) from a really quick test. I'll push it out to ralph soon. Thanks for the helpful input and suggestions for optimizations etc. As SSE2 has been around since the Pentium 4 (2001), can we expect new versions with SSE3 (2004), SSSE3 (2006), SSE4 (2006), AES (2008), AVX (2008), F16C (AMD: 2009/Intel: 2001), and/or FMA instructions (2011-2013) at Ralph soon too? ID: 78344 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 78345 - Posted: 24 Jun 2015, 20:28:24 UTC For the immediate future, I can test whatever optimizations are possible given the version of visual studio we currently have which is 2010. ID: 78345 · Rating: 0 · rate: / Reply Quote

Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0	Message 78346 - Posted: 24 Jun 2015, 20:46:03 UTC - in response to Message 78345. For the immediate future, I can test whatever optimizations are possible given the version of visual studio we currently have which is 2010. Looks like auto-vectorization is only supported by newer versions of Visual Studio, e.g. from 2012 onwards. That means no AVX2. I think SSE2 is the best we get. There's no way other than updating your compiler infrastructure. ID: 78346 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2203 Credit: 13,720,774 RAC: 27	Message 78347 - Posted: 24 Jun 2015, 20:57:54 UTC - in response to Message 78343. I just rebuilt the windows version with SSE2 and see a minor improvement (1.6%) from a really quick test. I'll push it out to ralph soon. Great!! ID: 78347 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 78348 - Posted: 24 Jun 2015, 21:28:16 UTC I'll also look into a VS upgrade. ID: 78348 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2203 Credit: 13,720,774 RAC: 27	Message 78349 - Posted: 25 Jun 2015, 6:56:12 UTC - in response to Message 78348. I'll also look into a VS upgrade. According to this source, MS will release VS2015 during this summer... :-) ID: 78349 · Rating: 0 · rate: / Reply Quote

Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,662,635 RAC: 0	Message 78351 - Posted: 25 Jun 2015, 18:58:05 UTC RJS5 did his testing on Linux, and I believe he said that the version of GCC used for compiling the Linux binaries was also incredibly outdated (for some reason I want to say he mentioned something like it being 8+ versions behind) and that upgrading the GCC compiler on Linux would also render some easy performance improvement without any change to the code base. ID: 78351 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 78353 - Posted: 26 Jun 2015, 16:09:16 UTC I'm not too sure if the apps can be made available as 'additional binaries' i.e. we can have a 'lowest common denominator' made available to the general cohort. & there could be specific binaries that's optimised targetting the newer chips which for that matter may not even run on chips even a generation earlier. those binaries would probably not be automatically downloaded, but for those keen they can optionally install the binaries following some instructions. ------- on another note i found is that rosetta commons code is apparently available for 'no charge' only under 'academic license'. while a commercial license cost some 40k per site. this would probably limit the feasibility say for a public member to build '3rd party binaries' that could be used with rosetta@home https://c4c.uwc4c.com/express_license_technologies/rosetta just 2 cents ID: 78353 · Rating: 0 · rate: / Reply Quote

Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,662,635 RAC: 0	Message 78354 - Posted: 26 Jun 2015, 16:31:38 UTC - in response to Message 78353. I'm not too sure if the apps can be made available as 'additional binaries' i.e. we can have a 'lowest common denominator' made available to the general cohort. Actually, all modern compilers support multiple code paths built into a single binary and handle this type of fall-back automatically. No need for all the complexity to be handled on the BOINC side. My comment was simply that not only should SSE2 be enabled in VS2010, but also that the Linux versions should be recompiled with an updated version of GCC rather than the very old version they appear to be built with. Cheers! ID: 78354 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 78355 - Posted: 26 Jun 2015, 18:19:48 UTC I built a 64bit linux version with gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) with the "-msse4.2" option. Suggestions for the best optimization options would be greatly appreciated since this is all a bit of voodoo to me keeping in mind general compatibility without having to have specific boinc builds other than 32 and 64bit versions. Also keep in mind that the Rosetta code will likely not gain much from vectorization optimizations but any gain is good if it's just a matter of updating compiler options. Thanks for all your input! ID: 78355 · Rating: 0 · rate: / Reply Quote

Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0	Message 78356 - Posted: 26 Jun 2015, 19:43:58 UTC I'm not too sure about the compiler flags either as aggressive optimization can also break things. I still recommend a newer version of gcc. The one you have is ancient. You should probably check out the "mtune" option. -O2 should be the maximum. Flags for gcc For fun and profit (if you own a Haswell CPU and a newer compiler) you could try compiling the source code with the -march=native flag and compare results with a non-AVX2 version. ID: 78356 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 78357 - Posted: 26 Jun 2015, 20:48:14 UTC It would not just be for fun, these optimizations which our scientists and developers aren't that familiar with can have great benefits if the speed up is significant. I'll give that a try and see how things improve. The linux 64bit build with sse4.2 and gcc 4.4.7 does seem to have a more significant improvement than our windows sse2 version, around a 12% improvement on my quick test. I need to do more thorough tests though, particularly for the windows builds but judging from this linux improvement, it may be worthwhile to upgrade to VS 2015 when it comes out. ID: 78357 · Rating: 0 · rate: / Reply Quote

Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,662,635 RAC: 0	Message 78358 - Posted: 27 Jun 2015, 5:28:08 UTC - in response to Message 78355. Suggestions for the best optimization options would be greatly appreciated since this is all a bit of voodoo to me keeping in mind general compatibility without having to have specific boinc builds other than 32 and 64bit versions. If you can, I'd really suggest reaching out to user rjs5 via a private message. He seems to have a strong technical understanding of the various compiler options more than most of us talking heads on these forums tend to, and he seemed very willing to help. ID: 78358 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 78361 - Posted: 28 Jun 2015, 20:32:36 UTC - in response to Message 78358. Last modified: 28 Jun 2015, 20:35:58 UTC Suggestions for the best optimization options would be greatly appreciated since this is all a bit of voodoo to me keeping in mind general compatibility without having to have specific boinc builds other than 32 and 64bit versions. If you can, I'd really suggest reaching out to user rjs5 via a private message. He seems to have a strong technical understanding of the various compiler options more than most of us talking heads on these forums tend to, and he seemed very willing to help. You (Timo) pinged me with a message through the board but I infrequently stop to pick them up. I think running the BETA program through RALPH is dumb. They could/should simply define a NEW "Beta OPT IN" project OPTION on this Rosetta board and build upon their current contributors. Including me and my Haswell machine. Using RALPH just splits the Rosetta volunteer resources and makes everyone's job harder. There will be a couple of barriers to get to real performance improvement. I think that getting an optimized version for Windows and Linux versions is a challenge that most people overlook. I have downloaded a number of other project sources to poke around but I have never really built a version because I did not want to mess up my running projects. For this one, I would work with David E. K. to see if I could help. You have to be careful about determining how much improvement you "got". Windows is VERY AGGRESSIVE about using TURBO mode and when you start optimizing Windows code, the CPU will heat up faster when you start switching more transistors and drop out of TURBO mode earlier. Your code gets faster but it overheats your system. Linux systems are FAR LESS AGGRESSIVE about using TURBO and the performance benefits of improving code are "more visible" to the person with the stop watch. I would watch the Windows CPU temperature and frequency with one of the number of TOOLS available. I tend to use SPECCY which has been good to me. https://www.piriform.com/speccy I loaded XSensors on my Linux system to monitor CPU temperature. If you want to watch the CPU temperature go NUTS, monitor the CPU temperature and run prime95 while watching the temp and frequency. PrimeGrid even apps stress my liquid cooled : Intel Core i7 5930K Cores 6 Threads 12 Name Intel Core i7 5930K Code Name Haswell-E/EP Package Socket 2011 LGA Technology 22nm Specification Intel Core i7-5930K CPU @ 3.50GHz The stages of performance improvement. 1. SSE2: The first will be to migrate to 64-bit floating point from the old x87 80-bit floating point. x87 80-bit was supported by Intel but not by any of their RISC competitors during the "RISC vs CISC" wars. x87 80-bit registers were truncated to 64-bits when stored to memory so depending on the code, you could truncate the 80-bit FP values to 64-bit values at various times in the computation, leading to error variation creeping into calculations at different rates. If they are able to get satisfactory results with the SSE2 options which TURN OFF the x87, then all other options are open. 2. VECTOR INSTRUCTIONS: The second level of optimization will then to be to make sure their algorithms are written so the compiler can VECTORIZE them. The SSE2 instructions operate on 128-bit XMM registers that can do 2 64-bit FP operations or 4 32-bit operations or 8 16-bit operation during the similar number of clocks. If they do 64-bit FP operations in a loop where 32-bit operations are OK, then they are losing 50% performance while executing the code. The performance loss percentage gets bigger as the size of the vector register increases. There are a number of things that can be done in the program to "encourage" the compiler to make the decision to vectorize the code automatically, and the compilers are getting better. The developer can also use instrinsic statements to force the compiler to use vector instructions. Intel has a Intrinsic Guide online at https://software.intel.com/sites/landingpage/IntrinsicsGuide/ There are also hand optimized libraries supported by Intel and open source groups (with the help of Intel) that developers can include in their code. Intel MKL and IPP libraries are, I think, available to educational institutions for distribution. 3. VECTOR SIZE: SSE2 and AVX will operate on 128-bit XMM registers. AVX2 will operate on the 256-bit YMM registers and the AVX2 added INTEGER vector instructions. AVX3 (SkyLake) Xeon PHI will operate on 512-bit vector registers. When using the VECTOR operations, the compiler will chose the SCALAR (x87-like do it one at a time) operations OR PARALLEL or PACKED operations that do multiple operations in parallel. The goal of the developer is to code the algorithm to use the PARALLEL or PACKED operations. Parallel Scalar ADDPS ADDSS - Adds operands SUBPS SUBSS - Subtracts operands MULPS MULSS - Multiply operands DIVPS DIVSS - Divides operands You want to write you code so it uses PACKED or PARALLEL operations. Scalar code will give you a few percent performance improvement. PARALLEL will give you MULTIPLE times performance improvement. ID: 78361 · Rating: 0 · rate: / Reply Quote