Message boards : Number crunching : Rosetta@home using AVX / AVX2 ?
Previous · 1 · 2 · 3 · 4 · 5 . . . 9 · Next
Author | Message |
---|---|
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 22,487,487 RAC: 846 |
[quote] Sure. Be happy to. There many similar tools for both Windows and for Linux. At the time I only had access to my Windows machine so I got the output from the Intel Vtune sampline profiler. For Linux environments I usually use "perf". Both will annotate the disassembly with the source code if you have the source and symbols if you the them. It makes tracking back to the specific source line easy. They use the CPU EVENT counters and you can set the time or event domain to trigger on. I just used the default CYCLES and INSTRUCTION COMPLETIONS to find where the program was burning clocks. That tells you where you will get the biggest return for your effort. Sample Haswell even list: https://code.google.com/p/likwid/wiki/Haswell EXAMPLE: I was running 8 copies of miniroseta on my Ubuntu 64 machine a d profiled ALL CPU with the command: sudo perf record -a -- sleep 10 Run as sudo and record what all "-a" the CPU are doing. After the 10 second sleep, it dumps the samples and then: sudo perf report --demangle do process the counts and demangle the C++ symbols to a more readable form. 7.6% of the time was spent in the numeric10MathMatrixIdE21inverse_square_matrixEv function. Drilling down into that function .... 7.60% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZN7numeric10MathMatrixIdE21inverse_square_matrixEv 4.95% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZN4core7scoring6etable5etrie16TrieCountPairAll20resolve_trie_vs_trieERKNS0_4trie 4.46% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZNK7numeric9xyzVectorIdE16distance_squaredERKS1_ 2.96% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZN4core7scoring6etable5etrie16TrieCountPairAll20resolve_trie_vs_trieERKNS0_4trie 2.60% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZNK4core7scoring7vdwaals10VDW_Energy19residue_pair_energyERKNS_12conformation7Re 2.10% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZNK4core7scoring4elec13FA_ElecEnergy15score_atom_pairERKNS_12conformation7Residu 2.05% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] memcpy 1.94% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _int_malloc 1.61% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _int_free 1.34% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZN4core10kinematics4tree10BondedAtom17update_xyz_coordsERNS0_4StubE 1.33% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] malloc 1.20% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZNK4core12conformation7Residue3xyzEj 1.17% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZNK4core12conformation7Residue15atom_type_indexEj Perf will open up a disassembly display with the HOT SPOT higlighted, i marked with "=====". Even though the file has the x86_64 in the file name, it is still 32-bit application. file boinc.bakerlab.org_rosetta/*gnu boinc.bakerlab.org_rosetta/minirosetta_3.54_x86_64-pc-linux-gnu: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, for GNU/Linux 2.6.9, not stripped _ZN7numeric10MathMatrixIdE21inverse_square_matrixEv /var/lib/boinc-client/projects/boinc.bakerlab.org_rosetta/minirosetta_3.54_x86_64-pc-linux-gnu 0.07 │ add $0x8,%ebx 0.17 │ fmul %st(1),%st 0.44 │ fsubrl (%edx) 0.56 │ fstpl (%edx) 0.21 │ add $0x8,%edx 0.08 │ cmp %esi,-0xf4(%ebp) │ ↓ je 7eb 0.75 │ 778: fldl (%eax) 0.93 │ fmul %st(1),%st 3.35 │ mov -0xf0(%ebp),%edi 0.89 │ addl $0x4,-0xf4(%ebp) 1.63 │ fsubrl (%ecx) 4.62 │ fstpl (%ecx) 1.80 │ fldl (%ebx) 0.30 │ fmul %st(1),%st 2.57 │ fsubrl (%edx) 6.25 │ fstpl (%edx) =================== 1.70 │ fldl 0x8(%eax) 0.09 │ fmul %st(1),%st 1.20 │ fsubrl 0x8(%ecx) 4.50 │ fstpl 0x8(%ecx) 1.87 │ fldl 0x8(%ebx) 0.12 │ fmul %st(1),%st 0.84 │ fsubrl 0x8(%edx) 4.62 │ fstpl 0x8(%edx) 1.98 │ fldl 0x10(%eax) 0.09 │ fmul %st(1),%st 0.95 │ fsubrl 0x10(%ecx) 4.81 │ fstpl 0x10(%ecx) |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1965 Credit: 9,099,488 RAC: 453 |
All post-Pentium4 CPU (newer than Nov. 2000) support the SSE2 register model. Simply adding the SSE2 target option to the builds would require the machines to be made this century but would use the SSE registers. Who cares about 15-years-old cpu??? A single modern Xeon 16 cores is hundreds of times more powerful. Beyond that, the developers would need to look more closely at the code. I hope rosetta admins read your post. SSE2 may be a first step. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 22,487,487 RAC: 846 |
All post-Pentium4 CPU (newer than Nov. 2000) support the SSE2 register model. Simply adding the SSE2 target option to the builds would require the machines to be made this century but would use the SSE registers. The applications seem to be built on Red Hat RHEL4 which is not too old and still in corporate use. It seems to be built with GCC 4.1.2. This is a "vintage" application and very unlikely to attract any administration upgrade attention. By comparison, World Community Grid Mapping Cancer is built with gcc 4.4.7 a similar clip of code shows that a similar code bottleneck is located in a similar USE-AFTER-COMPUTATION when results of a divide operation (divsd xmm2, qword ptr [rsi]) are used in the next instruction (xorpd xmm2, xmm6). The density of COMPUTE instruction compared to data spill/fill memory write/reads is much higher. You see references to "r8d", "rdx" ... 64-bit registers which you don't see with Rosetta. A 64-bit recompile would allow the compilers to use both extra registers and SSE2 instructions which dramatically reduce the excess memory operations. A Vtune clip of World Community Grid Mapping Cancer Markers .... Address Assembly 0x140058e2e sub r8, 0x1 0x140058e32 mulsd xmm0, qword ptr [rdx-0x8] 0x140058e37 addsd xmm2, xmm0 0x140058e3b jnz 0x140058e22 <Block 12> 0x140058e3d Block 13: 0x140058e3d movsd xmm0, qword ptr [rsi] 0x140058e41 xor r8d, r8d 0x140058e44 mulsd xmm0, qword ptr [r10+rbx*1] 0x140058e4a subsd xmm2, xmm0 0x140058e4e divsd xmm2, qword ptr [rsi] 0x140058e52 xorpd xmm2, xmm6 ======================= stall on waiting for xmm2 results 0x140058e56 comisd xmm13, xmm2 0x140058e5b movsd qword ptr [r10], xmm2 0x140058e60 jbe 0x140058e65 <Block 15> 0x140058e62 Block 14: 0x140058e62 mov qword ptr [r10], r8 0x140058e65 Block 15: 0x140058e65 movsd xmm1, qword ptr [r10] 0x140058e6a movapd xmm0, xmm1 0x140058e6e subsd xmm0, qword ptr [r10+rbx*1] 0x140058e74 andpd xmm0, xmm9 |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1965 Credit: 9,099,488 RAC: 453 |
It seems to be built with GCC 4.1.2. This is a "vintage" application and very unlikely to attract any administration upgrade attention. 4.1.2 is February 2007.... I understand that it's difficult to update constantly the software (now GCC is 5.1), but an 8 years is too much. Please, admins, use ralph@home server to test (eventualy) this updates |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
|
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1965 Credit: 9,099,488 RAC: 453 |
Didn't know R@H was that unoptimized. we do not know if the admin read this thread and we do not even know if they are interested... |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 22,487,487 RAC: 846 |
It seems to be built with GCC 4.1.2. This is a "vintage" application and very unlikely to attract any administration upgrade attention. I had to go see what "RALPH@HOME" was. You mentioned it several times. It gave me a good chuckle. They could have just put another CHECKBOX on the Rosetta@Home "Edit Rosetta@home preferences" options page. They would not have to build the duplicate project. Few people will add RALPH just to run Rosetta ALPHA versions. Many more would click the opt-in option on the preferences page. Much of any performance increase beyond compiling for SSE2 comes from understanding the program and avoiding subtle coding problems. Keeping tight control on the data type sizes is one common overlooked problem. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1965 Credit: 9,099,488 RAC: 453 |
Few people will add RALPH just to run Rosetta ALPHA versions. Many more would click the opt-in option on the preferences page. I'm not agree with you. 1) Rosetta and Ralph are on 2 different servers with the same SW version/configuration. If admins want to try some updates/upgrades it's better to test it in alpha config than in production server. 2) Do not underestimate the Ralph's volunteers. :-) Much of any performance increase beyond compiling for SSE2 comes from understanding the program and avoiding subtle coding problems. Keeping tight control on the data type sizes is one common overlooked problem. +1 (obliusly, it would be better if rosetta admins have updated tools/servers/debuggers/etc) |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
I doubt this thread is being read by any admins at this point given that until rjs5 brought some tangible analysis, this thread was otherwise mostly hot air and knee-jerk suggestions about adding support for different technologies which may be good ideas but may also be too technically challenging or time consuming to come onto the roadmap any time soon. ... but this thing about upgrading the compiler from the *ancient* version they currently build with is truly low hanging fruit! Thus, I would encourage rjs5 to consider spinning up a new thread strictly showcasing the details he/she was able to pull together. Lay it out as a new thread, call it 'An analysis of the Rosetta binaries...' or something that shows that rjs5 has already done the legwork, and in the first post break down what was found, maybe even add some charts or graphs. I have to do 'business cases' like this at work often, let me know if you need any help! |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
I doubt this thread is being read by any admins at this point given that until rjs5 brought some tangible analysis, this thread was otherwise mostly hot air and knee-jerk suggestions about adding support for different technologies which may be good ideas but may also be too technically challenging or time consuming to come onto the roadmap any time soon. Done! |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1965 Credit: 9,099,488 RAC: 453 |
I doubt this thread is being read by any admins :-( I open a similar thread on Ralph@home, without reply from admins.... ... but this thing about upgrading the compiler from the *ancient* version they currently build with is truly low hanging fruit! +1 |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 22,487,487 RAC: 846 |
yup i'd think AVX / AVX2 is a good thing, actually this is very similar (or of the same nature) to the GPU request threads, i.e. to exploit vectorized CPU or GPU functionality to significantly accelerate computations Generating an AVX/AVX2/AVX512 binary does not necessarily mean a "rewrite". All you have to do is TURN ON the compile time option for the target feature you want to enable and JUST recompile. The benefit, HOWEVER, will depend on the program problem and HOW the coder encoded the algorithm to solve that problem. The Intel compiler can even generate a startup section of code that tests the CPU for them and branch to the correct code for all the CPU. U of Washington probably already has dozens of Intel ICC licenses. Either ICC or GCC will handle AVX since Intel also contributes to GCC. The AVX vector operations VECTORIZE loops in the code where the results of one operation are not needed for the next. For many cases, this would "fold" the Amdahl sections of code and the code owners could estimate the speed up. The PRIMARY problem they would have and the FIRST one I would look at would be the conversion from 80-bit floating point i387 register value computations to the smaller 64-bit floating point computations. The old i387 had 80-bit internal registers and would truncate/round/... to 64-bit when storing to 64-bit memory data types but a sequence of 80-bit float operations will get a slightly different answer than a sequence of 64-bit float operations. Those pesky extra 16-bits keep data that is lost when just using 64-bit operations. Once they confirm that they can use 64-bit operations, then move on. This is probably the barrier that worries the Rosetta people. Simple illustration: Suppose I had a 1024 vector of floating point numbers that I wanted to ADD together. for (i=0;i<1024;i++) sum += vec[i]; i386 code would: load next value of 1024 vector from memory to FP register ADD AVX1/2/512 code would operate on stride sub-vectors in the array: ADD next 2/4/8 values of the 1024 vector from memory to FP register horizontal ADD to accumulate the subtotals There might be some tail processing time IF the coder did not let the compiler know the length of the vector and just used arbitrary vector length (instead of modulo 2/4/8). It is hard to know what the speedup would be without some more analysis but modernizing Rosetta will multiple the data that the Rosetta people get from the current CPU cycles donated to their project. If they are satisfied with their poverty, then I am OK with it too. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1965 Credit: 9,099,488 RAC: 453 |
The Intel compiler can even generate a startup section of code that tests the CPU for them and branch to the correct code for all the CPU. U of Washington probably already has dozens of Intel ICC licenses. Either ICC or GCC will handle AVX since Intel also contributes to GCC. I hope these improvements may involve also AMD cpu.... Intel cripple Amd cpu |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1965 Credit: 9,099,488 RAC: 453 |
http://code.compeng.uni-frankfurt.de/projects/vc Now they are on GitHub https://github.com/VcDevel/Vc |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 22,487,487 RAC: 846 |
The Intel compiler can even generate a startup section of code that tests the CPU for them and branch to the correct code for all the CPU. U of Washington probably already has dozens of Intel ICC licenses. Either ICC or GCC will handle AVX since Intel also contributes to GCC. You will certainly get what you hope for as detailed in your January 2010 article. The ICC compiler now just checks the CPUID FEATURE support bits and if the feature is supported, ICC will generate the optimized code. ICC can 100% trust the CPUID FEATURE bits. The problem is now for AMD and vendors developing software for multiple target CPU. For example, when AMD had AVX problems with Bulldozer/Interlagos, AMD recommended compiling with -mssse3 and AVOID AVX. Since the CPUID FEATURE bit was on, vendors wanting to support AVX had problems with Bulldozer silicon. Now you have people trying to report Intel ICC bugs because their code did not run on the AMD transistors. Intel is now prohibited by court order from generating separate bits for Intel and AMD. Avoid -mavx on Interlagos/Bulldozer (middle of page) The Rosetta people will need to deal these decisions during their optimization effort. Unless they can vectorize their code, there is limited upside to pushing beyond 64-bit SSE2/3/4. Lots of fun. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1965 Credit: 9,099,488 RAC: 453 |
Intel is now prohibited by court order from generating separate bits for Intel and AMD. Not so fun |
xdarma Send message Joined: 20 Jan 08 Posts: 5 Credit: 4,878,545 RAC: 0 |
You will certainly get what you hope for as detailed in your January 2010 article. Sorry for re-posting, but this article is dated November 2014: Intel finally agrees to pay $15 to Pentium 4 owners over AMD Athlon benchmarking shenanigans The ICC compiler now just checks the CPUID FEATURE support bits and if the feature is supported, ICC will generate the optimized code. ICC can 100% trust the CPUID FEATURE bits. For sure ICC must check CPUID, unless can't cripple non-intel cpu. The author has applied a patch to fool the software created with icc. Second the previous article, the gain (or the loss?) seem to be around 8-12%. And still so on these days. The problem is now for AMD and vendors developing software for multiple target CPU. For example, when AMD had AVX problems with Bulldozer/Interlagos, AMD recommended compiling with -mssse3 and AVOID AVX. Since the CPUID FEATURE bit was on, vendors wanting to support AVX had problems with Bulldozer silicon. Which people? Can you elaborate? I do not find anything useful. Thanks. Now you have people trying to report Intel ICC bugs because their code did not run on the AMD transistors. Intel is now prohibited by court order from generating separate bits for Intel and AMD. IMO, the only way is to separate ICC away from Intel. So ICC must be fair with other cpu makers. Maybe your job can be hit by this. But it's another story. Avoid -mavx on Interlagos/Bulldozer (middle of page) Avoid if you use ICC. If you use gcc you can optimize with -mprefer-avx-128 (on the previous page of your link). Lots of fun. Another reason not to buy intel cpu. |
Betting Slip Send message Joined: 26 Sep 05 Posts: 71 Credit: 5,702,246 RAC: 0 |
Good to hear they are thinking of updating their server code... because it is ANCIENT. Well, if it was ANCIENT in October 2014 but they are thinking of updating it that's good news. Oh wait, it's now October 2015. Oh well maybe they haven't done that much thinking in 12 months. They are very BUSY you know. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1965 Credit: 9,099,488 RAC: 453 |
Very intresting analysis about optimiziation by rsj5 here |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1965 Credit: 9,099,488 RAC: 453 |
From Rjs5 Any performance improvement will only make progress when someone on the Rosetta team wants to. I have a couple unanswered messages to developers volunteering time and expertise. If anyone has interested contacts, please pass me along to them. I fully expect that there is no real need nor incentive for Rosetta to mess with the code to improve performance. :-( |
Message boards :
Number crunching :
Rosetta@home using AVX / AVX2 ?
©2024 University of Washington
https://www.bakerlab.org