R@H Scientists/Coders: An analysis of the Rosetta binaries...

Message boards : Number crunching : R@H Scientists/Coders: An analysis of the Rosetta binaries...

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 7 · Next

AuthorMessage
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 78273 - Posted: 5 Jun 2015, 15:11:48 UTC

This thread is to bring exposure to the findings done by user rjs5 back in the thread "Rosetta@home using AVX / AVX2 ?":

The executing code seems to be compiled for a i386 and uses the 387 floating point 8-register stack model. The code (on my machine) spends about 5% of the time waiting for the "fmul st0,st1" ("====" below) to complete.

minirosetta_3.54_windows_x86_64.exe

Rosetta instruction clip ...

address instruction
0x6b3d82 add ebx, ecx
0x6b3d84 lea ebx, ptr [edi+ebx*8]
0x6b3d87 fld st0, qword ptr [edi+eax*8]
0x6b3d8a mov eax, dword ptr [ebp-0x20]
0x6b3d8d mov edi, dword ptr [ebp-0x14]
0x6b3d90 fmul st0, st1
0x6b3d92 inc ecx =========================
0x6b3d93 add eax, 0x8
0x6b3d96 fsubr st0, qword ptr [ebx]
0x6b3d98 add edx, 0x8


All post-Pentium4 CPU (newer than Nov. 2000) support the SSE2 register model. Simply adding the SSE2 target option to the builds would require the machines to be made this century but would use the SSE registers. The 16 directly addressable registers would reduce register stores to the stack and code scheduling (less shuffling of data around and more computation).

A simple recompile should make a noticeable difference without any side effects. If you compile newer than SSE2 or GPUs, you have to start worrying about and managing the population of target machines you deliver workloads to.

Beyond that, the developers would need to look more closely at the code.



Sure. Be happy to.

There many similar tools for both Windows and for Linux. At the time I only had access to my Windows machine so I got the output from the Intel Vtune sampline profiler.

For Linux environments I usually use "perf".

Both will annotate the disassembly with the source code if you have the source and symbols if you the them. It makes tracking back to the specific source line easy.

They use the CPU EVENT counters and you can set the time or event domain to trigger on. I just used the default CYCLES and INSTRUCTION COMPLETIONS to find where the program was burning clocks. That tells you where you will get the biggest return for your effort.


Sample Haswell even list:
https://code.google.com/p/likwid/wiki/Haswell


EXAMPLE:
I was running 8 copies of miniroseta on my Ubuntu 64 machine a d profiled ALL CPU with the command:
sudo perf record -a -- sleep 10

Run as sudo and record what all "-a" the CPU are doing.

After the 10 second sleep, it dumps the samples and then:

sudo perf report --demangle

do process the counts and demangle the C++ symbols to a more readable form.

7.6% of the time was spent in the numeric10MathMatrixIdE21inverse_square_matrixEv function. Drilling down into that function ....

7.60% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZN7numeric10MathMatrixIdE21inverse_square_matrixEv
4.95% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZN4core7scoring6etable5etrie16TrieCountPairAll20resolve_trie_vs_trieERKNS0_4trie
4.46% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZNK7numeric9xyzVectorIdE16distance_squaredERKS1_
2.96% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZN4core7scoring6etable5etrie16TrieCountPairAll20resolve_trie_vs_trieERKNS0_4trie
2.60% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZNK4core7scoring7vdwaals10VDW_Energy19residue_pair_energyERKNS_12conformation7Re
2.10% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZNK4core7scoring4elec13FA_ElecEnergy15score_atom_pairERKNS_12conformation7Residu
2.05% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] memcpy
1.94% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _int_malloc
1.61% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _int_free
1.34% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZN4core10kinematics4tree10BondedAtom17update_xyz_coordsERNS0_4StubE
1.33% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] malloc
1.20% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZNK4core12conformation7Residue3xyzEj
1.17% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZNK4core12conformation7Residue15atom_type_indexEj



Perf will open up a disassembly display with the HOT SPOT higlighted, i marked with "=====".

Even though the file has the x86_64 in the file name, it is still 32-bit application.

file boinc.bakerlab.org_rosetta/*gnu
boinc.bakerlab.org_rosetta/minirosetta_3.54_x86_64-pc-linux-gnu: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, for GNU/Linux 2.6.9, not stripped

_ZN7numeric10MathMatrixIdE21inverse_square_matrixEv /var/lib/boinc-client/projects/boinc.bakerlab.org_rosetta/minirosetta_3.54_x86_64-pc-linux-gnu
0.07 │ add $0x8,%ebx
0.17 │ fmul %st(1),%st
0.44 │ fsubrl (%edx)
0.56 │ fstpl (%edx)
0.21 │ add $0x8,%edx
0.08 │ cmp %esi,-0xf4(%ebp)
│ ↓ je 7eb
0.75 │ 778: fldl (%eax)
0.93 │ fmul %st(1),%st
3.35 │ mov -0xf0(%ebp),%edi
0.89 │ addl $0x4,-0xf4(%ebp)
1.63 │ fsubrl (%ecx)
4.62 │ fstpl (%ecx)
1.80 │ fldl (%ebx)
0.30 │ fmul %st(1),%st
2.57 │ fsubrl (%edx)
6.25 │ fstpl (%edx) ===================
1.70 │ fldl 0x8(%eax)
0.09 │ fmul %st(1),%st
1.20 │ fsubrl 0x8(%ecx)
4.50 │ fstpl 0x8(%ecx)
1.87 │ fldl 0x8(%ebx)
0.12 │ fmul %st(1),%st
0.84 │ fsubrl 0x8(%edx)
4.62 │ fstpl 0x8(%edx)
1.98 │ fldl 0x10(%eax)
0.09 │ fmul %st(1),%st
0.95 │ fsubrl 0x10(%ecx)
4.81 │ fstpl 0x10(%ecx)


I believe R@H uses the Rosetta Commons code, therefore I do not know precisely who really codes the Rosetta Software, but this should at least be looked at by someone working for R@H.
ID: 78273 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,622,253
RAC: 9,523
Message 78275 - Posted: 5 Jun 2015, 15:44:11 UTC - in response to Message 78273.  

I believe R@H uses the Rosetta Commons code, therefore I do not know precisely who really codes the Rosetta Software, but this should at least be looked at by someone working for R@H.


This is the documentation of Rosetta Commons.
I don't know if the code is the same of r@h.
ID: 78275 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,649,459
RAC: 0
Message 78277 - Posted: 5 Jun 2015, 18:05:54 UTC

To summarize, rjs5 used some software (Intel Vtune sampline profiler) to examine the binaries of the minirosetta core and discovered that they are being compiled using a very outdated version of the GCC, and in short, simply updating the compiler would introduce some optimizations and resolve some known bottlenecks that show up in any program built with the older version of GCC.
ID: 78277 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,051,657
RAC: 8,071
Message 78299 - Posted: 13 Jun 2015, 19:14:16 UTC - in response to Message 78277.  

To summarize, rjs5 used some software (Intel Vtune sampline profiler) to examine the binaries of the minirosetta core and discovered that they are being compiled using a very outdated version of the GCC, and in short, simply updating the compiler would introduce some optimizations and resolve some known bottlenecks that show up in any program built with the older version of GCC.


The tools needed on Linux are available to all Linux users. Just start up a bunch of R@H tasks, use "perf" to monitor all the system CPU's for your time period and use "perf" to display the results. I used "objdump" to disassemble the binary and find the "perf" program counter address in the objdump output. If you have SOURCE, objdump will add the source code to the dump.

The equally good stuff on Windows seems to be mostly retail stuff.

r
ID: 78299 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,622,253
RAC: 9,523
Message 78327 - Posted: 18 Jun 2015, 10:21:42 UTC

It's a pity there are no admins in this thread...
ID: 78327 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 78343 - Posted: 24 Jun 2015, 18:41:26 UTC

I just rebuilt the windows version with SSE2 and see a minor improvement (1.6%) from a really quick test. I'll push it out to ralph soon. Thanks for the helpful input and suggestions for optimizations etc.
ID: 78343 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dirk Broer

Send message
Joined: 16 Nov 05
Posts: 22
Credit: 3,345,533
RAC: 1,720
Message 78344 - Posted: 24 Jun 2015, 19:14:12 UTC - in response to Message 78343.  

I just rebuilt the windows version with SSE2 and see a minor improvement (1.6%) from a really quick test. I'll push it out to ralph soon. Thanks for the helpful input and suggestions for optimizations etc.


As SSE2 has been around since the Pentium 4 (2001), can we expect new versions with SSE3 (2004), SSSE3 (2006), SSE4 (2006), AES (2008), AVX (2008),
F16C (AMD: 2009/Intel: 2001), and/or FMA instructions (2011-2013) at Ralph soon too?

ID: 78344 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 78345 - Posted: 24 Jun 2015, 20:28:24 UTC

For the immediate future, I can test whatever optimizations are possible given the version of visual studio we currently have which is 2010.
ID: 78345 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 78346 - Posted: 24 Jun 2015, 20:46:03 UTC - in response to Message 78345.  

For the immediate future, I can test whatever optimizations are possible given the version of visual studio we currently have which is 2010.


Looks like auto-vectorization is only supported by newer versions of Visual Studio, e.g. from 2012 onwards. That means no AVX2.

I think SSE2 is the best we get. There's no way other than updating your compiler infrastructure.
ID: 78346 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,622,253
RAC: 9,523
Message 78347 - Posted: 24 Jun 2015, 20:57:54 UTC - in response to Message 78343.  

I just rebuilt the windows version with SSE2 and see a minor improvement (1.6%) from a really quick test. I'll push it out to ralph soon.


Great!!

ID: 78347 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 78348 - Posted: 24 Jun 2015, 21:28:16 UTC

I'll also look into a VS upgrade.
ID: 78348 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,622,253
RAC: 9,523
Message 78349 - Posted: 25 Jun 2015, 6:56:12 UTC - in response to Message 78348.  

I'll also look into a VS upgrade.


According to this source, MS will release VS2015 during this summer... :-)

ID: 78349 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,649,459
RAC: 0
Message 78351 - Posted: 25 Jun 2015, 18:58:05 UTC

RJS5 did his testing on Linux, and I believe he said that the version of GCC used for compiling the Linux binaries was also incredibly outdated (for some reason I want to say he mentioned something like it being 8+ versions behind) and that upgrading the GCC compiler on Linux would also render some easy performance improvement without any change to the code base.
ID: 78351 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 78353 - Posted: 26 Jun 2015, 16:09:16 UTC

I'm not too sure if the apps can be made available as 'additional binaries' i.e. we can have a 'lowest common denominator' made available to the general cohort.

& there could be specific binaries that's optimised targetting the newer chips which for that matter may not even run on chips even a generation earlier.

those binaries would probably not be automatically downloaded, but for those keen they can optionally install the binaries following some instructions.
-------
on another note i found is that rosetta commons code is apparently available for 'no charge' only under 'academic license'. while a commercial license cost some 40k per site. this would probably limit the feasibility say for a public member to build '3rd party binaries' that could be used with rosetta@home

https://c4c.uwc4c.com/express_license_technologies/rosetta

just 2 cents
ID: 78353 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,649,459
RAC: 0
Message 78354 - Posted: 26 Jun 2015, 16:31:38 UTC - in response to Message 78353.  

I'm not too sure if the apps can be made available as 'additional binaries' i.e. we can have a 'lowest common denominator' made available to the general cohort.


Actually, all modern compilers support multiple code paths built into a single binary and handle this type of fall-back automatically. No need for all the complexity to be handled on the BOINC side.

My comment was simply that not only should SSE2 be enabled in VS2010, but also that the Linux versions should be recompiled with an updated version of GCC rather than the very old version they appear to be built with. Cheers!
ID: 78354 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 78355 - Posted: 26 Jun 2015, 18:19:48 UTC

I built a 64bit linux version with gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) with the "-msse4.2" option. Suggestions for the best optimization options would be greatly appreciated since this is all a bit of voodoo to me keeping in mind general compatibility without having to have specific boinc builds other than 32 and 64bit versions. Also keep in mind that the Rosetta code will likely not gain much from vectorization optimizations but any gain is good if it's just a matter of updating compiler options. Thanks for all your input!
ID: 78355 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 78356 - Posted: 26 Jun 2015, 19:43:58 UTC

I'm not too sure about the compiler flags either as aggressive optimization can also break things.

I still recommend a newer version of gcc. The one you have is ancient.

You should probably check out the "mtune" option. -O2 should be the maximum.

Flags for gcc

For fun and profit (if you own a Haswell CPU and a newer compiler) you could try compiling the source code with the -march=native flag and compare results with a non-AVX2 version.
ID: 78356 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 78357 - Posted: 26 Jun 2015, 20:48:14 UTC

It would not just be for fun, these optimizations which our scientists and developers aren't that familiar with can have great benefits if the speed up is significant. I'll give that a try and see how things improve. The linux 64bit build with sse4.2 and gcc 4.4.7 does seem to have a more significant improvement than our windows sse2 version, around a 12% improvement on my quick test. I need to do more thorough tests though, particularly for the windows builds but judging from this linux improvement, it may be worthwhile to upgrade to VS 2015 when it comes out.
ID: 78357 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,649,459
RAC: 0
Message 78358 - Posted: 27 Jun 2015, 5:28:08 UTC - in response to Message 78355.  

Suggestions for the best optimization options would be greatly appreciated since this is all a bit of voodoo to me keeping in mind general compatibility without having to have specific boinc builds other than 32 and 64bit versions.


If you can, I'd really suggest reaching out to user rjs5 via a private message. He seems to have a strong technical understanding of the various compiler options more than most of us talking heads on these forums tend to, and he seemed very willing to help.
ID: 78358 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,051,657
RAC: 8,071
Message 78361 - Posted: 28 Jun 2015, 20:32:36 UTC - in response to Message 78358.  
Last modified: 28 Jun 2015, 20:35:58 UTC

Suggestions for the best optimization options would be greatly appreciated since this is all a bit of voodoo to me keeping in mind general compatibility without having to have specific boinc builds other than 32 and 64bit versions.


If you can, I'd really suggest reaching out to user rjs5 via a private message. He seems to have a strong technical understanding of the various compiler options more than most of us talking heads on these forums tend to, and he seemed very willing to help.



You (Timo) pinged me with a message through the board but I infrequently stop to pick them up. I think running the BETA program through RALPH is dumb. They could/should simply define a NEW "Beta OPT IN" project OPTION on this Rosetta board and build upon their current contributors. Including me and my Haswell machine. Using RALPH just splits the Rosetta volunteer resources and makes everyone's job harder.


There will be a couple of barriers to get to real performance improvement. I think that getting an optimized version for Windows and Linux versions is a challenge that most people overlook. I have downloaded a number of other project sources to poke around but I have never really built a version because I did not want to mess up my running projects. For this one, I would work with David E. K. to see if I could help.



You have to be careful about determining how much improvement you "got".

Windows is VERY AGGRESSIVE about using TURBO mode and when you start optimizing Windows code, the CPU will heat up faster when you start switching more transistors and drop out of TURBO mode earlier. Your code gets faster but it overheats your system. Linux systems are FAR LESS AGGRESSIVE about using TURBO and the performance benefits of improving code are "more visible" to the person with the stop watch.

I would watch the Windows CPU temperature and frequency with one of the number of TOOLS available. I tend to use SPECCY which has been good to me.
https://www.piriform.com/speccy
I loaded XSensors on my Linux system to monitor CPU temperature.

If you want to watch the CPU temperature go NUTS, monitor the CPU temperature and run prime95 while watching the temp and frequency. PrimeGrid even apps stress my liquid cooled :
Intel Core i7 5930K
Cores 6
Threads 12
Name Intel Core i7 5930K
Code Name Haswell-E/EP
Package Socket 2011 LGA
Technology 22nm
Specification Intel Core i7-5930K CPU @ 3.50GHz


The stages of performance improvement.

1. SSE2: The first will be to migrate to 64-bit floating point from the old x87 80-bit floating point. x87 80-bit was supported by Intel but not by any of their RISC competitors during the "RISC vs CISC" wars. x87 80-bit registers were truncated to 64-bits when stored to memory so depending on the code, you could truncate the 80-bit FP values to 64-bit values at various times in the computation, leading to error variation creeping into calculations at different rates.

If they are able to get satisfactory results with the SSE2 options which TURN OFF the x87, then all other options are open.



2. VECTOR INSTRUCTIONS: The second level of optimization will then to be to make sure their algorithms are written so the compiler can VECTORIZE them. The SSE2 instructions operate on 128-bit XMM registers that can do 2 64-bit FP operations or 4 32-bit operations or 8 16-bit operation during the similar number of clocks. If they do 64-bit FP operations in a loop where 32-bit operations are OK, then they are losing 50% performance while executing the code. The performance loss percentage gets bigger as the size of the vector register increases.

There are a number of things that can be done in the program to "encourage" the compiler to make the decision to vectorize the code automatically, and the compilers are getting better. The developer can also use instrinsic statements to force the compiler to use vector instructions. Intel has a Intrinsic Guide online at
https://software.intel.com/sites/landingpage/IntrinsicsGuide/

There are also hand optimized libraries supported by Intel and open source groups (with the help of Intel) that developers can include in their code. Intel MKL and IPP libraries are, I think, available to educational institutions for distribution.


3. VECTOR SIZE:
SSE2 and AVX will operate on 128-bit XMM registers.
AVX2 will operate on the 256-bit YMM registers and the AVX2 added INTEGER vector instructions.
AVX3 (SkyLake) Xeon PHI will operate on 512-bit vector registers.




When using the VECTOR operations, the compiler will chose the SCALAR (x87-like do it one at a time) operations OR PARALLEL or PACKED operations that do multiple operations in parallel. The goal of the developer is to code the algorithm to use the PARALLEL or PACKED operations.

Parallel Scalar
ADDPS ADDSS - Adds operands
SUBPS SUBSS - Subtracts operands
MULPS MULSS - Multiply operands
DIVPS DIVSS - Divides operands

You want to write you code so it uses PACKED or PARALLEL operations. Scalar code will give you a few percent performance improvement. PARALLEL will give you MULTIPLE times performance improvement.
ID: 78361 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 7 · Next

Message boards : Number crunching : R@H Scientists/Coders: An analysis of the Rosetta binaries...



©2024 University of Washington
https://www.bakerlab.org