Message boards : Number crunching : R@H Scientists/Coders: An analysis of the Rosetta binaries...
Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next
Author | Message |
---|---|
Mark Send message Joined: 10 Nov 13 Posts: 40 Credit: 397,847 RAC: 0 |
If you're going to examine this area, another option is llvm/clang which is at http://llvm.org/. Sounds like you need an experienced computer scientist input if you dont mind me saying... |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 22,962,883 RAC: 17,460 |
It would not just be for fun, these optimizations which our scientists and developers aren't that familiar with can have great benefits if the speed up is significant. I'll give that a try and see how things improve. The linux 64bit build with sse4.2 and gcc 4.4.7 does seem to have a more significant improvement than our windows sse2 version, around a 12% improvement on my quick test. I need to do more thorough tests though, particularly for the windows builds but judging from this linux improvement, it may be worthwhile to upgrade to VS 2015 when it comes out. Some things that will make a difference, especially in vector code. Use the smallest data type that does the job. You will pay extra for 64-bit doubles versus 32-bit floats. Define the length of arrays to help the compilers avoid generating "tail-processing" code. Initialize the tail padding and process it. The compiler vendor also makes a big difference too. FOR example, I just summed an array of 10 million floating point values using the newest gcc and newest Intel ICC compiler: icc -v icc version 16.0.0 Beta (gcc version 4.9.2 compatibility) ... and the standard Fedora 21 gcc gcc -v Using built-in specs. COLLECT_GCC=/usr/bin/gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.9.2/lto-wrapper Target: x86_64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.9.2-20150212/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.9.2-20150212/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux Thread model: posix gcc version 4.9.2 20150212 (Red Hat 4.9.2-6) (GCC) I compiled using both compilers a BRAIN DEAD program that just SUMMED 10 million single precision and double precision numbers: cat main.c #include <stdio.h> #define BS 10 *1024*1024 float sp[BS]; double dp[BS]; int main () { unsigned long timeBegin = clock(); unsigned long timeEnd = clock(); unsigned long i; float f = 0.0; double d = 0.0; for (i=0;i<BS;i++) { sp[i] = (float) random(); } for (i=0;i<BS;i++) { dp[i] = (double) random(); } f=0.0; timeBegin = clock(); for (i=0;i<BS;i++) { f+=sp[i]; } timeEnd = clock(); printf("%f %ld n", f, timeEnd-timeBegin ); f=0.0; timeBegin = clock(); for (i=0;i<BS;i++) { f+=sp[i]; } timeEnd = clock(); printf("%f %ld n", f, timeEnd-timeBegin ); f=0.0; timeBegin = clock(); for (i=0;i<BS;i++) { f+=sp[i]; } timeEnd = clock(); printf("%f %ld n", f, timeEnd-timeBegin ); printf("n"); d=0.0; timeBegin = clock(); for (i=0;i<BS;i++) { d+=dp[i]; } timeEnd = clock(); printf("%f %ld n", d, timeEnd-timeBegin ); d=0.0; timeBegin = clock(); for (i=0;i<BS;i++) { d+=dp[i]; } timeEnd = clock(); printf("%f %ld n", d, timeEnd-timeBegin ); d=0.0; timeBegin = clock(); for (i=0;i<BS;i++) { d+=dp[i]; } timeEnd = clock(); printf("%f %ld n", d, timeEnd-timeBegin ); } I ran them on my DELL XPS 8700 with the CPU replaced with the i7-4790K running at 4GHz. vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz stepping : 3 microcode : 0x1c cpu MHz : 4038.593 cache size : 8192 KB The results of identical ICC and GCC compiles with was: 3 iterations of summing 10m numbers single and double precision with ICC and then with GCC with the "-O2" option set since that is the default "-O" on ICC which is slightly different that -O on gcc. OPT=-O2 -D__icc 11259263208914944.000000 2559 ABOUT 2500 usec ICC single precision. 11259263208914944.000000 2334 11259263208914944.000000 2304 11260575066722724.000000 4395 double that time with ICC double 11260575066722724.000000 5392 11260575066722724.000000 4506 OPT=-O2 -D__gcc 11258919611531264.000000 7296 gcc single is slower than icc double 11258919611531264.000000 7430 11258919611531264.000000 7840 11260575066721352.000000 7690 gcc double is about 50% slower than icc. 11260575066721352.000000 8669 11260575066721352.000000 7927 the scrip I used to compile the brain dead add test. cat foo #!/bin/sh OPT="-O2 -D__icc" echo OPT=$OPT icc $OPT main.c ./a.out objdump -dCS a.out > a.out.icc.od OPT="-O2 -D__gcc" echo OPT=$OPT gcc $OPT main.c ./a.out objdump -dCS a.out > a.out.gcc.od The DOUBLE PRECISION LOOP created by gcc just iterates over buffer entries. 4006e7: e8 e4 fd ff ff callq 4004d0 <clock@plt> 4006ec: 66 0f ef c0 pxor %xmm0,%xmm0 4006f0: 48 63 d8 movslq %eax,%rbx 4006f3: 31 c0 xor %eax,%eax 4006f5: 0f 1f 00 nopl (%rax) 4006f8: f2 0f 58 04 c5 c0 10 addsd 0x2e010c0(,%rax,8),%xmm0 <<<< THE SUMMING 4006ff: e0 02 400701: 48 83 c0 01 add $0x1,%rax 400705: 48 3d 00 00 a0 00 cmp $0xa00000,%rax 40070b: 75 eb jne 4006f8 <main+0x1d8> 40070d: 31 c0 xor %eax,%eax 40070f: f2 0f 11 44 24 08 movsd %xmm0,0x8(%rsp) 400715: e8 b6 fd ff ff callq 4004d0 <clock@plt> The same loop compiled by ICC at -O2 uses 8 xmm registers to generate 8 sub-totals and then at the bottom of the loop ACCUMULATES the 8 subtotals into the SUM. 400b83: e8 98 fd ff ff callq 400920 <clock@plt> 400b88: 49 89 c4 mov %rax,%r12 400b8b: 33 c0 xor %eax,%eax 400b8d: 66 0f ef f6 pxor %xmm6,%xmm6 400b91: 0f 29 34 24 movaps %xmm6,(%rsp) 400b95: 0f 28 ee movaps %xmm6,%xmm5 400b98: 0f 29 74 24 10 movaps %xmm6,0x10(%rsp) 400b9d: 0f 28 e6 movaps %xmm6,%xmm4 400ba0: 0f 28 fd movaps %xmm5,%xmm7 400ba3: 0f 28 de movaps %xmm6,%xmm3 400ba6: 0f 28 d6 movaps %xmm6,%xmm2 400ba9: 0f 28 ce movaps %xmm6,%xmm1 400bac: 0f 28 c6 movaps %xmm6,%xmm0 400baf: 90 nop 400bb0: 0f 58 3c 85 c0 46 60 addps 0x6046c0(,%rax,4),%xmm7 400bb7: 00 400bb8: 0f 58 34 85 d0 46 60 addps 0x6046d0(,%rax,4),%xmm6 400bbf: 00 400bc0: 0f 58 2c 85 e0 46 60 addps 0x6046e0(,%rax,4),%xmm5 400bc7: 00 400bc8: 0f 58 24 85 f0 46 60 addps 0x6046f0(,%rax,4),%xmm4 400bcf: 00 400bd0: 0f 58 1c 85 00 47 60 addps 0x604700(,%rax,4),%xmm3 400bd7: 00 400bd8: 0f 58 14 85 10 47 60 addps 0x604710(,%rax,4),%xmm2 400bdf: 00 400be0: 0f 58 0c 85 20 47 60 addps 0x604720(,%rax,4),%xmm1 400be7: 00 400be8: 0f 58 04 85 30 47 60 addps 0x604730(,%rax,4),%xmm0 400bef: 00 400bf0: 48 83 c0 20 add $0x20,%rax 400bf4: 48 3d 00 00 a0 00 cmp $0xa00000,%rax 400bfa: 72 b4 jb 400bb0 <main+0xb0> 400bfc: 0f 29 3c 24 movaps %xmm7,(%rsp) 400c00: 0f 58 ec addps %xmm4,%xmm5 400c03: 0f 58 da addps %xmm2,%xmm3 400c06: 0f 58 c8 addps %xmm0,%xmm1 400c09: 0f 58 d9 addps %xmm1,%xmm3 400c0c: 0f 58 fe addps %xmm6,%xmm7 400c0f: 0f 58 fd addps %xmm5,%xmm7 400c12: 0f 58 fb addps %xmm3,%xmm7 <<<<< THE FINAL SUMMING 400c15: 0f 29 3c 24 movaps %xmm7,(%rsp) 400c19: e8 02 fd ff ff callq 400920 <clock@plt> |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 22,962,883 RAC: 17,460 |
If you're going to examine this area, another option is llvm/clang which is at http://llvm.org/. Good idea. I had clang on my machine for some earlier hashing work. I had to add time.h and unistd.h includes to the source and the clang results on the same -O2 compile/run time is below. clang -v clang version 3.5.0 (tags/RELEASE_350/final) Target: x86_64-redhat-linux-gnu Thread model: posix Found candidate GCC installation: /usr/bin/../lib/gcc/i686-redhat-linux/4.9.2 Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-redhat-linux/4.9.2 Found candidate GCC installation: /usr/lib/gcc/i686-redhat-linux/4.9.2 Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/4.9.2 Selected GCC installation: /usr/bin/../lib/gcc/x86_64-redhat-linux/4.9.2 Candidate multilib: .;@m64 Candidate multilib: 32;@m32 Selected multilib: .;@m64 ./foo OPT=-O2 -D__icc 11259263208914944.000000 2531 11259263208914944.000000 2289 11259263208914944.000000 2263 11260575066722724.000000 4403 11260575066722724.000000 5437 11260575066722724.000000 4650 OPT=-O2 -D__gcc 11258919611531264.000000 7268 11258919611531264.000000 7257 11258919611531264.000000 7972 11260575066721352.000000 7503 11260575066721352.000000 7442 11260575066721352.000000 7839 OPT=-O2 -D__clang 11258919611531264.000000 7473 11258919611531264.000000 7620 11258919611531264.000000 8232 11260575066721352.000000 8002 11260575066721352.000000 8479 11260575066721352.000000 8073 head main.c #include <stdio.h> #include <stdlib.h> #include <time.h> #define BS 10 *1024*1024 The clang code .... the "0x66" opcode is a PRECISION override opcode and I am not sure why clang repeats it 5 times. But the clang compiler generates more aggressive code than gcc but not icc .... but the resulting performance does not get harvested for some reason. clang disassembly of loop .... 4006f7: e8 d4 fd ff ff callq 4004d0 <clock@plt> 4006fc: 0f 57 c0 xorps %xmm0,%xmm0 4006ff: 49 89 c6 mov %rax,%r14 400702: 66 66 66 66 66 2e 0f data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1) 400709: 1f 84 00 00 00 00 00 400710: f3 0f 58 83 60 10 60 addss 0x601060(%rbx),%xmm0 400717: 00 400718: f3 0f 58 83 64 10 60 addss 0x601064(%rbx),%xmm0 40071f: 00 400720: f3 0f 58 83 68 10 60 addss 0x601068(%rbx),%xmm0 400727: 00 400728: f3 0f 58 83 6c 10 60 addss 0x60106c(%rbx),%xmm0 40072f: 00 400730: 48 83 c3 10 add $0x10,%rbx 400734: 48 81 fb 00 00 80 02 cmp $0x2800000,%rbx 40073b: 75 d3 jne 400710 <main+0xf0> 40073d: f3 0f 11 04 24 movss %xmm0,(%rsp) 400742: e8 89 fd ff ff callq 4004d0 <clock@plt> |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 22,962,883 RAC: 17,460 |
If I turn on AVX2, it uses the YMM 256-bit registers BUT still generates 64-bit scalar code. You see some minor improvement. ./foo OPT=-xCORE-AVX2 -O2 -D__icc 11259265356398592.000000 2587 11259265356398592.000000 2392 11259265356398592.000000 2401 11260575066722724.000000 4596 11260575066722724.000000 4946 11260575066722724.000000 4626 OPT=-O2 -D__icc 11259263208914944.000000 2611 11259263208914944.000000 2342 11259263208914944.000000 2304 11260575066722724.000000 4469 11260575066722724.000000 5443 11260575066722724.000000 4785 OPT=-O2 -D__gcc 11258919611531264.000000 7266 11258919611531264.000000 7258 11258919611531264.000000 8146 11260575066721352.000000 7800 11260575066721352.000000 7827 11260575066721352.000000 9051 OPT=-O2 -D__clang 11258919611531264.000000 7261 11258919611531264.000000 7269 11258919611531264.000000 7750 11260575066721352.000000 8100 11260575066721352.000000 8421 11260575066721352.000000 7982 400b86: e8 95 fd ff ff callq 400920 <clock@plt> 400b8b: 49 89 c4 mov %rax,%r12 400b8e: 33 c0 xor %eax,%eax 400b90: c5 c4 57 ff vxorps %ymm7,%ymm7,%ymm7 400b94: c5 fc 11 7c 24 20 vmovups %ymm7,0x20(%rsp) 400b9a: c5 fd 6f f7 vmovdqa %ymm7,%ymm6 400b9e: c5 fd 6f ef vmovdqa %ymm7,%ymm5 400ba2: c5 fd 6f e7 vmovdqa %ymm7,%ymm4 400ba6: c5 fd 6f df vmovdqa %ymm7,%ymm3 400baa: c5 fd 6f d7 vmovdqa %ymm7,%ymm2 400bae: c5 fd 6f cf vmovdqa %ymm7,%ymm1 400bb2: c5 fc 28 c7 vmovaps %ymm7,%ymm0 400bb6: 0f 1f 00 nopl (%rax) 400bb9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 400bc0: c5 c4 58 3c 85 c0 46 vaddps 0x6046c0(,%rax,4),%ymm7,%ymm7 400bc7: 60 00 400bc9: c5 cc 58 34 85 e0 46 vaddps 0x6046e0(,%rax,4),%ymm6,%ymm6 400bd0: 60 00 400bd2: c5 d4 58 2c 85 00 47 vaddps 0x604700(,%rax,4),%ymm5,%ymm5 400bd9: 60 00 400bdb: c5 dc 58 24 85 20 47 vaddps 0x604720(,%rax,4),%ymm4,%ymm4 400be2: 60 00 400be4: c5 e4 58 1c 85 40 47 vaddps 0x604740(,%rax,4),%ymm3,%ymm3 400beb: 60 00 400bed: c5 ec 58 14 85 60 47 vaddps 0x604760(,%rax,4),%ymm2,%ymm2 400bf4: 60 00 400bf6: c5 f4 58 0c 85 80 47 vaddps 0x604780(,%rax,4),%ymm1,%ymm1 400bfd: 60 00 400bff: c5 fc 58 04 85 a0 47 vaddps 0x6047a0(,%rax,4),%ymm0,%ymm0 400c06: 60 00 400c08: 48 83 c0 40 add $0x40,%rax 400c0c: 48 3d 00 00 a0 00 cmp $0xa00000,%rax 400c12: 72 ac jb 400bc0 <main+0xc0> 400c14: c5 44 58 c6 vaddps %ymm6,%ymm7,%ymm8 400c18: c5 54 58 cc vaddps %ymm4,%ymm5,%ymm9 400c1c: c5 64 58 d2 vaddps %ymm2,%ymm3,%ymm10 400c20: c5 f4 58 c0 vaddps %ymm0,%ymm1,%ymm0 400c24: c4 41 3c 58 d9 vaddps %ymm9,%ymm8,%ymm11 400c29: c5 ac 58 c8 vaddps %ymm0,%ymm10,%ymm1 400c2d: c5 a4 58 d1 vaddps %ymm1,%ymm11,%ymm2 400c31: c5 fc 11 14 24 vmovups %ymm2,(%rsp) 400c36: c5 f8 77 vzeroupper 400c39: e8 e2 fc ff ff callq 400920 <clock@plt> |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1991 Credit: 9,501,324 RAC: 12,596 |
I think running the BETA program through RALPH is dumb.......Using RALPH just splits the Rosetta volunteer resources and makes everyone's job harder. All "big" projects, if they can, have beta-side: seti@home beta, albert@home (beta of Einstein), Poem@Test, Ralph, etc. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 22,962,883 RAC: 17,460 |
I think running the BETA program through RALPH is dumb.......Using RALPH just splits the Rosetta volunteer resources and makes everyone's job harder. Hmmmm. I stand corrected. This ignorant is now more enlightened .... thanks. My comment was overly harsh. I understand the need for a BETA program but don't understand any need for the separate project. The model I prefer is like the World Community Grid model where they have a BETA project as a preference option where you can opt-in. I run those beta workloads and would run beta binaries in projects I am crunching. |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,452,852 RAC: 11,025 |
I think running the BETA program through RALPH is dumb.......Using RALPH just splits the Rosetta volunteer resources and makes everyone's job harder. I would prefer beta as a per-computer opt-in option. The current set-up probably reduces the number of people who sign up for beta without realising that it will potentially lead to issues. I'd prefer to see any spare capacity invested in this compiler optimisation rather than any other changes though! Is there any room for making better use of AMD HSA APUs (not that I have any!)? My understanding is that any other type of GPU optimisation isn't suitable for rosetta because even if bits of the code were GPU optimised, the overhead for switching from CPU RAM to GPU RAM would massively outweigh any benefits. HSA might change that but my far-from-expert understanding is that any GPU optimisations would require significant rewrite. Is that correct, or are there any low-hanging compiler options to make HSA useful (in this context)? |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1991 Credit: 9,501,324 RAC: 12,596 |
I understand the need for a BETA program but don't understand any need for the separate project. For example if you want to update server (and rosetta, i think, need it) with this command and you don't want to try it before production (to avoid disappointments). WCG is little different respect other projects, they use wrapper for boinc, so their servers "go their own way". |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1991 Credit: 9,501,324 RAC: 12,596 |
The current set-up probably reduces the number of people who sign up for beta without realising that it will potentially lead to issues. I'd prefer to see any spare capacity invested in this compiler optimisation rather than any other changes though! Don't understimate the ralph@home computational power. Sometimes, in the past, Rosetta's admins have used it to crunch rapidly wus for CASP (because the near deadline of wus). The people who participate to Ralph, know that is a beta project, and assume the risk. |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,452,852 RAC: 11,025 |
The current set-up probably reduces the number of people who sign up for beta without realising that it will potentially lead to issues. I'd prefer to see any spare capacity invested in this compiler optimisation rather than any other changes though! What I meant is that lots of people might tick a 'beta' box when installing, without understanding that they're signing up for a project that might be unstable. The way it's set up at the moment you'd struggle to accidentally sign up for ralph as well as rosetta. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 22,962,883 RAC: 17,460 |
I think running the BETA program through RALPH is dumb.......Using RALPH just splits the Rosetta volunteer resources and makes everyone's job harder. For every target environment, there is some effort required to make it work in that environment. If you are the developer/researcher, the question they ask is "How many systems are going to use this new feature and will it pay back the researcher effort for the port?" The Rosetta researchers have an idea about what the machine distribution looks like. I don't know if the number of AMD HSA APUs is sufficient to warrant the effort. Timo indicated in another thread: "Setting a longer run-time actually means that each WU you crunch will run more decoys (more searches through different paths of confirmation-space) for the same core model/protein/data set. Many thousands of decoys are needed for obtaining an accurate prediction of a single protein." This would make me think that there is a COARSE-GRAIN PARALLELISM at the DECOY level and Rosetta would be a candidate for using multiple CPU and processing DECOY per logical CPU simultaneously on the same data. Depending on the size of the data set being operated on, there may be some benefit from having common data cached. The researchers know the data set size and how it compares with the size of the L2/L3 cache size of machines. I also expect there is some fine-grain parallelism loops where vectors can be used. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
This would make me think that there is a COARSE-GRAIN PARALLELISM at the DECOY level and Rosetta would be a candidate for using multiple CPU and processing DECOY per logical CPU simultaneously on the same data. Yes, that works in theory. At present, R@h is not a multithreaded application, and so the same results are generally achieved by running multiple WUs at the same time. The project needs 10s of thousands of decoys, but a given WU generally runs dozens or hundreds. So multiplying that by number of CPUs and running a single WU on all CPUs of a machine sorta puts all of your eggs in one basket. If there were a failure of somekind more work would be lost. Having different WUs per CPU randomizes how much CPU is lost when system powers off, etc. Rosetta Moderator: Mod.Sense |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
This optimization work flow will take quite a bit of time and I am and will be definitely discussing it with others in the Rosetta developer community, including computer scientists. We do have clang and icc build options also. But have limited developer/testing/optimization time. If there are easy to implement compiler optimization options that we can benefit from significantly, that would be best in the short term. As for the linux test comparing our 32bit distributed app vs a 64bit sse4.2 and gcc 4.4.7 build, I see almost a 13% improvement in average run time per model. That is great! |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi David. I have a Haswell R Xeon running Ubuntu 14.04lts x64, I could put it on to Ralph for any testing you might do if it's any help. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 22,962,883 RAC: 17,460 |
This optimization work flow will take quite a bit of time and I am and will be definitely discussing it with others in the Rosetta developer community, including computer scientists. We do have clang and icc build options also. But have limited developer/testing/optimization time. If there are easy to implement compiler optimization options that we can benefit from significantly, that would be best in the short term. 13% is not a surprise and by a simple recompile, you have increased the volunteer donation by 13%. BUT!!! The important question is "Did you get the right answer faster?" 8-) There should be very few applications that do not benefit from a 64-bit version. There are more (16 vs 8) registers available on the 64-bit model and an application will take longer to spill to temporary variables on the stack. If you have access to the Intel compiler, look at the option: -ax<CODE> The "-a" part tells the ICC compiler to generate multiple, feature-specific, auto-dispatch code paths for Intel processors if there is a performance benefit. Since you are looking at all machines, I would set "<CODE>" for the newest machine .... Haswell. The "fat binary" dispatcher checks the CPU type at runtime and configures itself to use code paths that it thinks are best suited to your CPU. All the Rosetta volunteers should see improvements. There are several ICC specific libraries that will have to be redistributed or statically linked with the Rosetta binary. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1991 Credit: 9,501,324 RAC: 12,596 |
I have a Haswell R Xeon running Ubuntu 14.04lts x64, I could put it on to Ralph for any testing you might do if it's any help. My virtual linux machines are ready! |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1991 Credit: 9,501,324 RAC: 12,596 |
I'll push it out to ralph soon. Will you release only the new optimized version (3.60, i suppose) o continue with two version of 3.59 (optimized and not-optimized)? |
Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0 |
In order to fully utilize the potency of the AVX2 (and other instruction) sets I still recommend to switch to a newer gcc version, e.g. 4.9. This article is quite interesting as it highlights some pitfalls while programming. Auto-vectorization with gcc 4.7 At an even higher level is auto-vectorization. There, code isn't re-written at all. It remains portable C, and the compiler automagically determines how to vectorize it. Since ideally, no work needs to be done, one can simply recompile with a new compiler and get all the speed advantages of vectorization with very little effort required. The big question though, is how much code can gcc vectorize? If very few loops can be, then this feature isn't too useful. Conversely, if gcc is very smart, then the lower level techniques aren't necessary any more. and On the other hand, gcc will still attempt to vectorize code which hasn't had changes done to it at all. It just won't be able to get nearly as much of a performance improvement as you might hope. In the long run I seriously doubt that GPGPU will survive homogeneous computing through AVX2 / AVX512 and successors... |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1991 Credit: 9,501,324 RAC: 12,596 |
In order to fully utilize the potency of the AVX2 (and other instruction) sets I still recommend to switch to a newer gcc version, e.g. 4.9. Or 5.1 In the long run I seriously doubt that GPGPU will survive homogeneous computing through AVX2 / AVX512 and successors... I don't think so. Gpu computational power (if sw is ok) outclasses cpu |
Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0 |
I don't think so. Gpu computational power (if sw is ok) outclasses cpu We'll see how Knights Landing will perform with its AVX512 support and much more cores than a reglar DT CPU. The logical future: AVX1024 and TSX to handle all those cores efficiently. |
Message boards :
Number crunching :
R@H Scientists/Coders: An analysis of the Rosetta binaries...
©2024 University of Washington
https://www.bakerlab.org