R@H Scientists/Coders: An analysis of the Rosetta binaries...

Message boards : Number crunching : R@H Scientists/Coders: An analysis of the Rosetta binaries...

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

AuthorMessage
Mark

Send message
Joined: 10 Nov 13
Posts: 40
Credit: 397,847
RAC: 0
Message 78362 - Posted: 28 Jun 2015, 21:55:31 UTC

If you're going to examine this area, another option is llvm/clang which is at http://llvm.org/.

Sounds like you need an experienced computer scientist input if you dont mind me saying...
ID: 78362 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 21,432,528
RAC: 15,783
Message 78363 - Posted: 28 Jun 2015, 22:24:45 UTC - in response to Message 78357.  

It would not just be for fun, these optimizations which our scientists and developers aren't that familiar with can have great benefits if the speed up is significant. I'll give that a try and see how things improve. The linux 64bit build with sse4.2 and gcc 4.4.7 does seem to have a more significant improvement than our windows sse2 version, around a 12% improvement on my quick test. I need to do more thorough tests though, particularly for the windows builds but judging from this linux improvement, it may be worthwhile to upgrade to VS 2015 when it comes out.


Some things that will make a difference, especially in vector code. Use the smallest data type that does the job. You will pay extra for 64-bit doubles versus 32-bit floats.

Define the length of arrays to help the compilers avoid generating "tail-processing" code. Initialize the tail padding and process it.



The compiler vendor also makes a big difference too. FOR example, I just summed an array of 10 million floating point values using the newest gcc and newest Intel ICC compiler:
icc -v
icc version 16.0.0 Beta (gcc version 4.9.2 compatibility)


... and the standard Fedora 21 gcc

gcc -v
Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.9.2/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.9.2-20150212/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.9.2-20150212/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.9.2 20150212 (Red Hat 4.9.2-6) (GCC)




I compiled using both compilers a BRAIN DEAD program that just SUMMED 10 million single precision and double precision numbers:


cat main.c

#include <stdio.h>
#define BS 10 *1024*1024

float sp[BS];
double dp[BS];

int main ()
{
unsigned long timeBegin = clock();
unsigned long timeEnd = clock();
unsigned long i;
float f = 0.0;
double d = 0.0;

for (i=0;i<BS;i++) { sp[i] = (float) random(); }
for (i=0;i<BS;i++) { dp[i] = (double) random(); }
f=0.0; timeBegin = clock(); for (i=0;i<BS;i++) { f+=sp[i]; } timeEnd = clock(); printf("%f %ld n", f, timeEnd-timeBegin );
f=0.0; timeBegin = clock(); for (i=0;i<BS;i++) { f+=sp[i]; } timeEnd = clock(); printf("%f %ld n", f, timeEnd-timeBegin );
f=0.0; timeBegin = clock(); for (i=0;i<BS;i++) { f+=sp[i]; } timeEnd = clock(); printf("%f %ld n", f, timeEnd-timeBegin );
printf("n");
d=0.0; timeBegin = clock(); for (i=0;i<BS;i++) { d+=dp[i]; } timeEnd = clock(); printf("%f %ld n", d, timeEnd-timeBegin );
d=0.0; timeBegin = clock(); for (i=0;i<BS;i++) { d+=dp[i]; } timeEnd = clock(); printf("%f %ld n", d, timeEnd-timeBegin );
d=0.0; timeBegin = clock(); for (i=0;i<BS;i++) { d+=dp[i]; } timeEnd = clock(); printf("%f %ld n", d, timeEnd-timeBegin );

}



I ran them on my DELL XPS 8700 with the CPU replaced with the i7-4790K running at 4GHz.
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
stepping : 3
microcode : 0x1c
cpu MHz : 4038.593
cache size : 8192 KB



The results of identical ICC and GCC compiles with was:
3 iterations of summing 10m numbers single and double precision with ICC and then with GCC with the "-O2" option set since that is the default "-O" on ICC which is slightly different that -O on gcc.

OPT=-O2 -D__icc
11259263208914944.000000 2559 ABOUT 2500 usec ICC single precision.
11259263208914944.000000 2334
11259263208914944.000000 2304

11260575066722724.000000 4395 double that time with ICC double
11260575066722724.000000 5392
11260575066722724.000000 4506

OPT=-O2 -D__gcc
11258919611531264.000000 7296 gcc single is slower than icc double
11258919611531264.000000 7430
11258919611531264.000000 7840

11260575066721352.000000 7690 gcc double is about 50% slower than icc.
11260575066721352.000000 8669
11260575066721352.000000 7927


the scrip I used to compile the brain dead add test.
cat foo

#!/bin/sh


OPT="-O2 -D__icc"
echo OPT=$OPT
icc $OPT main.c
./a.out
objdump -dCS a.out > a.out.icc.od

OPT="-O2 -D__gcc"
echo OPT=$OPT
gcc $OPT main.c
./a.out
objdump -dCS a.out > a.out.gcc.od





The DOUBLE PRECISION LOOP created by gcc just iterates over buffer entries.

4006e7: e8 e4 fd ff ff callq 4004d0 <clock@plt>
4006ec: 66 0f ef c0 pxor %xmm0,%xmm0
4006f0: 48 63 d8 movslq %eax,%rbx
4006f3: 31 c0 xor %eax,%eax
4006f5: 0f 1f 00 nopl (%rax)
4006f8: f2 0f 58 04 c5 c0 10 addsd 0x2e010c0(,%rax,8),%xmm0 <<<< THE SUMMING
4006ff: e0 02
400701: 48 83 c0 01 add $0x1,%rax
400705: 48 3d 00 00 a0 00 cmp $0xa00000,%rax
40070b: 75 eb jne 4006f8 <main+0x1d8>
40070d: 31 c0 xor %eax,%eax
40070f: f2 0f 11 44 24 08 movsd %xmm0,0x8(%rsp)
400715: e8 b6 fd ff ff callq 4004d0 <clock@plt>


The same loop compiled by ICC at -O2 uses 8 xmm registers to generate 8 sub-totals and then at the bottom of the loop ACCUMULATES the 8 subtotals into the SUM.

400b83: e8 98 fd ff ff callq 400920 <clock@plt>
400b88: 49 89 c4 mov %rax,%r12
400b8b: 33 c0 xor %eax,%eax
400b8d: 66 0f ef f6 pxor %xmm6,%xmm6
400b91: 0f 29 34 24 movaps %xmm6,(%rsp)
400b95: 0f 28 ee movaps %xmm6,%xmm5
400b98: 0f 29 74 24 10 movaps %xmm6,0x10(%rsp)
400b9d: 0f 28 e6 movaps %xmm6,%xmm4
400ba0: 0f 28 fd movaps %xmm5,%xmm7
400ba3: 0f 28 de movaps %xmm6,%xmm3
400ba6: 0f 28 d6 movaps %xmm6,%xmm2
400ba9: 0f 28 ce movaps %xmm6,%xmm1
400bac: 0f 28 c6 movaps %xmm6,%xmm0
400baf: 90 nop
400bb0: 0f 58 3c 85 c0 46 60 addps 0x6046c0(,%rax,4),%xmm7
400bb7: 00
400bb8: 0f 58 34 85 d0 46 60 addps 0x6046d0(,%rax,4),%xmm6
400bbf: 00
400bc0: 0f 58 2c 85 e0 46 60 addps 0x6046e0(,%rax,4),%xmm5
400bc7: 00
400bc8: 0f 58 24 85 f0 46 60 addps 0x6046f0(,%rax,4),%xmm4
400bcf: 00
400bd0: 0f 58 1c 85 00 47 60 addps 0x604700(,%rax,4),%xmm3
400bd7: 00
400bd8: 0f 58 14 85 10 47 60 addps 0x604710(,%rax,4),%xmm2
400bdf: 00
400be0: 0f 58 0c 85 20 47 60 addps 0x604720(,%rax,4),%xmm1
400be7: 00
400be8: 0f 58 04 85 30 47 60 addps 0x604730(,%rax,4),%xmm0
400bef: 00
400bf0: 48 83 c0 20 add $0x20,%rax
400bf4: 48 3d 00 00 a0 00 cmp $0xa00000,%rax
400bfa: 72 b4 jb 400bb0 <main+0xb0>
400bfc: 0f 29 3c 24 movaps %xmm7,(%rsp)
400c00: 0f 58 ec addps %xmm4,%xmm5
400c03: 0f 58 da addps %xmm2,%xmm3
400c06: 0f 58 c8 addps %xmm0,%xmm1
400c09: 0f 58 d9 addps %xmm1,%xmm3
400c0c: 0f 58 fe addps %xmm6,%xmm7
400c0f: 0f 58 fd addps %xmm5,%xmm7
400c12: 0f 58 fb addps %xmm3,%xmm7 <<<<< THE FINAL SUMMING
400c15: 0f 29 3c 24 movaps %xmm7,(%rsp)
400c19: e8 02 fd ff ff callq 400920 <clock@plt>
ID: 78363 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 21,432,528
RAC: 15,783
Message 78364 - Posted: 28 Jun 2015, 22:48:34 UTC - in response to Message 78362.  

If you're going to examine this area, another option is llvm/clang which is at http://llvm.org/.

Sounds like you need an experienced computer scientist input if you dont mind me saying...


Good idea. I had clang on my machine for some earlier hashing work. I had to add time.h and unistd.h includes to the source and the clang results on the same -O2 compile/run time is below.

clang -v
clang version 3.5.0 (tags/RELEASE_350/final)
Target: x86_64-redhat-linux-gnu
Thread model: posix
Found candidate GCC installation: /usr/bin/../lib/gcc/i686-redhat-linux/4.9.2
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-redhat-linux/4.9.2
Found candidate GCC installation: /usr/lib/gcc/i686-redhat-linux/4.9.2
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/4.9.2
Selected GCC installation: /usr/bin/../lib/gcc/x86_64-redhat-linux/4.9.2
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64


./foo
OPT=-O2 -D__icc
11259263208914944.000000 2531
11259263208914944.000000 2289
11259263208914944.000000 2263

11260575066722724.000000 4403
11260575066722724.000000 5437
11260575066722724.000000 4650

OPT=-O2 -D__gcc
11258919611531264.000000 7268
11258919611531264.000000 7257
11258919611531264.000000 7972

11260575066721352.000000 7503
11260575066721352.000000 7442
11260575066721352.000000 7839

OPT=-O2 -D__clang
11258919611531264.000000 7473
11258919611531264.000000 7620
11258919611531264.000000 8232

11260575066721352.000000 8002
11260575066721352.000000 8479
11260575066721352.000000 8073



head main.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define BS 10 *1024*1024



The clang code .... the "0x66" opcode is a PRECISION override opcode and I am not sure why clang repeats it 5 times. But the clang compiler generates more aggressive code than gcc but not icc .... but the resulting performance does not get harvested for some reason.

clang disassembly of loop ....


4006f7: e8 d4 fd ff ff callq 4004d0 <clock@plt>
4006fc: 0f 57 c0 xorps %xmm0,%xmm0
4006ff: 49 89 c6 mov %rax,%r14
400702: 66 66 66 66 66 2e 0f data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
400709: 1f 84 00 00 00 00 00
400710: f3 0f 58 83 60 10 60 addss 0x601060(%rbx),%xmm0
400717: 00
400718: f3 0f 58 83 64 10 60 addss 0x601064(%rbx),%xmm0
40071f: 00
400720: f3 0f 58 83 68 10 60 addss 0x601068(%rbx),%xmm0
400727: 00
400728: f3 0f 58 83 6c 10 60 addss 0x60106c(%rbx),%xmm0
40072f: 00
400730: 48 83 c3 10 add $0x10,%rbx
400734: 48 81 fb 00 00 80 02 cmp $0x2800000,%rbx
40073b: 75 d3 jne 400710 <main+0xf0>
40073d: f3 0f 11 04 24 movss %xmm0,(%rsp)
400742: e8 89 fd ff ff callq 4004d0 <clock@plt>
ID: 78364 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 21,432,528
RAC: 15,783
Message 78365 - Posted: 28 Jun 2015, 23:08:15 UTC - in response to Message 78364.  

If I turn on AVX2, it uses the YMM 256-bit registers BUT still generates 64-bit scalar code. You see some minor improvement.


./foo
OPT=-xCORE-AVX2 -O2 -D__icc
11259265356398592.000000 2587
11259265356398592.000000 2392
11259265356398592.000000 2401

11260575066722724.000000 4596
11260575066722724.000000 4946
11260575066722724.000000 4626

OPT=-O2 -D__icc
11259263208914944.000000 2611
11259263208914944.000000 2342
11259263208914944.000000 2304

11260575066722724.000000 4469
11260575066722724.000000 5443
11260575066722724.000000 4785

OPT=-O2 -D__gcc
11258919611531264.000000 7266
11258919611531264.000000 7258
11258919611531264.000000 8146

11260575066721352.000000 7800
11260575066721352.000000 7827
11260575066721352.000000 9051

OPT=-O2 -D__clang
11258919611531264.000000 7261
11258919611531264.000000 7269
11258919611531264.000000 7750

11260575066721352.000000 8100
11260575066721352.000000 8421
11260575066721352.000000 7982


400b86: e8 95 fd ff ff callq 400920 <clock@plt>
400b8b: 49 89 c4 mov %rax,%r12
400b8e: 33 c0 xor %eax,%eax
400b90: c5 c4 57 ff vxorps %ymm7,%ymm7,%ymm7
400b94: c5 fc 11 7c 24 20 vmovups %ymm7,0x20(%rsp)
400b9a: c5 fd 6f f7 vmovdqa %ymm7,%ymm6
400b9e: c5 fd 6f ef vmovdqa %ymm7,%ymm5
400ba2: c5 fd 6f e7 vmovdqa %ymm7,%ymm4
400ba6: c5 fd 6f df vmovdqa %ymm7,%ymm3
400baa: c5 fd 6f d7 vmovdqa %ymm7,%ymm2
400bae: c5 fd 6f cf vmovdqa %ymm7,%ymm1
400bb2: c5 fc 28 c7 vmovaps %ymm7,%ymm0
400bb6: 0f 1f 00 nopl (%rax)
400bb9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
400bc0: c5 c4 58 3c 85 c0 46 vaddps 0x6046c0(,%rax,4),%ymm7,%ymm7
400bc7: 60 00
400bc9: c5 cc 58 34 85 e0 46 vaddps 0x6046e0(,%rax,4),%ymm6,%ymm6
400bd0: 60 00
400bd2: c5 d4 58 2c 85 00 47 vaddps 0x604700(,%rax,4),%ymm5,%ymm5
400bd9: 60 00
400bdb: c5 dc 58 24 85 20 47 vaddps 0x604720(,%rax,4),%ymm4,%ymm4
400be2: 60 00
400be4: c5 e4 58 1c 85 40 47 vaddps 0x604740(,%rax,4),%ymm3,%ymm3
400beb: 60 00
400bed: c5 ec 58 14 85 60 47 vaddps 0x604760(,%rax,4),%ymm2,%ymm2
400bf4: 60 00
400bf6: c5 f4 58 0c 85 80 47 vaddps 0x604780(,%rax,4),%ymm1,%ymm1
400bfd: 60 00
400bff: c5 fc 58 04 85 a0 47 vaddps 0x6047a0(,%rax,4),%ymm0,%ymm0
400c06: 60 00
400c08: 48 83 c0 40 add $0x40,%rax
400c0c: 48 3d 00 00 a0 00 cmp $0xa00000,%rax
400c12: 72 ac jb 400bc0 <main+0xc0>
400c14: c5 44 58 c6 vaddps %ymm6,%ymm7,%ymm8
400c18: c5 54 58 cc vaddps %ymm4,%ymm5,%ymm9
400c1c: c5 64 58 d2 vaddps %ymm2,%ymm3,%ymm10
400c20: c5 f4 58 c0 vaddps %ymm0,%ymm1,%ymm0
400c24: c4 41 3c 58 d9 vaddps %ymm9,%ymm8,%ymm11
400c29: c5 ac 58 c8 vaddps %ymm0,%ymm10,%ymm1
400c2d: c5 a4 58 d1 vaddps %ymm1,%ymm11,%ymm2
400c31: c5 fc 11 14 24 vmovups %ymm2,(%rsp)
400c36: c5 f8 77 vzeroupper
400c39: e8 e2 fc ff ff callq 400920 <clock@plt>
ID: 78365 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1864
Credit: 8,184,675
RAC: 7,690
Message 78367 - Posted: 29 Jun 2015, 7:54:48 UTC - in response to Message 78361.  
Last modified: 29 Jun 2015, 7:55:21 UTC

I think running the BETA program through RALPH is dumb.......Using RALPH just splits the Rosetta volunteer resources and makes everyone's job harder.


All "big" projects, if they can, have beta-side: seti@home beta, albert@home (beta of Einstein), Poem@Test, Ralph, etc.
ID: 78367 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 21,432,528
RAC: 15,783
Message 78368 - Posted: 29 Jun 2015, 13:30:44 UTC - in response to Message 78367.  

I think running the BETA program through RALPH is dumb.......Using RALPH just splits the Rosetta volunteer resources and makes everyone's job harder.


All "big" projects, if they can, have beta-side: seti@home beta, albert@home (beta of Einstein), Poem@Test, Ralph, etc.


Hmmmm. I stand corrected. This ignorant is now more enlightened .... thanks. My comment was overly harsh.

I understand the need for a BETA program but don't understand any need for the separate project.

The model I prefer is like the World Community Grid model where they have a BETA project as a preference option where you can opt-in. I run those beta workloads and would run beta binaries in projects I am crunching.
ID: 78368 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 115,779,194
RAC: 59,243
Message 78369 - Posted: 29 Jun 2015, 14:05:09 UTC - in response to Message 78368.  

I think running the BETA program through RALPH is dumb.......Using RALPH just splits the Rosetta volunteer resources and makes everyone's job harder.


All "big" projects, if they can, have beta-side: seti@home beta, albert@home (beta of Einstein), Poem@Test, Ralph, etc.


Hmmmm. I stand corrected. This ignorant is now more enlightened .... thanks. My comment was overly harsh.

I understand the need for a BETA program but don't understand any need for the separate project.

The model I prefer is like the World Community Grid model where they have a BETA project as a preference option where you can opt-in. I run those beta workloads and would run beta binaries in projects I am crunching.


I would prefer beta as a per-computer opt-in option. The current set-up probably reduces the number of people who sign up for beta without realising that it will potentially lead to issues. I'd prefer to see any spare capacity invested in this compiler optimisation rather than any other changes though!

Is there any room for making better use of AMD HSA APUs (not that I have any!)? My understanding is that any other type of GPU optimisation isn't suitable for rosetta because even if bits of the code were GPU optimised, the overhead for switching from CPU RAM to GPU RAM would massively outweigh any benefits. HSA might change that but my far-from-expert understanding is that any GPU optimisations would require significant rewrite. Is that correct, or are there any low-hanging compiler options to make HSA useful (in this context)?
ID: 78369 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1864
Credit: 8,184,675
RAC: 7,690
Message 78370 - Posted: 29 Jun 2015, 14:47:33 UTC - in response to Message 78368.  
Last modified: 29 Jun 2015, 14:51:42 UTC

I understand the need for a BETA program but don't understand any need for the separate project.


For example if you want to update server (and rosetta, i think, need it) with this command and you don't want to try it before production (to avoid disappointments).

WCG is little different respect other projects, they use wrapper for boinc, so their servers "go their own way".
ID: 78370 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1864
Credit: 8,184,675
RAC: 7,690
Message 78371 - Posted: 29 Jun 2015, 14:57:35 UTC - in response to Message 78369.  

The current set-up probably reduces the number of people who sign up for beta without realising that it will potentially lead to issues. I'd prefer to see any spare capacity invested in this compiler optimisation rather than any other changes though!


Don't understimate the ralph@home computational power.
Sometimes, in the past, Rosetta's admins have used it to crunch rapidly wus for CASP (because the near deadline of wus).
The people who participate to Ralph, know that is a beta project, and assume the risk.

ID: 78371 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 115,779,194
RAC: 59,243
Message 78372 - Posted: 29 Jun 2015, 15:56:07 UTC - in response to Message 78371.  

The current set-up probably reduces the number of people who sign up for beta without realising that it will potentially lead to issues. I'd prefer to see any spare capacity invested in this compiler optimisation rather than any other changes though!


Don't understimate the ralph@home computational power.
Sometimes, in the past, Rosetta's admins have used it to crunch rapidly wus for CASP (because the near deadline of wus).
The people who participate to Ralph, know that is a beta project, and assume the risk.

What I meant is that lots of people might tick a 'beta' box when installing, without understanding that they're signing up for a project that might be unstable. The way it's set up at the moment you'd struggle to accidentally sign up for ralph as well as rosetta.
ID: 78372 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 21,432,528
RAC: 15,783
Message 78373 - Posted: 29 Jun 2015, 17:32:15 UTC - in response to Message 78369.  

I think running the BETA program through RALPH is dumb.......Using RALPH just splits the Rosetta volunteer resources and makes everyone's job harder.


All "big" projects, if they can, have beta-side: seti@home beta, albert@home (beta of Einstein), Poem@Test, Ralph, etc.


Hmmmm. I stand corrected. This ignorant is now more enlightened .... thanks. My comment was overly harsh.

I understand the need for a BETA program but don't understand any need for the separate project.

The model I prefer is like the World Community Grid model where they have a BETA project as a preference option where you can opt-in. I run those beta workloads and would run beta binaries in projects I am crunching.


I would prefer beta as a per-computer opt-in option. The current set-up probably reduces the number of people who sign up for beta without realising that it will potentially lead to issues. I'd prefer to see any spare capacity invested in this compiler optimisation rather than any other changes though!

Is there any room for making better use of AMD HSA APUs (not that I have any!)? My understanding is that any other type of GPU optimisation isn't suitable for rosetta because even if bits of the code were GPU optimised, the overhead for switching from CPU RAM to GPU RAM would massively outweigh any benefits. HSA might change that but my far-from-expert understanding is that any GPU optimisations would require significant rewrite. Is that correct, or are there any low-hanging compiler options to make HSA useful (in this context)?



For every target environment, there is some effort required to make it work in that environment. If you are the developer/researcher, the question they ask is "How many systems are going to use this new feature and will it pay back the researcher effort for the port?" The Rosetta researchers have an idea about what the machine distribution looks like. I don't know if the number of AMD HSA APUs is sufficient to warrant the effort.

Timo indicated in another thread:
"Setting a longer run-time actually means that each WU you crunch will run more decoys (more searches through different paths of confirmation-space) for the same core model/protein/data set. Many thousands of decoys are needed for obtaining an accurate prediction of a single protein."


This would make me think that there is a COARSE-GRAIN PARALLELISM at the DECOY level and Rosetta would be a candidate for using multiple CPU and processing DECOY per logical CPU simultaneously on the same data. Depending on the size of the data set being operated on, there may be some benefit from having common data cached. The researchers know the data set size and how it compares with the size of the L2/L3 cache size of machines.

I also expect there is some fine-grain parallelism loops where vectors can be used.








ID: 78373 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 78374 - Posted: 29 Jun 2015, 20:57:04 UTC - in response to Message 78373.  

This would make me think that there is a COARSE-GRAIN PARALLELISM at the DECOY level and Rosetta would be a candidate for using multiple CPU and processing DECOY per logical CPU simultaneously on the same data.


Yes, that works in theory. At present, R@h is not a multithreaded application, and so the same results are generally achieved by running multiple WUs at the same time. The project needs 10s of thousands of decoys, but a given WU generally runs dozens or hundreds. So multiplying that by number of CPUs and running a single WU on all CPUs of a machine sorta puts all of your eggs in one basket. If there were a failure of somekind more work would be lost. Having different WUs per CPU randomizes how much CPU is lost when system powers off, etc.
Rosetta Moderator: Mod.Sense
ID: 78374 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 78377 - Posted: 30 Jun 2015, 0:59:03 UTC

This optimization work flow will take quite a bit of time and I am and will be definitely discussing it with others in the Rosetta developer community, including computer scientists. We do have clang and icc build options also. But have limited developer/testing/optimization time. If there are easy to implement compiler optimization options that we can benefit from significantly, that would be best in the short term.

As for the linux test comparing our 32bit distributed app vs a 64bit sse4.2 and gcc 4.4.7 build, I see almost a 13% improvement in average run time per model. That is great!

ID: 78377 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 78378 - Posted: 30 Jun 2015, 3:26:51 UTC

Hi David.

I have a Haswell R Xeon running Ubuntu 14.04lts x64, I could put it on to Ralph for any testing you might do if it's any help.

ID: 78378 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 21,432,528
RAC: 15,783
Message 78380 - Posted: 30 Jun 2015, 13:35:01 UTC - in response to Message 78377.  
Last modified: 30 Jun 2015, 13:45:12 UTC

This optimization work flow will take quite a bit of time and I am and will be definitely discussing it with others in the Rosetta developer community, including computer scientists. We do have clang and icc build options also. But have limited developer/testing/optimization time. If there are easy to implement compiler optimization options that we can benefit from significantly, that would be best in the short term.

As for the linux test comparing our 32bit distributed app vs a 64bit sse4.2 and gcc 4.4.7 build, I see almost a 13% improvement in average run time per model. That is great!



13% is not a surprise and by a simple recompile, you have increased the volunteer donation by 13%. BUT!!! The important question is "Did you get the right answer faster?" 8-)

There should be very few applications that do not benefit from a 64-bit version. There are more (16 vs 8) registers available on the 64-bit model and an application will take longer to spill to temporary variables on the stack. If you have access to the Intel compiler, look at the option:

-ax<CODE>

The "-a" part tells the ICC compiler to generate multiple, feature-specific, auto-dispatch code paths for Intel processors if there is a performance benefit. Since you are looking at all machines, I would set "<CODE>" for the newest machine .... Haswell.

The "fat binary" dispatcher checks the CPU type at runtime and configures itself to use code paths that it thinks are best suited to your CPU. All the Rosetta volunteers should see improvements.

There are several ICC specific libraries that will have to be redistributed or statically linked with the Rosetta binary.
ID: 78380 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1864
Credit: 8,184,675
RAC: 7,690
Message 78386 - Posted: 30 Jun 2015, 17:52:54 UTC - in response to Message 78378.  

I have a Haswell R Xeon running Ubuntu 14.04lts x64, I could put it on to Ralph for any testing you might do if it's any help.


My virtual linux machines are ready!
ID: 78386 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1864
Credit: 8,184,675
RAC: 7,690
Message 78396 - Posted: 2 Jul 2015, 9:42:07 UTC - in response to Message 78343.  

I'll push it out to ralph soon.


Will you release only the new optimized version (3.60, i suppose) o continue with two version of 3.59 (optimized and not-optimized)?

ID: 78396 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 78397 - Posted: 2 Jul 2015, 19:49:00 UTC - in response to Message 78378.  


I have a Haswell R Xeon running Ubuntu 14.04lts x64, I could put it on to Ralph for any testing you might do if it's any help.


In order to fully utilize the potency of the AVX2 (and other instruction) sets I still recommend to switch to a newer gcc version, e.g. 4.9.

This article is quite interesting as it highlights some pitfalls while programming.

Auto-vectorization with gcc 4.7

At an even higher level is auto-vectorization. There, code isn't re-written at all. It remains portable C, and the compiler automagically determines how to vectorize it. Since ideally, no work needs to be done, one can simply recompile with a new compiler and get all the speed advantages of vectorization with very little effort required. The big question though, is how much code can gcc vectorize? If very few loops can be, then this feature isn't too useful. Conversely, if gcc is very smart, then the lower level techniques aren't necessary any more.


and

On the other hand, gcc will still attempt to vectorize code which hasn't had changes done to it at all. It just won't be able to get nearly as much of a performance improvement as you might hope.

However, as time passes, more inner loop patterns will be added to the vectorizable list. Thus if you are using later versions of gcc, don't take the above results for granted. Check the output of the compiler yourself to see if it is behaving as you might expect. You might be pleasantly surprised by what it can do.


In the long run I seriously doubt that GPGPU will survive homogeneous computing through AVX2 / AVX512 and successors...
ID: 78397 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1864
Credit: 8,184,675
RAC: 7,690
Message 78399 - Posted: 3 Jul 2015, 6:51:27 UTC - in response to Message 78397.  

In order to fully utilize the potency of the AVX2 (and other instruction) sets I still recommend to switch to a newer gcc version, e.g. 4.9.

Or 5.1

In the long run I seriously doubt that GPGPU will survive homogeneous computing through AVX2 / AVX512 and successors...

I don't think so. Gpu computational power (if sw is ok) outclasses cpu
ID: 78399 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 78400 - Posted: 3 Jul 2015, 7:59:23 UTC - in response to Message 78399.  

I don't think so. Gpu computational power (if sw is ok) outclasses cpu

We'll see how Knights Landing will perform with its AVX512 support and much more cores than a reglar DT CPU.

The logical future: AVX1024 and TSX to handle all those cores efficiently.
ID: 78400 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

Message boards : Number crunching : R@H Scientists/Coders: An analysis of the Rosetta binaries...



©2024 University of Washington
https://www.bakerlab.org