GPU Potential

Author	Message
SuperSluether Send message Joined: 7 Jul 14 Posts: 10 Credit: 1,357,990 RAC: 0	Message 77632 - Posted: 8 Nov 2014, 23:36:03 UTC From my standpoint, this project has potential with a GPU app. BOINC regularly polls this project trying to find tasks for my Nvidia, even though I don't see any GPU apps. From what I know, the way data is analyzed and processed, I think a GPU would be able to process tasks much faster than a CPU. There's plenty of RAM to hold the process, and the multiple cores would allow for more data to be sent through. How hard would it be to implement a GPU version of Rosetta@home? ID: 77632 · Rating: 0 · rate: / Reply Quote

Mark Send message Joined: 10 Nov 13 Posts: 40 Credit: 397,847 RAC: 0	Message 77633 - Posted: 9 Nov 2014, 0:58:34 UTC - in response to Message 77632. From my standpoint, this project has potential with a GPU app. BOINC regularly polls this project trying to find tasks for my Nvidia, even though I don't see any GPU apps. From what I know, the way data is analyzed and processed, I think a GPU would be able to process tasks much faster than a CPU. There's plenty of RAM to hold the process, and the multiple cores would allow for more data to be sent through. How hard would it be to implement a GPU version of Rosetta@home? This is a question that gets regularly asked. The answer is very hard and not top of the priorities ID: 77633 · Rating: 0 · rate: / Reply Quote

Jesse Viviano Send message Joined: 14 Jan 10 Posts: 42 Credit: 2,700,472 RAC: 0	Message 77720 - Posted: 4 Dec 2014, 18:23:54 UTC - in response to Message 77632. Last modified: 4 Dec 2014, 18:24:40 UTC From my standpoint, this project has potential with a GPU app. BOINC regularly polls this project trying to find tasks for my Nvidia, even though I don't see any GPU apps. From what I know, the way data is analyzed and processed, I think a GPU would be able to process tasks much faster than a CPU. There's plenty of RAM to hold the process, and the multiple cores would allow for more data to be sent through. How hard would it be to implement a GPU version of Rosetta@home? The way Rosetta@home folds proteins is extremely serial in which each step of creating a decoy, or a guess as to what a protein that is folded would be shaped like, feeds upon the previous step. The only steps that do not feed upon each other is creating a new decoy after the previous decoy is finished. There is not much to parallelize in this program, so a GPU would fail in this job. ID: 77720 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2081 Credit: 12,004,098 RAC: 13,165	Message 77721 - Posted: 5 Dec 2014, 10:02:03 UTC - in response to Message 77720. There is not much to parallelize in this program, so a GPU would fail in this job. Rosetta Admins know the power of gpu (see, for example, Poem@home project), but you are right, it's VERY difficult to implement. This is the reason i hope for AVX/AVX2 extensions, probably easier to implement and with an immediate gain (look at the first 100 cpus in Statistics) ID: 77721 · Rating: 0 · rate: / Reply Quote

alex Send message Joined: 21 Dec 14 Posts: 8 Credit: 2,669,706 RAC: 0	Message 77766 - Posted: 26 Dec 2014, 22:05:48 UTC - in response to Message 77721. Rosetta Admins know the power of gpu (see, for example, Poem@home project), but you are right, it's VERY difficult to implement. This is the reason i hope for AVX/AVX2 extensions, probably easier to implement and with an immediate gain (look at the first 100 cpus in Statistics) Please do not forget the fma3 / fma4 capabilities of the AMD cpu's. Crunch3r's fma4 implementation @ Asteroids@Home is extreme efficient! ID: 77766 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2081 Credit: 12,004,098 RAC: 13,165	Message 77769 - Posted: 27 Dec 2014, 14:47:28 UTC - in response to Message 77766. Please do not forget the fma3 / fma4 capabilities of the AMD cpu's. Crunch3r's fma4 implementation @ Asteroids@Home is extreme efficient! But: The incompatibility between Intel's FMA3 and AMD's FMA4 is due to both companies changing plans without coordinating coding details with each other. AMD changed their plans from FMA3 to FMA4 while Intel changed their plans from FMA4 to FMA3 almost at the same time. ID: 77769 · Rating: 0 · rate: / Reply Quote

Jesse Viviano Send message Joined: 14 Jan 10 Posts: 42 Credit: 2,700,472 RAC: 0	Message 77778 - Posted: 29 Dec 2014, 19:17:18 UTC - in response to Message 77769. Please do not forget the fma3 / fma4 capabilities of the AMD cpu's. Crunch3r's fma4 implementation @ Asteroids@Home is extreme efficient! But: The incompatibility between Intel's FMA3 and AMD's FMA4 is due to both companies changing plans without coordinating coding details with each other. AMD changed their plans from FMA3 to FMA4 while Intel changed their plans from FMA4 to FMA3 almost at the same time. AMD's latest CPUs can support both FMA3 and FMA4. ID: 77778 · Rating: 0 · rate: / Reply Quote

Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,661,974 RAC: 0	Message 77805 - Posted: 3 Jan 2015, 2:31:50 UTC In other news... Red Hat Engineer Improves Math Performance of Glibc - this might mean some increased performance is just a compiler-upgrade away (no idea who does actual builds of the rosetta core and what compilers are currently in use - interesting optimizations to be had anyways) ID: 77805 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2081 Credit: 12,004,098 RAC: 13,165	Message 77822 - Posted: 13 Jan 2015, 6:51:05 UTC - in response to Message 77721. Rosetta Admins know the power of gpu (see, for example, Poem@home project), but you are right, it's VERY difficult to implement. Gerasim@home, for example, doesn't use cuda or opencl, but C++ AMP They pass, in few days, from less than 30 GFlops to over 300 GFlops.... ID: 77822 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 77824 - Posted: 13 Jan 2015, 8:40:43 UTC - in response to Message 77822. Last modified: 13 Jan 2015, 9:31:26 UTC Rosetta Admins know the power of gpu (see, for example, Poem@home project), but you are right, it's VERY difficult to implement. Gerasim@home, for example, doesn't use cuda or opencl, but C++ AMP They pass, in few days, from less than 30 GFlops to over 300 GFlops.... i'm not too sure if those who're 'blessed' with very expensive Intel MKL could relink rosetta against Intel MKL (and if and only if rosetta happens to use specific calls and algorithms that happens to be optimized by MKL) and i'd also think it may perhaps only specifically benefit those who own a recent cpu e.g. ivy bridge or haswell) may see the benefits. there's also the ACML from amd which presumably biased to benefit amd platforms better but i'd guess doing so could be deemed too 'biased' as all the rest who don't benefit from those hardware would complain they are 'left in the cold' :o :p lol i'd guess while the benefits are there, there's simply many varied platforms to support (this is true even for cuda / opencl) and not all platforms (i.e. GPUs) support the features that's necessary for accelerated computation. cuda / opencl is known for drastically reduced double precision floating point performance vs single precision floating point (read some could cut as much as 1/8 vs single precision that the cards handle) & those GPUs that can handle accelerated double precision floating point well are probably very expensive. let alone the fact that it may require significant rework (using totally different methods) to just get the performance gains. http://www.cs.virginia.edu/~mwb7w/cuda_support/double.html On the GTX 280 & 260, while a multiprocessor has eight single-precision floating point ALUs (one per core), it has only one double-precision ALU (shared by the eight cores). Thus, for applications whose execution time is dominated by floating point computations, switching from single-precision to double-precision will increase runtime by a factor of approximately eight i'd guess for complex problems that has a lot of iterative dependencies (e.g. the next iteration depends on the results of a prior iteration and which those results cannot be predicted) , there is also a significant limit in that it could be impossible to parallelize or that the parallelized results may lead to wrong answers http://en.wikipedia.org/wiki/Amdahl%27s_law ID: 77824 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 77837 - Posted: 17 Jan 2015, 2:41:08 UTC Last modified: 17 Jan 2015, 2:57:00 UTC this is a somewhat 'techie/geeky' post ignore if u do find it that way :o lol apparently to AMD / Nvidia etc the GPU designs has pushed the envelope of 'old school' technology. The GPU today from the higher end AMD / Nvidia etc GPU is effectively today's vector CPU (along the notions of the earlier vector supercomputers - cray etc) this is most apparent from the Vector ALU instructions in this AMD technical document which i'd think would be utilised if technologies such as Open-CL / CUDA are used: http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf AMD SOUTHERN ISLANDS SERIES TECHNOLOGY Table 6.2 Vector ALU Instruction Set ... V_MSAD_U8 V_XOR_B32 V_RSQ_LEGACY_F32 1-2 Operand Instructions Available Only in VOP3 V_BFM_B32 V_RSQ_{F32, F64} V_ADD_F64 V_MAC_F32 V_SQRT_{F32,F64} V_MUL_F64 V_MADMK_F32 V_SIN_F32 V_MIN_F64 V_MADAK_F32 V_COS_F32 V_MAX_F64 V_BCNT_U32_B32 V_NOT_B32 V_LDEXP_F64 V_MBCNT_LO_ U32_B32 V_BFREV_B32 V_MUL_{LO,HI}_{I32,U32} V_MBCNT_HI_U32_B32 V_FFBH_U32 V_LSHL_B64 V_ADDC_U32 V_FFBL_B32 V_LSHR_B64 V_SUBB_U32 V_FFBH_I32 V_ASHR_I64 V_SUBBREV_U3 ... lots more That's significantly more elaborate than what used to be understood of as 'GPU' (graphic processing units). In effect the GPU packs the punch of parallel (SIMD single instruction multiple data) computations into what's traditionally done in CPU. and apparently today GPU pack possibly dozens to even hundreds of such vector ALU cores per GPU for such vector accelerated computations. i.e. e.g. to leverage today's AMD's/Nvidia platform in particular the 'GPU' vector ALU instructions/technologies it apparently means that it is necessary to re-write /redesign programs to use these technologies (e.g. Open-CL/ CUDA) this is unfortunate as much of today's programs use 'traditional' intel x86* style instructions and much less is designed for vector computations. the other thing is that GPUs often run at lower frequencies (e.g. 1 Ghz) compared to today's CPUs say 3-4 Ghz. thus on many of the 'benchmarks' web sites that pits Intel vs AMD CPUs etc, the apparent lack of prowess on AMD CPUs may apparently be simply that the 'benchmarks' programs are comparing X86 instructions prowess which probably puts AMD at a 'disadvantage' as AMD's design seemed to be more optimized towards vector computations. If vectorized open CL based computation are compared against a Intel CPU tasks for a program optimised for the different platform GPU vs CPU that is intended to produce the same results, i'd guess the GPU - CPU benchmarks say in terms of Gflops processing prowess would likely be closer and likely exceeded that of CPU prowess for the high end GPUs) unfortunately high end is needed as a qualifier here as a lot of 'lower end' GPUs use software emulation for double precision float vector computation that cripple the performance to say 1/8 of that possible for single precision floats. And the 'high end' GPUs are likely the expensive GPUs today ID: 77837 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 77838 - Posted: 17 Jan 2015, 3:17:31 UTC Last modified: 17 Jan 2015, 3:27:17 UTC just for curiosity sake, today's home desktop teraflop (trying too hard to be petaflop) vector super computer? :o lol http://en.wikipedia.org/wiki/Radeon_HD_9000_Series Radeon R9 295X2 [44] Apr 8, 2014 Vesuvius GCN 1.1 28 2× 6200 2× 438 PCIe 3.0 ×16 2× 4096 1018 N/A 1250 (Effective 5000) 2× 2816:176:64 2× 65.152 2× 179.168 2× 512 GDDR5 2× 320 11466.752 Gflops single precision float 1433.344 Gflops double precision float Unknown 500 22.9 12.0[38] 4.3[38] 1.2 Yes Yes http://www.pcworld.com/article/2140581/amd-radeon-r9-290x2-review-if-you-have-the-cash-amd-has-the-compute-power.html ID: 77838 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2081 Credit: 12,004,098 RAC: 13,165	Message 77841 - Posted: 17 Jan 2015, 7:58:35 UTC - in response to Message 77824. Last modified: 17 Jan 2015, 7:59:07 UTC i'm not too sure if those who're 'blessed' with very expensive Intel MKL could relink rosetta against Intel MKL (and if and only if rosetta happens to use specific calls and algorithms that happens to be optimized by MKL) and i'd also think it may perhaps only specifically benefit those who own a recent cpu e.g. ivy bridge or haswell) may see the benefits. there's also the ACML from amd which presumably biased to benefit amd platforms better That's for sure. But, for example SSEx extension is cross-platform Amd/Intel (like avx) Some projects use these (Seti, Poem, Einstein, ecc). http://boinc.berkeley.edu/trac/wiki/AppPlan But only admins can answer to our questions ID: 77841 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2081 Credit: 12,004,098 RAC: 13,165	Message 77909 - Posted: 10 Feb 2015, 9:55:49 UTC Now also Gpugrid has his opencl client (besides Poem@home). ID: 77909 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2081 Credit: 12,004,098 RAC: 13,165	Message 78053 - Posted: 22 Mar 2015, 22:06:31 UTC New (provisional) version of OpenCl (2.1) supports C++ and, thank to new SPIR, a lot of languages. Kronos OpenCl ID: 78053 · Rating: 0 · rate: / Reply Quote

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 78064 - Posted: 26 Mar 2015, 0:19:05 UTC I think Rosetta might be able to harness GPUs once this whole "unified memory" thing comes out from nVidia and AMD. AFAIK, the main problem with running Rosetta on a GPU is RAM. Just imagine, a single WU uses about 0.5 GB. But with unified memory, the GPU should be able to access system RAM and CPU should be able to access VRAM directly. ID: 78064 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2081 Credit: 12,004,098 RAC: 13,165	Message 78186 - Posted: 6 May 2015, 15:35:58 UTC - in response to Message 78064. AFAIK, the main problem with running Rosetta on a GPU is RAM. Just imagine, a single WU uses about 0.5 GB. But with unified memory, the GPU should be able to access system RAM and CPU should be able to access VRAM directly. Amd Carrizo Apu (first cpu HSA Compliant) may be a first solution. "Unified memory" is now, i think, a beta-technology, but in the future.. ID: 78186 · Rating: 0 · rate: / Reply Quote