GPU Potential

Message boards : Number crunching : GPU Potential

To post messages, you must log in.

AuthorMessage
SuperSluether

Send message
Joined: 7 Jul 14
Posts: 10
Credit: 1,357,990
RAC: 0
Message 77632 - Posted: 8 Nov 2014, 23:36:03 UTC

From my standpoint, this project has potential with a GPU app. BOINC regularly polls this project trying to find tasks for my Nvidia, even though I don't see any GPU apps. From what I know, the way data is analyzed and processed, I think a GPU would be able to process tasks much faster than a CPU. There's plenty of RAM to hold the process, and the multiple cores would allow for more data to be sent through. How hard would it be to implement a GPU version of Rosetta@home?
ID: 77632 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mark

Send message
Joined: 10 Nov 13
Posts: 40
Credit: 397,847
RAC: 0
Message 77633 - Posted: 9 Nov 2014, 0:58:34 UTC - in response to Message 77632.  

From my standpoint, this project has potential with a GPU app. BOINC regularly polls this project trying to find tasks for my Nvidia, even though I don't see any GPU apps. From what I know, the way data is analyzed and processed, I think a GPU would be able to process tasks much faster than a CPU. There's plenty of RAM to hold the process, and the multiple cores would allow for more data to be sent through. How hard would it be to implement a GPU version of Rosetta@home?


This is a question that gets regularly asked. The answer is very hard and not top of the priorities
ID: 77633 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jesse Viviano

Send message
Joined: 14 Jan 10
Posts: 42
Credit: 2,700,472
RAC: 0
Message 77720 - Posted: 4 Dec 2014, 18:23:54 UTC - in response to Message 77632.  
Last modified: 4 Dec 2014, 18:24:40 UTC

From my standpoint, this project has potential with a GPU app. BOINC regularly polls this project trying to find tasks for my Nvidia, even though I don't see any GPU apps. From what I know, the way data is analyzed and processed, I think a GPU would be able to process tasks much faster than a CPU. There's plenty of RAM to hold the process, and the multiple cores would allow for more data to be sent through. How hard would it be to implement a GPU version of Rosetta@home?

The way Rosetta@home folds proteins is extremely serial in which each step of creating a decoy, or a guess as to what a protein that is folded would be shaped like, feeds upon the previous step. The only steps that do not feed upon each other is creating a new decoy after the previous decoy is finished. There is not much to parallelize in this program, so a GPU would fail in this job.
ID: 77720 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,994,764
RAC: 8,835
Message 77721 - Posted: 5 Dec 2014, 10:02:03 UTC - in response to Message 77720.  

There is not much to parallelize in this program, so a GPU would fail in this job.


Rosetta Admins know the power of gpu (see, for example, Poem@home project), but you are right, it's VERY difficult to implement.
This is the reason i hope for AVX/AVX2 extensions, probably easier to implement and with an immediate gain (look at the first 100 cpus in Statistics)
ID: 77721 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
alex

Send message
Joined: 21 Dec 14
Posts: 8
Credit: 2,668,966
RAC: 3
Message 77766 - Posted: 26 Dec 2014, 22:05:48 UTC - in response to Message 77721.  


Rosetta Admins know the power of gpu (see, for example, Poem@home project), but you are right, it's VERY difficult to implement.
This is the reason i hope for AVX/AVX2 extensions, probably easier to implement and with an immediate gain (look at the first 100 cpus in Statistics)


Please do not forget the fma3 / fma4 capabilities of the AMD cpu's. Crunch3r's fma4 implementation @ Asteroids@Home is extreme efficient!
ID: 77766 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,994,764
RAC: 8,835
Message 77769 - Posted: 27 Dec 2014, 14:47:28 UTC - in response to Message 77766.  

Please do not forget the fma3 / fma4 capabilities of the AMD cpu's. Crunch3r's fma4 implementation @ Asteroids@Home is extreme efficient!


But:
The incompatibility between Intel's FMA3 and AMD's FMA4 is due to both companies changing plans without coordinating coding details with each other. AMD changed their plans from FMA3 to FMA4 while Intel changed their plans from FMA4 to FMA3 almost at the same time.

ID: 77769 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jesse Viviano

Send message
Joined: 14 Jan 10
Posts: 42
Credit: 2,700,472
RAC: 0
Message 77778 - Posted: 29 Dec 2014, 19:17:18 UTC - in response to Message 77769.  

Please do not forget the fma3 / fma4 capabilities of the AMD cpu's. Crunch3r's fma4 implementation @ Asteroids@Home is extreme efficient!


But:
The incompatibility between Intel's FMA3 and AMD's FMA4 is due to both companies changing plans without coordinating coding details with each other. AMD changed their plans from FMA3 to FMA4 while Intel changed their plans from FMA4 to FMA3 almost at the same time.

AMD's latest CPUs can support both FMA3 and FMA4.
ID: 77778 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,641,236
RAC: 111
Message 77805 - Posted: 3 Jan 2015, 2:31:50 UTC

In other news... Red Hat Engineer Improves Math Performance of Glibc - this might mean some increased performance is just a compiler-upgrade away (no idea who does actual builds of the rosetta core and what compilers are currently in use - interesting optimizations to be had anyways)
ID: 77805 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,994,764
RAC: 8,835
Message 77822 - Posted: 13 Jan 2015, 6:51:05 UTC - in response to Message 77721.  

Rosetta Admins know the power of gpu (see, for example, Poem@home project), but you are right, it's VERY difficult to implement.


Gerasim@home, for example, doesn't use cuda or opencl, but C++ AMP
They pass, in few days, from less than 30 GFlops to over 300 GFlops....

ID: 77822 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 77824 - Posted: 13 Jan 2015, 8:40:43 UTC - in response to Message 77822.  
Last modified: 13 Jan 2015, 9:31:26 UTC

Rosetta Admins know the power of gpu (see, for example, Poem@home project), but you are right, it's VERY difficult to implement.


Gerasim@home, for example, doesn't use cuda or opencl, but C++ AMP
They pass, in few days, from less than 30 GFlops to over 300 GFlops....


i'm not too sure if those who're 'blessed' with *very expensive* Intel MKL could relink rosetta against Intel MKL (and if and only if rosetta happens to use specific calls and algorithms that happens to be optimized by MKL) and i'd also think it may perhaps only specifically benefit those who own a recent cpu e.g. ivy bridge or haswell) may see the benefits. there's also the ACML from amd which presumably biased to benefit amd platforms better

but i'd guess doing so could be deemed too 'biased' as all the rest who don't benefit from those hardware would complain they are 'left in the cold' :o :p lol

i'd guess while the benefits are there, there's simply many varied platforms to support (this is true even for cuda / opencl) and not all platforms (i.e. GPUs) support the features that's necessary for accelerated computation. cuda / opencl is known for drastically reduced *double precision floating point* performance vs single precision floating point (read some could cut as much as 1/8 vs single precision that the cards handle) & those GPUs that can handle *accelerated double precision floating point* well are probably *very expensive*. let alone the fact that it may require significant rework (using totally different methods) to just get the performance gains.
http://www.cs.virginia.edu/~mwb7w/cuda_support/double.html
On the GTX 280 & 260, while a multiprocessor has eight single-precision floating point ALUs (one per core), it has only one double-precision ALU (shared by the eight cores). Thus, for applications whose execution time is dominated by floating point computations, switching from single-precision to double-precision will increase runtime by a factor of approximately eight



i'd guess for *complex problems* that has a lot of *iterative dependencies* (e.g. the next iteration depends on the results of a prior iteration and which those results cannot be predicted) , there is also a significant limit in that it could be impossible to parallelize or that the parallelized results may lead to wrong answers
http://en.wikipedia.org/wiki/Amdahl%27s_law
ID: 77824 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 77837 - Posted: 17 Jan 2015, 2:41:08 UTC
Last modified: 17 Jan 2015, 2:57:00 UTC

this is a somewhat 'techie/geeky' post ignore if u do find it that way :o lol

apparently to AMD / Nvidia etc the GPU designs has pushed the envelope of 'old school' technology. The GPU today from the higher end AMD / Nvidia etc GPU is effectively today's *vector CPU* (along the notions of the earlier vector supercomputers - cray etc)

this is most apparent from the Vector ALU instructions in this AMD technical document which i'd think would be utilised if technologies such as Open-CL / CUDA are used:
http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf
AMD SOUTHERN ISLANDS SERIES TECHNOLOGY

Table 6.2 Vector ALU Instruction Set
...
V_MSAD_U8 V_XOR_B32 V_RSQ_LEGACY_F32
1-2 Operand Instructions
Available Only in VOP3
V_BFM_B32 V_RSQ_{F32, F64}
V_ADD_F64 V_MAC_F32 V_SQRT_{F32,F64}
V_MUL_F64 V_MADMK_F32 V_SIN_F32
V_MIN_F64 V_MADAK_F32 V_COS_F32
V_MAX_F64 V_BCNT_U32_B32 V_NOT_B32
V_LDEXP_F64 V_MBCNT_LO_
U32_B32 V_BFREV_B32
V_MUL_{LO,HI}_{I32,U32} V_MBCNT_HI_U32_B32 V_FFBH_U32
V_LSHL_B64 V_ADDC_U32 V_FFBL_B32
V_LSHR_B64 V_SUBB_U32 V_FFBH_I32
V_ASHR_I64 V_SUBBREV_U3
... lots more


That's significantly more elaborate than what used to be understood of as 'GPU' (graphic processing units).
In effect the GPU packs the punch of parallel (SIMD single instruction multiple data) computations into what's traditionally done in CPU. and apparently today GPU pack possibly dozens to even hundreds of such vector ALU cores per GPU for such vector accelerated computations.

i.e. e.g. to leverage today's AMD's/Nvidia platform in particular the 'GPU' vector ALU instructions/technologies it apparently means that it is necessary to re-write /redesign programs to use these technologies (e.g. Open-CL/ CUDA)

this is unfortunate as much of today's programs use 'traditional' intel x86* style instructions and much less is designed for vector computations. the other thing is that GPUs often run at lower frequencies (e.g. 1 Ghz) compared to today's CPUs say 3-4 Ghz.

thus on many of the 'benchmarks' web sites that pits Intel vs AMD CPUs etc, the apparent lack of prowess on AMD CPUs may apparently be simply that the 'benchmarks' programs are comparing X86 instructions prowess which probably puts AMD at a 'disadvantage' as AMD's design seemed to be more optimized towards vector computations.

If vectorized open CL based computation are compared against a Intel CPU tasks for a program optimised for the different platform GPU vs CPU that is intended to produce the same results, i'd guess the GPU - CPU benchmarks say in terms of Gflops processing prowess would likely be closer and likely exceeded that of CPU prowess for the *high end* GPUs)

unfortunately *high end* is needed as a qualifier here as a lot of 'lower end' GPUs use software emulation for double precision float vector computation that cripple the performance to say 1/8 of that possible for single precision floats.
And the 'high end' GPUs are likely the *expensive* GPUs today
ID: 77837 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 77838 - Posted: 17 Jan 2015, 3:17:31 UTC
Last modified: 17 Jan 2015, 3:27:17 UTC

just for curiosity sake, today's home desktop teraflop (trying too hard to be petaflop) vector super computer? :o lol
http://en.wikipedia.org/wiki/Radeon_HD_9000_Series
Radeon R9 295X2 [44] Apr 8, 2014 Vesuvius GCN 1.1 28 2× 6200 2× 438 PCIe 3.0 ×16 2× 4096 1018 N/A 1250 (Effective 5000) 2× 2816:176:64 2× 65.152 2× 179.168 2× 512 GDDR5 2× 320
11466.752 Gflops single precision float
1433.344 Gflops double precision float
Unknown 500 22.9 12.0[38] 4.3[38] 1.2 Yes Yes

http://www.pcworld.com/article/2140581/amd-radeon-r9-290x2-review-if-you-have-the-cash-amd-has-the-compute-power.html
ID: 77838 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,994,764
RAC: 8,835
Message 77841 - Posted: 17 Jan 2015, 7:58:35 UTC - in response to Message 77824.  
Last modified: 17 Jan 2015, 7:59:07 UTC

i'm not too sure if those who're 'blessed' with *very expensive* Intel MKL could relink rosetta against Intel MKL (and if and only if rosetta happens to use specific calls and algorithms that happens to be optimized by MKL) and i'd also think it may perhaps only specifically benefit those who own a recent cpu e.g. ivy bridge or haswell) may see the benefits. there's also the ACML from amd which presumably biased to benefit amd platforms better

That's for sure.
But, for example SSEx extension is cross-platform Amd/Intel (like avx)
Some projects use these (Seti, Poem, Einstein, ecc).
http://boinc.berkeley.edu/trac/wiki/AppPlan

But only admins can answer to our questions
ID: 77841 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,994,764
RAC: 8,835
Message 77909 - Posted: 10 Feb 2015, 9:55:49 UTC

Now also Gpugrid has his opencl client (besides Poem@home).
ID: 77909 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,994,764
RAC: 8,835
Message 78053 - Posted: 22 Mar 2015, 22:06:31 UTC

New (provisional) version of OpenCl (2.1) supports C++ and, thank to new SPIR, a lot of languages.

Kronos OpenCl
ID: 78053 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 78064 - Posted: 26 Mar 2015, 0:19:05 UTC

I think Rosetta might be able to harness GPUs once this whole "unified memory" thing comes out from nVidia and AMD.

AFAIK, the main problem with running Rosetta on a GPU is RAM. Just imagine, a single WU uses about 0.5 GB. But with unified memory, the GPU should be able to access system RAM and CPU should be able to access VRAM directly.
ID: 78064 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,994,764
RAC: 8,835
Message 78186 - Posted: 6 May 2015, 15:35:58 UTC - in response to Message 78064.  

AFAIK, the main problem with running Rosetta on a GPU is RAM. Just imagine, a single WU uses about 0.5 GB. But with unified memory, the GPU should be able to access system RAM and CPU should be able to access VRAM directly.


Amd Carrizo Apu (first cpu HSA Compliant) may be a first solution.
"Unified memory" is now, i think, a beta-technology, but in the future..

ID: 78186 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : GPU Potential



©2024 University of Washington
https://www.bakerlab.org