GPU computing

Message boards : Number crunching : GPU computing

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 80709 - Posted: 5 Oct 2016, 23:50:01 UTC - in response to Message 80706.  


Is there any progress worth speaking of?


Plenty of scientific progress!

https://www.bakerlab.org/wp-content/uploads/2016/09/HuangBoyken_DeNovoDesign_Nature2016.pdf

https://www.bakerlab.org/wp-content/uploads/2016/09/Bhardwaj_Nature_2016.pdf

ID: 80709 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1981
Credit: 38,436,301
RAC: 14,059
Message 80710 - Posted: 6 Oct 2016, 1:12:19 UTC - in response to Message 80709.  


Is there any progress worth speaking of?

Plenty of scientific progress!

https://www.bakerlab.org/wp-content/uploads/2016/09/HuangBoyken_DeNovoDesign_Nature2016.pdf

https://www.bakerlab.org/wp-content/uploads/2016/09/Bhardwaj_Nature_2016.pdf

I think the question was about re-coding to take advantage of newer protocols, but wrt these papers from some weeks ago, these are the sort of things that should be posted up in the Science forum when they're available
ID: 80710 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 80711 - Posted: 6 Oct 2016, 5:23:31 UTC - in response to Message 80710.  


Is there any progress worth speaking of?

Plenty of scientific progress!

https://www.bakerlab.org/wp-content/uploads/2016/09/HuangBoyken_DeNovoDesign_Nature2016.pdf

https://www.bakerlab.org/wp-content/uploads/2016/09/Bhardwaj_Nature_2016.pdf

I think the question was about re-coding to take advantage of newer protocols, but wrt these papers from some weeks ago, these are the sort of things that should be posted up in the Science forum when they're available


They have been posted and tweeted. Lots of cool science happening recently.
ID: 80711 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 80712 - Posted: 6 Oct 2016, 14:14:45 UTC - in response to Message 80708.  


There is really no way that scientists have to wait hours or even days for their computation results. Personally, I hope BOINC will die because it's a kludge.


I hope not.
If it is "kludge", why you are here?


Because I want to help. That's why.

One of the problems is the heterogeneous architecture of rosetta@home: There are PCs, Macs and tablets/smartphones (seriously?). Why not an internet-connected dual-core toaster?

There are a lot of issues with these devices and this is IMHO a waste of developer resources.

A homogeneous architecture based on AVXx would alleviate all those problems while yielding a higher performance.

The distributed nature of rosetta also introduces latencies: preparing work, zipping it, sending it and collect the results back over a WAN. Being forced to deal with ultra-lame ancient CPUs and the like are another problem.
ID: 80712 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 80713 - Posted: 6 Oct 2016, 18:11:27 UTC - in response to Message 80712.  


There is really no way that scientists have to wait hours or even days for their computation results. Personally, I hope BOINC will die because it's a kludge.


I hope not.
If it is "kludge", why you are here?


Because I want to help. That's why.

One of the problems is the heterogeneous architecture of rosetta@home: There are PCs, Macs and tablets/smartphones (seriously?). Why not an internet-connected dual-core toaster?

There are a lot of issues with these devices and this is IMHO a waste of developer resources.

A homogeneous architecture based on AVXx would alleviate all those problems while yielding a higher performance.

The distributed nature of rosetta also introduces latencies: preparing work, zipping it, sending it and collect the results back over a WAN. Being forced to deal with ultra-lame ancient CPUs and the like are another problem.



We do use local and UW hosted clusters.

https://itconnect.uw.edu/service/shared-scalable-compute-cluster-for-research-hyak/

We also have been given time on cloud computing resources.

We also have had many many compute years awarded for supercomputing resources like blue gene etc.

Specific questions and concerns about code development and optimizations are more of a Rosetta Commons issue. They have hired developers to tackle such issues. Keep in mind, we are a research lab whose main priority is research.

One overlooked benefit to distributed computing is getting people familiar with science and allowing them to be directly a part of it.
ID: 80713 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,842,027
RAC: 1,584
Message 80721 - Posted: 9 Oct 2016, 4:59:26 UTC - in response to Message 80082.  

I've just been looking at the performance of the new GTX1080 and for DOUBLE precision calculations it does 4 Tflops!!!! For comparison a relatively high performance chip like an overclocked 5820K will do maybe 350GFlops. So we are talking an order of magnitude difference. In addition the Tesla HPC version will probably be double that at 8 TFlops. (Edit: Looks like it is actually 5.3TFlops) The Volta version of the gtx1080 (next gen on, due in about 18 months time) is rumoured to be 7TFlops FP64 in the consumer version.

There is no way that conventional processors can keep up with that level of calculation. At what point does the gap between serial CPU and parallel GPU have to be before the project leaders decide they can not afford NOT to invest in recoding to parallel processing? Because by 2 years time, HPC GPUs will be around 35 times faster than CPUs. How much will it cost to rewrite the code, $100-150K maybe?? Isn't that worth paying for such a huge step up?

With that kind of performance increase, you can make calcs more accurate. You no longer have to use approximations like LJ potentials, you can calculate the energy accurately and get a better answer in a quicker time than now. Whats not to like?

It seems like so many projects, everyone is comfortable with what they are doing now. Revolution has been forsaken for evolution. Understandable, but the best way to do things?

Be bold and take the leap!

More computing performance is not a good answer if the limit comes from available memory limits rather than from computing limits. Rosetta@Home has already looked into GPU versions, and found that they would require about 6 GB of graphics memory per GPU to get the expected 10 times as much performance as for the CPU version. The GPU version would run each workunit at about the same speed as the CPU version, and would therefore need to run 10 workunits at the same time, using 10 times as much memory, to get 10 times as much performance.

Rather few of the high-end graphics boards have that much memory.
ID: 80721 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,842,027
RAC: 1,584
Message 80722 - Posted: 9 Oct 2016, 5:20:25 UTC - in response to Message 80106.  
Last modified: 9 Oct 2016, 5:27:08 UTC

I can't fathom the computing knowledge you need for something like Rosetta. Or anything useful for that matter... I just got into learning Python (I figured an EE should know a good bit of programming) and I'm struggling like mad. MATLAB is the only language I'm proficient at, but it's so user friendly it doesn't count IMO.


If i remember correctly, the public test of rosy on gpu was with and old version of pycl

This is the post one developer wrote about this test. It's a pity that pdfs are not longer available

I've used Fortran for several years, and have taken classes in C++ and CUDA since then. Is any help needed for translating any remaining Fortran code to C++?

I would not be able to travel for this.

I'm still looking for an online OpenCL class aimed at GPUs rather than FPGAs. A CUDA version would work on most Nvidia GPUs, but not on other brands. An OpenCL version should work on other brands of GPUs.

A GPU version REQUIRES that most of the application allows many threads to run in any order, or even at the same time, since they don't use anything produced by the other threads. If this is not satisfied, the GPU version may be as slow as a quarter of the speed of the CPU version.
ID: 80722 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,842,027
RAC: 1,584
Message 80723 - Posted: 9 Oct 2016, 5:38:19 UTC - in response to Message 80712.  
Last modified: 9 Oct 2016, 5:39:35 UTC


There is really no way that scientists have to wait hours or even days for their computation results. Personally, I hope BOINC will die because it's a kludge.


I hope not.
If it is "kludge", why you are here?


Because I want to help. That's why.

One of the problems is the heterogeneous architecture of rosetta@home: There are PCs, Macs and tablets/smartphones (seriously?). Why not an internet-connected dual-core toaster?

There are a lot of issues with these devices and this is IMHO a waste of developer resources.

A homogeneous architecture based on AVXx would alleviate all those problems while yielding a higher performance.

The distributed nature of rosetta also introduces latencies: preparing work, zipping it, sending it and collect the results back over a WAN. Being forced to deal with ultra-lame ancient CPUs and the like are another problem.

So you want far fewer processors to be used? None of my computers use a CPU that even has AVXx available, and not enough money is available to replace all the computers available through BOINC with equivalents that have AVXx available. It would be possible, though, to produce separate compiles of the application for computers with AVXx and computers without, and add a shell program that tests what the CPU has available, then starts only the version of the program best for the current CPU.
ID: 80723 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 80726 - Posted: 9 Oct 2016, 10:40:30 UTC - in response to Message 80723.  


So you want far fewer processors to be used? None of my computers use a CPU that even has AVXx available, and not enough money is available to replace all the computers available through BOINC with equivalents that have AVXx available.


Your computer does support AVX.

AVX2 however has been introduced with the Haswell CPU generation. AVX-512 will be featured on Skylake-EP CPUs.


It would be possible, though, to produce separate compiles of the application for computers with AVXx and computers without, and add a shell program that tests what the CPU has available, then starts only the version of the program best for the current CPU.


Yes, I know. We've already had that discussion here on this board. We're just waiting for results.
ID: 80726 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 80788 - Posted: 26 Oct 2016, 17:42:55 UTC
Last modified: 26 Oct 2016, 18:02:32 UTC

imho GPUs may be simplified as very simplified ALUs with 1000s of 'registers' in which the ALUs can do SIMD (single instruction multiple data) executions on them.
typical GPUs possibly have hundreds to thousands of 'gpu' (e.g. cuda) 'cores' on them & they benefit from a specific class of problem, i.e. the whole array or matrix is loaded into the gpu as 'registers' and in which simd instructions runs the algorithm in a highly *vectorized* fashion. this means among various things, the problem needs to be *vectorizable* and *large* and *runs completely in the 'registers' without needing to access memory*, it is useless if we are trying to solve 2x2 matrices over and over again in which the next iteration depends on the previous iteration. the whole of the rest of the gpu is simply *unused* except for a few transistors.

In addition, adapting algorithms to gpus is often a significantly *difficult* software task. it isn't as simply as 'compiling' a program to optimise for gpu. Quite often the algorithms at hand *cannot make use of GPU vectorized* infrastructure, this requires at times a *complete redoing* of the entire design and even completely different algorithms and approaches.

while i'd not want to discourage users who have invested in GPUs, the above are true software challenges to really 'make it work'. As i personally did not use s/w that particular use the above aspects of gpu, i've actually refrained from getting one and basically made do with a rather recent intel i7 cpu.

i would think that similar challenges would confront the rosetta research team and i tend to agree that functional needs are the higher priority vs trying to redo all the algorithms just to make them use gpus. as the functional needs in themselves could be complex and spending overwhelming efforts into doing 'gpu' algorithms could compromise the original research objectives
ID: 80788 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Darrell

Send message
Joined: 28 Sep 06
Posts: 25
Credit: 51,934,631
RAC: 0
Message 80793 - Posted: 27 Oct 2016, 1:26:17 UTC

As someone with 14 discrete GPU cards, I support those projects that have applications that run primarily in the GPUs (Einstein, SETI).

My five computers have fairly modern CPUs, so I also give their cycles to projects that DON'T have applications for GPUs (Rosetta, LHC).

This works for me. Keeps both GPUs and CPUs busy.


ID: 80793 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1860
Credit: 8,152,819
RAC: 8,197
Message 80795 - Posted: 27 Oct 2016, 10:32:50 UTC
Last modified: 27 Oct 2016, 10:33:21 UTC

Up in this thread, i've reported the 2 pdf about gpu in Rosetta@Home project.
They said, in this papers, that they have created a gpu app (so, it's possible) for specific simulations, but they were not satisfied about performances. This, over 3 years ago.
Now, i don't know if they retried this app with recent and powerful gpus, if they have recompiled this app with new updated compilers/libraries/etc or if they have abandoned definitively this app...
ID: 80795 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMDave

Send message
Joined: 16 Dec 05
Posts: 35
Credit: 12,576,896
RAC: 0
Message 80796 - Posted: 27 Oct 2016, 13:49:36 UTC

For curiostity's sake, what about incorporating open source, specifically this (second paragraph)?
ID: 80796 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1860
Credit: 8,152,819
RAC: 8,197
Message 80798 - Posted: 28 Oct 2016, 9:47:14 UTC - in response to Message 80796.  

For curiostity's sake, what about incorporating open source, specifically this (second paragraph)?


First, it's great that poem's admins will release the code.
Second, i don't think that rosetta can use this code. Poem, if i'm not wrong, runs omogeneous simulations, not heterogeneous like Rosetta (ab initio, docking, etc).
ID: 80798 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ToyMachine

Send message
Joined: 31 Oct 16
Posts: 1
Credit: 621,562
RAC: 0
Message 80821 - Posted: 2 Nov 2016, 2:19:01 UTC

Could this thread be made a "Sticky"? Right up front, separated from all the other newb questions. It might also be appropriate to add this topic to the FAQ section, and maybe a bit on the main page. "We don't utilize GPUs, and here's why." I think that would make it quicker and easier for new contributors to determine which project to add to which computer, or where to direct upgrade funds, even if they (I) are too lazy to dig into the forum. ;)
ID: 80821 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1860
Credit: 8,152,819
RAC: 8,197
Message 80868 - Posted: 24 Nov 2016, 13:13:37 UTC - in response to Message 80721.  

More computing performance is not a good answer if the limit comes from available memory limits rather than from computing limits. Rosetta@Home has already looked into GPU versions, and found that they would require about 6 GB of graphics memory per GPU to get the expected 10 times as much performance as for the CPU version.


I thought, until yesterday, that gpu memory is only a problem of "amount", not a problem of "kind" of memory...
Matrix-vector case study
ID: 80868 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 54
Credit: 20,058,207
RAC: 4,375
Message 80872 - Posted: 25 Nov 2016, 18:34:05 UTC

That test basically hits the memory wall where data can't be moved fast enough to fully utilize the processing cores. In this case its the GPU and HBM improves the bandwidth between processor and memory. HMC is a similar tech for the CPU and main memory.
ID: 80872 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Greg Tippitt

Send message
Joined: 4 May 07
Posts: 5
Credit: 7,764,699
RAC: 10,365
Message 80901 - Posted: 14 Dec 2016, 9:00:26 UTC

There is a reason that PC's must have a CPU rather than simply a big video card that tries to run the entire operating system on a GPU. The reason is that GPUs are specialized processors for applications that can be designed for parallel computing. Huge speed improvements for some analysis when using GPUs does not mean that everything can run on a GPU more quickly.

One metaphor for understanding this is to think about a WalMart store. If they open all of the checkout lanes, then it makes it faster for you to checkout without having to wait in line. This is like a GPU using parallel computing.

Having lots of available checkout lanes will not make it faster for you to do your shopping, if for instance you need milk, antifreeze, shampoo, a pair of sweatpants, and a bag of kitty litter. These items are normally in departments that scattered all over the store, so it takes you lots of time to go to each. Having lots of empty checkout lanes doesn't help. If you've taken your family with you to Walmart, then you can send each person to get different items and rendezvous at the checkout. This might be thought of being analogous to having 4 CPU cores.

The repeated posts "Why don't they compile the code for GPUs so it will run faster?" is somewhat like asking "Why doesn't the highway department attached a snowblower to the front of a Dodge Challenger SRT Hellcat, so that they can clear all the streets really quickly instead of using those slow trucks that take forever to get about town?"

The easy solution is to run Rosetta on your CPU cores, and then run GPUGRID, or your other favorite BOINC apps, on your GPUs.
ID: 80901 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1860
Credit: 8,152,819
RAC: 8,197
Message 80903 - Posted: 14 Dec 2016, 17:00:27 UTC - in response to Message 80901.  

There is a reason that PC's must have a CPU rather than simply a big video card that tries to run the entire operating system on a GPU. The reason is that GPUs are specialized processors for applications that can be designed for parallel computing. Huge speed improvements for some analysis when using GPUs does not mean that everything can run on a GPU more quickly.


Completely agree with you. In fact, top500 supercomputers use cpus and gpus TOGETHER

The easy solution is to run Rosetta on your CPU cores, and then run GPUGRID, or your other favorite BOINC apps, on your GPUs.


A BETTER solution is to use "deeply" our cpus, for example with SSEx or AVX
ID: 80903 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1860
Credit: 8,152,819
RAC: 8,197
Message 87895 - Posted: 13 Dec 2017, 13:14:59 UTC - in response to Message 80903.  

Some news on OpenCl side (i posted these also on ralph@home forum)
New CodeXL 2.5.
ROCm now is at 1.6 version.
Codeplay released ComputeCpp to develop SYCL app in Visual Studio.
VC4CL brings OpenCl to Raspberry Pi.
Khronos Group releases SYCL 1.2.1 to develop "code for heterogeneous processors to be written in a “single-source” style using completely standard modern C++" (and supports TensorFlow).
ID: 87895 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : Number crunching : GPU computing



©2024 University of Washington
https://www.bakerlab.org