Discussion of the merits and challenges of using GPUs

Message boards : Number crunching : Discussion of the merits and challenges of using GPUs

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 9,591
Message 99709 - Posted: 19 Nov 2020, 16:59:34 UTC - in response to Message 99205.  

Intel OneApi is landed with SyCl full support
Rocm 4 is on the way (with support to Xilinx FPGA and to consumer AMD gpu series 68xx)
OneApi interesting example
AMD Instinct MI100 deliver up to 11.5 TFLOPs of double precision
ID: 99709 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 9,591
Message 99775 - Posted: 26 Nov 2020, 7:39:31 UTC

ID: 99775 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 9,591
Message 100486 - Posted: 24 Jan 2021, 18:20:44 UTC

ID: 100486 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 9,591
Message 100591 - Posted: 9 Feb 2021, 14:37:59 UTC

ID: 100591 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 9,591
Message 101110 - Posted: 7 Apr 2021, 5:32:33 UTC

I know, i know, we will never see a gpu app on R@H, but other projects (WCG OpenPandemics)...
On average we anticipate a ~500x average speed-up in processing current packages on the mix of GPUs and CPUs from volunteers, which include from Raspberry PIs to laptops to high-end GPUs. On more powerful GPUs, we see up to 4000x speedups overall compared to a single CPU core

ID: 101110 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,807
Message 101136 - Posted: 7 Apr 2021, 20:19:52 UTC - in response to Message 101110.  

I know, i know, we will never see a gpu app on R@H, but other projects (WCG OpenPandemics)...
On average we anticipate a ~500x average speed-up in processing current packages on the mix of GPUs and CPUs from volunteers, which include from Raspberry PIs to laptops to high-end GPUs. On more powerful GPUs, we see up to 4000x speedups overall compared to a single CPU core

They might consider separating R@H workunits into two classes:

1. One starting point, many steps to improve on that starting point

2. A list of starting points, one step each to do something one them but the same step for all of them

The first of these are generally not good candidates for GPU speedup, but the second is much more likely to be good candidates.
ID: 101136 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,169,305
RAC: 3,857
Message 101150 - Posted: 8 Apr 2021, 2:33:14 UTC - in response to Message 101136.  

I know, i know, we will never see a gpu app on R@H, but other projects (WCG OpenPandemics)...
On average we anticipate a ~500x average speed-up in processing current packages on the mix of GPUs and CPUs from volunteers, which include from Raspberry PIs to laptops to high-end GPUs. On more powerful GPUs, we see up to 4000x speedups overall compared to a single CPU core

They might consider separating R@H workunits into two classes:

1. One starting point, many steps to improve on that starting point

2. A list of starting points, one step each to do something one them but the same step for all of them

The first of these are generally not good candidates for GPU speedup, but the second is much more likely to be good candidates.


Another option might be to consider only using gpu's with 8gb or more of ram, that way more of the task will fight into onboard ram, much like a cpu task does now, but of course a gpu uses memory differently so maybe the limit would be 12gb or more for example. The idea being a gpu should not be discarded out of hand and many options should be considered.

Personally I DO like the idea of splitting a gpu task into smaller parts, maybe an A part and a B part of the same task that can each be crunch by different pc, or even the same pc to keep things normalized, and then put together for one full task. I would even go with a multipart task beyond 2 parts if that worked and provided provable and reliable Science.

To me the problem isn't "why" the question is "why not" and WHY are wasting the opportunity to advance this Project into the petaflop range as mentioned in this message "Good afternoon. I have seen that the project is increasing the computational power in recent times, being currently with a power of 835 teraflop. We are going to increase its power for petaflops. Encourage friends, family and others to join the project. The more people who help, the faster the searches. Make the project reach more people, talk about it."
written by Tiago Martins Barreiros For info the message number is: Message 100765 - Posted: 18 Mar 2021, 17:06:10 UTC
ID: 101150 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 9,591
Message 101157 - Posted: 8 Apr 2021, 8:24:49 UTC - in response to Message 101150.  
Last modified: 8 Apr 2021, 8:26:16 UTC

Another option might be to consider only using gpu's with 8gb or more of ram, that way more of the task will fight into onboard ram, much like a cpu task does now, but of course a gpu uses memory differently so maybe the limit would be 12gb or more for example. The idea being a gpu should not be discarded out of hand and many options should be considered.


The last time they made a "public/known" test on gpu was years ago (if i'm not wrong over 7 years ago) and they had problems with gpu ram.
But, at that time, the gpus had, at most, 4gb of ram on board (top level gpus, like Radeon R9 290), the others had 1 or 2 gb.
Now top level gpus have 12/16 gb (and a different kind of memory, much faster).
Other considerations are reguarding sw for gpu: languages (like cuda or opencl or rocm or oneApi), frameworks, tools are changed A LOT during these years.
So, hw and sw problems are present for sure, but i think that the first and most important problem is the will to do.....see, for example, the idea to have cpu app optimized (ssex, avx, etc).
ID: 101157 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1681
Credit: 17,854,150
RAC: 22,647
Message 101160 - Posted: 8 Apr 2021, 8:49:13 UTC - in response to Message 101150.  

Personally I DO like the idea of splitting a gpu task into smaller parts, maybe an A part and a B part of the same task that can each be crunch by different pc, or even the same pc to keep things normalized, and then put together for one full task. I would even go with a multipart task beyond 2 parts if that worked and provided provable and reliable Science.
A CPU Task & a GPU Task should be the same to keep thing simple for the project, A Work Unit can be processed on a GPU using the GPU application, and it can be processed by the CPU using the CPU application.

As for splitting up a Task- that is how Seti was able to use GPUs to process the same data as a CPU.
The GPU application broke the Work Unit in to multiple blocks, processed each block as necessary & the results of each block were then re-combined to give the final result, producing the same result as it would have if it was processed on the CPU.
However instead of taking an hour or more, a high end GPU of the time could do it in 25 secs or so.
Grant
Darwin NT
ID: 101160 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 9,591
Message 101161 - Posted: 8 Apr 2021, 8:56:34 UTC - in response to Message 101160.  

As for splitting up a Task- that is how Seti was able to use GPUs to process the same data as a CPU.
The GPU application broke the Work Unit in to multiple blocks, processed each block as necessary & the results of each block were then re-combined to give the final result, producing the same result as it would have if it was processed on the CPU.
However instead of taking an hour or more, a high end GPU of the time could do it in 25 secs or so.


OpenPandemics team does it: new app does the same simulations, but a single cpu wu has from 1 to 5 simulations inside (and takes 2 hours in a modern cpu), while gpu app has from 30 to 70 simulations (and runs in 15 minutes in old gpu - gtx 750)
ID: 101161 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,807
Message 101171 - Posted: 8 Apr 2021, 19:56:53 UTC - in response to Message 101110.  

I know, i know, we will never see a gpu app on R@H, but other projects (WCG OpenPandemics)...
On average we anticipate a ~500x average speed-up in processing current packages on the mix of GPUs and CPUs from volunteers, which include from Raspberry PIs to laptops to high-end GPUs. On more powerful GPUs, we see up to 4000x speedups overall compared to a single CPU core

Looks like you're overestimating the speedup based on some incorrect assumptions about the GPUs.

1. GPUs have a different clock speed from the CPU cores - their clock speed is typically about a quarter of the CPU's clock speed, at least for Nvidia and AMD GPUs. Therefore, a GPU core can do about a quarter as much as a CPU core can do in the same amount of time. I haven't seen similar information about Intel GPUs.

2. CPU cores get their instructions independently; each CPU core has a register containing the memory address it gets its instruction from, and goes on to the next memory address unless the instruction makes it load a new address into this register. GPU cores (at least Nvidia and AMD) come in groups with an instruction unit that sends the same instruction to every member of the group, plus a mask to determine which cores in the group execute that instruction. The other cores in the group do nothing while this happens. If there is an if ... then ... else ..., then the then branch and the else branch cannot execute simultaneously for cores with the same group. For Nvidia GPUs, each group is called a warp and has 16 GPU cores within it. This means that work doing the same operations on multiple sets of data is more compatible with the hardware, regardless of which computer language is used.

As a result, the maximum possible speed of a GPU application divided by the CPU speed is about the number of GPU cores divided by 4. Achieving this speed is very rare; more typical values are 10 to 20 times the speed of the CPU application.

It's even possible for the GPU application speed to be only a quarter of the speed of the CPU application, but BOINC projects seldom release a GPU application that doesn't run at least 10 times as fast as the CPU application.
ID: 101171 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,169,305
RAC: 3,857
Message 101175 - Posted: 9 Apr 2021, 0:48:23 UTC - in response to Message 101171.  

I know, i know, we will never see a gpu app on R@H, but other projects (WCG OpenPandemics)...
On average we anticipate a ~500x average speed-up in processing current packages on the mix of GPUs and CPUs from volunteers, which include from Raspberry PIs to laptops to high-end GPUs. On more powerful GPUs, we see up to 4000x speedups overall compared to a single CPU core

Looks like you're overestimating the speedup based on some incorrect assumptions about the GPUs.

1. GPUs have a different clock speed from the CPU cores - their clock speed is typically about a quarter of the CPU's clock speed, at least for Nvidia and AMD GPUs. Therefore, a GPU core can do about a quarter as much as a CPU core can do in the same amount of time. I haven't seen similar information about Intel GPUs.

2. CPU cores get their instructions independently; each CPU core has a register containing the memory address it gets its instruction from, and goes on to the next memory address unless the instruction makes it load a new address into this register. GPU cores (at least Nvidia and AMD) come in groups with an instruction unit that sends the same instruction to every member of the group, plus a mask to determine which cores in the group execute that instruction. The other cores in the group do nothing while this happens. If there is an if ... then ... else ..., then the then branch and the else branch cannot execute simultaneously for cores with the same group. For Nvidia GPUs, each group is called a warp and has 16 GPU cores within it. This means that work doing the same operations on multiple sets of data is more compatible with the hardware, regardless of which computer language is used.

As a result, the maximum possible speed of a GPU application divided by the CPU speed is about the number of GPU cores divided by 4. Achieving this speed is very rare; more typical values are 10 to 20 times the speed of the CPU application.

It's even possible for the GPU application speed to be only a quarter of the speed of the CPU application, but BOINC projects seldom release a GPU application that doesn't run at least 10 times as fast as the CPU application.


All that is true and means it's definitely worth a try on the newer gpu's with their faster memory and much larger amounts of it. The problem could be the lack of access to new ones and getting them into the hands of the people who can then try and make them work here at Rosetta. Since Rosetta has tried to make gpu's work in the past tweaking the existing programming to accommodate the new gpu's shouldn't be that hard. Personally I think a call to Nvidia and/or AMD and finding the right person in Marketing should get one on it's way, with the understanding that it goes back when they are done with it and if it works the company gets the necessary public accolades.
ID: 101175 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 9,591
Message 101176 - Posted: 9 Apr 2021, 6:16:48 UTC - in response to Message 101171.  

I know, i know, we will never see a gpu app on R@H, but other projects (WCG OpenPandemics)...
On average we anticipate a ~500x average speed-up in processing current packages on the mix of GPUs and CPUs from volunteers, which include from Raspberry PIs to laptops to high-end GPUs. On more powerful GPUs, we see up to 4000x speedups overall compared to a single CPU core

Looks like you're overestimating the speedup based on some incorrect assumptions about the GPUs.

I don't overstimate.
It's a message from OpenPandemics admin
Maybe they know their code, benchmark and results better than you.
ID: 101176 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 9,591
Message 101178 - Posted: 9 Apr 2021, 7:07:34 UTC - in response to Message 101171.  
Last modified: 9 Apr 2021, 7:10:16 UTC

Looks like you're overestimating the speedup based on some incorrect assumptions about the GPUs.


From my previous post (it's DATA from OpenPandemic admin):
1 to 5 simulations inside (and takes 2 hours in a modern cpu), while gpu app has from 30 to 70


A volunteer with an RTX2080 makes 2 gpu wus in less than 2 minutes.
So, assuming the max simulations inside, a cpu core makes 5 "steps" in 2h, while a gpu makes 140 "steps" in 2 minutes.
Do the math.
ID: 101178 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,807
Message 101600 - Posted: 29 Apr 2021, 15:20:46 UTC
Last modified: 29 Apr 2021, 15:26:23 UTC

World Community Grid now has a GPU application for their OpenPandemics application.

OpenPandemics - COVID-19 Now Running on Machines with Graphics Processing Units

https://www.worldcommunitygrid.org/about_us/viewNewsArticle.do?articleId=693


If you already have BOINC installed, select the OpenPandemics project under World Community Grid and enable GPU use.

World Community Grid

https://join.worldcommunitygrid.org?recruiterId=480838


They are currently running a GPU stress test, so expect internet use to be especially high.

Expect a few CPU tasks at first, soon switching to GPU tasks only if you have enabled at least one non-WCG BOINC project offering only CPU tasks.
ID: 101600 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 9,591
Message 101606 - Posted: 29 Apr 2021, 19:15:09 UTC - in response to Message 101600.  

IWOCL 2021 conference about OpenCl/Sycl/OneAPI
ID: 101606 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 9,591
Message 101614 - Posted: 30 Apr 2021, 8:16:28 UTC

Interesting paper about gpu and QM/MM simulations

P.S. Amber and Rosetta are compatible through AMBRose
ID: 101614 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,228,659
RAC: 10,982
Message 101622 - Posted: 30 Apr 2021, 12:03:21 UTC - in response to Message 101600.  
Last modified: 30 Apr 2021, 12:03:39 UTC

World Community Grid now has a GPU application for their OpenPandemics application.

OpenPandemics - COVID-19 Now Running on Machines with Graphics Processing Units

https://www.worldcommunitygrid.org/about_us/viewNewsArticle.do?articleId=693


If you already have BOINC installed, select the OpenPandemics project under World Community Grid and enable GPU use.

World Community Grid

https://join.worldcommunitygrid.org?recruiterId=480838


They are currently running a GPU stress test, so expect internet use to be especially high.

Expect a few CPU tasks at first, soon switching to GPU tasks only if you have enabled at least one non-WCG BOINC project offering only CPU tasks.

I had some come down. Killing my PCs.
I'm only allowing them to run on PCs I'm not using as it makes everything drop to an unbearable crawl
ID: 101622 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,623,704
RAC: 9,591
Message 101625 - Posted: 30 Apr 2021, 13:32:45 UTC - in response to Message 101622.  

I had some come down. Killing my PCs.
I'm only allowing them to run on PCs I'm not using as it makes everything drop to an unbearable crawl

Uh, that's strange.
How many concurrent gpu wus are you crunching? What is your gpu? Entry level??
ID: 101625 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,169,305
RAC: 3,857
Message 101629 - Posted: 30 Apr 2021, 13:44:38 UTC - in response to Message 101625.  

I had some come down. Killing my PCs.
I'm only allowing them to run on PCs I'm not using as it makes everything drop to an unbearable crawl


Uh, that's strange.
How many concurrent gpu wus are you crunching? What is your gpu? Entry level??


I agree I am running a laptop with an Nvidia 1660Ti gpu and running the WCG gpu tasks, one at a time, and it's working just fine, I'm typing this one it. Now I do leave 3 HT cpu cores free for internet browsing, typing in forums etc.
ID: 101629 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Discussion of the merits and challenges of using GPUs



©2024 University of Washington
https://www.bakerlab.org