Discussion of the merits and challenges of using GPUs

Author	Message
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2206 Credit: 13,720,774 RAC: 4	Message 99709 - Posted: 19 Nov 2020, 16:59:34 UTC - in response to Message 99205. Intel OneApi is landed with SyCl full support Rocm 4 is on the way (with support to Xilinx FPGA and to consumer AMD gpu series 68xx) OneApi interesting example AMD Instinct MI100 deliver up to 11.5 TFLOPs of double precision ID: 99709 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2206 Credit: 13,720,774 RAC: 4	Message 99775 - Posted: 26 Nov 2020, 7:39:31 UTC Radeon RX 6800 Series Performance Comes Out Even Faster With Newest Linux Code ID: 99775 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2206 Credit: 13,720,774 RAC: 4	Message 100486 - Posted: 24 Jan 2021, 18:20:44 UTC CUDA C++ library ID: 100486 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2206 Credit: 13,720,774 RAC: 4	Message 100591 - Posted: 9 Feb 2021, 14:37:59 UTC SYCL 2020 final specification ID: 100591 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2206 Credit: 13,720,774 RAC: 4	Message 101110 - Posted: 7 Apr 2021, 5:32:33 UTC I know, i know, we will never see a gpu app on R@H, but other projects (WCG OpenPandemics)... On average we anticipate a ~500x average speed-up in processing current packages on the mix of GPUs and CPUs from volunteers, which include from Raspberry PIs to laptops to high-end GPUs. On more powerful GPUs, we see up to 4000x speedups overall compared to a single CPU core ID: 101110 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1265 Credit: 14,424,358 RAC: 0	Message 101136 - Posted: 7 Apr 2021, 20:19:52 UTC - in response to Message 101110. I know, i know, we will never see a gpu app on R@H, but other projects (WCG OpenPandemics)... On average we anticipate a ~500x average speed-up in processing current packages on the mix of GPUs and CPUs from volunteers, which include from Raspberry PIs to laptops to high-end GPUs. On more powerful GPUs, we see up to 4000x speedups overall compared to a single CPU core They might consider separating R@H workunits into two classes: 1. One starting point, many steps to improve on that starting point 2. A list of starting points, one step each to do something one them but the same step for all of them The first of these are generally not good candidates for GPU speedup, but the second is much more likely to be good candidates. ID: 101136 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 5 Jan 06 Posts: 1900 Credit: 12,902,147 RAC: 0	Message 101150 - Posted: 8 Apr 2021, 2:33:14 UTC - in response to Message 101136. I know, i know, we will never see a gpu app on R@H, but other projects (WCG OpenPandemics)... On average we anticipate a ~500x average speed-up in processing current packages on the mix of GPUs and CPUs from volunteers, which include from Raspberry PIs to laptops to high-end GPUs. On more powerful GPUs, we see up to 4000x speedups overall compared to a single CPU core They might consider separating R@H workunits into two classes: 1. One starting point, many steps to improve on that starting point 2. A list of starting points, one step each to do something one them but the same step for all of them The first of these are generally not good candidates for GPU speedup, but the second is much more likely to be good candidates. Another option might be to consider only using gpu's with 8gb or more of ram, that way more of the task will fight into onboard ram, much like a cpu task does now, but of course a gpu uses memory differently so maybe the limit would be 12gb or more for example. The idea being a gpu should not be discarded out of hand and many options should be considered. Personally I DO like the idea of splitting a gpu task into smaller parts, maybe an A part and a B part of the same task that can each be crunch by different pc, or even the same pc to keep things normalized, and then put together for one full task. I would even go with a multipart task beyond 2 parts if that worked and provided provable and reliable Science. To me the problem isn't "why" the question is "why not" and WHY are wasting the opportunity to advance this Project into the petaflop range as mentioned in this message "Good afternoon. I have seen that the project is increasing the computational power in recent times, being currently with a power of 835 teraflop. We are going to increase its power for petaflops. Encourage friends, family and others to join the project. The more people who help, the faster the searches. Make the project reach more people, talk about it." written by Tiago Martins Barreiros For info the message number is: Message 100765 - Posted: 18 Mar 2021, 17:06:10 UTC ID: 101150 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2206 Credit: 13,720,774 RAC: 4	Message 101157 - Posted: 8 Apr 2021, 8:24:49 UTC - in response to Message 101150. Last modified: 8 Apr 2021, 8:26:16 UTC Another option might be to consider only using gpu's with 8gb or more of ram, that way more of the task will fight into onboard ram, much like a cpu task does now, but of course a gpu uses memory differently so maybe the limit would be 12gb or more for example. The idea being a gpu should not be discarded out of hand and many options should be considered. The last time they made a "public/known" test on gpu was years ago (if i'm not wrong over 7 years ago) and they had problems with gpu ram. But, at that time, the gpus had, at most, 4gb of ram on board (top level gpus, like Radeon R9 290), the others had 1 or 2 gb. Now top level gpus have 12/16 gb (and a different kind of memory, much faster). Other considerations are reguarding sw for gpu: languages (like cuda or opencl or rocm or oneApi), frameworks, tools are changed A LOT during these years. So, hw and sw problems are present for sure, but i think that the first and most important problem is the will to do.....see, for example, the idea to have cpu app optimized (ssex, avx, etc). ID: 101157 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 101160 - Posted: 8 Apr 2021, 8:49:13 UTC - in response to Message 101150. Personally I DO like the idea of splitting a gpu task into smaller parts, maybe an A part and a B part of the same task that can each be crunch by different pc, or even the same pc to keep things normalized, and then put together for one full task. I would even go with a multipart task beyond 2 parts if that worked and provided provable and reliable Science. A CPU Task & a GPU Task should be the same to keep thing simple for the project, A Work Unit can be processed on a GPU using the GPU application, and it can be processed by the CPU using the CPU application. As for splitting up a Task- that is how Seti was able to use GPUs to process the same data as a CPU. The GPU application broke the Work Unit in to multiple blocks, processed each block as necessary & the results of each block were then re-combined to give the final result, producing the same result as it would have if it was processed on the CPU. However instead of taking an hour or more, a high end GPU of the time could do it in 25 secs or so. Grant Darwin NT ID: 101160 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2206 Credit: 13,720,774 RAC: 4	Message 101161 - Posted: 8 Apr 2021, 8:56:34 UTC - in response to Message 101160. As for splitting up a Task- that is how Seti was able to use GPUs to process the same data as a CPU. The GPU application broke the Work Unit in to multiple blocks, processed each block as necessary & the results of each block were then re-combined to give the final result, producing the same result as it would have if it was processed on the CPU. However instead of taking an hour or more, a high end GPU of the time could do it in 25 secs or so. OpenPandemics team does it: new app does the same simulations, but a single cpu wu has from 1 to 5 simulations inside (and takes 2 hours in a modern cpu), while gpu app has from 30 to 70 simulations (and runs in 15 minutes in old gpu - gtx 750) ID: 101161 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1265 Credit: 14,424,358 RAC: 0	Message 101171 - Posted: 8 Apr 2021, 19:56:53 UTC - in response to Message 101110. I know, i know, we will never see a gpu app on R@H, but other projects (WCG OpenPandemics)... On average we anticipate a ~500x average speed-up in processing current packages on the mix of GPUs and CPUs from volunteers, which include from Raspberry PIs to laptops to high-end GPUs. On more powerful GPUs, we see up to 4000x speedups overall compared to a single CPU core Looks like you're overestimating the speedup based on some incorrect assumptions about the GPUs. 1. GPUs have a different clock speed from the CPU cores - their clock speed is typically about a quarter of the CPU's clock speed, at least for Nvidia and AMD GPUs. Therefore, a GPU core can do about a quarter as much as a CPU core can do in the same amount of time. I haven't seen similar information about Intel GPUs. 2. CPU cores get their instructions independently; each CPU core has a register containing the memory address it gets its instruction from, and goes on to the next memory address unless the instruction makes it load a new address into this register. GPU cores (at least Nvidia and AMD) come in groups with an instruction unit that sends the same instruction to every member of the group, plus a mask to determine which cores in the group execute that instruction. The other cores in the group do nothing while this happens. If there is an if ... then ... else ..., then the then branch and the else branch cannot execute simultaneously for cores with the same group. For Nvidia GPUs, each group is called a warp and has 16 GPU cores within it. This means that work doing the same operations on multiple sets of data is more compatible with the hardware, regardless of which computer language is used. As a result, the maximum possible speed of a GPU application divided by the CPU speed is about the number of GPU cores divided by 4. Achieving this speed is very rare; more typical values are 10 to 20 times the speed of the CPU application. It's even possible for the GPU application speed to be only a quarter of the speed of the CPU application, but BOINC projects seldom release a GPU application that doesn't run at least 10 times as fast as the CPU application. ID: 101171 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 5 Jan 06 Posts: 1900 Credit: 12,902,147 RAC: 0	Message 101175 - Posted: 9 Apr 2021, 0:48:23 UTC - in response to Message 101171. I know, i know, we will never see a gpu app on R@H, but other projects (WCG OpenPandemics)... On average we anticipate a ~500x average speed-up in processing current packages on the mix of GPUs and CPUs from volunteers, which include from Raspberry PIs to laptops to high-end GPUs. On more powerful GPUs, we see up to 4000x speedups overall compared to a single CPU core Looks like you're overestimating the speedup based on some incorrect assumptions about the GPUs. 1. GPUs have a different clock speed from the CPU cores - their clock speed is typically about a quarter of the CPU's clock speed, at least for Nvidia and AMD GPUs. Therefore, a GPU core can do about a quarter as much as a CPU core can do in the same amount of time. I haven't seen similar information about Intel GPUs. 2. CPU cores get their instructions independently; each CPU core has a register containing the memory address it gets its instruction from, and goes on to the next memory address unless the instruction makes it load a new address into this register. GPU cores (at least Nvidia and AMD) come in groups with an instruction unit that sends the same instruction to every member of the group, plus a mask to determine which cores in the group execute that instruction. The other cores in the group do nothing while this happens. If there is an if ... then ... else ..., then the then branch and the else branch cannot execute simultaneously for cores with the same group. For Nvidia GPUs, each group is called a warp and has 16 GPU cores within it. This means that work doing the same operations on multiple sets of data is more compatible with the hardware, regardless of which computer language is used. As a result, the maximum possible speed of a GPU application divided by the CPU speed is about the number of GPU cores divided by 4. Achieving this speed is very rare; more typical values are 10 to 20 times the speed of the CPU application. It's even possible for the GPU application speed to be only a quarter of the speed of the CPU application, but BOINC projects seldom release a GPU application that doesn't run at least 10 times as fast as the CPU application. All that is true and means it's definitely worth a try on the newer gpu's with their faster memory and much larger amounts of it. The problem could be the lack of access to new ones and getting them into the hands of the people who can then try and make them work here at Rosetta. Since Rosetta has tried to make gpu's work in the past tweaking the existing programming to accommodate the new gpu's shouldn't be that hard. Personally I think a call to Nvidia and/or AMD and finding the right person in Marketing should get one on it's way, with the understanding that it goes back when they are done with it and if it works the company gets the necessary public accolades. ID: 101175 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2206 Credit: 13,720,774 RAC: 4	Message 101176 - Posted: 9 Apr 2021, 6:16:48 UTC - in response to Message 101171. I know, i know, we will never see a gpu app on R@H, but other projects (WCG OpenPandemics)... On average we anticipate a ~500x average speed-up in processing current packages on the mix of GPUs and CPUs from volunteers, which include from Raspberry PIs to laptops to high-end GPUs. On more powerful GPUs, we see up to 4000x speedups overall compared to a single CPU core Looks like you're overestimating the speedup based on some incorrect assumptions about the GPUs. I don't overstimate. It's a message from OpenPandemics admin Maybe they know their code, benchmark and results better than you. ID: 101176 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2206 Credit: 13,720,774 RAC: 4	Message 101178 - Posted: 9 Apr 2021, 7:07:34 UTC - in response to Message 101171. Last modified: 9 Apr 2021, 7:10:16 UTC Looks like you're overestimating the speedup based on some incorrect assumptions about the GPUs. From my previous post (it's DATA from OpenPandemic admin): 1 to 5 simulations inside (and takes 2 hours in a modern cpu), while gpu app has from 30 to 70 A volunteer with an RTX2080 makes 2 gpu wus in less than 2 minutes. So, assuming the max simulations inside, a cpu core makes 5 "steps" in 2h, while a gpu makes 140 "steps" in 2 minutes. Do the math. ID: 101178 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1265 Credit: 14,424,358 RAC: 0	Message 101600 - Posted: 29 Apr 2021, 15:20:46 UTC Last modified: 29 Apr 2021, 15:26:23 UTC World Community Grid now has a GPU application for their OpenPandemics application. OpenPandemics - COVID-19 Now Running on Machines with Graphics Processing Units https://www.worldcommunitygrid.org/about_us/viewNewsArticle.do?articleId=693 If you already have BOINC installed, select the OpenPandemics project under World Community Grid and enable GPU use. World Community Grid https://join.worldcommunitygrid.org?recruiterId=480838 They are currently running a GPU stress test, so expect internet use to be especially high. Expect a few CPU tasks at first, soon switching to GPU tasks only if you have enabled at least one non-WCG BOINC project offering only CPU tasks. ID: 101600 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2206 Credit: 13,720,774 RAC: 4	Message 101606 - Posted: 29 Apr 2021, 19:15:09 UTC - in response to Message 101600. IWOCL 2021 conference about OpenCl/Sycl/OneAPI ID: 101606 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2206 Credit: 13,720,774 RAC: 4	Message 101614 - Posted: 30 Apr 2021, 8:16:28 UTC Interesting paper about gpu and QM/MM simulations P.S. Amber and Rosetta are compatible through AMBRose ID: 101614 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2599 Credit: 47,220,881 RAC: 0	Message 101622 - Posted: 30 Apr 2021, 12:03:21 UTC - in response to Message 101600. Last modified: 30 Apr 2021, 12:03:39 UTC World Community Grid now has a GPU application for their OpenPandemics application. OpenPandemics - COVID-19 Now Running on Machines with Graphics Processing Units https://www.worldcommunitygrid.org/about_us/viewNewsArticle.do?articleId=693 If you already have BOINC installed, select the OpenPandemics project under World Community Grid and enable GPU use. World Community Grid https://join.worldcommunitygrid.org?recruiterId=480838 They are currently running a GPU stress test, so expect internet use to be especially high. Expect a few CPU tasks at first, soon switching to GPU tasks only if you have enabled at least one non-WCG BOINC project offering only CPU tasks. I had some come down. Killing my PCs. I'm only allowing them to run on PCs I'm not using as it makes everything drop to an unbearable crawl ID: 101622 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2206 Credit: 13,720,774 RAC: 4	Message 101625 - Posted: 30 Apr 2021, 13:32:45 UTC - in response to Message 101622. I had some come down. Killing my PCs. I'm only allowing them to run on PCs I'm not using as it makes everything drop to an unbearable crawl Uh, that's strange. How many concurrent gpu wus are you crunching? What is your gpu? Entry level?? ID: 101625 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 5 Jan 06 Posts: 1900 Credit: 12,902,147 RAC: 0	Message 101629 - Posted: 30 Apr 2021, 13:44:38 UTC - in response to Message 101625. I had some come down. Killing my PCs. I'm only allowing them to run on PCs I'm not using as it makes everything drop to an unbearable crawl Uh, that's strange. How many concurrent gpu wus are you crunching? What is your gpu? Entry level?? I agree I am running a laptop with an Nvidia 1660Ti gpu and running the WCG gpu tasks, one at a time, and it's working just fine, I'm typing this one it. Now I do leave 3 HT cpu cores free for internet browsing, typing in forums etc. ID: 101629 · Rating: 0 · rate: / Reply Quote