new version of BOINC + CUDA support

Author	Message
bruce boytler Send message Joined: 17 Sep 05 Posts: 68 Credit: 3,565,442 RAC: 0	Message 63815 - Posted: 25 Oct 2009, 8:20:03 UTC The main problem with GPU (and this was stated by the rosetta@home people) is that the rosetta app would run to slow on the GPU. I was personaly intriguged by this so I looked into it. The only thing I can see is that GPU is really bad at Monte Carlo simulations. This is due to its lack of branching ability. ID: 63815 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 63821 - Posted: 25 Oct 2009, 15:33:30 UTC A Monte Carlo simulation doesn't necessarily require many branching operations, so I doubt it is fair to characterize that as a GPU limitation. The main hurdle is memory. A CPU (any CPU/GPU/processor) can only run fast if it has all of the programs and data items available in it's L2 (or equiv.) cache memory. When something required (such as the instruction you wish to branch to) is not there, the CPU waits for memory access to load the required memory page. Some architectures attempt to minimize the likelihood of such "Cache misses" by prefetching portions of the program that you MIGHT jump to, just in case you need to do so. That way it will either already be in memory, or coming sooner then if you had to request it only once you got there in execution. But such prefetching must kick something else out of the cache to make room for the anticipatory fetch. You can see that any application that did a lot of branching would have difficulty running efficiently, unless most of the active areas of the program can be retained in memory. Specifically, in the L2 cache memory. So, a smaller program has a better chance of running well. Such dependencies are why the answer to performance questions is always "it depends". The new GPU chips coming out have shared L2 cache memory between the individual processor cores. So if they are all working on the same problem, and data, they can share it more readily and maximize each fetch done by making the page available to all of the cores. However, if memory is still overcommitted, you still end up kicking something out of memory that another core is going to need to continue processing. And thus that processor will go idle waiting to fetch it back again. See references in this thread. Rosetta Moderator: Mod.Sense ID: 63821 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 5 Jan 06 Posts: 1898 Credit: 12,724,450 RAC: 581	Message 63838 - Posted: 26 Oct 2009, 9:03:30 UTC - in response to Message 63821. A Monte Carlo simulation doesn't necessarily require many branching operations, so I doubt it is fair to characterize that as a GPU limitation. The main hurdle is memory. A CPU (any CPU/GPU/processor) can only run fast if it has all of the programs and data items available in it's L2 (or equiv.) cache memory. When something required (such as the instruction you wish to branch to) is not there, the CPU waits for memory access to load the required memory page. Some architectures attempt to minimize the likelihood of such "Cache misses" by prefetching portions of the program that you MIGHT jump to, just in case you need to do so. That way it will either already be in memory, or coming sooner then if you had to request it only once you got there in execution. But such prefetching must kick something else out of the cache to make room for the anticipatory fetch. You can see that any application that did a lot of branching would have difficulty running efficiently, unless most of the active areas of the program can be retained in memory. Specifically, in the L2 cache memory. So, a smaller program has a better chance of running well. Such dependencies are why the answer to performance questions is always "it depends". The new GPU chips coming out have shared L2 cache memory between the individual processor cores. So if they are all working on the same problem, and data, they can share it more readily and maximize each fetch done by making the page available to all of the cores. However, if memory is still overcommitted, you still end up kicking something out of memory that another core is going to need to continue processing. And thus that processor will go idle waiting to fetch it back again. See references in this thread. So are coming down to the point that Rosetta is too big to fit in the current bunch of gpu cards? This point you are making is why HT computers can actually be slower than a non HT computer, the sharing of the memory. ID: 63838 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 63841 - Posted: 26 Oct 2009, 14:18:36 UTC This point you are making is why HT computers can actually be slower than a non HT computer, the sharing of the memory. Yes. HT machines actually share other resources as well. Their design is for the case where these are not the intensely utilized resources. So, "it depends". If your desired use would attempt to further over utilize a shared resource, the new architecture may increase your overhead, without offering any improvement in overall performance. It's like having a car that can go from zero to 5x the highway speed limit in 4 seconds... and then measuring the time driving along a city street with a stop sign every block. It doesn't improve overall speed much over riding a bicycle... for that application. And you never get to run the thing at full speed. The car was designed for racing (i.e. for specific conditions where it works well). ...in fact it's like having 512 race cars, and comparing time to get them ALL down this city street with the time for me to ride my bike. ...are you visualizing how 512 cars start to get in each other's way? And how measuring time first car starts, to time last car completes creates a critical mass required to gain any benefit over a single "core" bicycle? Rosetta Moderator: Mod.Sense ID: 63841 · Rating: 0 · rate: / Reply Quote

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 63842 - Posted: 26 Oct 2009, 14:33:28 UTC - in response to Message 63841. This point you are making is why HT computers can actually be slower than a non HT computer, the sharing of the memory. Yes. HT machines actually share other resources as well. Their design is for the case where these are not the intensely utilized resources. So, "it depends". If your desired use would attempt to further over utilize a shared resource, the new architecture may increase your overhead, without offering any improvement in overall performance. It's like having a car that can go from zero to 5x the highway speed limit in 4 seconds... and then measuring the time driving along a city street with a stop sign every block. It doesn't improve overall speed much over riding a bicycle... for that application. And you never get to run the thing at full speed. The car was designed for racing (i.e. for specific conditions where it works well). ...in fact it's like having 512 race cars, and comparing time to get them ALL down this city street with the time for me to ride my bike. ...are you visualizing how 512 cars start to get in each other's way? And how measuring time first car starts, to time last car completes creates a critical mass required to gain any benefit over a single "core" bicycle? I alwyas found HT too... well "good" to be true. I mean, you can't execute two thread w/o having two separate cpu cores dedicated to each. Period. HT might come in handy when executing a single WU, since they sorta share the same mem base. That's how I see it at least. ID: 63842 · Rating: 0 · rate: / Reply Quote