new version of BOINC + CUDA support

Message boards : Number crunching : new version of BOINC + CUDA support

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Profile bruce boytler
Avatar

Send message
Joined: 17 Sep 05
Posts: 68
Credit: 3,565,442
RAC: 0
Message 63815 - Posted: 25 Oct 2009, 8:20:03 UTC

The main problem with GPU (and this was stated by the rosetta@home people) is that the rosetta app would run to slow on the GPU.
I was personaly intriguged by this so I looked into it. The only thing I can see is that GPU is really bad at Monte Carlo simulations. This is due to its lack of branching ability.
ID: 63815 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 63821 - Posted: 25 Oct 2009, 15:33:30 UTC

A Monte Carlo simulation doesn't necessarily require many branching operations, so I doubt it is fair to characterize that as a GPU limitation. The main hurdle is memory. A CPU (any CPU/GPU/processor) can only run fast if it has all of the programs and data items available in it's L2 (or equiv.) cache memory. When something required (such as the instruction you wish to branch to) is not there, the CPU waits for memory access to load the required memory page.

Some architectures attempt to minimize the likelihood of such "Cache misses" by prefetching portions of the program that you MIGHT jump to, just in case you need to do so. That way it will either already be in memory, or coming sooner then if you had to request it only once you got there in execution. But such prefetching must kick something else out of the cache to make room for the anticipatory fetch.

You can see that any application that did a lot of branching would have difficulty running efficiently, unless most of the active areas of the program can be retained in memory. Specifically, in the L2 cache memory. So, a smaller program has a better chance of running well. Such dependencies are why the answer to performance questions is always "it depends".

The new GPU chips coming out have shared L2 cache memory between the individual processor cores. So if they are all working on the same problem, and data, they can share it more readily and maximize each fetch done by making the page available to all of the cores. However, if memory is still overcommitted, you still end up kicking something out of memory that another core is going to need to continue processing. And thus that processor will go idle waiting to fetch it back again. See references in this thread.
Rosetta Moderator: Mod.Sense
ID: 63821 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1898
Credit: 12,724,450
RAC: 581
Message 63838 - Posted: 26 Oct 2009, 9:03:30 UTC - in response to Message 63821.  

A Monte Carlo simulation doesn't necessarily require many branching operations, so I doubt it is fair to characterize that as a GPU limitation. The main hurdle is memory. A CPU (any CPU/GPU/processor) can only run fast if it has all of the programs and data items available in it's L2 (or equiv.) cache memory. When something required (such as the instruction you wish to branch to) is not there, the CPU waits for memory access to load the required memory page.

Some architectures attempt to minimize the likelihood of such "Cache misses" by prefetching portions of the program that you MIGHT jump to, just in case you need to do so. That way it will either already be in memory, or coming sooner then if you had to request it only once you got there in execution. But such prefetching must kick something else out of the cache to make room for the anticipatory fetch.

You can see that any application that did a lot of branching would have difficulty running efficiently, unless most of the active areas of the program can be retained in memory. Specifically, in the L2 cache memory. So, a smaller program has a better chance of running well. Such dependencies are why the answer to performance questions is always "it depends".

The new GPU chips coming out have shared L2 cache memory between the individual processor cores. So if they are all working on the same problem, and data, they can share it more readily and maximize each fetch done by making the page available to all of the cores. However, if memory is still overcommitted, you still end up kicking something out of memory that another core is going to need to continue processing. And thus that processor will go idle waiting to fetch it back again. See references in this thread.


So are coming down to the point that Rosetta is too big to fit in the current bunch of gpu cards? This point you are making is why HT computers can actually be slower than a non HT computer, the sharing of the memory.
ID: 63838 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 63841 - Posted: 26 Oct 2009, 14:18:36 UTC

This point you are making is why HT computers can actually be slower than a non HT computer, the sharing of the memory.


Yes. HT machines actually share other resources as well. Their design is for the case where these are not the intensely utilized resources. So, "it depends". If your desired use would attempt to further over utilize a shared resource, the new architecture may increase your overhead, without offering any improvement in overall performance.

It's like having a car that can go from zero to 5x the highway speed limit in 4 seconds... and then measuring the time driving along a city street with a stop sign every block. It doesn't improve overall speed much over riding a bicycle... for that application. And you never get to run the thing at full speed. The car was designed for racing (i.e. for specific conditions where it works well).

...in fact it's like having 512 race cars, and comparing time to get them ALL down this city street with the time for me to ride my bike. ...are you visualizing how 512 cars start to get in each other's way? And how measuring time first car starts, to time last car completes creates a critical mass required to gain any benefit over a single "core" bicycle?
Rosetta Moderator: Mod.Sense
ID: 63841 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 63842 - Posted: 26 Oct 2009, 14:33:28 UTC - in response to Message 63841.  

This point you are making is why HT computers can actually be slower than a non HT computer, the sharing of the memory.


Yes. HT machines actually share other resources as well. Their design is for the case where these are not the intensely utilized resources. So, "it depends". If your desired use would attempt to further over utilize a shared resource, the new architecture may increase your overhead, without offering any improvement in overall performance.

It's like having a car that can go from zero to 5x the highway speed limit in 4 seconds... and then measuring the time driving along a city street with a stop sign every block. It doesn't improve overall speed much over riding a bicycle... for that application. And you never get to run the thing at full speed. The car was designed for racing (i.e. for specific conditions where it works well).

...in fact it's like having 512 race cars, and comparing time to get them ALL down this city street with the time for me to ride my bike. ...are you visualizing how 512 cars start to get in each other's way? And how measuring time first car starts, to time last car completes creates a critical mass required to gain any benefit over a single "core" bicycle?


I alwyas found HT too... well "good" to be true. I mean, you can't execute two thread w/o having two separate cpu cores dedicated to each. Period.
HT might come in handy when executing a single WU, since they sorta share the same mem base.

That's how I see it at least.
ID: 63842 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3

Message boards : Number crunching : new version of BOINC + CUDA support



©2025 University of Washington
https://www.bakerlab.org