Message boards : Number crunching : cpu optimization
Author | Message |
---|---|
dasfiend Send message Joined: 11 Dec 07 Posts: 2 Credit: 5,844,394 RAC: 0 |
does anyone know the status/philosophy r@h has with regards to optimizing for various CPUs and new instruction sets? (i.e. sse3 and sse4) I was reading on the Einstein@Home forums that they've had some success optimizing for the core2 architecture in general. I ask because I'm the proud new owner of a core2quad extreme edition qx9650 45nm yorkfield/peryn cpu. According to anandtech and others the sse4 optimizations can lead to dramatic speed improvements for certain computation types. I don't know enough about what's available and what r@h needs for its computation so I thought I'd toss this question out there. |
The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,271,025 RAC: 0 |
congrats on the qx9650! i'm a little jealous, to say the least, lol! My understanding is that there is no cpu optimization (ala sse) at R@H. (and my personal guess is, the Project will refuse to even discuss this topic, the same way they refuse to discuss the possibility of the utilizations of gpu's and the sony ps/3) |
dasfiend Send message Joined: 11 Dec 07 Posts: 2 Credit: 5,844,394 RAC: 0 |
congrats on the qx9650! i'm a little jealous, to say the least, lol! Thanks! It's running completely stable atm at 3850MHz (350x11) on air cooling! pretty amazing stuff here. Yeah I get why they wouldn't bother to try and support sse4 seeing as how few people there are currently that have sse4 processors. That said, there are many that have core2 or athlon64 level processors, and it seems like some broader optimizations (for some platforms with larger numbers in the swarm) might be a worthwhile endeavor. *shrug* I wish I knew enough to help with that stuff, but I don't so instead I just run the client on all my machines =) |
Paul Send message Joined: 29 Oct 05 Posts: 193 Credit: 66,607,712 RAC: 4,382 |
Thanks! It's running completely stable atm at 3850MHz (350x11) on air cooling! pretty amazing stuff here. Yeah I get why they wouldn't bother to try and support sse4 seeing as how few people there are currently that have sse4 processors. That said, there are many that have core2 or athlon64 level processors, and it seems like some broader optimizations (for some platforms with larger numbers in the swarm) might be a worthwhile endeavor. *shrug* I wish I knew enough to help with that stuff, but I don't so instead I just run the client on all my machines =)[/quote] Sounds like you have plenty of CPU to offer the project - thx Maybe the new mini project will make it easier to optimize for each processor. I understand the SSE3 and SSE4 instructions are designed for 3D calculations. I don't know if SSE3 or SSSE3 help. I think my next system will have a Q9450 with SSE4 support. Even with the low number of these devices available, it would be nice to have an optimized application take full advantage of the performance. Paul Thx! Paul |
voyager Send message Joined: 2 Feb 08 Posts: 5 Credit: 76,584 RAC: 0 |
Do we automatically get updates? I currently see in my files V5.82 and Betta V5.99 |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Do we automatically get updates? I currently see in my files V5.82 and Betta V5.99 Yes, the Rosetta programs are updated automatically. The BOINC software you have to do your self once and a while. Rosetta Moderator: Mod.Sense |
The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,271,025 RAC: 0 |
For those interested, it appears that Microcenter has the Q9450 on sale for about $300. 12 mb cache, and sse4.1. Now, if only the Project (which apparently isn't interested in PS3 or GPU crunching) would at least discuss the potential benefits that might be offered by sse4.1... |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
It is always a question of resource allocation. To do ONE of the "optimized" applications over at SAH the programmers spend months working on the hand coding. Everytime the application changes, guess what, you get to start hand coding all over again. With SAH the application is releatively stable so the effort can pay off ... same with EAH ... Here, the application does not appear to be quite as stable in the sense that it is always going to process in one particular way. In this case, from a project perspective it makes no sense to try to program for the various versions of instruction sets out there. Particularly when it is "good enough". Of which "best" is the enemy of ... Not being sure exactly what they are doing "under the hood" I cannot say that they are not even one of those applications for which there are few, if any, real optimizations that would bear fruit. |
The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,271,025 RAC: 0 |
My understanding, subject to correction, is that SSE4.1 offers performance gains of between 50% and 100% for certain tasks. Not stating that such tasks are used in Rosie's code, I don't know. But with numbers like 50% - 100%, I will state that I believe it is worth having a discussion about, even if the discussion is nothing more than "no, the nature of Rosie's code is not suited to take advantage of any of the improvements offered by sse4.1" If I knew a project would benefit from the sse4.1, it would weigh heavily into my purchasing decision: $200 q6600 or $300 q9450. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Yes it would, IF ... 1) Making the optimized version took less time than would be gained 2) The population of computers with this capability is sufficiently large 3) The rosetta application is stable enough that the optimized sections would not have to be constantly recoded 4) There is a programmer that is capable of hand coding the application available Ususlly, the effort is not worth the payback. Optimizing compilers are pretty good at turning out an acceptably fast program. Hand coding does not double the speed of the overall application, only the affected parts. The overall increase in speed can be as low as only 1-2% of the entire runtime ... with 10-20% being the high end of what is possible. Lurk over at the SAH forums and as I said, they spend months trying to get that extra speed and it is only about 20% faster than the "stock" application. And the last time I looked they had about 6 varieties each hand coded for the particular architecture ... with that huge investment in time... |
The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,271,025 RAC: 0 |
I'm not necessarily disagreeing with you. All I am suggesting is that when a new technology appears (i.e., sse4.1 didn't exist previously), that "seems" (50% - 100% performance gains for certain tasks) to have the "potential" to speed up the modeling process, it is worth asking and answering the questions you and I have posed. This, I would argue, would be time well spent. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
I'm not necessarily disagreeing with you. Um, in a word, ... no :) I am NOT trying to be difficult. However ... Optimizing code (and systems) is one of the topics that systems engineers spend a lot of time thinking about. One of the more common mistakes made by programmers is to spend time optimizing their code as they develop. Instead of writing the most robust code with the greatest chance of executing correctly they worry about efficiency. The unfortunate consequence is that they spend enormous amounts of time making parts of the system efficient that have effectively zero impact on the whole system's total throughput. In this case the system includes the developers, the computers, the programs, and the problems. I grant you that the increase in speed seems to be a powerful advantage. BUt the first point is, especially now, is how many systems of the total number of systems would take advantage of the improved code. Given that one of the constraints is developer time and the time it takes to hand code the application, which, if I understand the current efforts is under revision as we speak ... it makes little sense from a systems perspective. USUALLY, and this is what I think I have understood from the other information on the site, the team is, in fact, doing the most significant form of optimization possible... changing the algorithm that processes the data. It is the classic case of why there are so many different kinds of sorting algorithms and why selecting the optimum one from the list is the best way to ensure that the data set is sorted in the most efficient manner. At SAH the only reason that so much effort can be put into these optimization efforts is that the processing algorithm and the way that the search is conducted has not really changed in years. The core algorithms go back to classic days though the depth and breath have changed ... well, the core mechanisms have not changed. Yet, even so, the relatively "minor" changes can provoke months of development and testing. And, each optimized application is directed at specific models of processors. Here, the technology is hardly what I would call stable in the sense that they are applying technologies that have decades if not a century of history (the core technologies in signal processing go back a LONG way) and so the path to solution is not as well trod. Heck, I wish it were easy, and my G5 would still be blowing through the EAH tasks in a couple of hours ... but, the later Altivec optimized application on the G5 does not seem to be as effective as the older one I used two years ago ... but back then the G5 beat my Xeon and completed 2-3 times as many models. No more, the stock Xeon application seems to be as fast as the G5 optimized. Anyway, until and unless the RAH team can find an algorithm that the KNOW they will be using for the next 4-5 years, without significant changes, the investment in optimized applications is, from a system perspective, not a viable or optimal solution. And were I to be their consultant, not that anyone has listened to me in the past, I would never recommend spending project resources in hand coding when better throughput is more likely from a better search algorithm ... That is where the real optimization will occur ... |
Michael G.R. Send message Joined: 11 Nov 05 Posts: 264 Credit: 11,247,510 RAC: 0 |
As much as I like the thought of Rosetta using SSEx, I think Paul has a point. There's an opportunity cost to any change you make, and better scientific code that runs slower can probably give better bio-medical results than worse scientific code that runs faster. But I also think that Bad Penguin has a point when he says that the project could do a better job of letting us know what they plans are with regard to SSEx/GPU/PS3 stuff, and explain why it is or is not on the roadmap for now. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
As much as I like the thought of Rosetta using SSEx, I think Paul has a point. Thank you :) There's an opportunity cost to any change you make, and better scientific code that runs slower can probably give better bio-medical results than worse scientific code that runs faster. It is not so much that the results are different as it is in where do you expend your resources. If you have a section of code that takes 4 hours to execute and it is small and tight and maps to a processor extension, and that section of code is not likely to change, then it is a POTENTIAL candidate for optimization. But, we are back to how many processors are going to be affected. In DC, a large number of processing nodes are older machines which will not have the additional instructions thus not availing themselves of the optimized code. But I also think that Bad Penguin has a point when he says that the project could do a better job of letting us know what they plans are with regard to SSEx/GPU/PS3 stuff, and explain why it is or is not on the roadmap for now. Well, this is one of my long standing complaints also... However, in this case, for this question, I do have a historical perspective and I know we debated this question back when the project first 'stood-up" and, I did, as best as I could, tried to reiterate the discussion points. And what I would teach in classes about coding ... Selection of the apprpriate algorithm is the best optimization. In new territory, well, this is where we are at ... the best point is to try approaches and while they are "live" look for alternatives ... then try those... The myriad of sort routines we have today, I used to have hundreds in my files, we developed those over decades ... RAH is much younger than that ... The communication issue ... yeah, but that is a whole 'nother debate ... |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
I think all that was really asked in penguins last post was for the developers/project to chime in with a few words ;) We've been down the SIMD instruction improvements (this is the 'optimisation' talked about here) before. Even one of the Intel 'optimising' blokes was here chatting and he knows his SSE stuff.. In my opinion at this stage in the game, all that the developers should be doing with regard to SIMD optimisation is compiler swithes to see if it runs faster, with as much useful information and as stabley as the stock version. Slap them on Ralph@home and see what happens, not much time wasted and some possible usefull information. Though BOINC would really need to implement instruction set allocationg to make it really useful. All other effort should go in to improving the scientific code and the 'real' optimisation. Team mauisun.org |
Michael G.R. Send message Joined: 11 Nov 05 Posts: 264 Credit: 11,247,510 RAC: 0 |
In DC, a large number of processing nodes are older machines which will not have the additional instructions thus not availing themselves of the optimized code. SSE4 probably wouldn't be a good idea just yet (unless there's more than one version of the code available), but SSE would probably be supported by almost all machines. Those that wouldn't support it are probably so old that their loss would be more than made up with by the rest of the machines now going through the code faster. Boinc Simap is using SSE, but I think their code base is a lot more static than Rosetta@home... The idea of simply using compiler flags is interesting, though. Any reason why that shouldn't be tried? |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
The idea of simply using compiler flags is interesting, though. Any reason why that shouldn't be tried? The BOINC Manager does not have a good way to correctly download the "correct" application. I cannot recall if this was tried in the past. I do know, on some critical applications, even minor permutations of the processing may make the results incompatible. That is the problem LHC faced with cross platforms ... (darn ...) |
Paul Send message Joined: 29 Oct 05 Posts: 193 Credit: 66,607,712 RAC: 4,382 |
I will ask the stupid question: Is there a way to include both optimized and non-optimized routines? It would be ideal for the application to sense the hardware and use code optimized for that hardware. Thx! Paul |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,860,059 RAC: 3,073 |
I believe they do use some compiler optimisations, but getting into hand-optimising is too time-consuming when the current aim of the project isn't to increase the speed at which it runs. The aim is to improve accuracy (which more speed helps with but not, it seems, at the expense that it would cost) |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
Is there a way to include both optimized and non-optimized routines? It would be ideal for the application to sense the hardware and use code optimized for that hardware. Yes there is. Note that nearly all crunching programs spend most of their time in a few key subroutines. Let's say one of those subroutines is called "do_it()" and the source code is in a file "do_it.c". One can arrange it so that the name of the subroutine is controlled my a compiler flag, so you can compile do_it.c several times with different optimizing flags, and have the resulting subroutines named things like do_it_default(), do_it_athlon64(), do_it_core2(), etc. The final program can include all of these subroutines. As only a few key subroutines are being duplicated, the final executable will only be moderately bigger. The program can figure out which processor type it's running on. It can have a variable that is a pointer to a subroutine, which is then set to point to the best version of the subroutine for that processor type. All calls to do_it() will now be calls to whatever subroutine that variable is pointing at. That let's a single executable run on many processor types while using optimizations for the types that can support those optimizations. |
Message boards :
Number crunching :
cpu optimization
©2024 University of Washington
https://www.bakerlab.org