Rosetta@home using AVX / AVX2 ?

Author	Message
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 80910 - Posted: 17 Dec 2016, 10:02:22 UTC - in response to Message 80471. rjs5 has put a huge effort to help and look into optimization possibilities with Rosetta. Now that CASP is almost over, we can get back to this. ...... ID: 80910 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 80914 - Posted: 19 Dec 2016, 22:17:57 UTC - in response to Message 80910. rjs5 has put a huge effort to help and look into optimization possibilities with Rosetta. Now that CASP is almost over, we can get back to this. ...... I retired in June and I am decompressing from work and trying to figure out this thing called "leisure". 8-) I haven't done much more on Rosetta and my opinion has not changed. I am going to Seattle 1/7 for several days but currently have no plans on stopping by the lab. David was supportive and interested but the "developers" were "skeptical" (even somewhat "hostile") about my performance expectations. Without their interest, there will not even be the simple changes ... and IMO, those are the only ones that make sense. 1. The Rosetta Project server/network infrastructure is "creaky" and it is probably already the project bottleneck. a. the big problem is likely just disk IO and reliability but could also be network too. b. they could improve performance by supporting multiple MACHINE CLASSES (SSE2, AVX2, ...) and get rid of the x87 floating point. 2. The Rosetta source code is kludgy, cumbersome and will be VERY difficult to make major changes. a. changing the compiler to ICC gives a bump. b. adding the FOURTH DIMENSION to the vector coordinates will enable SSE/AVX to use VECTOR instructions instead of the SCALAR instructions they did the last time I looked. c. I did not see how the current code could be modified beyond the 4th coordinate ... GPU that has hundreds of compute elements would not have any parallel work. 3. "Performance" increase that a person will see on their machine will widely vary. Some will see big bumps. Other will see little change. a. I saw wide variations on small machines and little variations on big machines ... looked like cache size and memory latency was a big factor. ID: 80914 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 80915 - Posted: 19 Dec 2016, 23:33:45 UTC - in response to Message 80914. I retired in June and I am decompressing from work and trying to figure out this thing called "leisure". Its significance will become apparent when there is a snow and ice storm on the way and you realize that you don't have to go out in it. Then it is a big deal. ID: 80915 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 80916 - Posted: 20 Dec 2016, 7:29:40 UTC - in response to Message 80914. I retired in June and I am decompressing from work and trying to figure out this thing called "leisure". 8-) To be clear, i'm NOT speaking about you, rjs5, you're GREAT!!! I'm referring to Rosetta's admins and at their silence about this. ID: 80916 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 80917 - Posted: 20 Dec 2016, 8:57:29 UTC - in response to Message 80916. Last modified: 20 Dec 2016, 9:00:58 UTC I retired in June and I am decompressing from work and trying to figure out this thing called "leisure". 8-) To be clear, i'm NOT speaking about you, rjs5, you're GREAT!!! I'm referring to Rosetta's admins and at their silence about this. Yep! I got that. 8-) My point was that the admins were good to great as volunteers. They, however, do not speak for the "developers" nor do they have much (if any) influence over the direction or tasks the developers work on. Developers get sensitive when you point out these things. Humorously, if you doubled the compute performance of Rosetta, they would get twice the work completed for about the same network bandwidth. A 6-hour job would do 12-hours of the old work with the same network traffic. Rosetta uses the BOINC timer to kill the Rosetta job at the end of the next Rosetta compute loop. The developers I interacted were justifiably skeptical about my claims. I frequently got the compiler developers giving me a new compiler with a fancy new switch ... telling me it would improve performance by "5%". They were ALWAYS slower since they did not understand the application I was working on. I worked for the last 16 years as a Software Performance Engineer on an very large enterprise sized application that was structured and behaved similar to Rosetta. I used my knowledge/experience with CPU/cache, memory and IO architectures to drive source code changes and compiler improvements. The Rosetta developers know what I have recommended (and are familiar with the technique) AND THEY control their implementation. 8-) If they don't care, nothing will happen. They were not "pleased" with my candid recommendations. The admins are OK. ID: 80917 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 80918 - Posted: 20 Dec 2016, 9:51:59 UTC - in response to Message 80917. Last modified: 20 Dec 2016, 9:53:05 UTC My point was that the admins were good to great as volunteers. They, however, do not speak for the "developers" nor do they have much (if any) influence over the direction or tasks the developers work on. Developers get sensitive when you point out these things. I thinked admins and devs working TOGETHER to make the project better :-P They were not "pleased" with my candid recommendations. If you are interested, a volunteer on Tn-Grid (italian genetic map project) are working on optimization with interesting preliminary results optimize It turned out that on my SandyBridge CPUs SSE version was the fastest one (I suspect that unaligned loads kills performance of AVX version). it needs about 1 hour per WUs (original version needed about 2.5 hours) ID: 80918 · Rating: 0 · rate: / Reply Quote

Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0	Message 80919 - Posted: 20 Dec 2016, 19:55:13 UTC Vote with your feet... Core a7 looks very promising and they have a forum that is actually alive. ID: 80919 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 80920 - Posted: 20 Dec 2016, 21:39:09 UTC - in response to Message 80919. Last modified: 20 Dec 2016, 21:40:06 UTC Vote with your feet... Core a7 looks very promising and they have a forum that is actually alive. I have done Folding for many years, but almost exclusively on the GPUs. Their official party line is that anything they can do on the CPUs can also be done on the GPUs. So why bother with a7 on a CPU when you can do Core 21 on a GPU and be an order of magnitude more efficent? Of course, if your GPUs are committed to other projects, then a7 is perfectly reasonable, though you will be getting a lot more a4s for some time, and you can't select. So I like to reserve my CPU power for the projects that have no alternative. Maybe Rosetta should have a more efficient alternative; it is annoying to think that they are overlooking the easy improvements, but they may have reasons. ID: 80920 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 80921 - Posted: 20 Dec 2016, 23:01:37 UTC - in response to Message 80914. Last modified: 20 Dec 2016, 23:07:53 UTC c. I did not see how the current code could be modified beyond the 4th coordinate ... GPU that has hundreds of compute elements would not have any parallel work. 3. "Performance" increase that a person will see on their machine will widely vary. Some will see big bumps. Other will see little change. a. I saw wide variations on small machines and little variations on big machines ... looked like cache size and memory latency was a big factor. GPUs these days has thousands of vector stream processors, if all those vector stream processors can be activated to do supercomputing style vector compute as a boinc group, r@h alone could easily top 10s or 100s of petaflops easily out gunning the fastest supercomputers in the world but that's provided on the notion that everyone is running those top end GPUs like the recent Nvidia GTX 1070 ID: 80921 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 80922 - Posted: 21 Dec 2016, 3:23:14 UTC - in response to Message 80921. c. I did not see how the current code could be modified beyond the 4th coordinate ... GPU that has hundreds of compute elements would not have any parallel work. 3. "Performance" increase that a person will see on their machine will widely vary. Some will see big bumps. Other will see little change. a. I saw wide variations on small machines and little variations on big machines ... looked like cache size and memory latency was a big factor. GPUs these days has thousands of vector stream processors, if all those vector stream processors can be activated to do supercomputing style vector compute as a boinc group, r@h alone could easily top 10s or 100s of petaflops easily out gunning the fastest supercomputers in the world but that's provided on the notion that everyone is running those top end GPUs like the recent Nvidia GTX 1070 I can tie my shoes one at a time (Rosetta today). I can have one person help me and we can tie both in parallel (Rosetta with the extra 4th vector dimension). If I have the help of "thousands" of people to tie my shoes, they can still only tie 2 in parallel with "thousands" idling. A GPU is worthless for Rosetta work UNTIL the developers invest a TON of time to REDESIGN the entire software design (if even possible). With the number of machines currently crunching their work, they have near zero incentive to burn the man-years of effort. ID: 80922 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 80927 - Posted: 22 Dec 2016, 7:59:39 UTC - in response to Message 80922. A GPU is worthless for Rosetta work UNTIL the developers invest a TON of time to REDESIGN the entire software design (if even possible). With the number of machines currently crunching their work, they have near zero incentive to burn the man-years of effort. Indeed, this is the thread of CPU optimizations :-P And your words "skeptical", "hostile" and not "pleased" with my candid recommendations (referring to devs) are not so encouraging :-( ID: 80927 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 80929 - Posted: 22 Dec 2016, 17:02:44 UTC - in response to Message 80922. Last modified: 22 Dec 2016, 17:07:46 UTC I can tie my shoes one at a time (Rosetta today). I can have one person help me and we can tie both in parallel (Rosetta with the extra 4th vector dimension). If I have the help of "thousands" of people to tie my shoes, they can still only tie 2 in parallel with "thousands" idling. A GPU is worthless for Rosetta work UNTIL the developers invest a TON of time to REDESIGN the entire software design (if even possible). With the number of machines currently crunching their work, they have near zero incentive to burn the man-years of effort. +1 agreed, the truth to be told, even CERN the physics people in which those high energy physics computations are associated with vector super computers declared that vector parallel computations is but only very few of the real world scenarios, much of all the rest of problem with all those extremely parallel vector supercomputing horsepower let that be 100s of peta flops - is useless, they can only be solved sequentially where the next iteration depends on the prior only 1 out of the millions of vector core is probably used https://indico.cern.ch/event/327306/contributions/760669/attachments/635800/875267/HaswellConundrum.pdf On modern architecture, extrapolation based on synthetic benchmarks is mission impossible yup only a teeny tiny weeny few problems out of the whole universe of problems can be simply expressed as a large set of linear equations. all the rest never fit that pattern ID: 80929 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 80961 - Posted: 1 Jan 2017, 7:42:51 UTC Happy new year to all of you!!! ID: 80961 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 81525 - Posted: 9 May 2017, 8:08:07 UTC - in response to Message 80929. +1 agreed, the truth to be told, even CERN the physics people in which those high energy physics computations are associated with vector super computers declared that vector parallel computations is but only very few of the real world scenarios, much of all the rest of problem with all those extremely parallel vector supercomputing horsepower let that be 100s of peta flops - is useless, they can only be solved sequentially where the next iteration depends on the prior only 1 out of the millions of vector core is probably used https://indico.cern.ch/event/327306/contributions/760669/attachments/635800/875267/HaswellConundrum.pdf Yeah, but at CERN, physics people are not closed to new possibility, like opencl https://www.hpcwire.com/2017/04/14/xeon-fpga-processor-tested-at-cern/ ID: 81525 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 81534 - Posted: 16 May 2017, 14:47:25 UTC TJ, a rosetta dev, on other thread wrote: I am a Rosetta developer who looked at the issue Rjs5 pointed out which was using ICC rather than gcc. I also found a large speed improvement. As such we started transitioning over to using icc in compiling. Just considering new AOCC The AOCC compiler system is a high performance, production quality code generation tool. The AOCC environment provides the developer the essential choices when building and optimizing C, C++, and Fortran applications targeting 32-bit and 64-bit Linux® platforms. The AOCC compiler system offers a high level of advanced optimizations, multi-threading, and processor support that includes global optimization, vectorization, interprocedural analyses, loop transformations, and code generation. Also highly optimized libraries, which extracts the optimal performance from each x86 processor core, are used. The AOCC Compiler Suite simplifies and accelerates development and tuning for x86, AMD64 (AMD® x86-64 Architecture), and Intel64 (Intel® x86-64 Architecture) applications ID: 81534 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 87161 - Posted: 29 Aug 2017, 18:25:38 UTC A lot has changed including increased use of C++11 from the commons developers, some of which is not yet supported by Visual C++ and has to be ported, also new dependencies, and a migration away from Boost (related to increased C++11 use). Also, the "rosetta scripts" protocols will not be backwards compatible due to new XML format rules so the next app version will be a new app named "rosetta" which is appropriate :) These changes will help the SSEx/Avx development, i hope.... ID: 87161 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 87193 - Posted: 3 Sep 2017, 17:55:46 UTC - in response to Message 87161. A lot has changed including increased use of C++11 from the commons developers, some of which is not yet supported by Visual C++ and has to be ported, also new dependencies, and a migration away from Boost (related to increased C++11 use). Also, the "rosetta scripts" protocols will not be backwards compatible due to new XML format rules so the next app version will be a new app named "rosetta" which is appropriate :) These changes will help the SSEx/Avx development, i hope.... These changes will make no difference on vector computing. The only possibility is .... since they have the chest open on the source code for major surgery, ... they could possibly make the simple but widespread changes to add vector capability. They need to make changes to their primary TYPEDEF statements to pad to a 2^n size or no compiler will do anything other than sequential, SCALAR crunching. ID: 87193 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 87206 - Posted: 5 Sep 2017, 12:20:13 UTC - in response to Message 87193. These changes will make no difference on vector computing. The only possibility is .... since they have the chest open on the source code for major surgery, ... they could possibly make the simple but widespread changes to add vector capability. They need to make changes to their primary TYPEDEF statements to pad to a 2^n size or no compiler will do anything other than sequential, SCALAR crunching. In Italy we say "campa cavallo" (something like "don't hold your breath"/"That'll be the day!") ID: 87206 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 87252 - Posted: 9 Sep 2017, 20:01:52 UTC - in response to Message 87206. These changes will make no difference on vector computing. The only possibility is .... since they have the chest open on the source code for major surgery, ... they could possibly make the simple but widespread changes to add vector capability. They need to make changes to their primary TYPEDEF statements to pad to a 2^n size or no compiler will do anything other than sequential, SCALAR crunching. In Italy we say "campa cavallo" (something like "don't hold your breath"/"That'll be the day!") When the Project managers realize that "doubling the performance of the software" is the same thing as "reducing the project operating costs by ~50%", they will begin pressuring for the changes. The changes are simple, but will require a careful changes to be sprinkled across the source code. There is likely very few developers with the code knowledge capable of safely making the changes across the code. I suspect that those capable developers are not interested in that kind of work. The developers currently appear (to me) to put very low priority on performance changes. ID: 87252 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 87773 - Posted: 30 Nov 2017, 10:57:21 UTC - in response to Message 77541. Hi everybody, I just wanted to ask if there are plans to use AVX or AVX2 or possibly even the coming AVX-512 in Rosetta? Avx 512 seems no so good Avx512 ID: 87773 · Rating: 0 · rate: / Reply Quote