Rosetta@home using AVX / AVX2 ?

Author	Message
Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0	Message 77541 - Posted: 4 Oct 2014, 9:21:47 UTC Hi everybody, I just wanted to ask if there are plans to use AVX or AVX2 or possibly even the coming AVX-512 in Rosetta? I heard there is not much sense in using GPUs to crunch but AVXx could really speed things up. It's certainly true that in order to gain the full speedup, you would need to rewrite parts of that program but compiling with the appropriate compiler flags should still give you some performance advantage without changing the code. It's sort of sad to see that those instructions lay dormant and unused. I think Folding@Home already supports AVX through gromacs. Why not Rosetta? ID: 77541 · Rating: 0 · rate: / Reply Quote

Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0	Message 77542 - Posted: 4 Oct 2014, 11:30:17 UTC - in response to Message 77541. Here is a conversation from a thread about Android development. David E K, one of the project administrators, doesn't talk about AVX but replies to a participant's query that updates to the server and the current application are the immediate priority. David E K wrote: Yes, there are definitely issues with android and boinc apps. The main issues now I believe are with the BOINC client and current android versions which put background processes to sleep. For now, I am not going to spend much time on our android version until they fix this issue. The motivation for an android arm version has come from BOINC and their partnership with HTC power to give. Samsung is also interested in running R@h on their phones. VENETO boboviz wrote: What's next? Update server side? Avx/Avx2? :-) David E K wrote: Probably server updates including software and hardware. Also, there's been some recent large scale code changes/refactoring of Rosetta so our next application update may not be trivial. ID: 77542 · Rating: 0 · rate: / Reply Quote

Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0	Message 77543 - Posted: 4 Oct 2014, 14:16:12 UTC - in response to Message 77542. Here is a conversation from a thread about Android development. David E K, one of the project administrators, doesn't talk about AVX but replies to a participant's query that updates to the server and the current application are the immediate priority. Hi, thanks for the info. I figure the use of AVXx would be a nice task for ralph@home. All they need to do is to provide a binary compiled with the appropriate flags. It either works or it doesn't. ;-) IMHO this has much more precedence that getting Rosetta to work on Android. ID: 77543 · Rating: 0 · rate: / Reply Quote

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 77551 - Posted: 6 Oct 2014, 17:56:34 UTC Good to hear they are thinking of updating their server code... because it is ANCIENT. ID: 77551 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 77553 - Posted: 6 Oct 2014, 18:10:48 UTC - in response to Message 77543. Here is a conversation from a thread about Android development. David E K, one of the project administrators, doesn't talk about AVX but replies to a participant's query that updates to the server and the current application are the immediate priority. Hi, thanks for the info. I figure the use of AVXx would be a nice task for ralph@home. All they need to do is to provide a binary compiled with the appropriate flags. It either works or it doesn't. ;-) IMHO this has much more precedence that getting Rosetta to work on Android. I'm not familiar with AVXx. I believe we'd have to upgrade our compiler versions which isn't much of an issue (depending on how well/easy Rosetta ports). But would it crash on non-compatible machines? ID: 77553 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2155 Credit: 12,998,581 RAC: 8,218	Message 77558 - Posted: 7 Oct 2014, 6:28:53 UTC - in response to Message 77553. Last modified: 7 Oct 2014, 6:29:42 UTC I'm not familiar with AVXx. In Intel and AMD developer sites there are a lot of docs, examples, etc.. :-) But would it crash on non-compatible machines? Why? Other projects use SSE/AVX with scheduler that assigns correctly works based on cpu capabilities ID: 77558 · Rating: 0 · rate: / Reply Quote

Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0	Message 77559 - Posted: 7 Oct 2014, 18:11:46 UTC - in response to Message 77553. I'm not familiar with AVXx. I believe we'd have to upgrade our compiler versions which isn't much of an issue (depending on how well/easy Rosetta ports). But would it crash on non-compatible machines? Hi there and thanks for your reply. I have to confess I'm not really an expert on these things. I think you do have to upgrade your compiler to a fairly recent version in order to take advantage of the new extensions. Unless you specifically compile for a certain architecture (as in -march=core-avx2), the binary will just use a different code path, resulting in a larger binary. But than again, I'm not sure. I'm also aware that in the past there were ludicrous expectations concerning these new cpu extensions, i.e. MMX and 3Dnow. But I think this time with AVX2 it will be different. If you have some time to spare, you should read the relevant thread on Anandtech. The user Benchpress goes to some length to explain what the use of AVX2 can do to the performance of your code. Thread about AVX2 Again, this should have much more precedence than running rosetta@home on a tablet. ID: 77559 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2155 Credit: 12,998,581 RAC: 8,218	Message 77564 - Posted: 8 Oct 2014, 15:10:05 UTC - in response to Message 77559. If you have some time to spare, you should read the relevant thread on Anandtech. The user Benchpress goes to some length to explain what the use of AVX2 can do to the performance of your code. Some programs are 20% faster with AVXx, others 40% (!!), others 10%, depends of code... Here some docs/tools about Avx/Avx2 First program with Avx2 Processing arrays with Avx2 CodeXL benefits ACML There are lot, as i say, of tools, docs, examples ID: 77564 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 77835 - Posted: 17 Jan 2015, 1:50:15 UTC Last modified: 17 Jan 2015, 2:05:23 UTC yup i'd think AVX / AVX2 is a good thing, actually this is very similar (or of the same nature) to the GPU request threads, i.e. to exploit vectorized CPU or GPU functionality to significantly accelerate computations the thing is that it may involve some code rewrites, which it seemed has been deemed 'hard to do'? :o lol AVX / AVX2 can process 4 x 64bit double precision floats in a single clock cycle, on a naive basis against non-vectorized codes, it would imply up to 4 times the speedup per cpu core. but in practice i'd think the speedup may not really reach the that scale as many of today's CPUs are superscalar (they features instruction level parallelism for non vector codes) and that it's likely not all pieces of codes can be parallelized http://en.wikipedia.org/wiki/Amdahl%27s_law as for GPUs the very high end / expensive cards is said to be able to process many times that. (unfortunately GPU is not consistent in this respects, a lot of GPU use software emulation for double precision floats computation, this cut that GPU prowess to 1/8 of it or more). note also that desktop GPU is normally clocked as about 1Ghz which is some 1/3 of today's CPU clock frequencies (e.g. 3-4 Ghz) link to GPU thread discussion: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6549 ID: 77835 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2155 Credit: 12,998,581 RAC: 8,218	Message 77840 - Posted: 17 Jan 2015, 7:26:09 UTC - in response to Message 77835. Last modified: 17 Jan 2015, 7:27:00 UTC the thing is that it may involve some code rewrites, which it seemed has been deemed 'hard to do'? :o lol I know that rosetta's admins don't try to use avx extension. I know they tried to use android and it was a waste of time. So, why not try avx? AVX / AVX2 can process 4 x 64bit double precision floats in a single clock cycle, on a naive basis against non-vectorized codes, it would imply up to 4 times the speedup per cpu core. but in practice i'd think the speedup may not really reach the that scale[/url] A simply 10% plus per core is a BIG gain!! :-) ID: 77840 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 77842 - Posted: 18 Jan 2015, 4:21:40 UTC - in response to Message 77840. Last modified: 18 Jan 2015, 4:59:44 UTC the thing is that it may involve some code rewrites, which it seemed has been deemed 'hard to do'? :o lol I know that rosetta's admins don't try to use avx extension. I know they tried to use android and it was a waste of time. So, why not try avx? AVX / AVX2 can process 4 x 64bit double precision floats in a single clock cycle, on a naive basis against non-vectorized codes, it would imply up to 4 times the speedup per cpu core. but in practice i'd think the speedup may not really reach the that scale[/url] A simply 10% plus per core is a BIG gain!! :-) actually that's almost the same as optimizing the programs for GPUs, as a common technology based on 'higher level' languages that's optimised to vector cpu computation be they AVX/AVX2 or vector GPU cores is OpenCL and CUDA. https://software.intel.com/sites/default/files/m/d/4/1/d/8/Writing_Optimal_OpenCL_28tm_29_Code_with_Intel_28R_29_OpenCL_SDK.pdf the thing is that part of rosetta commons code would need to be rewritten / redesigned to use OpenCL. And in addition, the compiled target binaries would certainly be platform specific (i.e. differs between each Intel or AMD, Nvidia CPU platforms). However, apparently OpenCL uses some just-in-time methods where the codes are basically stored as text scripts and is compiled at run time by the specific platforms. note this other issue is that there is specific bindings / libraries / SDK for each platform hence it may means quite a lot more maintenance issues as there would at least be a need to target the different runtime OpenCL platforms (and even underlying hardware CPU/GPU platforms - they are different after all), it may mean needing to maintain multiple versions of the rosetta codes even if OpenCL is used. ID: 77842 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2155 Credit: 12,998,581 RAC: 8,218	Message 77856 - Posted: 26 Jan 2015, 10:47:12 UTC http://code.compeng.uni-frankfurt.de/projects/vc ID: 77856 · Rating: 0 · rate: / Reply Quote

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 78060 - Posted: 24 Mar 2015, 11:03:08 UTC Any news on this? 200 TFlops (which is a probably bad estimate) is starting to look a bit low! ID: 78060 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 78061 - Posted: 24 Mar 2015, 17:54:13 UTC - in response to Message 77840. I know that rosetta's admins don't try to use avx extension. I know they tried to use android and it was a waste of time. So, why not try avx? We do have a somewhat stable android build but android 5 gave me a curve ball with the requirement of PIE and unfortunately it's not so easy to build Rosetta with PIE even though they say it just requires -PIE -fpie compile/link commands etc... Yes, it compiles and links but seg faults and debugging has been tough. Such is the case sometimes when things are said to be easy but in practice it can be a different story. It has been on the backburner as with avx etc due to other research related priorities, for example, we have been invited to write 3 papers for the CASP11 meeting and I'm also in the process of making the builds based on current Rosetta source. ID: 78061 · Rating: 0 · rate: / Reply Quote

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 78063 - Posted: 25 Mar 2015, 21:10:50 UTC - in response to Message 78061. Last modified: 25 Mar 2015, 21:11:55 UTC I know that rosetta's admins don't try to use avx extension. I know they tried to use android and it was a waste of time. So, why not try avx? We do have a somewhat stable android build but android 5 gave me a curve ball with the requirement of PIE and unfortunately it's not so easy to build Rosetta with PIE even though they say it just requires -PIE -fpie compile/link commands etc... Yes, it compiles and links but seg faults and debugging has been tough. Such is the case sometimes when things are said to be easy but in practice it can be a different story. It has been on the backburner as with avx etc due to other research related priorities, for example, we have been invited to write 3 papers for the CASP11 meeting and I'm also in the process of making the builds based on current Rosetta source. I was just watching the video posted on the front page, you're aging really well! (comparing to the Rosetta@home promo video). Is it possible to realease the code as open-source and have two versions of it (one propietary and one open-source)? Open-source development could really help with things like this, specially when you're short on coders and/or time. EDIT: Profile pictures are not loading when updated :( (for example, mine) ID: 78063 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 78068 - Posted: 27 Mar 2015, 19:48:58 UTC - in response to Message 78063. I know that rosetta's admins don't try to use avx extension. I know they tried to use android and it was a waste of time. So, why not try avx? We do have a somewhat stable android build but android 5 gave me a curve ball with the requirement of PIE and unfortunately it's not so easy to build Rosetta with PIE even though they say it just requires -PIE -fpie compile/link commands etc... Yes, it compiles and links but seg faults and debugging has been tough. Such is the case sometimes when things are said to be easy but in practice it can be a different story. It has been on the backburner as with avx etc due to other research related priorities, for example, we have been invited to write 3 papers for the CASP11 meeting and I'm also in the process of making the builds based on current Rosetta source. I was just watching the video posted on the front page, you're aging really well! (comparing to the Rosetta@home promo video). Is it possible to realease the code as open-source and have two versions of it (one propietary and one open-source)? Open-source development could really help with things like this, specially when you're short on coders and/or time. EDIT: Profile pictures are not loading when updated :( (for example, mine) I'll check up on this profile picture bug. Don't know why that's happening. This is David Kim not David Baker :) The Rosetta source is freely available to academics. Source development however is limited to RosettaCommons developers/researchers, institutions/groups can join if they agree to the UW rosetta commons terms and align with the same research interests I believe. You can check out the rosettacommons.org site for more info. ID: 78068 · Rating: 0 · rate: / Reply Quote

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 78086 - Posted: 2 Apr 2015, 19:59:23 UTC - in response to Message 78068. I know that rosetta's admins don't try to use avx extension. I know they tried to use android and it was a waste of time. So, why not try avx? We do have a somewhat stable android build but android 5 gave me a curve ball with the requirement of PIE and unfortunately it's not so easy to build Rosetta with PIE even though they say it just requires -PIE -fpie compile/link commands etc... Yes, it compiles and links but seg faults and debugging has been tough. Such is the case sometimes when things are said to be easy but in practice it can be a different story. It has been on the backburner as with avx etc due to other research related priorities, for example, we have been invited to write 3 papers for the CASP11 meeting and I'm also in the process of making the builds based on current Rosetta source. I was just watching the video posted on the front page, you're aging really well! (comparing to the Rosetta@home promo video). Is it possible to realease the code as open-source and have two versions of it (one propietary and one open-source)? Open-source development could really help with things like this, specially when you're short on coders and/or time. EDIT: Profile pictures are not loading when updated :( (for example, mine) I'll check up on this profile picture bug. Don't know why that's happening. This is David Kim not David Baker :) The Rosetta source is freely available to academics. Source development however is limited to RosettaCommons developers/researchers, institutions/groups can join if they agree to the UW rosetta commons terms and align with the same research interests I believe. You can check out the rosettacommons.org site for more info. Ah, well, sorry for the mix up. It was just an idea to help boost R@H's FLOPS. Doubt it's that easy to implement AVX just like that though. ID: 78086 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2155 Credit: 12,998,581 RAC: 8,218	Message 78165 - Posted: 28 Apr 2015, 12:44:55 UTC - in response to Message 78086. Doubt it's that easy to implement AVX just like that though. Yeap not easy, but there are tools/documentation that help, like this: Intel Intrinsics Guide ID: 78165 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 78198 - Posted: 15 May 2015, 22:42:25 UTC - in response to Message 78165. Doubt it's that easy to implement AVX just like that though. Yeap not easy, but there are tools/documentation that help, like this: Intel Intrinsics Guide The executing code seems to be compiled for a i386 and uses the 387 floating point 8-register stack model. The code (on my machine) spends about 5% of the time waiting for the "fmul st0,st1" ("====" below) to complete. minirosetta_3.54_windows_x86_64.exe Rosetta instruction clip ... address instruction 0x6b3d82 add ebx, ecx 0x6b3d84 lea ebx, ptr [edi+ebx8] 0x6b3d87 fld st0, qword ptr [edi+eax8] 0x6b3d8a mov eax, dword ptr [ebp-0x20] 0x6b3d8d mov edi, dword ptr [ebp-0x14] 0x6b3d90 fmul st0, st1 0x6b3d92 inc ecx ========================= 0x6b3d93 add eax, 0x8 0x6b3d96 fsubr st0, qword ptr [ebx] 0x6b3d98 add edx, 0x8 All post-Pentium4 CPU (newer than Nov. 2000) support the SSE2 register model. Simply adding the SSE2 target option to the builds would require the machines to be made this century but would use the SSE registers. The 16 directly addressable registers would reduce register stores to the stack and code scheduling (less shuffling of data around and more computation). A simple recompile should make a noticeable difference without any side effects. If you compile newer than SSE2 or GPUs, you have to start worrying about and managing the population of target machines you deliver workloads to. Beyond that, the developers would need to look more closely at the code. ID: 78198 · Rating: 0 · rate: / Reply Quote

Mark Send message Joined: 10 Nov 13 Posts: 40 Credit: 397,847 RAC: 0	Message 78200 - Posted: 15 May 2015, 23:25:23 UTC - in response to Message 78198. The executing code seems to be compiled for a i386 and uses the 387 floating point 8-register stack model. The code (on my machine) spends about 5% of the time waiting for the "fmul st0,st1" ("====" below) to complete. minirosetta_3.54_windows_x86_64.exe Rosetta instruction clip ... address instruction 0x6b3d82 add ebx, ecx 0x6b3d84 lea ebx, ptr [edi+ebx8] 0x6b3d87 fld st0, qword ptr [edi+eax8] 0x6b3d8a mov eax, dword ptr [ebp-0x20] 0x6b3d8d mov edi, dword ptr [ebp-0x14] 0x6b3d90 fmul st0, st1 0x6b3d92 inc ecx ========================= 0x6b3d93 add eax, 0x8 0x6b3d96 fsubr st0, qword ptr [ebx] 0x6b3d98 add edx, 0x8 All post-Pentium4 CPU (newer than Nov. 2000) support the SSE2 register model. Simply adding the SSE2 target option to the builds would require the machines to be made this century but would use the SSE registers. The 16 directly addressable registers would reduce register stores to the stack and code scheduling (less shuffling of data around and more computation). A simple recompile should make a noticeable difference without any side effects. If you compile newer than SSE2 or GPUs, you have to start worrying about and managing the population of target machines you deliver workloads to. Beyond that, the developers would need to look more closely at the code. Interesting. Which tool did you use to get that info may I ask? ID: 78200 · Rating: 0 · rate: / Reply Quote