Message boards : Number crunching : Rosetta@home using AVX / AVX2 ?
Author | Message |
---|---|
Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0 |
Hi everybody, I just wanted to ask if there are plans to use AVX or AVX2 or possibly even the coming AVX-512 in Rosetta? I heard there is not much sense in using GPUs to crunch but AVXx could really speed things up. It's certainly true that in order to gain the full speedup, you would need to rewrite parts of that program but compiling with the appropriate compiler flags should still give you some performance advantage without changing the code. It's sort of sad to see that those instructions lay dormant and unused. I think Folding@Home already supports AVX through gromacs. Why not Rosetta? |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
Here is a conversation from a thread about Android development. David E K, one of the project administrators, doesn't talk about AVX but replies to a participant's query that updates to the server and the current application are the immediate priority. David E K wrote: Yes, there are definitely issues with android and boinc apps. The main issues now I believe are with the BOINC client and current android versions which put background processes to sleep. For now, I am not going to spend much time on our android version until they fix this issue. The motivation for an android arm version has come from BOINC and their partnership with HTC power to give. Samsung is also interested in running R@h on their phones. VENETO boboviz wrote: What's next? Update server side? Avx/Avx2? :-) David E K wrote: Probably server updates including software and hardware. Also, there's been some recent large scale code changes/refactoring of Rosetta so our next application update may not be trivial. |
Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0 |
Here is a conversation from a thread about Android development. David E K, one of the project administrators, doesn't talk about AVX but replies to a participant's query that updates to the server and the current application are the immediate priority. Hi, thanks for the info. I figure the use of AVXx would be a nice task for ralph@home. All they need to do is to provide a binary compiled with the appropriate flags. It either works or it doesn't. ;-) IMHO this has much more precedence that getting Rosetta to work on Android. |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
|
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Here is a conversation from a thread about Android development. David E K, one of the project administrators, doesn't talk about AVX but replies to a participant's query that updates to the server and the current application are the immediate priority. I'm not familiar with AVXx. I believe we'd have to upgrade our compiler versions which isn't much of an issue (depending on how well/easy Rosetta ports). But would it crash on non-compatible machines? |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1991 Credit: 9,501,324 RAC: 12,596 |
I'm not familiar with AVXx. In Intel and AMD developer sites there are a lot of docs, examples, etc.. :-) But would it crash on non-compatible machines? Why? Other projects use SSE/AVX with scheduler that assigns correctly works based on cpu capabilities |
Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0 |
Hi there and thanks for your reply. I have to confess I'm not really an expert on these things. I think you do have to upgrade your compiler to a fairly recent version in order to take advantage of the new extensions. Unless you specifically compile for a certain architecture (as in -march=core-avx2), the binary will just use a different code path, resulting in a larger binary. But than again, I'm not sure. I'm also aware that in the past there were ludicrous expectations concerning these new cpu extensions, i.e. MMX and 3Dnow. But I think this time with AVX2 it will be different. If you have some time to spare, you should read the relevant thread on Anandtech. The user Benchpress goes to some length to explain what the use of AVX2 can do to the performance of your code. Thread about AVX2 Again, this should have much more precedence than running rosetta@home on a tablet. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1991 Credit: 9,501,324 RAC: 12,596 |
If you have some time to spare, you should read the relevant thread on Anandtech. The user Benchpress goes to some length to explain what the use of AVX2 can do to the performance of your code. Some programs are 20% faster with AVXx, others 40% (!!), others 10%, depends of code... Here some docs/tools about Avx/Avx2 First program with Avx2 Processing arrays with Avx2 CodeXL benefits ACML There are lot, as i say, of tools, docs, examples |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
yup i'd think AVX / AVX2 is a good thing, actually this is very similar (or of the same nature) to the GPU request threads, i.e. to exploit vectorized CPU or GPU functionality to significantly accelerate computations the thing is that it may involve some code rewrites, which it seemed has been deemed 'hard to do'? :o lol AVX / AVX2 can process 4 x 64bit double precision floats in a single clock cycle, on a naive basis against non-vectorized codes, it would imply up to 4 times the speedup per cpu core. but in practice i'd think the speedup may not really reach the that scale as many of today's CPUs are superscalar (they features instruction level parallelism for non vector codes) and that it's likely not all pieces of codes can be parallelized http://en.wikipedia.org/wiki/Amdahl%27s_law as for GPUs the very *high end / expensive* cards is said to be able to process many times that. (unfortunately GPU is not consistent in this respects, a lot of GPU use software emulation for double precision floats computation, this cut that GPU prowess to 1/8 of it or more). note also that desktop GPU is normally clocked as about 1Ghz which is some 1/3 of today's CPU clock frequencies (e.g. 3-4 Ghz) link to GPU thread discussion: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6549 |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1991 Credit: 9,501,324 RAC: 12,596 |
the thing is that it may involve some code rewrites, which it seemed has been deemed 'hard to do'? :o lol I know that rosetta's admins don't try to use avx extension. I know they tried to use android and it was a waste of time. So, why not try avx? AVX / AVX2 can process 4 x 64bit double precision floats in a single clock cycle, on a naive basis against non-vectorized codes, it would imply up to 4 times the speedup per cpu core. but in practice i'd think the speedup may not really reach the that scale[/url] A simply 10% plus per core is a BIG gain!! :-) |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
the thing is that it may involve some code rewrites, which it seemed has been deemed 'hard to do'? :o lol actually that's *almost* the same as optimizing the programs for GPUs, as a common technology based on 'higher level' languages that's optimised to vector cpu computation be they AVX/AVX2 or vector GPU cores is OpenCL and CUDA. https://software.intel.com/sites/default/files/m/d/4/1/d/8/Writing_Optimal_OpenCL_28tm_29_Code_with_Intel_28R_29_OpenCL_SDK.pdf the thing is that part of rosetta commons code would need to be rewritten / redesigned to use OpenCL. And in addition, the *compiled* target binaries would certainly be *platform specific* (i.e. differs between each Intel or AMD, Nvidia CPU platforms). However, apparently OpenCL uses some just-in-time methods where the codes are basically stored as text scripts and is compiled at run time by the specific platforms. note this other issue is that there is specific bindings / libraries / SDK for each platform hence it may means quite a lot more maintenance issues as there would at least be a need to target the different runtime OpenCL platforms (and even underlying hardware CPU/GPU platforms - they are different after all), it may mean needing to maintain multiple versions of the rosetta codes even if OpenCL is used. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1991 Credit: 9,501,324 RAC: 12,596 |
|
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
|
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
We do have a somewhat stable android build but android 5 gave me a curve ball with the requirement of PIE and unfortunately it's not so easy to build Rosetta with PIE even though they say it just requires -PIE -fpie compile/link commands etc... Yes, it compiles and links but seg faults and debugging has been tough. Such is the case sometimes when things are said to be easy but in practice it can be a different story. It has been on the backburner as with avx etc due to other research related priorities, for example, we have been invited to write 3 papers for the CASP11 meeting and I'm also in the process of making the builds based on current Rosetta source. |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
I was just watching the video posted on the front page, you're aging really well! (comparing to the Rosetta@home promo video). Is it possible to realease the code as open-source and have two versions of it (one propietary and one open-source)? Open-source development could really help with things like this, specially when you're short on coders and/or time. EDIT: Profile pictures are not loading when updated :( (for example, mine) |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I'll check up on this profile picture bug. Don't know why that's happening. This is David Kim not David Baker :) The Rosetta source is freely available to academics. Source development however is limited to RosettaCommons developers/researchers, institutions/groups can join if they agree to the UW rosetta commons terms and align with the same research interests I believe. You can check out the rosettacommons.org site for more info. |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
Ah, well, sorry for the mix up. It was just an idea to help boost R@H's FLOPS. Doubt it's that easy to implement AVX just like that though. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1991 Credit: 9,501,324 RAC: 12,596 |
Doubt it's that easy to implement AVX just like that though. Yeap not easy, but there are tools/documentation that help, like this: Intel Intrinsics Guide |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 22,962,883 RAC: 17,460 |
Doubt it's that easy to implement AVX just like that though. The executing code seems to be compiled for a i386 and uses the 387 floating point 8-register stack model. The code (on my machine) spends about 5% of the time waiting for the "fmul st0,st1" ("====" below) to complete. minirosetta_3.54_windows_x86_64.exe Rosetta instruction clip ... address instruction 0x6b3d82 add ebx, ecx 0x6b3d84 lea ebx, ptr [edi+ebx*8] 0x6b3d87 fld st0, qword ptr [edi+eax*8] 0x6b3d8a mov eax, dword ptr [ebp-0x20] 0x6b3d8d mov edi, dword ptr [ebp-0x14] 0x6b3d90 fmul st0, st1 0x6b3d92 inc ecx ========================= 0x6b3d93 add eax, 0x8 0x6b3d96 fsubr st0, qword ptr [ebx] 0x6b3d98 add edx, 0x8 All post-Pentium4 CPU (newer than Nov. 2000) support the SSE2 register model. Simply adding the SSE2 target option to the builds would require the machines to be made this century but would use the SSE registers. The 16 directly addressable registers would reduce register stores to the stack and code scheduling (less shuffling of data around and more computation). A simple recompile should make a noticeable difference without any side effects. If you compile newer than SSE2 or GPUs, you have to start worrying about and managing the population of target machines you deliver workloads to. Beyond that, the developers would need to look more closely at the code. |
Mark Send message Joined: 10 Nov 13 Posts: 40 Credit: 397,847 RAC: 0 |
The executing code seems to be compiled for a i386 and uses the 387 floating point 8-register stack model. The code (on my machine) spends about 5% of the time waiting for the "fmul st0,st1" ("====" below) to complete. Interesting. Which tool did you use to get that info may I ask? |
Message boards :
Number crunching :
Rosetta@home using AVX / AVX2 ?
©2024 University of Washington
https://www.bakerlab.org