Message boards : Number crunching : R@H Scientists/Coders: An analysis of the Rosetta binaries...
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · Next
Author | Message |
---|---|
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,624,317 RAC: 7,073 |
Hi, as you may have noticed we're only talking to ourselves here. :-( The latest post of an admin on this thread was 17 Jul 2015. I hope they read forum and "taking inspiration" from other projects for app optimization. |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
Crunch3r and Sesef seem to almost enjoy optimizing code. If the Rosetta admins could just send them the code for "academic purposes", let them play with it (with of course having correct results in mind), then send it back to double-check it's validity, and we could actually be getting somewhere. In DENIS... their admins didn't do a thing and are VERY grateful for the work of these two individuals. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
In DENIS... their admins didn't do a thing and are VERY grateful for the work of these two individuals. They are amazed by it. DENIS Optimized app Note that Sesef did the Windows version, while Crunch3r added versions for Linux and OSX (and an alternative Windows version). But that is a newer project, and whether the same tricks work for the Rosetta code, which has been around for awhile, is another question. |
Betting Slip Send message Joined: 26 Sep 05 Posts: 71 Credit: 5,702,246 RAC: 0 |
In DENIS... their admins didn't do a thing and are VERY grateful for the work of these two individuals. The last time David replied to this thread was in July that in itself speaks volumes. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,624,317 RAC: 7,073 |
The last time David replied to this thread was in July that in itself speaks volumes. But he replied on Ralph's forum 3 days ago: I've been too busy to look into optimizations. We do have one volunteer helping us out however. I'll keep you all posted if anything develops. |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
|
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,624,317 RAC: 7,073 |
IDK if I already mentioned it, but it seems that David is doing all the work... Science/BOINC/Public Relations. That's strange, 'cause there is a large "ecosystem" around Rosetta@home, like BakerLabs, Rosettacommons and Rosetta Design Group....but i don't know how they are involved in boinc application's development |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,624,317 RAC: 7,073 |
He also just updated the Linux binary to 64-bit. Is really 64 bit or, like windows, is a "simple rename"? It's a pity this thread is almost dead |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
He also just updated the Linux binary to 64-bit. in linux normally there is little bluff as the command 'file' is just there in about most linux distributions: # file minirosetta_graphics_3.71_x86_64-pc-linux-gnu minirosetta_graphics_3.71_x86_64-pc-linux-gnu: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, BuildID[sha1]=3567ba7fc3a0583e4aefb1e58607a065a8128568, stripped |
Sebastian M. Bobrecki Send message Joined: 9 Oct 05 Posts: 4 Credit: 6,286,377 RAC: 0 |
... And uses SSE2: ... 491a1a: f2 0f 10 54 24 30 movsd 0x30(%rsp),%xmm2 491a20: f2 44 0f 10 5c 24 40 movsd 0x40(%rsp),%xmm11 491a27: f2 0f 5c e8 subsd %xmm0,%xmm5 491a2b: f2 0f 58 74 24 50 addsd 0x50(%rsp),%xmm6 491a31: f2 44 0f 59 fa mulsd %xmm2,%xmm15 491a36: f2 44 0f 59 c2 mulsd %xmm2,%xmm8 491a3b: f2 45 0f 5c dc subsd %xmm12,%xmm11 491a40: f2 0f 59 da mulsd %xmm2,%xmm3 491a44: f2 0f 59 d1 mulsd %xmm1,%xmm2 491a48: f2 0f 11 7c 24 48 movsd %xmm7,0x48(%rsp) 491a4e: f2 44 0f 11 7c 24 68 movsd %xmm15,0x68(%rsp) ... |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 6,536 |
... It may be SSE2+ code but it is pretty ugly stuff. The "sd" ending of the instructions means that those instructions are SCALAR, DOUBLE precision. They only use the lower half of the XMM registers. 4 of the 11 instructions are reading and writing (movsd (%rsp) ) temporaries from/to the stack. Each of those instructions takes longer than the actual computation. This code fragment spends more than 50% of its time reading/saving temporaries. |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,653,907 RAC: 11,163 |
Is it easy to improve that code? |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 6,536 |
Yes. A newer version of the compiler and changing the compile time options should get 30%-40% improvement without much trouble. Beyond that, it probably means digging deeper into the Rosetta algorithms. I am still trying to figure out how to tell what a "better" Rosetta run looks like. If I solve the same Rosetta "problem" with two different binaries, I get different results (not unusual for floating point code) ... so which "answer" is "better". 8-) The Rosetta algorithms seem unstable and I am seeing a 10%-20% variation in performance depending on the initial seed value that is applied. David gave me a test problem. I am building the non-boinc standalone static binary version. If I vary the -jran seed from 12345, 12346, 12347, 12348, ... the compute times vary noticeably on the same idling computer. There may be some opportunity to SIEVE problems to run quick tests to "qualify" or "eliminate" work units and spend compute cycles on the most promising candidates. I am currently using 2 compilers to generate static binaries. 1.) gcc 4.9.1 from the devtools-3 Linux Software Collections group repository and 2.) Intel icc 2016 David has been very good finding time to look at results. I would be happy to work with other developers too and share my findings. Rosetta uploaded a new source tree and I am just starting to look at it. Humorously, the fastest Linux binary so far seems to be a GENERIC 32-bit binary built with icc and the options: -O3 -mtune=generic -march=core2 The current gcc binaries with the complex compile time options solve the test problem in 15,000 to 18,000 seconds. The GENERIC icc 32-bit binary (and several others) take about 12,000 seconds. It is easy to "over tune" because few developers have the courage to remove "optimizations" since it is hard to verify that they are no longer needed. I always start by removing them. 8-) |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
@ RJS5 - Was looking through the IPD's youtube channel and found this video lecture that introduces a bunch of the idiosyncrasies of the Rosetta code stack.. Alot of it is really basic but there may be some nuggets of useful info in here: Video link here: https://www.youtube.com/watch?v=Cyk6W6YtWUQ |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,624,317 RAC: 7,073 |
|
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
rjs5: Glad you are making some headway. The seed, and specific value used for first model computed determines how it runs. So, you would need same seed and value run each time to expect comparable results. That is part of what makes issuing credits difficult for R@h. Same protein, different run value, may take more periods of diving deeper and take longer to run. I should think the best solutions are the ones that keep thinking there may be more to milk out of the model it is working on at the time. To draw an analogy, just based on my outside understanding... If I'm navaigating through a maze of one way and dead end streets in a city, trying to get as close as possible to a water tower, some starting routes may progress gradually closer and closer and so merit continuing to try more variations. Other starting routes will rather quickly take you away from the tower, to the point where the decision is made to cut bait and try the next model instead of pursuing the current one further. So that second one may finish significantly sooner than the first. So you really have to compare runs of exactly the same starting point. Rosetta Moderator: Mod.Sense |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,624,317 RAC: 7,073 |
If I'm navaigating through a maze of one way and dead end streets in a city, trying to get as close as possible to a water tower, some starting routes may progress gradually closer and closer and so merit continuing to try more variations. Other starting routes will rather quickly take you away from the tower, to the point where the decision is made to cut bait and try the next model instead of pursuing the current one further. So that second one may finish significantly sooner than the first. So you really have to compare runs of exactly the same starting point. This is the reason of Fold.it Humans are better to find solutions of this kind of problems |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
"rjs5" wrote:
i'm thinking if it may have something to do with the on-chip L1 & L2 cache. 32 bits codes are smaller and if something fits well within the cachelines, it may be significantly faster than if everything is fetched from memory |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 6,536 |
"rjs5" wrote: I am still playing and I have a several compute years invested in the process. Faster performance will not matter if the quality of the result is low. Waiting on David for the good/bad news. The video link that Timo posted was encouraging. (THANKS!) Since Rosetta is well structured, they may be able to "encourage" SSE/AVX parallel operation by redefining the low level vectors. They frequently use an XYZ spacial coordinate. They might be able to modify that XYZ vector definition to a larger XYZd (d=don't care) where the processing of XYZ values individually could be changed to "pairs" (SSE/AVX) or "quad" (AVX2). I have posted code clips where the 3 scalar XYZ loads, operation, store are conducted serially. If they added a bogus "4th dimension" onto their vector, the compiler might be able to generate SSE/AVX/AVX2 loads/operations/stores that would be faster. |
Computing for Humanity (Account) Send message Joined: 8 Jan 16 Posts: 2 Credit: 480,191,183 RAC: 50,869 |
Would anyone want to or would know someone who would want to help optimize the Rosetta software, at the compiler level or even at the code level? It is freely available through an Academic license but we can also provide it to individuals under the same license agreement. We might be able to help. Perhaps better to discuss via PM, some details under NDA. |
Message boards :
Number crunching :
R@H Scientists/Coders: An analysis of the Rosetta binaries...
©2024 University of Washington
https://www.bakerlab.org