Message boards : Number crunching : 64-Bit Rosetta?
Previous · 1 · 2 · 3
Author | Message |
---|---|
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
Whilst x87 indeed does all internal calculations to 80-bit precision on Linux, Windows actually sets the FPU to round to 64-bit by default. Not that this matters at the very moment, as most of the calculations in Rosetta are done with 32-bit single precision floats. And indeed, if the internal calculation is done with a high precision, and then stored in a lower precision, some calculations will end up different than if you do them same calculation entirely in lower precision. A typical example would be: b = 1.00000; c = 1.00001; a = (b + 100.0) - (c + 100.0); d = 100.0 / a; If you haven't got enough precision (and 32-bit may be enough in this case, but I can't really be bothered to figure out exactly how many zeros you need to make it work right vs. wrong) [And I'm also assuming the compiler doesn't remove the redundant +100.0 that actually cancel each other]. One of the main points of using SSE would be to get the floating point calculations being done in single precision. At least on Athlon/Opteron processors, SSE double precision calculations are near identical performance to x87 double precision - mainly because it goes through the same units in the processor, and it's not capable of doing this any faster with SSE than it is with x87. So if we can't make parts of the code run with SSE single precision, it's probably not going to run much faster with the SSE instrucitons. I think your code would work just fine, but the performance benefit isn't there, unless it's in single precision. Also, I found this bug: you'd need to convert your single precision incoming data to double precision before you can subtract, so you need to replace the shufps with cvtps2pd, and use the MOVQ instruction to load the data into the low 64 bits of the xmm register. You can then use MOVD to load the last word, and cvtss2sd to convert it from 32- to 64-bit. But I think this should be avoided unless it's proven to be necessary! MMX is 64-bit integer operations (that can be split into 2 x 32, 4 x 16 or 8 x 8 bit operations, so for example you can add two 8 x 8 bit vectors in a single instruction, and there will be no overflow from one 8-bit operand to another, like it would be in a normal 32- or 64-bit add operation (add 0x00FF00FF with 0x00010001 and you end up with 0x01000100 in "normal" math, whilst MMX of the same operation would end up with zero's in all bytes [but MMX would be a twice as long number - I'm just to lazy for writing down 64-bit numbers in examples]). AMD invended 3DNow! that uses the MMX register set for 32-bit floating point calculations, which is similar to the SSE instructions, except it's 2 x 32-bit rather than 4 x 32-bit. Both SSE and SSE2 are 128-bit. SSE is using single precision floats (32-bit each), whilst SSE2 allows 64-bit double precision floats. So, you're arguing for improving Rosetta to use 64-bit, but you don't actually haev a processor to support it, then - as all AMD and Intel 64-bit processors also have SSE2... ;-) [I'm not having a go at you, just finding it a bit funny that this discussion started on the subject of "Why isn't there a 64-bit version of Rosetta", and now it turns out that one of the people arguing ardently for such a development, couldn't make use of it anyways...;-) ] -- Mats |
Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0 |
Yes, I agree completely. Thank you for posting the correction to my code snip, it is most appreciated. I just realized too that Rosetta@home should be okay using single percision values. I mean if they have been running their code unoptimized for this CASP7 project, almost all the calculations in the application are being truncuated back into single percision from 80 bit anyway, because of all those excessive load and store instructions the compiler they use is generating. lol I do not know why I did not think of the above yesterday, sorry. =) I guess the developers know the results could* change when they decide to turn on optimizations on their compilers one day. =) I remember a post saying Rosetta@Home was being run lately in some form of debug mode or something to try and diagnose the bugs lately too? [edit] I checked the Rosetta@Home application's windows build, and the x87 is set to the default precision of 64 bit like you stated in the previous post. So, apparently Rosetta@Home is not doing 80bit calculations, but instead only 64bit internaly and performing load and store operations in 32bit with the conjunction of their code constantly loading and storing after almost every math operation --- using SSE should be *okay*.. |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
You're welcome [see, I'm not a complete retard ;-)] Almost all Windows apps run with the 64-bit truncation set, very few (if any) will use the longer 80-bit mode. However, don't ask me why MS has decided to set things up this way... I haven't got a clue, and I'm pretty darn sure they haven't really got a good answer either... However, whether the app uses 64- or 80-bit FPU intermediates is not as important as "the intermediate results are bigger than 32-bit", which means that the intermediate results can hold more precision than the final result, which is very useful for adding/subtracting small numbers and large numbers in the same calculation. The reason for this is that in the floating point value, there's only so many bits to store the "mantissa", i.e. the actual number. When you're adding large numbers with small numbers, the numbers have first to be denominalized, so for example: 1.0E3 + 1.0E-3 must both be expressed as xE3, whcih leads to: 1.0E3 + 0.000001E3 [I think that's the right number of zeros]. Of course, if we go to further extremes, say: 1.0E6 + 1.0E-6: 1.0E6 + 0.000000000000001E6 [I'm sure that's NOT the right number of zero's... ;-)] Now, if we haven't got a large enough floating point storage, the addition of such a small number to a large number would end up not adding anything. Some calculations do this sort of thing and expect there to be a difference [typically would be some kind of searching for a equation result, where a smaller and smaller difference is added to a variable. If the value becomes small enough, the result will be the same every time, but the loop may not expire, because the result is not precise enough...] And I repeat that the main gain from using SSE for 32-bit results is gained from performing the calculations as 32-bit - doing a 64-bit intermediate calculation will make it as slow as the FPU version, by almost infinite certainty. They have added a "Symbol store" to the distribution for Windows to add debuggability - I don't know if this also included changes to the compile options (say reducing the optimisation level). The symbol store is really not going to affect anything, it's just a way to correlate an address within the application with the symbol (function) that it relates to - and perhaps also indication of which source-line in which file it belongs to, depending on the details being stored in the Symbol store... -- Mats |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
For reference, I did some quick benchmarks on array additon (sum of two arrays into a single sum), and posted them here (as a response tom somewhat optimistic expactations of how much better performance one can expect from SSE/3DNow! optimisation). -- Mats |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
Further to the previous post, and related to discussions in this post: Performing 64-bit-float operations instead of 32-bit floats is definitely not a great idea... The loop that runs in about 400 kcycles takes 600 kcycles with double-precision calculations (this is partly related to the necessary conversions, and partly because the FPU is only able to do one 64-bit operation per arithmetics unit, where it can do two 32-bit operations in parallel in one unit). -- Mats |
Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0 |
I agree with your findings, because I came up with similar results. I gave up on hand optimizing that one function. It was taking way too long, and would be a little difficult to confirm the results as being accurate in-case I made a mistake. That further reinforces the fact that it is a diffcult task to hand optimize something by reverse engineering it. I am not saying I could not do it, nor anyone else, but it is not a easy task by any means and I am most likely not going to spend that much time when I have no way to verify the results, and the possibility of many* other functions existing that need the same optimization.. whew.. lots of work. = I did do some tests where I took a single percision value, randomly generated. I loaded it into a 64bit register and performed a operation on it. Then I loaded it into a 32bit register and performed the same operation on it. I did this many times. Of course you lose percision.. However, Rosetta is performing fstd after almost every math operation on that 64bit register, truncuating it to a single percision value. I know it is not exactly the same, but it did yield exact results in alot of cases as doing the operation completely in 32bit. Thus, mabye depending on Rosetta's arithograms the margin of precision error is small enough. When my tests including the trunuaction of the 64bit result it yielding a exact match thus no precision lost? ([32BIT] = [64BIT] <*+/-> [64BIT]) might be so close to: ([32BIT] = [32BIT] <*+/-> [32BIT]) that for Rosetta's arithograms it would cause no problems, thus enabling the usage of SSE in single percision mode. I have seen at *least* one case were two operations were performed on a double percision value by the x87 in Rosetta thus increasing the margin of precision error by a value unknown to me because of my knowledge limitations on floating-point internal workings, and this in it's self could eliminate the usage of SSE. However, to the above paragraph: The Rosetta application would also suffer precision changes in the event compiler flags were changed and the generated code thus changed thus producing different results, and of course: I feel this is a potential clue that the Rosetta arithograms may not be bothered by the: ([32BIT] = [64BIT] <*+/-> [64BIT]) vs ([32BIT] = [32BIT] <*+/-> [32BIT]) problem. because, I would imagine the developers have already realized the penaltys on their precision for their arithograms when building the application and most likely exmaining their machine code output. So, I guess the application needs to be hand-optimized. Then take the un-hand optimizaed version vs the hand optimized version running a work unit with the exact same random seed to see if their is a result difference? PS: I really appreciate you taking the time to help me ?to try? to solve this very difficult question! |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
Yes, hand optimizing without source is near on impossible, particularly if the code-base is large... You can perhaps figure out where it's spending it's time and what type of optimization needs doing, but modifying the code is a different story... Often it's a case of finding rhe right function to optimize, and with the source-code available, it's often possible to replace inline functions or macros such that the code in many places gets improved all at once. Sometimes that's not the case tho' - you just have to rewrite large chunks of C or Fortran into assembler - but at least with source-code available, you: 1. Have something to get an idea of what the actualy thought behind the code was. 2. Something that you can use to compare results. I have often done the same calculations with two versions of the code for some test-case, and then compare that the result is the same from the optimized code and the non-optimized code. Something like this: #if DEBUG_OPTIMIZED { some_type _a = a, _b = b, _c = c; _res = some_func_optimized(_a, _b, _c); } #endif res = some_func(a, b, c); #if DEBUG_OPTIMIZED if (res != _res) printf("Bad result, expected %f, got %fn", res, _res); #endif This type of checking is more efficient than checking the end result, because you get to know when it goes wrong and WHERE, rather than at the very end, where it may have been calculated wrong for MANY MANY thousands of lines of code and thousands of iterations - not nice to debug that... ;-) I _DO_ believe that translating a vast majority of the floating point calculations to SSE will work without problems. It's only extreme corner cases that 32-bit float causes a problem for - but of course, someone pointing out the pitfalls of using SSE will point this out... Because it is a possible pitfall... However, having the source-code to be able to effectively do this work is necessary... -- Mats |
Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0 |
We have a solution, and a agreement. I have never in my entire life run into this situation on a internet message board.. I am proud! |
Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0 |
But, you know. Even if - SSE had to use doubles. I think a AMD64 would perform better using SSE doubles than X87 doubles because of the reduced code size at least. I do not know for other processors, I just remember reading that it was recommened to use SSE instead of X87. I think a executable compiled in 32bit causes the processor switch into the compatiblity mode. The 64bit long mode removed some instructions that some programs may use when written as 32bit. So that SSE performance gain might be only when the executable is a native 64BIT exe under a 64BIT operating system. Oh I forgot to mention. Somewhere the Rosetta application is using SSE for double precision values to do some small calculations? (I could have sworn I saw it in the disassembly). I do have a AMD64 3500+ with Win XP64 Pro installed. |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
The instructions removed in 64-bit mode (which works in compatibility mode, by the way - otherwise all code would have to be recompiled), are all integer instructions that have duplicated opcodes to do the same thing 0x40..0x4F are PUSH/POP of all the registers, and there's an 0xFF 0xXX opcode to do the same thing. AMD also took the opportunity to kull some of the rarely used AAM, AAD, AAA and AAS instructions and some others like that (XLAT, I think is another one). In the Linux version of Rosetta, there's not a single reference to "xmm", so there's no SSE-code in Rosetta (for Linux - I can't say for anything else). If they had some SSE-code in there, it would not run on older machines, which I'm sure it does... [Unless it checks to see which architecture it is and then does different things depending on architecture - but I very much doubt it - why do it for a tiny bit of code, when the rest of it isn't]. The main difference when running 32 or 64-bit code would be the number of registers available, and particularly for math-operations, SSE is easier to use, since you don't have to swap the top-of-stack to get to values that have been calculated earlier. If you have an AMD64 3500+, then you should have SSE2. The only SSE-version that doesn't work in that is SSE3 - since the processor was designed before Intel released their SSE3 processors... I guess I should be a little bit proud too, that we have an agreement... ;-) -- Mats |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
This may not be quite the topic being discussed, but it reminded me of this discussion. 32 bits are better than 64 Borrowed the title from story on The Inquirer that linked to the Suse Linux review. |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
This may not be quite the topic being discussed, but it reminded me of this discussion. There are certainly cases where 32-bit is faster than 64-bit. There are also cases where 64-bit is faster than 32-bit - it depends on the application. However, since this is about Rosetta, I can with 99% confidence say that a pure recompile to 64-bit would gain less than 0.5% performance difference. The reasons are: 1. Rosetta is to a very large extent limited by the processors floating point capactiy. 2. Rosetta doesn't use linked lists or other indirect data structures where the size of pointers are critical to the performance. 3. Rosetta doesn't use 64-bit integers for any purpose, and thus will not be able to benefit from "large integers". This has been discussed several times before, and the outcome is still the same: It's actually very hard to improve Rosetta's performance with trivial measures... -- Mats |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
This may not be quite the topic being discussed, but it reminded me of this discussion. Mats are you one of the ones looking at the code? Would changing to 64bit improve Rosetta though? such as accuracy/speed etc.. I don't mean a compile to 64bit but a change to 64bits ? Team mauisun.org |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
Mats are you one of the ones looking at the code? Would changing to 64bit improve Rosetta though? such as accuracy/speed etc.. I don't mean a compile to 64bit but a change to 64bits ? Yes, and no, I don't think so. -- Mats |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
Mats are you one of the ones looking at the code? Would changing to 64bit improve Rosetta though? such as accuracy/speed etc.. I don't mean a compile to 64bit but a change to 64bits ? So they couldn't represent the molecule in more detail in a 64bit matrix/grid (if they do that). I guess it would take an explanation of how the program actually does it's stuff. (something more for Ralph, since that the development). I would have thought that in the docking part of the program a more detailed description would get them in more detailed interaction ? ... The main reason people move to 64bit. I know if I was to represent a laser beam in a 64bit time space mesh it would show far more detail and possible subtalties during intereations with things (lenses, materials etc.).. Team mauisun.org |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
Mats are you one of the ones looking at the code? Would changing to 64bit improve Rosetta though? such as accuracy/speed etc.. I don't mean a compile to 64bit but a change to 64bits ? I think I can fairly tell you, without breaking the "NDA", that the internal representation of the model is all done in single precision floating point - and the floating point format doesn't change when moving to 64-bit. It's conceivable, but not likely, that it could be done in 32.32 or 48.16 fixed point notation instead - haven't looked at the number range to see if that's feasible - and more importantly, if it would actually gain anything to do that - most likely not, as the FPU itself is pretty good throughput, and fixed point multiplies (which there are plenty of) are a bit more complicated than the basic add/sub operation (which are faster than the FPU version) - so it's unlikely that we'd gain much from such a modification - not to mention that ALL of the code will be affected unless just some code is using this technique - which means a conversion to and from fixed point format in some places - and that's not "free" either... By far the most likely candidate for gains is to use 32-bit SSE instructions, but that is difficult because it requires data-reorganization, as the values are currently not kept in the right way to gain from SSE instructions (you can't easily load up four values in an SSE register and just operate on it, as the values you need aren't "next to each other"). Unfortunately, such data-reorg is either costly locally, or requires big re-orgs in the overall source-code, which isn't nice from a work-perspective, particularly if you only gain a few percent... [And I've got plenty work to do that I get paid for, so progress isn't that great ;-)] -- Mats |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
By the way, I think it's also safe to indicate that 222000 lines of code doesn't exactly make it a "small and simple" piece of code. Compare that to seti, with 15000 lines in the source-code [as of the latest nightly tar-ball], and you quickly realize why it's not such a trivial task to optimize Rosetta as it is with Seti. -- Mats |
Message boards :
Number crunching :
64-Bit Rosetta?
©2024 University of Washington
https://www.bakerlab.org