64-Bit Rosetta?

Author	Message
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 19728 - Posted: 3 Jul 2006, 11:11:20 UTC You're welcome [see, I'm not a complete retard ;-)] Almost all Windows apps run with the 64-bit truncation set, very few (if any) will use the longer 80-bit mode. However, don't ask me why MS has decided to set things up this way... I haven't got a clue, and I'm pretty darn sure they haven't really got a good answer either... However, whether the app uses 64- or 80-bit FPU intermediates is not as important as "the intermediate results are bigger than 32-bit", which means that the intermediate results can hold more precision than the final result, which is very useful for adding/subtracting small numbers and large numbers in the same calculation. The reason for this is that in the floating point value, there's only so many bits to store the "mantissa", i.e. the actual number. When you're adding large numbers with small numbers, the numbers have first to be denominalized, so for example: 1.0E3 + 1.0E-3 must both be expressed as xE3, whcih leads to: 1.0E3 + 0.000001E3 [I think that's the right number of zeros]. Of course, if we go to further extremes, say: 1.0E6 + 1.0E-6: 1.0E6 + 0.000000000000001E6 [I'm sure that's NOT the right number of zero's... ;-)] Now, if we haven't got a large enough floating point storage, the addition of such a small number to a large number would end up not adding anything. Some calculations do this sort of thing and expect there to be a difference [typically would be some kind of searching for a equation result, where a smaller and smaller difference is added to a variable. If the value becomes small enough, the result will be the same every time, but the loop may not expire, because the result is not precise enough...] And I repeat that the main gain from using SSE for 32-bit results is gained from performing the calculations as 32-bit - doing a 64-bit intermediate calculation will make it as slow as the FPU version, by almost infinite certainty. They have added a "Symbol store" to the distribution for Windows to add debuggability - I don't know if this also included changes to the compile options (say reducing the optimisation level). The symbol store is really not going to affect anything, it's just a way to correlate an address within the application with the symbol (function) that it relates to - and perhaps also indication of which source-line in which file it belongs to, depending on the details being stored in the Symbol store... -- Mats ID: 19728 · Rating: 0 · rate: / Reply Quote

Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 19738 - Posted: 3 Jul 2006, 18:33:58 UTC For reference, I did some quick benchmarks on array additon (sum of two arrays into a single sum), and posted them here (as a response tom somewhat optimistic expactations of how much better performance one can expect from SSE/3DNow! optimisation). -- Mats ID: 19738 · Rating: 0 · rate: / Reply Quote

Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 19742 - Posted: 3 Jul 2006, 20:39:52 UTC Further to the previous post, and related to discussions in this post: Performing 64-bit-float operations instead of 32-bit floats is definitely not a great idea... The loop that runs in about 400 kcycles takes 600 kcycles with double-precision calculations (this is partly related to the necessary conversions, and partly because the FPU is only able to do one 64-bit operation per arithmetics unit, where it can do two 32-bit operations in parallel in one unit). -- Mats ID: 19742 · Rating: 0 · rate: / Reply Quote

Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0	Message 19774 - Posted: 4 Jul 2006, 18:18:51 UTC Last modified: 4 Jul 2006, 18:26:52 UTC I agree with your findings, because I came up with similar results. I gave up on hand optimizing that one function. It was taking way too long, and would be a little difficult to confirm the results as being accurate in-case I made a mistake. That further reinforces the fact that it is a diffcult task to hand optimize something by reverse engineering it. I am not saying I could not do it, nor anyone else, but it is not a easy task by any means and I am most likely not going to spend that much time when I have no way to verify the results, and the possibility of many* other functions existing that need the same optimization.. whew.. lots of work. = I did do some tests where I took a single percision value, randomly generated. I loaded it into a 64bit register and performed a operation on it. Then I loaded it into a 32bit register and performed the same operation on it. I did this many times. Of course you lose percision.. However, Rosetta is performing fstd after almost every math operation on that 64bit register, truncuating it to a single percision value. I know it is not exactly the same, but it did yield exact results in alot of cases as doing the operation completely in 32bit. Thus, mabye depending on Rosetta's arithograms the margin of precision error is small enough. When my tests including the trunuaction of the 64bit result it yielding a exact match thus no precision lost? ([32BIT] = [64BIT] <+/-> [64BIT]) might be so close to: ([32BIT] = [32BIT] <+/-> [32BIT]) that for Rosetta's arithograms it would cause no problems, thus enabling the usage of SSE in single percision mode. I have seen at least one case were two operations were performed on a double percision value by the x87 in Rosetta thus increasing the margin of precision error by a value unknown to me because of my knowledge limitations on floating-point internal workings, and this in it's self could eliminate the usage of SSE. However, to the above paragraph: The Rosetta application would also suffer precision changes in the event compiler flags were changed and the generated code thus changed thus producing different results, and of course: I feel this is a potential clue that the Rosetta arithograms may not be bothered by the: ([32BIT] = [64BIT] <+/-> [64BIT]) vs ([32BIT] = [32BIT] <+/-> [32BIT]) problem. because, I would imagine the developers have already realized the penaltys on their precision for their arithograms when building the application and most likely exmaining their machine code output. So, I guess the application needs to be hand-optimized. Then take the un-hand optimizaed version vs the hand optimized version running a work unit with the exact same random seed to see if their is a result difference? PS: I really appreciate you taking the time to help me ?to try? to solve this very difficult question! ID: 19774 · Rating: 0 · rate: / Reply Quote

Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 19778 - Posted: 4 Jul 2006, 20:24:42 UTC Yes, hand optimizing without source is near on impossible, particularly if the code-base is large... You can perhaps figure out where it's spending it's time and what type of optimization needs doing, but modifying the code is a different story... Often it's a case of finding rhe right function to optimize, and with the source-code available, it's often possible to replace inline functions or macros such that the code in many places gets improved all at once. Sometimes that's not the case tho' - you just have to rewrite large chunks of C or Fortran into assembler - but at least with source-code available, you: 1. Have something to get an idea of what the actualy thought behind the code was. 2. Something that you can use to compare results. I have often done the same calculations with two versions of the code for some test-case, and then compare that the result is the same from the optimized code and the non-optimized code. Something like this: #if DEBUG_OPTIMIZED { some_type _a = a, _b = b, _c = c; _res = some_func_optimized(_a, _b, _c); } #endif res = some_func(a, b, c); #if DEBUG_OPTIMIZED if (res != _res) printf("Bad result, expected %f, got %fn", res, _res); #endif This type of checking is more efficient than checking the end result, because you get to know when it goes wrong and WHERE, rather than at the very end, where it may have been calculated wrong for MANY MANY thousands of lines of code and thousands of iterations - not nice to debug that... ;-) I _DO_ believe that translating a vast majority of the floating point calculations to SSE will work without problems. It's only extreme corner cases that 32-bit float causes a problem for - but of course, someone pointing out the pitfalls of using SSE will point this out... Because it is a possible pitfall... However, having the source-code to be able to effectively do this work is necessary... -- Mats ID: 19778 · Rating: 0 · rate: / Reply Quote

Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0	Message 19784 - Posted: 5 Jul 2006, 0:42:10 UTC We have a solution, and a agreement. I have never in my entire life run into this situation on a internet message board.. I am proud! ID: 19784 · Rating: 0 · rate: / Reply Quote

Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0	Message 19786 - Posted: 5 Jul 2006, 0:51:14 UTC Last modified: 5 Jul 2006, 1:03:19 UTC But, you know. Even if - SSE had to use doubles. I think a AMD64 would perform better using SSE doubles than X87 doubles because of the reduced code size at least. I do not know for other processors, I just remember reading that it was recommened to use SSE instead of X87. I think a executable compiled in 32bit causes the processor switch into the compatiblity mode. The 64bit long mode removed some instructions that some programs may use when written as 32bit. So that SSE performance gain might be only when the executable is a native 64BIT exe under a 64BIT operating system. Oh I forgot to mention. Somewhere the Rosetta application is using SSE for double precision values to do some small calculations? (I could have sworn I saw it in the disassembly). I do have a AMD64 3500+ with Win XP64 Pro installed. ID: 19786 · Rating: 0 · rate: / Reply Quote

Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 19791 - Posted: 5 Jul 2006, 10:45:04 UTC The instructions removed in 64-bit mode (which works in compatibility mode, by the way - otherwise all code would have to be recompiled), are all integer instructions that have duplicated opcodes to do the same thing 0x40..0x4F are PUSH/POP of all the registers, and there's an 0xFF 0xXX opcode to do the same thing. AMD also took the opportunity to kull some of the rarely used AAM, AAD, AAA and AAS instructions and some others like that (XLAT, I think is another one). In the Linux version of Rosetta, there's not a single reference to "xmm", so there's no SSE-code in Rosetta (for Linux - I can't say for anything else). If they had some SSE-code in there, it would not run on older machines, which I'm sure it does... [Unless it checks to see which architecture it is and then does different things depending on architecture - but I very much doubt it - why do it for a tiny bit of code, when the rest of it isn't]. The main difference when running 32 or 64-bit code would be the number of registers available, and particularly for math-operations, SSE is easier to use, since you don't have to swap the top-of-stack to get to values that have been calculated earlier. If you have an AMD64 3500+, then you should have SSE2. The only SSE-version that doesn't work in that is SSE3 - since the processor was designed before Intel released their SSE3 processors... I guess I should be a little bit proud too, that we have an agreement... ;-) -- Mats ID: 19791 · Rating: 0 · rate: / Reply Quote

BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0	Message 28227 - Posted: 23 Sep 2006, 1:40:13 UTC This may not be quite the topic being discussed, but it reminded me of this discussion. 32 bits are better than 64 Borrowed the title from story on The Inquirer that linked to the Suse Linux review. ID: 28227 · Rating: 0 · rate: / Reply Quote

Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 28469 - Posted: 25 Sep 2006, 8:29:04 UTC - in response to Message 28227. This may not be quite the topic being discussed, but it reminded me of this discussion. 32 bits are better than 64 Borrowed the title from story on The Inquirer that linked to the Suse Linux review. There are certainly cases where 32-bit is faster than 64-bit. There are also cases where 64-bit is faster than 32-bit - it depends on the application. However, since this is about Rosetta, I can with 99% confidence say that a pure recompile to 64-bit would gain less than 0.5% performance difference. The reasons are: 1. Rosetta is to a very large extent limited by the processors floating point capactiy. 2. Rosetta doesn't use linked lists or other indirect data structures where the size of pointers are critical to the performance. 3. Rosetta doesn't use 64-bit integers for any purpose, and thus will not be able to benefit from "large integers". This has been discussed several times before, and the outcome is still the same: It's actually very hard to improve Rosetta's performance with trivial measures... -- Mats ID: 28469 · Rating: 0 · rate: / Reply Quote

FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0	Message 28489 - Posted: 25 Sep 2006, 12:57:04 UTC - in response to Message 28469. This may not be quite the topic being discussed, but it reminded me of this discussion. 32 bits are better than 64 Borrowed the title from story on The Inquirer that linked to the Suse Linux review. There are certainly cases where 32-bit is faster than 64-bit. There are also cases where 64-bit is faster than 32-bit - it depends on the application. However, since this is about Rosetta, I can with 99% confidence say that a pure recompile to 64-bit would gain less than 0.5% performance difference. The reasons are: 1. Rosetta is to a very large extent limited by the processors floating point capactiy. 2. Rosetta doesn't use linked lists or other indirect data structures where the size of pointers are critical to the performance. 3. Rosetta doesn't use 64-bit integers for any purpose, and thus will not be able to benefit from "large integers". This has been discussed several times before, and the outcome is still the same: It's actually very hard to improve Rosetta's performance with trivial measures... -- Mats Mats are you one of the ones looking at the code? Would changing to 64bit improve Rosetta though? such as accuracy/speed etc.. I don't mean a compile to 64bit but a change to 64bits ? Team mauisun.org ID: 28489 · Rating: 0 · rate: / Reply Quote

Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 28495 - Posted: 25 Sep 2006, 13:54:43 UTC - in response to Message 28489. Mats are you one of the ones looking at the code? Would changing to 64bit improve Rosetta though? such as accuracy/speed etc.. I don't mean a compile to 64bit but a change to 64bits ? Yes, and no, I don't think so. -- Mats ID: 28495 · Rating: 0 · rate: / Reply Quote

FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0	Message 28507 - Posted: 25 Sep 2006, 15:42:48 UTC - in response to Message 28495. Mats are you one of the ones looking at the code? Would changing to 64bit improve Rosetta though? such as accuracy/speed etc.. I don't mean a compile to 64bit but a change to 64bits ? Yes, and no, I don't think so. -- Mats So they couldn't represent the molecule in more detail in a 64bit matrix/grid (if they do that). I guess it would take an explanation of how the program actually does it's stuff. (something more for Ralph, since that the development). I would have thought that in the docking part of the program a more detailed description would get them in more detailed interaction ? ... The main reason people move to 64bit. I know if I was to represent a laser beam in a 64bit time space mesh it would show far more detail and possible subtalties during intereations with things (lenses, materials etc.).. Team mauisun.org ID: 28507 · Rating: 0 · rate: / Reply Quote

Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 28510 - Posted: 25 Sep 2006, 15:57:55 UTC - in response to Message 28507. Mats are you one of the ones looking at the code? Would changing to 64bit improve Rosetta though? such as accuracy/speed etc.. I don't mean a compile to 64bit but a change to 64bits ? Yes, and no, I don't think so. -- Mats So they couldn't represent the molecule in more detail in a 64bit matrix/grid (if they do that). I guess it would take an explanation of how the program actually does it's stuff. (something more for Ralph, since that the development). I would have thought that in the docking part of the program a more detailed description would get them in more detailed interaction ? ... The main reason people move to 64bit. I know if I was to represent a laser beam in a 64bit time space mesh it would show far more detail and possible subtalties during intereations with things (lenses, materials etc.).. I think I can fairly tell you, without breaking the "NDA", that the internal representation of the model is all done in single precision floating point - and the floating point format doesn't change when moving to 64-bit. It's conceivable, but not likely, that it could be done in 32.32 or 48.16 fixed point notation instead - haven't looked at the number range to see if that's feasible - and more importantly, if it would actually gain anything to do that - most likely not, as the FPU itself is pretty good throughput, and fixed point multiplies (which there are plenty of) are a bit more complicated than the basic add/sub operation (which are faster than the FPU version) - so it's unlikely that we'd gain much from such a modification - not to mention that ALL of the code will be affected unless just some code is using this technique - which means a conversion to and from fixed point format in some places - and that's not "free" either... By far the most likely candidate for gains is to use 32-bit SSE instructions, but that is difficult because it requires data-reorganization, as the values are currently not kept in the right way to gain from SSE instructions (you can't easily load up four values in an SSE register and just operate on it, as the values you need aren't "next to each other"). Unfortunately, such data-reorg is either costly locally, or requires big re-orgs in the overall source-code, which isn't nice from a work-perspective, particularly if you only gain a few percent... [And I've got plenty work to do that I get paid for, so progress isn't that great ;-)] -- Mats ID: 28510 · Rating: 0 · rate: / Reply Quote

Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 28511 - Posted: 25 Sep 2006, 16:10:43 UTC By the way, I think it's also safe to indicate that 222000 lines of code doesn't exactly make it a "small and simple" piece of code. Compare that to seti, with 15000 lines in the source-code [as of the latest nightly tar-ball], and you quickly realize why it's not such a trivial task to optimize Rosetta as it is with Seti. -- Mats ID: 28511 · Rating: 0 · rate: / Reply Quote