64-Bit Rosetta?

Message boards : Number crunching : 64-Bit Rosetta?

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 19547 - Posted: 30 Jun 2006, 10:26:16 UTC

Whilst x87 indeed does all internal calculations to 80-bit precision on Linux, Windows actually sets the FPU to round to 64-bit by default. Not that this matters at the very moment, as most of the calculations in Rosetta are done with 32-bit single precision floats.

And indeed, if the internal calculation is done with a high precision, and then stored in a lower precision, some calculations will end up different than if you do them same calculation entirely in lower precision. A typical example would be:

b = 1.00000;
c = 1.00001;
a = (b + 100.0) - (c + 100.0);
d = 100.0 / a;

If you haven't got enough precision (and 32-bit may be enough in this case, but I can't really be bothered to figure out exactly how many zeros you need to make it work right vs. wrong) [And I'm also assuming the compiler doesn't remove the redundant +100.0 that actually cancel each other].

One of the main points of using SSE would be to get the floating point calculations being done in single precision. At least on Athlon/Opteron processors, SSE double precision calculations are near identical performance to x87 double precision - mainly because it goes through the same units in the processor, and it's not capable of doing this any faster with SSE than it is with x87. So if we can't make parts of the code run with SSE single precision, it's probably not going to run much faster with the SSE instrucitons.

I think your code would work just fine, but the performance benefit isn't there, unless it's in single precision.

Also, I found this bug: you'd need to convert your single precision incoming data to double precision before you can subtract, so you need to replace the shufps with cvtps2pd, and use the MOVQ instruction to load the data into the low 64 bits of the xmm register. You can then use MOVD to load the last word, and cvtss2sd to convert it from 32- to 64-bit.

But I think this should be avoided unless it's proven to be necessary!

MMX is 64-bit integer operations (that can be split into 2 x 32, 4 x 16 or 8 x 8 bit operations, so for example you can add two 8 x 8 bit vectors in a single instruction, and there will be no overflow from one 8-bit operand to another, like it would be in a normal 32- or 64-bit add operation (add 0x00FF00FF with 0x00010001 and you end up with 0x01000100 in "normal" math, whilst MMX of the same operation would end up with zero's in all bytes [but MMX would be a twice as long number - I'm just to lazy for writing down 64-bit numbers in examples]).

AMD invended 3DNow! that uses the MMX register set for 32-bit floating point calculations, which is similar to the SSE instructions, except it's 2 x 32-bit rather than 4 x 32-bit.

Both SSE and SSE2 are 128-bit. SSE is using single precision floats (32-bit each), whilst SSE2 allows 64-bit double precision floats.

So, you're arguing for improving Rosetta to use 64-bit, but you don't actually haev a processor to support it, then - as all AMD and Intel 64-bit processors also have SSE2... ;-) [I'm not having a go at you, just finding it a bit funny that this discussion started on the subject of "Why isn't there a 64-bit version of Rosetta", and now it turns out that one of the people arguing ardently for such a development, couldn't make use of it anyways...;-) ]

--
Mats

ID: 19547 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Leonard Kevin Mcguire Jr.

Send message
Joined: 13 Jun 06
Posts: 29
Credit: 14,903
RAC: 0
Message 19595 - Posted: 30 Jun 2006, 21:05:47 UTC
Last modified: 30 Jun 2006, 22:04:14 UTC

Yes, I agree completely. Thank you for posting the correction to my code snip, it is most appreciated.

I just realized too that Rosetta@home should be okay using single percision values. I mean if they have been running their code unoptimized for this CASP7 project, almost all the calculations in the application are being truncuated back into single percision from 80 bit anyway, because of all those excessive load and store instructions the compiler they use is generating. lol

I do not know why I did not think of the above yesterday, sorry. =) I guess the developers know the results could* change when they decide to turn on optimizations on their compilers one day. =)

I remember a post saying Rosetta@Home was being run lately in some form of debug mode or something to try and diagnose the bugs lately too?

[edit]
I checked the Rosetta@Home application's windows build, and the x87 is set to the default precision of 64 bit like you stated in the previous post. So, apparently Rosetta@Home is not doing 80bit calculations, but instead only 64bit internaly and performing load and store operations in 32bit with the conjunction of their code constantly loading and storing after almost every math operation --- using SSE should be *okay*..
ID: 19595 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 19728 - Posted: 3 Jul 2006, 11:11:20 UTC

You're welcome [see, I'm not a complete retard ;-)]

Almost all Windows apps run with the 64-bit truncation set, very few (if any) will use the longer 80-bit mode. However, don't ask me why MS has decided to set things up this way... I haven't got a clue, and I'm pretty darn sure they haven't really got a good answer either...

However, whether the app uses 64- or 80-bit FPU intermediates is not as important as "the intermediate results are bigger than 32-bit", which means that the intermediate results can hold more precision than the final result, which is very useful for adding/subtracting small numbers and large numbers in the same calculation. The reason for this is that in the floating point value, there's only so many bits to store the "mantissa", i.e. the actual number. When you're adding large numbers with small numbers, the numbers have first to be denominalized, so for example:

1.0E3 + 1.0E-3 must both be expressed as xE3, whcih leads to:
1.0E3 + 0.000001E3 [I think that's the right number of zeros].

Of course, if we go to further extremes, say:
1.0E6 + 1.0E-6:
1.0E6 + 0.000000000000001E6 [I'm sure that's NOT the right number of zero's... ;-)]
Now, if we haven't got a large enough floating point storage, the addition of such a small number to a large number would end up not adding anything. Some calculations do this sort of thing and expect there to be a difference [typically would be some kind of searching for a equation result, where a smaller and smaller difference is added to a variable. If the value becomes small enough, the result will be the same every time, but the loop may not expire, because the result is not precise enough...]

And I repeat that the main gain from using SSE for 32-bit results is gained from performing the calculations as 32-bit - doing a 64-bit intermediate calculation will make it as slow as the FPU version, by almost infinite certainty.

They have added a "Symbol store" to the distribution for Windows to add debuggability - I don't know if this also included changes to the compile options (say reducing the optimisation level). The symbol store is really not going to affect anything, it's just a way to correlate an address within the application with the symbol (function) that it relates to - and perhaps also indication of which source-line in which file it belongs to, depending on the details being stored in the Symbol store...

--
Mats
ID: 19728 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 19738 - Posted: 3 Jul 2006, 18:33:58 UTC

For reference, I did some quick benchmarks on array additon (sum of two arrays into a single sum), and posted them here (as a response tom somewhat optimistic expactations of how much better performance one can expect from SSE/3DNow! optimisation).

--
Mats
ID: 19738 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 19742 - Posted: 3 Jul 2006, 20:39:52 UTC

Further to the previous post, and related to discussions in this post:

Performing 64-bit-float operations instead of 32-bit floats is definitely not a great idea...

The loop that runs in about 400 kcycles takes 600 kcycles with double-precision calculations (this is partly related to the necessary conversions, and partly because the FPU is only able to do one 64-bit operation per arithmetics unit, where it can do two 32-bit operations in parallel in one unit).

--
Mats
ID: 19742 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Leonard Kevin Mcguire Jr.

Send message
Joined: 13 Jun 06
Posts: 29
Credit: 14,903
RAC: 0
Message 19774 - Posted: 4 Jul 2006, 18:18:51 UTC
Last modified: 4 Jul 2006, 18:26:52 UTC

I agree with your findings, because I came up with similar results. I gave up on hand optimizing that one function. It was taking way too long, and would be a little difficult to confirm the results as being accurate in-case I made a mistake. That further reinforces the fact that it is a diffcult task to hand optimize something by reverse engineering it.

I am not saying I could not do it, nor anyone else, but it is not a easy task by any means and I am most likely not going to spend that much time when I have no way to verify the results, and the possibility of many* other functions existing that need the same optimization.. whew.. lots of work. =

I did do some tests where I took a single percision value, randomly generated. I loaded it into a 64bit register and performed a operation on it. Then I loaded it into a 32bit register and performed the same operation on it. I did this many times. Of course you lose percision..

However, Rosetta is performing fstd after almost every math operation on that 64bit register, truncuating it to a single percision value. I know it is not exactly the same, but it did yield exact results in alot of cases as doing the operation completely in 32bit. Thus, mabye depending on Rosetta's arithograms the margin of precision error is small enough. When my tests including the trunuaction of the 64bit result it yielding a exact match thus no precision lost?

([32BIT] = [64BIT] <*+/-> [64BIT]) might be so close to:
([32BIT] = [32BIT] <*+/-> [32BIT]) that for Rosetta's arithograms it would cause no problems, thus enabling the usage of SSE in single percision mode.

I have seen at *least* one case were two operations were performed on a double percision value by the x87 in Rosetta thus increasing the margin of precision error by a value unknown to me because of my knowledge limitations on floating-point internal workings, and this in it's self could eliminate the usage of SSE.

However, to the above paragraph: The Rosetta application would also suffer precision changes in the event compiler flags were changed and the generated code thus changed thus producing different results, and of course: I feel this is a potential clue that the Rosetta arithograms may not be bothered by the:


([32BIT] = [64BIT] <*+/-> [64BIT]) vs ([32BIT] = [32BIT] <*+/-> [32BIT]) problem.


because, I would imagine the developers have already realized the penaltys on their precision for their arithograms when building the application and most likely exmaining their machine code output.

So, I guess the application needs to be hand-optimized. Then take the un-hand optimizaed version vs the hand optimized version running a work unit with the exact same random seed to see if their is a result difference?

PS: I really appreciate you taking the time to help me ?to try? to solve this very difficult question!
ID: 19774 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 19778 - Posted: 4 Jul 2006, 20:24:42 UTC

Yes, hand optimizing without source is near on impossible, particularly if the code-base is large... You can perhaps figure out where it's spending it's time and what type of optimization needs doing, but modifying the code is a different story...

Often it's a case of finding rhe right function to optimize, and with the source-code available, it's often possible to replace inline functions or macros such that the code in many places gets improved all at once. Sometimes that's not the case tho' - you just have to rewrite large chunks of C or Fortran into assembler - but at least with source-code available, you:
1. Have something to get an idea of what the actualy thought behind the code was.
2. Something that you can use to compare results.

I have often done the same calculations with two versions of the code for some test-case, and then compare that the result is the same from the optimized code and the non-optimized code. Something like this:

#if DEBUG_OPTIMIZED
   { some_type _a = a, _b = b, _c = c;
     _res = some_func_optimized(_a, _b, _c);
   }
#endif
   res = some_func(a, b, c);
#if DEBUG_OPTIMIZED
   if (res != _res)
      printf("Bad result, expected %f, got %fn", res, _res);
#endif


This type of checking is more efficient than checking the end result, because you get to know when it goes wrong and WHERE, rather than at the very end, where it may have been calculated wrong for MANY MANY thousands of lines of code and thousands of iterations - not nice to debug that... ;-)

I _DO_ believe that translating a vast majority of the floating point calculations to SSE will work without problems. It's only extreme corner cases that 32-bit float causes a problem for - but of course, someone pointing out the pitfalls of using SSE will point this out... Because it is a possible pitfall...

However, having the source-code to be able to effectively do this work is necessary...

--
Mats
ID: 19778 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Leonard Kevin Mcguire Jr.

Send message
Joined: 13 Jun 06
Posts: 29
Credit: 14,903
RAC: 0
Message 19784 - Posted: 5 Jul 2006, 0:42:10 UTC

We have a solution, and a agreement. I have never in my entire life run into this situation on a internet message board.. I am proud!
ID: 19784 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Leonard Kevin Mcguire Jr.

Send message
Joined: 13 Jun 06
Posts: 29
Credit: 14,903
RAC: 0
Message 19786 - Posted: 5 Jul 2006, 0:51:14 UTC
Last modified: 5 Jul 2006, 1:03:19 UTC

But, you know. Even if - SSE had to use doubles. I think a AMD64 would perform better using SSE doubles than X87 doubles because of the reduced code size at least. I do not know for other processors, I just remember reading that it was recommened to use SSE instead of X87. I think a executable compiled in 32bit causes the processor switch into the compatiblity mode. The 64bit long mode removed some instructions that some programs may use when written as 32bit. So that SSE performance gain might be only when the executable is a native 64BIT exe under a 64BIT operating system.

Oh I forgot to mention. Somewhere the Rosetta application is using SSE for double precision values to do some small calculations? (I could have sworn I saw it in the disassembly).

I do have a AMD64 3500+ with Win XP64 Pro installed.
ID: 19786 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 19791 - Posted: 5 Jul 2006, 10:45:04 UTC

The instructions removed in 64-bit mode (which works in compatibility mode, by the way - otherwise all code would have to be recompiled), are all integer instructions that have duplicated opcodes to do the same thing 0x40..0x4F are PUSH/POP of all the registers, and there's an 0xFF 0xXX opcode to do the same thing. AMD also took the opportunity to kull some of the rarely used AAM, AAD, AAA and AAS instructions and some others like that (XLAT, I think is another one).

In the Linux version of Rosetta, there's not a single reference to "xmm", so there's no SSE-code in Rosetta (for Linux - I can't say for anything else). If they had some SSE-code in there, it would not run on older machines, which I'm sure it does... [Unless it checks to see which architecture it is and then does different things depending on architecture - but I very much doubt it - why do it for a tiny bit of code, when the rest of it isn't].

The main difference when running 32 or 64-bit code would be the number of registers available, and particularly for math-operations, SSE is easier to use, since you don't have to swap the top-of-stack to get to values that have been calculated earlier.

If you have an AMD64 3500+, then you should have SSE2. The only SSE-version that doesn't work in that is SSE3 - since the processor was designed before Intel released their SSE3 processors...

I guess I should be a little bit proud too, that we have an agreement... ;-)

--
Mats
ID: 19791 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 28227 - Posted: 23 Sep 2006, 1:40:13 UTC

This may not be quite the topic being discussed, but it reminded me of this discussion.
32 bits are better than 64
Borrowed the title from story on The Inquirer that linked to the Suse Linux review.
ID: 28227 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 28469 - Posted: 25 Sep 2006, 8:29:04 UTC - in response to Message 28227.  

This may not be quite the topic being discussed, but it reminded me of this discussion.
32 bits are better than 64
Borrowed the title from story on The Inquirer that linked to the Suse Linux review.



There are certainly cases where 32-bit is faster than 64-bit. There are also cases where 64-bit is faster than 32-bit - it depends on the application.

However, since this is about Rosetta, I can with 99% confidence say that a pure recompile to 64-bit would gain less than 0.5% performance difference. The reasons are:
1. Rosetta is to a very large extent limited by the processors floating point capactiy.
2. Rosetta doesn't use linked lists or other indirect data structures where the size of pointers are critical to the performance.
3. Rosetta doesn't use 64-bit integers for any purpose, and thus will not be able to benefit from "large integers".

This has been discussed several times before, and the outcome is still the same:
It's actually very hard to improve Rosetta's performance with trivial measures...

--
Mats

ID: 28469 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 28489 - Posted: 25 Sep 2006, 12:57:04 UTC - in response to Message 28469.  

This may not be quite the topic being discussed, but it reminded me of this discussion.
32 bits are better than 64
Borrowed the title from story on The Inquirer that linked to the Suse Linux review.



There are certainly cases where 32-bit is faster than 64-bit. There are also cases where 64-bit is faster than 32-bit - it depends on the application.

However, since this is about Rosetta, I can with 99% confidence say that a pure recompile to 64-bit would gain less than 0.5% performance difference. The reasons are:
1. Rosetta is to a very large extent limited by the processors floating point capactiy.
2. Rosetta doesn't use linked lists or other indirect data structures where the size of pointers are critical to the performance.
3. Rosetta doesn't use 64-bit integers for any purpose, and thus will not be able to benefit from "large integers".

This has been discussed several times before, and the outcome is still the same:
It's actually very hard to improve Rosetta's performance with trivial measures...

--
Mats


Mats are you one of the ones looking at the code? Would changing to 64bit improve Rosetta though? such as accuracy/speed etc.. I don't mean a compile to 64bit but a change to 64bits ?
Team mauisun.org
ID: 28489 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 28495 - Posted: 25 Sep 2006, 13:54:43 UTC - in response to Message 28489.  

Mats are you one of the ones looking at the code? Would changing to 64bit improve Rosetta though? such as accuracy/speed etc.. I don't mean a compile to 64bit but a change to 64bits ?


Yes, and no, I don't think so.

--
Mats

ID: 28495 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 28507 - Posted: 25 Sep 2006, 15:42:48 UTC - in response to Message 28495.  

Mats are you one of the ones looking at the code? Would changing to 64bit improve Rosetta though? such as accuracy/speed etc.. I don't mean a compile to 64bit but a change to 64bits ?


Yes, and no, I don't think so.

--
Mats



So they couldn't represent the molecule in more detail in a 64bit matrix/grid (if they do that). I guess it would take an explanation of how the program actually does it's stuff. (something more for Ralph, since that the development).
I would have thought that in the docking part of the program a more detailed description would get them in more detailed interaction ? ... The main reason people move to 64bit. I know if I was to represent a laser beam in a 64bit time space mesh it would show far more detail and possible subtalties during intereations with things (lenses, materials etc.)..
Team mauisun.org
ID: 28507 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 28510 - Posted: 25 Sep 2006, 15:57:55 UTC - in response to Message 28507.  

Mats are you one of the ones looking at the code? Would changing to 64bit improve Rosetta though? such as accuracy/speed etc.. I don't mean a compile to 64bit but a change to 64bits ?


Yes, and no, I don't think so.

--
Mats



So they couldn't represent the molecule in more detail in a 64bit matrix/grid (if they do that). I guess it would take an explanation of how the program actually does it's stuff. (something more for Ralph, since that the development).
I would have thought that in the docking part of the program a more detailed description would get them in more detailed interaction ? ... The main reason people move to 64bit. I know if I was to represent a laser beam in a 64bit time space mesh it would show far more detail and possible subtalties during intereations with things (lenses, materials etc.)..



I think I can fairly tell you, without breaking the "NDA", that the internal representation of the model is all done in single precision floating point - and the floating point format doesn't change when moving to 64-bit.

It's conceivable, but not likely, that it could be done in 32.32 or 48.16 fixed point notation instead - haven't looked at the number range to see if that's feasible - and more importantly, if it would actually gain anything to do that - most likely not, as the FPU itself is pretty good throughput, and fixed point multiplies (which there are plenty of) are a bit more complicated than the basic add/sub operation (which are faster than the FPU version) - so it's unlikely that we'd gain much from such a modification - not to mention that ALL of the code will be affected unless just some code is using this technique - which means a conversion to and from fixed point format in some places - and that's not "free" either...

By far the most likely candidate for gains is to use 32-bit SSE instructions, but that is difficult because it requires data-reorganization, as the values are currently not kept in the right way to gain from SSE instructions (you can't easily load up four values in an SSE register and just operate on it, as the values you need aren't "next to each other"). Unfortunately, such data-reorg is either costly locally, or requires big re-orgs in the overall source-code, which isn't nice from a work-perspective, particularly if you only gain a few percent... [And I've got plenty work to do that I get paid for, so progress isn't that great ;-)]

--
Mats
ID: 28510 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 28511 - Posted: 25 Sep 2006, 16:10:43 UTC

By the way, I think it's also safe to indicate that 222000 lines of code doesn't exactly make it a "small and simple" piece of code. Compare that to seti, with 15000 lines in the source-code [as of the latest nightly tar-ball], and you quickly realize why it's not such a trivial task to optimize Rosetta as it is with Seti.

--
Mats
ID: 28511 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3

Message boards : Number crunching : 64-Bit Rosetta?



©2024 University of Washington
https://www.bakerlab.org