64-Bit Rosetta?

Author	Message
Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0	Message 19333 - Posted: 26 Jun 2006, 23:39:04 UTC Last modified: 26 Jun 2006, 23:40:22 UTC I think you either misread or misunderstood my statement. I'm saying that in general, there is no direct correlation between the bitness of an application and/or of a CPU and the performance seen. I don't understand how that can be considered to be an "inflexible rule". It depends on what size scope is considered general. More importantly what criteria would need to be meet to consider something a general "case"? Many general cases exist, by my definition of general, that involve routines that can have a performance gain - and thus this gain no matter how small is still a gain unless we are talking about non-general routines which might be: int main(void) { return (char)1 + (char)1; } Evidently this would yeild no use from 64 BIT. We only need to store one value, but it could be run in a 64 BIT enviroment!, and not use 64 BIT instructions and have a gain as I have read somewhere... let me go find it. mov eax, 1 add eax, 1 ret Is that a general case? I do not know of many useful programs that do something like that. Would this be a correlation? int main(int a, int b, int c, int d, int e, int f, int g, int h, int i, int k) { return a + b + c + d + e + f + g + h + i + k; } add eax, ebx // a=a+b add eax, ecx // a=a+c add eax, edx // a=a+d add eax, r08 // a=a+e add eax, r09 // a=a+f add eax, r10 // a=a+g add eax, r11 // a=a+h add eax, r12 // a=a+i add eax, r13 // a=a+k ret I'm saying that in general, there is no direct correlation between the bitness of an application and/or of a CPU and the performance seen. 1. I'm saying that in general, 2. there is no direct correlation between 2A. the bitness of an application or of a CPU and the performance seen. 2B. the bitness of an application and of a CPU and the performance seen. So for 2A we mean: A application using totaly 32bit instructions VS one using totaly 64bit instructions. (In general? Every application in the world that is considered a general case application? How are we to scientificly define this general?) Quite frankly you don't use storage space you do not need. It would be a waste to use a 64 bit operation on a 32 bit value unless wraparound was a unwanted effect. A application using only 32bit instructions VS one that can use 32bit instructions and 64bit instructions as such is the 64 bit AMD processors. So for 2B we mean: A application that uses only 32bit run on a 32bit processor VS a 32bit application run on a 64 bit processor. The performance should be no less in "general". =) A application that uses 32bit and 64bit or just 64bit run on a 32 bit processor VS a 64 bit processor. It will not work. =) ID: 19333 · Rating: 0 · rate: / Reply Quote

dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0	Message 19334 - Posted: 26 Jun 2006, 23:43:56 UTC - in response to Message 19198. Last modified: 27 Jun 2006, 0:26:16 UTC And by god, this is not the same problem as it was 15 years ago! Are you sure? If it's not the same problem, why, pray tell, was NT 4 on the DEC Alpha such a miserable failure? Why was Win 2K 64 bit on IA64 (Itanium) an even worse failure? And no it's not about pointer size. I'm not sure about the Alpha, but I know for a fact that the IA64 offered a "half pointer" (i.e. 32 bit) mode to solve the 64 bit pointer overhead problem. It's because a 64 bit processor does not automatically equate to extra speed, just as 32 bits didn't automaticaly equate to more speed over 16 bits, 15 years ago. I benchmarked programs when porting to IA 64. Not just "throw a 32 bit program at the IA 64 compiler and recompile" but actually rearchitect to make use of the 64 bit nature. Yes, some programs did run faster. Some ran a hell of a lot faster. But a good many programs did not, and some ran slower. This is exactly what BennyRop is trying to tell you. There is no hard and fast rule, and there are some programs that simply cannot be made to run faster by throwing more processor bits at them. Please read the second post here. That's from Keith Davis, author of THINK, the software behind the Find-A-Drug cancer project. The point he's making is that the algorithms in Think simply could not be improved by allowing the compiler to use SSE. And don't for an instant make the mistake of underestimating his ability to optimze. By working with Intel on that code, he got about a 40 to 1 speedup after Think and UD parted company. That doesn't mean that Rosetta can't benefit from SSE (or 64 bits for that matter). But it is quite a wakeup call for the fact that sometimes those technologies just can't do any good. ID: 19334 · Rating: 0 · rate: / Reply Quote

Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0	Message 19337 - Posted: 27 Jun 2006, 1:31:46 UTC Last modified: 27 Jun 2006, 1:36:06 UTC Oh. I am sorry, I was completely wrong. You are right, dgnuff. Silly me! I am so glad we have smart people like you around to explain to idiots like me that 64 bits does not equate to more performance, it took me forever it understand what everyone is talking about, wow - I have been lost before but never this lost! Thank you lord! I have been saved from a hellish future! The world has been saved! ID: 19337 · Rating: 0 · rate: / Reply Quote

BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0	Message 19343 - Posted: 27 Jun 2006, 5:55:07 UTC - in response to Message 19337. Leonard spoketh: Oh. I am sorry, I was completely wrong. You are right, dgnuff. Silly me! I am so glad we have smart people like you around to explain to idiots like me that 64 bits does not equate to more performance, it took me forever it understand what everyone is talking about, wow - I have been lost before but never this lost! Thank you lord! I have been saved from a hellish future! The world has been saved! It's tempting to agree with your self assessment, Leonard. Especially after you keep changing your diagnosis of the problem with DF's 64 bit compilation. When looking at advantages of 64 bit apps - it's important to keep in mind the disadvantages of 64 bit apps and work around them. If you take offense at my giving detailed reports on what I've run across on the subject, then keep in mind that when the Intel fanboys get together and state how much better Intel parts are than AMD cpus because they saw a video of an AMD cpu blowing up on Tom's Hardware - I'll likewise point out my experience with customer's AMD systems that only stayed on for a few seconds and kept shutting off. (HSF wasn't mounted correctly and wasn't making proper contact with the cpu - i.e. newer AMD cpus and motherboards don't have that problem.) I've pointed out the benefits of the Core Duo cpus over the lesser Intel cpus and AMD parts; so I'm loyal to a balanced view - not just 100% positive or negative reporting while trying to hide/ignore the other side. Xeno's mention of "the huge performance benefit in using 64-bit processing?" was not born out by a 64 bit app that ran at 50%-75% of the speed of the 32 bit version over at DF. Having to properly optimize a 64 bit client to get it to run at roughly the same speed as the 32 bit version would not qualify as "the huge performance benefit in using 64-bit processing." ID: 19343 · Rating: 0 · rate: / Reply Quote

skutnar Send message Joined: 31 Oct 05 Posts: 8 Credit: 133,153 RAC: 0	Message 19361 - Posted: 27 Jun 2006, 17:43:17 UTC LOL, Leonard... I'm not sure what you're agenda is in trying to tear apart every word of what I wrote and give some contrived examples. You haven't demonstrated to me, at least, that my earlier statements are incorrect. ID: 19361 · Rating: 0 · rate: / Reply Quote

skutnar Send message Joined: 31 Oct 05 Posts: 8 Credit: 133,153 RAC: 0	Message 19362 - Posted: 27 Jun 2006, 17:44:36 UTC - in response to Message 19317. Nice discussion! ;-) Would anybody of you be willing to actually have a look on the Rosetta source and make an educated guess whether optimizing for 64 BIT (and for SSE etc.) could lead to a significant performance gain? That might soon be possible. If yes, please email me: joachim@iwanuschka.de regards Joachim I think it's been mentioned earlier in this discussion and in other threads that the Rosetta developers had been looking at this issue. ID: 19362 · Rating: 0 · rate: / Reply Quote

Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 19368 - Posted: 27 Jun 2006, 18:58:23 UTC - in response to Message 19298. No, DUH! Use int and get a 64 bit integer, and increase the cache needed. You should work for a news channel! =) That is incorrect for both the GCC and Microsoft x86-64-compilers, and most likely all other compilers for this architecture - I obviously can't speak for compilers for different architectures, but I think this generally holds true. The type "int" in both of those compilers is 32-bit. "long" would be reguired to get a 64-bit integer (which of course, in a 32-bit compiler will still be 32-bit - and there's no guarantee that some programmer has used "long int" when they actually just wanted a 32-bit entity, of course). I may not have expressed it very clearly, but I _HAVE_ quite a bit of experience doing comparative benchmarking and improving performance on x86(-64) architecture, and I _DO_ know a fair bit about tricks you can play with passing multiple arguments in a single register, etc, etc. However, those tricks assume either a compiler that can implement them - which isn't the case for either of the two I mention above - or a programmer that makes the effort to implement this by hand (using macros or similar methods, probably). Finally, a store to memory is a single cycle when the store-target is in L1 cache, which the stack will almost certainly be for any call-intensive code. Performing two single cycle but dependant operations to merge some data into a single register doesn't make sense under those conditions. Of course, on x86_64, there will be more registers available for passing values in registers from one function to another, which helps. A plain (i.e. just compile it 64-bit instead of 32-bit) 64-bit compile of SOME applications will give 30% performance boost. Other applications may well loose the same amount, because they are cache-dependant, and bigger pointers are using more cache-space. Yes, you could convince the compiler to use 32-bit pointers, but you'd have to also use a special C-library designed for this purpose [of which there are none available today AFAIK], and make sure that any calls to the OS itself is "filtered" suitably for your app, etc, etc. Which makes this variation fairly unpractical - it's unlikely that such a cache-bound app would gain noticably from running with more/bigger registers anyays, so those apps are better off running 32-bit binaries - unless you actually want bigger memory space, in which case 32-bit pointers are a moot point. Some applications neither gain nor loose, because the performance bottleneck is neither calling-convention, register usage or cache-usage, but simply CPU-clock-cycles - everything is cached, the code is running flat out. Another category is where memory is the limit of performance - in which case there are some clever tricks you can play with buffered data-handling, assuming there's some linear pattern which the data can be fetched in, rather than a telephone switch system where each access is pretty much random (from the switch's standpoint, every call goes to/from different telephone numbers, so each access is completely uncached [in fact, it's sometimes better to not ever cache this data, since that will just throw out other data from the cache, but that does depend on the amount of processing needed for each call] and not in any way predictable). Again, these applications will perform almost identical in 64-bit as in 32-bit, since it's some other factor that determines performance. My point is that you can not, without understanding quite a bit about the application, say whether it will run faster, slower or at the same speed when moving from 32-bit to 64-bit. I also haven't looked at the math in Rosetta - it may all be "linear", which means that no great gains can be had from optimizing it with SSE-instructions. SSE instructions are very useful for doing multiple calculations in parallel, but for something like: for (i = 1; i < large_number; i++) { x[i] = x[i-1] + y[i]; } where the result of on calculation is dependant on the previous one, the SSE doesn't buy you anything, because you can't perform this in parallel. SSE instructions aren't faster, per se, than FPU instructons (in AMD64, at least - may be different with Intel processors, I haven't done performance tuning with one of those since before AMD had SSE instructions, so I can't say). -- Mats ID: 19368 · Rating: 0 · rate: / Reply Quote

tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 19369 - Posted: 27 Jun 2006, 19:03:18 UTC - in response to Message 19362. I think it's been mentioned earlier in this discussion and in other threads that the Rosetta developers had been looking at this issue. I couldn't find any mention of this in all the threads. In fact to my knowledge they do try to optimize Rosetta and use compiler optimization settings to be sure, but never looked specifically into the possibilities of SSE/3DNow and 64 BIT. What I would like to do to let some knowledgeable people fathom the possibilities of these specific optimizations. If I understand all those posts correctly SSE and 64Bit _might_ offer significant optimization potential but not necessarily for all applications. It's fruitless to debate all the optimization potential for Rosetta without really looking on the code. ID: 19369 · Rating: 0 · rate: / Reply Quote

Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0	Message 19379 - Posted: 27 Jun 2006, 21:54:11 UTC Last modified: 27 Jun 2006, 22:02:40 UTC You are a retard. I have written a operating system kernel from scratch! I should know! Yes! Other applications may well loose the same amount, because they are cache-dependant, and bigger pointers are using more cache-space. Yes, you could convince the compiler to use 32-bit pointers, but you'd have to also use a special C-library designed for this purpose [of which there are none available today AFAIK], and make sure that any calls to the OS itself is "filtered" suitably for your app, etc, etc. Which makes this variation fairly unpractical - it's unlikely that such a cache-bound app would gain noticably from running with more/bigger registers anyays, so those apps are better off running 32-bit binaries - unless you actually want bigger memory space, in which case 32-bit pointers are a moot point. I am so tired of this crap. 64-bit Data Models Prior to the introduction of 64-bit platforms, it was generally believed that the introduction of 64-bit UNIX operating systems would naturally use the ILP64 data model. However, this view was too simplistic and overlooked optimizations that could be obtained by choosing a different data model. Unfortunately, the C programming language does not provide a mechanism for adding new fundamental data types. Thus, providing 64-bit addressing and integer arithmetic capabilities involves changing the bindings or mappings of the existing data types or adding new data types to the language. ISO/IEC 9899:1990, Programming Languages - C (ISO C) left the definition of the short int, the int, the long int, and the pointer deliberately vague to avoid artificially constraining hardware architectures that might benefit from defining these data types independent from the other. The only constraints were that ints must be no smaller than shorts, and longs must be no smaller than ints, and size_t must represent the largest unsigned type supported by an implementation. It is possible, for instance, to define a short as 16 bits, an int as 32 bits, a long as 64 bits and a pointer as 128 bits. The relationship between the fundamental data types can be expressed as: sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long) = sizeof(size_t) Ignoring non-standard types, all three of the following 64-bit pointer data models satisfy the above relationship: LP64 (also known as 4/8/8) ILP64 (also known as 8/8/8) LLP64 (also known as 4/4/8). The differences between the three models lies in the non-pointer data types. The table below details the data types for the above three data models and includes LP32 and ILP32 for comparison purposes. http://www.unix.org/whitepapers/64bit.html All I have to say, is for all the "fake" stupid idiots that rant and rave about crap they don't know for sure can kiss my white butt. Optimize the Rosetta@Home application your self, since it seems you people don't want to attempt to talk about something constructive apon it helping - instead every single post (almost) was a negative buncha crap. I may not have expressed it very clearly, but I _HAVE_ quite a bit of experience doing comparative benchmarking and improving performance on x86(-64) architecture, and I _DO_ know a fair bit about tricks you can play with passing multiple arguments in a single register, etc, etc. You wish you knew something. I can not prove it, but I know it. I am so mad, I do not care. So if you post any other CRAP above this, just know I am going to read it and laugh at you. ID: 19379 · Rating: 9.9920072216264E-15 · rate: / Reply Quote

Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 19384 - Posted: 27 Jun 2006, 23:52:53 UTC - in response to Message 19379. I guess you can do with a laugh, so here we go... You are a retard. I have written a operating system kernel from scratch! I should know! Yes! As if that matters, so have I... It's not hard, if you have read a book or two on the subject... roughly: after interrupt, look at task-list, run the highest priority task. If nothing runnable, wait for another interrupt. Of course, it gets a bit more complicated when you start involving paging, floating-point-lazy-saving, and other advanced features. But it's still just a matter of programming, just like anything. It's not that hard - as obvious by the fact that a retard like me can manage it... Other applications may well loose the same amount, because they are cache-dependant, and bigger pointers are using more cache-space. Yes, you could convince the compiler to use 32-bit pointers, but you'd have to also use a special C-library designed for this purpose [of which there are none available today AFAIK], and make sure that any calls to the OS itself is "filtered" suitably for your app, etc, etc. Which makes this variation fairly unpractical - it's unlikely that such a cache-bound app would gain noticably from running with more/bigger registers anyays, so those apps are better off running 32-bit binaries - unless you actually want bigger memory space, in which case 32-bit pointers are a moot point. I am so tired of this crap. 64-bit Data Models Prior to the introduction of 64-bit platforms, it was generally believed that the introduction of 64-bit UNIX operating systems would naturally use the ILP64 data model. However, this view was too simplistic and overlooked optimizations that could be obtained by choosing a different data model. Unfortunately, the C programming language does not provide a mechanism for adding new fundamental data types. Thus, providing 64-bit addressing and integer arithmetic capabilities involves changing the bindings or mappings of the existing data types or adding new data types to the language. ISO/IEC 9899:1990, Programming Languages - C (ISO C) left the definition of the short int, the int, the long int, and the pointer deliberately vague to avoid artificially constraining hardware architectures that might benefit from defining these data types independent from the other. The only constraints were that ints must be no smaller than shorts, and longs must be no smaller than ints, and size_t must represent the largest unsigned type supported by an implementation. It is possible, for instance, to define a short as 16 bits, an int as 32 bits, a long as 64 bits and a pointer as 128 bits. The relationship between the fundamental data types can be expressed as: sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long) = sizeof(size_t) Ignoring non-standard types, all three of the following 64-bit pointer data models satisfy the above relationship: LP64 (also known as 4/8/8) ILP64 (also known as 8/8/8) LLP64 (also known as 4/4/8). The differences between the three models lies in the non-pointer data types. The table below details the data types for the above three data models and includes LP32 and ILP32 for comparison purposes. http://www.unix.org/whitepapers/64bit.html And this proves exactly what? According to the table, all the 64-bit designs have 64-bit pointers, which means that a task that uses pointers heavily will be running slower on a 64-bit machine than on a 32-bit machine. All I have to say, is for all the "fake" stupid idiots that rant and rave about crap they don't know for sure can kiss my white butt. Optimize the Rosetta@Home application your self, since it seems you people don't want to attempt to talk about something constructive apon it helping - instead every single post (almost) was a negative buncha crap. I may not have expressed it very clearly, but I _HAVE_ quite a bit of experience doing comparative benchmarking and improving performance on x86(-64) architecture, and I _DO_ know a fair bit about tricks you can play with passing multiple arguments in a single register, etc, etc. You wish you knew something. I can not prove it, but I know it. I am so mad, I do not care. So if you post any other CRAP above this, just know I am going to read it and laugh at you. As far as I'm concerned, what I stated is true. I've also had a look at the profile of rosetta, using oprofile. It's quite clear that it spends most of it's time doing floating point operations. It's not using a huge amount of memory whilst doing these calculations, as it's not getting cache-misses very often (roughly one per every 2-3K instructions executed, which is definitely (a lot) better than some other code I've looked at). Some instruction counts 15000 fmul, 12500 fadd, 7000 fsub, 1200 fdiv, 37000 fld, 36000 fst instructions that are in the code. call instruction: 180000 call to 0x888f250: 8325 call to 0x8898bec: 7399 call to 0x888d640: 4577 call to 0x8890b26: 4596 call to 0x8054454: 3932 call to 0x883f8b0: 3494 That's just the functions that are called more than 3000 times in the code. I haven't looked at all these functions (or any of the others) to understand what of this could potentially be optimized, or by what method. I've seen some short functions that are called a few hundred times that could perhaps be inlined - but it may not give ANY benefit whatsoever, since they may only be called on error-paths or in code that isn't repeated very much. It's probably fair to say that this code follows the regular 90:10 rule, 90% of the work is done in 10% of the code, and I'm not at a position where I can say exactly which part of this huge piece of code is actually executing a lot. I also looked at some of the floating point code, and at least parts look like it could be optimized to some extent with SSE instructions - but again, I have no idea whether the code I looked at is what executes a lot or not - since that would require a bit more analyzis of the code, and I'm not usually at work at ten to one in the morning to do research. Now it's your turn to come up with something USEFUL, I should think. -- Mats ID: 19384 · Rating: 0 · rate: / Reply Quote

Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0	Message 19389 - Posted: 28 Jun 2006, 2:21:21 UTC Last modified: 28 Jun 2006, 2:59:51 UTC Alright. Here is a function, unknown to me exactly what it is. Samples Address Code Bytes Instruction Symbol CPU0 Event:0x41 0x494a50 0x 56 push esi 0x494a51 0x 57 push edi 0x494a52 0x 8B 7C 24 0C mov edi,[esp+0ch] 0x494a56 0x 8B F1 mov esi,ecx 0x494a58 0x 3B F7 cmp esi,edi 0x494a5a 0x 74 2D jz $+2fh (0x494a89) 0x494a5c 0x 57 push edi 0x494a5d 0x E8 7E F4 FF FF call $-00000b7dh (0x493ee0) 0x494a62 0x 84 C0 test al,al 0x494a64 0x 75 08 jnz $+0ah (0x494a6e) 0x494a66 0x 57 push edi 0x494a67 0x 8B CE mov ecx,esi 0x494a69 0x E8 92 FE FF FF call $-00000169h (0x494900) 0x494a6e 0x 33 C0 xor eax,eax 0x494a70 0x 39 46 0C cmp [esi+0ch],eax 0x494a73 0x 76 14 jbe $+16h (0x494a89) [b] 23 0x494a75 0x 8B 4F 08 mov ecx,[edi+08h] 23 34 0x494a78 0x D9 04 81 fld dword ptr [ecx+eax4] 34 988 0x494a7b 0x 8B 56 08 mov edx,[esi+08h] 988 0x494a7e 0x D9 1C 82 fstp dword ptr [edx+eax4] 995 0x494a81 0x 83 C0 01 add eax,01h 995 1 0x494a84 0x 3B 46 0C cmp eax,[esi+0ch] 1 0x494a87 0x 72 EC jb $-12h (0x494a75) [/b] 0x494a89 0x 5F pop edi 0x494a8a 0x 8B C6 mov eax,esi 0x494a8c 0x 5E pop esi 6 0x494a8d 0x C2 04 00 ret 0004h 6 This is apparently where alot of cache misses are occuring, just a thought, if it is copying a large array, as in the case when I broke the thread. The cache misses might just be too bad for such a large array, but on a AMD64 capable processor would the below even though not erasing the cache misses be benificial? cmp eax, [esi+0ch] 0x494a84 ((unsigned long)(esi + 0xC)) = 0x11D In this instance it was moving (285 * 4) bytes with 285 loops. EAX starts at 0x0, so its definitly looping. Could a 64BIT move be faster than using the floating point unit to perform a data move? Apparently the floating point unit can move it faster than a 32 bit move, as the code is using the x87? (I do not know) [The x87 might be slower, mabye the compiler just did not optimize it correctly?) My WIN32 executables .text section's virtual address is 0x401000. That could help pinpoint where the function is since I'm pretty sure a linux build's CRT might offset it differently? [edit:] I actually looked at the darn code some more and figured it couldnt be optimized at all since it repeatadly loads ecx and edx with the same value. You know I just noticed from your results that it does alot of FP store and load operations, the potential of MMX to reduce transfers from main memory to FP registers? I do think all AMD(I do not know alot about INTEL) 64 processors support the MMX instructions.. I am just guessing here. And in reply to the SSE being slower than the x87 instructions. I have read that SSE is actually slower, and the only performance gain is by using it to perform multiple operations. ID: 19389 · Rating: 0 · rate: / Reply Quote

Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 19405 - Posted: 28 Jun 2006, 12:25:27 UTC - in response to Message 19389. Last modified: 28 Jun 2006, 12:55:42 UTC Now, this is USEFUL discussion... Alright. Here is a function, unknown to me exactly what it is. [snip code-segment] This is apparently where alot of cache misses are occuring, just a thought, if it is copying a large array, as in the case when I broke the thread. The cache misses might just be too bad for such a large array, but on a AMD64 capable processor would the below even though not erasing the cache misses be benificial? cmp eax, [esi+0ch] 0x494a84 ((unsigned long)(esi + 0xC)) = 0x11D In this instance it was moving (285 * 4) bytes with 285 loops. EAX starts at 0x0, so its definitly looping. Could a 64BIT move be faster than using the floating point unit to perform a data move? Apparently the floating point unit can move it faster than a 32 bit move, as the code is using the x87? (I do not know) [The x87 might be slower, mabye the compiler just did not optimize it correctly?) My WIN32 executables .text section's virtual address is 0x401000. That could help pinpoint where the function is since I'm pretty sure a linux build's CRT might offset it differently? [edit:] I actually looked at the darn code some more and figured it couldnt be optimized at all since it repeatadly loads ecx and edx with the same value. You know I just noticed from your results that it does alot of FP store and load operations, the potential of MMX to reduce transfers from main memory to FP registers? I do think all AMD(I do not know alot about INTEL) 64 processors support the MMX instructions.. I am just guessing here. And in reply to the SSE being slower than the x87 instructions. I have read that SSE is actually slower, and the only performance gain is by using it to perform multiple operations. This is typical for a loop like: for(i = 0; i < something; i++) a[i] = b[i]; where what you really would want to do is memcpy(a, b, sizeof(a[0]) * something); It is also, as you say, completely unnecesseary to use floating point load/store operations. Although, I'm not sure it makes a whole lot of difference. The reason you get cache-misses in this piece of code is probably that every time a and b are copied, they are (let's say) segments/chunks from a different locations, so the current location isn't in cache when this happens - or because it's not running some other code inbetween that fills the cache with something else, then takes another copy of this data. Is this also the place where you see a lot of hits if you run use profiling in timer-mode? Unfortunately, cache-misses aren't going to be solved by using 64-bit or 128-bit (SSE) loads/stores. However, there's only so many load/store "slots" in the processor, which means that when the processor is waiting for X outstanding loads or stores, it's stopped from executing something else. Loads/stores can be performed speculatively and out of roder, so there's no reason for the processor to stall until these slots are filled up. So using bigger units (64- or 128-bit) can be helpfull to get more data copied before the slots are stacked up. As the loop is quite short, it's unlikely that it will benefit from prefetch instructions - that usually works on bigger copies, but small ones don't really benefit much. I did a little experiment where I copy 256 floats at a time, over a 16MB array, using different methods: 1. float 25000 cycles (100.0%) 2. int32 24600 cycles (98.0%) 3. int64 21600 cycles (86.0%) 4. sse128 21500 cycles (85.6%) The numbare slightly rounded, but as you can see, it's the size of the copy-item that matters, more than the actual register type that is used for the copy-mechanism. Each copy method is run 15 times, with the 13 "middle" values averaged to get a good value. And the machine is running all sorts of things in the background, as I can't just shut things down for this benchmark - but it's the same for all of the tests, and I can run the test several times, and it only varies a hundred or so cycles up or down for each test. By the way, using gcc for this, and I had to actually write the loop by hand for the float-copy method, as the compiler would automatically convert the a[i] = b[i]; copy operation to a integer transaction. There may be a way to avoid this, but I didn't look for it. However, I also did a version that uses the C-library memcpy() function: 5. memcpy 3600 cycles (14.4%) Now, that's what I call a saving... 85+% faster than the original code... The C-library memcpy function is highly optimized and uses a range of different methods to copy data, depending on the size and alignment of the source and destination. So, at least for Linux, it would definitely make sense to copy blocks of data, not with a loop, but with a memcpy() call... I doubt it's slower in the Windows case either... Edit: I'm wondering tho', that this may not give ANY speed change to the application, because although this particular section is the one getting Cache-misses, Cache-misses aren't a large problem in Rosetta, since there's only one cache-miss for every 2-3K instructions or so. -- Mats ID: 19405 · Rating: 0 · rate: / Reply Quote

Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0	Message 19443 - Posted: 28 Jun 2006, 23:56:10 UTC You were right about the data move routines, they stayed right about the same for each one independant of the register size used. I must meditate on this. ID: 19443 · Rating: 0 · rate: / Reply Quote

Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0	Message 19445 - Posted: 29 Jun 2006, 0:38:00 UTC Last modified: 29 Jun 2006, 0:40:17 UTC Hmmm.. So it seems you were right and I know why, and I am not mad. It could benifit more from SSE rather than 64 BIT instructions primarly because most operations work in cache when moving around data, and the rest is FP operations - and if not so its rarely called functions that have cache misses that move alot of data around so no big deal. I came to this conclusion after trying to figure out a way to optimize the function at the file address 0x204050 in the windows PE32 build. I used a timer, and it found that this function gets called alot. It seems like it takes about 3 arguments - looks just like a vector subtraction at the start, anyway, after looking the function over and over. I could find no where to optimize it favorable >:) with 64bit instructions. I am not saying there is absolutely no performance to be gained, but its prolly not going to be in the core of the application as the 90:10 rule you stated in a earlier post. However, I did notice a few operations of load/store that where kind of redundant looking unless the rounding effect was intentional (80bit[fp reg] -> 32bit[mem]... "fstp fld"). I will look at it some more in a direction towards SSE =). At least this sorta stops the 64bit fanboy arguments! ID: 19445 · Rating: 0 · rate: / Reply Quote

Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0	Message 19447 - Posted: 29 Jun 2006, 2:38:43 UTC I found a interesting function that could use SSE: 00604050 sub esp,0Ch 00604053 push esi 00604054 mov esi,dword ptr [esp+1Ch] 00604058 fld dword ptr [esi] 0060405A push edi 0060405B mov edi,dword ptr [esp+4Ch] 0060405F fsub dword ptr [edi] 00604061 mov eax,dword ptr [esp+88h] 00604068 fstp dword ptr [esp+0Ch] 0060406C fld dword ptr [esi+4] 0060406F fsub dword ptr [edi+4] 00604072 fstp dword ptr [esp+8] 00604076 fld dword ptr [esi+8] 00604079 fsub dword ptr [edi+8] 0060407C fstp dword ptr [esp+10h] 00604080 fld dword ptr [esp+8] 00604084 fld dword ptr [esp+0Ch] 00604088 fld dword ptr [esp+10h] .... I replaced the x87 instructions with SSE. 00604050 83 EC 0C sub esp,0Ch 00604053 56 push esi 00604054 57 push edi 00604055 8B 74 24 20 mov esi,dword ptr [esp+20h] 00604059 8B 7C 24 4C mov edi,dword ptr [esp+4Ch] 0060405D 0F 10 06 movups xmm0,xmmword ptr [esi] 00604060 0F 10 0F movups xmm1,xmmword ptr [edi] 00604063 0F 5C C1 subps xmm0,xmm1 00604066 0F C6 C1 4B shufps xmm0,xmm1,4Bh 0060406A 8B 44 24 10 mov eax,dword ptr [esp+10h] 0060406E 0F 11 4C 24 04 movups xmmword ptr [esp+4],xmm1 00604073 89 44 24 10 mov dword ptr [esp+10h],eax 00604077 8B 84 24 88 00 00 00 mov eax,dword ptr [esp+88h] 0060407E 90 nop 0060407F 90 nop 00604080 D9 44 24 08 fld dword ptr [esp+8] 00604084 D9 44 24 0C fld dword ptr [esp+0Ch] 00604088 D9 44 24 10 fld dword ptr [esp+10h] ... Do you think SSE will give a significant performace gain with something like that? I had to use a shufps, because I thought/think it was swapping the first two FPs. r[1] = a[0] - b[0] r[0] = a[1] - b[1] r[2] = a[2] - b[2] The forth over-writes other stuff in the stack, so I preserve it using eax - mabye all the extra overhead just kills any gain? ID: 19447 · Rating: 0 · rate: / Reply Quote

Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 19465 - Posted: 29 Jun 2006, 12:17:24 UTC I don't know, there would probably be a little benefit from using SSE here, but it would be even better if you can continue with the value in SSE further down. The code at 604080 is loading what you've just calculated into xmm1 - which, by the way, at 60406E you're storing one dword too high up on the stack, you should store at [esp+8]. Having source to be able to modify would help a whole lot here... Question: Is this one of the top-hit functions? -- Mats ID: 19465 · Rating: 0 · rate: / Reply Quote

Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 19466 - Posted: 29 Jun 2006, 12:35:30 UTC A few more thoughts... It's pretty much meaningless to round by saving the value and then reading it back out again - it's most likely just the compiler being stupid. From what I can see, the code doesn't look hyper-optimized by any means. It's probably not "no optimization", but it's not "-Ox" either (That's "Enable almost all optimizations"). Of course, enabling more optimization sometimes makes the code break - either because the code is written badly, or because the compiler is buggy, or both. With a huge project (which Rosetta certainly is), it can be hard to find those nasty bugs that are caused by high levels of optimization - particularly since the compiled code is much less readable when you turn on more optimization, the compiler often shuffles instructions to places that are dozens or more instructions away from the original source-code-line that you wrote, and may well shuffle conditional code so that you can't really recognize where it came from.... But someone else said that the code is compiled with "the highest available optimization", which doesn't seem to be true. Of course, that's not to say that the code will necessarily run noticably faster with more optimization - when I tried higher optimization level on my memory-copy test, it ran slower (only a little bit, but slower)... :-( [The lowest level of optimization was quite a lot slower, tho'] -- Mats ID: 19466 · Rating: 0 · rate: / Reply Quote

Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 19476 - Posted: 29 Jun 2006, 17:28:16 UTC By the way, you should be able to make more space on the stack by another 4 bytes, and thus not overwrite anything ... -- Mats ID: 19476 · Rating: 0 · rate: / Reply Quote

Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0	Message 19515 - Posted: 29 Jun 2006, 23:33:24 UTC Last modified: 29 Jun 2006, 23:48:18 UTC I was doing some reading, and come across: http://www.gamedev.net/community/forums/topic.asp?topic_id=363892 That was not much, but then I read: http://72.14.209.104/search?q=cache:Z7_1qzYnxqQJ:asci-training.lanl.gov/BProc/SSE.html+sse+%22single+scalar%22&hl=en&gl=us&ct=clnk&cd=2 Alot of people have always said, with-out quotes, SSE is only good for doing multiple calculations using one instruction like the name implies. However, SSE supports some Single Scalar instructions that only perform one operation per instruction sorta mimic-ing the x87. According to the last link using SSE with single percision scalars is faster and more efficent than the x87, but using it for double percision scalars is about equal to the x87. That sounds like good news, or is there more to this? You would have to read that last link to see what I read near the top of the page. Unfourtunatly I also read this: http://72.14.209.104/search?q=cache:w4T3-RFklbYJ:www.ddj.com/184406205+x87+instructions&hl=en&gl=us&ct=clnk&cd=10&client=firefox-a That describes that fact that the x87 defaulty provides 80bit percision. According to that - even though Rosetta@Home only uses single percision(32bit) floats, it is only referencing the fact of how these are written/read from/to the x87 unit. Potentialy alot of Rosetta's calculations have become dependant on the 80bit percision and may not function at the required accuracy by using SSE? However the article also states the workarounds as to use double percision with SSE to aleviate or fix the 80bit percision problem? I am not sure if 64bit floats would do enough justice? Neverless, apparently that would still be equal to the x87 performance on modern day processors. The AMD64 seems to recommend using SSE instead of x87 even with double percision(64bit) floats, but I am not pulling a fanboy episode - just throwing this out there as to what is going on? Let me retype that, "I think the article recommeneds using double percision SSE instructions over x87 on a AMD64". ID: 19515 · Rating: 0 · rate: / Reply Quote

Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0	Message 19519 - Posted: 30 Jun 2006, 0:23:22 UTC Last modified: 30 Jun 2006, 0:24:31 UTC So essentially by using SSE I lose 80 bit precision for a choice of 32 bit precision or 64 bit precision of the calculations. Even though the SSE uses 128 bit registers. MMX I think offers 128 bit operations, but I do not know if these would have a performance boost from x87? I think SSE2 allows 128bit operations, although my processor does not support it. I end up doing the first part of the other function like so to get double percision. sub esp, 0Ch push esi push edi mov esi,dword ptr [esp+20h] mov edi,dword ptr [esp+4Ch] // todo:fix: unaligned move, data potentialy allocated unaligned. // load two double floats from esi and edi movups xmm0, xmmword ptr [esi] shufps xmm0, xmm0, 050h movups xmm1, xmmword ptr [edi] shufps xmm1, xmm1, 050h // load remaining double float from esi and edi movups xmm2, xmmword ptr [esi] shufps xmm2, xmm2, 0FAh; movups xmm3, xmmword ptr [edi] shufps xmm3, xmm3, 0FAh; // vector subtraction subpd xmm0, xmm1 subsd xmm2, xmm3 ID: 19519 · Rating: 0 · rate: / Reply Quote