Posts by Leonard Kevin Mcguire Jr.

1) Message boards : Number crunching : 64-Bit Rosetta? (Message 19786)
Posted 5 Jul 2006 by Profile Leonard Kevin Mcguire Jr.
Post:
But, you know. Even if - SSE had to use doubles. I think a AMD64 would perform better using SSE doubles than X87 doubles because of the reduced code size at least. I do not know for other processors, I just remember reading that it was recommened to use SSE instead of X87. I think a executable compiled in 32bit causes the processor switch into the compatiblity mode. The 64bit long mode removed some instructions that some programs may use when written as 32bit. So that SSE performance gain might be only when the executable is a native 64BIT exe under a 64BIT operating system.

Oh I forgot to mention. Somewhere the Rosetta application is using SSE for double precision values to do some small calculations? (I could have sworn I saw it in the disassembly).

I do have a AMD64 3500+ with Win XP64 Pro installed.
2) Message boards : Number crunching : 64-Bit Rosetta? (Message 19784)
Posted 5 Jul 2006 by Profile Leonard Kevin Mcguire Jr.
Post:
We have a solution, and a agreement. I have never in my entire life run into this situation on a internet message board.. I am proud!
3) Message boards : Number crunching : 64-Bit Rosetta? (Message 19774)
Posted 4 Jul 2006 by Profile Leonard Kevin Mcguire Jr.
Post:
I agree with your findings, because I came up with similar results. I gave up on hand optimizing that one function. It was taking way too long, and would be a little difficult to confirm the results as being accurate in-case I made a mistake. That further reinforces the fact that it is a diffcult task to hand optimize something by reverse engineering it.

I am not saying I could not do it, nor anyone else, but it is not a easy task by any means and I am most likely not going to spend that much time when I have no way to verify the results, and the possibility of many* other functions existing that need the same optimization.. whew.. lots of work. =

I did do some tests where I took a single percision value, randomly generated. I loaded it into a 64bit register and performed a operation on it. Then I loaded it into a 32bit register and performed the same operation on it. I did this many times. Of course you lose percision..

However, Rosetta is performing fstd after almost every math operation on that 64bit register, truncuating it to a single percision value. I know it is not exactly the same, but it did yield exact results in alot of cases as doing the operation completely in 32bit. Thus, mabye depending on Rosetta's arithograms the margin of precision error is small enough. When my tests including the trunuaction of the 64bit result it yielding a exact match thus no precision lost?

([32BIT] = [64BIT] <*+/-> [64BIT]) might be so close to:
([32BIT] = [32BIT] <*+/-> [32BIT]) that for Rosetta's arithograms it would cause no problems, thus enabling the usage of SSE in single percision mode.

I have seen at *least* one case were two operations were performed on a double percision value by the x87 in Rosetta thus increasing the margin of precision error by a value unknown to me because of my knowledge limitations on floating-point internal workings, and this in it's self could eliminate the usage of SSE.

However, to the above paragraph: The Rosetta application would also suffer precision changes in the event compiler flags were changed and the generated code thus changed thus producing different results, and of course: I feel this is a potential clue that the Rosetta arithograms may not be bothered by the:


([32BIT] = [64BIT] <*+/-> [64BIT]) vs ([32BIT] = [32BIT] <*+/-> [32BIT]) problem.


because, I would imagine the developers have already realized the penaltys on their precision for their arithograms when building the application and most likely exmaining their machine code output.

So, I guess the application needs to be hand-optimized. Then take the un-hand optimizaed version vs the hand optimized version running a work unit with the exact same random seed to see if their is a result difference?

PS: I really appreciate you taking the time to help me ?to try? to solve this very difficult question!
4) Message boards : Rosetta@home Science : DISCUSSION of Rosetta@home Journal (2) (Message 19739)
Posted 3 Jul 2006 by Profile Leonard Kevin Mcguire Jr.
Post:

Hmm.. I was thinking about the same few weeks ago. But I've read some posts from Akos F. and he explained, that just compilig with 3Dnow or sse2 may increase the speed by only 3%. If you want bigger increase you have to do some low-level programming in assembler. This is that "magic" that can give you 600% increase in speed. Of course 3Dnow and sse2 can be strong tools, but not by itself. They has to be used the right way.

If this is correct then taking into consideration the below post by Mats Petersson. I agree with Akos, and I agree with Mats based from the point:

Its a daunting task to hand optimize assembler to use SSEx instructions, and it is componded when these routines change often. However, the only people who know how much the routines change and if the routines that are changed alot -- are the developers. So to completely put the issue to the grave would be their input - If there is even a definied way for them to communicate this, that could provide useful information to make the final determination.
5) Message boards : Number crunching : 64-Bit Rosetta? (Message 19595)
Posted 30 Jun 2006 by Profile Leonard Kevin Mcguire Jr.
Post:
Yes, I agree completely. Thank you for posting the correction to my code snip, it is most appreciated.

I just realized too that Rosetta@home should be okay using single percision values. I mean if they have been running their code unoptimized for this CASP7 project, almost all the calculations in the application are being truncuated back into single percision from 80 bit anyway, because of all those excessive load and store instructions the compiler they use is generating. lol

I do not know why I did not think of the above yesterday, sorry. =) I guess the developers know the results could* change when they decide to turn on optimizations on their compilers one day. =)

I remember a post saying Rosetta@Home was being run lately in some form of debug mode or something to try and diagnose the bugs lately too?

[edit]
I checked the Rosetta@Home application's windows build, and the x87 is set to the default precision of 64 bit like you stated in the previous post. So, apparently Rosetta@Home is not doing 80bit calculations, but instead only 64bit internaly and performing load and store operations in 32bit with the conjunction of their code constantly loading and storing after almost every math operation --- using SSE should be *okay*..
6) Message boards : Number crunching : 64-Bit Rosetta? (Message 19519)
Posted 30 Jun 2006 by Profile Leonard Kevin Mcguire Jr.
Post:
So essentially by using SSE I lose 80 bit precision for a choice of 32 bit precision or 64 bit precision of the calculations. Even though the SSE uses 128 bit registers. MMX I think offers 128 bit operations, but I do not know if these would have a performance boost from x87? I think SSE2 allows 128bit operations, although my processor does not support it.

I end up doing the first part of the other function like so to get double percision.
		sub         esp, 0Ch 
		push        esi  
		push        edi
		mov         esi,dword ptr [esp+20h]
		mov         edi,dword ptr [esp+4Ch]
		// todo:fix: unaligned move, data potentialy allocated unaligned.
		// load two double floats from esi and edi
		movups xmm0, xmmword ptr [esi]
		shufps xmm0, xmm0, 050h
		movups xmm1, xmmword ptr [edi]
		shufps xmm1, xmm1, 050h
		// load remaining double float from esi and edi
		movups xmm2, xmmword ptr [esi]
		shufps xmm2, xmm2, 0FAh;
		movups xmm3, xmmword ptr [edi]
		shufps xmm3, xmm3, 0FAh;
		// vector subtraction
		subpd xmm0, xmm1
		subsd xmm2, xmm3
7) Message boards : Number crunching : 64-Bit Rosetta? (Message 19515)
Posted 29 Jun 2006 by Profile Leonard Kevin Mcguire Jr.
Post:
I was doing some reading, and come across:
http://www.gamedev.net/community/forums/topic.asp?topic_id=363892

That was not much, but then I read:
http://72.14.209.104/search?q=cache:Z7_1qzYnxqQJ:asci-training.lanl.gov/BProc/SSE.html+sse+%22single+scalar%22&hl=en&gl=us&ct=clnk&cd=2

Alot of people have always said, with-out quotes, SSE is only good for doing multiple calculations using one instruction like the name implies. However, SSE supports some Single Scalar instructions that only perform one operation per instruction sorta mimic-ing the x87. According to the last link using SSE with single percision scalars is faster and more efficent than the x87, but using it for double percision scalars is about equal to the x87. That sounds like good news, or is there more to this? You would have to read that last link to see what I read near the top of the page.

Unfourtunatly I also read this:
http://72.14.209.104/search?q=cache:w4T3-RFklbYJ:www.ddj.com/184406205+x87+instructions&hl=en&gl=us&ct=clnk&cd=10&client=firefox-a

That describes that fact that the x87 defaulty provides 80bit percision. According to that - even though Rosetta@Home only uses single percision(32bit) floats, it is only referencing the fact of how these are written/read from/to the x87 unit. Potentialy alot of Rosetta's calculations have become dependant on the 80bit percision and may not function at the required accuracy by using SSE?

However the article also states the workarounds as to use double percision with SSE to aleviate or fix the 80bit percision problem? I am not sure if 64bit floats would do enough justice? Neverless, apparently that would still be equal to the x87 performance on modern day processors. The AMD64 seems to recommend using SSE instead of x87 even with double percision(64bit) floats, but I am not pulling a fanboy episode - just throwing this out there as to what is going on?

Let me retype that, "I *think* the article recommeneds using double percision SSE instructions over x87 on a AMD64".
8) Message boards : Number crunching : 64-Bit Rosetta? (Message 19447)
Posted 29 Jun 2006 by Profile Leonard Kevin Mcguire Jr.
Post:
I found a interesting function that could use SSE:
00604050  sub         esp,0Ch 
00604053  push        esi  
00604054  mov         esi,dword ptr [esp+1Ch] 
00604058  fld         dword ptr [esi] 
0060405A  push        edi  
0060405B  mov         edi,dword ptr [esp+4Ch] 
0060405F  fsub        dword ptr [edi] 
00604061  mov         eax,dword ptr [esp+88h] 
00604068  fstp        dword ptr [esp+0Ch] 
0060406C  fld         dword ptr [esi+4] 
0060406F  fsub        dword ptr [edi+4] 
00604072  fstp        dword ptr [esp+8] 
00604076  fld         dword ptr [esi+8] 
00604079  fsub        dword ptr [edi+8] 
0060407C  fstp        dword ptr [esp+10h]
00604080  fld         dword ptr [esp+8] 
00604084  fld         dword ptr [esp+0Ch] 
00604088  fld         dword ptr [esp+10h] 
....


I replaced the x87 instructions with SSE.
00604050 83 EC 0C         sub         esp,0Ch 
00604053 56               push        esi  
00604054 57               push        edi  
00604055 8B 74 24 20      mov         esi,dword ptr [esp+20h] 
00604059 8B 7C 24 4C      mov         edi,dword ptr [esp+4Ch] 
0060405D 0F 10 06         movups      xmm0,xmmword ptr [esi] 
00604060 0F 10 0F         movups      xmm1,xmmword ptr [edi] 
00604063 0F 5C C1         subps       xmm0,xmm1 
00604066 0F C6 C1 4B      shufps      xmm0,xmm1,4Bh 
0060406A 8B 44 24 10      mov         eax,dword ptr [esp+10h] 
0060406E 0F 11 4C 24 04   movups      xmmword ptr [esp+4],xmm1 
00604073 89 44 24 10      mov         dword ptr [esp+10h],eax 
00604077 8B 84 24 88 00 00 00 mov         eax,dword ptr [esp+88h] 
0060407E 90               nop              
0060407F 90               nop              
00604080 D9 44 24 08      fld         dword ptr [esp+8] 
00604084 D9 44 24 0C      fld         dword ptr [esp+0Ch] 
00604088 D9 44 24 10      fld         dword ptr [esp+10h] 
...


Do you think SSE will give a significant performace gain with something like that? I had to use a shufps, because I thought/think it was swapping the first two FPs.

r[1] = a[0] - b[0]
r[0] = a[1] - b[1]
r[2] = a[2] - b[2]

The forth over-writes other stuff in the stack, so I preserve it using eax - mabye all the extra overhead just kills any gain?
9) Message boards : Number crunching : 64-Bit Rosetta? (Message 19445)
Posted 29 Jun 2006 by Profile Leonard Kevin Mcguire Jr.
Post:
Hmmm.. So it seems you were right and I know why, and I am not mad. It could benifit more from SSE rather than 64 BIT instructions primarly because most operations work in cache when moving around data, and the rest is FP operations - and if not so its rarely called functions that have cache misses that move alot of data around so no big deal.

I came to this conclusion after trying to figure out a way to optimize the function at the file address 0x204050 in the windows PE32 build. I used a timer, and it found that this function gets called alot. It seems like it takes about 3 arguments - looks just like a vector subtraction at the start, anyway, after looking the function over and over. I could find no where to optimize it favorable >:) with 64bit instructions.

I am not saying there is absolutely no performance to be gained, but its prolly not going to be in the core of the application as the 90:10 rule you stated in a earlier post.

However, I did notice a few operations of load/store that where kind of redundant looking unless the rounding effect was intentional (80bit[fp reg] -> 32bit[mem]... "fstp fld").

I will look at it some more in a direction towards SSE =). At least this sorta stops the 64bit fanboy arguments!
10) Message boards : Number crunching : 64-Bit Rosetta? (Message 19443)
Posted 28 Jun 2006 by Profile Leonard Kevin Mcguire Jr.
Post:
You were right about the data move routines, they stayed right about the same for each one independant of the register size used. I must meditate on this.
11) Message boards : Number crunching : 64-Bit Rosetta? (Message 19389)
Posted 28 Jun 2006 by Profile Leonard Kevin Mcguire Jr.
Post:
Alright. Here is a function, unknown to me exactly what it is.

Samples 	Address  	Code Bytes         	Instruction                 	Symbol 	CPU0 Event:0x41 	
        	0x494a50 	0x 56              	push esi                    	       	                	
        	0x494a51 	0x 57              	push edi                    	       	                	
        	0x494a52 	0x 8B 7C 24 0C     	mov edi,[esp+0ch]           	       	                	
        	0x494a56 	0x 8B F1           	mov esi,ecx                 	       	                	
        	0x494a58 	0x 3B F7           	cmp esi,edi                 	       	                	
        	0x494a5a 	0x 74 2D           	jz $+2fh (0x494a89)         	       	                	
        	0x494a5c 	0x 57              	push edi                    	       	                	
        	0x494a5d 	0x E8 7E F4 FF FF  	call $-00000b7dh (0x493ee0) 	       	                	
        	0x494a62 	0x 84 C0           	test al,al                  	       	                	
        	0x494a64 	0x 75 08           	jnz $+0ah (0x494a6e)        	       	                	
        	0x494a66 	0x 57              	push edi                    	       	                	
        	0x494a67 	0x 8B CE           	mov ecx,esi                 	       	                	
        	0x494a69 	0x E8 92 FE FF FF  	call $-00000169h (0x494900) 	       	                	
        	0x494a6e 	0x 33 C0           	xor eax,eax                 	       	                	
        	0x494a70 	0x 39 46 0C        	cmp [esi+0ch],eax           	       	                	
        	0x494a73 	0x 76 14           	jbe $+16h (0x494a89)       
[b] 	       	                	
23      	0x494a75 	0x 8B 4F 08        	mov ecx,[edi+08h]           	       	23              	
34      	0x494a78 	0x D9 04 81        	fld dword ptr [ecx+eax*4]   	       	34              	
988     	0x494a7b 	0x 8B 56 08        	mov edx,[esi+08h]           	       	988             	
        	0x494a7e 	0x D9 1C 82        	fstp dword ptr [edx+eax*4]  	       	                	
995     	0x494a81 	0x 83 C0 01        	add eax,01h                 	       	995             	
1       	0x494a84 	0x 3B 46 0C        	cmp eax,[esi+0ch]           	       	1               	
        	0x494a87 	0x 72 EC           	jb $-12h (0x494a75)
[/b]         	       	                	
        	0x494a89 	0x 5F              	pop edi                     	       	                	
        	0x494a8a 	0x 8B C6           	mov eax,esi                 	       	                	
        	0x494a8c 	0x 5E              	pop esi                     	       	                	
6       	0x494a8d 	0x C2 04 00        	ret 0004h                   	       	6  


This is apparently where alot of cache misses are occuring, just a thought, if it is copying a large array, as in the case when I broke the thread. The cache misses might just be too bad for such a large array, but on a AMD64 capable processor would the below even though not erasing the cache misses be benificial?

cmp eax, [esi+0ch]
0x494a84 *((unsigned long*)(esi + 0xC)) = 0x11D

In this instance it was moving (285 * 4) bytes with 285 loops.

EAX starts at 0x0, so its definitly looping. Could a 64BIT move be faster than using the floating point unit to perform a data move? Apparently the floating point unit can move it faster than a 32 bit move, as the code is using the x87? (I do not know) [The x87 might be slower, mabye the compiler just did not optimize it correctly?)

My WIN32 executables .text section's virtual address is 0x401000. That could help pinpoint where the function is since I'm pretty sure a linux build's CRT might offset it differently?

[edit:] I actually looked at the darn code some more and figured it couldnt be optimized at all since it repeatadly loads ecx and edx with the same value.

You know I just noticed from your results that it does alot of FP store and load operations, the potential of MMX to reduce transfers from main memory to FP registers? I do think all AMD(I do not know alot about INTEL) 64 processors support the MMX instructions.. I am just guessing here.

And in reply to the SSE being slower than the x87 instructions. I have read that SSE is actually slower, and the only performance gain is by using it to perform multiple operations.
12) Message boards : Number crunching : 64-Bit Rosetta? (Message 19379)
Posted 27 Jun 2006 by Profile Leonard Kevin Mcguire Jr.
Post:
You are a retard. I have written a operating system kernel from scratch! I should know! Yes!


Other applications may well loose the same amount, because they are cache-dependant, and bigger pointers are using more cache-space. Yes, you could convince the compiler to use 32-bit pointers, but you'd have to also use a special C-library designed for this purpose [of which there are none available today AFAIK], and make sure that any calls to the OS itself is "filtered" suitably for your app, etc, etc. Which makes this variation fairly unpractical - it's unlikely that such a cache-bound app would gain noticably from running with more/bigger registers anyays, so those apps are better off running 32-bit binaries - unless you actually want bigger memory space, in which case 32-bit pointers are a moot point.


I am so tired of this crap.

64-bit Data Models
Prior to the introduction of 64-bit platforms, it was generally believed that the introduction of 64-bit UNIX operating systems would naturally use the ILP64 data model. However, this view was too simplistic and overlooked optimizations that could be obtained by choosing a different data model.
Unfortunately, the C programming language does not provide a mechanism for adding new fundamental data types. Thus, providing 64-bit addressing and integer arithmetic capabilities involves changing the bindings or mappings of the existing data types or adding new data types to the language.
ISO/IEC 9899:1990, Programming Languages - C (ISO C) left the definition of the short int, the int, the long int, and the pointer deliberately vague to avoid artificially constraining hardware architectures that might benefit from defining these data types independent from the other. The only constraints were that ints must be no smaller than shorts, and longs must be no smaller than ints, and size_t must represent the largest unsigned type supported by an implementation. It is possible, for instance, to define a short as 16 bits, an int as 32 bits, a long as 64 bits and a pointer as 128 bits. The relationship between the fundamental data types can be expressed as:
sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long) = sizeof(size_t)
Ignoring non-standard types, all three of the following 64-bit pointer data models satisfy the above relationship:
LP64 (also known as 4/8/8)
ILP64 (also known as 8/8/8)
LLP64 (also known as 4/4/8).
The differences between the three models lies in the non-pointer data types. The table below details the data types for the above three data models and includes LP32 and ILP32 for comparison purposes.


http://www.unix.org/whitepapers/64bit.html

All I have to say, is for all the "fake" stupid idiots that rant and rave about crap they don't know for sure can kiss my white butt. Optimize the Rosetta@Home application your self, since it seems you people don't want to attempt to talk about something constructive apon it helping - instead every single post (almost) was a negative buncha crap.


I may not have expressed it very clearly, but I _HAVE_ quite a bit of experience doing comparative benchmarking and improving performance on x86(-64) architecture, and I _DO_ know a fair bit about tricks you can play with passing multiple arguments in a single register, etc, etc.

You wish you knew something. I can not prove it, but I know it.

I am so mad, I do not care. So if you post any other CRAP above this, just know I am going to read it and laugh at you.
13) Message boards : Number crunching : 64-Bit Rosetta? (Message 19337)
Posted 27 Jun 2006 by Profile Leonard Kevin Mcguire Jr.
Post:
Oh. I am sorry, I was completely wrong. You are right, dgnuff. Silly me! I am so glad we have smart people like you around to explain to idiots like me that 64 bits does not equate to more performance, it took me forever it understand what everyone is talking about, wow - I have been lost before but never this lost! Thank you lord! I have been saved from a hellish future! The world has been saved!
14) Message boards : Number crunching : 64-Bit Rosetta? (Message 19333)
Posted 26 Jun 2006 by Profile Leonard Kevin Mcguire Jr.
Post:

I think you either misread or misunderstood my statement. I'm saying that in general, there is no direct correlation between the bitness of an application and/or of a CPU and the performance seen. I don't understand how that can be considered to be an "inflexible rule".


It depends on what size scope is considered general. More importantly what criteria would need to be meet to consider something a general "case"? Many general cases exist, by my definition of general, that involve routines that can have a performance gain - and thus this gain no matter how small is still a gain unless we are talking about non-general routines which might be:

int main(void)
{
    return (char)1 + (char)1;
}


Evidently this would yeild no use from 64 BIT. We only need to store one value, but it could be run in a 64 BIT enviroment!, and not use 64 BIT instructions and have a gain as I have read somewhere... let me go find it.
mov eax, 1
add eax, 1
ret


Is that a general case? I do not know of many useful programs that do something like that. Would this be a correlation?

int main(int a, int b, int c, int d, int e, int f, int g, int h, int i, int k)
{
    return a + b + c + d + e + f + g + h + i + k;
}

add eax, ebx // a=a+b
add eax, ecx // a=a+c
add eax, edx // a=a+d
add eax, r08 // a=a+e
add eax, r09 // a=a+f
add eax, r10 // a=a+g
add eax, r11 // a=a+h
add eax, r12 // a=a+i
add eax, r13 // a=a+k
ret



I'm saying that in general, there is no direct correlation between the bitness of an application and/or of a CPU and the performance seen.


1. I'm saying that in general,
2. there is no direct correlation between
2A. the bitness of an application or of a CPU and the performance seen.
2B. the bitness of an application and of a CPU and the performance seen.

So for 2A we mean:
A application using totaly 32bit instructions VS one using totaly 64bit instructions. (In general? Every application in the world that is considered a general case application? How are we to scientificly define this general?)
Quite frankly you don't use storage space you do not need. It would be a waste to use a 64 bit operation on a 32 bit value unless wraparound was a unwanted effect.

A application using only 32bit instructions VS one that can use 32bit instructions and 64bit instructions as such is the 64 bit AMD processors.

So for 2B we mean:
A application that uses only 32bit run on a 32bit processor VS a 32bit application run on a 64 bit processor. The performance should be no less in "general". =)

A application that uses 32bit and 64bit or just 64bit run on a 32 bit processor VS a 64 bit processor. It will not work. =)
15) Message boards : Number crunching : 64-Bit Rosetta? (Message 19298)
Posted 25 Jun 2006 by Profile Leonard Kevin Mcguire Jr.
Post:

The point being that while it is possible to hand optimize a client and possibly improve the performance of that client - the effort in maintaining that client may not be justified by the performance gain. In which case, Rosetta would be better off putting 32 bit clients in the 64 bit client directories.


Yes it is possible to hand optimize a client, and possibly improve the performance. Yes, the effort may not be justified by the performance gain. In which case Rosetta would be better off putting 32 bit clients in 64 bit directories. I agree.


Prove that there's more than a 10% performance increase in moving to a 64 bit client and get Rosetta to produce a 64 bit client from the current code, and I'll start dual booting again.

No problems. You are welcome to what threshold of a performance increase that is needed to deem it worthy to be used. =)


Regardless of which was faster - my coding was a waste of effort for that project. Good for education, but a waste of effort, especially as I never used that screensaver again. :)

The screensaver and Rosetta@Home are two distinct goals in science. The first was for fun, and I have no idea what role it played in advancing science.


It wasn't optimized for 64 bit.. it was merely compiled in 64 bit. But we don't have optimized Rosetta clients yet, either.

I still remember this quote. If you understood the implications of merely compiling something in 64 bit that was designed for a 32 bit enviroment. Why did you even post to this thread when you knew the entire topic is if a 64 bit build could have performance improvments worthy of use? Thats why it seemed like a politcal ploy for a reader who is not technicaly minded just to rummage through and go, "Oh.. Hmm. So 64 bit is not better." And, of course above that you gave information relating to the statistics. That sounded kind of like a news channel releasing a presidential approval rating (Yet they really got there statistics/percentages from a biased group.).

The Orginal BennyRop Post:

Back at DF (Distributed Folding) they released a 64 bit client for a 64 bit flavor of a SUN OS. DF ran at a fairly consistent speed - i.e. 24 hours of crunching would produce roughly the same amount of results on Wednesday as they did on Tuesday. Stats whores tried out the 64 bit client, noticed a dramatic decline in results per day - so gave up and moved back to the 32 bit client.

It wasn't optimized for 64 bit.. it was merely compiled in 64 bit. But we don't have optimized Rosetta clients yet, either.


No, DUH! Use int and get a 64 bit integer, and increase the cache needed. You should work for a news channel! =)
16) Message boards : Number crunching : 64-Bit Rosetta? (Message 19266)
Posted 25 Jun 2006 by Profile Leonard Kevin Mcguire Jr.
Post:

So from what I gather from reading that and what I already know, it seems right to assume there is no real benefit from making a 64-bit binary for Rosetta given what the program does as SSE/2 can fully accomodate that in 32-bit mode as it is. Right? :)


The FPU, and the GPRs are two seperate things. The extra GPRs are a gain that x64 provides, the x87 FPU aparently in my knowledge remains almost the same in long mode. The processor stores data in three ways each ordered in speed. The FPU in long mode should behave exactly as it did in protected mode as far as presentation, the performance I do not know for sure.

I have also read somewhere that x87 instructions in long mode should be avoided? Has anyone else read this, even so?


1. REGISTERS
2. CACHE
3. MEMORY

X86 or 32bit code has 8 general purposes registers. Programs TRY to keep as much as possible in these registers while it is processing. However, due to the large number of varibles and/or size of data structors it becomes impossible. The compiler then generates instructions for swapping data from registers back out to memory/program-stack while it loads other data for current processing. Giving the program access to 8 more general purposes registers allows the processor to not have to swap as much between its registers and memory. The registers are faster than cache! The cache is almost always controled by the CPU from my knowledge, although instructions exist to manipulate it. The AMD64 Althon has a prefetch instruction somewhere that allows more detailed manipulation of the cache, but I do not know much about this.

In the case of large loops that perform many functions calls in the core of the loop, a performance gain could be made for functions that are forced to place arguments onto the stack because not enough registers are free OR core code can make use of registers, and possibly not change the functions calling convention.

There should be no worldy reason why no application in the universe could not benifit from x64 in someway no MATTER HOW SMALL, at the VERY LEAST.
I made those words in capital letters for those that can not READ. The conservatives!


You've already stated that there is no speed benefit from using 64 bit pointers in such cases and the pointers would have to be hard coded to 32 bit.

Yes, I did say that.


Where is the benefit of hand optimization to 64 bit - that the developers had neither experience or time for - going to have come from? The point is not that all 64 bit apps are slower than 32 bit apps - but that 64 bits DOES NOT speed up every application as the 64 bit fanboys keep claiming.

Now, I'm going to stretch the limits of sanity apon what amount of performance gain could be had from using long mode instructions by telling you that any program will run faster using x64 mode, and the most easiest way to prove that is to take one function call in the core code of the application that pushes one or more arguments onto the stack and allow it to use one of the extra GPRs instead of pushing a argument onto the stack. The AMD64 documentation already specifies that it is faster to peform a move rather than a push or pop.

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24592.pdf Section 3.10.9



And if I was using this as a penile enhancement motivation as you seem to imply - then I could probably move a couple of my systems from another project to enhance my score here.

I am sorry, even by ignoring the penil enchacement movtivation part, what you said does not prove my "conspiracy theory" false. What does it prove that you could move a couple of systems to this project? lol


Let's can the conspiracy theories - and move back to the frightening world of reality, shall we?

Alright.. I am back to reality! Sorry for skipping out what do you need?


Get ahold of the older Rosetta code that's been mentioned; get ahold of the 64 bit version of the compiler that the Rosetta team is using; hand optimize it for 64 bit mode, and then run each client for 24 hours 3 times on the same WU. We'll see the variability of the same client on the same hardware running the same WU - and be able to see if there's a dramatic improvement between the 64 bit version and the 32 bit version.

Well. You just told me to come back to reality, now you are saying go back to insanity because there is a chance you were really out of reality the entire time?

I wanted to claify that x64 or long mode is not proposed by me to speed up floating point operations, but can speed up basic data handling and storage operations. In case some people are becoming confused. =
17) Message boards : Number crunching : 64-Bit Rosetta? (Message 19242)
Posted 24 Jun 2006 by Profile Leonard Kevin Mcguire Jr.
Post:

Back at DF (Distributed Folding) they released a 64 bit client for a 64 bit flavor of a SUN OS. DF ran at a fairly consistent speed - i.e. 24 hours of crunching would produce roughly the same amount of results on Wednesday as they did on Tuesday. Stats whores tried out the 64 bit client, noticed a dramatic decline in results per day - so gave up and moved back to the 32 bit client.

It wasn't optimized for 64 bit.. it was merely compiled in 64 bit. But we don't have optimized Rosetta clients yet, either.



(I am taking a guess due to the fact it is stated - 'it was merely compiled in 64 bit.')

Alright, here is another indication of incorrect statistics for a 64 bit processor. The real questions is WHY this happened, not that it is automaticaly accepted that for some reason the 64 bit processor is just slower..

If a program uses the default int type, and the compiler produces 64 bit code the int becomes 64 bit. The int type is platform specific. i could understand a speed decrease when you start making excess reads/write and processing when you do not need it.

Just because you have 64 bits does not mean you use it, but it does not also mean you have no uses for the extra processing or storage power in other relavent and close proximity areas.


In my own opinion I think some people may post negative and incorrect information to in some way to politicaly push down any benifits/gains that could come from a 64 bit processor. Honestly, I would hate for 64 bit machines to start crunching more, and push me further down the ranks if I did not have a 64 bit machine. =)
18) Message boards : Number crunching : 64-Bit Rosetta? (Message 19198)
Posted 24 Jun 2006 by Profile Leonard Kevin Mcguire Jr.
Post:

From a general standpoint, the "bitness" of the application does not equate to performance improvement or degredation. The thing that really matters is the source code, compiler/assembler, and linker taking advantage of the architecture. In some cases, 64-bit will not be faster or slower, but could take up almost twice the memory footprint. Only the developers, who have access to the source code and higher-level algorithms and know how to work with the architecture can tell if an application will see significant improvements.


the "bitness" of a application does not equate to performance improvment or degredation - is incorrect, because you are setting a inflexible rule that can be proved incorrect with:

// Passing values to a function.
mov rax, [offset FunctionArg2]
mov eax, [offset FunctionArg1]
// I want to pass two values to a function.
mov rax, ecx
shl rax, 32
mov eax, edx

Just because a register is considered a whole object does not mean you can not manipulate it to store more information while keeping the information seperate. People, think oh a 64 bit register... hmm I will never store a value over a 32 bit limit oh well no use for the extra 32 bits.. =


However, bigger registers can also be a drawback - particularly for pointers. A pointer is now 8 bytes (64-bit) instead of 4 bytes (32-bit), so the cache can hold only half as many pointers. So for pointer-intensive code (such as binary trees or linked lists) where lots of pointers are being read/written, the cache becomes fulll much sooner.


Thats why you have 32 bit instructions for pointer access? I mean duh? Huh?
http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html
Another rigid rule that makes ever who reads your post will think that 64 bit is not better, or did you just word it wrong? wtf
You don't use a 64 bit pointer if you do not need to.. = Now, I bet alot of people are jumping around going oh no, you are wrong! OS calls are using 64bit pointers, and and the heap may even allocate things out of reach by a 32bit pointer.. I mean this is all trival fixable things. Writting a simple heap arithogram takes long a little time and effort.. =


I would suspect that the biggest gain for Rosetta would be to compile it to use SSE instructions, rather than to make it 64-bit. This is because it's a primarly floating point-intensive application, and it uses 32-bit single precision floats, so SSE would make sense.

I suspect you have also realized that Rosetta@Home is also a primarly data processing application too right? Data storage inside the processor, makes a big difference.


ote also that very few compilers generate decent code for SSE in actual super-scalar mode. There's been a vectorization project for GCC, but it's still not ready for prime-time use. So it would probably require a bit of hand-coding to get anywhere close to ideal performance.

And, this was exactly why I tried to argue about just setting a compiler flag in another post of mine. =|

And by god, this is not the same problem as it was 15 years ago!


19) Message boards : Number crunching : 64-Bit Rosetta? (Message 19022)
Posted 20 Jun 2006 by Profile Leonard Kevin Mcguire Jr.
Post:



You know I still can not understand how everyone rants and raves, about why not make a 64bit version. Yet the only reply is can we really use quad word registers, and over 4 gigabytes of memory.

Its almost like there is a stigma associated with a 64bit processor, only being good for having 32 extra bits for processing power?

Does anyone know, or has anyone who did know forgotton that a 64 chip has 8 more general purposes registers. This means almost all functions calls that use integral data types can make their calls with out placing arguments onto the stack.

It also means fewer RAM reads/writes, and any value in a register has almost instant access for read/write by the processor.

A 32bit processor can provide 32 bytes at most, using all its general purpose registers.

A 64bit processor can provide 64 bytes with no loss of flexibility due to register usage, plus the original 8 registers that also provide 64 bytes of total space.

Thats 32 bytes vs 128 bytes, thats four times larger, while making irelavant the fact that Rosetta@Home only uses single percision 32 bit floating point values, and less than 2 gigabytes of memory.- who cares? That has to be a performance gain.


I know there has been lots of discussions regarding this, and even the fact that some say there is no performance gain. I have even read that a application got slower? Who did the benchmarking, and I want to see the source code because something was not done correctly! It should at least equal the 32 bit version, but never fall to half the speed even if the port does not take advantage of any 64 bit features! Thats just plain and simple logic.. Someone is talking alot of bull. I mean at least post some references to some technical information about why it was slower. If you do not know why you do not know if it is being done correctly(porting).

I checked ralph@home, and did a quick search of the site using google, and this site and only found a small remark about releasing the source code and that was it?
20) Message boards : Number crunching : Is this possible? (Message 19020)
Posted 20 Jun 2006 by Profile Leonard Kevin Mcguire Jr.
Post:

One of the things I would like to be able to do is designate a single machine as the "host" whereby all internet traffic goes through that one machine (uploading/downloading WU's etc..)

You did not state if you needed "host" to actually handle BOINC's protocol, only that you needed to channel "internet traffic" through one machine witch I considered TCP/IP as in "internet traffic".


You could use a SOCKS proxy, which BOINC has support for internaly. Look in the menu item advanced, then the sub-item options. Click the SOCKS tab on the dialog.
You can find quite a few software packages that provide a SOCKS proxy, one one computer which would be the one you want the internet traffic routed through. Also, configuring a SOCKS daemon is very easy just ensure by testing on a friends computer that someone outside of your private network can not utilize the daemon therefore presenting a security problem.


Next 20



©2024 University of Washington
https://www.bakerlab.org