| Author | Message | 
|---|
        
        |  Who? 
  Send message Joined: 2 Apr 06
 Posts: 213
 Credit: 1,366,981
 RAC: 0
 
 | 
     
            I upgraded my V8 (cores) computer to Windows VISTA. no slow down :)
 The monster running Vista
 
 it is getting back to number one position, it will beat my overclocked quad core over night i think.
 
 still no saturation of the front side buses ... that does not happen mister Ruiz!
 (yep! I got 2 FSB on this motherboard! the snooping filters are more efficent than Hypertransport aging protocole)
 
 
 who?
 
 And again, Rosetta isn't going to saturate any bus on a machine with decent L2 cache, so why are you going on about the FSB/HyperTransport - it's quite clear that you DO understand that this is not an issue from previous posts. It may be an issue for some other applications, but not for Rosetta, so Rosetta makes a very poor example for comparing these things, right?
 
 --
 Mats
 
because some people still saying that my front side bus is saturating, and it is not true either ...   Got the point?
 
With the prefect, the FSB NEVER saturate, who else say it is a lier. (except in few known case that are not realistic, because synthetic benchmarks)
 
who?
 | 
        |  | 
    
        
        |  Who? 
  Send message Joined: 2 Apr 06
 Posts: 213
 Credit: 1,366,981
 RAC: 0
 
 | 
     
            I upgraded my V8 (cores) computer to Windows VISTA. no slow down :)
 The monster running Vista
 
 it is getting back to number one position, it will beat my overclocked quad core over night i think.
 
 still no saturation of the front side buses ... that does not happen mister Ruiz!
 (yep! I got 2 FSB on this motherboard! the snooping filters are more efficent than Hypertransport aging protocole)
 
 
 who?
 
 And again, Rosetta isn't going to saturate any bus on a machine with decent L2 cache, so why are you going on about the FSB/HyperTransport - it's quite clear that you DO understand that this is not an issue from previous posts. It may be an issue for some other applications, but not for Rosetta, so Rosetta makes a very poor example for comparing these things, right?
 
 --
 Mats
 
i just vTuned the work loads i was crunching on my quad core, your statement is actually wrong, there is 10% of the time spent on memory load and store, it does not cost on Core 2 because the success rate of the prefetcher is high, and every thing gets into the cache before the load unit needs it. 
I know that some work load of Rosetta varies a lot, i guess, you miss understood some part of the algorythm, there is few pointer chassing going on that requires FSB or old hypertransport ... 
 
remember, each workload of rosetta can be dramatically different, based on the kind of structure you are processing.
 
whowho? | 
        |  | 
    
        
        | Mats Petersson 
  Send message Joined: 29 Sep 05
 Posts: 225
 Credit: 951,788
 RAC: 0
 
 | 
     
            I upgraded my V8 (cores) computer to Windows VISTA. no slow down :)
 The monster running Vista
 
 it is getting back to number one position, it will beat my overclocked quad core over night i think.
 
 still no saturation of the front side buses ... that does not happen mister Ruiz!
 (yep! I got 2 FSB on this motherboard! the snooping filters are more efficent than Hypertransport aging protocole)
 
 
 who?
 
 And again, Rosetta isn't going to saturate any bus on a machine with decent L2 cache, so why are you going on about the FSB/HyperTransport - it's quite clear that you DO understand that this is not an issue from previous posts. It may be an issue for some other applications, but not for Rosetta, so Rosetta makes a very poor example for comparing these things, right?
 
 --
 Mats
 
 i just vTuned the work loads i was crunching on my quad core, your statement is actually wrong, there is 10% of the time spent on memory load and store, it does not cost on Core 2 because the success rate of the prefetcher is high, and every thing gets into the cache before the load unit needs it.
 I know that some work load of Rosetta varies a lot, i guess, you miss understood some part of the algorythm, there is few pointer chassing going on that requires FSB or old hypertransport ...
 
 remember, each workload of rosetta can be dramatically different, based on the kind of structure you are processing.
 
 whowho?
 
Yeah, ok, there may be some memory traffic, but I've yet to see a single case where Rosetta is even close to saturating the memory traffic, which is why I stated that Rosetta is a poor benchmark for whether the bus is "efficient" or not. Can we agree on that?
 
-- 
Mats 
 | 
        |  | 
    
        
        |  Who? 
  Send message Joined: 2 Apr 06
 Posts: 213
 Credit: 1,366,981
 RAC: 0
 
 | 
     
            I upgraded my V8 (cores) computer to Windows VISTA. no slow down :)
 The monster running Vista
 
 it is getting back to number one position, it will beat my overclocked quad core over night i think.
 
 still no saturation of the front side buses ... that does not happen mister Ruiz!
 (yep! I got 2 FSB on this motherboard! the snooping filters are more efficent than Hypertransport aging protocole)
 
 
 who?
 
 And again, Rosetta isn't going to saturate any bus on a machine with decent L2 cache, so why are you going on about the FSB/HyperTransport - it's quite clear that you DO understand that this is not an issue from previous posts. It may be an issue for some other applications, but not for Rosetta, so Rosetta makes a very poor example for comparing these things, right?
 
 --
 Mats
 
 i just vTuned the work loads i was crunching on my quad core, your statement is actually wrong, there is 10% of the time spent on memory load and store, it does not cost on Core 2 because the success rate of the prefetcher is high, and every thing gets into the cache before the load unit needs it.
 I know that some work load of Rosetta varies a lot, i guess, you miss understood some part of the algorythm, there is few pointer chassing going on that requires FSB or old hypertransport ...
 
 remember, each workload of rosetta can be dramatically different, based on the kind of structure you are processing.
 
 whowho?
 
 
 Yeah, ok, there may be some memory traffic, but I've yet to see a single case where Rosetta is even close to saturating the memory traffic, which is why I stated that Rosetta is a poor benchmark for whether the bus is "efficient" or not. Can we agree on that?
 
 --
 Mats
 
well, by getting the memory cache lines in time in the load units, as Core 2 does, you get 10% performance improvement. 10% is not a small number. Hypertransport innefficency probably cost 8 to 10% to the K8 on rosetta, that is not negligeable. I don't even speak about the cost of crossing through the NUMA bridge, this will even slow down more. CPU1 getting data into the memory controler of the CPU2, due to thread migration is the worst design case i saw. K8 + Hypertransport cross link has good bandwidth, but we learn with Pentium 4 that is not the only important point. predicting and feeding with low latency is what matter.
 
got the point?
 
who?
 | 
        |  | 
    
        
        | Mats Petersson 
  Send message Joined: 29 Sep 05
 Posts: 225
 Credit: 951,788
 RAC: 0
 
 | 
     
            well, by getting the memory cache lines in time in the load units, as Core 2 does, you get 10% performance improvement. 10% is not a small number. Hypertransport innefficency probably cost 8 to 10% to the K8 on rosetta, that is not negligeable.
 
 got the point?
 
 who?
 
 
1. Automatic memory prefetching has been in K7/K8 processors since the introduction of Athlon MP some 5 or 6 years ago.  
2. I still don't see enough cache-misses/read/write requests on my (current sample) to motivate a 10% improvement in Rosetta... Have you actually compared with a K8-based system?
 
-- 
Mats
 
 | 
        |  | 
    
        
        |  Who? 
  Send message Joined: 2 Apr 06
 Posts: 213
 Credit: 1,366,981
 RAC: 0
 
 | 
     
            well, by getting the memory cache lines in time in the load units, as Core 2 does, you get 10% performance improvement. 10% is not a small number. Hypertransport innefficency probably cost 8 to 10% to the K8 on rosetta, that is not negligeable.
 
 got the point?
 
 who?
 
 
 1. Automatic memory prefetching has been in K7/K8 processors since the introduction of Athlon MP some 5 or 6 years ago.
 2. I still don't see enough cache-misses/read/write requests on my (current sample) to motivate a 10% improvement in Rosetta... Have you actually compared with a K8-based system?
 
 
 --
 Mats
 
 
Are you saying that Rosetta work load are fitting into 1Mb ? i don t think so
 
who? | 
        |  | 
    
        
        | Mats Petersson 
  Send message Joined: 29 Sep 05
 Posts: 225
 Credit: 951,788
 RAC: 0
 
 | 
     
            well, by getting the memory cache lines in time in the load units, as Core 2 does, you get 10% performance improvement. 10% is not a small number. Hypertransport innefficency probably cost 8 to 10% to the K8 on rosetta, that is not negligeable.
 
 got the point?
 
 who?
 
 
 1. Automatic memory prefetching has been in K7/K8 processors since the introduction of Athlon MP some 5 or 6 years ago.
 2. I still don't see enough cache-misses/read/write requests on my (current sample) to motivate a 10% improvement in Rosetta... Have you actually compared with a K8-based system?
 
 
 --
 Mats
 
 
 
 Are you saying that Rosetta work load are fitting into 1Mb ? i don t think so
 
 who?
 
At least sufficently to not notice any delays [assuming you mean 1MB (Mb = Megabit, MB = Megabyte in my way of writing things]. 
 
I tried oprofile for "BU_FILL", which is essentially misses in both L1 and L2 caches, and I got around 10 interrupts per second on that [so about 13M fills per second, one fill = 64byte => 800MB/s - which is about 10% of the bus capacity. 
 
If we get around 10% usage on the bus (per core), I don't see how it can be improved by 10% - that's an "infinite" improvement, which are usually not achievable in real world scenarios.
 
-- 
Mats 
 | 
        |  | 
    
        
        |  Who? 
  Send message Joined: 2 Apr 06
 Posts: 213
 Credit: 1,366,981
 RAC: 0
 
 | 
     
            well, by getting the memory cache lines in time in the load units, as Core 2 does, you get 10% performance improvement. 10% is not a small number. Hypertransport innefficency probably cost 8 to 10% to the K8 on rosetta, that is not negligeable.
 
 got the point?
 
 who?
 
 
 1. Automatic memory prefetching has been in K7/K8 processors since the introduction of Athlon MP some 5 or 6 years ago.
 2. I still don't see enough cache-misses/read/write requests on my (current sample) to motivate a 10% improvement in Rosetta... Have you actually compared with a K8-based system?
 
 
 --
 Mats
 
 
 
 Are you saying that Rosetta work load are fitting into 1Mb ? i don t think so
 
 who?
 
 At least sufficently to not notice any delays [assuming you mean 1MB (Mb = Megabit, MB = Megabyte in my way of writing things].
 
 I tried oprofile for "BU_FILL", which is essentially misses in both L1 and L2 caches, and I got around 10 interrupts per second on that [so about 13M fills per second, one fill = 64byte => 800MB/s - which is about 10% of the bus capacity.
 
 If we get around 10% usage on the bus (per core), I don't see how it can be improved by 10% - that's an "infinite" improvement, which are usually not achievable in real world scenarios.
 
 --
 Mats
 
I am very impressed, it sound like Hyperthrans has a compression algorythm or something, because it uses less memory space than any other processor i used.
 
you probably want to review your figures again, your measurement tool is broken, or you forgot to count the prefected cache lines. (Most of the measurement tool exclude them, AMD tool does at least ...)
 
now, hypertrans is magic ... cool :) that explain why Voodoo like it so much before ...
 
who? | 
        |  | 
    
        
        | Mats Petersson 
  Send message Joined: 29 Sep 05
 Posts: 225
 Credit: 951,788
 RAC: 0
 
 | 
     
            well, by getting the memory cache lines in time in the load units, as Core 2 does, you get 10% performance improvement. 10% is not a small number. Hypertransport innefficency probably cost 8 to 10% to the K8 on rosetta, that is not negligeable.
 
 got the point?
 
 who?
 
 
 1. Automatic memory prefetching has been in K7/K8 processors since the introduction of Athlon MP some 5 or 6 years ago.
 2. I still don't see enough cache-misses/read/write requests on my (current sample) to motivate a 10% improvement in Rosetta... Have you actually compared with a K8-based system?
 
 
 --
 Mats
 
 
 
 Are you saying that Rosetta work load are fitting into 1Mb ? i don t think so
 
 who?
 
 At least sufficently to not notice any delays [assuming you mean 1MB (Mb = Megabit, MB = Megabyte in my way of writing things].
 
 I tried oprofile for "BU_FILL", which is essentially misses in both L1 and L2 caches, and I got around 10 interrupts per second on that [so about 13M fills per second, one fill = 64byte => 800MB/s - which is about 10% of the bus capacity.
 
 If we get around 10% usage on the bus (per core), I don't see how it can be improved by 10% - that's an "infinite" improvement, which are usually not achievable in real world scenarios.
 
 --
 Mats
 
 
 I am very impressed, it sound like Hyperthrans has a compression algorythm or something, because it uses less memory space than any other processor i used.
 
 you probably want to review your figures again, your measurement tool is broken, or you forgot to count the prefected cache lines. (Most of the measurement tool exclude them, AMD tool does at least ...)
 
 now, hypertrans is magic ... cool :) that explain why Voodoo like it so much before ...
 
 who?
 
Ehm, first of all, did I even mention anything about Hypertransport in my post? Not so. If the memory management works right (Linux does), the process should allocate memory from the local processor, which means that hypertransport doesn't come into the question. Memory controller is the local memory controller, so hypertransport shouldn't get involved for that. [Yes, there's obviously snoop messages for each cache-line, but that should be fairly short compared to a cache-line]. Of course, there is indeed no guarantee that the process is kept in the same processor, in which case cross-processor traffic starts to affect things. 
 
I did not find any (different) way to measure the actual memory transfers, so you may be correct that there are prefetches there... 
 
Just out of curiosity, what sort of memory transfers are you seeing on Rosetta (in MB/s for example)?
 
By the way, the working set where Rosetta is most active is an array of 300 x 6 x 4 bytes. It does reference a whole bunch of other variables in the function, but that's by far the largest one. 300 x 6 x 4 is MUCH smaller than 1MB. 
 
There is lots of other data, but it's very rarely accessed in general. 
 
-- 
Mats 
 | 
        |  | 
    
        
        |  Who? 
  Send message Joined: 2 Apr 06
 Posts: 213
 Credit: 1,366,981
 RAC: 0
 
 | 
     
            well, by getting the memory cache lines in time in the load units, as Core 2 does, you get 10% performance improvement. 10% is not a small number. Hypertransport innefficency probably cost 8 to 10% to the K8 on rosetta, that is not negligeable.
 
 got the point?
 
 who?
 
 
 1. Automatic memory prefetching has been in K7/K8 processors since the introduction of Athlon MP some 5 or 6 years ago.
 2. I still don't see enough cache-misses/read/write requests on my (current sample) to motivate a 10% improvement in Rosetta... Have you actually compared with a K8-based system?
 
 
 --
 Mats
 
 
 
 Are you saying that Rosetta work load are fitting into 1Mb ? i don t think so
 
 who?
 
 At least sufficently to not notice any delays [assuming you mean 1MB (Mb = Megabit, MB = Megabyte in my way of writing things].
 
 I tried oprofile for "BU_FILL", which is essentially misses in both L1 and L2 caches, and I got around 10 interrupts per second on that [so about 13M fills per second, one fill = 64byte => 800MB/s - which is about 10% of the bus capacity.
 
 If we get around 10% usage on the bus (per core), I don't see how it can be improved by 10% - that's an "infinite" improvement, which are usually not achievable in real world scenarios.
 
 --
 Mats
 
 
 I am very impressed, it sound like Hyperthrans has a compression algorythm or something, because it uses less memory space than any other processor i used.
 
 you probably want to review your figures again, your measurement tool is broken, or you forgot to count the prefected cache lines. (Most of the measurement tool exclude them, AMD tool does at least ...)
 
 now, hypertrans is magic ... cool :) that explain why Voodoo like it so much before ...
 
 who?
 
 Ehm, first of all, did I even mention anything about Hypertransport in my post? Not so. If the memory management works right (Linux does), the process should allocate memory from the local processor, which means that hypertransport doesn't come into the question. Memory controller is the local memory controller, so hypertransport shouldn't get involved for that. [Yes, there's obviously snoop messages for each cache-line, but that should be fairly short compared to a cache-line]. Of course, there is indeed no guarantee that the process is kept in the same processor, in which case cross-processor traffic starts to affect things.
 
 I did not find any (different) way to measure the actual memory transfers, so you may be correct that there are prefetches there...
 
 Just out of curiosity, what sort of memory transfers are you seeing on Rosetta (in MB/s for example)?
 
 By the way, the working set where Rosetta is most active is an array of 300 x 6 x 4 bytes. It does reference a whole bunch of other variables in the function, but that's by far the largest one. 300 x 6 x 4 is MUCH smaller than 1MB.
 
 There is lots of other data, but it's very rarely accessed in general.
 
 --
 Mats
 
So, I guess, the 155Megs around it are there for fun (yep ... the executable allocate 140 to 150Megs ...)
 
i see a different story than yours, I am in the process of getting the source code, I ll let you know later.
 
who? | 
        |  | 
    
        
        | Mats Petersson 
  Send message Joined: 29 Sep 05
 Posts: 225
 Credit: 951,788
 RAC: 0
 
 | 
     
            well, by getting the memory cache lines in time in the load units, as Core 2 does, you get 10% performance improvement. 10% is not a small number. Hypertransport innefficency probably cost 8 to 10% to the K8 on rosetta, that is not negligeable.
 
 got the point?
 
 who?
 
 
 1. Automatic memory prefetching has been in K7/K8 processors since the introduction of Athlon MP some 5 or 6 years ago.
 2. I still don't see enough cache-misses/read/write requests on my (current sample) to motivate a 10% improvement in Rosetta... Have you actually compared with a K8-based system?
 
 
 --
 Mats
 
 
 
 Are you saying that Rosetta work load are fitting into 1Mb ? i don t think so
 
 who?
 
 At least sufficently to not notice any delays [assuming you mean 1MB (Mb = Megabit, MB = Megabyte in my way of writing things].
 
 I tried oprofile for "BU_FILL", which is essentially misses in both L1 and L2 caches, and I got around 10 interrupts per second on that [so about 13M fills per second, one fill = 64byte => 800MB/s - which is about 10% of the bus capacity.
 
 If we get around 10% usage on the bus (per core), I don't see how it can be improved by 10% - that's an "infinite" improvement, which are usually not achievable in real world scenarios.
 
 --
 Mats
 
 
 I am very impressed, it sound like Hyperthrans has a compression algorythm or something, because it uses less memory space than any other processor i used.
 
 you probably want to review your figures again, your measurement tool is broken, or you forgot to count the prefected cache lines. (Most of the measurement tool exclude them, AMD tool does at least ...)
 
 now, hypertrans is magic ... cool :) that explain why Voodoo like it so much before ...
 
 who?
 
 Ehm, first of all, did I even mention anything about Hypertransport in my post? Not so. If the memory management works right (Linux does), the process should allocate memory from the local processor, which means that hypertransport doesn't come into the question. Memory controller is the local memory controller, so hypertransport shouldn't get involved for that. [Yes, there's obviously snoop messages for each cache-line, but that should be fairly short compared to a cache-line]. Of course, there is indeed no guarantee that the process is kept in the same processor, in which case cross-processor traffic starts to affect things.
 
 I did not find any (different) way to measure the actual memory transfers, so you may be correct that there are prefetches there...
 
 Just out of curiosity, what sort of memory transfers are you seeing on Rosetta (in MB/s for example)?
 
 By the way, the working set where Rosetta is most active is an array of 300 x 6 x 4 bytes. It does reference a whole bunch of other variables in the function, but that's by far the largest one. 300 x 6 x 4 is MUCH smaller than 1MB.
 
 There is lots of other data, but it's very rarely accessed in general.
 
 --
 Mats
 
 So, I guess, the 155Megs around it are there for fun (yep ... the executable allocate 140 to 150Megs ...)
 
 i see a different story than yours, I am in the process of getting the source code, I ll let you know later.
 
 
 who?
 
Yes, my memory usage is in the 100-150 MB range too. But a lot of that is rarely touched. Just the binary uses 18MB of static memory, and there's surely a bunch of dynamically allocated data as well. But as I say, much of that is never touched. When I say the most active area, I measured about 60% of the overall time spent there. There's several other functions that take a fair amount of cycles, but they aren't really spending much time on memory fetches...
 
-- 
Mats 
 | 
        |  | 
    
        
        |  Who? 
  Send message Joined: 2 Apr 06
 Posts: 213
 Credit: 1,366,981
 RAC: 0
 
 | 
     
            
 Yes, my memory usage is in the 100-150 MB range too. But a lot of that is rarely touched. Just the binary uses 18MB of static memory, and there's surely a bunch of dynamically allocated data as well. But as I say, much of that is never touched. When I say the most active area, I measured about 60% of the overall time spent there. There's several other functions that take a fair amount of cycles, but they aren't really spending much time on memory fetches...
 
 --
 Mats
 
in your opinion,what do you think make the K8 so slow on Rosetta? 
in the mean time, when memory matter "for sure", Hypertransport does not improve anything: on SETI, the 1st AMD system with 4 sockets (and 4 memory controllers) is 79 on the top list. 
nobody can argue that seti work load uses more than the cache size ... changing the memory timing of your system changes dramatically the RAC.  
conclusion: even when memory matter, Hypertransport is hyper-useless.
 
Who?
 | 
        |  | 
    
        
        | Mats Petersson 
  Send message Joined: 29 Sep 05
 Posts: 225
 Credit: 951,788
 RAC: 0
 
 | 
     
            
 Yes, my memory usage is in the 100-150 MB range too. But a lot of that is rarely touched. Just the binary uses 18MB of static memory, and there's surely a bunch of dynamically allocated data as well. But as I say, much of that is never touched. When I say the most active area, I measured about 60% of the overall time spent there. There's several other functions that take a fair amount of cycles, but they aren't really spending much time on memory fetches...
 
 --
 Mats
 
 in your opinion,what do you think make the K8 so slow on Rosetta?
 in the mean time, when memory matter "for sure", Hypertransport does not improve anything: on SETI, the 1st AMD system with 4 sockets (and 4 memory controllers) is 79 on the top list.
 nobody can argue that seti work load uses more than the cache size ... changing the memory timing of your system changes dramatically the RAC.
 conclusion: even when memory matter, Hypertransport is hyper-useless.
 
 Who?
 
 
 
I haven't looked at Seti at all - not really a project that I took much interest in how it works, and I'm not participating there at all any more. 
 
When I compared machines of same speed and architecture with different number of CPU's, the performance per MHz is very similar. 
 
In my view, most of the time is spent waiting for the math processing to finish, so if K8 had a faster math unit, it would help performance. Obviously, there may be other parts that are important too. [And the actual behaviour may vary depending on the actual type of calculations performed, as some types of workunits perform different tasks and run different bits of code]. 
 
Of course, if you change the architecture, there's no doubt that a different architecture will be different in performance (assuming it's not just a simple straight copy of an older architecture). There's no denying that the new Core2 technology is good - I have never said otherwise. Compared to my 6.3 +/- 5% credits average per MHz per hour, your quad core is getting around 7.1 credit per MHz per hour (assuming it's running at 2.66GHz). The dual core also close as it hits 7.0 assuming that you're still running 4.0GHz. 
 
That's a good 10% per GHz better performance. 
 
-- 
Mats
 
 | 
        |  | 
    
        
        |  Who? 
  Send message Joined: 2 Apr 06
 Posts: 213
 Credit: 1,366,981
 RAC: 0
 
 | 
     
            yep ... and my machine just passed RAC=2800 ... without Hypertransport ;-) :-P
 who?
 
 
 | 
        |  | 
    
        
        | The_Bad_Penguin 
  
  Send message Joined: 5 Jun 06
 Posts: 2751
 Credit: 4,276,053
 RAC: 0
 
 | 
     
            Well... with or without HTT, glad you're on Team Rosetta! yep ... and my machine just passed RAC=2800 ... without Hypertransport ;-) :-P
 who?
 | 
        |  | 
    
        
        | zombie67 [MM] 
  
  Send message Joined: 11 Feb 06
 Posts: 316
 Credit: 6,621,003
 RAC: 0
 
 | 
     
            @who?:  When are you gonna try out your 8-way monster on SETI?  I am very interested in seeing what it will do with an optimized app.  Speaking of optimized app....any news on the one you were working on? 
 Reno, NV 
Team: SETI.USA   | 
        |  | 
    
        
        |  Who? 
  Send message Joined: 2 Apr 06
 Posts: 213
 Credit: 1,366,981
 RAC: 0
 
 | 
     
            @who?:  When are you gonna try out your 8-way monster on SETI?  I am very interested in seeing what it will do with an optimized app.  Speaking of optimized app....any news on the one you were working on? 
i am done with my optimization for SETI, i am waiting that the library i wrote and use become "officially" public ... You know .... license stuff , to insure I get my cute little license, and transfert the right of using it to berkeley. 
(due to my position, i got to do this :( )  
Intellectual property is a headach, especially when you want to autorize "anybody" to use it, because some usages are not ok ... 
the SSSE3 version is up and running, if you dig, you could figure out that i did test it :d and it is screaming fast. 
the SSE4 version for next summmer is ready too. Get ready for an other exciting ride on this one :)
 
I am affraid to go back to seti , because the inquirer will blast again "an intel guys is looking for aliens", even if i feel comfortable with this, i can understand that investors can see a little crazyness in this, so, I don't want my company to get hit by a tabloid effect phenomena, where people get mislead about my company. my employer is not responsable for my hobbies, but it can t be hurt if i am not careful. 
so, i am doing the proper license work, thanks to some co workers for helping me doing the license, even if it is not work related.
 
who? 
You ll get the SSSE3 version around Jan 2007.  | 
        |  | 
    
        
        | zombie67 [MM] 
  
  Send message Joined: 11 Feb 06
 Posts: 316
 Credit: 6,621,003
 RAC: 0
 
 | 
     
            Thanks for all the info!  Are you referring to IntelĀ® Math Kernel Library 9.0 ?  I think that is out now.  It's all greek to me, so maybe you are talking about something else.
 
I understand your PR concerns.  I am sure someone will be glad to use your application, once you release it, to show off what a dual quad can really do. 
 Reno, NV 
Team: SETI.USA   | 
        |  | 
    
        
        |  Who? 
  Send message Joined: 2 Apr 06
 Posts: 213
 Credit: 1,366,981
 RAC: 0
 
 | 
     
            Thanks for all the info!  Are you referring to IntelĀ® Math Kernel Library 9.0?  I think that is out now.  It's all greek to me, so maybe you are talking about something else.
 I understand your PR concerns.  I am sure someone will be glad to use your application, once you release it, to show off what a dual quad can really do.
 
I was refering to my pattern matching algo in SSSE3 (how SETI match the sub harmonic of the FFT)
 
who? | 
        |  | 
    
        
        | Michael G.R. 
  Send message Joined: 11 Nov 05
 Posts: 264
 Credit: 11,247,510
 RAC: 0
 
 | 
     
            Rosetta needs you more than SETI ;) 
 | 
        |  |