Author | Message |
Who?
Send message
Joined: 2 Apr 06 Posts: 213 Credit: 1,366,981 RAC: 0
|
I upgraded my V8 (cores) computer to Windows VISTA. no slow down :)
The monster running Vista
it is getting back to number one position, it will beat my overclocked quad core over night i think.
still no saturation of the front side buses ... that does not happen mister Ruiz!
(yep! I got 2 FSB on this motherboard! the snooping filters are more efficent than Hypertransport aging protocole)
who?
And again, Rosetta isn't going to saturate any bus on a machine with decent L2 cache, so why are you going on about the FSB/HyperTransport - it's quite clear that you DO understand that this is not an issue from previous posts. It may be an issue for some other applications, but not for Rosetta, so Rosetta makes a very poor example for comparing these things, right?
--
Mats
because some people still saying that my front side bus is saturating, and it is not true either ... Got the point?
With the prefect, the FSB NEVER saturate, who else say it is a lier. (except in few known case that are not realistic, because synthetic benchmarks)
who?
|
|
Who?
Send message
Joined: 2 Apr 06 Posts: 213 Credit: 1,366,981 RAC: 0
|
I upgraded my V8 (cores) computer to Windows VISTA. no slow down :)
The monster running Vista
it is getting back to number one position, it will beat my overclocked quad core over night i think.
still no saturation of the front side buses ... that does not happen mister Ruiz!
(yep! I got 2 FSB on this motherboard! the snooping filters are more efficent than Hypertransport aging protocole)
who?
And again, Rosetta isn't going to saturate any bus on a machine with decent L2 cache, so why are you going on about the FSB/HyperTransport - it's quite clear that you DO understand that this is not an issue from previous posts. It may be an issue for some other applications, but not for Rosetta, so Rosetta makes a very poor example for comparing these things, right?
--
Mats
i just vTuned the work loads i was crunching on my quad core, your statement is actually wrong, there is 10% of the time spent on memory load and store, it does not cost on Core 2 because the success rate of the prefetcher is high, and every thing gets into the cache before the load unit needs it.
I know that some work load of Rosetta varies a lot, i guess, you miss understood some part of the algorythm, there is few pointer chassing going on that requires FSB or old hypertransport ...
remember, each workload of rosetta can be dramatically different, based on the kind of structure you are processing.
whowho?
|
|
Mats Petersson
Send message
Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0
|
I upgraded my V8 (cores) computer to Windows VISTA. no slow down :)
The monster running Vista
it is getting back to number one position, it will beat my overclocked quad core over night i think.
still no saturation of the front side buses ... that does not happen mister Ruiz!
(yep! I got 2 FSB on this motherboard! the snooping filters are more efficent than Hypertransport aging protocole)
who?
And again, Rosetta isn't going to saturate any bus on a machine with decent L2 cache, so why are you going on about the FSB/HyperTransport - it's quite clear that you DO understand that this is not an issue from previous posts. It may be an issue for some other applications, but not for Rosetta, so Rosetta makes a very poor example for comparing these things, right?
--
Mats
i just vTuned the work loads i was crunching on my quad core, your statement is actually wrong, there is 10% of the time spent on memory load and store, it does not cost on Core 2 because the success rate of the prefetcher is high, and every thing gets into the cache before the load unit needs it.
I know that some work load of Rosetta varies a lot, i guess, you miss understood some part of the algorythm, there is few pointer chassing going on that requires FSB or old hypertransport ...
remember, each workload of rosetta can be dramatically different, based on the kind of structure you are processing.
whowho?
Yeah, ok, there may be some memory traffic, but I've yet to see a single case where Rosetta is even close to saturating the memory traffic, which is why I stated that Rosetta is a poor benchmark for whether the bus is "efficient" or not. Can we agree on that?
--
Mats
|
|
Who?
Send message
Joined: 2 Apr 06 Posts: 213 Credit: 1,366,981 RAC: 0
|
I upgraded my V8 (cores) computer to Windows VISTA. no slow down :)
The monster running Vista
it is getting back to number one position, it will beat my overclocked quad core over night i think.
still no saturation of the front side buses ... that does not happen mister Ruiz!
(yep! I got 2 FSB on this motherboard! the snooping filters are more efficent than Hypertransport aging protocole)
who?
And again, Rosetta isn't going to saturate any bus on a machine with decent L2 cache, so why are you going on about the FSB/HyperTransport - it's quite clear that you DO understand that this is not an issue from previous posts. It may be an issue for some other applications, but not for Rosetta, so Rosetta makes a very poor example for comparing these things, right?
--
Mats
i just vTuned the work loads i was crunching on my quad core, your statement is actually wrong, there is 10% of the time spent on memory load and store, it does not cost on Core 2 because the success rate of the prefetcher is high, and every thing gets into the cache before the load unit needs it.
I know that some work load of Rosetta varies a lot, i guess, you miss understood some part of the algorythm, there is few pointer chassing going on that requires FSB or old hypertransport ...
remember, each workload of rosetta can be dramatically different, based on the kind of structure you are processing.
whowho?
Yeah, ok, there may be some memory traffic, but I've yet to see a single case where Rosetta is even close to saturating the memory traffic, which is why I stated that Rosetta is a poor benchmark for whether the bus is "efficient" or not. Can we agree on that?
--
Mats
well, by getting the memory cache lines in time in the load units, as Core 2 does, you get 10% performance improvement. 10% is not a small number. Hypertransport innefficency probably cost 8 to 10% to the K8 on rosetta, that is not negligeable. I don't even speak about the cost of crossing through the NUMA bridge, this will even slow down more. CPU1 getting data into the memory controler of the CPU2, due to thread migration is the worst design case i saw. K8 + Hypertransport cross link has good bandwidth, but we learn with Pentium 4 that is not the only important point. predicting and feeding with low latency is what matter.
got the point?
who?
|
|
Mats Petersson
Send message
Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0
|
well, by getting the memory cache lines in time in the load units, as Core 2 does, you get 10% performance improvement. 10% is not a small number. Hypertransport innefficency probably cost 8 to 10% to the K8 on rosetta, that is not negligeable.
got the point?
who?
1. Automatic memory prefetching has been in K7/K8 processors since the introduction of Athlon MP some 5 or 6 years ago.
2. I still don't see enough cache-misses/read/write requests on my (current sample) to motivate a 10% improvement in Rosetta... Have you actually compared with a K8-based system?
--
Mats
|
|
Who?
Send message
Joined: 2 Apr 06 Posts: 213 Credit: 1,366,981 RAC: 0
|
well, by getting the memory cache lines in time in the load units, as Core 2 does, you get 10% performance improvement. 10% is not a small number. Hypertransport innefficency probably cost 8 to 10% to the K8 on rosetta, that is not negligeable.
got the point?
who?
1. Automatic memory prefetching has been in K7/K8 processors since the introduction of Athlon MP some 5 or 6 years ago.
2. I still don't see enough cache-misses/read/write requests on my (current sample) to motivate a 10% improvement in Rosetta... Have you actually compared with a K8-based system?
--
Mats
Are you saying that Rosetta work load are fitting into 1Mb ? i don t think so
who?
|
|
Mats Petersson
Send message
Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0
|
well, by getting the memory cache lines in time in the load units, as Core 2 does, you get 10% performance improvement. 10% is not a small number. Hypertransport innefficency probably cost 8 to 10% to the K8 on rosetta, that is not negligeable.
got the point?
who?
1. Automatic memory prefetching has been in K7/K8 processors since the introduction of Athlon MP some 5 or 6 years ago.
2. I still don't see enough cache-misses/read/write requests on my (current sample) to motivate a 10% improvement in Rosetta... Have you actually compared with a K8-based system?
--
Mats
Are you saying that Rosetta work load are fitting into 1Mb ? i don t think so
who?
At least sufficently to not notice any delays [assuming you mean 1MB (Mb = Megabit, MB = Megabyte in my way of writing things].
I tried oprofile for "BU_FILL", which is essentially misses in both L1 and L2 caches, and I got around 10 interrupts per second on that [so about 13M fills per second, one fill = 64byte => 800MB/s - which is about 10% of the bus capacity.
If we get around 10% usage on the bus (per core), I don't see how it can be improved by 10% - that's an "infinite" improvement, which are usually not achievable in real world scenarios.
--
Mats
|
|
Who?
Send message
Joined: 2 Apr 06 Posts: 213 Credit: 1,366,981 RAC: 0
|
well, by getting the memory cache lines in time in the load units, as Core 2 does, you get 10% performance improvement. 10% is not a small number. Hypertransport innefficency probably cost 8 to 10% to the K8 on rosetta, that is not negligeable.
got the point?
who?
1. Automatic memory prefetching has been in K7/K8 processors since the introduction of Athlon MP some 5 or 6 years ago.
2. I still don't see enough cache-misses/read/write requests on my (current sample) to motivate a 10% improvement in Rosetta... Have you actually compared with a K8-based system?
--
Mats
Are you saying that Rosetta work load are fitting into 1Mb ? i don t think so
who?
At least sufficently to not notice any delays [assuming you mean 1MB (Mb = Megabit, MB = Megabyte in my way of writing things].
I tried oprofile for "BU_FILL", which is essentially misses in both L1 and L2 caches, and I got around 10 interrupts per second on that [so about 13M fills per second, one fill = 64byte => 800MB/s - which is about 10% of the bus capacity.
If we get around 10% usage on the bus (per core), I don't see how it can be improved by 10% - that's an "infinite" improvement, which are usually not achievable in real world scenarios.
--
Mats
I am very impressed, it sound like Hyperthrans has a compression algorythm or something, because it uses less memory space than any other processor i used.
you probably want to review your figures again, your measurement tool is broken, or you forgot to count the prefected cache lines. (Most of the measurement tool exclude them, AMD tool does at least ...)
now, hypertrans is magic ... cool :) that explain why Voodoo like it so much before ...
who?
|
|
Mats Petersson
Send message
Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0
|
well, by getting the memory cache lines in time in the load units, as Core 2 does, you get 10% performance improvement. 10% is not a small number. Hypertransport innefficency probably cost 8 to 10% to the K8 on rosetta, that is not negligeable.
got the point?
who?
1. Automatic memory prefetching has been in K7/K8 processors since the introduction of Athlon MP some 5 or 6 years ago.
2. I still don't see enough cache-misses/read/write requests on my (current sample) to motivate a 10% improvement in Rosetta... Have you actually compared with a K8-based system?
--
Mats
Are you saying that Rosetta work load are fitting into 1Mb ? i don t think so
who?
At least sufficently to not notice any delays [assuming you mean 1MB (Mb = Megabit, MB = Megabyte in my way of writing things].
I tried oprofile for "BU_FILL", which is essentially misses in both L1 and L2 caches, and I got around 10 interrupts per second on that [so about 13M fills per second, one fill = 64byte => 800MB/s - which is about 10% of the bus capacity.
If we get around 10% usage on the bus (per core), I don't see how it can be improved by 10% - that's an "infinite" improvement, which are usually not achievable in real world scenarios.
--
Mats
I am very impressed, it sound like Hyperthrans has a compression algorythm or something, because it uses less memory space than any other processor i used.
you probably want to review your figures again, your measurement tool is broken, or you forgot to count the prefected cache lines. (Most of the measurement tool exclude them, AMD tool does at least ...)
now, hypertrans is magic ... cool :) that explain why Voodoo like it so much before ...
who?
Ehm, first of all, did I even mention anything about Hypertransport in my post? Not so. If the memory management works right (Linux does), the process should allocate memory from the local processor, which means that hypertransport doesn't come into the question. Memory controller is the local memory controller, so hypertransport shouldn't get involved for that. [Yes, there's obviously snoop messages for each cache-line, but that should be fairly short compared to a cache-line]. Of course, there is indeed no guarantee that the process is kept in the same processor, in which case cross-processor traffic starts to affect things.
I did not find any (different) way to measure the actual memory transfers, so you may be correct that there are prefetches there...
Just out of curiosity, what sort of memory transfers are you seeing on Rosetta (in MB/s for example)?
By the way, the working set where Rosetta is most active is an array of 300 x 6 x 4 bytes. It does reference a whole bunch of other variables in the function, but that's by far the largest one. 300 x 6 x 4 is MUCH smaller than 1MB.
There is lots of other data, but it's very rarely accessed in general.
--
Mats
|
|
Who?
Send message
Joined: 2 Apr 06 Posts: 213 Credit: 1,366,981 RAC: 0
|
well, by getting the memory cache lines in time in the load units, as Core 2 does, you get 10% performance improvement. 10% is not a small number. Hypertransport innefficency probably cost 8 to 10% to the K8 on rosetta, that is not negligeable.
got the point?
who?
1. Automatic memory prefetching has been in K7/K8 processors since the introduction of Athlon MP some 5 or 6 years ago.
2. I still don't see enough cache-misses/read/write requests on my (current sample) to motivate a 10% improvement in Rosetta... Have you actually compared with a K8-based system?
--
Mats
Are you saying that Rosetta work load are fitting into 1Mb ? i don t think so
who?
At least sufficently to not notice any delays [assuming you mean 1MB (Mb = Megabit, MB = Megabyte in my way of writing things].
I tried oprofile for "BU_FILL", which is essentially misses in both L1 and L2 caches, and I got around 10 interrupts per second on that [so about 13M fills per second, one fill = 64byte => 800MB/s - which is about 10% of the bus capacity.
If we get around 10% usage on the bus (per core), I don't see how it can be improved by 10% - that's an "infinite" improvement, which are usually not achievable in real world scenarios.
--
Mats
I am very impressed, it sound like Hyperthrans has a compression algorythm or something, because it uses less memory space than any other processor i used.
you probably want to review your figures again, your measurement tool is broken, or you forgot to count the prefected cache lines. (Most of the measurement tool exclude them, AMD tool does at least ...)
now, hypertrans is magic ... cool :) that explain why Voodoo like it so much before ...
who?
Ehm, first of all, did I even mention anything about Hypertransport in my post? Not so. If the memory management works right (Linux does), the process should allocate memory from the local processor, which means that hypertransport doesn't come into the question. Memory controller is the local memory controller, so hypertransport shouldn't get involved for that. [Yes, there's obviously snoop messages for each cache-line, but that should be fairly short compared to a cache-line]. Of course, there is indeed no guarantee that the process is kept in the same processor, in which case cross-processor traffic starts to affect things.
I did not find any (different) way to measure the actual memory transfers, so you may be correct that there are prefetches there...
Just out of curiosity, what sort of memory transfers are you seeing on Rosetta (in MB/s for example)?
By the way, the working set where Rosetta is most active is an array of 300 x 6 x 4 bytes. It does reference a whole bunch of other variables in the function, but that's by far the largest one. 300 x 6 x 4 is MUCH smaller than 1MB.
There is lots of other data, but it's very rarely accessed in general.
--
Mats
So, I guess, the 155Megs around it are there for fun (yep ... the executable allocate 140 to 150Megs ...)
i see a different story than yours, I am in the process of getting the source code, I ll let you know later.
who?
|
|
Mats Petersson
Send message
Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0
|
well, by getting the memory cache lines in time in the load units, as Core 2 does, you get 10% performance improvement. 10% is not a small number. Hypertransport innefficency probably cost 8 to 10% to the K8 on rosetta, that is not negligeable.
got the point?
who?
1. Automatic memory prefetching has been in K7/K8 processors since the introduction of Athlon MP some 5 or 6 years ago.
2. I still don't see enough cache-misses/read/write requests on my (current sample) to motivate a 10% improvement in Rosetta... Have you actually compared with a K8-based system?
--
Mats
Are you saying that Rosetta work load are fitting into 1Mb ? i don t think so
who?
At least sufficently to not notice any delays [assuming you mean 1MB (Mb = Megabit, MB = Megabyte in my way of writing things].
I tried oprofile for "BU_FILL", which is essentially misses in both L1 and L2 caches, and I got around 10 interrupts per second on that [so about 13M fills per second, one fill = 64byte => 800MB/s - which is about 10% of the bus capacity.
If we get around 10% usage on the bus (per core), I don't see how it can be improved by 10% - that's an "infinite" improvement, which are usually not achievable in real world scenarios.
--
Mats
I am very impressed, it sound like Hyperthrans has a compression algorythm or something, because it uses less memory space than any other processor i used.
you probably want to review your figures again, your measurement tool is broken, or you forgot to count the prefected cache lines. (Most of the measurement tool exclude them, AMD tool does at least ...)
now, hypertrans is magic ... cool :) that explain why Voodoo like it so much before ...
who?
Ehm, first of all, did I even mention anything about Hypertransport in my post? Not so. If the memory management works right (Linux does), the process should allocate memory from the local processor, which means that hypertransport doesn't come into the question. Memory controller is the local memory controller, so hypertransport shouldn't get involved for that. [Yes, there's obviously snoop messages for each cache-line, but that should be fairly short compared to a cache-line]. Of course, there is indeed no guarantee that the process is kept in the same processor, in which case cross-processor traffic starts to affect things.
I did not find any (different) way to measure the actual memory transfers, so you may be correct that there are prefetches there...
Just out of curiosity, what sort of memory transfers are you seeing on Rosetta (in MB/s for example)?
By the way, the working set where Rosetta is most active is an array of 300 x 6 x 4 bytes. It does reference a whole bunch of other variables in the function, but that's by far the largest one. 300 x 6 x 4 is MUCH smaller than 1MB.
There is lots of other data, but it's very rarely accessed in general.
--
Mats
So, I guess, the 155Megs around it are there for fun (yep ... the executable allocate 140 to 150Megs ...)
i see a different story than yours, I am in the process of getting the source code, I ll let you know later.
who?
Yes, my memory usage is in the 100-150 MB range too. But a lot of that is rarely touched. Just the binary uses 18MB of static memory, and there's surely a bunch of dynamically allocated data as well. But as I say, much of that is never touched. When I say the most active area, I measured about 60% of the overall time spent there. There's several other functions that take a fair amount of cycles, but they aren't really spending much time on memory fetches...
--
Mats
|
|
Who?
Send message
Joined: 2 Apr 06 Posts: 213 Credit: 1,366,981 RAC: 0
|
Yes, my memory usage is in the 100-150 MB range too. But a lot of that is rarely touched. Just the binary uses 18MB of static memory, and there's surely a bunch of dynamically allocated data as well. But as I say, much of that is never touched. When I say the most active area, I measured about 60% of the overall time spent there. There's several other functions that take a fair amount of cycles, but they aren't really spending much time on memory fetches...
--
Mats
in your opinion,what do you think make the K8 so slow on Rosetta?
in the mean time, when memory matter "for sure", Hypertransport does not improve anything: on SETI, the 1st AMD system with 4 sockets (and 4 memory controllers) is 79 on the top list.
nobody can argue that seti work load uses more than the cache size ... changing the memory timing of your system changes dramatically the RAC.
conclusion: even when memory matter, Hypertransport is hyper-useless.
Who?
|
|
Mats Petersson
Send message
Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0
|
Yes, my memory usage is in the 100-150 MB range too. But a lot of that is rarely touched. Just the binary uses 18MB of static memory, and there's surely a bunch of dynamically allocated data as well. But as I say, much of that is never touched. When I say the most active area, I measured about 60% of the overall time spent there. There's several other functions that take a fair amount of cycles, but they aren't really spending much time on memory fetches...
--
Mats
in your opinion,what do you think make the K8 so slow on Rosetta?
in the mean time, when memory matter "for sure", Hypertransport does not improve anything: on SETI, the 1st AMD system with 4 sockets (and 4 memory controllers) is 79 on the top list.
nobody can argue that seti work load uses more than the cache size ... changing the memory timing of your system changes dramatically the RAC.
conclusion: even when memory matter, Hypertransport is hyper-useless.
Who?
I haven't looked at Seti at all - not really a project that I took much interest in how it works, and I'm not participating there at all any more.
When I compared machines of same speed and architecture with different number of CPU's, the performance per MHz is very similar.
In my view, most of the time is spent waiting for the math processing to finish, so if K8 had a faster math unit, it would help performance. Obviously, there may be other parts that are important too. [And the actual behaviour may vary depending on the actual type of calculations performed, as some types of workunits perform different tasks and run different bits of code].
Of course, if you change the architecture, there's no doubt that a different architecture will be different in performance (assuming it's not just a simple straight copy of an older architecture). There's no denying that the new Core2 technology is good - I have never said otherwise. Compared to my 6.3 +/- 5% credits average per MHz per hour, your quad core is getting around 7.1 credit per MHz per hour (assuming it's running at 2.66GHz). The dual core also close as it hits 7.0 assuming that you're still running 4.0GHz.
That's a good 10% per GHz better performance.
--
Mats
|
|
Who?
Send message
Joined: 2 Apr 06 Posts: 213 Credit: 1,366,981 RAC: 0
|
yep ... and my machine just passed RAC=2800 ... without Hypertransport ;-) :-P
who?
|
|
The_Bad_Penguin

Send message
Joined: 5 Jun 06 Posts: 2751 Credit: 4,271,025 RAC: 0
|
Well... with or without HTT, glad you're on Team Rosetta!
yep ... and my machine just passed RAC=2800 ... without Hypertransport ;-) :-P
who?
|
|
zombie67 [MM]

Send message
Joined: 11 Feb 06 Posts: 316 Credit: 6,621,003 RAC: 0
|
@who?: When are you gonna try out your 8-way monster on SETI? I am very interested in seeing what it will do with an optimized app. Speaking of optimized app....any news on the one you were working on?
Reno, NV
Team: SETI.USA
|
|
Who?
Send message
Joined: 2 Apr 06 Posts: 213 Credit: 1,366,981 RAC: 0
|
@who?: When are you gonna try out your 8-way monster on SETI? I am very interested in seeing what it will do with an optimized app. Speaking of optimized app....any news on the one you were working on?
i am done with my optimization for SETI, i am waiting that the library i wrote and use become "officially" public ... You know .... license stuff , to insure I get my cute little license, and transfert the right of using it to berkeley.
(due to my position, i got to do this :( )
Intellectual property is a headach, especially when you want to autorize "anybody" to use it, because some usages are not ok ...
the SSSE3 version is up and running, if you dig, you could figure out that i did test it :d and it is screaming fast.
the SSE4 version for next summmer is ready too. Get ready for an other exciting ride on this one :)
I am affraid to go back to seti , because the inquirer will blast again "an intel guys is looking for aliens", even if i feel comfortable with this, i can understand that investors can see a little crazyness in this, so, I don't want my company to get hit by a tabloid effect phenomena, where people get mislead about my company. my employer is not responsable for my hobbies, but it can t be hurt if i am not careful.
so, i am doing the proper license work, thanks to some co workers for helping me doing the license, even if it is not work related.
who?
You ll get the SSSE3 version around Jan 2007.
|
|
zombie67 [MM]

Send message
Joined: 11 Feb 06 Posts: 316 Credit: 6,621,003 RAC: 0
|
Thanks for all the info! Are you referring to IntelĀ® Math Kernel Library 9.0? I think that is out now. It's all greek to me, so maybe you are talking about something else.
I understand your PR concerns. I am sure someone will be glad to use your application, once you release it, to show off what a dual quad can really do.
Reno, NV
Team: SETI.USA
|
|
Who?
Send message
Joined: 2 Apr 06 Posts: 213 Credit: 1,366,981 RAC: 0
|
Thanks for all the info! Are you referring to IntelĀ® Math Kernel Library 9.0? I think that is out now. It's all greek to me, so maybe you are talking about something else.
I understand your PR concerns. I am sure someone will be glad to use your application, once you release it, to show off what a dual quad can really do.
I was refering to my pattern matching algo in SSSE3 (how SETI match the sub harmonic of the FFT)
who?
|
|
Michael G.R.
Send message
Joined: 11 Nov 05 Posts: 264 Credit: 11,247,510 RAC: 0
|
Rosetta needs you more than SETI ;)
|
|