1)
Message boards :
Number crunching :
L3 Cache is not a problem, CPU Performance Counter proves it
(Message 94600)
Posted 16 Apr 2020 by bitsonic Post: I have a wide variety of servers with a lot of different configs. My servers have 8 up to 128 cores (and almost every config and Intel generation in between). All servers have hyperthreading enabled. I would say start with those CPU with skylake architecture first, such as Xeon(R) Gold 5122, Xeon(R) Platinum 8160, i7-6600U. Then you can compare with results from my machine and fuzzydice555's machine. Our machines are skylake based. |
2)
Message boards :
Number crunching :
L3 Cache is not a problem, CPU Performance Counter proves it
(Message 94432)
Posted 14 Apr 2020 by bitsonic Post:
That's correct. It is true, almost in every case, that total throughput is higher when you utilize all cores vs fewer cores. Just like a high way, when there are very few cars and only 1-2 lane occupied, every car may run at 70 miles/hour. When there are many cars and all lanes are occupied, each car may only run at 60 miles/hour. However the second case still have higher total throughput. |
3)
Message boards :
Number crunching :
L3 Cache is not a problem, CPU Performance Counter proves it
(Message 94431)
Posted 14 Apr 2020 by bitsonic Post:
I assume ES=Engineering Sample, right? It is likely that ES is still a WIP product and may not function with every feature correctly. Your test result from i5-8550U looks more normal. |
4)
Message boards :
Number crunching :
L3 Cache is not a problem, CPU Performance Counter proves it
(Message 94245)
Posted 12 Apr 2020 by bitsonic Post: According to this cache miss is pretty bad on my Xeon 2628v4, no matter how many threads I run. It seems no matter how many threads I run, all 30MB cache will be allocated to the active threads. That looks too low and doesn't sound normal. l. Which version of PCM are you using? You must use version 2.11 or above to support Xeon E5 v4. A lower version will report wrong information. If you are using 2.11, try to run another program and see if the cache hit is still so low. If it is still low, then there shall be something wrong. |
5)
Message boards :
Number crunching :
L3 Cache is not a problem, CPU Performance Counter proves it
(Message 94074)
Posted 10 Apr 2020 by bitsonic Post: 1. When you say people report the number of concurrent instances increases the problem becomes more pronounced, what exactly do they see inside the system? Do they have evidence showing that L3 is the major culprit? Do they look into other factors and confirm those do not contribute to the problem? It is true that the more concurrency, the more intensive that threads could compete with each other for cache resources, which may subject to lower average L3 hit ratio. This is true for any application, not just Rosetta. However there are multiple other factors that can also affect performance when you increase concurrency. For example: Guess I should use a more extreme example huh :) |
6)
Message boards :
Number crunching :
L3 Cache is not a problem, CPU Performance Counter proves it
(Message 94035)
Posted 10 Apr 2020 by bitsonic Post:
1. When you say people report the number of concurrent instances increases the problem becomes more pronounced, what exactly do they see inside the system? Do they have evidence showing that L3 is the major culprit? Do they look into other factors and confirm those do not contribute to the problem? It is true that the more concurrency, the more intensive that threads could compete with each other for cache resources, which may subject to lower average L3 hit ratio. This is true for any application, not just Rosetta. However there are multiple other factors that can also affect performance when you increase concurrency. For example: - The more cores your CPU has, the more likely that RAM latency could become bottleneck. Think about a dual channel DDR 2400 serving a dual core CPU vs serving a quad core CPU (assuming each core has the same specification), of course memory queue will be longer for that quad core CPU, because 4 cores are competing the same memory bandwidth instead of 2. And the each core may have higher chance to wait longer before memory serves it. Imagine this is like 2 people queue in a ticket office vs 4 people, people in the later one will wait longer. When CPU core is waiting, the core is in idle state. Now think about 16 core CPU and 32 core CPU, do you increase memory bandwidth 8 times and 16 times compare to a dual core CPU? If not, then definitely you will see performance of each instance slower. - The design of OS kernel also affects the performance and L3 hit ratio. I saw slight higher L3 hit ratio after I upgraded from Win10 1709 to 1909, testing with the same program. What about in Linux? What about comparing efficiency between Windows and Linux when dealing 32 threads? They could be different. But this is OS problem, not Rosetta's problem. - The design of CPU micro-architecture. This is easy to understand. Even with the same cache size, hit ratio of an Intel will not be the same as an AMD, hit ratio of different generations of the same brand will be different too. So, again, we need to be specific what happens when running large number of concurrent instances. Hopefully someone can run CPU performance counter monitor on a 32 core CPU and post the result here. 2. 56% of utilization is a very very normal number. Even we have 100% L3 hit ratio, it will not be 100% utilization. If cache read is missing, then the core is in idle state waiting. Response time of different layer of cache is different. On my computer, latency of L1 cache is around 1ns, L2 around 3ns, L3 around 15ns, RAM around 60-70ns. If L1 cache is missed, CPU wait a few cycles to fetch data from L2. If L2 miss, it waits more cycles for L3. If L3 missed, then hundreds of cycles wait for RAM. All these wait contribute to lower utilization. Even we cache everything in L1, this number will not be 100%. The highest number I have ever seem is from CPU stability test of AIDA64 (stress FPU only). Essentially it is a very small piece of code highly optimized to push CPU to its limit. It only reach out around 80% of utilization. Read the Intel article I quote in my last post, you will understand what I am saying. 3. Memory read/write there refers to how many GB of data have been read/write to main memory (RAM). What I am talking about is memory bandwidth utilization here. It is true that L3 refresh result in data read from RAM. Actually modern CPUs always try to prefetch data from memory to cache so that when CPU needs the data, it is already in cache. But in general, running the same code, the higher cache hit rate, the lower memory read/write. I have seen occasionally, when I run that performance capture for 10 seconds interval, L3 hit ratio drop to 60% and memory bandwidth utilization raise up quite a lot. So I think my point is valid in this specific case. 4. It could be. So I ran the capture again for 10,000 seconds, long enough I believe. I got 82% on my i3. Very close to my first few tests. Actually, I have been doing this test a few times in last 2 days. I saw similar number. Again, this is for Rosetta 4.12 crunching covid-19 WU. I ran out of non-4.12 WUs on my i3. So I did not test how it looks like with other Rosetta versions. 5. Again, I am using report from performance counter monitor. It is report from CPU kernel, not from OS. Basically those are physical counters built inside CPU cores. So the number shall be very accurate. We all know that CPU utilization reported from OS do not reflect the reality. This is the reason why Intel developed that software to help technician reveal where the real performance bottleneck is. To clarify, when I said "I don't know whether Rosetta's algorithm itself is optimized", I mean maybe there is room to improve algorithm of protein design/folding simulation to complete the same simulation with less computation cycles. When I said "looks like it is pretty effective in keeping CPU busy", I mean the code itself can push CPU utilization pretty high, compare to other applications. Lastly, I agree with you that I shall not draw a broad conclusion by just doing a few amount of testing on my computers only. I shall say my conclusion is only applicable to my machines running Rosetta 4.12 crunching Covid-19 WU. But at least it proves on some machines, L3 cache is not bottleneck. Other people may see different result on their machines. Use it as a reference only. |
7)
Message boards :
News :
Help in the fight against COVID-19!
(Message 93962)
Posted 9 Apr 2020 by bitsonic Post: Dear Brian Coventry et alia, Hi, I did some testing and the result shows that L3 cache is not a problem. I put more details in this thread: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13764#93961 |
8)
Message boards :
Number crunching :
L3 Cache is not a problem, CPU Performance Counter proves it
(Message 93961)
Posted 9 Apr 2020 by bitsonic Post: I know that this have been discussed quite a few times in this forum. People said that Rosetta application is not optimized and needs 5MB of L3 cache for each instance running (though I don't really know where they got this number). But looks to me no one has ever shown any data to proves it. So I decided to figure it out. Intel used to provide a software called Performance Counter Monitor (see this article https://software.intel.com/en-us/articles/intel-performance-counter-monitor). Basically this software can read some hardware level counter inside CPU and provides performance insight information, including cache hit rate. So I did a test on my desktop: CPU: i3-7100, 2 cores 4 threads, 3.9Ghz, 3MB L3 cache RAM: 16GB DDR 2400 Dual Channel OS: Win10 1909 I run 4 instance of Rosetta 4.12 with Covid-19 WU. I close all other background applications that may affect the result. Then I enabled the monitor and capture for 100 seconds. Here is the result. Column L3HIT is L3 cache hit rate. L2HIT is L2 cache hit rate. --------------------------------------------------------------------------------------------------------------- EXEC : instructions per nominal CPU cycle IPC : instructions per CPU cycle FREQ : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost) AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state' (includes Intel Turbo Boost) L3MISS: L3 (read) cache misses L2MISS: L2 (read) cache misses (including other core's L2 cache *hits*) L3HIT : L3 (read) cache hit ratio (0.00-1.00) L2HIT : L2 cache hit ratio (0.00-1.00) L3MPI : number of L3 (read) cache misses per instruction L2MPI : number of L2 (read) cache misses per instruction READ : bytes read from main memory controller (in GBytes) WRITE : bytes written to main memory controller (in GBytes) IO : bytes read/written due to IO requests to memory controller (in GBytes); this may be an over estimate due to same-cache-line partial requests TEMP : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature energy: Energy in Joules Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | TEMP 0 0 1.07 1.07 0.99 1.00 79 M 436 M 0.82 0.78 0.00 0.00 29 1 0 1.09 1.10 0.99 1.00 69 M 435 M 0.84 0.78 0.00 0.00 29 2 0 1.17 1.17 0.99 1.00 55 M 352 M 0.84 0.82 0.00 0.00 30 3 0 1.18 1.19 1.00 1.00 43 M 321 M 0.86 0.83 0.00 0.00 30 --------------------------------------------------------------------------------------------------------------- SKT 0 1.13 1.13 0.99 1.00 247 M 1545 M 0.84 0.80 0.00 0.00 29 --------------------------------------------------------------------------------------------------------------- TOTAL * 1.13 1.13 0.99 1.00 247 M 1545 M 0.84 0.80 0.00 0.00 N/A Instructions retired: 1764 G ; Active cycles: 1556 G ; Time (TSC): 391 Gticks ; C0 (active,non-halted) core residency: 99.73 % C1 core residency: 0.27 %; C3 core residency: 0.00 %; C6 core residency: 0.00 %; C7 core residency: 0.00 %; C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %; C8 package residency: 0.00 %; C9 package residency: 0.00 %; C10 package residency: 0.00 %; PHYSICAL CORE IPC : 2.27 => corresponds to 56.70 % utilization for cores in active state Instructions per nominal CPU cycle: 2.26 => corresponds to 56.38 % core utilization over time interval SMI count: 0 --------------------------------------------------------------------------------------------------------------- MEM (GB)->| READ | WRITE | IO | CPU energy | --------------------------------------------------------------------------------------------------------------- SKT 0 181.87 37.36 2.18 2912.05 --------------------------------------------------------------------------------------------------------------- As you can see, the L3 cache hit rate is 84% in average for 100 seconds. I repeat the test a few times. Result is vary around 80%, lowest number I saw is 75%, highest is 89%. This is actually a very high hit rate based on my observation. I can say most of other BOINC applications has lower L3 hit rate than this. For example, I used to run SETI@home (optimized version), and the L3 hit rate is around 70%, Milkyway@home is a little more but still below 80%. The only project that has higher hit rate is Collatz Conjecture. I am not surprised with this as Collatz's algorithm is very simple (basically calculation of 3*N+1 and N/2). So the code footprint shall be very small and may even fit into L2 cache. Another number that proves this is the memory traffic (see the bottom of the above figure). It says 181.87GB read & 37.36GB write in 100 seconds. That's around 2.19GB/second in total. This is a very low number considering CPU is running at full speed. Remember dual channel of DDR2400 can provide close to 40GB/S bandwidth. So basically the memory utilization is very low. This also proves that most memory read/write has been hit by CPU caches. I also tested my workstation at work: CPU: i7 9700K, 8 cores 8 threads, OC to 4.2Ghz, 12MB L3 cache RAM: 16GB DDR4 2666 Dual Channel OS: WIn10 1909 Running 8 Rosetta instances and I got average L3 hit rate of 77%. Pretty close. Conclusion: According to the findings above, it doesn't seen that L3 cache is a bottleneck for Rosetta, at least not for 4.12. I don't know whether Rosetta's algorithm itself is optimized, but looks like it is pretty effective in keeping CPU busy. Lastly, BTW, actually for Intel's current CPU microarchitecture (ie gen 6,7,8,9), each physical core can only use up to 2MB L3 cache. So even you only run one Rosetta instance, it will not be benefit with L3 cache larger than 2MB. I hope you find this helps. Michael Wang |
9)
Message boards :
News :
Help in the fight against COVID-19!
(Message 93870)
Posted 8 Apr 2020 by bitsonic Post: Greetings from China and big thank you to the team!I have been with SETI@home since 1999. Now I take my computers doing another meaningful thing here. My country has been suffering with this virus. I donate all my computer power here. I hope we will help scientists finding vaccine eventually. |
©2024 University of Washington
https://www.bakerlab.org