Message boards : Number crunching : L3 Cache is not a problem, CPU Performance Counter proves it
Author | Message |
---|---|
bitsonic Send message Joined: 21 Mar 20 Posts: 9 Credit: 1,680,354 RAC: 0 |
I know that this have been discussed quite a few times in this forum. People said that Rosetta application is not optimized and needs 5MB of L3 cache for each instance running (though I don't really know where they got this number). But looks to me no one has ever shown any data to proves it. So I decided to figure it out. Intel used to provide a software called Performance Counter Monitor (see this article https://software.intel.com/en-us/articles/intel-performance-counter-monitor). Basically this software can read some hardware level counter inside CPU and provides performance insight information, including cache hit rate. So I did a test on my desktop: CPU: i3-7100, 2 cores 4 threads, 3.9Ghz, 3MB L3 cache RAM: 16GB DDR 2400 Dual Channel OS: Win10 1909 I run 4 instance of Rosetta 4.12 with Covid-19 WU. I close all other background applications that may affect the result. Then I enabled the monitor and capture for 100 seconds. Here is the result. Column L3HIT is L3 cache hit rate. L2HIT is L2 cache hit rate. --------------------------------------------------------------------------------------------------------------- EXEC : instructions per nominal CPU cycle IPC : instructions per CPU cycle FREQ : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost) AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state' (includes Intel Turbo Boost) L3MISS: L3 (read) cache misses L2MISS: L2 (read) cache misses (including other core's L2 cache *hits*) L3HIT : L3 (read) cache hit ratio (0.00-1.00) L2HIT : L2 cache hit ratio (0.00-1.00) L3MPI : number of L3 (read) cache misses per instruction L2MPI : number of L2 (read) cache misses per instruction READ : bytes read from main memory controller (in GBytes) WRITE : bytes written to main memory controller (in GBytes) IO : bytes read/written due to IO requests to memory controller (in GBytes); this may be an over estimate due to same-cache-line partial requests TEMP : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature energy: Energy in Joules Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | TEMP 0 0 1.07 1.07 0.99 1.00 79 M 436 M 0.82 0.78 0.00 0.00 29 1 0 1.09 1.10 0.99 1.00 69 M 435 M 0.84 0.78 0.00 0.00 29 2 0 1.17 1.17 0.99 1.00 55 M 352 M 0.84 0.82 0.00 0.00 30 3 0 1.18 1.19 1.00 1.00 43 M 321 M 0.86 0.83 0.00 0.00 30 --------------------------------------------------------------------------------------------------------------- SKT 0 1.13 1.13 0.99 1.00 247 M 1545 M 0.84 0.80 0.00 0.00 29 --------------------------------------------------------------------------------------------------------------- TOTAL * 1.13 1.13 0.99 1.00 247 M 1545 M 0.84 0.80 0.00 0.00 N/A Instructions retired: 1764 G ; Active cycles: 1556 G ; Time (TSC): 391 Gticks ; C0 (active,non-halted) core residency: 99.73 % C1 core residency: 0.27 %; C3 core residency: 0.00 %; C6 core residency: 0.00 %; C7 core residency: 0.00 %; C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %; C8 package residency: 0.00 %; C9 package residency: 0.00 %; C10 package residency: 0.00 %; PHYSICAL CORE IPC : 2.27 => corresponds to 56.70 % utilization for cores in active state Instructions per nominal CPU cycle: 2.26 => corresponds to 56.38 % core utilization over time interval SMI count: 0 --------------------------------------------------------------------------------------------------------------- MEM (GB)->| READ | WRITE | IO | CPU energy | --------------------------------------------------------------------------------------------------------------- SKT 0 181.87 37.36 2.18 2912.05 --------------------------------------------------------------------------------------------------------------- As you can see, the L3 cache hit rate is 84% in average for 100 seconds. I repeat the test a few times. Result is vary around 80%, lowest number I saw is 75%, highest is 89%. This is actually a very high hit rate based on my observation. I can say most of other BOINC applications has lower L3 hit rate than this. For example, I used to run SETI@home (optimized version), and the L3 hit rate is around 70%, Milkyway@home is a little more but still below 80%. The only project that has higher hit rate is Collatz Conjecture. I am not surprised with this as Collatz's algorithm is very simple (basically calculation of 3*N+1 and N/2). So the code footprint shall be very small and may even fit into L2 cache. Another number that proves this is the memory traffic (see the bottom of the above figure). It says 181.87GB read & 37.36GB write in 100 seconds. That's around 2.19GB/second in total. This is a very low number considering CPU is running at full speed. Remember dual channel of DDR2400 can provide close to 40GB/S bandwidth. So basically the memory utilization is very low. This also proves that most memory read/write has been hit by CPU caches. I also tested my workstation at work: CPU: i7 9700K, 8 cores 8 threads, OC to 4.2Ghz, 12MB L3 cache RAM: 16GB DDR4 2666 Dual Channel OS: WIn10 1909 Running 8 Rosetta instances and I got average L3 hit rate of 77%. Pretty close. Conclusion: According to the findings above, it doesn't seen that L3 cache is a bottleneck for Rosetta, at least not for 4.12. I don't know whether Rosetta's algorithm itself is optimized, but looks like it is pretty effective in keeping CPU busy. Lastly, BTW, actually for Intel's current CPU microarchitecture (ie gen 6,7,8,9), each physical core can only use up to 2MB L3 cache. So even you only run one Rosetta instance, it will not be benefit with L3 cache larger than 2MB. I hope you find this helps. Michael Wang |
entity Send message Joined: 8 May 18 Posts: 19 Credit: 5,744,699 RAC: 12,807 |
I know that this have been discussed quite a few times in this forum. People said that Rosetta application is not optimized and needs 5MB of L3 cache for each instance running (though I don't really know where they got this number). But looks to me no one has ever shown any data to proves it. So I decided to figure it out. Unfortunately, statistics are always open to interpretation. "If you torture the data long enough, it will confess" -- Economist Ronald H Coase 1. Many contributors have reported that if they run small numbers of Rosetta instances concurrently they see less of a problem but as the number of concurrent instances increases the problem becomes more pronounced. This observation seems to be backed by your data where the L3 hit ration drops between 4 concurrent and 8 concurrent. What happens when concurrency jumps to 16 threads, 32 threads, etc? 2. PHYSICAL CORE IPC data indicates ~56% utilization. Seems like it should be closer to 100%. Why so low? Does it indicate that the cores are waiting significantly? 3. Memory READ/WRITE numbers need more context. Where is that measured at? Additionally, I assume those are main memory stats and not cache memory stats based on the size of the numbers. I would conclude that if the L3 cache needed to be refreshed with new data due to misses, it would result in high read rates just as your data indicates. Memory transfers rates are not indicative of anything other than data is being transferred and it definitely doesn't indicate memory utilization. I would suggest that it more rightly indicates memory channel utilization. I submit in a highly optimized L3 cache utilization situation, there would be little to no memory transfer activity, however, just because the transfer rate isn't at full utilization doesn't support the claim that there isn't a L3 cache problem. 4. Are you making the assumption that the Rosetta program has a reasonably static processing profile? Could the program have higher L3 cache demands at various points in the processing stream and you just happened to catch the program at a low point in the demand? 5. You make the statement: "I don't know whether Rosetta's algorithm itself is optimized, but looks like it is pretty effective in keeping CPU busy." I ask, based on what? If you are using CPU utilization reported from the OS, I contend that isn't a reliable number in this case. That number is based off the OS wait bits that never get set in this case. The thread is dispatched to the CPU and as far as the OS is concerned, it is active on the processor. If the core is waiting on memory transfers between cache or main memory, the OS doesn't see that. Only way the OS knows about a thread waiting is if it gives up the CPU voluntarily or is interrupted by a higher priority task. Only way the OS knows if the CPU is waiting is if the wait bit for that processor is set. Not saying the data isn't useful or pertinent. Just be careful about drawing conclusions from such a small amount of data. In other words, it's hard to run a program for 100 seconds and then say everything is OK |
bitsonic Send message Joined: 21 Mar 20 Posts: 9 Credit: 1,680,354 RAC: 0 |
1. When you say people report the number of concurrent instances increases the problem becomes more pronounced, what exactly do they see inside the system? Do they have evidence showing that L3 is the major culprit? Do they look into other factors and confirm those do not contribute to the problem? It is true that the more concurrency, the more intensive that threads could compete with each other for cache resources, which may subject to lower average L3 hit ratio. This is true for any application, not just Rosetta. However there are multiple other factors that can also affect performance when you increase concurrency. For example: - The more cores your CPU has, the more likely that RAM latency could become bottleneck. Think about a dual channel DDR 2400 serving a dual core CPU vs serving a quad core CPU (assuming each core has the same specification), of course memory queue will be longer for that quad core CPU, because 4 cores are competing the same memory bandwidth instead of 2. And the each core may have higher chance to wait longer before memory serves it. Imagine this is like 2 people queue in a ticket office vs 4 people, people in the later one will wait longer. When CPU core is waiting, the core is in idle state. Now think about 16 core CPU and 32 core CPU, do you increase memory bandwidth 8 times and 16 times compare to a dual core CPU? If not, then definitely you will see performance of each instance slower. - The design of OS kernel also affects the performance and L3 hit ratio. I saw slight higher L3 hit ratio after I upgraded from Win10 1709 to 1909, testing with the same program. What about in Linux? What about comparing efficiency between Windows and Linux when dealing 32 threads? They could be different. But this is OS problem, not Rosetta's problem. - The design of CPU micro-architecture. This is easy to understand. Even with the same cache size, hit ratio of an Intel will not be the same as an AMD, hit ratio of different generations of the same brand will be different too. So, again, we need to be specific what happens when running large number of concurrent instances. Hopefully someone can run CPU performance counter monitor on a 32 core CPU and post the result here. 2. 56% of utilization is a very very normal number. Even we have 100% L3 hit ratio, it will not be 100% utilization. If cache read is missing, then the core is in idle state waiting. Response time of different layer of cache is different. On my computer, latency of L1 cache is around 1ns, L2 around 3ns, L3 around 15ns, RAM around 60-70ns. If L1 cache is missed, CPU wait a few cycles to fetch data from L2. If L2 miss, it waits more cycles for L3. If L3 missed, then hundreds of cycles wait for RAM. All these wait contribute to lower utilization. Even we cache everything in L1, this number will not be 100%. The highest number I have ever seem is from CPU stability test of AIDA64 (stress FPU only). Essentially it is a very small piece of code highly optimized to push CPU to its limit. It only reach out around 80% of utilization. Read the Intel article I quote in my last post, you will understand what I am saying. 3. Memory read/write there refers to how many GB of data have been read/write to main memory (RAM). What I am talking about is memory bandwidth utilization here. It is true that L3 refresh result in data read from RAM. Actually modern CPUs always try to prefetch data from memory to cache so that when CPU needs the data, it is already in cache. But in general, running the same code, the higher cache hit rate, the lower memory read/write. I have seen occasionally, when I run that performance capture for 10 seconds interval, L3 hit ratio drop to 60% and memory bandwidth utilization raise up quite a lot. So I think my point is valid in this specific case. 4. It could be. So I ran the capture again for 10,000 seconds, long enough I believe. I got 82% on my i3. Very close to my first few tests. Actually, I have been doing this test a few times in last 2 days. I saw similar number. Again, this is for Rosetta 4.12 crunching covid-19 WU. I ran out of non-4.12 WUs on my i3. So I did not test how it looks like with other Rosetta versions. 5. Again, I am using report from performance counter monitor. It is report from CPU kernel, not from OS. Basically those are physical counters built inside CPU cores. So the number shall be very accurate. We all know that CPU utilization reported from OS do not reflect the reality. This is the reason why Intel developed that software to help technician reveal where the real performance bottleneck is. To clarify, when I said "I don't know whether Rosetta's algorithm itself is optimized", I mean maybe there is room to improve algorithm of protein design/folding simulation to complete the same simulation with less computation cycles. When I said "looks like it is pretty effective in keeping CPU busy", I mean the code itself can push CPU utilization pretty high, compare to other applications. Lastly, I agree with you that I shall not draw a broad conclusion by just doing a few amount of testing on my computers only. I shall say my conclusion is only applicable to my machines running Rosetta 4.12 crunching Covid-19 WU. But at least it proves on some machines, L3 cache is not bottleneck. Other people may see different result on their machines. Use it as a reference only. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2105 Credit: 40,926,259 RAC: 18,158 |
1. When you say people report the number of concurrent instances increases the problem becomes more pronounced, what exactly do they see inside the system? Do they have evidence showing that L3 is the major culprit? Do they look into other factors and confirm those do not contribute to the problem? It is true that the more concurrency, the more intensive that threads could compete with each other for cache resources, which may subject to lower average L3 hit ratio. This is true for any application, not just Rosetta. However there are multiple other factors that can also affect performance when you increase concurrency. For example: I don't (think I) suffer the problem myself (8 thread) but the complaint came from people with 32+ CPUsthreads, so I'm not sure your results based on a 2-core4-thread CPU helps answer their question. |
bitsonic Send message Joined: 21 Mar 20 Posts: 9 Credit: 1,680,354 RAC: 0 |
1. When you say people report the number of concurrent instances increases the problem becomes more pronounced, what exactly do they see inside the system? Do they have evidence showing that L3 is the major culprit? Do they look into other factors and confirm those do not contribute to the problem? It is true that the more concurrency, the more intensive that threads could compete with each other for cache resources, which may subject to lower average L3 hit ratio. This is true for any application, not just Rosetta. However there are multiple other factors that can also affect performance when you increase concurrency. For example: Guess I should use a more extreme example huh :) |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2105 Credit: 40,926,259 RAC: 18,158 |
I don't (think I) suffer the problem myself (8 thread) but the complaint came from people with 32+ CPUsthreads, so I'm not sure your results based on a 2-core4-thread CPU helps answer their question. It seems. The whole subject has come up before, and I take note, but I'm not in the league of people with these huge machines, so I only make a mental note for when I need or can afford something bigger/better |
fuzzydice555 Send message Joined: 31 Oct 17 Posts: 5 Credit: 2,786,716 RAC: 4,394 |
Now this is is interesting! I wanted to test this, but didn't know how yet. The performance metrics provided by Rosetta didn't provide an answer for this, but I'll share my results nevertheless: R7 1700: 16MB cache, 16 threads R7 2700: 16MB cache, 16 threads R7 3700: 32MB cache, 16 threads Xeon 2628v4: 30MB cache, 24 threads Average processing rate GFLOPS______|1700 | 2700 | 3700X | 2628v4 | Rosetta Mini 3.78 x86_64-pc-linux-gnu__|2,840 | 2,840 | 3,100 | 2,930 | Rosetta Mini 3.78 i686-pc-linux-gnu_____|2,900 | 2,870 | 3,040 | 2,870 | Rosetta 4.07 i686-pc-linux-gnu__________|2,780 | 2,780 | 3,200 | 2,920 | Rosetta 4.08 x86_64-pc-linux-gnu_______|3,000 | 3,010 | 2,980 | 3,060 | Rosetta 4.12 i686-pc-linux-gnu__________|1,830 | 1,850 | 2,180 | 1,830 | Rosetta 4.12 x86_64-pc-linux-gnu_______|3,060 | 2,830 | 3,440 | 2,910 | The 3700X has twice as many cache as the 1700&2700, but it also has better IPC and faster clock speeds. The difference between 1700/2700 and the 3700x could be because of these improvements. However, the 2628v4 should be MUCH slower than the ryzens (1.4 GHz vs 3.4 GHz clock speed), but I don't see a difference in performance. I will check Performance Counter to see if I can find any difference in cache miss. |
fuzzydice555 Send message Joined: 31 Oct 17 Posts: 5 Credit: 2,786,716 RAC: 4,394 |
According to this cache miss is pretty bad on my Xeon 2628v4, no matter how many threads I run. It seems no matter how many threads I run, all 30MB cache will be allocated to the active threads. I'm not sure how to interpret these results. 24 threads: Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | L3OCC | LMB | RMB | TEMP 0 0 1.93 1.70 1.13 1.13 2391 K 2732 K 0.12 0.63 0.00 0.00 192 8 0 33 1 0 1.15 1.02 1.13 1.13 465 K 578 K 0.20 0.73 0.00 0.00 576 19 0 32 2 0 1.13 0.99 1.13 1.13 21 M 28 M 0.23 0.24 0.01 0.02 2112 79 0 33 3 0 0.93 0.82 1.13 1.13 26 M 35 M 0.27 0.12 0.02 0.03 3504 220 0 34 4 0 1.95 1.72 1.13 1.13 2473 K 2818 K 0.12 0.72 0.00 0.00 48 20 0 32 5 0 1.60 1.41 1.13 1.13 3790 K 5603 K 0.32 0.81 0.00 0.00 384 31 0 33 6 0 0.85 0.75 1.13 1.13 18 M 28 M 0.34 0.21 0.01 0.02 2640 81 0 34 7 0 1.64 1.45 1.13 1.13 3895 K 4715 K 0.17 0.43 0.00 0.00 48 58 0 32 8 0 1.78 1.57 1.13 1.13 1696 K 1923 K 0.12 0.38 0.00 0.00 96 41 0 32 9 0 1.19 1.05 1.13 1.13 20 M 27 M 0.27 0.16 0.01 0.02 528 135 0 31 10 0 1.61 1.43 1.13 1.13 1198 K 1394 K 0.14 0.91 0.00 0.00 4992 42 0 33 11 0 1.27 1.12 1.13 1.13 4696 K 7883 K 0.40 0.59 0.00 0.00 480 32 0 32 12 0 1.09 0.96 1.13 1.13 16 M 21 M 0.24 0.34 0.01 0.01 2976 447 0 33 13 0 1.52 1.36 1.12 1.13 1288 K 1347 K 0.04 0.40 0.00 0.00 48 73 0 32 14 0 1.48 1.31 1.13 1.13 11 M 13 M 0.12 0.43 0.01 0.01 960 104 0 33 15 0 0.96 0.84 1.13 1.13 9561 K 13 M 0.28 0.28 0.01 0.01 576 228 0 34 16 0 0.98 0.86 1.13 1.13 25 M 33 M 0.26 0.22 0.02 0.02 2064 24 0 32 17 0 1.31 1.16 1.13 1.13 3227 K 5229 K 0.38 0.73 0.00 0.00 240 16 0 33 18 0 0.97 0.86 1.13 1.13 8905 K 12 M 0.26 0.34 0.01 0.01 192 267 0 34 19 0 1.05 0.93 1.13 1.13 12 M 17 M 0.29 0.16 0.01 0.01 1872 435 0 32 20 0 1.05 0.92 1.13 1.13 11 M 16 M 0.29 0.18 0.01 0.01 3024 477 0 32 21 0 1.20 1.09 1.10 1.13 7781 K 10 M 0.23 0.24 0.00 0.01 2160 91 0 31 22 0 1.75 1.54 1.13 1.13 1120 K 1322 K 0.15 0.91 0.00 0.00 96 32 0 33 23 0 1.47 1.30 1.13 1.13 20 M 21 M 0.08 0.35 0.01 0.01 384 57 0 32 --------------------------------------------------------------------------------------------------------------- SKT 0 1.33 1.17 1.13 1.13 236 M 315 M 0.25 0.37 0.00 0.01 30192 3017 0 30 12 threads: Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | L3OCC | LMB | RMB | TEMP 1 0 1.53 1.35 1.13 1.13 16 M 22 M 0.28 0.11 0.01 0.01 3600 578 0 33 2 0 2.71 2.39 1.13 1.13 3471 K 5262 K 0.34 0.81 0.00 0.00 912 92 0 34 3 0 1.38 1.22 1.13 1.13 8782 K 11 M 0.23 0.34 0.00 0.01 2784 339 0 34 5 0 1.55 1.36 1.13 1.13 16 M 22 M 0.28 0.11 0.01 0.01 3504 569 0 34 8 0 2.02 1.78 1.13 1.13 20 M 26 M 0.22 0.26 0.01 0.01 6432 764 0 33 9 0 2.23 1.97 1.13 1.13 23 M 26 M 0.10 0.38 0.01 0.01 864 120 0 32 10 0 1.39 1.22 1.13 1.13 9271 K 11 M 0.22 0.32 0.00 0.01 2544 355 0 33 11 0 1.50 1.33 1.13 1.13 35 M 47 M 0.26 0.18 0.02 0.02 3168 46 0 33 12 0 2.42 2.14 1.13 1.13 32 M 33 M 0.04 0.28 0.01 0.01 624 68 0 34 16 0 1.45 1.28 1.13 1.13 36 M 50 M 0.27 0.14 0.02 0.02 4032 70 0 34 18 0 1.37 1.21 1.13 1.13 8721 K 11 M 0.22 0.34 0.00 0.01 1920 318 0 34 19 0 2.25 1.98 1.13 1.13 5706 K 7573 K 0.25 0.59 0.00 0.00 576 171 0 33 --------------------------------------------------------------------------------------------------------------- SKT 0 0.91 1.60 0.57 1.13 219 M 279 M 0.22 0.29 0.01 0.01 31968 3516 0 30 6 threads: Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | L3OCC | LMB | RMB | TEMP 2 0 2.35 2.01 1.17 1.18 26 M 35 M 0.27 0.21 0.01 0.01 8784 503 0 36 3 0 1.44 1.23 1.17 1.18 9198 K 11 M 0.22 0.34 0.00 0.01 7392 253 0 36 4 0 2.72 2.33 1.17 1.18 24 M 25 M 0.04 0.32 0.01 0.01 2160 186 0 36 7 0 2.31 1.98 1.17 1.18 48 M 50 M 0.03 0.22 0.01 0.01 1536 36 0 36 12 0 2.32 1.99 1.17 1.18 28 M 39 M 0.28 0.19 0.01 0.01 8016 453 0 36 18 0 1.08 0.93 1.17 1.18 964 K 1033 K 0.07 0.67 0.00 0.00 1056 14 0 37 --------------------------------------------------------------------------------------------------------------- SKT 0 0.52 1.73 0.30 1.17 142 M 169 M 0.16 0.24 0.01 0.01 30480 1517 0 32 --------------------------------------------------------------------------------------------------------------- 3 threads: Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | L3OCC | LMB | RMB | TEMP 4 0 2.16 1.62 1.33 1.37 23 M 29 M 0.21 0.29 0.01 0.01 22560 356 1 36 7 0 3.20 2.40 1.33 1.37 3171 K 4887 K 0.35 0.86 0.00 0.00 3744 28 0 35 8 0 1.41 1.06 1.33 1.37 1644 K 1816 K 0.09 0.57 0.00 0.00 2112 22 0 35 --------------------------------------------------------------------------------------------------------------- SKT 0 0.29 1.67 0.17 1.34 32 M 41 M 0.21 0.53 0.00 0.00 30768 453 1 34 |
bitsonic Send message Joined: 21 Mar 20 Posts: 9 Credit: 1,680,354 RAC: 0 |
According to this cache miss is pretty bad on my Xeon 2628v4, no matter how many threads I run. It seems no matter how many threads I run, all 30MB cache will be allocated to the active threads. That looks too low and doesn't sound normal. l. Which version of PCM are you using? You must use version 2.11 or above to support Xeon E5 v4. A lower version will report wrong information. If you are using 2.11, try to run another program and see if the cache hit is still so low. If it is still low, then there shall be something wrong. |
Aurum Send message Joined: 12 Jul 17 Posts: 32 Credit: 38,158,977 RAC: 0 |
People said that Rosetta application is not optimized and needs 5MB of L3 cache for each instance running (though I don't really know where they got this number).We observed it running MIP at WCG. When we dedicated all threads to MIP fewer results were completed in the same time. We verified this repeatedly for months. Then the project made a statement explaining: "The short version is that Rosetta, the program being used by the MIP to fold the proteins on all of your computers, is pretty hungry when it comes to cache. A single instance of the program fits well in to a small cache. However, when you begin to run multiple instances there is more contention for that cache. This results in L3 cache misses and the CPU sits idle while we have to make a long trip to main memory to get the data we need. This behavior is common for programs that have larger memory requirements. It's also not something that we as developers often notice; we typically run on large clusters and use hundreds to thousands of cores in parallel on machines. Nothing seemed slower for us because we are always running in that regime." https://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,40374_offset,50#569786 I doubt it's been fixed unless Rosetta coders consciously went in and rewrote the offending code. Why won't Baker respond to this question??? |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1663 Credit: 17,329,318 RAC: 24,446 |
I doubt it's been fixed unless Rosetta coders consciously went in and rewrote the offending code. Why won't Baker respond to this question???Given the usage of RAM can be has high as 1.3GB, i can't see how an optimisation of L3 cache usage would result in anything other than a slight improvement (look at benchmarks that fit inside a cache v benchmarks that require the use of main memory v becnhmarks that require the use of almost all of the main memory. Ones that fit in the cache, otpmisation of the cache usage has a big impact.Benchmarks that make use of main memory- cache optimisations help, but only by a percent or two (if that)), where as a SSS3, SSSE4.1 or better yet an AVX application would result in a huge boost in performance. So there are plenty of other things that need answering that would benefit the project so much more than a few percentage points in improvement that an optimisation for L3 would provide. Grant Darwin NT |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
To say an application is "pretty hungry when it comes to cache", and acknowledge that "This behavior is common for programs that have larger memory requirements" does not seem to paint a path to a specific optimization strategy. Nor does it sound like an assertion that the application is flawed in some way. L2 and L3 cache are limited, shared resources. The more contention for those resources exists in a system, the lower the overall efficiency is going to be. Such is the nature of limited, shared resources in computing environments. How many threads should one optimize for? How much cache per active thread should one presume exist? Which cache management algorithm should one presume the CPU is using? How far ahead should one presume the instruction pipelines are looking? Etc. etc. In general, applications have no visibility nor control over how cache memory is utilized in a system at runtime. Rosetta Moderator: Mod.Sense |
fuzzydice555 Send message Joined: 31 Oct 17 Posts: 5 Credit: 2,786,716 RAC: 4,394 |
I am using the latest PCM, downloaded from the git repo. On the 2628v4 all other projects report this same cache hit rate. This processor is an ES version, so it's very possible there's something physically wrong with the CPU. There's a huge amount of memory traffic, which could correspond to the cache misses. I'm also seeing low core utilization: Instructions per nominal CPU cycle: 2.24 => corresponds to 56.05 % core utilization over time interval Mem read: 5.7 GB/s Compared with 24 world community grid - SCC work units: Instructions per nominal CPU cycle: 3.31 => corresponds to 82.75 % core utilization over time interval Mem read: 0.57 GB/s Unfortunately I don't have a non ES processor to test, I can only check my i5-8550U laptop. It has 8MB cache, so the cache issue should be much more pronounced. On the 8550u everything seems OK when running 8 threads of Rosetta: L3 hit rate: 0.86 Instructions per nominal CPU cycle: 3.69 => corresponds to 92.21 % core utilization over time interval Mem read: 0.00 GB/s (???) |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,692,082 RAC: 15,825 |
2 bitsonic Nice data digging and testing. I did similar testing about a year ago, and also have not found any significant performance impacts from running multiple R@H WUs. Not not specifically on the use of L3 cache, but the general effect of the number of simultaneously running R@H tasks on the speed of calculations. There is an impact of course, but it very minor. I used average credits (which accounted in proportion to successfully calculated decoys) per day per CPU core as benchmark averaging over at least 20-30 WUs each measurement to mitigate different work types profiles in different WUs. And running 6 WUs in parallel on 6 cores CPU or 8 WUs on 8 core CPU or even 16 WUs on 8 core CPU gives only about 10-25% performance hit: eg 6 WUs running on 6 cores CPU with 6 MB L3 yields about 5 times more useful work compared to 1 WU running on same 6 cores CPU + 6 MB L3. This all L3 fuss with statement like "running only 1 WU is faster compared to running WUs on all cores" sounds like a pure nonsense for me. All data i have seen on my computers and from a few team-mates prove the opposite: there is a some performance hits from high WUs concurrency, but they are minor and running more threads always increase total CPU throughput if not constrained by RAM volume (no active swapping) . Even running WUs on virtual cores (HT / SMT threads) gives some total speed boost: it decrease 1 thread speed significantly but still increasing total CPU throughput from all running threads combined. For example: running 16 R@H WUs on my Ryzen 2700(8 cores + SMT, 16 MB L3) is ~17 percent faster (more total throughput/more decoys generated) compared to running 8 R@H WUs on same system. Yes, there is some problems on heavy working machines running 16-32 or more R@H WUs in parallel. But such problems are came from a very high RAM usage(volumes) and sometimes a very high disk (I/O) loads during multiple WUs startup in such systems. And not related to any issues with CPU caches. |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,692,082 RAC: 15,825 |
Average processing rate calculated by BOINC is not relevant for R@H. These numbers are useless because R@H WUs have no fixed length but have fixed FLOPS count set on the BOINC server = 80 000 GFLOPs. It useful only for fixed work WUs used in most of the other projects. But not in R@H ANY CPU running R@H with default 8 hours target runtime will give about 2.5-3 GFLOPS as average processing rate: 80000/(8*60*60). Regardless of real CPU speed. Pentium 4 and Core i9 have similar values because FLOPS count is fixed and runtime is fixed too. If you change target CPU time - you will get significant change in "average processing rate" reported by BOINC. Any other differences are random variations, it says nothing about actual CPUs speeds. Comparing real computation speed for R@H is a hard task. You need to count run-times and numbers of decoys or credits generated during that run-times while selecting WUs of a same type. Or averaging over large numbers of different WUs to smooth out the differences between them. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1663 Credit: 17,329,318 RAC: 24,446 |
Average processing rate calculated by BOINC is not relevant for R@H. These numbers are useless because R@H WUs have no fixed length but have fixed FLOPS count set on the BOINC server = 80 000 GFLOPs.Guess what? APR is used to determine Credit (amongst a raft of other things as well). These numbers are useless because R@H WUs have no fixed length but have fixed FLOPS count set on the BOINC server = 80 000 GFLOPsIf that is the case, it would explain some of the Credit, APR & Estimated time issues. if an 8 hour Target CPU time Task has a wu.fpops_est of 80,000 GFLOPs, then a 4 hour Target Runtime Task should have a wu.fpops_est of 40,000 GFLOPs, 16hrs 160,000 GFLOPs etc. Of course more powerful CPUs will do more actual FLOPs during that period than a less powerful CPU, but that's why the CPU benchmarks are used in calculating the work done, to work out APR & Credit, and Estimated completion times. You need to count run-times and numbers of decoys or credits generated during that run-times while selecting WUs of a same type.And Runtime is no good as Runtimes are fixed. Credit is no good as it is based on APR. And are Decoys even a useful indicator for the same type of Task? For a given CPU- a 2hr Target CPU time produces 10 Decoys. Would a 4 hour runtime produce 20? 8hrs 40? And for a different CPU on the very same Task, if a 2hr runtime time produced 20 Decoys, would a 4 hour runtime produce 40 etc? Is this what happens? Then you've got different tasks that might produce 10 times as many, or 10 times less Decoys for the same Runtime. Hence why the number FLOPs done was used to determine the work done by a CPU (although of course not all FLOPs are equal, some have larger or smaller execution overheads than others so often some sort of scaling factor is required to smooth those things out). ANY CPU running R@H with default 8 hours target runtime will give about 2.5-3 GFLOPS as average processing rate: 80000/(8*60*60). Regardless of real CPU speed. Pentium 4 and Core i9 have similar values because FLOPS count is fixed and runtime is fixed too.Which would explain some of the APR values i've seen on some systems. If the FLOPs values used for the the wu.fpops_est were set to proportionally match the Target CPU runtime (eg 2hr Runtime- wu.fpops_est, 4hr Runtime- wu.fpops_est * 2, 8hr Runtime- wu.fpops_est * 4, 36hr Runtime wu.fpops_est * 18) then the APRs would be more representative of computation done, as would the Credit awarded. Tasks that run longer or shorter than the Target CPU time will still cause variations. But the Credit awarded & APR would be a lot more representative of the processing a given CPU has done, and initial Estimated completion times for new Tasks and particularly new applications shouldn't be nearly as far out as they presently are. Grant Darwin NT |
bitsonic Send message Joined: 21 Mar 20 Posts: 9 Credit: 1,680,354 RAC: 0 |
I assume ES=Engineering Sample, right? It is likely that ES is still a WIP product and may not function with every feature correctly. Your test result from i5-8550U looks more normal. |
bitsonic Send message Joined: 21 Mar 20 Posts: 9 Credit: 1,680,354 RAC: 0 |
That's correct. It is true, almost in every case, that total throughput is higher when you utilize all cores vs fewer cores. Just like a high way, when there are very few cars and only 1-2 lane occupied, every car may run at 70 miles/hour. When there are many cars and all lanes are occupied, each car may only run at 60 miles/hour. However the second case still have higher total throughput. |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,384,475 RAC: 9,496 |
The Haswell and Broadwell CPUs with L4 cache (intended for the Iris Pro GPU) might be interesting to look at for this too. I believe the Haswell parts with Iris Pro have direct replacements without Iris Pro so makes for relatively easy comparison. |
HPE Belgium Send message Joined: 27 Mar 20 Posts: 16 Credit: 367,648,439 RAC: 0 |
I have a wide variety of servers with a lot of different configs. My servers have 8 up to 128 cores (and almost every config and Intel generation in between). All servers have hyperthreading enabled. I will see if I find time to run some tests with PCM in the next couple of days on servers that have Windows installed if you think it that can be of any help. My servers are viewable so if any wants to pick some test servers out of it, just let me know which ones I would need to test. |
Message boards :
Number crunching :
L3 Cache is not a problem, CPU Performance Counter proves it
©2024 University of Washington
https://www.bakerlab.org