Posts by bitsonic

1) Message boards : Number crunching : L3 Cache is not a problem, CPU Performance Counter proves it (Message 94600)
Posted 16 Apr 2020 by Profile bitsonic
Post:
I have a wide variety of servers with a lot of different configs. My servers have 8 up to 128 cores (and almost every config and Intel generation in between). All servers have hyperthreading enabled.
I will see if I find time to run some tests with PCM in the next couple of days on servers that have Windows installed if you think it that can be of any help.

My servers are viewable so if any wants to pick some test servers out of it, just let me know which ones I would need to test.


I would say start with those CPU with skylake architecture first, such as Xeon(R) Gold 5122, Xeon(R) Platinum 8160, i7-6600U. Then you can compare with results from my machine and fuzzydice555's machine. Our machines are skylake based.
2) Message boards : Number crunching : L3 Cache is not a problem, CPU Performance Counter proves it (Message 94432)
Posted 14 Apr 2020 by Profile bitsonic
Post:


This all L3 fuss with statement like "running only 1 WU is faster compared to running WUs on all cores" sounds like a pure nonsense for me. All data i have seen on my computers and from a few team-mates prove the opposite: there is a some performance hits from high WUs concurrency, but they are minor and running more threads always increase total CPU throughput if not constrained by RAM volume (no active swapping) . Even running WUs on virtual cores (HT / SMT threads) gives some total speed boost: it decrease 1 thread speed significantly but still increasing total CPU throughput from all running threads combined.



That's correct. It is true, almost in every case, that total throughput is higher when you utilize all cores vs fewer cores. Just like a high way, when there are very few cars and only 1-2 lane occupied, every car may run at 70 miles/hour. When there are many cars and all lanes are occupied, each car may only run at 60 miles/hour. However the second case still have higher total throughput.
3) Message boards : Number crunching : L3 Cache is not a problem, CPU Performance Counter proves it (Message 94431)
Posted 14 Apr 2020 by Profile bitsonic
Post:


I am using the latest PCM, downloaded from the git repo. On the 2628v4 all other projects report this same cache hit rate. This processor is an ES version, so it's very possible there's something physically wrong with the CPU.

There's a huge amount of memory traffic, which could correspond to the cache misses. I'm also seeing low core utilization:

Instructions per nominal CPU cycle: 2.24 => corresponds to 56.05 % core utilization over time interval
Mem read: 5.7 GB/s

Compared with 24 world community grid - SCC work units:

Instructions per nominal CPU cycle: 3.31 => corresponds to 82.75 % core utilization over time interval
Mem read: 0.57 GB/s

Unfortunately I don't have a non ES processor to test, I can only check my i5-8550U laptop. It has 8MB cache, so the cache issue should be much more pronounced. On the 8550u everything seems OK when running 8 threads of Rosetta:

L3 hit rate: 0.86
Instructions per nominal CPU cycle: 3.69 => corresponds to 92.21 % core utilization over time interval
Mem read: 0.00 GB/s (???)



I assume ES=Engineering Sample, right? It is likely that ES is still a WIP product and may not function with every feature correctly. Your test result from i5-8550U looks more normal.
4) Message boards : Number crunching : L3 Cache is not a problem, CPU Performance Counter proves it (Message 94245)
Posted 12 Apr 2020 by Profile bitsonic
Post:
According to this cache miss is pretty bad on my Xeon 2628v4, no matter how many threads I run. It seems no matter how many threads I run, all 30MB cache will be allocated to the active threads.
I'm not sure how to interpret these results.

24 threads:

Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | L3OCC | LMB | RMB | TEMP

0 0 1.93 1.70 1.13 1.13 2391 K 2732 K 0.12 0.63 0.00 0.00 192 8 0 33
1 0 1.15 1.02 1.13 1.13 465 K 578 K 0.20 0.73 0.00 0.00 576 19 0 32
2 0 1.13 0.99 1.13 1.13 21 M 28 M 0.23 0.24 0.01 0.02 2112 79 0 33
3 0 0.93 0.82 1.13 1.13 26 M 35 M 0.27 0.12 0.02 0.03 3504 220 0 34
4 0 1.95 1.72 1.13 1.13 2473 K 2818 K 0.12 0.72 0.00 0.00 48 20 0 32
5 0 1.60 1.41 1.13 1.13 3790 K 5603 K 0.32 0.81 0.00 0.00 384 31 0 33
6 0 0.85 0.75 1.13 1.13 18 M 28 M 0.34 0.21 0.01 0.02 2640 81 0 34
7 0 1.64 1.45 1.13 1.13 3895 K 4715 K 0.17 0.43 0.00 0.00 48 58 0 32
8 0 1.78 1.57 1.13 1.13 1696 K 1923 K 0.12 0.38 0.00 0.00 96 41 0 32
9 0 1.19 1.05 1.13 1.13 20 M 27 M 0.27 0.16 0.01 0.02 528 135 0 31
10 0 1.61 1.43 1.13 1.13 1198 K 1394 K 0.14 0.91 0.00 0.00 4992 42 0 33
11 0 1.27 1.12 1.13 1.13 4696 K 7883 K 0.40 0.59 0.00 0.00 480 32 0 32
12 0 1.09 0.96 1.13 1.13 16 M 21 M 0.24 0.34 0.01 0.01 2976 447 0 33
13 0 1.52 1.36 1.12 1.13 1288 K 1347 K 0.04 0.40 0.00 0.00 48 73 0 32
14 0 1.48 1.31 1.13 1.13 11 M 13 M 0.12 0.43 0.01 0.01 960 104 0 33
15 0 0.96 0.84 1.13 1.13 9561 K 13 M 0.28 0.28 0.01 0.01 576 228 0 34
16 0 0.98 0.86 1.13 1.13 25 M 33 M 0.26 0.22 0.02 0.02 2064 24 0 32
17 0 1.31 1.16 1.13 1.13 3227 K 5229 K 0.38 0.73 0.00 0.00 240 16 0 33
18 0 0.97 0.86 1.13 1.13 8905 K 12 M 0.26 0.34 0.01 0.01 192 267 0 34
19 0 1.05 0.93 1.13 1.13 12 M 17 M 0.29 0.16 0.01 0.01 1872 435 0 32
20 0 1.05 0.92 1.13 1.13 11 M 16 M 0.29 0.18 0.01 0.01 3024 477 0 32
21 0 1.20 1.09 1.10 1.13 7781 K 10 M 0.23 0.24 0.00 0.01 2160 91 0 31
22 0 1.75 1.54 1.13 1.13 1120 K 1322 K 0.15 0.91 0.00 0.00 96 32 0 33
23 0 1.47 1.30 1.13 1.13 20 M 21 M 0.08 0.35 0.01 0.01 384 57 0 32
---------------------------------------------------------------------------------------------------------------
SKT 0 1.33 1.17 1.13 1.13 236 M 315 M 0.25 0.37 0.00 0.01 30192 3017 0 30

12 threads:

Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | L3OCC | LMB | RMB | TEMP

1 0 1.53 1.35 1.13 1.13 16 M 22 M 0.28 0.11 0.01 0.01 3600 578 0 33
2 0 2.71 2.39 1.13 1.13 3471 K 5262 K 0.34 0.81 0.00 0.00 912 92 0 34
3 0 1.38 1.22 1.13 1.13 8782 K 11 M 0.23 0.34 0.00 0.01 2784 339 0 34
5 0 1.55 1.36 1.13 1.13 16 M 22 M 0.28 0.11 0.01 0.01 3504 569 0 34
8 0 2.02 1.78 1.13 1.13 20 M 26 M 0.22 0.26 0.01 0.01 6432 764 0 33
9 0 2.23 1.97 1.13 1.13 23 M 26 M 0.10 0.38 0.01 0.01 864 120 0 32
10 0 1.39 1.22 1.13 1.13 9271 K 11 M 0.22 0.32 0.00 0.01 2544 355 0 33
11 0 1.50 1.33 1.13 1.13 35 M 47 M 0.26 0.18 0.02 0.02 3168 46 0 33
12 0 2.42 2.14 1.13 1.13 32 M 33 M 0.04 0.28 0.01 0.01 624 68 0 34
16 0 1.45 1.28 1.13 1.13 36 M 50 M 0.27 0.14 0.02 0.02 4032 70 0 34
18 0 1.37 1.21 1.13 1.13 8721 K 11 M 0.22 0.34 0.00 0.01 1920 318 0 34
19 0 2.25 1.98 1.13 1.13 5706 K 7573 K 0.25 0.59 0.00 0.00 576 171 0 33
---------------------------------------------------------------------------------------------------------------
SKT 0 0.91 1.60 0.57 1.13 219 M 279 M 0.22 0.29 0.01 0.01 31968 3516 0 30

6 threads:

Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | L3OCC | LMB | RMB | TEMP

2 0 2.35 2.01 1.17 1.18 26 M 35 M 0.27 0.21 0.01 0.01 8784 503 0 36
3 0 1.44 1.23 1.17 1.18 9198 K 11 M 0.22 0.34 0.00 0.01 7392 253 0 36
4 0 2.72 2.33 1.17 1.18 24 M 25 M 0.04 0.32 0.01 0.01 2160 186 0 36
7 0 2.31 1.98 1.17 1.18 48 M 50 M 0.03 0.22 0.01 0.01 1536 36 0 36
12 0 2.32 1.99 1.17 1.18 28 M 39 M 0.28 0.19 0.01 0.01 8016 453 0 36
18 0 1.08 0.93 1.17 1.18 964 K 1033 K 0.07 0.67 0.00 0.00 1056 14 0 37
---------------------------------------------------------------------------------------------------------------
SKT 0 0.52 1.73 0.30 1.17 142 M 169 M 0.16 0.24 0.01 0.01 30480 1517 0 32
---------------------------------------------------------------------------------------------------------------

3 threads:

Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | L3OCC | LMB | RMB | TEMP

4 0 2.16 1.62 1.33 1.37 23 M 29 M 0.21 0.29 0.01 0.01 22560 356 1 36
7 0 3.20 2.40 1.33 1.37 3171 K 4887 K 0.35 0.86 0.00 0.00 3744 28 0 35
8 0 1.41 1.06 1.33 1.37 1644 K 1816 K 0.09 0.57 0.00 0.00 2112 22 0 35
---------------------------------------------------------------------------------------------------------------
SKT 0 0.29 1.67 0.17 1.34 32 M 41 M 0.21 0.53 0.00 0.00 30768 453 1 34


That looks too low and doesn't sound normal. l. Which version of PCM are you using? You must use version 2.11 or above to support Xeon E5 v4. A lower version will report wrong information. If you are using 2.11, try to run another program and see if the cache hit is still so low. If it is still low, then there shall be something wrong.
5) Message boards : Number crunching : L3 Cache is not a problem, CPU Performance Counter proves it (Message 94074)
Posted 10 Apr 2020 by Profile bitsonic
Post:
1. When you say people report the number of concurrent instances increases the problem becomes more pronounced, what exactly do they see inside the system? Do they have evidence showing that L3 is the major culprit? Do they look into other factors and confirm those do not contribute to the problem? It is true that the more concurrency, the more intensive that threads could compete with each other for cache resources, which may subject to lower average L3 hit ratio. This is true for any application, not just Rosetta. However there are multiple other factors that can also affect performance when you increase concurrency. For example:
- The more cores your CPU has, the more likely that RAM latency could become bottleneck. Think about a dual channel DDR 2400 serving a dual core CPU vs serving a quad core CPU (assuming each core has the same specification), of course memory queue will be longer for that quad core CPU, because 4 cores are competing the same memory bandwidth instead of 2. And the each core may have higher chance to wait longer before memory serves it. Imagine this is like 2 people queue in a ticket office vs 4 people, people in the later one will wait longer. When CPU core is waiting, the core is in idle state. Now think about 16 core CPU and 32 core CPU, do you increase memory bandwidth 8 times and 16 times compare to a dual core CPU? If not, then definitely you will see performance of each instance slower.

I don't (think I) suffer the problem myself (8 thread) but the complaint came from people with 32+ CPUsthreads, so I'm not sure your results based on a 2-core4-thread CPU helps answer their question.


Guess I should use a more extreme example huh :)
6) Message boards : Number crunching : L3 Cache is not a problem, CPU Performance Counter proves it (Message 94035)
Posted 10 Apr 2020 by Profile bitsonic
Post:

Unfortunately, statistics are always open to interpretation. "If you torture the data long enough, it will confess" -- Economist Ronald H Coase

1. Many contributors have reported that if they run small numbers of Rosetta instances concurrently they see less of a problem but as the number of concurrent instances increases the problem becomes more pronounced. This observation seems to be backed by your data where the L3 hit ration drops between 4 concurrent and 8 concurrent. What happens when concurrency jumps to 16 threads, 32 threads, etc?

2. PHYSICAL CORE IPC data indicates ~56% utilization. Seems like it should be closer to 100%. Why so low? Does it indicate that the cores are waiting significantly?

3. Memory READ/WRITE numbers need more context. Where is that measured at? Additionally, I assume those are main memory stats and not cache memory stats based on the size of the numbers. I would conclude that if the L3 cache needed to be refreshed with new data due to misses, it would result in high read rates just as your data indicates. Memory transfers rates are not indicative of anything other than data is being transferred and it definitely doesn't indicate memory utilization. I would suggest that it more rightly indicates memory channel utilization. I submit in a highly optimized L3 cache utilization situation, there would be little to no memory transfer activity, however, just because the transfer rate isn't at full utilization doesn't support the claim that there isn't a L3 cache problem.

4. Are you making the assumption that the Rosetta program has a reasonably static processing profile? Could the program have higher L3 cache demands at various points in the processing stream and you just happened to catch the program at a low point in the demand?

5. You make the statement: "I don't know whether Rosetta's algorithm itself is optimized, but looks like it is pretty effective in keeping CPU busy." I ask, based on what? If you are using CPU utilization reported from the OS, I contend that isn't a reliable number in this case. That number is based off the OS wait bits that never get set in this case. The thread is dispatched to the CPU and as far as the OS is concerned, it is active on the processor. If the core is waiting on memory transfers between cache or main memory, the OS doesn't see that. Only way the OS knows about a thread waiting is if it gives up the CPU voluntarily or is interrupted by a higher priority task. Only way the OS knows if the CPU is waiting is if the wait bit for that processor is set.

Not saying the data isn't useful or pertinent. Just be careful about drawing conclusions from such a small amount of data. In other words, it's hard to run a program for 100 seconds and then say everything is OK



1. When you say people report the number of concurrent instances increases the problem becomes more pronounced, what exactly do they see inside the system? Do they have evidence showing that L3 is the major culprit? Do they look into other factors and confirm those do not contribute to the problem? It is true that the more concurrency, the more intensive that threads could compete with each other for cache resources, which may subject to lower average L3 hit ratio. This is true for any application, not just Rosetta. However there are multiple other factors that can also affect performance when you increase concurrency. For example:
- The more cores your CPU has, the more likely that RAM latency could become bottleneck. Think about a dual channel DDR 2400 serving a dual core CPU vs serving a quad core CPU (assuming each core has the same specification), of course memory queue will be longer for that quad core CPU, because 4 cores are competing the same memory bandwidth instead of 2. And the each core may have higher chance to wait longer before memory serves it. Imagine this is like 2 people queue in a ticket office vs 4 people, people in the later one will wait longer. When CPU core is waiting, the core is in idle state. Now think about 16 core CPU and 32 core CPU, do you increase memory bandwidth 8 times and 16 times compare to a dual core CPU? If not, then definitely you will see performance of each instance slower.
- The design of OS kernel also affects the performance and L3 hit ratio. I saw slight higher L3 hit ratio after I upgraded from Win10 1709 to 1909, testing with the same program. What about in Linux? What about comparing efficiency between Windows and Linux when dealing 32 threads? They could be different. But this is OS problem, not Rosetta's problem.
- The design of CPU micro-architecture. This is easy to understand. Even with the same cache size, hit ratio of an Intel will not be the same as an AMD, hit ratio of different generations of the same brand will be different too.
So, again, we need to be specific what happens when running large number of concurrent instances. Hopefully someone can run CPU performance counter monitor on a 32 core CPU and post the result here.

2. 56% of utilization is a very very normal number. Even we have 100% L3 hit ratio, it will not be 100% utilization. If cache read is missing, then the core is in idle state waiting. Response time of different layer of cache is different. On my computer, latency of L1 cache is around 1ns, L2 around 3ns, L3 around 15ns, RAM around 60-70ns. If L1 cache is missed, CPU wait a few cycles to fetch data from L2. If L2 miss, it waits more cycles for L3. If L3 missed, then hundreds of cycles wait for RAM. All these wait contribute to lower utilization. Even we cache everything in L1, this number will not be 100%. The highest number I have ever seem is from CPU stability test of AIDA64 (stress FPU only). Essentially it is a very small piece of code highly optimized to push CPU to its limit. It only reach out around 80% of utilization. Read the Intel article I quote in my last post, you will understand what I am saying.

3. Memory read/write there refers to how many GB of data have been read/write to main memory (RAM). What I am talking about is memory bandwidth utilization here. It is true that L3 refresh result in data read from RAM. Actually modern CPUs always try to prefetch data from memory to cache so that when CPU needs the data, it is already in cache. But in general, running the same code, the higher cache hit rate, the lower memory read/write. I have seen occasionally, when I run that performance capture for 10 seconds interval, L3 hit ratio drop to 60% and memory bandwidth utilization raise up quite a lot. So I think my point is valid in this specific case.

4. It could be. So I ran the capture again for 10,000 seconds, long enough I believe. I got 82% on my i3. Very close to my first few tests. Actually, I have been doing this test a few times in last 2 days. I saw similar number. Again, this is for Rosetta 4.12 crunching covid-19 WU. I ran out of non-4.12 WUs on my i3. So I did not test how it looks like with other Rosetta versions.

5. Again, I am using report from performance counter monitor. It is report from CPU kernel, not from OS. Basically those are physical counters built inside CPU cores. So the number shall be very accurate. We all know that CPU utilization reported from OS do not reflect the reality. This is the reason why Intel developed that software to help technician reveal where the real performance bottleneck is. To clarify, when I said "I don't know whether Rosetta's algorithm itself is optimized", I mean maybe there is room to improve algorithm of protein design/folding simulation to complete the same simulation with less computation cycles. When I said "looks like it is pretty effective in keeping CPU busy", I mean the code itself can push CPU utilization pretty high, compare to other applications.

Lastly, I agree with you that I shall not draw a broad conclusion by just doing a few amount of testing on my computers only. I shall say my conclusion is only applicable to my machines running Rosetta 4.12 crunching Covid-19 WU. But at least it proves on some machines, L3 cache is not bottleneck. Other people may see different result on their machines. Use it as a reference only.
7) Message boards : News : Help in the fight against COVID-19! (Message 93962)
Posted 9 Apr 2020 by Profile bitsonic
Post:
Dear Brian Coventry et alia,
1. Your custom protein binding the Spike protein looks like it will put neutralizing IgGs to shame!!! Very impressive!!! {As an aside each dose of this protein therapeutic will have to be injected as it could never survive a transit of the stomach and upper GI tract with the low pH and proteolytic enzymes that would reduce it to amino acids.}

2. I'm laboring under the assumption that Rosetta code incorrectly programmed the use of the L3 cache and will bottleneck if too many Rosetta WUs are running simultaneously. I limited my use to one WU per 5 MB of L3 cache. Exceeding that limitation slows the entire CPU over 60%. This means that my fleet of Xeon E5s are running Rosetta at less than 20% capacity. Has this been fixed and is it safe to run full force now???

3. I believe that your statement implies that all WUs are contributing to Covid-19 research regardless of whether you include "Covid-19" in the WU name and that minirosetta is doing Covid-19 work as well. Is that what you mean???

4. I don't know if I'm running ARM64 Rosetta aps or not. I see from your Applications page there are numerous versions but when I look at the Properties of a given WU I'm running it says the version 3.76, 4.07, 4.08 or 4.12. What are the Linux-ARM platforms???

5. I suggest you tether these blocking proteins together like an IgM. This would more readily mark them for destruction. Also, it appears that your blocking protein is much larger than an IgG so when it binds a Spike it also blocks neighboring Spikes (like an umbrella) from binding ACE2 and being invaginated.

{Please use black text on white background for your forum to maximize contrast. This faded gray text really strains my old eyes so I never read your forums. I would not have seen this announcement except that you were nice enough to send out a BOINC Notice. Thank you in advance :-}



Hi, I did some testing and the result shows that L3 cache is not a problem. I put more details in this thread: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13764#93961
8) Message boards : Number crunching : L3 Cache is not a problem, CPU Performance Counter proves it (Message 93961)
Posted 9 Apr 2020 by Profile bitsonic
Post:
I know that this have been discussed quite a few times in this forum. People said that Rosetta application is not optimized and needs 5MB of L3 cache for each instance running (though I don't really know where they got this number). But looks to me no one has ever shown any data to proves it. So I decided to figure it out.

Intel used to provide a software called Performance Counter Monitor (see this article https://software.intel.com/en-us/articles/intel-performance-counter-monitor). Basically this software can read some hardware level counter inside CPU and provides performance insight information, including cache hit rate. So I did a test on my desktop:
CPU: i3-7100, 2 cores 4 threads, 3.9Ghz, 3MB L3 cache
RAM: 16GB DDR 2400 Dual Channel
OS: Win10 1909

I run 4 instance of Rosetta 4.12 with Covid-19 WU. I close all other background applications that may affect the result. Then I enabled the monitor and capture for 100 seconds. Here is the result. Column L3HIT is L3 cache hit rate. L2HIT is L2 cache hit rate.

---------------------------------------------------------------------------------------------------------------

EXEC : instructions per nominal CPU cycle
IPC : instructions per CPU cycle
FREQ : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state' (includes Intel Turbo Boost)
L3MISS: L3 (read) cache misses
L2MISS: L2 (read) cache misses (including other core's L2 cache *hits*)
L3HIT : L3 (read) cache hit ratio (0.00-1.00)
L2HIT : L2 cache hit ratio (0.00-1.00)
L3MPI : number of L3 (read) cache misses per instruction
L2MPI : number of L2 (read) cache misses per instruction
READ : bytes read from main memory controller (in GBytes)
WRITE : bytes written to main memory controller (in GBytes)
IO : bytes read/written due to IO requests to memory controller (in GBytes); this may be an over estimate due to same-cache-line partial requests
TEMP : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature
energy: Energy in Joules


Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | TEMP

0 0 1.07 1.07 0.99 1.00 79 M 436 M 0.82 0.78 0.00 0.00 29
1 0 1.09 1.10 0.99 1.00 69 M 435 M 0.84 0.78 0.00 0.00 29
2 0 1.17 1.17 0.99 1.00 55 M 352 M 0.84 0.82 0.00 0.00 30
3 0 1.18 1.19 1.00 1.00 43 M 321 M 0.86 0.83 0.00 0.00 30
---------------------------------------------------------------------------------------------------------------
SKT 0 1.13 1.13 0.99 1.00 247 M 1545 M 0.84 0.80 0.00 0.00 29
---------------------------------------------------------------------------------------------------------------
TOTAL * 1.13 1.13 0.99 1.00 247 M 1545 M 0.84 0.80 0.00 0.00 N/A

Instructions retired: 1764 G ; Active cycles: 1556 G ; Time (TSC): 391 Gticks ; C0 (active,non-halted) core residency: 99.73 %

C1 core residency: 0.27 %; C3 core residency: 0.00 %; C6 core residency: 0.00 %; C7 core residency: 0.00 %;
C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %; C8 package residency: 0.00 %; C9 package residency: 0.00 %; C10 package residency: 0.00 %;

PHYSICAL CORE IPC : 2.27 => corresponds to 56.70 % utilization for cores in active state
Instructions per nominal CPU cycle: 2.26 => corresponds to 56.38 % core utilization over time interval
SMI count: 0
---------------------------------------------------------------------------------------------------------------
MEM (GB)->| READ | WRITE | IO | CPU energy |
---------------------------------------------------------------------------------------------------------------
SKT 0 181.87 37.36 2.18 2912.05
---------------------------------------------------------------------------------------------------------------

As you can see, the L3 cache hit rate is 84% in average for 100 seconds. I repeat the test a few times. Result is vary around 80%, lowest number I saw is 75%, highest is 89%. This is actually a very high hit rate based on my observation. I can say most of other BOINC applications has lower L3 hit rate than this. For example, I used to run SETI@home (optimized version), and the L3 hit rate is around 70%, Milkyway@home is a little more but still below 80%. The only project that has higher hit rate is Collatz Conjecture. I am not surprised with this as Collatz's algorithm is very simple (basically calculation of 3*N+1 and N/2). So the code footprint shall be very small and may even fit into L2 cache.

Another number that proves this is the memory traffic (see the bottom of the above figure). It says 181.87GB read & 37.36GB write in 100 seconds. That's around 2.19GB/second in total. This is a very low number considering CPU is running at full speed. Remember dual channel of DDR2400 can provide close to 40GB/S bandwidth. So basically the memory utilization is very low. This also proves that most memory read/write has been hit by CPU caches.

I also tested my workstation at work:
CPU: i7 9700K, 8 cores 8 threads, OC to 4.2Ghz, 12MB L3 cache
RAM: 16GB DDR4 2666 Dual Channel
OS: WIn10 1909

Running 8 Rosetta instances and I got average L3 hit rate of 77%. Pretty close.

Conclusion: According to the findings above, it doesn't seen that L3 cache is a bottleneck for Rosetta, at least not for 4.12. I don't know whether Rosetta's algorithm itself is optimized, but looks like it is pretty effective in keeping CPU busy.

Lastly, BTW, actually for Intel's current CPU microarchitecture (ie gen 6,7,8,9), each physical core can only use up to 2MB L3 cache. So even you only run one Rosetta instance, it will not be benefit with L3 cache larger than 2MB.

I hope you find this helps.

Michael Wang
9) Message boards : News : Help in the fight against COVID-19! (Message 93870)
Posted 8 Apr 2020 by Profile bitsonic
Post:
Greetings from China and big thank you to the team!I have been with SETI@home since 1999. Now I take my computers doing another meaningful thing here. My country has been suffering with this virus. I donate all my computer power here. I hope we will help scientists finding vaccine eventually.






©2024 University of Washington
https://www.bakerlab.org