L3 Cache is not a problem, CPU Performance Counter proves it

Message boards : Number crunching : L3 Cache is not a problem, CPU Performance Counter proves it

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile bitsonic

Send message
Joined: 21 Mar 20
Posts: 9
Credit: 1,680,354
RAC: 0
Message 94600 - Posted: 16 Apr 2020, 10:16:22 UTC - in response to Message 94578.  

I have a wide variety of servers with a lot of different configs. My servers have 8 up to 128 cores (and almost every config and Intel generation in between). All servers have hyperthreading enabled.
I will see if I find time to run some tests with PCM in the next couple of days on servers that have Windows installed if you think it that can be of any help.

My servers are viewable so if any wants to pick some test servers out of it, just let me know which ones I would need to test.


I would say start with those CPU with skylake architecture first, such as Xeon(R) Gold 5122, Xeon(R) Platinum 8160, i7-6600U. Then you can compare with results from my machine and fuzzydice555's machine. Our machines are skylake based.
ID: 94600 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
fuzzydice555

Send message
Joined: 31 Oct 17
Posts: 5
Credit: 2,786,716
RAC: 4,394
Message 94601 - Posted: 16 Apr 2020, 10:23:03 UTC - in response to Message 94600.  

Yep, one machine with a lot of cores. On low core count systems l3 hit seems pretty good. Just an addition, I've tested on Linux but I doubt this will influence the results.
ID: 94601 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
bkil
Avatar

Send message
Joined: 11 Jan 20
Posts: 97
Credit: 4,433,288
RAC: 0
Message 94641 - Posted: 16 Apr 2020, 22:57:09 UTC - in response to Message 94415.  

Could you perhaps also measure the power consumption difference between running 8 WU vs. 16 WU on your Ryzen? My theory is that if the CPU is blocked for longer, it may consume a bit less energy. This would also increase the credit/watt ratio more than 17% then..
ID: 94641 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1663
Credit: 17,329,581
RAC: 24,454
Message 94645 - Posted: 17 Apr 2020, 0:34:28 UTC - in response to Message 94641.  
Last modified: 17 Apr 2020, 0:40:47 UTC

Could you perhaps also measure the power consumption difference between running 8 WU vs. 16 WU on your Ryzen? My theory is that if the CPU is blocked for longer, it may consume a bit less energy. This would also increase the credit/watt ratio more than 17% then..
I've got 2 almost identical i7-8700K systems. One with HyperThreading on, the other off (System with HT off uses the iGPU to drive the monitor).

CPUID HWMonitor reports
       HT on            HT off
  Package 83.26W        68.74W
 IA cores 71.60W        57.47W
   Uncore 11.86W        13.62W
Peak power (average about 3W less).


The HT off system has been running for about half the time of the other system, but at the 10 day mark the RAC's for each system were

HT on 6,000
HT off 4,400
Grant
Darwin NT
ID: 94645 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
bkil
Avatar

Send message
Joined: 11 Jan 20
Posts: 97
Credit: 4,433,288
RAC: 0
Message 94655 - Posted: 17 Apr 2020, 5:59:47 UTC - in response to Message 94645.  

Thank you for the data point. Although disabling HT in the BIOS may or may not do something else as well to reduce power so much (maybe some higher order logic?). So that boils to 36% credit gain with HT and at least 13% credit/Watt gain on your i7-7700K (even more if we considered total system power).

Interestingly, I've noticed elevated temperatures on a Haswell as well when enabling HT, so I'd need to do some measurements too. I first thought it was some kind of an anomaly.

Also, I usually compute with 30-50% HT gain in my approximations but maybe I'd need to revise that formula a bit - probably depending on architecture.
ID: 94655 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1663
Credit: 17,329,581
RAC: 24,454
Message 94656 - Posted: 17 Apr 2020, 6:31:32 UTC - in response to Message 94655.  

Interestingly, I've noticed elevated temperatures on a Haswell as well when enabling HT, so I'd need to do some measurements too. I first thought it was some kind of an anomaly.
It's working harder, so it's using more current, so it gets hotter.
Grant
Darwin NT
ID: 94656 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 25,692,082
RAC: 15,825
Message 97689 - Posted: 27 Jun 2020, 16:24:19 UTC - in response to Message 94418.  
Last modified: 27 Jun 2020, 17:10:25 UTC

And Runtime is no good as Runtimes are fixed. Credit is no good as it is based on APR.
And are Decoys even a useful indicator for the same type of Task? For a given CPU- a 2hr Target CPU time produces 10 Decoys. Would a 4 hour runtime produce 20? 8hrs 40?
And for a different CPU on the very same Task, if a 2hr runtime time produced 20 Decoys, would a 4 hour runtime produce 40 etc? Is this what happens?

Then you've got different tasks that might produce 10 times as many, or 10 times less Decoys for the same Runtime. Hence why the number FLOPs done was used to determine the work done by a CPU (although of course not all FLOPs are equal, some have larger or smaller execution overheads than others so often some sort of scaling factor is required to smooth those things out).


Sorry for a late reply (forget to subscribe to thread initially).

Runtimes are NOT fixed. The target CPU time is fixed in the settings, yes. But actual run times can vary significantly from it. Some task ends prematurely - usually if there is an some errors during processing or WU hits internal "max decoy limit" (there is an instruction in each WU to stop processing data if set number of decoys already generated and sent results to server, ignoring fact it did not reach target CPU time).
And on the other case some WUs exceed target CPU time significantly - usually it happens if WU works on really hard/big models and generation of one decoy on such hard targets can take few hour of CPU work. And target CPU time is checked only between decoy, CPU time trigger does not interrupt calculation of already started decoy until it fully finished or watchdog kick-in (usually it set to CPU target time + 4 hours) and abort the task .

That is why I count actual CPU time: take some(more is better) completed WUs, sum up all CPU times used by it, sum up all the credit generated. Divide sum of the credit by sum of all CPU time consumed. And you got a fairly accurate estimate of real host performance without waiting a LONG time for the average indicator (RAC) to stabilize.
Usually grab(C&P) all recent WUs from result tables into Excel/Calc spreadsheet,

And about decoys - based on my observation - yes, there is almost linear relation - e.g. double CPU runtime of WU and it will produce about twice number of decoys.
With same type of WU and same hardware of course.
Moreover, it the number of decoys generated is the main factor for calculating credits for a successfully completed task at server after reporting. Its like simple formula:
Cr granted = decoy count in reported WU x "price" of one decoy.
Host CPU benchmarks and APR is used to determine that "decoy price" but server uses average values collected from many (not sure how many? probable all) host contributing to the same target/work type. Its NOT a BOINC WU "wingmans"(such scheme is used in WCG for example). For R@H it all host getting WUs of a same type/batch (usually hundreds or even few thousands hosts on large batches). So all "anomalies" in benchmarks and AFR smoothed out due to large scale averaging.

But for particular host only number of successfully generated decoys determine how much CR it receives for completed WUs.
ID: 97689 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 25,692,082
RAC: 15,825
Message 97690 - Posted: 27 Jun 2020, 16:54:11 UTC - in response to Message 94418.  


ANY CPU running R@H with default 8 hours target runtime will give about 2.5-3 GFLOPS as average processing rate: 80000/(8*60*60). Regardless of real CPU speed. Pentium 4 and Core i9 have similar values because FLOPS count is fixed and runtime is fixed too.
If you change target CPU time - you will get significant change in "average processing rate" reported by BOINC.
Which would explain some of the APR values i've seen on some systems.

If the FLOPs values used for the the wu.fpops_est were set to proportionally match the Target CPU runtime (eg 2hr Runtime- wu.fpops_est, 4hr Runtime- wu.fpops_est * 2, 8hr Runtime- wu.fpops_est * 4, 36hr Runtime wu.fpops_est * 18) then the APRs would be more representative of computation done, as would the Credit awarded. Tasks that run longer or shorter than the Target CPU time will still cause variations.
But the Credit awarded & APR would be a lot more representative of the processing a given CPU has done, and initial Estimated completion times for new Tasks and particularly new applications shouldn't be nearly as far out as they presently are.

Yes, i agree this should be fixed. By a simple multiplier at least as a fast/simpler solution. E.g 10 000 FLOPs her hour of target CPU time to be in line with current baseline of 80000 GFLOPs and default 8 hr target runtime.

It will both improve Cr calculation accuracy and help a LOT to BOINC client adapts Estimated completion times and queue size faster if user changed target CPU setting. Without it BOINC starts to correct these values slowly and only after some WUs finished as it does not aware that WUs became longer or shorted out of sudden. And only see it after some WUs already finished.
With wu.fpops_est altered in a proportion to target CPU time BOINC client will know that new WUs will be shorter or longer in advance: right after downloading and before even starting processing the first of it.
ID: 97690 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 25,692,082
RAC: 15,825
Message 97694 - Posted: 27 Jun 2020, 17:46:59 UTC - in response to Message 94641.  
Last modified: 27 Jun 2020, 18:01:07 UTC

Could you perhaps also measure the power consumption difference between running 8 WU vs. 16 WU on your Ryzen? My theory is that if the CPU is blocked for longer, it may consume a bit less energy. This would also increase the credit/watt ratio more than 17% then..

Interesting and simple test indeed.
There is a results. I do not have have external hardware tools to measure power consumption right now. So i used internal CPU monitoring: CPU package power (SMU) values collected by HWiNFO64.

With 16 WUs running in parallel its shows 72 W as average (waited a couple of minutes to collect average values) CPU package power
With only 8 WUs running in parallel it drops down to 60 W.
CPU temperature also slowly decreased by about 3 degrees as well, confirming power monitoring values.

CPU frequencies and voltages stays the same during comparison (3.34 GHz and 1.02 V - its stock values for this CPU, no manual tuning).
And SMT was not entirely disabled i just reduced number of WUs running without system reboot, so additional 8 SMT threads was still present in the system, just were not been used for computation.

So "there is no magic" (c) - more real work is done, more energy is consumed by CPU. Given the CPU stays the same.

And it almost perfect linear correlation in my case: 16 WUs running on 8 cores produce about 17% more total computation throughput and consume about same 17% more power (about 20% actually: 72/60 = ~1.2)
So credit/watt ratio stays about the same. But only CPU wise.
There is some additional power consumptions (RAM, disk, Motherboard components) which should not be affected or affected negligible so total system is a bit more energy efficient if it runs all 16 WUs on all threads with SMT compared to just 8 WUs .

And it definitely more efficient from a "credit/$ cost of the system" point of view as use of SMT cost nothing.

P.S.
Note: +17% performance gain from SMT was measured few month ago. While power comparison made today. Current SMT boost may slightly different due to different tasks being processed in R@H queue. Its still same Rosetta and BOINC, but a different tasks/protein targets alter work/load profiles sightly.

To direct comparison and accurate energy efficiency calculation 2 new performance tests needed(with 8 and 16 WUs running), but it take a lot of time to do it and i do not have enough spare currently.
ID: 97694 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 22,848,992
RAC: 15,249
Message 97814 - Posted: 29 Jun 2020, 12:55:01 UTC - in response to Message 93961.  

Intel now offers a free forum-supported version of Vtune that you can download. You can run your tests and capture a system-wide sample and then drill down to see what is happening.
https://software.intel.com/content/www/us/en/develop/tools/vtune-profiler/choose-download.html

The Rosetta binaries don't have symbols but you can see where the program is burning its time. The last time I looked at the source code build instructions, Rosetta very, very, very aggressively used inline compile optimizations. They build Rosetta on top of tiny BOOST math library routines that they repeatedly inlined. That made for a huge code footprint and icache misses and ITLB table walks. I think of the the Rosetta binary as the "Swiss Army Knife" of programs. You can give it the correct command line and that one binary will do just about anything. Few developers look at the impact of the size of the code footprint. IMO, they made Rosetta more "flexible" at the cost of performance.

Other than "over-optimizing" the inlining, there was not much useful performance optimization. I was able to build a Rosetta binary and simply turn off the inlining optimizations and get 20% to 50% improvement.

Because they used the BOOST library calls, that made vector optimizations VERY TOUGH. The compiler could not depend upon correct data alignment or locality. If you find any SSE or AVX code executed, it will be scalar and not parallel. Developers, in general, do not take kindly to "suggestions" about their code. With a huge number of hosts crunching Rosetta and a number of corporations commercially licensing the software, there is zero incentive for the developers to change anything.

Try using Vtune to look at running code.
ID: 97814 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : Number crunching : L3 Cache is not a problem, CPU Performance Counter proves it



©2024 University of Washington
https://www.bakerlab.org