Message boards : Number crunching : L3 Cache is not a problem, CPU Performance Counter proves it
Previous · 1 · 2
Author | Message |
---|---|
bitsonic Send message Joined: 21 Mar 20 Posts: 9 Credit: 1,680,354 RAC: 0 |
I have a wide variety of servers with a lot of different configs. My servers have 8 up to 128 cores (and almost every config and Intel generation in between). All servers have hyperthreading enabled. I would say start with those CPU with skylake architecture first, such as Xeon(R) Gold 5122, Xeon(R) Platinum 8160, i7-6600U. Then you can compare with results from my machine and fuzzydice555's machine. Our machines are skylake based. |
fuzzydice555 Send message Joined: 31 Oct 17 Posts: 5 Credit: 2,786,716 RAC: 4,394 |
Yep, one machine with a lot of cores. On low core count systems l3 hit seems pretty good. Just an addition, I've tested on Linux but I doubt this will influence the results. |
bkil Send message Joined: 11 Jan 20 Posts: 97 Credit: 4,433,288 RAC: 0 |
Could you perhaps also measure the power consumption difference between running 8 WU vs. 16 WU on your Ryzen? My theory is that if the CPU is blocked for longer, it may consume a bit less energy. This would also increase the credit/watt ratio more than 17% then.. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1663 Credit: 17,329,581 RAC: 24,454 |
Could you perhaps also measure the power consumption difference between running 8 WU vs. 16 WU on your Ryzen? My theory is that if the CPU is blocked for longer, it may consume a bit less energy. This would also increase the credit/watt ratio more than 17% then..I've got 2 almost identical i7-8700K systems. One with HyperThreading on, the other off (System with HT off uses the iGPU to drive the monitor). CPUID HWMonitor reports HT on HT off Package 83.26W 68.74W IA cores 71.60W 57.47W Uncore 11.86W 13.62WPeak power (average about 3W less). The HT off system has been running for about half the time of the other system, but at the 10 day mark the RAC's for each system were HT on 6,000 HT off 4,400 Grant Darwin NT |
bkil Send message Joined: 11 Jan 20 Posts: 97 Credit: 4,433,288 RAC: 0 |
Thank you for the data point. Although disabling HT in the BIOS may or may not do something else as well to reduce power so much (maybe some higher order logic?). So that boils to 36% credit gain with HT and at least 13% credit/Watt gain on your i7-7700K (even more if we considered total system power). Interestingly, I've noticed elevated temperatures on a Haswell as well when enabling HT, so I'd need to do some measurements too. I first thought it was some kind of an anomaly. Also, I usually compute with 30-50% HT gain in my approximations but maybe I'd need to revise that formula a bit - probably depending on architecture. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1663 Credit: 17,329,581 RAC: 24,454 |
Interestingly, I've noticed elevated temperatures on a Haswell as well when enabling HT, so I'd need to do some measurements too. I first thought it was some kind of an anomaly.It's working harder, so it's using more current, so it gets hotter. Grant Darwin NT |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,692,082 RAC: 15,825 |
And Runtime is no good as Runtimes are fixed. Credit is no good as it is based on APR. Sorry for a late reply (forget to subscribe to thread initially). Runtimes are NOT fixed. The target CPU time is fixed in the settings, yes. But actual run times can vary significantly from it. Some task ends prematurely - usually if there is an some errors during processing or WU hits internal "max decoy limit" (there is an instruction in each WU to stop processing data if set number of decoys already generated and sent results to server, ignoring fact it did not reach target CPU time). And on the other case some WUs exceed target CPU time significantly - usually it happens if WU works on really hard/big models and generation of one decoy on such hard targets can take few hour of CPU work. And target CPU time is checked only between decoy, CPU time trigger does not interrupt calculation of already started decoy until it fully finished or watchdog kick-in (usually it set to CPU target time + 4 hours) and abort the task . That is why I count actual CPU time: take some(more is better) completed WUs, sum up all CPU times used by it, sum up all the credit generated. Divide sum of the credit by sum of all CPU time consumed. And you got a fairly accurate estimate of real host performance without waiting a LONG time for the average indicator (RAC) to stabilize. Usually grab(C&P) all recent WUs from result tables into Excel/Calc spreadsheet, And about decoys - based on my observation - yes, there is almost linear relation - e.g. double CPU runtime of WU and it will produce about twice number of decoys. With same type of WU and same hardware of course. Moreover, it the number of decoys generated is the main factor for calculating credits for a successfully completed task at server after reporting. Its like simple formula: Cr granted = decoy count in reported WU x "price" of one decoy. Host CPU benchmarks and APR is used to determine that "decoy price" but server uses average values collected from many (not sure how many? probable all) host contributing to the same target/work type. Its NOT a BOINC WU "wingmans"(such scheme is used in WCG for example). For R@H it all host getting WUs of a same type/batch (usually hundreds or even few thousands hosts on large batches). So all "anomalies" in benchmarks and AFR smoothed out due to large scale averaging. But for particular host only number of successfully generated decoys determine how much CR it receives for completed WUs. |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,692,082 RAC: 15,825 |
Yes, i agree this should be fixed. By a simple multiplier at least as a fast/simpler solution. E.g 10 000 FLOPs her hour of target CPU time to be in line with current baseline of 80000 GFLOPs and default 8 hr target runtime. It will both improve Cr calculation accuracy and help a LOT to BOINC client adapts Estimated completion times and queue size faster if user changed target CPU setting. Without it BOINC starts to correct these values slowly and only after some WUs finished as it does not aware that WUs became longer or shorted out of sudden. And only see it after some WUs already finished. With wu.fpops_est altered in a proportion to target CPU time BOINC client will know that new WUs will be shorter or longer in advance: right after downloading and before even starting processing the first of it. |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,692,082 RAC: 15,825 |
Could you perhaps also measure the power consumption difference between running 8 WU vs. 16 WU on your Ryzen? My theory is that if the CPU is blocked for longer, it may consume a bit less energy. This would also increase the credit/watt ratio more than 17% then.. Interesting and simple test indeed. There is a results. I do not have have external hardware tools to measure power consumption right now. So i used internal CPU monitoring: CPU package power (SMU) values collected by HWiNFO64. With 16 WUs running in parallel its shows 72 W as average (waited a couple of minutes to collect average values) CPU package power With only 8 WUs running in parallel it drops down to 60 W. CPU temperature also slowly decreased by about 3 degrees as well, confirming power monitoring values. CPU frequencies and voltages stays the same during comparison (3.34 GHz and 1.02 V - its stock values for this CPU, no manual tuning). And SMT was not entirely disabled i just reduced number of WUs running without system reboot, so additional 8 SMT threads was still present in the system, just were not been used for computation. So "there is no magic" (c) - more real work is done, more energy is consumed by CPU. Given the CPU stays the same. And it almost perfect linear correlation in my case: 16 WUs running on 8 cores produce about 17% more total computation throughput and consume about same 17% more power (about 20% actually: 72/60 = ~1.2) So credit/watt ratio stays about the same. But only CPU wise. There is some additional power consumptions (RAM, disk, Motherboard components) which should not be affected or affected negligible so total system is a bit more energy efficient if it runs all 16 WUs on all threads with SMT compared to just 8 WUs . And it definitely more efficient from a "credit/$ cost of the system" point of view as use of SMT cost nothing. P.S. Note: +17% performance gain from SMT was measured few month ago. While power comparison made today. Current SMT boost may slightly different due to different tasks being processed in R@H queue. Its still same Rosetta and BOINC, but a different tasks/protein targets alter work/load profiles sightly. To direct comparison and accurate energy efficiency calculation 2 new performance tests needed(with 8 and 16 WUs running), but it take a lot of time to do it and i do not have enough spare currently. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 22,848,992 RAC: 15,249 |
Intel now offers a free forum-supported version of Vtune that you can download. You can run your tests and capture a system-wide sample and then drill down to see what is happening. https://software.intel.com/content/www/us/en/develop/tools/vtune-profiler/choose-download.html The Rosetta binaries don't have symbols but you can see where the program is burning its time. The last time I looked at the source code build instructions, Rosetta very, very, very aggressively used inline compile optimizations. They build Rosetta on top of tiny BOOST math library routines that they repeatedly inlined. That made for a huge code footprint and icache misses and ITLB table walks. I think of the the Rosetta binary as the "Swiss Army Knife" of programs. You can give it the correct command line and that one binary will do just about anything. Few developers look at the impact of the size of the code footprint. IMO, they made Rosetta more "flexible" at the cost of performance. Other than "over-optimizing" the inlining, there was not much useful performance optimization. I was able to build a Rosetta binary and simply turn off the inlining optimizations and get 20% to 50% improvement. Because they used the BOOST library calls, that made vector optimizations VERY TOUGH. The compiler could not depend upon correct data alignment or locality. If you find any SSE or AVX code executed, it will be scalar and not parallel. Developers, in general, do not take kindly to "suggestions" about their code. With a huge number of hosts crunching Rosetta and a number of corporations commercially licensing the software, there is zero incentive for the developers to change anything. Try using Vtune to look at running code. |
Message boards :
Number crunching :
L3 Cache is not a problem, CPU Performance Counter proves it
©2024 University of Washington
https://www.bakerlab.org