Posts by entity

1) Message boards : Number crunching : L3 Cache is not a problem, CPU Performance Counter proves it (Message 93979)
Posted 9 Apr 2020 by entity
Post:
I know that this have been discussed quite a few times in this forum. People said that Rosetta application is not optimized and needs 5MB of L3 cache for each instance running (though I don't really know where they got this number). But looks to me no one has ever shown any data to proves it. So I decided to figure it out.

Intel used to provide a software called Performance Counter Monitor (see this article https://software.intel.com/en-us/articles/intel-performance-counter-monitor). Basically this software can read some hardware level counter inside CPU and provides performance insight information, including cache hit rate. So I did a test on my desktop:
CPU: i3-7100, 2 cores 4 threads, 3.9Ghz, 3MB L3 cache
RAM: 16GB DDR 2400 Dual Channel
OS: Win10 1909

I run 4 instance of Rosetta 4.12 with Covid-19 WU. I close all other background applications that may affect the result. Then I enabled the monitor and capture for 100 seconds. Here is the result. Column L3HIT is L3 cache hit rate. L2HIT is L2 cache hit rate.

---------------------------------------------------------------------------------------------------------------

EXEC : instructions per nominal CPU cycle
IPC : instructions per CPU cycle
FREQ : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state' (includes Intel Turbo Boost)
L3MISS: L3 (read) cache misses
L2MISS: L2 (read) cache misses (including other core's L2 cache *hits*)
L3HIT : L3 (read) cache hit ratio (0.00-1.00)
L2HIT : L2 cache hit ratio (0.00-1.00)
L3MPI : number of L3 (read) cache misses per instruction
L2MPI : number of L2 (read) cache misses per instruction
READ : bytes read from main memory controller (in GBytes)
WRITE : bytes written to main memory controller (in GBytes)
IO : bytes read/written due to IO requests to memory controller (in GBytes); this may be an over estimate due to same-cache-line partial requests
TEMP : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature
energy: Energy in Joules


Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | TEMP

0 0 1.07 1.07 0.99 1.00 79 M 436 M 0.82 0.78 0.00 0.00 29
1 0 1.09 1.10 0.99 1.00 69 M 435 M 0.84 0.78 0.00 0.00 29
2 0 1.17 1.17 0.99 1.00 55 M 352 M 0.84 0.82 0.00 0.00 30
3 0 1.18 1.19 1.00 1.00 43 M 321 M 0.86 0.83 0.00 0.00 30
---------------------------------------------------------------------------------------------------------------
SKT 0 1.13 1.13 0.99 1.00 247 M 1545 M 0.84 0.80 0.00 0.00 29
---------------------------------------------------------------------------------------------------------------
TOTAL * 1.13 1.13 0.99 1.00 247 M 1545 M 0.84 0.80 0.00 0.00 N/A

Instructions retired: 1764 G ; Active cycles: 1556 G ; Time (TSC): 391 Gticks ; C0 (active,non-halted) core residency: 99.73 %

C1 core residency: 0.27 %; C3 core residency: 0.00 %; C6 core residency: 0.00 %; C7 core residency: 0.00 %;
C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %; C8 package residency: 0.00 %; C9 package residency: 0.00 %; C10 package residency: 0.00 %;

PHYSICAL CORE IPC : 2.27 => corresponds to 56.70 % utilization for cores in active state
Instructions per nominal CPU cycle: 2.26 => corresponds to 56.38 % core utilization over time interval
SMI count: 0
---------------------------------------------------------------------------------------------------------------
MEM (GB)->| READ | WRITE | IO | CPU energy |
---------------------------------------------------------------------------------------------------------------
SKT 0 181.87 37.36 2.18 2912.05
---------------------------------------------------------------------------------------------------------------

As you can see, the L3 cache hit rate is 84% in average for 100 seconds. I repeat the test a few times. Result is vary around 80%, lowest number I saw is 75%, highest is 89%. This is actually a very high hit rate based on my observation. I can say most of other BOINC applications has lower L3 hit rate than this. For example, I used to run SETI@home (optimized version), and the L3 hit rate is around 70%, Milkyway@home is a little more but still below 80%. The only project that has higher hit rate is Collatz Conjecture. I am not surprised with this as Collatz's algorithm is very simple (basically calculation of 3*N+1 and N/2). So the code footprint shall be very small and may even fit into L2 cache.

Another number that proves this is the memory traffic (see the bottom of the above figure). It says 181.87GB read & 37.36GB write in 100 seconds. That's around 2.19GB/second in total. This is a very low number considering CPU is running at full speed. Remember dual channel of DDR2400 can provide close to 40GB/S bandwidth. So basically the memory utilization is very low. This also proves that most memory read/write has been hit by CPU caches.

I also tested my workstation at work:
CPU: i7 9700K, 8 cores 8 threads, OC to 4.2Ghz, 12MB L3 cache
RAM: 16GB DDR4 2666 Dual Channel
OS: WIn10 1909

Running 8 Rosetta instances and I got average L3 hit rate of 77%. Pretty close.

Conclusion: According to the findings above, it doesn't seen that L3 cache is a bottleneck for Rosetta, at least not for 4.12. I don't know whether Rosetta's algorithm itself is optimized, but looks like it is pretty effective in keeping CPU busy.

Lastly, BTW, actually for Intel's current CPU microarchitecture (ie gen 6,7,8,9), each physical core can only use up to 2MB L3 cache. So even you only run one Rosetta instance, it will not be benefit with L3 cache larger than 2MB.

I hope you find this helps.

Michael Wang

Unfortunately, statistics are always open to interpretation. "If you torture the data long enough, it will confess" -- Economist Ronald H Coase

1. Many contributors have reported that if they run small numbers of Rosetta instances concurrently they see less of a problem but as the number of concurrent instances increases the problem becomes more pronounced. This observation seems to be backed by your data where the L3 hit ration drops between 4 concurrent and 8 concurrent. What happens when concurrency jumps to 16 threads, 32 threads, etc?

2. PHYSICAL CORE IPC data indicates ~56% utilization. Seems like it should be closer to 100%. Why so low? Does it indicate that the cores are waiting significantly?

3. Memory READ/WRITE numbers need more context. Where is that measured at? Additionally, I assume those are main memory stats and not cache memory stats based on the size of the numbers. I would conclude that if the L3 cache needed to be refreshed with new data due to misses, it would result in high read rates just as your data indicates. Memory transfers rates are not indicative of anything other than data is being transferred and it definitely doesn't indicate memory utilization. I would suggest that it more rightly indicates memory channel utilization. I submit in a highly optimized L3 cache utilization situation, there would be little to no memory transfer activity, however, just because the transfer rate isn't at full utilization doesn't support the claim that there isn't a L3 cache problem.

4. Are you making the assumption that the Rosetta program has a reasonably static processing profile? Could the program have higher L3 cache demands at various points in the processing stream and you just happened to catch the program at a low point in the demand?

5. You make the statement: "I don't know whether Rosetta's algorithm itself is optimized, but looks like it is pretty effective in keeping CPU busy." I ask, based on what? If you are using CPU utilization reported from the OS, I contend that isn't a reliable number in this case. That number is based off the OS wait bits that never get set in this case. The thread is dispatched to the CPU and as far as the OS is concerned, it is active on the processor. If the core is waiting on memory transfers between cache or main memory, the OS doesn't see that. Only way the OS knows about a thread waiting is if it gives up the CPU voluntarily or is interrupted by a higher priority task. Only way the OS knows if the CPU is waiting is if the wait bit for that processor is set.

Not saying the data isn't useful or pertinent. Just be careful about drawing conclusions from such a small amount of data. In other words, it's hard to run a program for 100 seconds and then say everything is OK
2) Message boards : Number crunching : Limited thread usage under Linux? (Message 93778)
Posted 7 Apr 2020 by entity
Post:
I would stop and restart boinc, then look at the log to see how many CPUs BOINC detected and also look for any messages as to why it is being limited. I would also check to see if there are any local preferences overriding the global preferences from the website.
3) Message boards : Number crunching : Rosetta 4.1+ and 4.2+ (Message 93574)
Posted 5 Apr 2020 by entity
Post:
Downloaded 1135 WUs and working through the last 500 right now. All Linux machines with different distributions...
4) Message boards : Number crunching : Rosetta 4.1+ and 4.2+ (Message 93289)
Posted 3 Apr 2020 by entity
Post:
I was poking around a bit more and chatting with others on our team, and discovered something after reading Aurum's post to this morning's COVID-19 update (https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13702&postid=93202#93202). He mentioned something about the misuse of the L3 cache that was causing issues he was noticing on Xeon E5's (didn't specify which architecture, and the computers are hidden). I looked at the 2nd, 4th, 7th and 9th gen Intel processors we have on our project, and found that the 2nd and 4th gen both have 3MB of L3 cache, and the 7th and 9th gen Intel processors have 4MB of L3 cache per each two physical cores (4 MB for the 7th gen core i5, and a 12MB SmartCache for the 9th gen core i7). Maybe it's coincidence; but I find it curious. Hope this helps someone track down what's going on. Cheers!

I've been seeing this in several threads recently. This is a response we got at WCG from the MIP project developers a couple of years ago:

"The short version is that Rosetta, the program being used by the MIP to fold the proteins on all of your computers*, is pretty hungry when it comes to cache. A single instance of the program fits well in to a small cache. However, when you begin to run multiple instances there is more contention for that cache. This results in L3 cache misses and the CPU sits idle while we have to make a long trip to main memory to get the data we need. This behavior is common for programs that have larger memory requirements. It's also not something that we as developers often notice; we typically run on large clusters and use hundreds to thousands of cores in parallel on machines. Nothing seemed slower for us because we are always running in that regime.
We are looking to see if if we can improve the cache behavior. Rosetta is ~2 million lines of C++ and improving the cache performance might involve changing some pretty fundamental parts. We have some ideas of where to start digging, but I can't make any promises.

Long term, identifying these issues may end up improving Rosetta for everyone that uses it so pat yourselves on the back for that!"
[/code]
5) Message boards : Number crunching : Rosetta x86 on AMD CPU (Message 93286)
Posted 3 Apr 2020 by entity
Post:
I think something is off with the CPU utilization by new 4.12 app for the new Ryzen 3k systems.
I have 3800X and with 4.07 had consistent 79-80C temps under the full load (CPU Package power - 105-107W), now with 4.12 app, temps are 65-70C under full load (CPU Package power is 80-85W).
What was changed? I was under impression that new app will be somewhat more optimized for the new CPU-s.

We were seeing this same behavior over at WCG on the MIP project which also uses Rosetta. The MIP developers posted this reply after we had sent them some data on what we were seeing (this was a couple of years ago):

"The short version is that Rosetta, the program being used by the MIP to fold the proteins on all of your computers*, is pretty hungry when it comes to cache. A single instance of the program fits well in to a small cache. However, when you begin to run multiple instances there is more contention for that cache. This results in L3 cache misses and the CPU sits idle while we have to make a long trip to main memory to get the data we need. This behavior is common for programs that have larger memory requirements. It's also not something that we as developers often notice; we typically run on large clusters and use hundreds to thousands of cores in parallel on machines. Nothing seemed slower for us because we are always running in that regime.
We are looking to see if if we can improve the cache behavior. Rosetta is ~2 million lines of C++ and improving the cache performance might involve changing some pretty fundamental parts. We have some ideas of where to start digging, but I can't make any promises.

Long term, identifying these issues may end up improving Rosetta for everyone that uses it so pat yourselves on the back for that!"

It's that sitting idle while waiting for data from main memory that causes the temps and energy use to drop.
6) Message boards : Number crunching : Rosetta 4.1+ and 4.2+ (Message 93108)
Posted 2 Apr 2020 by entity
Post:
I always try to suggest ways people can run the way they like, without purchase of additional hardware. So another approach would be to add another BOINC project that has less memory required to run. I usually suggest World Community Grid. Their tasks often run in less than 100MB. Running both projects with same resource share should allow you to use all of your CPUs, and afford enough memory for the larger R@h tasks.

Be careful at WCG. Their ARP, FAH2, and MIP work units are quite large. MIP is running Rosetta, so... HST is quite small but also quite rare. SCC might be a good choice. MCM might also be a good choice but is over 100MB, probably closer to 300MB or 400MB.
7) Message boards : Number crunching : 0 new tasks, Rosetta? (Message 93098)
Posted 2 Apr 2020 by entity
Post:
I was at WCG for almost 16 years and decided to come here as it seemed to be a more useful project. I guess I have been VERY fortunate in the ability of my machines to stay busy. I have had a steady group of 1600 tasks since I came on board a couple of days ago. I have never seen it below 1000 and it's been staying steady at about 1560 the past 24 hours. I looked at WCG's potential Pandemic project and I'm skeptical as to how useful it could be. Admittedly, the description and announcements are all I can go on but it doesn't seem to address anything specific. I might have to wait for some announcement from someone at the Scripps Institute for more information.
8) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 93080)
Posted 2 Apr 2020 by entity
Post:
Oh you're right.
I just looked at my task list.
Time per WU has jumped from 8 hours to 16 hours!
The cores are running cooler than the last version too, suggests a bottleneck.
Note 2, I just noticed that the most recent few are fast again.
Maybe there was just a run of WU for a harder problem.

This is a known problem in Rosetta that the developers have acknowledged but probably haven't fixed yet. They indicated that it would take a major rewrite of the code. L3 cache tends to become over utilized and the CPU waits for data to make the trip from main memory hence the CPU runs cooler (more waiting). There was a post by a developer in another project that suggested to limit the number of tasks run concurrently. They indicated that each task uses about 4MB of L3 cache. Concerning the run time, I noticed that the run parameters include something like cpu_seconds=57500. That is 16 hours. They are ignoring the Target CPU runtime setting
9) Message boards : News : Rosetta's role in fighting coronavirus (Message 92907)
Posted 1 Apr 2020 by entity
Post:
A few things (I just recently came back to R@H):
(1). Noticed that I downloaded 4.12 of Rosetta last night. (2). downloaded "a bunch" of conducting_fibre_*_*_*_* work units that have already run 10 hours and only 63% complete (I'm using the 8 hour default) and are due APR 3. A lot are not going to make it if they run the full 16 to 20 hours. (3). Is there maintenance that happens about 0000UTC? I noticed that the scheduler was not running and many WUs were "stuck" in upload and download. It seemed to clear about 0300UTC. Is this an every night thing?
10) Message boards : News : Rosetta's role in fighting coronavirus (Message 92762)
Posted 31 Mar 2020 by entity
Post:
I would not recommend F@H on Linux. I just left there because of the issues concerning software dependencies. As much as F@H likes to state their software will run most anywhere (which is why they suggest ignoring dependencies during install), I and others, have found that to not be the case. I never could get FAHControl running on 3 different distributions without having to go back and install deprecated software. If one is running a Long Term Support release of a distribution (which tends to be back-leveled) then the chances are greater that the install will work. It wasn't worth fiddling with. Just one man's opinion of course

I don't know about the other distributions, but it works OK on Ubuntu LTS (16.04 and 18.04). The main problem for me has not been the distributions, but the idiosyncratic FAH default settings. They are OK on Windows, but screwball (no other term) on Linux. It is not that we haven't told them about it in years past, it is just that we have been ignored.

However, for what it is worth, I put in my 2 cents again recently:
https://foldingforum.org/viewtopic.php?f=17&t=32124&start=30#p311633
https://foldingforum.org/viewtopic.php?f=17&t=32124&start=30#p311670

I have it on 9 Ubuntu 18.04.4 machines, which I manage remotely over the LAN with HFM.Net running on my Windows machine.

I think that's the trick. To run only LTS releases, which I don't do. The Ubuntu 20.04 LTS release is out in couple of weeks and it's going to be interesting to see if it still works. They may have a new client out by then.
I was able to get the client to run by using the config.xml file to define my slots but it wasn't easy and lot of trial and effort. Apologies for hijacking this thread to discuss another project. This is my last post on this particular topic. Back to R@H
11) Message boards : News : Rosetta's role in fighting coronavirus (Message 92747)
Posted 31 Mar 2020 by entity
Post:
There has been a discussion that Folding@home is likely to produce a BOINC version soon, Not ready yet, though.

Is that right? That's a project I've always wanted to run - especially now they're doing COVID work - but I've also been put off by it not being available via Boinc

The discussions are very preliminary; I have been a part of them. But the developers have to do the real work, and I would guess that it won't come in the midst of the current rush.

But the Folding client works well enough once you get past its idiosyncrasies setting it up; look to their forum for help - especially on Linux.
I run my GPUs on Folding (by deleting the CPU slot when setting it up), and then run my CPUs on BOINC. It works great.

I would not recommend F@H on Linux. I just left there because of the issues concerning software dependencies. As much as F@H likes to state their software will run most anywhere (which is why they suggest ignoring dependencies during install), I and others, have found that to not be the case. I never could get FAHControl running on 3 different distributions without having to go back and install deprecated software. If one is running a Long Term Support release of a distribution (which tends to be back-leveled) then the chances are greater that the install will work. It wasn't worth fiddling with. Just one man's opinion of course
12) Message boards : News : Rosetta's role in fighting coronavirus (Message 92645)
Posted 30 Mar 2020 by entity
Post:
I was draining down my 24 thread 32GB machines and had a mix of Rosetta and other work. 2 of the 3 machines were well into the swap file but I was to catch them before they ran out. Trimmed back some work and they are now working just fine. Once the other work drains off I will open them up again. The other machine was out of swap and totally thrashing (disk light on continuously and couldn't log on). Reboot and trimmed back some work like the others. On my 128 thread 256GB machine, it filled up the root filesystem as I only had 50GB allocated to it and when Rosetta started up with about 500MB per slot, BOINC died. I was able to still logon so was able to shrink some LVs and extend the root LV. All good now. Everything is running as expected.
13) Message boards : News : Rosetta's role in fighting coronavirus (Message 92568)
Posted 29 Mar 2020 by entity
Post:
Just added my 384 threads after being gone for 2 years.... 11 servers 24/7. I don't care about the names. I'll run anything that comes down the network link... First units should be coming back in about 4 hours from the faster machines. Slowest machines are about 8 hours. All 1600 WUs downloaded should be back in about 24 to 28 hours






©2021 University of Washington
https://www.bakerlab.org