Message boards : Number crunching : Finish a workunit in < 1 minute: What would it take?
Previous · 1 · 2 · 3
Author | Message |
---|---|
Michael Send message Joined: 12 Oct 06 Posts: 16 Credit: 51,712 RAC: 0 |
If I could offer Roadrunner, then Baker would offer WU. --=Exhibit A=-- Michael Join Team Zenwalk |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
Which processor offers the largest L2 cache? L3? From the workunits I've looked at, it appears that 1MB of L2 cache is plenty. Of course, future workunits may use more memory and thus need bigger caches. But I specifically ran "oprofile" to figure out if there was cache-misses on the processor, and my Opteron 840's didn't show any... -- Mats |
FluffyChicken![]() Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
Which processor offers the largest L2 cache? L3? Would the shared cache architecture in Intel Dual Cores benefit when running two tasks since they use the may be able to use the same parts. Has anyone seen any increase (yes I know it hard to evaluate) when running two similar tasks compared to say a standard one and a docking one? Would this also then benefit AMD on processors if that processor use cache snooping (since they currently use seperate caches)? Though it would probably take an architecture engineer from both camps to you if it really doe or doesn't, since we have no benchmark units to test theories. (well other than tell you theier is better of course ;-) Team mauisun.org |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
Which processor offers the largest L2 cache? L3? Caches work on the physical address of the element in the cache, which means that if the work is similar or not doesn't really matter. Since Rosetta (for very good reasons) is statically linked, i.e. it doesn't rely on .so/.DLL files that may or may not exist (or be the right version), there is no sharing of library files either. Not entirely sure if Windows and/or Linux allows sharing of the actual executable image - that would make sense (with a COW-scheme [Copy-on-Write] if the code is self-modifying - not that Rosetta is). But the large data-sections in Rosetta are definitely not shared, as they are being constantly modified by the application (that's the entire goal, right?), so we can't share those, whether they are similar, same or different, as they need to be two separate sets [unless the application is specifically writen to be MULTITHREADED - we've had that discussion elsewhere, Rosetta isn't multi-trheaded, it runs two apps on a dual core processor]. -- Mats |
FluffyChicken![]() Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
That's how traditional caches work but Intel's SmartCache is suppose to be able to share the data. Gumf from their advertising media, anyways the bold part. IntelĀ® Smart Cache Performance Features The first clear benefit of IntelĀ® Smart Cache is that both cores have full access to the entire 2MB L2 cache. Therefore, even if one core is idle, the other core can use the full 2MB. But many times, one core will not be idle, so it's important to dynamically allocate the cache in an intelligent way. Dynamic cache allocation efficiently apportions the cache memory according to what is needed by each execution core. With dynamic allocation, the Intel Core Duo processor enables increased cache utilization, and this optimizes performance. The Intel Core Duo processor also efficiently shares data needed by both execution cores. For example, data can be modified by one execution core and be subsequently utilized by the other execution core, without the need to first write the data to DRAM. Furthermore, with Intel Smart Cache L2 data is shared by both the execution cores, meaning data is stored in only one cache location rather than in multiple locations as with separate L2 caches. This minimizes front side bus traffic (reads to the system memory) and reduces cache coherency complexity. It's also important to maximize cache utilization for both cores, so the processor is supported with enhanced Data Pre-fetch Logic. This improves efficiency by speculatively fetching data to the L2 cache even before cache requests occur. This results in reduced bus cycle penalties and again, higher efficiency. In addition, the Write Order Buffer depth is enhanced to help reduce the write-back latency to further increase performance. AMD are going to (maybe already are ?) using 'snooping' where it peaks across at the other cache (since they are currently seperate) to see if it can use it instead. I'm a bit vauge on how it all actually works (need time to read about it more..) more that's the jist I get from it. Team mauisun.org |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
Data has two attributes: "address" and "content". Address is the location in memory where the data is stored. Whether we have a shared or unshared cache, the cache is organized based on the "address" part of the data, and the "content" is stored there. Let's make a simple example (and I'll ignore the complication of Virtual-to-physical address translation, let's just pretend that everything is mapped 1:1): An array declared as: int x[1000]; We have two cores, both running separate threads in one application: Thread one (for simplicity, running on the first core): for(i = 0; i < 1000; i++) { x[i] = i; } while(x[100] == 100) /* do nothing */ ; Thread two (for simplicity running on second core): while(1) { x[100]++; } Now, on a shared cache processor, the value of x[100] (or any other part of the x-array) would be visible to both processors within the L2-cache. On a "separate cache per processor architecture", the value of the x-array would first appear in first core's cache, and then have to be evicted when the second core writes to it (because it's no longer got the valid value), and then fetched (or DRAM) from second processors cache to the first processors cache when it tries to read it again (in the while(x[100 ...) ... ; loop) Snooping from one processor to another has been around for as long as SMP has - it's the only way you can keep caches coherent between multiple cores - the only other solution is to not allow caching of shared data - which is bad if you have a frequent-read/rare-write type of data situation - something which many operating systems and applications use quite often for all sorts of data-structures. However, if we make another example where we run two different copies of the same application, we would have two copies of the array x[] above - each copy would reside on the same virtual address, but a different physical address [since they are distinctly belonging to each application], and as such, they would not be cached together. In this case, there would actually just end up being competition for the cache-space (obviousl with a 1000 integers, no problem, but say that the data is much larger, several megabytes), where the one processor "throws out" some of the other processors data to make space for its data. [I think the Intel solution is actually taking this into account and allows processors to have a dedicated amount of the cache each to avoid one hogging all of the cache]. -- Mats |
Michael Send message Joined: 12 Oct 06 Posts: 16 Credit: 51,712 RAC: 0 |
|
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
Mats, I used it under Linux - not sure if there's a version for Windows... -- Mats |
Michael Send message Joined: 12 Oct 06 Posts: 16 Credit: 51,712 RAC: 0 |
|
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
It's included in Fedora Core 4/5, but I think it's available in other distro's too (but you may have to install it separately - like my SuSE 9.x installation has an rpm - I'm sure there is a GUI install method, of course). Or you can get it from the 'net at http://oprofile.sourceforge.net/download/ It is a kernel modeule and some user-mode app parts to set it up and get the results back out. oprof_start is a GUI tool to select the different performance counters in the processor (such as DATA_CACHE_MISSES). -- Mats |
Message boards :
Number crunching :
Finish a workunit in < 1 minute: What would it take?
©2025 University of Washington
https://www.bakerlab.org