Finish a workunit in < 1 minute: What would it take?

Author	Message
Michael Send message Joined: 12 Oct 06 Posts: 16 Credit: 51,712 RAC: 0	Message 29776 - Posted: 21 Oct 2006, 17:59:37 UTC Last modified: 21 Oct 2006, 18:01:10 UTC If I could offer Roadrunner, then Baker would offer WU. --=Exhibit A=-- To: Michael From: Baker Subject: RE: Roadrunner Michael, Naahhh, those work units ain't goin' no where. Dr. B-Dog ---Original Message--- To: Baker From: Michael Subject: Roadrunner Dr. Baker: I was recently appointed IS administrator at LANL, which means I am in charge of our supercomputer Roadrunner (RR). As you know RR is responsible for monitoring the US nuclear stockpile. But at the moment the stockpile is only using 80% of RR. Normally the unused nodes would be turned off, but as fate would have it we are in a burn-in phase now. To cut it short, I am authorized to run Rosetta on 1,600 cores (5%). The reason I am contacting you is I tried running BOINC through our scheduler, but I was only able to get 400 work units per 24 hours. I seem to be limited on your end. Can you get me more WU? Thank you in advance for your cooperation. Sincerely, Dr. Michael B.M.F. Michael Join Team Zenwalk ID: 29776 · Rating: 0 · rate: / Reply Quote

Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 29858 - Posted: 23 Oct 2006, 8:52:43 UTC - in response to Message 29737. Which processor offers the largest L2 cache? L3? Allow me to respectively suggest the Next Gen Opteron with 2MB L2, and the Xeon with 16MB of on die L3. Besides my first question, which type and amount of cache would work best for Rosetta? From the workunits I've looked at, it appears that 1MB of L2 cache is plenty. Of course, future workunits may use more memory and thus need bigger caches. But I specifically ran "oprofile" to figure out if there was cache-misses on the processor, and my Opteron 840's didn't show any... -- Mats ID: 29858 · Rating: 0 · rate: / Reply Quote

FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0	Message 29865 - Posted: 23 Oct 2006, 11:00:13 UTC - in response to Message 29858. Which processor offers the largest L2 cache? L3? Allow me to respectively suggest the Next Gen Opteron with 2MB L2, and the Xeon with 16MB of on die L3. Besides my first question, which type and amount of cache would work best for Rosetta? From the workunits I've looked at, it appears that 1MB of L2 cache is plenty. Of course, future workunits may use more memory and thus need bigger caches. But I specifically ran "oprofile" to figure out if there was cache-misses on the processor, and my Opteron 840's didn't show any... -- Mats Would the shared cache architecture in Intel Dual Cores benefit when running two tasks since they use the may be able to use the same parts. Has anyone seen any increase (yes I know it hard to evaluate) when running two similar tasks compared to say a standard one and a docking one? Would this also then benefit AMD on processors if that processor use cache snooping (since they currently use seperate caches)? Though it would probably take an architecture engineer from both camps to you if it really doe or doesn't, since we have no benchmark units to test theories. (well other than tell you theier is better of course ;-) Team mauisun.org ID: 29865 · Rating: 0 · rate: / Reply Quote

Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 29866 - Posted: 23 Oct 2006, 11:06:42 UTC - in response to Message 29865. Which processor offers the largest L2 cache? L3? Allow me to respectively suggest the Next Gen Opteron with 2MB L2, and the Xeon with 16MB of on die L3. Besides my first question, which type and amount of cache would work best for Rosetta? From the workunits I've looked at, it appears that 1MB of L2 cache is plenty. Of course, future workunits may use more memory and thus need bigger caches. But I specifically ran "oprofile" to figure out if there was cache-misses on the processor, and my Opteron 840's didn't show any... -- Mats Would the shared cache architecture in Intel Dual Cores benefit when running two tasks since they use the may be able to use the same parts. Has anyone seen any increase (yes I know it hard to evaluate) when running two similar tasks compared to say a standard one and a docking one? Would this also then benefit AMD on processors if that processor use cache snooping (since they currently use seperate caches)? Though it would probably take an architecture engineer from both camps to you if it really doe or doesn't, since we have no benchmark units to test theories. (well other than tell you theier is better of course ;-) Caches work on the physical address of the element in the cache, which means that if the work is similar or not doesn't really matter. Since Rosetta (for very good reasons) is statically linked, i.e. it doesn't rely on .so/.DLL files that may or may not exist (or be the right version), there is no sharing of library files either. Not entirely sure if Windows and/or Linux allows sharing of the actual executable image - that would make sense (with a COW-scheme [Copy-on-Write] if the code is self-modifying - not that Rosetta is). But the large data-sections in Rosetta are definitely not shared, as they are being constantly modified by the application (that's the entire goal, right?), so we can't share those, whether they are similar, same or different, as they need to be two separate sets [unless the application is specifically writen to be MULTITHREADED - we've had that discussion elsewhere, Rosetta isn't multi-trheaded, it runs two apps on a dual core processor]. -- Mats ID: 29866 · Rating: 0 · rate: / Reply Quote

FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0	Message 29873 - Posted: 23 Oct 2006, 14:25:02 UTC That's how traditional caches work but Intel's SmartCache is suppose to be able to share the data. Gumf from their advertising media, anyways the bold part. Intel® Smart Cache Performance Features The first clear benefit of Intel® Smart Cache is that both cores have full access to the entire 2MB L2 cache. Therefore, even if one core is idle, the other core can use the full 2MB. But many times, one core will not be idle, so it's important to dynamically allocate the cache in an intelligent way. Dynamic cache allocation efficiently apportions the cache memory according to what is needed by each execution core. With dynamic allocation, the Intel Core Duo processor enables increased cache utilization, and this optimizes performance. The Intel Core Duo processor also efficiently shares data needed by both execution cores. For example, data can be modified by one execution core and be subsequently utilized by the other execution core, without the need to first write the data to DRAM. Furthermore, with Intel Smart Cache L2 data is shared by both the execution cores, meaning data is stored in only one cache location rather than in multiple locations as with separate L2 caches. This minimizes front side bus traffic (reads to the system memory) and reduces cache coherency complexity. It's also important to maximize cache utilization for both cores, so the processor is supported with enhanced Data Pre-fetch Logic. This improves efficiency by speculatively fetching data to the L2 cache even before cache requests occur. This results in reduced bus cycle penalties and again, higher efficiency. In addition, the Write Order Buffer depth is enhanced to help reduce the write-back latency to further increase performance. AMD are going to (maybe already are ?) using 'snooping' where it peaks across at the other cache (since they are currently seperate) to see if it can use it instead. I'm a bit vauge on how it all actually works (need time to read about it more..) more that's the jist I get from it. Team mauisun.org ID: 29873 · Rating: 0 · rate: / Reply Quote

Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 29876 - Posted: 23 Oct 2006, 14:43:22 UTC Data has two attributes: "address" and "content". Address is the location in memory where the data is stored. Whether we have a shared or unshared cache, the cache is organized based on the "address" part of the data, and the "content" is stored there. Let's make a simple example (and I'll ignore the complication of Virtual-to-physical address translation, let's just pretend that everything is mapped 1:1): An array declared as: int x[1000]; We have two cores, both running separate threads in one application: Thread one (for simplicity, running on the first core): for(i = 0; i < 1000; i++) { x[i] = i; } while(x[100] == 100) /* do nothing */ ; Thread two (for simplicity running on second core): while(1) { x[100]++; } Now, on a shared cache processor, the value of x[100] (or any other part of the x-array) would be visible to both processors within the L2-cache. On a "separate cache per processor architecture", the value of the x-array would first appear in first core's cache, and then have to be evicted when the second core writes to it (because it's no longer got the valid value), and then fetched (or DRAM) from second processors cache to the first processors cache when it tries to read it again (in the while(x[100 ...) ... ; loop) Snooping from one processor to another has been around for as long as SMP has - it's the only way you can keep caches coherent between multiple cores - the only other solution is to not allow caching of shared data - which is bad if you have a frequent-read/rare-write type of data situation - something which many operating systems and applications use quite often for all sorts of data-structures. However, if we make another example where we run two different copies of the same application, we would have two copies of the array x[] above - each copy would reside on the same virtual address, but a different physical address [since they are distinctly belonging to each application], and as such, they would not be cached together. In this case, there would actually just end up being competition for the cache-space (obviousl with a 1000 integers, no problem, but say that the data is much larger, several megabytes), where the one processor "throws out" some of the other processors data to make space for its data. [I think the Intel solution is actually taking this into account and allows processors to have a dedicated amount of the cache each to avoid one hogging all of the cache]. -- Mats ID: 29876 · Rating: 0 · rate: / Reply Quote

Michael Send message Joined: 12 Oct 06 Posts: 16 Credit: 51,712 RAC: 0	Message 29882 - Posted: 23 Oct 2006, 17:29:22 UTC Mats, What OS is oprofile for? Michael Join Team Zenwalk ID: 29882 · Rating: 0 · rate: / Reply Quote

Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 29884 - Posted: 23 Oct 2006, 17:35:04 UTC - in response to Message 29882. Mats, What OS is oprofile for? I used it under Linux - not sure if there's a version for Windows... -- Mats ID: 29884 · Rating: 0 · rate: / Reply Quote

Michael Send message Joined: 12 Oct 06 Posts: 16 Credit: 51,712 RAC: 0	Message 29886 - Posted: 23 Oct 2006, 18:10:39 UTC What distro? Michael Join Team Zenwalk ID: 29886 · Rating: 0 · rate: / Reply Quote

Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0	Message 29930 - Posted: 24 Oct 2006, 9:43:33 UTC It's included in Fedora Core 4/5, but I think it's available in other distro's too (but you may have to install it separately - like my SuSE 9.x installation has an rpm - I'm sure there is a GUI install method, of course). Or you can get it from the 'net at http://oprofile.sourceforge.net/download/ It is a kernel modeule and some user-mode app parts to set it up and get the results back out. oprof_start is a GUI tool to select the different performance counters in the processor (such as DATA_CACHE_MISSES). -- Mats ID: 29930 · Rating: 0 · rate: / Reply Quote