Posts by Mats Petersson

1) Message boards : Number crunching : 8-core cruncher? (Message 35486)
Posted 25 Jan 2007 by Mats Petersson
Post:
I'm probably also too close to the work of these things to actually answer the question to Who?, but if we just look at it realistically, it takes roughly two-three years to go from "architecture specification" to "finished chip" in any chip manufacturer.

Today, there are dual and quad-core chips in existance from AMD and Intel.

To make a 80-core chip, you'd have to completely re-architect the chip architecture (or use roughly twenty times the current size die, which is unreasonable even if we ignore the large caches that take up some 50-70% of the die-size of current dual/quad-core chips).

If we re-architect the chip to such an extent that the core-size is SIGNIFICANTLY smaller, it would most likely impact software model that the processor can use, so it would not be compatible with the current architecture, and thus wouldn't allow you to use current software.

2011 is only 4 years away - that's not so long that the software of today (and the next couple of years) will be completely useless.

Intel already tried to push a new architecture to the market (Itanium), and I'd say that this has not been the success that Intel wished for. To achieve a new architecture's success, it needs to be sufficiently better for new applications, and at the same time comparable for current/older applications. If it's not compatible (enough), it won't succeed. Many manufacturers have come out with "better" products, but unless they are also compatible with the major markets of the business, they are not successfull.

The 80-core architecture is probably a fictive device. My mum has a book from the late 50's/early 60's that I used to read when I was little. It showed the future of our lives, including space-travel and personal aircrafts, in the next couple of decades. If you look out the windows (now more than 4 decades later), you'll notice the personal aircrafts are not flying around in the towns or countryside outside... ;-) It just goes to show how hard it is to foresee how fast the development is going to be in the decades ahead....

--
Mats
2) Message boards : Number crunching : Does it support freebsd? Yes. [solved] (Message 35405)
Posted 23 Jan 2007 by Mats Petersson
Post:
Whilst I understand that owning a machine without support for Rosetta is disappointing, I also understand the developers point of view of supporting more architectures.

Building an application for a different architecture isn't too difficult. However, verifying that it's still calculating the exact same result can be. Different software and hardware architectures have different math-libraries, that in turn have different sets of bugs. Rosetta isn't a simple application either, it's MANY thousand lines of code, and with some pretty hairy bits of code (my personal opinion) to try to optimize the compiled result without resorting to writing assembler code or some such. This makes it more fragile than your average "written to be portable" code.

And every time the source code is updated, it needs to be tested on all platforms, to verify that all of them are working correctly.

--
Mats
3) Message boards : Number crunching : What benchmarking program is most similar to Rosetta? (Message 35404)
Posted 23 Jan 2007 by Mats Petersson
Post:
I agree with DCDC, that the most likely scenario is just that your machine is running various work-units, some of which are "easier" and others that are "harder" for your particular machine. My "system average credit" looks a bit like a cross-section of a mountain range, and I think that's perfectly natural.

Of course, running memory out of sync can add delays to the memory reads, so it may not give you any improvement.

--
Mats
4) Message boards : Number crunching : Does it support freebsd? Yes. [solved] (Message 35068)
Posted 19 Jan 2007 by Mats Petersson
Post:
Well, it may be possible to trick a FreeBSD into running a Linux image (for example - I doubt that Windows image will work at all!), but I don't know if that's going to work well.

As far as I know, the only Unices that are supported are:
Linux x86, 32-bit.
MacOS X x86, ppc, 32-bit.

--
Mats
5) Message boards : Number crunching : Does it support freebsd? Yes. [solved] (Message 35030)
Posted 18 Jan 2007 by Mats Petersson
Post:
This should probably go into the number crunching section of the forum, but the answer is: No, *BSD is not currently supported by Rosetta - no way to say if it will be in the future or not.

Perhaps a moderator can move the thread.

--
Mats

6) Message boards : Cafe Rosetta : Personal Milestones (Message 35012)
Posted 18 Jan 2007 by Mats Petersson
Post:
I just got to number 20 in amd users based on rac


I guess I should join that team, as all my machines are AMD-based... ;-)

--
Mats
7) Message boards : Cafe Rosetta : Personal Milestones (Message 34948)
Posted 17 Jan 2007 by Mats Petersson
Post:
I feel rather good about passing 600K for Rosetta, and in the process getting into the teens of UK crunchers as well as passing the 200 mark in world-wide users...

--
Mats
8) Message boards : Number crunching : Lots of jobs in error (Message 33044)
Posted 21 Dec 2006 by Mats Petersson
Post:
I run R@H on four machines. I get lots of errors on two of them that lock up the computer. Sometimes they are cleared by ctl-alt-del and halting the rosetta task, other times it requires a reboot. Only the two machines with hyperthreading get errors, the others (athlon 64 and a PIII have never had an error. All four machines also run other clients (Seti and Einstein) with no problems.

I have tried keeping the job in memory and will see if that works, next I will try to disable the screensaver, but these do not explain why other CPUs are not affected. Maybe the size of the job.


it is believed to be a syncronisation error and happens (or seems to happen) more often than not on a computer running more than one boinc project at the same time (Hyperthreading technology or multicore processors do this)


Hence the PIII, A64 generally would not see this.


Synchronisation issues is by far more likely to be a problem on systems that have multiple execution units that run different threads at the same time (so SMP and HT/Multicore systems), as those would technically be able to get things into a "unsynched" state much more easily by accessing the data in parallel (and the data being in an inconsistent state due to one thread being half-way through some udpate, and the other one reading the "half-baked" data).

It's of course possible to get this to happen on a single processor system as well, but the likelyhood of actually hitting the failure point is less likely.

--
Mats
9) Message boards : Number crunching : My PC RAC > 3000 (Message 32855)
Posted 18 Dec 2006 by Mats Petersson
Post:
Ethan - I think that one's @4GHz



Ethan, form other post, it's chilled and running at 4GHz, not the default 2.66GHz ;-)


and I do like the ... only getting 700/day ... part.


I thought that was the single socket machine at pos 2 in the list.

--
Mats
10) Message boards : Number crunching : My PC RAC > 3000 (Message 32854)
Posted 18 Dec 2006 by Mats Petersson
Post:
I just thought I'd congratulate Who? on the score of 3000 per day - all my machines together are getting around 3000 per day... :-(

--
Mats
11) Message boards : Cafe Rosetta : Personal Milestones (Message 32703)
Posted 15 Dec 2006 by Mats Petersson
Post:
I find it a good distraction from lifes general boredoms to see the credit accumulate past certain "milestones". Particularly when they are "big ones" like I just passed half a million! Wohoo!

--
Mats
12) Message boards : Number crunching : QX6700 or XEONs What should you choose for crunching (Message 32694)
Posted 15 Dec 2006 by Mats Petersson
Post:


Yes, my memory usage is in the 100-150 MB range too. But a lot of that is rarely touched. Just the binary uses 18MB of static memory, and there's surely a bunch of dynamically allocated data as well. But as I say, much of that is never touched. When I say the most active area, I measured about 60% of the overall time spent there. There's several other functions that take a fair amount of cycles, but they aren't really spending much time on memory fetches...

--
Mats


in your opinion,what do you think make the K8 so slow on Rosetta?
in the mean time, when memory matter "for sure", Hypertransport does not improve anything: on SETI, the 1st AMD system with 4 sockets (and 4 memory controllers) is 79 on the top list.
nobody can argue that seti work load uses more than the cache size ... changing the memory timing of your system changes dramatically the RAC.
conclusion: even when memory matter, Hypertransport is hyper-useless.

Who?



I haven't looked at Seti at all - not really a project that I took much interest in how it works, and I'm not participating there at all any more.

When I compared machines of same speed and architecture with different number of CPU's, the performance per MHz is very similar.

In my view, most of the time is spent waiting for the math processing to finish, so if K8 had a faster math unit, it would help performance. Obviously, there may be other parts that are important too. [And the actual behaviour may vary depending on the actual type of calculations performed, as some types of workunits perform different tasks and run different bits of code].

Of course, if you change the architecture, there's no doubt that a different architecture will be different in performance (assuming it's not just a simple straight copy of an older architecture). There's no denying that the new Core2 technology is good - I have never said otherwise. Compared to my 6.3 +/- 5% credits average per MHz per hour, your quad core is getting around 7.1 credit per MHz per hour (assuming it's running at 2.66GHz). The dual core also close as it hits 7.0 assuming that you're still running 4.0GHz.

That's a good 10% per GHz better performance.

--
Mats

13) Message boards : Cafe Rosetta : Supercomputer joins fight against AIDS (Message 32637)
Posted 14 Dec 2006 by Mats Petersson
Post:
Supercomputer joins fight against AIDS

That level of power can generate about one billion mathematical equations per second -- saving hundreds of hours in human calculations and precious time for millions of Africans suffering from fatal disease in need of innovative solutions.


I'd say it's a pretty quick human if s/he can beat a teraflop machine in "hundreds of hours" (without "cheating")....

It sounds from the description like it's a 64 CPU (or perhaps 64 x 2 CPU) rack cluster, which is indeed a nice system, but not really comparable to a REAL super computer. I can only surmise that South Africa aren't quite as used to clusers as we are here in Europe.

--
Mats
14) Message boards : Number crunching : QX6700 or XEONs What should you choose for crunching (Message 32633)
Posted 14 Dec 2006 by Mats Petersson
Post:

well, by getting the memory cache lines in time in the load units, as Core 2 does, you get 10% performance improvement. 10% is not a small number. Hypertransport innefficency probably cost 8 to 10% to the K8 on rosetta, that is not negligeable.

got the point?

who?


1. Automatic memory prefetching has been in K7/K8 processors since the introduction of Athlon MP some 5 or 6 years ago.
2. I still don't see enough cache-misses/read/write requests on my (current sample) to motivate a 10% improvement in Rosetta... Have you actually compared with a K8-based system?


--
Mats



Are you saying that Rosetta work load are fitting into 1Mb ? i don t think so

who?


At least sufficently to not notice any delays [assuming you mean 1MB (Mb = Megabit, MB = Megabyte in my way of writing things].

I tried oprofile for "BU_FILL", which is essentially misses in both L1 and L2 caches, and I got around 10 interrupts per second on that [so about 13M fills per second, one fill = 64byte => 800MB/s - which is about 10% of the bus capacity.

If we get around 10% usage on the bus (per core), I don't see how it can be improved by 10% - that's an "infinite" improvement, which are usually not achievable in real world scenarios.

--
Mats



I am very impressed, it sound like Hyperthrans has a compression algorythm or something, because it uses less memory space than any other processor i used.

you probably want to review your figures again, your measurement tool is broken, or you forgot to count the prefected cache lines. (Most of the measurement tool exclude them, AMD tool does at least ...)

now, hypertrans is magic ... cool :) that explain why Voodoo like it so much before ...

who?


Ehm, first of all, did I even mention anything about Hypertransport in my post? Not so. If the memory management works right (Linux does), the process should allocate memory from the local processor, which means that hypertransport doesn't come into the question. Memory controller is the local memory controller, so hypertransport shouldn't get involved for that. [Yes, there's obviously snoop messages for each cache-line, but that should be fairly short compared to a cache-line]. Of course, there is indeed no guarantee that the process is kept in the same processor, in which case cross-processor traffic starts to affect things.

I did not find any (different) way to measure the actual memory transfers, so you may be correct that there are prefetches there...

Just out of curiosity, what sort of memory transfers are you seeing on Rosetta (in MB/s for example)?

By the way, the working set where Rosetta is most active is an array of 300 x 6 x 4 bytes. It does reference a whole bunch of other variables in the function, but that's by far the largest one. 300 x 6 x 4 is MUCH smaller than 1MB.

There is lots of other data, but it's very rarely accessed in general.

--
Mats


So, I guess, the 155Megs around it are there for fun (yep ... the executable allocate 140 to 150Megs ...)

i see a different story than yours, I am in the process of getting the source code, I ll let you know later.


who?


Yes, my memory usage is in the 100-150 MB range too. But a lot of that is rarely touched. Just the binary uses 18MB of static memory, and there's surely a bunch of dynamically allocated data as well. But as I say, much of that is never touched. When I say the most active area, I measured about 60% of the overall time spent there. There's several other functions that take a fair amount of cycles, but they aren't really spending much time on memory fetches...

--
Mats
15) Message boards : Number crunching : QX6700 or XEONs What should you choose for crunching (Message 32595)
Posted 13 Dec 2006 by Mats Petersson
Post:

well, by getting the memory cache lines in time in the load units, as Core 2 does, you get 10% performance improvement. 10% is not a small number. Hypertransport innefficency probably cost 8 to 10% to the K8 on rosetta, that is not negligeable.

got the point?

who?


1. Automatic memory prefetching has been in K7/K8 processors since the introduction of Athlon MP some 5 or 6 years ago.
2. I still don't see enough cache-misses/read/write requests on my (current sample) to motivate a 10% improvement in Rosetta... Have you actually compared with a K8-based system?


--
Mats



Are you saying that Rosetta work load are fitting into 1Mb ? i don t think so

who?


At least sufficently to not notice any delays [assuming you mean 1MB (Mb = Megabit, MB = Megabyte in my way of writing things].

I tried oprofile for "BU_FILL", which is essentially misses in both L1 and L2 caches, and I got around 10 interrupts per second on that [so about 13M fills per second, one fill = 64byte => 800MB/s - which is about 10% of the bus capacity.

If we get around 10% usage on the bus (per core), I don't see how it can be improved by 10% - that's an "infinite" improvement, which are usually not achievable in real world scenarios.

--
Mats



I am very impressed, it sound like Hyperthrans has a compression algorythm or something, because it uses less memory space than any other processor i used.

you probably want to review your figures again, your measurement tool is broken, or you forgot to count the prefected cache lines. (Most of the measurement tool exclude them, AMD tool does at least ...)

now, hypertrans is magic ... cool :) that explain why Voodoo like it so much before ...

who?


Ehm, first of all, did I even mention anything about Hypertransport in my post? Not so. If the memory management works right (Linux does), the process should allocate memory from the local processor, which means that hypertransport doesn't come into the question. Memory controller is the local memory controller, so hypertransport shouldn't get involved for that. [Yes, there's obviously snoop messages for each cache-line, but that should be fairly short compared to a cache-line]. Of course, there is indeed no guarantee that the process is kept in the same processor, in which case cross-processor traffic starts to affect things.

I did not find any (different) way to measure the actual memory transfers, so you may be correct that there are prefetches there...

Just out of curiosity, what sort of memory transfers are you seeing on Rosetta (in MB/s for example)?

By the way, the working set where Rosetta is most active is an array of 300 x 6 x 4 bytes. It does reference a whole bunch of other variables in the function, but that's by far the largest one. 300 x 6 x 4 is MUCH smaller than 1MB.

There is lots of other data, but it's very rarely accessed in general.

--
Mats
16) Message boards : Number crunching : QX6700 or XEONs What should you choose for crunching (Message 32588)
Posted 13 Dec 2006 by Mats Petersson
Post:

well, by getting the memory cache lines in time in the load units, as Core 2 does, you get 10% performance improvement. 10% is not a small number. Hypertransport innefficency probably cost 8 to 10% to the K8 on rosetta, that is not negligeable.

got the point?

who?


1. Automatic memory prefetching has been in K7/K8 processors since the introduction of Athlon MP some 5 or 6 years ago.
2. I still don't see enough cache-misses/read/write requests on my (current sample) to motivate a 10% improvement in Rosetta... Have you actually compared with a K8-based system?


--
Mats



Are you saying that Rosetta work load are fitting into 1Mb ? i don t think so

who?


At least sufficently to not notice any delays [assuming you mean 1MB (Mb = Megabit, MB = Megabyte in my way of writing things].

I tried oprofile for "BU_FILL", which is essentially misses in both L1 and L2 caches, and I got around 10 interrupts per second on that [so about 13M fills per second, one fill = 64byte => 800MB/s - which is about 10% of the bus capacity.

If we get around 10% usage on the bus (per core), I don't see how it can be improved by 10% - that's an "infinite" improvement, which are usually not achievable in real world scenarios.

--
Mats
17) Message boards : Number crunching : Lots of jobs in error (Message 32586)
Posted 13 Dec 2006 by Mats Petersson
Post:
The error code shown in your example of job in error is a 0xC0000005, which is the code for "Access Violation", which essentially means that the process in question was trying to access memory that it wasn't supposed to access.

From the call-stack, it seems like you're in an NVidia graphics driver that has gone into some sort of recursive call - but that could just be me misunderstanding the crash-dump... Or that the crash dump isn't very clever with certain types of stack-patterns.

--
Mats
18) Message boards : Number crunching : Negative Credits ? (Message 32585)
Posted 13 Dec 2006 by Mats Petersson
Post:
It's not a bug in BOINCstats, it happens when the user's CPID keeps changing because they haven't got it to sync up between all their projects.

The user "Xanthochroid" has two CPIDs, the other one is here. The negative scores on one ID match the positive scores on the other.



Well, ok, so it's kind of a bug that BOINCstats show negative credits, but the real cause of that is that there are two CPID's, right?

--
Mats
19) Message boards : Number crunching : QX6700 or XEONs What should you choose for crunching (Message 32534)
Posted 12 Dec 2006 by Mats Petersson
Post:

well, by getting the memory cache lines in time in the load units, as Core 2 does, you get 10% performance improvement. 10% is not a small number. Hypertransport innefficency probably cost 8 to 10% to the K8 on rosetta, that is not negligeable.

got the point?

who?


1. Automatic memory prefetching has been in K7/K8 processors since the introduction of Athlon MP some 5 or 6 years ago.
2. I still don't see enough cache-misses/read/write requests on my (current sample) to motivate a 10% improvement in Rosetta... Have you actually compared with a K8-based system?


--
Mats
20) Message boards : Number crunching : QX6700 or XEONs What should you choose for crunching (Message 32530)
Posted 12 Dec 2006 by Mats Petersson
Post:
I upgraded my V8 (cores) computer to Windows VISTA. no slow down :)

The monster running Vista

it is getting back to number one position, it will beat my overclocked quad core over night i think.

still no saturation of the front side buses ... that does not happen mister Ruiz!
(yep! I got 2 FSB on this motherboard! the snooping filters are more efficent than Hypertransport aging protocole)


who?


And again, Rosetta isn't going to saturate any bus on a machine with decent L2 cache, so why are you going on about the FSB/HyperTransport - it's quite clear that you DO understand that this is not an issue from previous posts. It may be an issue for some other applications, but not for Rosetta, so Rosetta makes a very poor example for comparing these things, right?

--
Mats


i just vTuned the work loads i was crunching on my quad core, your statement is actually wrong, there is 10% of the time spent on memory load and store, it does not cost on Core 2 because the success rate of the prefetcher is high, and every thing gets into the cache before the load unit needs it.
I know that some work load of Rosetta varies a lot, i guess, you miss understood some part of the algorythm, there is few pointer chassing going on that requires FSB or old hypertransport ...

remember, each workload of rosetta can be dramatically different, based on the kind of structure you are processing.

whowho?



Yeah, ok, there may be some memory traffic, but I've yet to see a single case where Rosetta is even close to saturating the memory traffic, which is why I stated that Rosetta is a poor benchmark for whether the bus is "efficient" or not. Can we agree on that?

--
Mats


Next 20



©2024 University of Washington
https://www.bakerlab.org