FPU/SSE2 performance

Message boards : Number crunching : FPU/SSE2 performance

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
XS_STEvil

Send message
Joined: 30 Dec 05
Posts: 9
Credit: 189,013
RAC: 0
Message 21105 - Posted: 25 Jul 2006, 5:53:10 UTC
Last modified: 25 Jul 2006, 6:09:44 UTC

http://www.xtremesystems.org/forums/showthread.php?t=103305

edit - also see this thread. It will have BOINC numbers and mandelbrot .53 numbers. http://www.xtremesystems.org/forums/showthread.php?p=1601705#post1601705

We know the current benchmark method is flawed and we know a new one is coming which should hopefully solve this mess (thanks rosetta team!!!!) so please dont bring that subject up, it really has been beaten to a pulp and then some ;)

So, out of curiosity can you guys run this and give us your output? I want to see what some of these give in relation to MMCIASTRO's (though they arent going to be comparable easily I dont think).
This signature was annoying.
ID: 21105 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 21107 - Posted: 25 Jul 2006, 6:57:43 UTC

I'm confused as to why you are using SSE2 for performance measurments ? Is it just interest since it has ne relevance to Rosetta (It does not use SSE2).

The only relevance it would have would be towards anyone using Crunch3r's Boinc client compiled for SSE2, I'll check your threads later when I get some time though :-)) Just a note that not all of Crunch3rs clients use SSE2, there is MMX and SSE one's as well.


In effect you should just compare them to SiSoft Sandra or similar programs benchmarks as they are the same Whetstone/Drystone benchmarks.
Team mauisun.org
ID: 21107 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MikeMarsUK

Send message
Joined: 15 Jan 06
Posts: 121
Credit: 2,637,872
RAC: 0
Message 21109 - Posted: 25 Jul 2006, 7:28:53 UTC
Last modified: 25 Jul 2006, 7:37:39 UTC

For Rosetta, before they introduce their new system, it may be worth doing this as a way of comparing systems:

* Look at the tasks recently completed on the box you want to benchmark, and try to find other hosts which have completed the same protein and algorithm in the same duration work unit (or adjust by duration of work unit).

* Make a note of the number of decoys processed on each machine.

Now repeat with different proteins and algorithms, until you have a decent sized data set. Calculate the average ratio.

The easiest way to find a single compatible result is via the work unit link, but that only gives you one comparible result per work unit. A comparison using only a single work unit isn't very reliable...

This gives you a measure of the comparative Rosetta performance between the machines, and hence the nearest 'fair' comparison available at the moment. From what people have said, it'll also give you an indication of how much credit you'd get on the two machines under the new scheme.

Effectively this allows you to compare systems based on how good the machine is at running Rosetta.

If you're adjusting the credit via some custom client, then this could be a guideline as to what is a fair adjustment and what is excessive (although personally I'd stick to a standard client).

Doesn't answer your question about FPU/SSE2 performance, of course, but if you wanted a Rosetta-specific benchmark...

ID: 21109 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 21160 - Posted: 25 Jul 2006, 19:38:11 UTC

OK, an independent benchmark, I can see the value in that. My machines were shown in Tony's list. I ran this on one of them. Is this the output you were looking for?

Host 172896 SSE2 327.730, FPU 187.109
GenuineIntel Intel(R) Pentium(R) 4 CPU 3.00GHz

One question, with respect mind you, I just want to confirm we're on the same page. The readme says the benchmark doesn't do really any memory access, and so does not depend upon L2 cache in the CPU nor memory speed. I guess it ALSO says it's "...totally focused on detecting the double precision floating point CPU capabilities", so I guess that's a given.

I just wanted to point out that an application that's heavily accessing memory is going to have that become the bottleneck to greater throughput. And so it's possible, in fact very likely, that no single benchmark will scale the same way Rosetta does. I mean, you may have one machine show 2x the results on this benchmark, but it's difference in throughput crunching Rosetta models will not necessarily correlate to this 2x. For that matter, I wouldn't expect it to scale accurately into any project's application. The runtime environment has too many dependant factors to account for. There's always going to be more to it then just the speed of double floats.

This is why benchmarking and credits is such a difficult nut to crack. And I think it's these subtile complexities that lead to the disputes as seen elsewhere, because almost noone truely understands all of what's going on in that CPU to make the models crunch. I certainly don't claim to. Please educate me.

I'm also not clear what you plan to do with the results as compared to Tony's numbers. These are two different benchmarks, so yes, their relative scales will be different... but neither is representative of how Rosetta uses the CPU, and so neither is going to measure the whole story and reflect how Rosetta throughput will scale across these various systems. And this is why the approach of establishing a scale based on actually crunching a given WU, which Rosetta has proposed, seems to make a lot of sense. It's a way to fully account for how Rosetta, in it's entirety, uses the various speeds and resources in the system. And if your box has the speeds and resources that Rosetta needs to crunch the model, then it's achieving more useful science work then a box which becomes bottlenecked as it processes the work.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 21160 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 21176 - Posted: 26 Jul 2006, 0:04:54 UTC

Here's another host of mine, host 181314

It clocked in at SSE2 261.139, FPU 109.988

It's reported as an identical CPU, but doesn't run dual-core. Yet it has about the same RAC as the first on I reported. Never understood why one runs as a dual core, and the other does not. Is that BIOS level? I looked in the settings, and both seem to be configured the same way. I forget what you call it when you view the settings as it boots up?
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 21176 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 21185 - Posted: 26 Jul 2006, 1:59:36 UTC

Hrmph. Just wrote a long reply and then lost it...

You should be able to enable/disable HyperThreading in the BIOS [don't ask me where tho' - it varies]. Of course, Rosetta uses nearly all of the processors floating point capacity 99% of the time, so if you run one or two threads at the same time probably makes little difference - except that one two models (doing half the number of decoys).

The Whetstone benchmark is quite "silly" - in fact with just a single line modification (the line that checks the result changed to "nothing"), I can make whetstone report that my machine is around 25 times faster than it really is - because it's just doing an empty loop with no floating point calculations at all... Much of the original calculations disappear if you optimize only a bit - so the result reported isn't necessarily right.

Also, I beleive that the Whetstone result of SiSoft isn't calculated the same way as the "regular" Whetstone [I know this is the case on Dhrystones - not 100% sure on Whetstone].

I agree, crunching a fixed workunit is the best way to give a good guestimate of how well this machine performs on running Rosetta (likewise can be said for any other application of course - Seti, Einstein, Predictor, etc, etc). I suspect that running a work-unit for a minute and see how many decoys we can generate will be one way to do it... A minute should be more than enough to establish how fast the machine is, I should think.

--
Mats
ID: 21185 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
XS_STEvil

Send message
Joined: 30 Dec 05
Posts: 9
Credit: 189,013
RAC: 0
Message 21187 - Posted: 26 Jul 2006, 2:40:04 UTC
Last modified: 26 Jul 2006, 2:41:01 UTC

Feet1st: the point is to attempt to gather a few numbers which might show if a Dual Xeon at 3.2ghz is faster or the same performance as an Opteron 165, and if not then at what point (clockspeed) they are most similar (only using these CPU's for example).

I agree that the mandelbrot benchmark is not the only way to find this but it will show which CPU generates the best FPU/SSE2 performance and give us a relative starting point.

To tell you the truth, I dont believe any "benchmark" does much of a good job showing CPU performance as some are heavily weighted to specific instructions or instruction sets where another CPU may have incredible gobs of memory and/or internal cache bandwidth at low latency which will allow it to put out high production numbers but low benchmark scores.

Its kind of like nitrous oxide vs. superchargers in drag racing: one gives you crazy numbers but doesnt take you far, the other kicks your butt all the way down the strip and then takes you home at the end of the night ;)

Maybe we should discuss what a CPU benchmark should encompass and how it should be done "properly" ?

Dont take this as I dont like the new benchmark system comming for rosetta@home as I truely do. It will measure CPU performance for rosetta perfectely but will not measure overall performance which is more in line with what this thread is about.
This signature was annoying.
ID: 21187 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 21188 - Posted: 26 Jul 2006, 2:50:03 UTC - in response to Message 21185.  

Hrmph. Just wrote a long reply and then lost it...

I HATE that when that happens! "Always right-click people's handles to see their computers", that's a rule that's saved me from repeating the error. Seems invariably I get half done in a reply and then say "gee, I wonder if he's got..." ZAP!

A minute should be more than enough to establish how fast the machine is, I should think.


Actually, I don't think it would be. I say this because there are different phases in the analysis. For example there is the ab+initro and the full atom relax. Also, while I don't pretend to know anything about the actual Rosetta code, I do know something about how expert systems work. And I believe the concepts are likely similar. Basically, you have to build up a large decision tree, and then navigate it. So, the size of the tree you build up will effect the speed at which you navigate it, and the relative amount of time in spend in navigation vs construction.

What I'm trying to say is that, while one might devise a workload that would simulate functions performed in Rosetta, there are wide variences between different proteins and WU types (Jumping, Fixed, Ignore... I don't know what all of it means, but I can see it can dramatically effect the number of models per hour I crunch on a given sized protein). So, even a benchmark designed to mimic Rosetta functions would not scale properly when you compare with new techniques they are devising all the time to study the protein structure.

These factors are what makes a "credit tailored to each WU released" type of system the truely most completely representative benchmark. ...and even that is flawed, because I've seen two WUs run at the same time with the same name, just running from different random number seeds, and after 10 or 15 hours of runtime on each, sometimes one is 10 models ahead of the other. So, the lucky draw of the random number appearently effects the scale of the decision tree that is built or in some other way some models take longer to crunch then others.

If Rosetta is 99% imersed in floating point, then the other factors I am sighting must only comprise the other 1%, so perhaps my concerns are too broad. I'm still puzzled how one WU can crunch 15hrs and have 65 models, and another, of the SAME WU, can crunch on the SAME machine and at the SAME TIME for 15hrs and only have 55 models... is it possible that randomly one of my split cores is able to monopolize the FP processor over the other? And if it did, and so the other is stuck with a much higher % of his time waiting for FP operations to complete, would they still both report the same number of CPU seconds used?

Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 21188 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 21189 - Posted: 26 Jul 2006, 3:11:43 UTC

If you're talking about an Intel P4 with hyperthreading - then you're getting the kind of results I'd expect. For some of the other DC projects I've taken part in, the HT users mentioned that they got a small performance increase over the system without HT turned on (5-10%) at the cost of the two work units taking almost twice as long to produce. (Others claimed a 20% boost in speed with HT turned on..) If the two tasks you're running on a HT system are different, they can potentially get a substantial speedup - although it won't reach 100% boost in speed; but if you're running two identical tasks - then you basically have everyone driving down main street, stopping at the stop sign and turning onto 1st street. It's bottlenecked.

Testing on a cpu that had two actual cores (Pentium D, Core Duo, Athlon x2, dual dore Opteron) - or a dual cpu system (Opteron, Xeon) would be a much more fair test of how repeatable the WUs are on identical hardware with identical settings.



ID: 21189 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ethan
Volunteer moderator

Send message
Joined: 22 Aug 05
Posts: 286
Credit: 9,304,700
RAC: 0
Message 21190 - Posted: 26 Jul 2006, 3:13:38 UTC - in response to Message 21188.  

Without any knowledge of the plan being devised:

Wouldn't the new credit system make Rosetta into the best project credit wise? Rather than a 2 or 3 quota project, it will take hundreds of results and average them together.

If you have a fast processor, great! You'll crunch it much faster than average and get the same credit. If you have quad core uberness, you'll get 4x the credits of a single core.

At what point would wu to wu differences be an issue? Since everything is averaged out, everyone has the same chance of getting a 'short' or 'long' wu. Is that better or worse than the current situation where a person can set their credit modifier?

-E

ID: 21190 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MikeMarsUK

Send message
Joined: 15 Jan 06
Posts: 121
Credit: 2,637,872
RAC: 0
Message 21196 - Posted: 26 Jul 2006, 7:46:35 UTC - in response to Message 21190.  

...

At what point would wu to wu differences be an issue? Since everything is averaged out, everyone has the same chance of getting a 'short' or 'long' wu. Is that better or worse than the current situation where a person can set their credit modifier?
...


It should be possible for the project to know roughly how 'big' a WU is without running it (number of bases versus which algorithms are being used). So a statistical approach should mean that there won't be huge differences between WUs credit-per-hour in any case.

I'm looking forward to the new system, should be a big step forwards for the project in terms of fairness (no difference in science, although I'm sure there could be a stats or CompSci publication in there somewhere for a postgrad).

ID: 21196 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
XS_DDTUNG

Send message
Joined: 3 Jan 06
Posts: 9
Credit: 26,087,357
RAC: 0
Message 21204 - Posted: 26 Jul 2006, 10:13:07 UTC - in response to Message 21196.  

...

At what point would wu to wu differences be an issue? Since everything is averaged out, everyone has the same chance of getting a 'short' or 'long' wu. Is that better or worse than the current situation where a person can set their credit modifier?
...


It should be possible for the project to know roughly how 'big' a WU is without running it (number of bases versus which algorithms are being used). So a statistical approach should mean that there won't be huge differences between WUs credit-per-hour in any case.

I'm looking forward to the new system, should be a big step forwards for the project in terms of fairness (no difference in science, although I'm sure there could be a stats or CompSci publication in there somewhere for a postgrad).


It appears from the information available that Rosetta will be moving towards a F@H type of credit system which is known to favor certain processors. Not complaining, just reiterating my belief that there is no perfect credit system.

Whatever happens, let's just crunch on for the science.

DDTUNG
ID: 21204 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 21206 - Posted: 26 Jul 2006, 10:31:54 UTC

I was trying to think of how a benchmark could favour one processor over another, if it's running the same code as the "real" code. And it turns out that if you have a benchmark WU that is running almost entirely in the cache, but a real WU that is much larger, and therefore needs to access memory, then the processor with the slower memory performance will get a higher score (since it's going to take, relatively speaking, longer to complete the WU).

It is very difficult to make a really, absolutely fair, credit system.

I agree: Let's crunch for science.

--
Mats
ID: 21206 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MikeMarsUK

Send message
Joined: 15 Jan 06
Posts: 121
Credit: 2,637,872
RAC: 0
Message 21216 - Posted: 26 Jul 2006, 13:33:07 UTC

I think DDTUNG is referring to the (known) fact that different science applications perform differently on different processors, rather than the benchmark not mirroring the science application.

So if Rosetta gets 20% more science work done on Intel, then it'll give 20% more credit. To me that would be the ideal outcome, since credit would match science work.

The CPDN example would be that SAP runs particularly well on Intel, while the coupled model runs particulary well on AMD. Hence I match processors to workload in order to get as much science work done as possible.

ID: 21216 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 21220 - Posted: 26 Jul 2006, 14:37:16 UTC - in response to Message 21190.  

Ethan
Without any knowledge of the plan being devised:
... it will take hundreds of results and average them together.

...At what point would wu to wu differences be an issue?


I believe the plan is to use the data gathered for a given WU on Ralph (where credits are not an issue) and devise a credit value per model for each WU. So, each WU is sent to a few dozen machines, and each machine crunches a few dozen models, Ralph confirms the WUs are configured properly for release on Rosetta, and they determine the credit value per model crunched for that specific WU.

Yes, WU to WU differences can vary widely. Some proteins I can crunch a model every 6 minutes, others take 1.5hrs per model. It depends upon the approach being attempted and the reletive size of the protein.

But it would mean the credit value per model would be predetermined when the WUs are released on Rosetta. So you would still have immediate credit upon reported results back to Rosetta.

DDTUNG
...credit system which is known to favor certain processors. Not complaining, just reiterating my belief that there is no perfect credit system.

Whatever happens, let's just crunch on for the science.


I think at that point it would be more correct to say that the system configuration is ideally suited to the Rosetta workload. If Rosetta is floating point bound, and one machine does float twice as fast, Rosetta gets more science done and that PC is going to get more credit. To say Rosetta then "favors" that machine is kinda looking at it backwards.

In the future the science Rosetta uses to crunch will evolve. It might become memory-bound instead, and run best on systems with large cache and fast memory. The dynamics will change. And so by using Ralph to prototype each WU, the established credit value for it, reletive to some arbitrary benchmark, would automatically adjust. The only issue then would be if your arbitrary benchmark is an environment that was great at float but had slow memory... and the dynamics of the actual Rosetta app. change away from that initial benchmark. At that point, yes some environment would be favored, the one that ran the initial benchmark the best would forever be favored. But it gives you a stake in the sand. A frame of reference that everything is compared to.

I wholeheartedly agree. Crunch more Rosetta! And thanks for clearing up some of my confusion about my two "identical" CPUs :)
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 21220 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 21221 - Posted: 26 Jul 2006, 14:45:31 UTC - in response to Message 21206.  

...if you have a benchmark WU that is running almost entirely in the cache, but a real WU that is much larger, and therefore needs to access memory, then the processor with the slower memory performance will get a higher score (since it's going to take, relatively speaking, longer to complete the WU).


Since the proposed system would value credit on a per model crunched basis (that's my understanding anyway), and the slow memory system would take longer to crunch each model, it would get LESS credit per hour. But the same credit per model, as everyone gets the same credit per model.

Remember, the WUs are kinda virtual in Rosetta. You crunch it as long as you like. You just pick a different random number and have at it again and again. Each pass is called a model. You see in the graphic how many models you have crunched so far on the given WU. But the value to the science is based on the number of models crunched. Run the WU for 24hrs and crunch 120 models, and you are doing more science than someone that runs for 3hrs and does 15 models. If they can do 17 models in the three hours, then they've done more science per hour.

It is very difficult to make a really, absolutely fair, credit system.

THAT's for sure!
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 21221 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
XS_STEvil

Send message
Joined: 30 Dec 05
Posts: 9
Credit: 189,013
RAC: 0
Message 21251 - Posted: 27 Jul 2006, 2:13:18 UTC

Why are the longer WU's larger in size to download, then?

I'm on dial-up that is not connected 24/7, so used bandwidth comes at a premium :(


BTW - thanks for keeping the discussion clean so far guys! :D
This signature was annoying.
ID: 21251 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Akins

Send message
Joined: 22 Oct 05
Posts: 176
Credit: 71,779
RAC: 0
Message 21252 - Posted: 27 Jul 2006, 5:22:08 UTC

Sorry I can't be much help on the dialup. I'm on low speed broadband.

As far as the new crediting system, I think we'll be OK as long as the then-and now awarded credits don't vary too much from what they show now.

Keep er Clean, Git Ur Dun.

ID: 21252 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 21253 - Posted: 27 Jul 2006, 5:40:57 UTC

XS_STEvil:

We're straying off topic, but this is my understanding: a tiny 43 Amino Acid protein has a small file that describes the interactions between those 43 Amino Acids. For most smaller proteins, there's fewer interactions to check, and we move through them quickly. The smaller proteins tend to have both smaller files and are faster.

The 156 Amino Acid proteins have a much huger file describing the interactions and characteristics of those 156 Amino Acids. Even if the program is limiting the testing of the ineractions between the Amino Acids to just those that are "close," tripling the size of the protein causes the system to run much slower than 3 times.
Say the 43 AA protein had an average of 10 proteins that were "close" to each Amino Acid being tested. The 156 AA would have a much higher number - perhaps averaging 50. The amount of work testing each interaction doesn't increase linearly - and if it increases as the square of the difference that would be about 25 times as much work as for the 43 AA protein.

(Perhaps these last two messages can be moved and a Rosetta team member can correct what I've posted.)




ID: 21253 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B^S] thierry@home
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 281,902
RAC: 0
Message 21257 - Posted: 27 Jul 2006, 8:46:07 UTC
Last modified: 27 Jul 2006, 8:57:03 UTC

I try to follow :

- How many models are available for a given WU?
- If a run a WU for threes hours and make lets say 10 models, another users will receive the same WU but for other models?

I make a test. Normaly I crunch WU in 3 hours. I change my preferences from the default to 6 hours. I have a WU running and when it reach a checkpoint, the 'Completion' drop from 70% to 48%, which is correct because now 'calculation time' + 'to completion' = 6 hours.

But finally, doing 10 models by WU every 3 hours or 20 models by WU every 6 hours is the same. I mean the contribution is the same, isn't it?

Is it a better way to crunch? Or crunching is crunching, whatever the preferences?


ID: 21257 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : FPU/SSE2 performance



©2024 University of Washington
https://www.bakerlab.org