Strange problem with dual Xeon machine

Message boards : Number crunching : Strange problem with dual Xeon machine

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Dusty

Send message
Joined: 1 Mar 08
Posts: 41
Credit: 2,667,354
RAC: 0
Message 52228 - Posted: 4 Apr 2008, 12:14:09 UTC
Last modified: 4 Apr 2008, 12:15:00 UTC

Has anyone had an issue where rebooting a Xeon computer makes it restart all 8 Rosetta work processes at zero again?

Also, that same machine gives a projected credit of about 35 per work process, but the actual credit awarded is about 10. Why the large disparity?

Thanks!
ID: 52228 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 52229 - Posted: 4 Apr 2008, 13:46:13 UTC

I presume this is the host you are referring to??

I suspect that the tasks that were active at the time you rebooted the machine had not reached a checkpoint yet, and therefore none of their work had been preserved. Some of the tasks being worked on lately are not able to reach a checkpoint very often, and so this further increases the odds of you seeing all 8 starting over at the same time. The good news is that as long as the machine is eventually able to run BOINC long enough, it will recover and continue normally with no intervention. And if this should happen to keep up due to how you use that machine, Rosetta will abort the tasks after 5 restarts in a row with no progress being made. And so then perhaps the new work you get will be able to checkpoint more frequently.

It looks like that host has a very very small L2 cache on the CPU. Indeed it's reporting in with 122K... not Meg, but K! In a nutshell, Rosetta seems to run better on processors with more L2 cache memory. And so your benchmark results are not in line with the actual results you are able to produce when crunching the Rosetta WUs.

This sort of thing is part of why you see so much talk on the internet about one processor being more powerful then another. It's not just as simple as comparing Ghz, and actual performance is very dependant on the type of work you measure the processor doing.
Rosetta Moderator: Mod.Sense
ID: 52229 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 52230 - Posted: 4 Apr 2008, 15:10:46 UTC

I am not sure if your machine is a "true" 8 processor or, like mine, uses Hyper-Threading to "simulate" multiple processors. In my case I have a dual Xeon with HT and it "thinks" that there are 4 CPUs.

My experience is that running a "mix" of projects improves performance in situations like this. In effect, you get less contention for the same resource.
ID: 52230 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 116,018,183
RAC: 64,830
Message 52231 - Posted: 4 Apr 2008, 15:19:27 UTC

if it is that machine (E5430 Xeon) then it has 12MB L2 cache which it looks like BOINC can't read/report correctly. It doesn't support HT - it's a true 2x4 core machine.

There is something wrong with the credit though - it should be claiming and receiving far more than it is. Is speedstep kicking in?
ID: 52231 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dusty

Send message
Joined: 1 Mar 08
Posts: 41
Credit: 2,667,354
RAC: 0
Message 52232 - Posted: 4 Apr 2008, 15:44:29 UTC - in response to Message 52229.  

Yes, that is the computer. They are two, Quad-core 45nm Hi-K processors with 12MB L2 cache, so I'm really not sure why they are being reported as only having 122K of cache. I'm running them on an ASUS DSEB-D16/SAS Enterprise Server Mobo which has the Intel 5400 chipset. I'm running 16 Gig of RAM on that Mobo, which is probably overkill and doesn't benefit BOINC much. Also, I'm using the latest (5.10.45) version of BOINC. Perhaps there is something in the BIOS that I'm not seeing that is improperly reporting the CPU L2 cache.

Thanks for the information on "checkpoints." I wasn't aware of those. I've rebooted the system several times since getting it online, so that must have been why it kept resetting the processes to zero. Thanks for the clarification!

Mark

I presume this is the host you are referring to??

I suspect that the tasks that were active at the time you rebooted the machine had not reached a checkpoint yet, and therefore none of their work had been preserved. Some of the tasks being worked on lately are not able to reach a checkpoint very often, and so this further increases the odds of you seeing all 8 starting over at the same time. The good news is that as long as the machine is eventually able to run BOINC long enough, it will recover and continue normally with no intervention. And if this should happen to keep up due to how you use that machine, Rosetta will abort the tasks after 5 restarts in a row with no progress being made. And so then perhaps the new work you get will be able to checkpoint more frequently.

It looks like that host has a very very small L2 cache on the CPU. Indeed it's reporting in with 122K... not Meg, but K! In a nutshell, Rosetta seems to run better on processors with more L2 cache memory. And so your benchmark results are not in line with the actual results you are able to produce when crunching the Rosetta WUs.

This sort of thing is part of why you see so much talk on the internet about one processor being more powerful then another. It's not just as simple as comparing Ghz, and actual performance is very dependant on the type of work you measure the processor doing.

ID: 52232 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dusty

Send message
Joined: 1 Mar 08
Posts: 41
Credit: 2,667,354
RAC: 0
Message 52233 - Posted: 4 Apr 2008, 15:45:24 UTC - in response to Message 52231.  

I'm not sure what "speedstep" is. Where would I see that reported as 'kicking in?'

if it is that machine (E5430 Xeon) then it has 12MB L2 cache which it looks like BOINC can't read/report correctly. It doesn't support HT - it's a true 2x4 core machine.

There is something wrong with the credit though - it should be claiming and receiving far more than it is. Is speedstep kicking in?

ID: 52233 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dusty

Send message
Joined: 1 Mar 08
Posts: 41
Credit: 2,667,354
RAC: 0
Message 52234 - Posted: 4 Apr 2008, 15:53:04 UTC - in response to Message 52229.  

Checking the CMOS, the CPUs are being reported as having a 128k L1 Cache and a 12MB L2 cache. Perhaps that's why BOINC is reporting the system as having a 122K cache?

I presume this is the host you are referring to??

I suspect that the tasks that were active at the time you rebooted the machine had not reached a checkpoint yet, and therefore none of their work had been preserved. Some of the tasks being worked on lately are not able to reach a checkpoint very often, and so this further increases the odds of you seeing all 8 starting over at the same time. The good news is that as long as the machine is eventually able to run BOINC long enough, it will recover and continue normally with no intervention. And if this should happen to keep up due to how you use that machine, Rosetta will abort the tasks after 5 restarts in a row with no progress being made. And so then perhaps the new work you get will be able to checkpoint more frequently.

It looks like that host has a very very small L2 cache on the CPU. Indeed it's reporting in with 122K... not Meg, but K! In a nutshell, Rosetta seems to run better on processors with more L2 cache memory. And so your benchmark results are not in line with the actual results you are able to produce when crunching the Rosetta WUs.

This sort of thing is part of why you see so much talk on the internet about one processor being more powerful then another. It's not just as simple as comparing Ghz, and actual performance is very dependant on the type of work you measure the processor doing.

ID: 52234 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 116,018,183
RAC: 64,830
Message 52235 - Posted: 4 Apr 2008, 16:22:31 UTC - in response to Message 52233.  
Last modified: 4 Apr 2008, 16:23:17 UTC

I'm not sure what "speedstep" is. Where would I see that reported as 'kicking in?'

if it is that machine (E5430 Xeon) then it has 12MB L2 cache which it looks like BOINC can't read/report correctly. It doesn't support HT - it's a true 2x4 core machine.

There is something wrong with the credit though - it should be claiming and receiving far more than it is. Is speedstep kicking in?


Speedstep turns down the speed of your CPUs when they're not in use. An idle priority task (rosetta) might not qualify as 'in use' and so speedstep might be kicking in. Easiest way to tell would be to run cpuz - http://www.cpuid.com/cpuz.php which will show you the speed of your CPUs in real-time. If it varies depending on what you're doing then Speedstep is enabled and might be the cause of the low scores/throughput. You'll be able to change the settings for that in the BIOS but I think there's a way to control it from Windows too.

You're almost certainly right about the L1/L2 issue too - it looks like BOINC is reporting the wrong number!

HTH
ID: 52235 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dusty

Send message
Joined: 1 Mar 08
Posts: 41
Credit: 2,667,354
RAC: 0
Message 52236 - Posted: 4 Apr 2008, 16:24:25 UTC - in response to Message 52231.  

Regarding "Speedstep," I checked my CMOS, and Speedstep is currently disabled. Do you recommend I enable it?

if it is that machine (E5430 Xeon) then it has 12MB L2 cache which it looks like BOINC can't read/report correctly. It doesn't support HT - it's a true 2x4 core machine.

There is something wrong with the credit though - it should be claiming and receiving far more than it is. Is speedstep kicking in?

ID: 52236 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dusty

Send message
Joined: 1 Mar 08
Posts: 41
Credit: 2,667,354
RAC: 0
Message 52237 - Posted: 4 Apr 2008, 16:27:38 UTC - in response to Message 52235.  

Thanks for the clarification! Since Speedstep is off in the BIOS, it isn't the problem. It's nice to know what it's for, however. There isn't much help in the Mobo manual about BIOS functionality (which is more often the case than not).
Mark

I'm not sure what "speedstep" is. Where would I see that reported as 'kicking in?'

if it is that machine (E5430 Xeon) then it has 12MB L2 cache which it looks like BOINC can't read/report correctly. It doesn't support HT - it's a true 2x4 core machine.

There is something wrong with the credit though - it should be claiming and receiving far more than it is. Is speedstep kicking in?


Speedstep turns down the speed of your CPUs when they're not in use. An idle priority task (rosetta) might not qualify as 'in use' and so speedstep might be kicking in. Easiest way to tell would be to run cpuz - http://www.cpuid.com/cpuz.php which will show you the speed of your CPUs in real-time. If it varies depending on what you're doing then Speedstep is enabled and might be the cause of the low scores/throughput. You'll be able to change the settings for that in the BIOS but I think there's a way to control it from Windows too.

You're almost certainly right about the L1/L2 issue too - it looks like BOINC is reporting the wrong number!

HTH

ID: 52237 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 52239 - Posted: 4 Apr 2008, 20:42:56 UTC

Other things to check are that you have the most up to date drivers for the motherboard, video card and OS...

You can get a brand new MB and have it have a really old driver/bios ...

A pain, but if this continues that needs to be added to the checklist.

Also check that heat is within tolerances as most processors have "self-help" that will slow the processor for overheat.

There may be other settings in the BIOS that can cause variable processor speeds.

Just for ha-has you could also try another project to see if the same kind of thing is happening there ... sometimes you have to turn over a lot of rocks to isolate an odd problem here and there ...
ID: 52239 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dusty

Send message
Joined: 1 Mar 08
Posts: 41
Credit: 2,667,354
RAC: 0
Message 52241 - Posted: 4 Apr 2008, 22:06:13 UTC - in response to Message 52239.  

Great suggestions! I checked the ASUS website, and I have the most current MB BIOS. None of the drivers that came with the Mobo were for XP Pro..just Win Server 2003, so I emailed their tech support and he gave me links to XP drivers for it. The video is on the motherboard; a measley 32 meg. I've ordered a PCI Express 2.0 card for it, but will that make much of difference since I don't use the screensaver anyway?

Anyway, thanks for the great suggestions! I hope to figure something out. I just can't understand out why the system anticipates I'll earn 30+ credits, but I only end up with about 10. I'll keep tinkering!

Mark

Other things to check are that you have the most up to date drivers for the motherboard, video card and OS...

You can get a brand new MB and have it have a really old driver/bios ...

A pain, but if this continues that needs to be added to the checklist.

Also check that heat is within tolerances as most processors have "self-help" that will slow the processor for overheat.

There may be other settings in the BIOS that can cause variable processor speeds.

Just for ha-has you could also try another project to see if the same kind of thing is happening there ... sometimes you have to turn over a lot of rocks to isolate an odd problem here and there ...

ID: 52241 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dusty

Send message
Joined: 1 Mar 08
Posts: 41
Credit: 2,667,354
RAC: 0
Message 52243 - Posted: 4 Apr 2008, 22:24:48 UTC - in response to Message 52237.  

Thanks for the link to CPUID! Running that, it says the CPU is running at 2.66Ghz, but I was surpised that the max bandwidth of the FBDIMM PC2-5300 DDR2 667 ram is only 333Mhz. I guess they split the bandwidth for each CPU? I have one 4-Gig Ram chip in Slot 0 of each of 4 banks. This RAM was recommended by ASUS as being compatible, but I just couldn't afford faster RAM.

I'll work on bringing temps down, but since the CPU Step function is disabled, at least I know that isn't the problem. Thanks again!

if it is that machine (E5430 Xeon) then it has 12MB L2 cache which it looks like BOINC can't read/report correctly. It doesn't support HT - it's a true 2x4 core machine.

There is something wrong with the credit though - it should be claiming and receiving far more than it is. Is speedstep kicking in?

[/quote]
Speedstep turns down the speed of your CPUs when they're not in use. An idle priority task (rosetta) might not qualify as 'in use' and so speedstep might be kicking in. Easiest way to tell would be to run cpuz - http://www.cpuid.com/cpuz.php which will show you the speed of your CPUs in real-time. If it varies depending on what you're doing then Speedstep is enabled and might be the cause of the low scores/throughput. You'll be able to change the settings for that in the BIOS but I think there's a way to control it from Windows too.

You're almost certainly right about the L1/L2 issue too - it looks like BOINC is reporting the wrong number!

HTH[/quote]
[/quote]
ID: 52243 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 52244 - Posted: 4 Apr 2008, 22:28:57 UTC

The video on the mother board is very much an issue. Because ANY updates to video uses the main memory bus to make changes to video memory. So, the off board memory is the way to go, even with a cheap $29 card.

So, if you are using the system and even doing something like surfing it can make a big differnece.

This is why a slightly more expensive MB can make all the diffence in the world.

I once changed nothing more than the MB, same CPU, memory, Video card, HDD and saw a 20%+ increase in speed. Part was the way memory was accessed. single vs dual channel ... but, it was just a better MB and well, it was on sale ...
ID: 52244 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dusty

Send message
Joined: 1 Mar 08
Posts: 41
Credit: 2,667,354
RAC: 0
Message 52247 - Posted: 4 Apr 2008, 23:10:51 UTC - in response to Message 52244.  

I see! This wasn't a cheap board by any means. I went with this one because I hadn't seen very great reviews of the Intel 5400X Dual LGA 771 board (although I'm normally a real proponent of genuine Intel). I'll have the video card by next week. I just finished putting in a Fan controller, since the CPU fans are running high (but not max). I think I need to upgrade the fans in this Thermaltake Armour case as well.

Thanks again!

The video on the mother board is very much an issue. Because ANY updates to video uses the main memory bus to make changes to video memory. So, the off board memory is the way to go, even with a cheap $29 card.

So, if you are using the system and even doing something like surfing it can make a big differnece.

This is why a slightly more expensive MB can make all the diffence in the world.

I once changed nothing more than the MB, same CPU, memory, Video card, HDD and saw a 20%+ increase in speed. Part was the way memory was accessed. single vs dual channel ... but, it was just a better MB and well, it was on sale ...

ID: 52247 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 116,018,183
RAC: 64,830
Message 52248 - Posted: 5 Apr 2008, 0:07:48 UTC

a graphics card won't make any difference to Rosetta on that machine, but it might improve surfing etc if you use it for that?

There's definitely something wrong, but I can't see what. I don't recommend this often, but maybe a project reset is required...

Mod.Sense - can you have a look at this as the scores don't look right. Any ideas?
ID: 52248 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dusty

Send message
Joined: 1 Mar 08
Posts: 41
Credit: 2,667,354
RAC: 0
Message 52249 - Posted: 5 Apr 2008, 0:34:30 UTC - in response to Message 52248.  
Last modified: 5 Apr 2008, 0:36:24 UTC

Thanks. I haven't been using the system at all; just running Rosetta on it.

Comparing the stats in Rosetta of my 2.4Ghz QuadCore CPU to the 2.6Ghz dual Quad Xeon CPU is interesting. The 2.6 Ghz Dual Xeon CPU machine's FPS is 2623.61 and has an Integer speed of 4498.38, while the QuadCore measures 2355.9 FPS and 5410.25 Integer speed. And the Xeon processors have a 12MB L2 Cache. Hmmmmmm. Don't know what to make of that slower Integer speed.

Further, comparing it to another 2.66 Dual CPU Xeon X5355 machine (although not the same exact model) on BOINC showed that computer (owned by ROBiie) showed a FPS of 2531.02 but an astounding 8193.16 Integer speed!

That's the performance I was looking for--especially since my Xeon processors are the 45nm Hi-K models Intel's been ranting about. Could the difference be because he's using XP PRO/64 while I'm using Pro/32?

Thanks again!

a graphics card won't make any difference to Rosetta on that machine, but it might improve surfing etc if you use it for that?

There's definitely something wrong, but I can't see what. I don't recommend this often, but maybe a project reset is required...

Mod.Sense - can you have a look at this as the scores don't look right. Any ideas?
ID: 52249 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 116,018,183
RAC: 64,830
Message 52252 - Posted: 5 Apr 2008, 1:11:26 UTC - in response to Message 52249.  

Could the difference be because he's using XP PRO/64 while I'm using Pro/32?

Nice machine :D

No - 64 bit isn't an advantage for BOINC/Rosetta at the moment (might be in the future). Even with those benchmarks you should be getting higher scores for your granted credit... What are your preferences set as - i.e. do you have 'use at least 8 CPUs' and use 100% of CPU?
You could download Sandra Lite and running a few benchmarks, but it sounds like it's running fine. Are there 8 Rosetta processes each using ~12% CPU utilisation? (easiest way to check is to get Task Manager up and have the 'CPU time' column showing and then sort the column by that so the longest running threads are at the top...
ID: 52252 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 116,018,183
RAC: 64,830
Message 52254 - Posted: 5 Apr 2008, 1:27:15 UTC - in response to Message 52243.  

Thanks for the link to CPUID! Running that, it says the CPU is running at 2.66Ghz, but I was surpised that the max bandwidth of the FBDIMM PC2-5300 DDR2 667 ram is only 333Mhz. I guess they split the bandwidth for each CPU? I have one 4-Gig Ram chip in Slot 0 of each of 4 banks. This RAM was recommended by ASUS as being compatible, but I just couldn't afford faster RAM.


Your RAM is fast enough. CPUZ reports 333MHz as it's DDR (double data rate) so 333MHz frequency is correct, and will have very little effect as it's pretty quick anyway, and the large cache on those CPUs makes a big difference too.

I'd leave CPUz running for a while and montior the CPU speed just incase it is dropping the CPU speed. It will do that if the CPUs get too hot too.

HTH
Danny
ID: 52254 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dusty

Send message
Joined: 1 Mar 08
Posts: 41
Credit: 2,667,354
RAC: 0
Message 52255 - Posted: 5 Apr 2008, 1:59:42 UTC - in response to Message 52252.  

I just checked. It's set to the default 16 processors and 100% CPU time. In Task Manager, it shows each process is getting 12%. Thanks for the suggestions!

Could the difference be because he's using XP PRO/64 while I'm using Pro/32?

Nice machine :D

No - 64 bit isn't an advantage for BOINC/Rosetta at the moment (might be in the future). Even with those benchmarks you should be getting higher scores for your granted credit... What are your preferences set as - i.e. do you have 'use at least 8 CPUs' and use 100% of CPU?
You could download Sandra Lite and running a few benchmarks, but it sounds like it's running fine. Are there 8 Rosetta processes each using ~12% CPU utilisation? (easiest way to check is to get Task Manager up and have the 'CPU time' column showing and then sort the column by that so the longest running threads are at the top...

ID: 52255 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Strange problem with dual Xeon machine



©2024 University of Washington
https://www.bakerlab.org