why is this machine failing so much?

Message boards : Number crunching : why is this machine failing so much?

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
zombie67 [MM]
Avatar

Send message
Joined: 11 Feb 06
Posts: 316
Credit: 6,589,590
RAC: 74
Message 29807 - Posted: 22 Oct 2006, 8:53:48 UTC

https://boinc.bakerlab.org/rosetta/results.php?hostid=336493

Most, but not all error out. Why?
Reno, NV
Team: SETI.USA
ID: 29807 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 29810 - Posted: 22 Oct 2006, 9:41:16 UTC - in response to Message 29807.  

https://boinc.bakerlab.org/rosetta/results.php?hostid=336493

Most, but not all error out. Why?



The 131 error is a file size to big ... more detail on this says an output file was bigger than max_nbytes
http://boinc-wiki.ath.cx/index.php?title=Error_Code

So I would guess somthing is wrong ;-) , lol
Team mauisun.org
ID: 29810 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 29818 - Posted: 22 Oct 2006, 12:44:43 UTC

Have you tested the hardware with Memcheck86+ and a HD diagnostic from the manufacturer of the HD.. (the whole collection of current manufacturer's HD diagnostics are on the Ultimate boot CD).

And temporarily setting up a clean install of the OS on another HD, install Boinc and let it load a fresh copy of Rosetta - to see if the problem disappears with a clean install?
ID: 29818 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Keith Jillings

Send message
Joined: 26 Sep 06
Posts: 7
Credit: 536,631
RAC: 0
Message 29834 - Posted: 22 Oct 2006, 21:22:17 UTC

Mine's the same. The last several units from this machine have been rejected as "Compute Error" amd "Client Error". After some days of computer time and power, that's not welcome.

SETI just plugs on with never an error. I'm disconnecting this machine from Rosetta.
ID: 29834 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 11 Feb 06
Posts: 316
Credit: 6,589,590
RAC: 74
Message 29837 - Posted: 22 Oct 2006, 23:54:25 UTC

It is an old PII Inspiron 7000 laptop. I have reinstalled the OS several times, the most recent reinstall was Friday evening. It does not error out for Docking or SETI. So it is something unique to Rosetta.

I will try running the test referenced anyway, probably tomorrow when I get a chance. Where does one get Memcheck86+? A quick google didn't turn anything up. This is a linux box, will it run on linux?
Reno, NV
Team: SETI.USA
ID: 29837 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 29838 - Posted: 23 Oct 2006, 0:20:51 UTC - in response to Message 29837.  

Hi Zombie, Keith, others:

The overall error rates for all recent workunits is similar to what its been in the past... so this definitely looks like something specific to your clients. Its useful to know that the same faults aren't occurring for Docking or SETI, and that it might be a large file size -- however, nothing in these workunits should result in a large file size, to my knowledge. Could I possibly bother you to attach your client to RALPH? Its our test server, and we get back more detailed info from those results. Thanks!

It is an old PII Inspiron 7000 laptop. I have reinstalled the OS several times, the most recent reinstall was Friday evening. It does not error out for Docking or SETI. So it is something unique to Rosetta.

I will try running the test referenced anyway, probably tomorrow when I get a chance. Where does one get Memcheck86+? A quick google didn't turn anything up. This is a linux box, will it run on linux?


ID: 29838 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James Thompson

Send message
Joined: 13 Oct 05
Posts: 46
Credit: 186,109
RAC: 0
Message 29839 - Posted: 23 Oct 2006, 0:48:55 UTC - in response to Message 29837.  

It is an old PII Inspiron 7000 laptop. I have reinstalled the OS several times, the most recent reinstall was Friday evening. It does not error out for Docking or SETI. So it is something unique to Rosetta.

I will try running the test referenced anyway, probably tomorrow when I get a chance. Where does one get Memcheck86+? A quick google didn't turn anything up. This is a linux box, will it run on linux?


Hi Zombie,

I don't know much about memcheck86, but I have used memtest86 many times in the past. I would run it from a Linux LiveCD (such as Knoppix, http://www.knoppix.org), as that won't rely on you installing any programs on your laptop. Here's a link to an article that describes the process in detail:

http://software.newsforge.com/software/06/06/27/206209.shtml?tid=91&tid=132

ID: 29839 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
casio7131

Send message
Joined: 10 Oct 05
Posts: 35
Credit: 149,748
RAC: 0
Message 29841 - Posted: 23 Oct 2006, 2:41:35 UTC

you get memtest86+ from here: http://www.memtest.org/
ID: 29841 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tiago

Send message
Joined: 11 Jul 06
Posts: 55
Credit: 2,538,721
RAC: 0
Message 29859 - Posted: 23 Oct 2006, 9:13:26 UTC

I'm getting the same problem in some computers, i think this is something related or with the boinc version, or with the wu.
ID: 29859 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 29862 - Posted: 23 Oct 2006, 10:45:36 UTC - in response to Message 29837.  

Where does one get Memcheck86+? [...]This is a linux box, will it run on linux?

Argh. It would help if I used the right name for the program. (Memtest86+). I've always used the iso from the linked web page for bootable cds, or floppies. So it doesn't care what OS you have installed on your HD. It's also included on The Ultimate Boot CD 3.4 (although there's probably a newer version out by now.)



ID: 29862 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 29864 - Posted: 23 Oct 2006, 10:54:16 UTC - in response to Message 29862.  

Where does one get Memcheck86+? [...]This is a linux box, will it run on linux?

Argh. It would help if I used the right name for the program. (Memtest86+). I've always used the iso from the linked web page for bootable cds, or floppies. So it doesn't care what OS you have installed on your HD. It's also included on The Ultimate Boot CD 3.4 (although there's probably a newer version out by now.)




Hirens BootCD also conatins many memorytest programs.
There is also microsoft test http://oca.microsoft.com/en/windiag.asp program, the last one is also accesible on every Vista DVD during the initial DVD boot.


Team mauisun.org
ID: 29864 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 11 Feb 06
Posts: 316
Credit: 6,589,590
RAC: 74
Message 29879 - Posted: 23 Oct 2006, 16:16:47 UTC

I have detached this machine from Rosetta, and attached it to RALPH.

I am running the memory test now. Looks like it will take some time to complete.


Reno, NV
Team: SETI.USA
ID: 29879 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 11 Feb 06
Posts: 316
Credit: 6,589,590
RAC: 74
Message 29887 - Posted: 23 Oct 2006, 18:19:52 UTC - in response to Message 29879.  

How many cycles should I let the test run?

FYI, there are no jobs on RALPH to run.
Reno, NV
Team: SETI.USA
ID: 29887 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 29893 - Posted: 23 Oct 2006, 20:18:45 UTC

Once should be enough to catch the memory errors; but I leave it running overnight to verify that there aren't any intermittent errors. With the number of errors your WUs are having in less than 3 hours, 1 pass should be enough.


ID: 29893 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 11 Feb 06
Posts: 316
Credit: 6,589,590
RAC: 74
Message 29894 - Posted: 23 Oct 2006, 20:28:39 UTC

Okay, I stopped it sometime during the 4th pass.

It accumulated 15 failures:

1x for test 3
13x for test 4
1x for test 7

The rest of the tests passed without failure. The location of the failures is "000ba73624 - 186.1mb".

So...what now? Is this what is causing my WUs to fail so often? And why is it not also happening for SETI?

I can try swapping out the DIMMs to find the problem. Hopefully it is not the 64mb of on-board memory.

It had 2x 64mb DIMMs, which I replaced with 2x 128mb DIMMs. If one of those is bad, it would drop me down to 64mb on-board + 128mb + 64mb = 256mb. Is that enough to run rosetta (or SETI)?
Reno, NV
Team: SETI.USA
ID: 29894 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 29895 - Posted: 23 Oct 2006, 20:35:50 UTC - in response to Message 29887.  
Last modified: 23 Oct 2006, 20:43:22 UTC

How many cycles should I let the test run?


Unless you suspect intermittent problems, once is enough.

If you do suspect intermittient memory problems, run it for as many days as you think is necessary to eliminate that possibility - for example if the symptom happens twice a week then you would need to run memtest for something like 5 days to be sure :(

edit-added: I agree with Benny - once would be OK in your case, overnight to be sure.

By the way, going back to how to run memtest (asked earlier in the thread).

All the well known Linux distros let you include memtest86 as a boot option, try specifying it as a package during the Linux install, or adding it as a package later. The package manager should do all the necessary changes for you, so that you simply see it on the boot menu every time you boot. That is my favoured way of running it, no hunting for CDs or floppies and let the package manager figure out where to download it from.

Adding it to a bootable usb stick should be possible to anyone who can already get Linux booting from usb - hint: treat memtest86 as another operating system. You'd use Linux to create the stick, but you don't need Linux (or any OS at all) to run it.

River~~
ID: 29895 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 116,625,945
RAC: 77,660
Message 29899 - Posted: 23 Oct 2006, 21:37:49 UTC - in response to Message 29894.  

It had 2x 64mb DIMMs, which I replaced with 2x 128mb DIMMs. If one of those is bad, it would drop me down to 64mb on-board + 128mb + 64mb = 256mb. Is that enough to run rosetta (or SETI)?

You can run Rosetta on 256MB - one of the other threads here suggests that there's a bug in how the memory requirements are handled, which may leave you with down periods, but I think if you run with a decent cache of jobs (>1 day) this shouldn't affect you.
ID: 29899 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
EW-3

Send message
Joined: 1 Sep 06
Posts: 27
Credit: 2,561,427
RAC: 0
Message 29900 - Posted: 23 Oct 2006, 21:46:30 UTC


Have also started to get more failing wu's.
Is there a running log kept of failures to identify a pattern in the making?

ID: 29900 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 29904 - Posted: 23 Oct 2006, 22:48:40 UTC

Remove the Ram and retest one stick at a time. Stop on the first failure. That module is bad on that particular motherboard. Test any remaining sticks.

Seti, etc must have a smaller memory footprint, and not touch the affected memory area that is bad.

We ran into problems like this when people started working with memory hog software after using their system problem free for a year.
ID: 29904 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 11 Feb 06
Posts: 316
Credit: 6,589,590
RAC: 74
Message 29908 - Posted: 24 Oct 2006, 0:05:36 UTC

Thanks for all the help everyone. The bad 128mb DIMM is in the RMA loop now. I will be running with 256mb until it returns. I tested the 64mb DIMM just to be sure everything all memory is good.

I have also reattached it to rosetta, to confirm the issue is resolved. I run 4 hour jobs, so I should have several to look at in the morning.

THANKS AGAIN!
Reno, NV
Team: SETI.USA
ID: 29908 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : why is this machine failing so much?



©2024 University of Washington
https://www.bakerlab.org