Many Problems

Message boards : Number crunching : Many Problems

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
murky

Send message
Joined: 24 Sep 06
Posts: 9
Credit: 214,896
RAC: 0
Message 29036 - Posted: 8 Oct 2006, 15:02:35 UTC

It appears that I will have to cease work on Rosetta tasks as the PC descibed below is not doing well with Rosetta. This project is the only work assigned to the PC.Rosetts is version 5.25 and BOINC is ver 5.4.11
These are the messages that I was able to retrieve and don’t know if this is of any help to diagnose the reason this PC is causing problems.If any other information should be provided to try to resolve these problems please advise me.

This PC is an AMD Athlon 64 3700+
NOT OVERCLOCKED
Motherboard is ASUS A8N-E
RAM is 1 Gigabyte OCZ PC4200 (263 MHz) running at 200 MHz
I am using very relaxed timings of 2.5, 4, 4, and 8.
The hard drive is a Western Digital Raptor (SATA)
Processor temperature is around 43 degrees C.
There have also been several instances of BSOD .

One being “ Page_Fault_IN_Nonpaged_Area”
Stop: 0x00000050
Win32.sys – address BF8028A7 base at BF800000 Datestamp43446a58
There have been several instances of the system locking up and having to reboot to recover.
This PC has worked on Folding at Home for several months with no problems.
I have set BOINC to not take any new work and will run memtest86 and Prime95 to see if they indicate problems. None had been indicated by these programs when the PC was first built and run at at a FSB of 220 MHZ.
I see no point in messing up the project with Client and Compute errors. My Intel Pentium 4 at 2.4 GHz is working well / stable.

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0086009A write attempt to address 0x11B8E760

Engaging BOINC Windows Runtime Debugger

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0060AA0A read attempt to address 0x8A34AED4

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004C9C46 read attempt to address 0xA15A7174

stderr out
<core_client_version>5.4.11</core_client_version>
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# random seed: 3084267
# cpu_run_time_pref: 28800
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score 952.206 for 3600 seconds
**********************************************************************
GZIP SILENT FILE: .xx1vie.out
# cpu_run_time_pref: 28800
ERROR:: Exit at: .initialize.cc line:1618
ID: 29036 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 29037 - Posted: 8 Oct 2006, 15:16:51 UTC
Last modified: 8 Oct 2006, 15:19:31 UTC

Eek, I'm not sure what to say.

I have an AMD64 3700 sandiego, OCed 10%, with 1M OCZ gold, Asus A8N-E mobo, but a Hitachi deskstar 250G sataII drive, and it works well with rosetta 5.25.

From the BSOD errors, I'd have to guess your bios setting are good enough for general work, but when tasked by rosetta your mem or something goof up. Try setting bios timings back to auto. have you run memtest86+ and Prime95?

tony

Oh yeah, I'm Boinc Alpha tester so I'm using 5.6.5, I'm having a hard time thinking it's boinc related, but I could direct you to the latest "Alpha" client
ID: 29037 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
murky

Send message
Joined: 24 Sep 06
Posts: 9
Credit: 214,896
RAC: 0
Message 29038 - Posted: 8 Oct 2006, 15:29:45 UTC - in response to Message 29037.  

[quote]Eek, I'm not sure what to say.
Thanks for the reply.
The bios settings for the memory are the Auto settings if I was running the memory at a FSB of 263. If I use auto at 200 MHZ it sets them faster to 2.5, 3, 3, 7.
I have raised the RAM voltage and CPU voltage....all the OC'ing tricks but without success. As I stated Prime95 and memtest86 were solid when the box was built and with an FSB of 220 MHZ. I will run both tests again for at least 24 hours when the current task finishes. Its at 70% now.
murky / Bob

From the BSOD errors, I'd have to guess your bios setting are good enough for general work, but when tasked by rosetta your mem or something goof up. Try setting bios timings back to auto. have you run memtest86+ and Prime95?

tony

ID: 29038 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 29039 - Posted: 8 Oct 2006, 15:32:05 UTC

I have the asus 6200 (cheap) Pcix video card. Have you tried turning off the graphics? I see the 0x1 error and one was terminated by the watchdog timer.
ID: 29039 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 29040 - Posted: 8 Oct 2006, 15:35:19 UTC

You might go to "rosetta prefs" and set your "cpu run time" to 1 hour, just for testing. If any shoot past 1 hour by more than 5-10 min then it is having a problem.
ID: 29040 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 29042 - Posted: 8 Oct 2006, 15:39:30 UTC

My OCZ is guaranteed to 3v +/-. I now find if I go below 2.9 it just locks up. I think default was 2.65, but there's no way I can get that low. Quite frankly I got it to 10% and got tired of trying to get more out of it and left it there. I'll reboot and look at my settings
ID: 29042 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 29043 - Posted: 8 Oct 2006, 15:56:36 UTC
Last modified: 8 Oct 2006, 15:58:18 UTC

2530 mhz

CPU config
dram config (Auto)
400Mhz
2.5T
7T
2T
2T
11T
14T
4T
3T
2T
both remapping enabled
HT 4X
CnQ disabled

Jumperfree
Overclocking (manual)
cpu freq 230
Pciex clcok 100 Mhz
DDR volt 3.0
cpu mult X11
cpu volt 1.5
pci clock sync auto

I have to run out, but will return
ID: 29043 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
murky

Send message
Joined: 24 Sep 06
Posts: 9
Credit: 214,896
RAC: 0
Message 29047 - Posted: 8 Oct 2006, 16:52:41 UTC - in response to Message 29039.  

I have the asus 6200 (cheap) Pcix video card. Have you tried turning off the graphics? I see the 0x1 error and one was terminated by the watchdog timer.

Thanks for all your thoughts on this problem. The current task has just over an hour to completion so I'll run the diagnostics for the next 36 hours or so. I occasionally turn on the graphics to see what is happening but seldom for more than a couple of minutes. The card is also a low priced PCI-Express using the NVidia GeForce 7300GS chipset. The CPU runtime is currently 8 hours and some tasks complete without a glitch and some don't. Memory was at 2.6 volts and raised it to 2.75v but there was no change in stability. Is there a way I can provide a link to the work this PC has done (its id is 313507) and gain a little more insight? I will see what happens after I try memtest and prime. If they show stability I will give Rosetta another try.
Thanks....murky / Bob
ID: 29047 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 118,192,822
RAC: 33,094
Message 29048 - Posted: 8 Oct 2006, 17:04:47 UTC - in response to Message 29047.  
Last modified: 8 Oct 2006, 17:08:05 UTC

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=313507
;-)

As a temporary measure you could reduce your run times so that you lose less work if it does error out on you.
ID: 29048 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
murky

Send message
Joined: 24 Sep 06
Posts: 9
Credit: 214,896
RAC: 0
Message 29049 - Posted: 8 Oct 2006, 18:03:12 UTC - in response to Message 29048.  

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=313507
;-)

As a temporary measure you could reduce your run times so that you lose less work if it does error out on you.


Thanks for the heads up on the computer id url.
In that vein I will provide a link to the 4 Tasks that errored out.

https://boinc.bakerlab.org/rosetta/result.php?resultid=41043803
https://boinc.bakerlab.org/rosetta/result.php?resultid=40834653
https://boinc.bakerlab.org/rosetta/result.php?resultid=40668898
https://boinc.bakerlab.org/rosetta/result.php?resultid=40606247
If someone can find a common thread in these that would indicate why I'm having these problems that would be great.
Thanks.....murky / Bob


ID: 29049 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 29050 - Posted: 8 Oct 2006, 20:21:28 UTC

The BSOD on it's own would say there was a problem. This (the Page_Fault error) could well be bad memory OR a bad harddrive. (among many many oteer things ;-(

So yes check your memory, use the default timings, hell run it overclocked and see if that is still ok. Will it still run F@H fine ?
Also check your harddrive (chkdisk an full check)
Other things are to reseat (take 'em out, pray to the god of electronics, put it back in) thats everything, memory, cables, reseat the CPU with a nice new thermal interface. Give it an over hall and don't forget your drivers......

F@H did you use the console version or the graphic version ?


Team mauisun.org
ID: 29050 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 29052 - Posted: 8 Oct 2006, 20:27:52 UTC

btw, it could also be bad tasks/work units, there has been a few reports of client fails recently. Since you are completing valid work, keep going. Most the invalid ones are giving debug info so they shold see that at there end.
(that is if they actually look ;-)
Team mauisun.org
ID: 29052 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 118,192,822
RAC: 33,094
Message 29053 - Posted: 8 Oct 2006, 20:30:19 UTC

is the Asus AI booster thing that you're running doing dynamic overclocking? If so I'd suggest disabling it. If you're running DC then you don't need any dynamic OCing - just OC it as far as you want/it'll go and you're done! I think the idea of dynamic OCing is to increase the clock rate when under load, but with DC it's always under load.

Could it be that it's not overclocking when you're running prime95 etc, but is when running Rosetta because it's low priority? (although I think prime95 is low priority too...)
ID: 29053 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 29056 - Posted: 8 Oct 2006, 21:04:26 UTC

Murky:
How were you using F@H? With -advmethods and -bigWUs? (or whatever the big switch is). If F@H was being run with default flags, it'd use less ram than Rosetta; so might might explain why the problem is popping up here so frequently.

You plan on ruling out Ram problems.. but you should also test out the HD. If you don't have a non destructive HD test like Spinrite, there's Sata HD tests on Western Digital's site. verification that it works with Sata drives.
WD's download page

A second possibility is software problems. Does HiJackThis! show any programs running that you don't recognize? The error log from the 386 second error result showed a number of applications running from the C:cygwin directory and from an F: drive. Error result (Can you temporarily disable them?)
Or can you setup a new HD, format it, setup a clean install of WinXP on it, and then try Rosetta for a week on that? (Proving that either windows is corrupted, or some program running on the original drive conflicts with Rosetta.)


ID: 29056 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
murky

Send message
Joined: 24 Sep 06
Posts: 9
Credit: 214,896
RAC: 0
Message 29059 - Posted: 8 Oct 2006, 21:37:42 UTC - in response to Message 29050.  
Last modified: 8 Oct 2006, 21:46:36 UTC

[quote]The BSOD on it's own would say there was a problem. This (the Page_Fault error) could well be bad memory OR a bad harddrive. (among many many oteer things ;-(

FluffyChicken, dcdc and BennyRop: thanks

I will try to address all the advice in one reply :)
I am running Prime95 now....2 1/2 hours.
I will run checkdsk after I stop Prime95, reseat the memory etc.
I thought that there might be a few bad WUs but I don't see very many people remarking on this so I assume my system has a problem.
I was using the consul version for F@H, adv methods and large work units.
I do not use the software to overclock....ASUS AI was uninstalled after I uninstalled BOINC but was only in the Start menu and was always closed.
Thanks for the link to Western Digital. I will definitely check that out after this run of Prime95.
This system has been dedicated to F@H since I built it in early spring. No surfing, no email, no virus, or adware. Using a firewall behind a router.
I will have to look into the cygwin directory and the F drive reference.
At this time I can not find a cygwin directory on either PC.
There are only 3 drives: C, D: (CD-RW) and E: (CD- read only)
I have to get back to Talladega for the end of the race :)
murky

ID: 29059 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
murky

Send message
Joined: 24 Sep 06
Posts: 9
Credit: 214,896
RAC: 0
Message 29062 - Posted: 8 Oct 2006, 23:03:49 UTC - in response to Message 29056.  
Last modified: 8 Oct 2006, 23:05:36 UTC

[quote]BennyRop:

A second possibility is software problems. Does HiJackThis! show any programs running that you don't recognize? The error log from the 386 second error result showed a number of applications running from the C:cygwin directory and from an F: drive. Error result (Can you temporarily disable them?)

BennyRop: with regard to the c:cygwin...etc from the resultid=41043803......
Looking through all that information, I was able to determine that it not from my C drive! nor is the reference to F drive. Those are at the Baker Labs! There is a reference to a "jack schonbrun" I Googled that name and it is assocoiated with Baker Labs and Rosetta. There is a reference to f:rtmvctoolscrt_bld: now I am starting to wonder if this is a part of my problem. I studied my Windows event logs and have 3 occurrences of errors.Quote: Source: Side by Side, Type: error Event ID:59
Resolve Partial Assembly failed for Microsoft VC80CRT.(I see a reference to vctools and crt in the f: directory above)
Continuing:
Resolve Partial Assembly failed for Microsoft VC80CRT:Reference error message. The referenced assembly is not installed on your system
Explanation: A component or manifest could not be activated.
Possible causes include: The component or manifest depends on another program or a component is not installed.
The manifest contains XML content that is not valid.
The user does not have the correct permissions.
I may be way out in left field but I thinks there is some connection to the errors in the event log and the failed tasks.
But this is way over my head :)
Regards....murky
ID: 29062 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 118,192,822
RAC: 33,094
Message 29063 - Posted: 8 Oct 2006, 23:28:01 UTC

when i was running filemon yesterday i saw quite a few attempts to access f: (comp also had no f:) and files with 'jack schonbrun' in the path. I assume these references should have been taken out before compilation?
ID: 29063 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 29074 - Posted: 9 Oct 2006, 7:35:14 UTC - in response to Message 29063.  

Jack is the person who first setup the graphics/screensaver in the Rosetta@home program.

So it sounds like bad jobs (compilations) I think it maybe something about not including the manifest file, well thats what a search on google say. Since I also get the side by side errors on my computers.. Though I've not had any bad work ?

Team mauisun.org
ID: 29074 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
murky

Send message
Joined: 24 Sep 06
Posts: 9
Credit: 214,896
RAC: 0
Message 29082 - Posted: 9 Oct 2006, 12:48:23 UTC - in response to Message 29074.  

Jack is the person who first setup the graphics/screensaver in the Rosetta@home program.

So it sounds like bad jobs (compilations) I think it maybe something about not including the manifest file, well thats what a search on google say. Since I also get the side by side errors on my computers.. Though I've not had any bad work ?


Thanks to everyone for their input. Prime95 ran with no errors for 17 hours and that is good enough for me :) I have just started memtest86 from a bootable CD and will give it a long run of all tests as this is not a conditioning exercise. I will look further into this SideBySide, version 5.2,
Symbolic Name:"MSG_SXS_Function_Call_Fail
If memtest runs without error I may give Rosetta another try. I'm having no problems on the other system
murky
ID: 29082 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
murky

Send message
Joined: 24 Sep 06
Posts: 9
Credit: 214,896
RAC: 0
Message 29117 - Posted: 10 Oct 2006, 15:28:42 UTC - in response to Message 29082.  

murky[/quote]
If memtest runs without error I may give Rosetta another try. I'm having no problems on the other system

memtest86 ran for 21 hours before I shut it down. 0 errors
This system will go back to F@H for Team Helix and the pentium4 box will work on Rosetta.

Thanks again for the input.
murky
ID: 29117 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Many Problems



©2024 University of Washington
https://www.bakerlab.org