Client Errors

Message boards : Number crunching : Client Errors

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 8 · Next

AuthorMessage
wbblakemore

Send message
Joined: 18 Dec 07
Posts: 33
Credit: 4,181
RAC: 0
Message 72604 - Posted: 26 Mar 2012, 15:36:43 UTC - in response to Message 72599.  




I want to extend a heartfelt thanks to everyone on the forums who is helping with troubleshooting the issue, especially those (like In Memory of Kimsey M Fowler Sr) who have gone above and beyond in diagnosing things. Thanks to all of your efforts, I think we can be relatively confident that the issue is directly related to NVidia GPU drivers on Windows 7.



Just out of curiosity, do we know for an affirmative fact that the same problem DOESN'T affect GPU users with ATI cards?
ID: 72604 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rayburner

Send message
Joined: 4 Oct 05
Posts: 32
Credit: 16,518,823
RAC: 0
Message 72605 - Posted: 26 Mar 2012, 15:58:03 UTC - in response to Message 72604.  




I want to extend a heartfelt thanks to everyone on the forums who is helping with troubleshooting the issue, especially those (like In Memory of Kimsey M Fowler Sr) who have gone above and beyond in diagnosing things. Thanks to all of your efforts, I think we can be relatively confident that the issue is directly related to NVidia GPU drivers on Windows 7.



Just out of curiosity, do we know for an affirmative fact that the same problem DOESN'T affect GPU users with ATI cards?


I can just say that I do have an i7 with a AMD RADEON 6970 with driver 12.1 running just fine on rosetta (on Win7 x64).
ID: 72605 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sky King

Send message
Joined: 28 Feb 12
Posts: 11
Credit: 15,912
RAC: 0
Message 72609 - Posted: 26 Mar 2012, 22:38:26 UTC - in response to Message 72559.  

It would be useful to hear from William Blakemore, Alpha Laser, Sky King, and Digital Savior if they are running EVGA's, how many, what model, and if they have been running Folding@Home.


Interestingly, I do have an EVGA GTX 560 in my rig. I used to do GPU folding on other GPUs, but not on this GPU and this particular Windows 7 instance. On this machine I was doing only SMP folding, and I stopped all folding activities prior to switching over to BOINC.

The EVGA driver package is 8.17.12.8562 and the NVIDIA control panel is 3.9.731.0


ID: 72609 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 72612 - Posted: 28 Mar 2012, 2:48:07 UTC

Is it possible that BOINC APIs and etc. are actually functioning differently in differing hardware environments? That would certainly make it difficult to track down, because the problem would not directly be in your code. Make the same set of APIs calls to assemble the results on two different machines, and one works and one doesn't sort of thing would be exactly what one might see in such a case.
Rosetta Moderator: Mod.Sense
ID: 72612 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
wbblakemore

Send message
Joined: 18 Dec 07
Posts: 33
Credit: 4,181
RAC: 0
Message 72618 - Posted: 28 Mar 2012, 22:38:27 UTC - in response to Message 72612.  
Last modified: 28 Mar 2012, 22:41:37 UTC

Is it possible that BOINC APIs and etc. are actually functioning differently in differing hardware environments? That would certainly make it difficult to track down, because the problem would not directly be in your code. Make the same set of APIs calls to assemble the results on two different machines, and one works and one doesn't sort of thing would be exactly what one might see in such a case.


The testing of In Memory Of Kinsey M Fowler Sr. conclusively proves otherwise. He was able to reproduce both successful runs and failing ones with the identical hardware configuration.

It's distressing that you would make such a comment. When a user goes to significant effort to debug your software for you, you should at least do him the courtesy of paying attention to his results.
ID: 72618 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sky King

Send message
Joined: 28 Feb 12
Posts: 11
Credit: 15,912
RAC: 0
Message 72627 - Posted: 29 Mar 2012, 15:26:42 UTC

Well, for now, I have decided to just suspend Rosetta folding until this gets resolved. As a 10 million F@H point contributor, I really wanted all my computing horsepower to go specifically to Rosetta, but I'm going to do World Community Grid for a while.

I am curious--when F@H -bigadv units came out, they ran much faster using the linux cores than windows cores--so much so that running a linux virtual machine appliance was still faster with VM overhead than running native windows.

Are there canned BOINC VM appliances available that could run under VMWare VMplayer, that would bypass this problem?
ID: 72627 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 72645 - Posted: 2 Apr 2012, 23:32:46 UTC - in response to Message 72618.  

It's distressing that you would make such a comment. When a user goes to significant effort to debug your software for you, you should at least do him the courtesy of paying attention to his results.


My apologies. Please keep in mind that I am an at-home volunteer, and not a Rosetta developer. As such, I was not following all details of the described testing, because I know those on the Project Team are already doing so. I was simply trying to offer another point of view on the situation. Sometimes that sparks ideas that lead to solutions.
Rosetta Moderator: Mod.Sense
ID: 72645 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sky King

Send message
Joined: 28 Feb 12
Posts: 11
Credit: 15,912
RAC: 0
Message 72647 - Posted: 3 Apr 2012, 3:07:41 UTC - in response to Message 72645.  

Please keep in mind that I am an at-home volunteer, and not a Rosetta developer.


I was thinking the same thing when I read the earlier post... Most community discussion groups are moderated by volunteers that have no official connection to the underlying developers.
ID: 72647 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
wbblakemore

Send message
Joined: 18 Dec 07
Posts: 33
Credit: 4,181
RAC: 0
Message 72653 - Posted: 3 Apr 2012, 20:19:50 UTC - in response to Message 72645.  

It's distressing that you would make such a comment. When a user goes to significant effort to debug your software for you, you should at least do him the courtesy of paying attention to his results.


My apologies. Please keep in mind that I am an at-home volunteer, and not a Rosetta developer. As such, I was not following all details of the described testing, because I know those on the Project Team are already doing so. I was simply trying to offer another point of view on the situation. Sometimes that sparks ideas that lead to solutions.


I'm sorry if you found my comment to be overly harsh -- that certainly wasn't my intention.

Having said that, though, I will also note that I am not a Rosetta developer, either, but an at-home volunteer such as yourself. I don't think it's asking too much of either of us to be aware of what's been previously posted, before adding additional posts to this thread. The alternative is merely adding additional FUD to what is turning out to be a difficult problem to solve.
ID: 72653 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile In Memory of Kimsey M Fowler Sr

Send message
Joined: 10 Mar 12
Posts: 26
Credit: 39,033,222
RAC: 0
Message 72654 - Posted: 3 Apr 2012, 21:30:58 UTC

Hi everyone... I don't have any great progress to report, but as we are all anxious for a way to resolve this, I want to give you an update. Rocco Moretti from Rosetta requested some sets of data from my machine collected both while running properly and while running with the client error condition. The data was sent near the end of last week, and this morning he reported that nothing has been found. Analysis is continuing.

I'm trying real hard to come up with some ideas of what to do to try to get it running properly, test any theories, software/hardware configurations, etc. The problem boils down to whether or not the NVIDIA GPU driver is installed versus the Windows default driver. Understanding the problem is complicated by the fact that I have two machines running with the same versions of software on the same versions of motherboards and processors, but one machine has never experience the problem and the other has.

Did this summary of the problem help you think of anything? It did for me, so now I have a few simple little experiments to try this evening.
ID: 72654 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sky King

Send message
Joined: 28 Feb 12
Posts: 11
Credit: 15,912
RAC: 0
Message 72656 - Posted: 4 Apr 2012, 1:59:51 UTC

Let me know how I can help... I currently have both the EVGA NVIDIA 560 and an AT 4850 in my rig... I am not too jazzed about too much tampering, but, I can run off the 4850 and submit some before/after work as well.
ID: 72656 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
wbblakemore

Send message
Joined: 18 Dec 07
Posts: 33
Credit: 4,181
RAC: 0
Message 72661 - Posted: 4 Apr 2012, 18:05:29 UTC - in response to Message 72654.  
Last modified: 4 Apr 2012, 18:05:56 UTC

Hi everyone... I don't have any great progress to report, but as we are all anxious for a way to resolve this, I want to give you an update. Rocco Moretti from Rosetta requested some sets of data from my machine collected both while running properly and while running with the client error condition. The data was sent near the end of last week, and this morning he reported that nothing has been found. Analysis is continuing.

I'm trying real hard to come up with some ideas of what to do to try to get it running properly, test any theories, software/hardware configurations, etc. The problem boils down to whether or not the NVIDIA GPU driver is installed versus the Windows default driver. Understanding the problem is complicated by the fact that I have two machines running with the same versions of software on the same versions of motherboards and processors, but one machine has never experience the problem and the other has.

Did this summary of the problem help you think of anything? It did for me, so now I have a few simple little experiments to try this evening.


Umm ... check software versions for the Rosetta app, as well as CUDA apps (and .dll's) for other projects? There has to be SOMETHING different :)
ID: 72661 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sky King

Send message
Joined: 28 Feb 12
Posts: 11
Credit: 15,912
RAC: 0
Message 72663 - Posted: 4 Apr 2012, 20:28:33 UTC - in response to Message 72654.  

Rocco Moretti from Rosetta requested some sets of data from my machine collected both while running properly and while running with the client error condition.


I am wondering if it is possible that Rosetta can "manually" issue us work packets to be completed... The "gold standard" for analysis purposes would be for us to run several work packets with nvidia driver, and then run the IDENTICAL work packets WITHOUT the nvidia driver and then diff the resulting work product.
ID: 72663 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
wbblakemore

Send message
Joined: 18 Dec 07
Posts: 33
Credit: 4,181
RAC: 0
Message 72690 - Posted: 6 Apr 2012, 18:49:57 UTC - in response to Message 72663.  

Rocco Moretti from Rosetta requested some sets of data from my machine collected both while running properly and while running with the client error condition.


I am wondering if it is possible that Rosetta can "manually" issue us work packets to be completed... The "gold standard" for analysis purposes would be for us to run several work packets with nvidia driver, and then run the IDENTICAL work packets WITHOUT the nvidia driver and then diff the resulting work product.


As a follow-up question to this, would there be any benefit for those of us having this problem to head over to RALPH, to serve as testers for any fixes in the works?
ID: 72690 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rocco Moretti

Send message
Joined: 18 May 10
Posts: 66
Credit: 585,745
RAC: 0
Message 72692 - Posted: 7 Apr 2012, 0:23:32 UTC - in response to Message 72690.  

As a follow-up question to this, would there be any benefit for those of us having this problem to head over to RALPH, to serve as testers for any fixes in the works?


Interesting point. The version of the application is currently the same on both Ralph and Rosetta@home, and since we just released 3.26, it'd likely be a while before we'll test a new version.

That said, I just double checked Ralph, and it doesn't look like anyone there is experiencing the same sort of symptoms. I can't say if that's just because there's no one on Ralph that has the type of system that's experiencing problems, or that there's some difference between Ralph and Rosetta@home that causes the problem to disappear - though I doubt it's the latter.

If people are willing, signing up to Ralph@home with computers which are experiencing the problem would probably be worth trying.


Certainly if we ever do figure out what the problem is, we would want to test it on Ralph first, prior to releasing it to Rosetta@home in general.
ID: 72692 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AlphaLaser

Send message
Joined: 19 Aug 06
Posts: 52
Credit: 3,327,939
RAC: 0
Message 72693 - Posted: 7 Apr 2012, 3:37:01 UTC - in response to Message 72692.  

As a follow-up question to this, would there be any benefit for those of us having this problem to head over to RALPH, to serve as testers for any fixes in the works?


Interesting point. The version of the application is currently the same on both Ralph and Rosetta@home, and since we just released 3.26, it'd likely be a while before we'll test a new version.

That said, I just double checked Ralph, and it doesn't look like anyone there is experiencing the same sort of symptoms. I can't say if that's just because there's no one on Ralph that has the type of system that's experiencing problems, or that there's some difference between Ralph and Rosetta@home that causes the problem to disappear - though I doubt it's the latter.

If people are willing, signing up to Ralph@home with computers which are experiencing the problem would probably be worth trying.


Certainly if we ever do figure out what the problem is, we would want to test it on Ralph first, prior to releasing it to Rosetta@home in general.


I've attached my host to ralph, unfortunately I haven't received any ralph tasks yet but still trying. Host page is here.
ID: 72693 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
wbblakemore

Send message
Joined: 18 Dec 07
Posts: 33
Credit: 4,181
RAC: 0
Message 72694 - Posted: 7 Apr 2012, 7:08:15 UTC - in response to Message 72692.  

As a follow-up question to this, would there be any benefit for those of us having this problem to head over to RALPH, to serve as testers for any fixes in the works?


Interesting point. The version of the application is currently the same on both Ralph and Rosetta@home, and since we just released 3.26, it'd likely be a while before we'll test a new version.

That said, I just double checked Ralph, and it doesn't look like anyone there is experiencing the same sort of symptoms. I can't say if that's just because there's no one on Ralph that has the type of system that's experiencing problems, or that there's some difference between Ralph and Rosetta@home that causes the problem to disappear - though I doubt it's the latter.

If people are willing, signing up to Ralph@home with computers which are experiencing the problem would probably be worth trying.


Certainly if we ever do figure out what the problem is, we would want to test it on Ralph first, prior to releasing it to Rosetta@home in general.


OK, I created an account on Ralph as well ... I turned on NNT until the developers can figure out whether my presence (with 100% error rate on production software) is a help or a hindrance.

ID: 72694 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile In Memory of Kimsey M Fowler Sr

Send message
Joined: 10 Mar 12
Posts: 26
Credit: 39,033,222
RAC: 0
Message 72700 - Posted: 8 Apr 2012, 20:48:23 UTC
Last modified: 8 Apr 2012, 20:55:16 UTC

This is an update on my troubleshooting activites since my last post five days ago. I've been experimenting under the assumption that there are no problems with Rosetta/BOINC software. My activies have assumed there is a driver or software configuration problem causing "client errors". I've learned from various websites that there can be a complex relationship between among the motherboard chipset drivers, the GPU drivers (a.k.a. the "display adapter drivers" in Windows 7) and the video monitor drivers. All drivers associated with putting something out there for your eyes to see must work together as a team. Another issue is that from a driver perspective, VGA drivers are more forgiving, and DVI drivers are less forgiving/interchangable; DVI monitor drivers are more specific.

Here are some of the things I've tested:

1) Plugged the monitor into each of four different DVI ports. This may seem silly, but I read reports of the DVI connector closest to the motherboard being preferred. My GPU's don't have VGA ports, and I am using a VGA cable with adapter. (Are any of you using a real DVI cable???)

2) I installed the DVI driver from the CD included with the HP S2031 flat panel 20" monitor I bought last month at Fry's for only $89. Was HP dumping them at a low price for a reason I wondered?

3) Next I got rid of the flat panel, and replaced it with a 20 year old Dell VGA extracted from the attic.

4) I reinstalled the chipset, GPU, and monitor drivers in different orders.

These test were time consuming in waiting for tasks to complete to determine results. None of the above solved the problem.

The next idea was to determine if the problem could be pinned to hardware or software. As mentioned above, I have two nearly identical systems with the same motherboards, CPUs, & OS Win 7 64-bit. The main difference is that the good computer has two unbridged EVGA 560 Ti's, and the problem computer has two unbridged EVGA 580's. I wondered what would happen if a mirror image of the hard drive from the good computer was put in place into the problem computer. I tested this using the following procedure:

1) Used Win 7 OS to make an emergency boot recovery CD for the good machine.
2) used Win 7 OS to generate a disk image of the entire hard drive (SSD) from the good machine onto an external hard drive plugged in via USB.
3) Used that emergency boot disk to boot the problem machine.
4) Went to Windows recover in control panels of the problem machine and installed the image from the external hard drive. This process completely wiped all existing files from the problem machine's boot drive.
5) Disconnected the network cable from the wall so the problem computer could not communicated with BOINC, Rosetta, Folding@HOme, or Microsoft pirate hunter.
6) Using built-in Win 7 OS capabilities, I proceeded to change the computer name, the user name, and Win 7 product key to those of the prior installation of the problem computer. BOINC, Rosetta, and Folding@Home were uninstalled, and their data directories deleted.
7) The network cable was plugged in, the machine rebooted, and Win 7 was re-validated with Microsoft.
8) BOINC and Rosetta were reinstalled from a fresh copy.

At this point my machine was configured for testing the old hardware with a known good software configuration. The driver for the EVGA GeForce 560's and 580's is the same.

I looked under device manager to check the display adapter driver and found that just the one EVGA 580 that the monitor was attached to was present, but it was using the correct NVIDIA driver. Without fiddling with anything else, I immediately launched Rosetta. The WU's returned from this configuration were reported correctly. Sixteen Wu's ran successfully before anything was changed. This might seem like a conclusive test, but to be fair, I must point out that all of the new WU's used Rosetta 3.26, whereas both the good and problem machines had been running 3.24. This introduced an unwanted variable into the test.

The next step was to get the second EVGA 580 to be recognized by the computer. I found many, many reports on the web of a second video card not being recognized, and many of those reports mentioned an ASUS motherboard, the same as my installation. Next the NVIDIA drivers were reinstalled using a freshly downloaded file from EVGA. Upon rebooting the machine, device manager reported that both GPU's were now recognized and installed. Subsequent WU's completed by Rosetta reported client errors.... ouch!

"Device Manager" was used to uninstalled one GPU, and again WU's completed successfully.

Some Notes of Interest:

a) I found that reinstallation of the NVIDIA driver can knock out the HP S2031 flat panel monitor driver.

b) If a WU begins on a machine that's in a good state, but finishes while the machine is in a bad state, the WU fails with client errors. If a WU's progress reaches 100% while the machine is in a bad state, the WU fails with client errors. If a WU has finished in a bad state, and is "Ready to Report" and the machine is changed into a good state before the report occurs, the WU is reported with client errors. Therefore, I would conclude that the WU only fails with client errors (the type addressed by this forum anyway) if the platform upon which it is running is in a bad state at the moment when it's state of progress reaches 100%, and that its success is independent of the good/bad state at any time other than the 100% mark.

I have more tests in progress now, and will try to make another report tonight. In the mean time, can you guys with the problem please respond to these questions:

A) Anyone using the HP S1931, 2031, 2231, 2331 monitor series?
B) Anyone have their monitor plugged in with a true DVI cable without DVI-to-VGA adapter?
C) Is everyone running with two GPU's installed?
D) Has anyone else discovered that if the second GPU is "uninstalled" via Device Manager, the problem goes away???
E) Anyone using an SLI bridge between multiple GPU's?

Keep looking for the golden Easter egg!
ID: 72700 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
A.M.

Send message
Joined: 13 Jun 06
Posts: 12
Credit: 954,586
RAC: 0
Message 72701 - Posted: 8 Apr 2012, 22:20:51 UTC - in response to Message 72692.  

If people are willing, signing up to Ralph@home with computers which are experiencing the problem would probably be worth trying.



Can do.
ID: 72701 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
A.M.

Send message
Joined: 13 Jun 06
Posts: 12
Credit: 954,586
RAC: 0
Message 72702 - Posted: 8 Apr 2012, 22:26:45 UTC - in response to Message 72700.  

I have more tests in progress now, and will try to make another report tonight. In the mean time, can you guys with the problem please respond to these questions:

A) Anyone using the HP S1931, 2031, 2231, 2331 monitor series?


Negative. I'm using an ASUS monitor, which is being reported as a 'Generic PnP Monitor' by Windows

B) Anyone have their monitor plugged in with a true DVI cable without DVI-to-VGA adapter?


Yes I do.

C) Is everyone running with two GPU's installed?


I am not.

D) Has anyone else discovered that if the second GPU is "uninstalled" via Device Manager, the problem goes away???
E) Anyone using an SLI bridge between multiple GPU's?

ID: 72702 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 8 · Next

Message boards : Number crunching : Client Errors



©2024 University of Washington
https://www.bakerlab.org