Client Errors

Message boards : Number crunching : Client Errors

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 8 · Next

AuthorMessage
A.M.

Send message
Joined: 13 Jun 06
Posts: 12
Credit: 954,586
RAC: 0
Message 72703 - Posted: 8 Apr 2012, 22:29:18 UTC - in response to Message 72701.  

If people are willing, signing up to Ralph@home with computers which are experiencing the problem would probably be worth trying.



Can do.


Actually, going to try a WU from the new Rosetta 3.26 first. Then we'll see about Ralph.
ID: 72703 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile In Memory of Kimsey M Fowler Sr

Send message
Joined: 10 Mar 12
Posts: 26
Credit: 39,033,222
RAC: 0
Message 72704 - Posted: 9 Apr 2012, 0:23:07 UTC - in response to Message 72703.  

If people are willing, signing up to Ralph@home with computers which are experiencing the problem would probably be worth trying.


I was going to try Ralph@Home, but when I went to add it as a BOINC project, it wasn't in the list. Is there is a trick to finding it, or has it been temporarily removed because Rosetta 3.26 just got release and there's nothing to test?
ID: 72704 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AlphaLaser

Send message
Joined: 19 Aug 06
Posts: 52
Credit: 3,327,939
RAC: 0
Message 72705 - Posted: 9 Apr 2012, 4:36:24 UTC - in response to Message 72704.  
Last modified: 9 Apr 2012, 4:51:28 UTC

If people are willing, signing up to Ralph@home with computers which are experiencing the problem would probably be worth trying.


I was going to try Ralph@Home, but when I went to add it as a BOINC project, it wasn't in the list. Is there is a trick to finding it, or has it been temporarily removed because Rosetta 3.26 just got release and there's nothing to test?


You can attach to it manually by entering the URL in the box after clicking Attach to project: http://ralph.bakerlab.org/

However they do not have very much work right now.
ID: 72705 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
A.M.

Send message
Joined: 13 Jun 06
Posts: 12
Credit: 954,586
RAC: 0
Message 72706 - Posted: 9 Apr 2012, 4:50:19 UTC - in response to Message 72703.  

If people are willing, signing up to Ralph@home with computers which are experiencing the problem would probably be worth trying.



Can do.


Actually, going to try a WU from the new Rosetta 3.26 first. Then we'll see about Ralph.



Sigh. Same crap as before. https://boinc.bakerlab.org/rosetta/result.php?resultid=497264507
ID: 72706 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AlphaLaser

Send message
Joined: 19 Aug 06
Posts: 52
Credit: 3,327,939
RAC: 0
Message 72707 - Posted: 9 Apr 2012, 5:07:17 UTC - in response to Message 72700.  

A) Anyone using the HP S1931, 2031, 2231, 2331 monitor series?
B) Anyone have their monitor plugged in with a true DVI cable without DVI-to-VGA adapter?


My affected host is a laptop (Dell XPS), every once in awhile I connect a second monitor (actually a Samsung TV) via HDMI but I'm pretty sure I get errors without it connected.


C) Is everyone running with two GPU's installed?
D) Has anyone else discovered that if the second GPU is "uninstalled" via Device Manager, the problem goes away???
E) Anyone using an SLI bridge between multiple GPU's?


Nope, only one discrete GPU inside the laptop, a Geforce 435M.

ID: 72707 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
wbblakemore

Send message
Joined: 18 Dec 07
Posts: 33
Credit: 4,181
RAC: 0
Message 72708 - Posted: 9 Apr 2012, 6:40:10 UTC - in response to Message 72700.  
Last modified: 9 Apr 2012, 6:49:02 UTC




A) Anyone using the HP S1931, 2031, 2231, 2331 monitor series?
B) Anyone have their monitor plugged in with a true DVI cable without DVI-to-VGA adapter?
C) Is everyone running with two GPU's installed?
D) Has anyone else discovered that if the second GPU is "uninstalled" via Device Manager, the problem goes away???
E) Anyone using an SLI bridge between multiple GPU's?

Keep looking for the golden Easter egg!


My full rig description from the factory invoice is as follows:

INTEL, Core™ i7-2700K Quad-Core 3.5 - 3.9GHz TB, HD Graphics 3000, LGA1155, 8MB L3 Cache, 32nm, 95W, EM64T EIST HT VT-x XD, Retail
INNOVATION COOLING, Diamond 7 Carat Thermal Compound, Electrically Non-Conductive
CORSAIR, H100 Hydro CPU Liquid Cooling System, Socket LGA2011/1155/1156/1366/775/AM3/AM2, Retail
ASUS, Maximus IV Extreme-Z, LGA1155, Intel® Z68, DDR3-2200 (O.C.) 32GB /4, PCIe x16 SLI CF /1+2*, SATA 3Gb/s RAID 5 /4, 6Gb/s /4, USB 3.0 /8, HDA, BT, GbLAN /2, FW /2, ATX, Retail
G.SKILL, 16GB (4 x 4GB) Ripjaws PC3-10600 DDR3 1333MHz CL7 (7-7-7-21) 1.5V SDRAM DIMM, Non-ECC
EVGA, GeForce® GTX 560 Ti 822MHz, 2GB GDDR5 4000MHz, PCIe x16 SLI, 2x DVI + mini-HDMI, Retail
CREATIVE, Sound Blaster® X-Fi Titanium Fatal1ty Champion, 7.1 channels, 24-bit 192KHz, I/O Module, PCIe x1
WESTERN DIGITAL, 2TB WD Caviar® Black™ (WD2002FAEX), SATA 6 Gb/s, 7200 RPM, 64MB Cache
CRUCIAL, 64GB M4 SSD, MLC Marvell 88SS9174, 500/95 MB/s, 2.5-Inch, SATA 6 Gb/s, Retail
CRUCIAL, 64GB M4 SSD, MLC Marvell 88SS9174, 500/95 MB/s, 2.5-Inch, SATA 6 Gb/s, Retail
RAID, No RAID, Independent HDD Drives
PLEXTOR, PX-B320SA Black 8x/16x/48x BD/DVD/CD, Blu-ray Disk™ Combo Drive, SATA, Retail
PLEXTOR, PX-LB950SA 12x/16x/48x BD/DVD/CD Blu-ray Disc™ Burner w/ Lightscribe, SATA, Retail
COOLER MASTER, HAF X (RC-942-KKN1) Black Tower Case w/ Window, EATX, 9 Slots, No PSU, Steel/Plastic
CUSTOM WIRING, Standard Wiring with Round Cables
CORSAIR, CMPSU-1200AX Gold AX1200 Power Supply w/ Modular Cables, 1200W, 80 PLUS®, 24-pin ATX12V v2.31 EPS12V 2.92, Multi-GPU Ready
MICROSOFT, Windows 7 Professional 64-bit Edition w/ SP1, OEM


A. Nope, I'm using a LG flat panel, which W7 called a "generic PnP monitor"
B. Yes, I'm using a true DVI connection to my monitor
C. No, I have a single EVGA NVidia 560 ti graphics card installed.
D. n/a
E. n/a
ID: 72708 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AlphaLaser

Send message
Joined: 19 Aug 06
Posts: 52
Credit: 3,327,939
RAC: 0
Message 72709 - Posted: 9 Apr 2012, 7:03:03 UTC - in response to Message 72706.  

If people are willing, signing up to Ralph@home with computers which are experiencing the problem would probably be worth trying.



Can do.


Actually, going to try a WU from the new Rosetta 3.26 first. Then we'll see about Ralph.



Sigh. Same crap as before. https://boinc.bakerlab.org/rosetta/result.php?resultid=497264507


I just got the same: https://boinc.bakerlab.org/result.php?resultid=497320429
ID: 72709 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rayburner

Send message
Joined: 4 Oct 05
Posts: 32
Credit: 16,518,823
RAC: 0
Message 72712 - Posted: 9 Apr 2012, 20:17:17 UTC - in response to Message 72692.  

I have received one WU at ralph and crunched it successfully!!

http://ralph.bakerlab.org/result.php?resultid=2647221

Regards,
Rayburner

ID: 72712 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile In Memory of Kimsey M Fowler Sr

Send message
Joined: 10 Mar 12
Posts: 26
Credit: 39,033,222
RAC: 0
Message 72714 - Posted: 9 Apr 2012, 21:47:28 UTC

Is there anyone with our problem that does not have all of the following attributes:

1) one or more NVIDIA GPU's, &
2) running Win7 64-bit, &
3) Intel I7 processor ?

ID: 72714 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile In Memory of Kimsey M Fowler Sr

Send message
Joined: 10 Mar 12
Posts: 26
Credit: 39,033,222
RAC: 0
Message 72715 - Posted: 9 Apr 2012, 21:53:48 UTC

Some additional tests have shown that when my machine was generating good work units with "only one NVIDIA installed", only a subset of the NVIDIA drivers were really installed. Noteworthy among the missing was the CUDA driver. I'm not saying this driver is at the root of the problem, only that the full complement of drivers was not present for successful WU's.
ID: 72715 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile In Memory of Kimsey M Fowler Sr

Send message
Joined: 10 Mar 12
Posts: 26
Credit: 39,033,222
RAC: 0
Message 72716 - Posted: 9 Apr 2012, 22:00:31 UTC - in response to Message 72712.  

I have received one WU at ralph and crunched it successfully!!
http://ralph.bakerlab.org/result.php?resultid=2647221
Regards,
Rayburner


Hi Rayburner: Thank you for some positive sounding news! I wonder if you have fiddled with the drivers or anything else since you last had a WU ending in client error? Perhaps you could give Rosetta another try and see what happens. Thus far I have not tried Ralph, but will give it a shot. Kimsey, Jr.
ID: 72716 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rayburner

Send message
Joined: 4 Oct 05
Posts: 32
Credit: 16,518,823
RAC: 0
Message 72717 - Posted: 9 Apr 2012, 22:26:34 UTC - in response to Message 72716.  

I have received one WU at ralph and crunched it successfully!!
http://ralph.bakerlab.org/result.php?resultid=2647221
Regards,
Rayburner


Hi Rayburner: Thank you for some positive sounding news! I wonder if you have fiddled with the drivers or anything else since you last had a WU ending in client error? Perhaps you could give Rosetta another try and see what happens. Thus far I have not tried Ralph, but will give it a shot. Kimsey, Jr.


Nothing was changed on my side. After the successfull run on ralph I tried rosetta again. Unfortunately still with the known client error outcome.

That lets me assume there must be a difference between ralph and rosetta. Maybe project admins can have a look at possible differences at the server side.

Regards,
Rayburner
ID: 72717 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile In Memory of Kimsey M Fowler Sr

Send message
Joined: 10 Mar 12
Posts: 26
Credit: 39,033,222
RAC: 0
Message 72718 - Posted: 9 Apr 2012, 23:16:06 UTC - in response to Message 72700.  


b) If a WU begins on a machine that's in a good state, but finishes while the machine is in a bad state, the WU fails with client errors. If a WU's progress reaches 100% while the machine is in a bad state, the WU fails with client errors. If a WU has finished in a bad state, and is "Ready to Report" and the machine is changed into a good state before the report occurs, the WU is reported with client errors. Therefore, I would conclude that the WU only fails with client errors (the type addressed by this forum anyway) if the platform upon which it is running is in a bad state at the moment when it's state of progress reaches 100%, and that its success is independent of the good/bad state at any time other than the 100% mark.



Questions for Rocco or other Rosetta Staff Members:

I am planning a new test for which I would appreciate any tips that might be helpful for planning and execution.

First of all I think we have all been assuming there is a problem on the client side output from Rosetta or BOINC. What if for example one of the output files that's uploaded to the server at the completion of a WU has a slightly different output format, say an unexpected space or extra character caused by an NVIDIA driver. When the uploaded files are processed by the server, a read statement in the processing software chokes on the extra character and we never see the Rosetta version number for the WU in "Task Details". Am I correct that the server uses our uploaded files as an input to generate the "Task Details" HTML files that we view on the web for each WU? If so, do we know if "Exit status = 0 (0x0)" being already in the preliminary version of the file is there by default at the end of a failed WU, or did our "client error" WU's really successfully complete, but part of the server side software failed in generating the HTML output?

Enough rambling about my theories. The goal of the test is to run the same WU twice, first with the machine in a bad state, and a second time with the machine configured for success. Here are the proposed test steps:

1) Determine from Rosetta staff exactly which files are uploaded to the server following execution of a WU and where exactly they reside while waiting to upload.
2) Configure my problem machine so that it will successfully complete WU's without client error.
3) Pick a particular work unit and when it is at about the 50% completion mark, terminate BOINC/Rosetta to force it to be saved at a breakpoint.
4) Copy and save the entire /ProgramData/BOINC directory.
5) Unplug the network cable, restart BOINC/Rosetta, and let the chosen WU complete successfully.
6) Terminate BOINC/Rosetta, again forcing existing WU's into a second breakpoint condition.
7) Copy & save the output files for the chosen WU (note that it cannot upload because the network cable is unplugged).
9) Configure the machine so that subsequent WU's will end with client errors.
8) Restore the entire BOINC directory to the state it was in at the first breakpoint.
10) Let the same WU chosen earlier complete again, but with client errors.
11) Collect the same set of output files before they are uploaded to the server.
12) Use file comparison software to identify differences.

Once the differences are identified, it should be much easier to find a solution.

Comments/suggestions please? KMF, Jr.
ID: 72718 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 72719 - Posted: 10 Apr 2012, 0:40:43 UTC

I like the sounds of your plan. You have described just about the only way to assure you actually run exactly the same task, and version, and host ID, and random number seed (they are embedded with the task) more than once.

My only suggestion would be that if you are going to go to all of the trouble, to go ahead and do what you've outlined with several tasks at the same time. Which might mean you would want to increase your number of days between network connections configured in your BOINC network preferences.

You can go ahead and backup BOINC either before tasks start, or as you described at a checkpoint. Either way, when your restore, the status of the task will be as it was when you did the backup.

You can see what the output file name (just one for Rosetta I believe) for a task will be by reviewing the client_state.xml file in your BOINC data directory (which is shown at the top of the event log as BOINC starts up). The file name is described in two parts, you will see a task which identifies a "<result_name>", then you will see that given result name identified later with "<file_info>". But those result names should be under your BOINC data directory when the task completes. I don't recall if they remain in the slots directory of the task they were produced by, or if they all get to a higher level directory. I think they do occupy one of the slots directories until they are uploaded and confirmed via update project (which BOINC does for you periodically, or you can do manually from the projects tab).
Rosetta Moderator: Mod.Sense
ID: 72719 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 72720 - Posted: 10 Apr 2012, 0:43:47 UTC

I guess further, I'd suggest taking a backup of the entire BOINC data directory at the point where the tasks have completed crunching. That way you could actually compare everything else as well, not just the output file. There would be some differences expected, such as whereever the event log is stored, all of the timestamps on messages would be different, etc. But perhaps you turn up a difference in a configuration file of some kind or something that reflects the detected hardware on the machine.
Rosetta Moderator: Mod.Sense
ID: 72720 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
A.M.

Send message
Joined: 13 Jun 06
Posts: 12
Credit: 954,586
RAC: 0
Message 72721 - Posted: 10 Apr 2012, 4:23:06 UTC
Last modified: 10 Apr 2012, 4:39:00 UTC

Well. Ralph ran 9 WUs (so far) to completion. Successfully.

One thing I did notice, and am looking into now, is that on Ralph, I'm using the default run-time target, whereas I was using different times on Rosetta. In the interest of eliminating another variable, I'm attempting another Rosetta task with the run-time set to the default.
ID: 72721 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
wbblakemore

Send message
Joined: 18 Dec 07
Posts: 33
Credit: 4,181
RAC: 0
Message 72726 - Posted: 10 Apr 2012, 20:43:00 UTC - in response to Message 72721.  
Last modified: 10 Apr 2012, 20:43:34 UTC

Well. Ralph ran 9 WUs (so far) to completion. Successfully.

One thing I did notice, and am looking into now, is that on Ralph, I'm using the default run-time target, whereas I was using different times on Rosetta. In the interest of eliminating another variable, I'm attempting another Rosetta task with the run-time set to the default.


Just to make sure I'm understanding this correctly ... the same version of the client software (3.26?) is currently being run on both Rosetta and Ralph. On Rosetta, WU's error out if GPU processing is enabled, but on Ralph, they don't. If I've correctly stated the situation, then it sounds like the problem isn't with either the Rosetta client or the NVidia drivers, it's with the Rosetta server.

Maybe it's time to look at the Rosetta server software?
ID: 72726 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
A.M.

Send message
Joined: 13 Jun 06
Posts: 12
Credit: 954,586
RAC: 0
Message 72727 - Posted: 10 Apr 2012, 21:57:26 UTC - in response to Message 72726.  

Well. Ralph ran 9 WUs (so far) to completion. Successfully.

One thing I did notice, and am looking into now, is that on Ralph, I'm using the default run-time target, whereas I was using different times on Rosetta. In the interest of eliminating another variable, I'm attempting another Rosetta task with the run-time set to the default.


Just to make sure I'm understanding this correctly ... the same version of the client software (3.26?) is currently being run on both Rosetta and Ralph. On Rosetta, WU's error out if GPU processing is enabled, but on Ralph, they don't. If I've correctly stated the situation, then it sounds like the problem isn't with either the Rosetta client or the NVidia drivers, it's with the Rosetta server.

Maybe it's time to look at the Rosetta server software?



That does seem to be the essence of the situation, yes.

I have now successfully completed 18 WUs from Ralph, while 3 more from Rosetta have failed.
ID: 72727 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 72728 - Posted: 10 Apr 2012, 22:15:51 UTC - in response to Message 72727.  

Well. Ralph ran 9 WUs (so far) to completion. Successfully.

One thing I did notice, and am looking into now, is that on Ralph, I'm using the default run-time target, whereas I was using different times on Rosetta. In the interest of eliminating another variable, I'm attempting another Rosetta task with the run-time set to the default.


Just to make sure I'm understanding this correctly ... the same version of the client software (3.26?) is currently being run on both Rosetta and Ralph. On Rosetta, WU's error out if GPU processing is enabled, but on Ralph, they don't. If I've correctly stated the situation, then it sounds like the problem isn't with either the Rosetta client or the NVidia drivers, it's with the Rosetta server.

Maybe it's time to look at the Rosetta server software?



That does seem to be the essence of the situation, yes.

I have now successfully completed 18 WUs from Ralph, while 3 more from Rosetta have failed.



Checked you one computer with 4 tasks completed.
It's weird, you have this status at the end:
DONE :: 2 starting structures 7788.38 cpu seconds
This process generated 2 decoys from 2 attempts
Which is good.
But why they are invalid is weird. They ran ok. They shut down at the end of your time limit.

Also what is weird is that Mac completes the same task ok, but your Win7 machine does not at least according to Rosetta's computer system.
ID: 72728 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
A.M.

Send message
Joined: 13 Jun 06
Posts: 12
Credit: 954,586
RAC: 0
Message 72731 - Posted: 11 Apr 2012, 5:08:16 UTC - in response to Message 72728.  

Checked you one computer with 4 tasks completed.
It's weird, you have this status at the end:
DONE :: 2 starting structures 7788.38 cpu seconds
This process generated 2 decoys from 2 attempts
Which is good.
But why they are invalid is weird. They ran ok. They shut down at the end of your time limit.

Also what is weird is that Mac completes the same task ok, but your Win7 machine does not at least according to Rosetta's computer system.



Yes, weird. We all agree.
ID: 72731 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 8 · Next

Message boards : Number crunching : Client Errors



©2024 University of Washington
https://www.bakerlab.org