Client Errors

Message boards : Number crunching : Client Errors

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 72733 - Posted: 11 Apr 2012, 17:17:26 UTC

Yes, server configuration may play a role and should be reviewed. Ralph may also be sending a different type of work unit to the ones that fail on Rosetta. One reason Ralph may send work even with the version is already released on Rosetta is that there are new types of work units that are tested. So, perhaps something within the work units is now improved that causes them to work. Or, perhaps the type of work differs from what fails on your environment.
Rosetta Moderator: Mod.Sense
ID: 72733 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>Le_Pommier] Jerome_C2005

Send message
Joined: 22 Aug 06
Posts: 38
Credit: 1,196,541
RAC: 1,506
Message 72734 - Posted: 11 Apr 2012, 18:16:24 UTC

Hi people,

I don't have time to read all what's been written since my previous post but i just wanted to let you know that after installing the new stable boinc 7.0.25 on my iMac I resumed rosetta to give it a new try and WU are running fine now.

They still credit some very small amounts, but it works fine ;)

(and I don't do it for credits of course :) )

Well I realize that the other information is that I upgraded from Snow Leopard to Lion just after I posted (on the 24/03), so there may be a link also...
ID: 72734 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rocco Moretti

Send message
Joined: 18 May 10
Posts: 66
Credit: 585,745
RAC: 0
Message 72736 - Posted: 11 Apr 2012, 18:47:59 UTC - in response to Message 72726.  

Well. Ralph ran 9 WUs (so far) to completion. Successfully.


?!?!?

Did not expect that. Ralph and Rosetta@home are intended to be basically the same. Ralph just gets applications and new jobs slightly before Rosetta does, so we can hopefully avoid pushing bad jobs/applications to Rosetta@home. The whole point of Ralph is things which would give errors on Rosetta@home would show errors on Ralph first.

Maybe it's time to look at the Rosetta server software?


From what I understand, the Ralph@home and Rosetta@home server back-end is running the same version of the software (there's differences in the versions of the web page software, but that shouldn't affect the result reporting). That doesn't mean there isn't some slight change in configuration which could be causing this interaction. We'll take a look at the servers and see if we can figure out what the difference is.
ID: 72736 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
wbblakemore

Send message
Joined: 18 Dec 07
Posts: 33
Credit: 4,181
RAC: 0
Message 72737 - Posted: 11 Apr 2012, 19:20:32 UTC - in response to Message 72736.  
Last modified: 11 Apr 2012, 19:23:55 UTC




From what I understand, the Ralph@home and Rosetta@home server back-end is running the same version of the software (there's differences in the versions of the web page software, but that shouldn't affect the result reporting). That doesn't mean there isn't some slight change in configuration which could be causing this interaction. We'll take a look at the servers and see if we can figure out what the difference is.


Sounds like the most promising lead we've had so far ...
ID: 72737 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 72739 - Posted: 12 Apr 2012, 18:58:51 UTC

The big picture sounds like results files are being corrupted in potentially trivial ways between the client and the validation process.

Would it be possible to have the wrap-up processing in the Rosetta client calculate a checksum or MD5 and store it in the result file? That way if the file does not validate properly, the checksum of the current, server copy of the file (except of course for the added checksum itself) can be compared against the stored checksum to confirm the data.

That would essentially prove if some change to the file occurs. From that point it is a matter of tracking down WHERE that change occurs.
Rosetta Moderator: Mod.Sense
ID: 72739 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,812,690
RAC: 812
Message 72742 - Posted: 12 Apr 2012, 21:34:15 UTC

Is this the same problem they were having at Einstein@home? Described here


Best,
Snags
ID: 72742 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
wbblakemore

Send message
Joined: 18 Dec 07
Posts: 33
Credit: 4,181
RAC: 0
Message 72743 - Posted: 12 Apr 2012, 23:04:27 UTC - in response to Message 72742.  

Is this the same problem they were having at Einstein@home? Described here


Best,
Snags


No, it's a different one. Ours involves the WU running to completion, only to fail on a validation error from the server. It's only been observed on machines which are also running GPU applications from other projects, using a NVidia graphics card.
ID: 72743 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sky King

Send message
Joined: 28 Feb 12
Posts: 11
Credit: 15,912
RAC: 0
Message 72744 - Posted: 13 Apr 2012, 5:14:36 UTC - in response to Message 72743.  

[quote]Is this the same problem they were having at Einstein@home? It's only been observed on machines which are also running GPU applications from other projects, using a NVidia graphics card.


I do want to clarify one thing... a previous post said that the error only occurs when GPU processing is enabled, and the quoted post implies that it only happens if you are using your GPU for other projects.

I don't believe either of those are true. Like most that are having this problem, I have an i7, 64 bit Windows 7, and an EVGA-branded NVIDIA 560... However, both the 560, and the new instance of Windows 7 on my machine have not been used for any projects, not even F@H or other non-BOINC, since installation.

I believe the determinant is simply whether or not you have the NVIDIA driver installed... Whether you are using it for anything more than just a display adapter appears to be immaterial.
ID: 72744 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,767,285
RAC: 10,641
Message 72746 - Posted: 13 Apr 2012, 12:31:31 UTC - in response to Message 72743.  

Is this the same problem they were having at Einstein@home? Described here


Best,
Snags


No, it's a different one. Ours involves the WU running to completion, only to fail on a validation error from the server. It's only been observed on machines which are also running GPU applications from other projects, using a NVidia graphics card.


It is ALSO happening when using AMD cards too, that is what I have and every machine I have a card in gives errors, the ONE machine without a crunching gpu in it returned units just fine.
ID: 72746 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile In Memory of Kimsey M Fowler Sr

Send message
Joined: 10 Mar 12
Posts: 26
Credit: 39,033,222
RAC: 0
Message 72748 - Posted: 13 Apr 2012, 16:41:36 UTC - in response to Message 72744.  


I believe the determinant is simply whether or not you have the NVIDIA driver installed... Whether you are using it for anything more than just a display adapter appears to be immaterial.


I agree with this statement. I cannot speak for the AMD's. Sorry for the delay with the test I proposed several days ago, but it's tax time in the US. The test runs are nearly complete, and I hope to analyze the results this weekend.
ID: 72748 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rocco Moretti

Send message
Joined: 18 May 10
Posts: 66
Credit: 585,745
RAC: 0
Message 72750 - Posted: 13 Apr 2012, 19:51:40 UTC - in response to Message 72742.  

Is this the same problem they were having at Einstein@home?


As wbblakemore says, the symptoms are different, but there might be some commonalities. If the "client writes trash to important files" bit is happening to the results, the Rosetta@home server (but apparently not the Ralph@home server) might be choking on that trash, leading to the result being tagged as a compute error. The only issue is that people running the 6.10.58 (as well as 7.0.20) have reported the error, and the examples I received from Kimsey Fowler didn't show any signs of corruption. - It's probably unrelated, though I've been surprised before.

BTW, I've taken a look at the difference between the Ralph and Rosetta@home servers, and can't see anything which would obviously cause a difference, but there is a slight compiler setting difference, so I'm looking into whether that might have some bearing on the issue.

It is ALSO happening when using AMD cards too, that is what I have and every machine I have a card in gives errors


Was that the same error issue? (All workunits consistently get listed as a Client Error outcome and have a missing application version, but according to stderr out exit successfully and have an exit status of 0). If I remember correctly, you were having a slightly different issue, primarily with the CASP9/hybridize workunits. (For what it's worth, to the best of my knowledge the bug which caused those errors has been fixed with the 3.26 release.)
ID: 72750 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
wbblakemore

Send message
Joined: 18 Dec 07
Posts: 33
Credit: 4,181
RAC: 0
Message 72751 - Posted: 13 Apr 2012, 21:03:26 UTC - in response to Message 72750.  
Last modified: 13 Apr 2012, 21:14:29 UTC

If the "client writes trash to important files" bit is happening to the results, the Rosetta@home server (but apparently not the Ralph@home server) might be choking on that trash, leading to the result being tagged as a compute error. The only issue is that people running the 6.10.58 (as well as 7.0.20) have reported the error, and the examples I received from Kimsey Fowler didn't show any signs of corruption. - It's probably unrelated, though I've been surprised before.

BTW, I've taken a look at the difference between the Ralph and Rosetta@home servers, and can't see anything which would obviously cause a difference, but there is a slight compiler setting difference, so I'm looking into whether that might have some bearing on the issue.


I'm sorry, but I keep coming back to one basic thought -- if the CLIENT is somehow corrupting data files (and the client software is presumably identical across servers, including compiler options), why is one server processing the data files correctly and the other server isn't?

I'd want to be looking at the server environment, especially any dynamic link libraries, to compare versions and possible changes -- and if the servers are Windows based, whether they are both running with the same update for .NET. Microsoft is infamous for breaking things with updates.

As an alternative, is there a possibility that the client compiled on different servers is somehow being linked with different versions of system routines and/or toolboxes?
ID: 72751 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,767,285
RAC: 10,641
Message 72756 - Posted: 14 Apr 2012, 12:14:19 UTC - in response to Message 72750.  

Is this the same problem they were having at Einstein@home?


As wbblakemore says, the symptoms are different, but there might be some commonalities. If the "client writes trash to important files" bit is happening to the results, the Rosetta@home server (but apparently not the Ralph@home server) might be choking on that trash, leading to the result being tagged as a compute error. The only issue is that people running the 6.10.58 (as well as 7.0.20) have reported the error, and the examples I received from Kimsey Fowler didn't show any signs of corruption. - It's probably unrelated, though I've been surprised before.

BTW, I've taken a look at the difference between the Ralph and Rosetta@home servers, and can't see anything which would obviously cause a difference, but there is a slight compiler setting difference, so I'm looking into whether that might have some bearing on the issue.

It is ALSO happening when using AMD cards too, that is what I have and every machine I have a card in gives errors


Was that the same error issue? (All workunits consistently get listed as a Client Error outcome and have a missing application version, but according to stderr out exit successfully and have an exit status of 0). If I remember correctly, you were having a slightly different issue, primarily with the CASP9/hybridize workunits. (For what it's worth, to the best of my knowledge the bug which caused those errors has been fixed with the 3.26 release.)


I have not crunched here for a couple of weeks but I think ver 3.26 was what I was using when all but one of my machines had their problems! It should be in my stats, to me they have been archived but I am sure you can manually check them.
ID: 72756 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 115,591,169
RAC: 60,092
Message 72759 - Posted: 14 Apr 2012, 20:37:13 UTC - in response to Message 72756.  

Is this the same problem they were having at Einstein@home?


As wbblakemore says, the symptoms are different, but there might be some commonalities. If the "client writes trash to important files" bit is happening to the results, the Rosetta@home server (but apparently not the Ralph@home server) might be choking on that trash, leading to the result being tagged as a compute error. The only issue is that people running the 6.10.58 (as well as 7.0.20) have reported the error, and the examples I received from Kimsey Fowler didn't show any signs of corruption. - It's probably unrelated, though I've been surprised before.

BTW, I've taken a look at the difference between the Ralph and Rosetta@home servers, and can't see anything which would obviously cause a difference, but there is a slight compiler setting difference, so I'm looking into whether that might have some bearing on the issue.

It is ALSO happening when using AMD cards too, that is what I have and every machine I have a card in gives errors


Was that the same error issue? (All workunits consistently get listed as a Client Error outcome and have a missing application version, but according to stderr out exit successfully and have an exit status of 0). If I remember correctly, you were having a slightly different issue, primarily with the CASP9/hybridize workunits. (For what it's worth, to the best of my knowledge the bug which caused those errors has been fixed with the 3.26 release.)


I have not crunched here for a couple of weeks but I think ver 3.26 was what I was using when all but one of my machines had their problems! It should be in my stats, to me they have been archived but I am sure you can manually check them.


3.26 was released last week (on the 5th according to the 3.26 thread) so that might have fixed your problems ;)
ID: 72759 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,767,285
RAC: 10,641
Message 72769 - Posted: 15 Apr 2012, 14:30:31 UTC - in response to Message 72759.  

I have not crunched here for a couple of weeks but I think ver 3.26 was what I was using when all but one of my machines had their problems! It should be in my stats, to me they have been archived but I am sure you can manually check them.


3.26 was released last week (on the 5th according to the 3.26 thread) so that might have fixed your problems ;)


I will try it again then as soon as I reach my goal on another project I am currently working on.
ID: 72769 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AlphaLaser

Send message
Joined: 19 Aug 06
Posts: 52
Credit: 3,327,939
RAC: 0
Message 72773 - Posted: 15 Apr 2012, 22:53:22 UTC

Just got a batch of Ralph WUs and I can confirm that my host has been able to successfully complete Ralph while failing here at Rosetta. I set the runtimes to be 1 hr on both projects.

Host at Rosetta: https://boinc.bakerlab.org/results.php?hostid=1455479
Host at Ralph: http://ralph.bakerlab.org/results.php?hostid=27840
ID: 72773 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile In Memory of Kimsey M Fowler Sr

Send message
Joined: 10 Mar 12
Posts: 26
Credit: 39,033,222
RAC: 0
Message 72774 - Posted: 15 Apr 2012, 23:01:46 UTC - in response to Message 72718.  
Last modified: 15 Apr 2012, 23:02:10 UTC

Here are the proposed test steps:

1) Determine from Rosetta staff exactly which files are uploaded to the server following execution of a WU and where exactly they reside while waiting to upload.
2) Configure my problem machine so that it will successfully complete WU's without client error.
3) Pick a particular work unit and when it is at about the 50% completion mark, terminate BOINC/Rosetta to force it to be saved at a breakpoint.
4) Copy and save the entire /ProgramData/BOINC directory.
5) Unplug the network cable, restart BOINC/Rosetta, and let the chosen WU complete successfully.
6) Terminate BOINC/Rosetta, again forcing existing WU's into a second breakpoint condition.
7) Copy & save the output files for the chosen WU (note that it cannot upload because the network cable is unplugged).
9) Configure the machine so that subsequent WU's will end with client errors.
8) Restore the entire BOINC directory to the state it was in at the first breakpoint.
10) Let the same WU chosen earlier complete again, but with client errors.
11) Collect the same set of output files before they are uploaded to the server.
12) Use file comparison software to identify differences.


The test to run a WU to both a failed and a successful state of completion is done. I generally followed the procedure, but with modifications suggested by Mod.Sense (thank you for the very useful info). The primary WU data files that are generated by my computer look about the same except with some "time description" differences (delta times, not clock times). I don't know if that is significant. The file name BOINC/client_state.xml had generally the same information for both the failed and successful runs. One difference was the file size and MD5 were different, but only due to the difference in the time description values.

I looked at other file types in the collection, but couldn't find anything interesting. Rocco's earlier findings from looking at my save BOINC directory didn't pick up any irregularities either.

Most of the run details you see reported on the web for each work unit are being extracted from the client_state.xml file. There's only one of these files in the BOINC directory, and it contains details for all of the different WU's you are running or have waiting in the queue. The raw data file for each WU lives in the directory BOINCprojectsboinc.bakerlab.org_rosetta and is named with the long (alphanumeric) name of the WU. It has no file extension, so you need to add ".GZ" if you want to unzip it to see the data. The file is usually deleted after it uploads soon after completion of a WU.

The point of this is that what you see on the WU web page primarily comes from the client_state.xml file and not from the raw data file. This begs the questions:
1) is the raw data file getting uploaded at all?
2) is it being modified/damaged during transfer?
3) will failing the MD5 check cause client errors?
4) what server-side software writes the phrase "client error" and what triggers that error?
5) where does the WU web page get the Rosetta version number?

If you are interested in looking at the details of the analysis, the pertinent information for one WU I looked can be found in this zip file. If you want additional data files from the BOINC directory, please let me know. ---KMF, Jr.
ID: 72774 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile In Memory of Kimsey M Fowler Sr

Send message
Joined: 10 Mar 12
Posts: 26
Credit: 39,033,222
RAC: 0
Message 72775 - Posted: 15 Apr 2012, 23:05:25 UTC - in response to Message 72773.  

Just got a batch of Ralph WUs and I can confirm that my host has been able to successfully complete Ralph while failing here at Rosetta. I set the runtimes to be 1 hr on both projects.


I installed Ralph@Home several days ago, but it won't give me any tasks. I've tried the PROJECT, UPDATE button many times. Any suggestions? ---KMF, Jr.
ID: 72775 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 72776 - Posted: 15 Apr 2012, 23:26:05 UTC - in response to Message 72775.  

I installed Ralph@Home several days ago, but it won't give me any tasks. I've tried the PROJECT, UPDATE button many times. Any suggestions? ---KMF, Jr.


Right, Ralph only issues work periodically when a new version or new work units need testing. Just have to wait for some to be available. BOINC retries periodically for you.
Rosetta Moderator: Mod.Sense
ID: 72776 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
wbblakemore

Send message
Joined: 18 Dec 07
Posts: 33
Credit: 4,181
RAC: 0
Message 72777 - Posted: 15 Apr 2012, 23:47:13 UTC - in response to Message 72774.  

[

If you are interested in looking at the details of the analysis, the pertinent information for one WU I looked can be found in this zip file. If you want additional data files from the BOINC directory, please let me know. ---KMF, Jr.


I just wanted to tell you how much your hard work is appreciated by all of the rest of us who are suffering from this bug.

It's really above and beyond the call of duty, and I take my hat off to you for doing it.

ID: 72777 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · Next

Message boards : Number crunching : Client Errors



©2024 University of Washington
https://www.bakerlab.org