Message boards : Number crunching : Client Errors
Previous · 1 · 2 · 3
Author | Message |
---|---|
Nuadormrac Send message Joined: 27 Sep 05 Posts: 37 Credit: 202,469 RAC: 0 |
I too have seen some access violation error messages. No indication of benchmarks from what I've seen. I'm also seeing 10/2/2005 12:01:04 AM|rosetta@home|Result 1pvaA_abrelax_no_cst_17565_0 exited with zero status but no 'finished' file 10/2/2005 12:01:04 AM|rosetta@home|If this happens repeatedly you may need to reset the project. from time to time, though I've seen this error a coupla times with results that have returned successfully and given credit as well. Anyhow, the results I'm seeing Access Violations on https://boinc.bakerlab.org/rosetta/result.php?resultid=127887 https://boinc.bakerlab.org/rosetta/result.php?resultid=79885 Ironically, a moment ago, in my results page it showed a result to be done which came up "server status unsent, client status initial", and clicking on it came back to this WU, the unsent of this same one. That seems to have cleared itself up on the results page however as of the time of this post. BTW, this box only has 1 CPU, so we're not just looking at errors related to multi-CPU computers with these, essentially what used to be refered (back in win3.1 days) as a general protection fault. |
JimB Send message Joined: 17 Sep 05 Posts: 19 Credit: 228,111 RAC: 0 |
I tracked one of my "exited with zero status" wu's out of curiosity, and it does get credit; it is also documented in the Wiki - "most of the time the best thing to do is to do nothing" : 2005-09-20 06:40:59 [SETI@home] Starting result 08no03aa.2694.26145.129826.90_1 using setiathome version 4.18 2005-09-20 07:33:14 [SETI@home] Result 08no03aa.2694.26145.129826.90_1 exited with zero status but no 'finished' file 2005-09-20 07:33:14 [SETI@home] Restarting result 08no03aa.2694.26145.129826.90_1 using setiathome version 4.18 2005-09-20 08:16:48 [SETI@home] Computation for result 08no03aa.2694.26145.129826.90_1 finished 2005-09-20 08:16:49 [SETI@home] Started upload of 08no03aa.2694.26145.129826.90_1_0 2005-09-20 08:16:51 [SETI@home] Finished upload of 08no03aa.2694.26145.129826.90_1_0 115364537 1140844 19 Sep 2005 21:31:33 UTC 20 Sep 2005 12:16:57 UTC Over Success Done 5,356.25 9.62 29.98 Result ID 115364537 "Be all that you can be...considering." Harold Green |
Pconfig Send message Joined: 26 Sep 05 Posts: 6 Credit: 56,254 RAC: 0 |
Benchmarking errors out over here: 1-10-2005 23:00:47||Suspending computation and network activity - running CPU benchmarks 1-10-2005 23:00:47|rosetta@home|Pausing result 1btn__abrelax_no_cst_16723_0 (removed from memory) 1-10-2005 23:00:47|rosetta@home|Pausing result 1btn__abrelax_no_cst_16971_0 (removed from memory) 1-10-2005 23:00:48|rosetta@home|Unrecoverable error for result 1btn__abrelax_no_cst_16971_0 ( - exit code -1073741819 (0xc0000005)) 1-10-2005 23:00:48||request_reschedule_cpus: process exited 1-10-2005 23:00:49|rosetta@home|Unrecoverable error for result 1btn__abrelax_no_cst_16723_0 ( - exit code -1073741819 (0xc0000005)) 1-10-2005 23:00:49||request_reschedule_cpus: process exited 1-10-2005 23:01:47|rosetta@home|Computation for result 1btn__abrelax_no_cst_16723_0 finished 1-10-2005 23:01:47|rosetta@home|resume_or_start(): unexpected process state 2 (going to leave it in mem while preempted) i think it's strange that a wu isn't aborted when boinc gets restarted, only when boinc tries to benchmark... Proud member of the Dutch Power Cows |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
The "Client Errors" appears to be when the Rosetta@Home Science Application is suspended and removed from memory. I do not recall the exact mechanics of the operation, but, if the API is not called in the proper order that could be part of it. This is one of the reasons I have suggested that the BOINC Official documentation is lacking. It is also why I am trying to add more information to the Wiki about development issues. Like, where the messages are created. At some point, I am going to have to start to actually do development work to "test" the documentation as it stands. But, from my review as I added the pages, being an old and tired systems engineer, woefully inadequate. It assumes much knowledge on the part of the developer and does not seem to trace the logical development path well ... The good news is that I have had a couple project developers add some of what they learned when burned ... @JimB There are a couple causes of that message, if it was not clear from the Wiki. Fundamentally, there are timing issues with the Science Application and the BOINC Daemon and this message is the result. Later versions of the BOINC Client Software when compiled into the Science Application should "cure" this ... or reduce its frequency. One of the more interesting "clues" is that if you adjust the system clock, or when it adjusts itself on synchronization with a Internet update, well, that outputs the message .... |
JimB Send message Joined: 17 Sep 05 Posts: 19 Credit: 228,111 RAC: 0 |
These errors are interesting. I've done a *very* quick summary list of errors I recall seeing in this forum (feel free to add to or ignore):
"Be all that you can be...considering." Harold Green |
Nuadormrac Send message Joined: 27 Sep 05 Posts: 37 Credit: 202,469 RAC: 0 |
Oddly, well in my case all my projects have been set to leave in memory pretty much since I signed up for my first BOINC project. Unless the benchmark was auto-invoked, I hadn't requested one, and the projects, well Rosseta was the last one I signed up for/connected to. I haven't seen a message in the log about one. I do wonder about these Access Violations though which I have seen on 2 units, and some others have claimed to be seeing. A slight blurb on this. Basically, when one of our PC processors are running in 32-bit protected mode, there's a degree of protection in place from some client software accessing memory it isn't entitled to (aka protected mode). This restriction is enforced by the CPU itself, in hardware, where the bit of code is running in user mode (ring 3). The OS code (or the system kernel) runs at ring 0, along with various device drivers, etc, and is entitled to access any memory in the system, with application software being launched in user mode (and yes Windows XP would launch things this way). The application is only entitled to access memory that belongs to it, or to make various API calls to request the operating system do some needed function for it. If it attempts to directly read from or write to memory that doesn't belong to that process, it generates an Access Violation. On a programming level, Rossetta would be attempting to (in my case the error message specified a read) from memory it wasn't entitled to. Either the memory was de-allocated and still referenced, or there's a bad pointer or something that crops up from time to time which is attempting to de-reference memory which the program has no valid access to. I don't envy the project devs who might have to hunt down the bit of code that might be causing this. |
The Pirate Send message Joined: 22 Sep 05 Posts: 20 Credit: 7,090,933 RAC: 0 |
|
Joe Send message Joined: 26 Sep 05 Posts: 3 Credit: 590,689 RAC: 1,164 |
I just had 6 results on a p4 invalid, exiting with access code violation. I think ill abort the rest of the units on that machine.... |
petrusbroder Send message Joined: 23 Sep 05 Posts: 9 Credit: 2,111,764 RAC: 0 |
Oh, BTW I have not had any errors on PCs for the last 8 hours. and only 2 for the last 3 days - OTOH I have no dual core CPUs and my newest CPU is an Athlon 64 3200+ ... However, my macs (a minimac with G4, 1.40 GHz and a PowerMac with 2 x G5 @ 2 GHz) report 37 WUs "client error" and a code etc. in a row. Looked at the details and was surprized to see that on some WUs there were three computers reporting client error while a fourth computer got a result correctly. for example: application Rosetta The report looks like this (for the mac): Result ID 178811 I have checked 8 of the failed WUs for the mac and the <stderr_txt> look all the same. For the dual CPU PC the report looks like this: Result ID 166737 It seems that the successful CPUs were singlecore and "old" PIII or P4 below 2 GHz or older Athlons. There were also some very fast single core CPUs running the failed WUs ... |
petrusbroder Send message Joined: 23 Sep 05 Posts: 9 Credit: 2,111,764 RAC: 0 |
I have to add some clarifications: The PowerMac has had 2 WUs with error codes but has during the same period produced 58 WUs with correct results. The minimac has processed 251 WUs with error such as the one noted in my previous post. It is interesting to see that the processor makes such a difference - because the minimac has enough RAM - 512 MBytes. Makes one think ... |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I have to add some clarifications: The following error is due to our OSX rosetta application only supporting 10.4+ currently: dyld: rosetta_4.76_powerpc-apple-darwin Undefined symbols: rosetta_4.76_powerpc-apple-darwin undefined reference to _floorl expected to be defined in /usr/lib/libmx.A.dylib rosetta_4.76_powerpc-apple-darwin undefined reference to _log10l expected to be defined in /usr/lib/libmx.A.dylib rosetta_4.76_powerpc-apple-darwin undefined reference to _statvfs expected to be defined in /usr/lib/libSystem.B.dylib |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Is there some reason we cannot do the fix similar to what had to be done for CPDN? Almost the same error set ... See ... http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2635 |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I don't know if that will work but regardless I would rather make a build that is compatible with OSX versions prior to 10.4. The problem is that anything built with Xcode2's gcc 4 will not run on anything prior to 10.3.9 (and even to support 10.3.9, the Cross-Development SDK has to be used). There may be a performance trade-off. I haven't yet had the time to look into this. |
Shaktai Send message Joined: 21 Sep 05 Posts: 56 Credit: 575,419 RAC: 0 |
There may be a performance trade-off. I haven't yet had the time to look into this. Typically we haven't seen a noticable performance trade off on other projects. The 10.3.x compiles have generally been very successful. However, there has been no success with compiling BOINC for 10.2.x or earlier. The cross development SDK has usually worked well. If you create a cross development compile, I think our team can scrounge up some testers. Just curious, have you tried Xcode2's gcc 4 auto vectorization function? It has helped sometimes with G4 and G5 processors that have altivec capabilities. Of course, I am not a coder, just a user. A 10.3.9 compatible version will draw a lot of new mac users though. Upgrades from earlier 10.3.x versions to 10.3.9 are free, so most users have or will upgrade. Team MacNN - The best Macintosh team ever. |
petrusbroder Send message Joined: 23 Sep 05 Posts: 9 Credit: 2,111,764 RAC: 0 |
I have to add some clarifications: Oh, Sorry - never realised that - got to update ... /blushing/ |
UBT - Halifax--lad Send message Joined: 17 Sep 05 Posts: 157 Credit: 2,687 RAC: 0 |
Are all the problems with WU's getting errors on them gone yet I would like to get back to the project but dont want to waste any time or WU's if the are going to abort Join us in Chat (see the forum) Click the Sig Join UBT |
devn Send message Joined: 17 Sep 05 Posts: 18 Credit: 2,063 RAC: 0 |
i'm having far more success with wus since upgrading to cc4.72; also, now running only 1 cpu with rosetta even though i have HT. benchmarks no longer cause a problem either. |
devn Send message Joined: 17 Sep 05 Posts: 18 Credit: 2,063 RAC: 0 |
update to previous post: benchmarks just ran and caused "unrecoverable error." why this time and not last, have no idea, nothing else going on. |
Message boards :
Number crunching :
Client Errors
©2024 University of Washington
https://www.bakerlab.org