Message boards : Number crunching : Client Errors
Previous · 1 · 2 · 3 · 4 · 5 . . . 8 · Next
Author | Message |
---|---|
AlphaLaser Send message Joined: 19 Aug 06 Posts: 52 Credit: 3,327,939 RAC: 0 |
I do have a host here which does not have discrete GPU, it only uses integrated graphics. It has been running Rosetta without errors. |
wbblakemore Send message Joined: 18 Dec 07 Posts: 33 Credit: 4,181 RAC: 0 |
This is a report on my troubleshooting activities. Last night on the problem machine I ran 8 Rosetta WU's with the following concurrent changes to the system: Yes, I'm running a single EVGA board -- GTX 560 ti, with 2GB RAM onboard. My CPU is also overclocked to 4.00 GHz. I don't crunch for Folding@Home. My other projects are Seti@Home and WCG. No corresponding problems at either place. |
wbblakemore Send message Joined: 18 Dec 07 Posts: 33 Credit: 4,181 RAC: 0 |
At the risk of making a silly observation ... more than anything else, this feels like a client/server communications problem. I'm not sure how useful it is to explore hardware issues or solutions. If I were to make a SWAG, I'd be looking for a change made to the BOINC manager software. Perhaps they made a change to accommodate newer NVidia GPU processing that your server people didn't pick up because they weren't running GPU WU's. I know that the Seti folks had to upgrade their NVidia GPU apps to handle newer hardware. |
In Memory of Kimsey M Fowler Sr Send message Joined: 10 Mar 12 Posts: 26 Credit: 39,033,222 RAC: 0 |
Troubleshooting Update: Before reaching home last night I had decided the problem was most likely a software problem of some sort. Therefore I did not uninstall the EVGA graphics cards. Instead I formatted the C: drive and reinstalled Windows. Here are my installation steps: 1) installed Win7, 64-bit which already included SP1. 2) turned off Windows update to keep Windows at stock configuration. 3) installed the motherboard drivers & utilities from the ASUS CD/DVD 4) did NOT install the EVGA driver/utilities (used only the default video driver) 5) did NOT install Folding@Home or any other s/w to run on the GPU's 6) installed BOINC and Rosetta@Home, then ran for about 7 hours. Results: Success! Rosetta reported valid WU's and successfully report the 'application version' as 3.24. This morning at approximately 6 AM (local/Seattle Time) or 13:00 UTC, I installed the EVGA graphics driver version 285.62 that has been running successfully with R@H on my "old machine" since November. I did NOT install any of the EVGA utilities, but only the driver. All tasks that have completed since that time failed due to client error, invalid, 'application version' not reported. Testing will continue tonight, and I have slowed down the processing so there will be some tasks left in the queue to play with when I get home. I'm thinking the next step is to uninstall the current EVGA driver, see what happens for a few hours, then install the most recent EVGA driver. Based on driver version information reported by others here, I expect the new driver to cause failure also. 'Thank you' to William Blakemore for the interesting comments. Anyone have other ideas for troubleshooting? |
In Memory of Kimsey M Fowler Sr Send message Joined: 10 Mar 12 Posts: 26 Credit: 39,033,222 RAC: 0 |
To Rocco Moretti - can you tell by looking at the uploaded results of a failed task if the data computed/returned is good, but it's only failing because of some more trivial absence of information like the missing application version. Is the upload being prematurely terminated? What exactly is causing the task to be marked as invalid? Thanks. |
wbblakemore Send message Joined: 18 Dec 07 Posts: 33 Credit: 4,181 RAC: 0 |
Troubleshooting Update: Just for grins ... you might try reinstalling ALL the suspect hardware drivers (and all the bells and whistles), with the exception of any GPU apps from any BOINC project. If everything still runs, that's telling us that something about the GPU app software and/or interface is somehow being inadvertently corrupted by the GPU app process. |
In Memory of Kimsey M Fowler Sr Send message Joined: 10 Mar 12 Posts: 26 Credit: 39,033,222 RAC: 0 |
Troubleshooting Progress Report: At the end of the previous troubleshooting session I had shown that BOINC/Rosetta would work fine until the EVGA GPU driver was installed. Last night I continue testing by uninstalling the EVGA driver and going back to the Windows generic display adapter driver. Once again, with the Windows generic driver installed, BOINC/Rosetta began working correctly. I tested yet a second EVGA driver several hours later. Per suggestion I also installed all other useful applications from the EVGA disk along with all Window 7(64bit) updates. Again BOINC/Rosetta became unable to deliver valid tasks without client errors. Is it time to get the BOINC people involved? I'm fresh out of idea here. The "Help Wanted" sign is hanging in the window! |
wbblakemore Send message Joined: 18 Dec 07 Posts: 33 Credit: 4,181 RAC: 0 |
Troubleshooting Progress Report: IIRC, the EVGA display driver is really from the NVidia website. What you're really saying is that something about that driver (which includes the CUDA driver) is breaking the upload. The obvious question is, if the CUDA driver is flawed, why isn't it breaking any of the other CPU apps at other projects? |
[AF>Le_Pommier] Jerome_C2005 Send message Joined: 22 Aug 06 Posts: 42 Credit: 1,258,039 RAC: 0 |
Hi for information I have the "no finished file error" since at least 15/03 (my results are all erroring, I hadn't noticed) on an iMac 27 (2010 / i7 / 16 GB) with Mac OS X 10.6.8 I have reseted the project yesterday, no success. //// Ven 23 mar 09:41:17 2012 | rosetta@home | Task rb_03_22_29991_60688__t000__SAVE_ALL_OUT_IGNORE_THE_REST_45429_2203_0 exited with zero status but no 'finished' file Ven 23 mar 09:41:17 2012 | rosetta@home | If this happens repeatedly you may need to reset the project. Ven 23 mar 09:41:17 2012 | rosetta@home | Restarting task rb_03_22_29991_60688__t000__SAVE_ALL_OUT_IGNORE_THE_REST_45429_2203_0 using minirosetta version 324 in slot 1 Ven 23 mar 09:41:18 2012 | rosetta@home | Task rb_03_22_29991_60688__t000__SAVE_ALL_OUT_IGNORE_THE_REST_45429_2203_0 exited with zero status but no 'finished' file Ven 23 mar 09:41:18 2012 | rosetta@home | If this happens repeatedly you may need to reset the project. Ven 23 mar 09:41:18 2012 | rosetta@home | Restarting task rb_03_22_29991_60688__t000__SAVE_ALL_OUT_IGNORE_THE_REST_45429_2203_0 using minirosetta version 324 in slot 1 Ven 23 mar 09:41:19 2012 | rosetta@home | Computation for task rb_03_22_29991_60688__t000__SAVE_ALL_OUT_IGNORE_THE_REST_45429_2203_0 finished Ven 23 mar 09:41:19 2012 | rosetta@home | Output file rb_03_22_29991_60688__t000__SAVE_ALL_OUT_IGNORE_THE_REST_45429_2203_0_0 for task rb_03_22_29991_60688__t000__SAVE_ALL_OUT_IGNORE_THE_REST_45429_2203_0 absent Ven 23 mar 09:41:21 2012 | rosetta@home | Scheduler request completed: got 0 new tasks Ven 23 mar 09:41:21 2012 | rosetta@home | No work sent Ven 23 mar 09:41:21 2012 | rosetta@home | (reached daily quota of 8 results) Boinc : -------- Mer 21 mar 17:27:44 2012 | | Starting BOINC client version 7.0.20 for x86_64-apple-darwin Mer 21 mar 17:27:44 2012 | | log flags: file_xfer, sched_ops, task Mer 21 mar 17:27:44 2012 | | Libraries: libcurl/7.21.7 OpenSSL/0.9.7l zlib/1.2.3 c-ares/1.7.4 Mer 21 mar 17:27:44 2012 | | Running as a daemon Mer 21 mar 17:27:44 2012 | | Data directory: /Library/Application Support/BOINC Data Mer 21 mar 17:27:44 2012 | | Processor: 8 GenuineIntel Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz [x86 Family 6 Model 30 Stepping 5] Mer 21 mar 17:27:44 2012 | | Processor features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 CX16 TPR PDCM SSE4.1 SSE4.2 POPCNT Mer 21 mar 17:27:44 2012 | | OS: Mac OS X 10.6.8 (Darwin 10.8.0) Mer 21 mar 17:27:44 2012 | | Memory: 16.00 GB physical, 829.07 GB virtual Mer 21 mar 17:27:44 2012 | | Disk: 1.82 TB total, 828.83 GB free Mer 21 mar 17:27:44 2012 | | Local time is UTC +1 hours Mer 21 mar 17:27:44 2012 | | VirtualBox version: 4.1.10 Mer 21 mar 17:27:44 2012 | | WARNING: get_ati_mem_size_from_opengl failed to create PixelFormat Mer 21 mar 17:27:44 2012 | | OpenCL: ATI GPU 0: Radeon HD 4850 (driver version 1.0, device version OpenCL 1.0, 512MB, 512MB available) Mer 21 mar 17:27:44 2012 | | Config: report completed tasks immediately Mer 21 mar 17:27:44 2012 | | Config: GUI RPC allowed from any host |
In Memory of Kimsey M Fowler Sr Send message Joined: 10 Mar 12 Posts: 26 Credit: 39,033,222 RAC: 0 |
Hi Jerome - I looked at one of your failed WU's, and it's a different problem than we are working on here. Up above in this same thread there is a mention of the same mechanism of failure that you are experiencing:
I see you are using BOINC 7 and have errno=13. You might want to go back to the previous version of BOINC. The latest is not necessarily the greatest! |
[AF>Le_Pommier] Jerome_C2005 Send message Joined: 22 Aug 06 Posts: 42 Credit: 1,258,039 RAC: 0 |
Ok thanks for the info, yes I know about potential instability, I'm trying version 7 since we are trying to see about GPU support implementation under Mac OS X, GPU started to be recognized for us by boinc with v7 but no GPU project is working with my iMac so far, it seems I'd have to upgrade to Lion, but I don't want to for the moment. Some of my Alliance Francophone fellows have better results (with Lion) but afaik it's not working 100% yet. For information, I have quite a high number of projects running on my machine and it's only happening with Rosetta for the moment. I'll see what I do regarding my version, thanks. |
In Memory of Kimsey M Fowler Sr Send message Joined: 10 Mar 12 Posts: 26 Credit: 39,033,222 RAC: 0 |
Troubleshooting Report: A recent version of the GPU driver was uninstalled and the machine reverted back to the Windows default driver. Rosetta once again started working correctly. Next, a much older July 2011 version of the NVIDIA driver was installed. Rosetta reported client errors again. One thing of note is that if the driver is switched while a Rosetta WU is paused/suspended, the final results (success or failure) is only influenced by the driver loaded at the time the WU completes and uploads. The current recommended version of BOINC is 6.12.34; it has been in place since July 2011. A glaring characteristic of our problem is that it was first reported three days after Rosetta 3.22 was released. With older tasks quickly falling off the bottom of contributer's task lists, it's a little late to do a detailed analysis of what happened when 3.22 replaced the previous version of Rosetta. Spring break is almost over at the University of Washington, so hopefully next week we can get some support. In the mean time, does anyone know if there is a way to set up a test running a pre-Rosetta 3.22 task? If the task will complete normally with an NVIDIA driver installed in the GPU, some change that went into Rosetta 3.22 would likely be the source of the problem. Markus Elfring's suggestion above is noteworthy, and I read his additional comments on the other thread. |
Rayburner Send message Joined: 4 Oct 05 Posts: 32 Credit: 16,518,823 RAC: 0 |
The problem already existed with with version Rosetta version 3.19. Howver I seemed to be the only one to report this problem. I started this thread (https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5875) back then. Regards, Rayburner |
wbblakemore Send message Joined: 18 Dec 07 Posts: 33 Credit: 4,181 RAC: 0 |
Yes, that definitely sounds like what the rest of us are experiencing - are you also running NVidia-based GPU apps for other projects? |
Rayburner Send message Joined: 4 Oct 05 Posts: 32 Credit: 16,518,823 RAC: 0 |
Yes, I do. GPUGrid and PrimeGrid |
wbblakemore Send message Joined: 18 Dec 07 Posts: 33 Credit: 4,181 RAC: 0 |
OK, to sum up ... it looks like we have a bug, first reported in version 3.19, that's causing all Rosetta WU's to error out, for any users running NVidia GPU apps elsewhere. Speaking for only myself, while I enjoy getting credit like everyone else, my primary reason for being here is the science. I don't see any benefit in running WU's that just error out, to no one's benefit, so I stopped crunching here until the problem is resolved (and no, turning off my GPU is NOT something I'm inclined to do). If others feel the same way I do, then Rosetta is losing some pretty capable machines. Umm ... support people? |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,070,747 RAC: 5,595 |
OK, to sum up ... it looks like we have a bug, first reported in version 3.19, that's causing all Rosetta WU's to error out, for any users running NVidia GPU apps elsewhere. I moved my pc's, all of them, to other projects a couple of weeks ago!! Poem seems to be enjoying the benefit of my @34 cpu cores right now!! |
Rocco Moretti Send message Joined: 18 May 10 Posts: 66 Credit: 585,745 RAC: 0 |
Umm ... support people? I wish I had more success to report. We're looking into it, and are in contact with the BOINC people about it. Unfortunately, we don't have a solution for the problem yet. Frankly speaking, it's embarrassing on our end that it's taking this long to solve this problem. I want to extend a heartfelt thanks to everyone on the forums who is helping with troubleshooting the issue, especially those (like In Memory of Kimsey M Fowler Sr) who have gone above and beyond in diagnosing things. Thanks to all of your efforts, I think we can be relatively confident that the issue is directly related to NVidia GPU drivers on Windows 7. I realize it sounds trite, but I'll say it anyway: Thanks again for your patience. We really do appreciate you volunteering your computers, and for putting up with us in our off moments. |
wbblakemore Send message Joined: 18 Dec 07 Posts: 33 Credit: 4,181 RAC: 0 |
I'm in the computer business myself and I do understand that some bugs are devilishly hard to find. Perhaps instead of talking to the BOINC people, you'd have better luck talking to other projects that have simultaneously working CPU and GPU apps. Possibly, they could point out something that you're missing. But damn ... since 3.19??? |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,070,747 RAC: 5,595 |
Umm ... support people? I believe you are barking up the wrong tree but since you are the one doing the barking feel free. I have ONE Nvidia gpu but over 8 AMD gpu's and I do not believe my pc with the Nvidia gpu in it was crunching for Rosetta at the time of my departure, yet ALL but one of my pc's had the same problems that others are seeing! The one pc that did not have any problems was a pc that did not have a gpu in it that crunched! I believe your problem is in how you are handling, or maybe NOT handling, the gpu drivers, both Nvidia AND AMD, and how they relate to Boinc in general. BUT I am NOT a programmer, I am just a guy with some disposable income and also a guy that 'does' pc's as a hobby. |
Message boards :
Number crunching :
Client Errors
©2024 University of Washington
https://www.bakerlab.org