Message boards : Number crunching : Report problems with Rosetta version 5.36
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
These two EDIT three computers A + B + C have produced these seven EDIT ten hung tasks in the last few days, the majority today: A1 + A2 + A3 + B1 + B2 + B3 + B4 + C1 + C2 + C3 Interstingly it is only these two EDIT three hosts out of 10 currently running Rosetta, but I am getting to expect to see a yellow stripe across my BoincView display to show me these two boxes have stopped again. The most recent two to stop, A3 and B4, were both on the same protein, but that may well be coincidence as the other 5 are not for that protein. It was weird to see two tasks failing at the same time and with the same protein unde investigation. This is not a screensaver issue as none of my BOINC clients run graphics. The boxes A and B are two of my three slowest boxes, but interestingly these two boxes have 368Mb RAM, whereas the other that is equally slow has only 256Mb and has not had this issue (yet). I had wondered if all the failed tasks are in the larger than 256Mb category - EDIT: until I spotted the same problem had occurred on box C, which has a faster (ahem, not quite so slow) cpu but only 256Mb RAM, so it does not seem to be a memory issue either. R~~ |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I had wondered if all the failed tasks are in the larger than 256Mb category Yes, I wonder. What exactly is the requirement on the "large" WUs? >256MB? Or >256MB per core? or is it <= 512MB? Rosetta Moderator: Mod.Sense |
Conan Send message Joined: 11 Oct 05 Posts: 150 Credit: 4,191,010 RAC: 3,332 |
>> Result https://boinc.bakerlab.org/rosetta/result.php?resultid=45844369 has a validate error and stuck so was killed by watchdog. Preference time 21600 was killed at 26708.58. The cc_config.xml did not trap anything and I wonder if it even works on my version 5.2.13 ? This is only the second workunit run since turning the screensaver back on and has failed. I have double checked the syntax of 'cc_config.xml' and it appears correct as per FluffyChicken's and the Boinc sites instructions. Will continue to monitor. Thanks for the reply Rhiju, haven't given up yet. I have not had any Ralph work units on this machine for a few days so can't check if still a problem or not with Ralph as well. |
genes Send message Joined: 8 Oct 05 Posts: 60 Credit: 694,386 RAC: 1,772 |
Thanks FluffyChicken, I have set this up, currently I have a Ralph 5.40 WU so we'll see how it goes. |
Mike Gelvin Send message Joined: 7 Oct 05 Posts: 65 Credit: 10,612,039 RAC: 0 |
I have a 5.36 work unit on a remote system that appears to have hung for 8 days with no additional CPU time past the first 2 hours 31 minutes. This system does not run a screensaver, and has BOINC installed as a service. https://boinc.bakerlab.org/rosetta/result.php?resultid=44919383 Why hasnt the watchdog killed this work unit? |
Conan Send message Joined: 11 Oct 05 Posts: 150 Credit: 4,191,010 RAC: 3,332 |
> This workunit failed with debug data https://boinc.bakerlab.org/rosetta/result.php?resultid=46039396 I don't think this one is related to the screensaver as the debug data does not mention it? |
Conan Send message Joined: 11 Oct 05 Posts: 150 Credit: 4,191,010 RAC: 3,332 |
> This workunit failed with debug data Also this one https://boinc.bakerlab.org/rosetta/result.php?resultid=46133984 CPU time 4755.046875 stderr out <core_client_version>5.2.13</core_client_version> <message>The system cannot find the path specified. (0x3) - exit code 3 (0x3) </message> <stderr_txt> # random seed: 3450426 # cpu_run_time_pref: 21600 This one did stop with the screensaver on. The screen was not updating and the processor had dropped to idle, I did not get any debug information. 3 workunits processed so far today on this machine and 2 have failed. Without the Boinc screen saver I had no failures, will keep drbugging. |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
I think they used a different cc_config type setup for the older clients. you should be using 5.4.9/11 anyway if you are having problems with earlier configurations. They fixed some screensaver code among other parts. Though I do not know how well the logging works in 5.4.9/11 as none of my setups use it (that I know of ;-)) All have 5.6.4/5 or 5.7.2 installed where the logging works well. Unfortunatly BOINC developers have a habbit of updating the website to reflect the current test versions of the client so if the logging gets altered (I do remember a change in file name/convention some time back but when I don't know.) Older client users are stuffed!. But then they don't continue to develop the client for nothing ;-) Side / According to BOINCStats, 5.4.9/11 are the most commonly used client by people. Team mauisun.org |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
45723693 & 45755534 were both running on my hyperthreaded machine. BOINC seemed to lose contact with the running threads (title bar did not show localhost, tasks tab empty), retry communications failed. Exited and restarted BOINC. Both WUs ended prematurely. 24hr time preferenece. But they only ran for 13 and 10.5 hrs. Both show No heartbeat from core client for 31 sec - exiting. I'm running BOINC 5.4.9 on Windows. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
scsimodo Send message Joined: 17 Sep 05 Posts: 93 Credit: 946,359 RAC: 0 |
|
scsimodo Send message Joined: 17 Sep 05 Posts: 93 Credit: 946,359 RAC: 0 |
|
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
I have a 5.36 work unit on a remote system that appears to have hung for 8 days with no additional CPU time past the first 2 hours 31 minutes. This system does not run a screensaver, and has BOINC installed as a service. watchdog can't kill an app that has already died for any other reason. We call these "stopped clock" errors, or "cpu frozen", etc. What has really happened is that the app has gone to meet its maker but has been nailed to its perch by the client which has not noticed its death, early demise, etc. Perhaps we should call this the Norwegian Blue app ;-) Here you have two bugs at once. Firstly whatever bug in the app that caused the access violation that caused win to stop the app running. You can see that something did this if you look at your result on the website, now it has been reported. Second bug, the fact that the client does not notice when one of its daughter processes has exited. This counts as a bug in the client, imo, and is not down to Rosetta but to the BOINC people to sort out. Failure to accrue cpu time is possible for the app if a user task grabs 100% cpu for a prolonged time, so the client cannot assume from a stopped clock that there is definitely something wrong. It should however, imho, ask the operating system at this point if the app is still alive. In the meantime, we need to intervene when we notice this. A clock that is stopped for more than a couple of minutes is usually a sign that the task has ended abruptly. Suspend/resume (of the task, not of the project) is usually enough to get the app to either pick up again, or more likely to begin to upload (whether as an error or success). If it freezes a second time round, give it at least two minutes to get going again, then suspend, abort, resume the task should force it to go into upload as an error. Rosetta usually grants credit for errorred results in both these cases, for the work the app did before it froze. What we don't get is credit for the time the box was idle -- but getting anything at all for an errorred app is one up on the other projects. R~~ |
Conan Send message Joined: 11 Oct 05 Posts: 150 Credit: 4,191,010 RAC: 3,332 |
>> Ok all the following workunits have failed with the screensaver on and fact were on the screen when I noticed a couple of them had failed. The others had gone into a 'not responding' mode according to Task Manager and the processor had dropped to idle. Only about 2 workunits have worked since I turned the screensaver back on. https://boinc.bakerlab.org/rosetta/result.php?resultid=46243254 https://boinc.bakerlab.org/rosetta/result.php?resultid=46243277 these 2 had error code "exit code 1073807364" https://boinc.bakerlab.org/rosetta/result.php?resultid=46243290 https://boinc.bakerlab.org/rosetta/result.php?resultid=46243291 https://boinc.bakerlab.org/rosetta/result.php?resultid=46243301 (failed at 34 sec) https://boinc.bakerlab.org/rosetta/result.php?resultid=46243303 (fail at 610 sec) https://boinc.bakerlab.org/rosetta/result.php?resultid=46243310 these last 5 all errored out with "stuck" or running too long killed by watchdog. Or had access violations. > I have now updated to client version 5.4.11 will leave the screensaver on to see if I can trap some debug information or see if the problem happens anymore. I can not confirm if Ralph is having the same problem as I have had no workunits for a few days now. Will keep you posted. |
Jim Send message Joined: 15 Oct 06 Posts: 22 Credit: 5,410,546 RAC: 0 |
I am also running version 5.36 with client version 5.4.11 on a AMD 3000+ and Windows XP. I opened the Show Graphics window and the machine locked up when I went to close the graphics window. I had to exit it using Task Manager. The workunit ended at that point with the exit code of 1073807364 (0x40010004). <core_client_version>5.4.11</core_client_version> <message> - exit code 1073807364 (0x40010004) </message> <stderr_txt> # random seed: 3382412 # cpu_run_time_pref: 10800 </stderr_txt> Before I opned the Show Graphics window everything appeared to be processing normally. Jim |
RuDiablo Send message Joined: 11 Nov 05 Posts: 2 Credit: 463,636 RAC: 0 |
My teammate have stdout.txt ~30MB File contains line: DANGER:: 0-overlap chainbreak score does not match the derivative!!!!!!!!!!!!!! Who is Phil? :) |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
Conan, is 5.4.11 showing the debug info ? Any time you open and close the Graphics it will make a log entry. It would have been nice if BOINC had brought out a 5.6.6 version and made it gold since the logging ceratainly works properly with 5.6.x+ Though it would be even better if Rosetta@home uploaded the *.pdb debug file like they didi during the initial stages so people having problems could grab it and give real debug information back to Rosetta@Home. (Mods Admin ???? have they thought of doing this again ?) Also another added bonus of the 5.6.x+series is they can log what the actual graphics cards being used are, which means they may see a trend if a particular graphics card causes a particular crash (currently suspected to be ATI and sometime integrated Intels, No one has mentioned Nvidia cards having screensaver troubles... of course not all of these are screensaver problems though). Guess we will not get most of this till 5.8.x comes out which will probably be a while as there are a lot of changes to test. Team mauisun.org |
Conan Send message Joined: 11 Oct 05 Posts: 150 Credit: 4,191,010 RAC: 3,332 |
Conan, is 5.4.11 showing the debug info ? Thanks FluffyChicken, no I can't see anything happening in the 'stdoutdae' (where Boinc says it is going) file indicating any debug information. I may consider going the whole hog and put on the latest version but my experience with 5.4.11 was not good as it keeps trying to do a completely fresh install and wipe out the current versions data files (it does not detect Boinc version 5.2.13 that is already there). This has already happened once and added another computer to my list, so I just added the main components (boinc manager,boinc client,boinc command,boinc.dll). This maybe why the debug is not working. The only thing I have found it that now an error message tells me Seti has the wrong url since changing to 5.4.11. So something is happening, I will have to detatch and reattach to see if that problem fixes itself. A big plus is that since changing to 5.4.11 I have had no lock ups yet with the current WU over 3 hours now, the last ones were failing from a minute to 2 hours, so I might just wait awhile and see if anymore fail first before upgrading again. Just checked and 5.4.11 is the latest recommended version. So why does it wipe previous versions off the map? Not a good update if you ask me. |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
[quote.....I may consider going the whole hog and put on the latest version but my experience with 5.4.11 was not good as it keeps trying to do a completely fresh install and wipe out the current versions data files (it does not detect Boinc version 5.2.13 that is already there). This has already happened once and added another computer to my list, so I just added the main components (boinc manager,boinc client,boinc command,boinc.dll). This maybe why the debug is not working. The only thing I have found it that now an error message tells me Seti has the wrong url since changing to 5.4.11. So something is happening, I will have to detatch and reattach to see if that problem fixes itself. A big plus is that since changing to 5.4.11 I have had no lock ups yet with the current WU over 3 hours now, the last ones were failing from a minute to 2 hours, so I might just wait awhile and see if anymore fail first before upgrading again. Just checked and 5.4.11 is the latest recommended version. So why does it wipe previous versions off the map? Not a good update if you ask me.[/quote] It didn't for me. Strange, though it is a long time since I tested it. you can always 'merge' the computers in your profile if you want to keep them together and neat. They did change some URL projet code (for security) but I don't remember seti changing their URL either.. Team mauisun.org |
Conan Send message Joined: 11 Oct 05 Posts: 150 Credit: 4,191,010 RAC: 3,332 |
> Everything was working fine, graphics not locking up and WU up to 5 hours 28 minutes and nearing completion. I moved the mouse which stopped the screensaver and whilst looking at my task list on the manager I saw the Rosetta WU die. https://boinc.bakerlab.org/result.php?resultid=46405086 This is the new error I received (at least it is a different one for me) CPU time 19728.5 stderr out <core_client_version>5.4.11</core_client_version> <message> Maximum disk usage exceeded </message> <stderr_txt> # random seed: 3295364 # cpu_run_time_pref: 21600 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Breakpoint Encountered (0x80000003) at address 0x7C901230 Engaging BOINC Windows Runtime Debugger... Have not seen this one for a long time, I have a 250 GB Hd and 2 Gb of RAM so disk or memory should not be a problem. The Boinc message said that the WU was aborted. This was bone by either Rosetta, which I doubt, or by Boinc client/manager. It could also be a bit of both. |
Hoelder1in Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0 |
<core_client_version>5.4.11</core_client_version> 46421301 <core_client_version>5.4.9</core_client_version> <message> Maximum disk usage exceeded </message> <stderr_txt> Graphics are disabled due to configuration... # random seed: 3277006 # cpu_run_time_pref: 28800 SIGSEGV: segmentation violation ...another "disk space exceeded" error and it's the same WU type as the one Conan reported (my first error in ages, btw ;-). Team betterhumans.com - discuss and celebrate the future - hoelder1in.org |
Message boards :
Number crunching :
Report problems with Rosetta version 5.36
©2024 University of Washington
https://www.bakerlab.org