Report stuck & aborted WU here please

Author	Message
Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0	Message 14146 - Posted: 19 Apr 2006, 23:04:07 UTC - in response to Message 14110. You obviously missed the post from "Rhiju" where he said in part- "...The result is reported to us as a large amount of "claimed credit"; we are going through on a weekly basis and granting credit for these jobs that caused big problems and returned "invalid" results. The results and your posting are still useful to us -- in this case, the postings on this work unit helped us track down a pretty esoteric bug in Rosetta. Let me know what the result number is -- if its not flagged to get credit, I can see why." The full text of the post is Here. As a personal asside I might point out that the usefulness of your posting is directly related to the information you provide concerning the problem. Could you at least provide a link to the result ID so the problem can be examined? I have seen the above post But You obviously missed the post from me where I said the auto abort at 48Hrs is not working "...Well David I let this one FA_RLXpt_hom002_1ptq__361_380_3 run through 48 HR it did not self abort as you said it would and send it self in IT JUST RESTARTED and it most likely was the 3rd time it restarted. That is 142 Hrs of wasted CPU time ." I think I aborted it at 7.5Hrs But I was watching all the way through 48 Hrs And then the clock went to 00:00 So I know for a fact that it Had at least 55 Hrs that I know of and it might have done the loop 5 X But you only grant credit for the 7.5 Hrs recorded after the loop reset So Not only is the 48 Hrs auto abort not working the way you want it to the granting of credit not working the way it should either And also this post where I said "... I'm sorry I can not, I looked for it but could not find it. For me your system for tracking WU is hard to use for me It might work OK for me if I had only a few nodes working this. But I have over 50 nodes working this project, jobs get lost with so many pages of WU's It might help if you put in page numbers 1 to 10 20 30 40 50 instead of Just NEXT PAGE ." If you want to tell me how to get the info you want I will try to retrieve it for you . Is there a file in my Boinc folder that I can open to get this Info I looked through hundreds of pages of returned WU And that was only 3 days worth (VARRY HARD to work with for a power user) I still do not know how all the other posters are getting and posting the links they are So you tell me what you want me to do, or to retreave from my network Just remember I am NOT in the IT feild so please be kind If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- ID: 14146 · Rating: 0 · rate: / Reply Quote

Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0	Message 14154 - Posted: 20 Apr 2006, 4:09:01 UTC - in response to Message 14110. [quoteAs a personal asside I might point out that the usefulness of your posting is directly related to the information you provide concerning the problem. Could you at least provide a link to the result ID so the problem can be examined?[/color][/b] [/quote] Ok after spending the last 1.5 Hrs looking I found it Note the time sent and the time sent back over 104 that meens this job was on it's 3rd loop 48 + 48 + 7 Now look at the granted points 8.25156419314919 for for a job that ran for 104 Hrs Now you can see why I was saying your Auto abort and your granting due credit is not working And the Real need to find a way to purge BAD WU's from the Rosetta Servers and the members clients. It is unfaire for Rosetta to make us members pay the bills to purge the system of your Bad WU's Result ID 17079919 Name FA_RLXpt_hom002_1ptq__361_380_3 Workunit 11796498 Created 12 Apr 2006 9:46:52 UTC Sent 12 Apr 2006 11:29:20 UTC Received 16 Apr 2006 19:13:28 UTC Server state Over Outcome Client error Client state Computing Exit status -197 (0xffffff3b) Computer ID 178772 Report deadline 26 Apr 2006 11:29:20 UTC CPU time 1716.296875 stderr out <core_client_version>5.2.13</core_client_version> <message>aborted via GUI RPC </message> <stderr_txt> # random seed: 2485491 # random seed: 2485491 # random seed: 2485491 # random seed: 2485491 </stderr_txt> Validate state Invalid Claimed credit 8.25156419314919 Granted credit 8.25156419314919 application version 4.98 If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- ID: 14154 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 14156 - Posted: 20 Apr 2006, 4:38:22 UTC Laurenu2, assuming you know which computer had the WU, you can find it by: click on "Participants" on any Rosetta@home web page click on "View computers" click on the computer ID for the computer that ran the WU click on the number on the "Results" line You will now have a list of results for just that computer. That WU was a nasty one. The first person who got it wasted 57 hours of CPU on it before noticing and aborting it, then the second person wasted another 14 hours. The third person let it sit in the queue until it went past deadline. Then you got it. At least now its had too many errors and won't be sent out any more. ID: 14156 · Rating: 0 · rate: / Reply Quote

Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0	Message 14157 - Posted: 20 Apr 2006, 4:51:03 UTC - in response to Message 14154. Last modified: 20 Apr 2006, 4:51:33 UTC As a personal asside I might point out that the usefulness of your posting is directly related to the information you provide concerning the problem. Could you at least provide a link to the result ID so the problem can be examined?[/color][/b] Ok after spending the last 1.5 Hrs looking I found it Note the time sent and the time sent back over 104 that meens this job was on it's 3rd loop 48 + 48 + 7 Now look at the granted points 8.25156419314919 for for a job that ran for 104 Hrs Now you can see why I was saying your Auto abort and your granting due credit is not working And the Real need to find a way to purge BAD WU's from the Rosetta Servers and the members clients. It is unfaire for Rosetta to make us members pay the bills to purge the system of your Bad WU's ... What is unfair is for people to bash the developers and the people who are trying to help you and others with the issues. The errors are frustrating for everyone most of all the project people. While I am sure it may be difficult for you to manage certain aspects of your farm if things aren't perfect, you built the farm. The work and expense is just part of what is required if you plan to run it. While you have made it very clear that the 30 or so credits you may have lost on this WU should be the single most important thing in the universe to the project, just try to get ANY of the credits for a failed work unit on ANY of the other projects. (and yes ALL projects have failed WUs) The Rosy people are doing what they can on the credit issue, and it is a lot more than what you get from other venues. But there are other things going on right now, so they only do it once a week. Why don't you just post the error, with sufficient information to help them fix the problem without all the bashing of the project people. It has already been explained that BOINC does not allow the project to remove WUs from your system and why. The only project that does this at all is CPDN and they are using a highly modified version of the BOINC client to do it. This project uses the standard BOINC package for everything, including the results pages. If you are having a problem with BOINC you should post the issue on the BOINC development site. As for looking up the links to your results, have you tried bookmarking a separate page in your browser for each machine? It is very easy to do. That way you can isolate the results by machine, and monitor them individually if something comes up. That should help you find things. Also, if you reduce your connection interval and increase your time setting you will have fewer WUs to look through. ID: 14157 · Rating: 0 · rate: / Reply Quote

Fuzzy Hollynoodles Send message Joined: 7 Oct 05 Posts: 234 Credit: 15,020 RAC: 0	Message 14170 - Posted: 20 Apr 2006, 10:00:41 UTC - in response to Message 14131. Based on the great advice from this forum, I coded a "watchdog" thread for Rosetta@home. It will output any data and abort work units that haven't changed their score in thirty minutes -- a pretty good indicator that the job is stuck! I'll be testing this over on RALPH over the next couple days. I'm also thinking of putting in an abort if the CPU time is more than twice the maximum time for the workunit (typically 4 hours by default these days, or whatever the client's preference)... that's another sign that the workunit is not compatible with the client. Are there any other conditions under which you think we should abort? Looking forward to some more advice. I think we're on the way to finally bringing an end to these stuck jobs. Thanks! :-) A function to kill a WU, that's stuck because the time for swapping between the projects is too low for creating a new checkpoint (e.g. 1 hour), so the WU starts all over again and again, would be nice also. This could help people not wasting endless time on one WU for nothing, since keeping WU's in memory while preempted is not necessary anymore. [b]"I'm trying to maintain a shred of dignity in this world." - Me[/b] ID: 14170 · Rating: 0 · rate: / Reply Quote

Cureseekers~Kristof Send message Joined: 5 Nov 05 Posts: 80 Credit: 689,603 RAC: 0	Message 14171 - Posted: 20 Apr 2006, 10:16:26 UTC some errors: https://boinc.bakerlab.org/rosetta/result.php?resultid=16591648 https://boinc.bakerlab.org/rosetta/result.php?resultid=16619607 Member of Dutch Power Cows ID: 14171 · Rating: 0 · rate: / Reply Quote

Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0	Message 14172 - Posted: 20 Apr 2006, 11:00:55 UTC Snake Doctor I am sorry If I am a little slower then most people here like you. I have my problems I will not go into right now But if you think Me trying to explain a Bug that I found (endless loop) is Bashing And posting the long letter below I think you are the one doing the bashing at me WHY It may be true that I am a little frustrated And perhaps it my show through but I was not basing I worked hard for Hrs trying to find that WU and learn how to find the info they wanted and I did post it Is that Bashing NO AGAIN my point here is the Rosetta system need to have in place a method to remove bad work on servers and clients To think this will not happen again is unwise 30 point hahaha I make over 17,000 points a day do you really think I am concerned with a Meir 30 points You need to get a reality check I am quite sure I have had around 1200 Hrs of CPU time stuck on these all the wasted time is what concerns me If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- ID: 14172 · Rating: 0 · rate: / Reply Quote

Cureseekers~Kristof Send message Joined: 5 Nov 05 Posts: 80 Credit: 689,603 RAC: 0	Message 14184 - Posted: 20 Apr 2006, 17:14:20 UTC Last modified: 20 Apr 2006, 17:15:29 UTC After more than 30 hours runtime, and stuck for hours at the same percentage, I aborted the job: https://boinc.bakerlab.org/rosetta/result.php?resultid=17454155 260 credits lost... Member of Dutch Power Cows ID: 14184 · Rating: 0 · rate: / Reply Quote

Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0	Message 14186 - Posted: 20 Apr 2006, 17:41:56 UTC - in response to Message 14131. Are there any other conditions under which you think we should abort? Looking forward to some more advice. I think we're on the way to finally bringing an end to these stuck jobs. What about when deadline is approaching? Sometimes people crank their preference straight from 4 hrs to 24 hrs, and all of the sudden 10 WUs cannot be completed before deadline. So, if deadline is "near" (? how near?) then just finish the current model and end this WU so it can be reported in time. I don't know if you would call that an "abort". It's more of a normal end, in advance of the target runtime. Great progress! Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ ID: 14186 · Rating: 0 · rate: / Reply Quote

Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0	Message 14188 - Posted: 20 Apr 2006, 18:03:56 UTC - in response to Message 14172. Last modified: 20 Apr 2006, 18:13:58 UTC AGAIN my point here is the Rosetta system need to have in place a method to remove bad work on servers and clients To think this will not happen again is unwise Lauren, I just read in the other threads, posts by Rhiju, Bin Qian and David Kim that: 1, We've tracked down the bugs which were causing some jobs stucked at 1.04% and been testing the fixes on ralph since yesterday. 2, We've coded up rosetta to do more frequent checkpointing in the modeling process. Now for the large jobs, we are expecting less than 30 minutes for the time between two check points. This code has been tested locally, and will be tested on ralph within a couple of days. 3, Rhiju has coded a watchdog thread for rosetta which will terminate the stucking jobs and return the intermediate results. see his post at this thread. This will be tested on ralph within a couple of days too. 4. One important change I forgot to mention: on top of the more frequent checkpointing, David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory. If I understand the new features quoted above correctly, #3 and #4 should ensure that no WUs get stuck forever, until a human operator aborts them via BOINC. Hopefully, the BOINC server code (developed by Berkeley Univ as open source) will eventually implement the other features, most notably the capability / preference flags (e.g. >512M mem, BigWU etc), so that big WUs are sent only to PCs which are capable / willing to process them. PS: So, in light of recent changes, I'd say that a method to cancel bad WUs from volunteer PCs is much less of an issue, as bad WUs will take care of themselves. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity ID: 14188 · Rating: 0 · rate: / Reply Quote

tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 14190 - Posted: 20 Apr 2006, 18:29:47 UTC - in response to Message 14156. That WU was a nasty one. The first person who got it wasted 57 hours of CPU on it before noticing and aborting it, then the second person wasted another 14 hours. The third person let it sit in the queue until it went past deadline. Then you got it. At least now its had too many errors and won't be sent out any more. Perhaps it is a good idea to turn off the mechanism of resending failed WU at the moment (ie by setting number of failures until cancellation of that WU to 1). With such a setting bad WUs would only bother one participant. ID: 14190 · Rating: 0 · rate: / Reply Quote

BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0	Message 14192 - Posted: 20 Apr 2006, 18:55:07 UTC It'd still be nice to have WUs that are only causing problems with say.. the Windows clients, only be sent out to the Linux/Darwin users, instead of requiring 3 failures on Windows clients before they're shelved. And if Boinc/Rosetta sends any data back and forth during the network connections that could be piggybacked.. (when we're looking to see if we need new work, returning a finished WU, or checking to see if there's a Rosetta update), so that the file would be left on the machine until the next connection to the server - it would be nice to have a list of problem WUs sent out that should be nuked.. Although, between the mentioned changes, and hopefully better pre-testing on Ralph, we can hope that the problem would not crop up any more.. :) ID: 14192 · Rating: 0 · rate: / Reply Quote

Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0	Message 14193 - Posted: 20 Apr 2006, 19:38:01 UTC - in response to Message 14192. It'd still be nice to have WUs that are only causing problems with say.. the Windows clients, only be sent out to the Linux/Darwin users, instead of requiring 3 failures on Windows clients before they're shelved. And if Boinc/Rosetta sends any data back and forth during the network connections that could be piggybacked.. (when we're looking to see if we need new work, returning a finished WU, or checking to see if there's a Rosetta update), so that the file would be left on the machine until the next connection to the server - it would be nice to have a list of problem WUs sent out that should be nuked.. Although, between the mentioned changes, and hopefully better pre-testing on Ralph, we can hope that the problem would not crop up any more.. :) Or even a user slected option for the client to report back to the servers every 3 to 6 Hrs Could give them a lot of alpha info to see what works better and hot any upgrades are working If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- ID: 14193 · Rating: 0 · rate: / Reply Quote

Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0	Message 14207 - Posted: 21 Apr 2006, 0:21:01 UTC - in response to Message 14188. I haven't yet got the watchdog thread into Rosetta 5.01, but we have very high hopes for it! It was a great idea from this message board. It should go into the next update, probably early next week, if the Windows build cooperates. (We're trying not to do updates during the weekend -- we seem to have had bad luck in the past!) I'm paying attention to the ideas about reverse trickle, keeping contact between client and server, etc. -- these are nice suggestions. As I explained below, those will likely require some changes in the BOINC code, and we'll need help from the BOINC crew. They've been pretty occupied with their upcoming release. I like the idea below of not passing on bad jobs to another client when they fail -- so only 1 computer will have the problem, not 4. I'm running this idea by David Baker and David Kim now. Unlike other BOINC projects its not critical for every single workunit to get processed. Its way more important to keep bad workunits from causing trouble! One final note: we just went through and granted credits to errored jobs in our database. I'm trying to code the watchdog so that it will gracefully abort, including the valid output of data, so that the job will automatically get credit (but will be tagged for us as a premature abort). AGAIN my point here is the Rosetta system need to have in place a method to remove bad work on servers and clients To think this will not happen again is unwise Lauren, I just read in the other threads, posts by Rhiju, Bin Qian and David Kim that: 1, We've tracked down the bugs which were causing some jobs stucked at 1.04% and been testing the fixes on ralph since yesterday. 2, We've coded up rosetta to do more frequent checkpointing in the modeling process. Now for the large jobs, we are expecting less than 30 minutes for the time between two check points. This code has been tested locally, and will be tested on ralph within a couple of days. 3, Rhiju has coded a watchdog thread for rosetta which will terminate the stucking jobs and return the intermediate results. see his post at this thread. This will be tested on ralph within a couple of days too. 4. One important change I forgot to mention: on top of the more frequent checkpointing, David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory. If I understand the new features quoted above correctly, #3 and #4 should ensure that no WUs get stuck forever, until a human operator aborts them via BOINC. Hopefully, the BOINC server code (developed by Berkeley Univ as open source) will eventually implement the other features, most notably the capability / preference flags (e.g. >512M mem, BigWU etc), so that big WUs are sent only to PCs which are capable / willing to process them. PS: So, in light of recent changes, I'd say that a method to cancel bad WUs from volunteer PCs is much less of an issue, as bad WUs will take care of themselves. ID: 14207 · Rating: 0 · rate: / Reply Quote

Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0	Message 14224 - Posted: 21 Apr 2006, 3:58:30 UTC Thank you Rhiju For listening to our needs and taking steeps to fix or improve a vary frustrating problem. If any my words were at all harsh Pleases forgive me. It was not my intent I just want to get my point across And words do not come easily to me I have checked all my nodes and not one is on 1.4 So I guess the bad ones are at a end. Again Thank You If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- ID: 14224 · Rating: 0 · rate: / Reply Quote

Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0	Message 14227 - Posted: 21 Apr 2006, 5:45:07 UTC - in response to Message 14224. Your comments have been really helpful -- please continue to make suggestions. Hopefully by next week we can ensure that these stupid stuck-at-1.04% jobs never show up again on your computers. Thanks for hanging in there! Thank you Rhiju For listening to our needs and taking steeps to fix or improve a vary frustrating problem. If any my words were at all harsh Pleases forgive me. It was not my intent I just want to get my point across And words do not come easily to me I have checked all my nodes and not one is on 1.4 So I guess the bad ones are at a end. Again Thank You ID: 14227 · Rating: 0 · rate: / Reply Quote

Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0	Message 14253 - Posted: 21 Apr 2006, 10:24:04 UTC Last modified: 21 Apr 2006, 10:55:05 UTC ANother HUGE ammount of CPU time wasted!!!! https://boinc.bakerlab.org/rosetta/result.php?resultid=17734977 CPU time 42670.640625 Claimed credit 145.838794071523 I had to abort this one as It was cought on a loop. Action done arround 6AM AST. stderr out <core_client_version>5.2.13</core_client_version> <message>aborted via GUI RPC </message> <stderr_txt> # cpu_run_time_pref: 21600 # random seed: 1509912 # cpu_run_time_pref: 21600 # Exception caught in nstruct loop ii=1 i=7 # num_decoys:6 attempts:7 cpu_run_time:30500.1 # Exception caught in nstruct loop ii=1 i=7 # num_decoys:6 attempts:8 cpu_run_time:33366.1 # Exception caught in nstruct loop ii=1 i=7 # num_decoys:6 attempts:9 cpu_run_time:34263 </stderr_txt> What irks me is that I was the second Computer to receive this WU. I just hope that that the third one that receives it is wise enough and aborts it before a lot of his cpu time is wasted. So dont gang up on me when I say ARGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH!!!!!! PS Ah at least the new version doesnt wait too long to go the error ways. On that one I will report on the 5.01 therad :( This and no other is the root from which a Tyrant springs; when he first appears he is a protector.â€ Plato ID: 14253 · Rating: 0 · rate: / Reply Quote

Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0	Message 14267 - Posted: 21 Apr 2006, 14:59:49 UTC - in response to Message 14253. ANother HUGE ammount of CPU time wasted!!!! ... Jose, Your time is not wasted. Look at This post. From this statement the results are used and you will be granted credit. So perhaps not so much ARGH but more like AHHH! Regards Phil ID: 14267 · Rating: 0 · rate: / Reply Quote

Steven Purvis Send message Joined: 17 Sep 05 Posts: 1 Credit: 8,194,677 RAC: 0	Message 14280 - Posted: 21 Apr 2006, 17:21:33 UTC I've just aborted about 6 work units for rosetta 4.98 with names starting 7486_largescale_large_full_atom_relax_XXXXXXXXXXXX They all seemed to be stuck in the getting to about 1.4% but no higher. I have the "don't remove workunits from memory" enabled so that shouldn't cause a problem. The work units results were: 17191225 17191227 17191336 17191339 17191352 17191374 Hope this is useful in some way. ID: 14280 · Rating: 0 · rate: / Reply Quote

[DPC]FOKschaap~_mcintosh_ Send message Joined: 4 Dec 05 Posts: 5 Credit: 118,303 RAC: 0	Message 14318 - Posted: 21 Apr 2006, 23:10:15 UTC PROD_ABINITIO_FAST_1tul__447_32515 That one got aborted by BOINC. Claimed credit 251, hope 2 see that one day ;) ID: 14318 · Rating: 0 · rate: / Reply Quote

Report stuck & aborted WU here please - II