Report stuck & aborted WU here please

Author	Message
Delk Send message Joined: 20 Feb 06 Posts: 25 Credit: 995,624 RAC: 0	Message 14097 - Posted: 19 Apr 2006, 8:52:23 UTC Last modified: 19 Apr 2006, 8:53:54 UTC https://boinc.bakerlab.org/rosetta/result.php?resultid=17550826 Stuck so I've aborted. The truncate_* units were removed due to a problem why not the FA_* units? The only FA units still going around are broken from what I've seen so its really annoying since they dont timeout at all. You want more people on the project, quality needs to be improved to reduce churn. I'm not just talking about workunit or app issues (ala 4.97), but also things like the down for maintenance notice today, what about at least 24 or 48 hrs notice? This really isn't anything new, change control processes (which include client communication) are something a lot of us deal with every working day. My 2 cents. ID: 14097 · Rating: 0 · rate: / Reply Quote

Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0	Message 14108 - Posted: 19 Apr 2006, 13:56:58 UTC - in response to Message 13799. But what about the removal of bad WU's from your servers You must set up a way to stop the resending out of the BAD WU's Letting the system purge it self is not right. You have the capability to do auto upgrades you should have the capability to auto abort bad WU;s on client side To let bad WU's run on yours or our system is a BAD THING The bad WU's are removed from our servers, but we can't remove them from your machines. Hopefully there will be no more bad WU's at all so this won't be a problem anymore. Well David it seems you can not or did not REMOVE the bad WU's I and others are still getting them I just found this one TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_126_0 that WASTED another 28 more Hrs This is not good I am nearing the end of my patients with these BAD jobs and the THOUSANDS + of Hrs of wasted work time that you will not give points for David I AM VARY UPSET ABOUT THIS If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- ID: 14108 · Rating: 0 · rate: / Reply Quote

Mike Gelvin Send message Joined: 7 Oct 05 Posts: 65 Credit: 10,612,039 RAC: 0	Message 14109 - Posted: 19 Apr 2006, 14:10:19 UTC Last modified: 19 Apr 2006, 14:10:35 UTC Aborted 1.04% https://boinc.bakerlab.org/rosetta/result.php?resultid=17045924 ID: 14109 · Rating: 0 · rate: / Reply Quote

Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0	Message 14113 - Posted: 19 Apr 2006, 15:11:57 UTC - in response to Message 14112. Last modified: 19 Apr 2006, 15:14:55 UTC [/quote] It looks like maybe "Rhiju's" error trap worked and terminated the Work Unit. If so it should claim some credit.[/quote] Well My account states that for that work unit the following credits were claimed but not granted 17504849 14375734 18 Apr 2006 9:59:32 UTC 18 Apr 2006 21:53:49 UTC Over Client error Computing 32,858.34 101.95 --- See the issue for me goes past the credit stuff [Although I would be dishonest if I don't admit I want all the credits posible added to my team totals as we are facing a vicious stampede by some very annoying cows ( LOL LOL LOL ..Yes I have a sense of humor) ] : it is seeing all that precious computing time not generating useful work that is worrying me. BTW my life partner is considering suing Rosetta@home for loss of consortium... Partner claims I am addicted to the screen saver and that I am becoming slightly more nuttier than when we met. :P This and no other is the root from which a Tyrant springs; when he first appears he is a protector.â€ Plato ID: 14113 · Rating: 0 · rate: / Reply Quote

dag Send message Joined: 16 Dec 05 Posts: 106 Credit: 1,000,020 RAC: 0	Message 14114 - Posted: 19 Apr 2006, 15:23:36 UTC Last modified: 19 Apr 2006, 15:25:12 UTC https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11846000 FA_RLXpt_hom006_1ptq__361_479 1.04%, 17+ hours... takes a lickin' and keeps on tickin'! This will be a good test to see if credit will be eventually awarded as was stated elsewhere in this thread. dag dag --Finding aliens is cool, but understanding the structure of proteins is useful. ID: 14114 · Rating: 0 · rate: / Reply Quote

CremionisD Send message Joined: 10 Mar 06 Posts: 9 Credit: 37,604,006 RAC: 0	Message 14117 - Posted: 19 Apr 2006, 17:58:26 UTC Workunit aborted manually. "Truncate_termini_fullrelax_1b3a_433_628_0" - Model 1, step 241723, at 1.04% CPU time ~24:30:00 Result ID = 17022725, (Workunit = 13954214) ID: 14117 · Rating: 0 · rate: / Reply Quote

tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 14125 - Posted: 19 Apr 2006, 18:58:57 UTC - in response to Message 13985. Also, BOINC server should provide a mechanism for projects to cancel jobs, after they've been let loose on volunteers' PCs (e.g. the batch of bad jobs send out last week). Such a feature exists and was recently employed from cpdn.org to reset the faulty models they send out. It's called "reverse trickle" or "killer trickle". But it still needs a contact from the client in order to respond with a "killer trickle". However every contact should do. ID: 14125 · Rating: 0 · rate: / Reply Quote

Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0	Message 14131 - Posted: 19 Apr 2006, 19:59:25 UTC - in response to Message 14125. Based on the great advice from this forum, I coded a "watchdog" thread for Rosetta@home. It will output any data and abort work units that haven't changed their score in thirty minutes -- a pretty good indicator that the job is stuck! I'll be testing this over on RALPH over the next couple days. I'm also thinking of putting in an abort if the CPU time is more than twice the maximum time for the workunit (typically 4 hours by default these days, or whatever the client's preference)... that's another sign that the workunit is not compatible with the client. Are there any other conditions under which you think we should abort? Looking forward to some more advice. I think we're on the way to finally bringing an end to these stuck jobs. Also, BOINC server should provide a mechanism for projects to cancel jobs, after they've been let loose on volunteers' PCs (e.g. the batch of bad jobs send out last week). Such a feature exists and was recently employed from cpdn.org to reset the faulty models they send out. It's called "reverse trickle" or "killer trickle". But it still needs a contact from the client in order to respond with a "killer trickle". However every contact should do. ID: 14131 · Rating: 1 · rate: / Reply Quote

Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0	Message 14134 - Posted: 19 Apr 2006, 20:22:58 UTC - in response to Message 14123. Last modified: 19 Apr 2006, 20:23:49 UTC My apologies if I sounded like I was bitching. I better take a break from the screen...but drat...those amino acid chains dancing all over the screen are so addictive :) Peace and ty for all your effort to make this project an efficient one. This and no other is the root from which a Tyrant springs; when he first appears he is a protector.â€ Plato ID: 14134 · Rating: 1 · rate: / Reply Quote

Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0	Message 14146 - Posted: 19 Apr 2006, 23:04:07 UTC - in response to Message 14110. You obviously missed the post from "Rhiju" where he said in part- "...The result is reported to us as a large amount of "claimed credit"; we are going through on a weekly basis and granting credit for these jobs that caused big problems and returned "invalid" results. The results and your posting are still useful to us -- in this case, the postings on this work unit helped us track down a pretty esoteric bug in Rosetta. Let me know what the result number is -- if its not flagged to get credit, I can see why." The full text of the post is Here. As a personal asside I might point out that the usefulness of your posting is directly related to the information you provide concerning the problem. Could you at least provide a link to the result ID so the problem can be examined? I have seen the above post But You obviously missed the post from me where I said the auto abort at 48Hrs is not working "...Well David I let this one FA_RLXpt_hom002_1ptq__361_380_3 run through 48 HR it did not self abort as you said it would and send it self in IT JUST RESTARTED and it most likely was the 3rd time it restarted. That is 142 Hrs of wasted CPU time ." I think I aborted it at 7.5Hrs But I was watching all the way through 48 Hrs And then the clock went to 00:00 So I know for a fact that it Had at least 55 Hrs that I know of and it might have done the loop 5 X But you only grant credit for the 7.5 Hrs recorded after the loop reset So Not only is the 48 Hrs auto abort not working the way you want it to the granting of credit not working the way it should either And also this post where I said "... I'm sorry I can not, I looked for it but could not find it. For me your system for tracking WU is hard to use for me It might work OK for me if I had only a few nodes working this. But I have over 50 nodes working this project, jobs get lost with so many pages of WU's It might help if you put in page numbers 1 to 10 20 30 40 50 instead of Just NEXT PAGE ." If you want to tell me how to get the info you want I will try to retrieve it for you . Is there a file in my Boinc folder that I can open to get this Info I looked through hundreds of pages of returned WU And that was only 3 days worth (VARRY HARD to work with for a power user) I still do not know how all the other posters are getting and posting the links they are So you tell me what you want me to do, or to retreave from my network Just remember I am NOT in the IT feild so please be kind If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- ID: 14146 · Rating: 0 · rate: / Reply Quote

Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0	Message 14154 - Posted: 20 Apr 2006, 4:09:01 UTC - in response to Message 14110. [quoteAs a personal asside I might point out that the usefulness of your posting is directly related to the information you provide concerning the problem. Could you at least provide a link to the result ID so the problem can be examined?[/color][/b] [/quote] Ok after spending the last 1.5 Hrs looking I found it Note the time sent and the time sent back over 104 that meens this job was on it's 3rd loop 48 + 48 + 7 Now look at the granted points 8.25156419314919 for for a job that ran for 104 Hrs Now you can see why I was saying your Auto abort and your granting due credit is not working And the Real need to find a way to purge BAD WU's from the Rosetta Servers and the members clients. It is unfaire for Rosetta to make us members pay the bills to purge the system of your Bad WU's Result ID 17079919 Name FA_RLXpt_hom002_1ptq__361_380_3 Workunit 11796498 Created 12 Apr 2006 9:46:52 UTC Sent 12 Apr 2006 11:29:20 UTC Received 16 Apr 2006 19:13:28 UTC Server state Over Outcome Client error Client state Computing Exit status -197 (0xffffff3b) Computer ID 178772 Report deadline 26 Apr 2006 11:29:20 UTC CPU time 1716.296875 stderr out <core_client_version>5.2.13</core_client_version> <message>aborted via GUI RPC </message> <stderr_txt> # random seed: 2485491 # random seed: 2485491 # random seed: 2485491 # random seed: 2485491 </stderr_txt> Validate state Invalid Claimed credit 8.25156419314919 Granted credit 8.25156419314919 application version 4.98 If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- ID: 14154 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 14156 - Posted: 20 Apr 2006, 4:38:22 UTC Laurenu2, assuming you know which computer had the WU, you can find it by: click on "Participants" on any Rosetta@home web page click on "View computers" click on the computer ID for the computer that ran the WU click on the number on the "Results" line You will now have a list of results for just that computer. That WU was a nasty one. The first person who got it wasted 57 hours of CPU on it before noticing and aborting it, then the second person wasted another 14 hours. The third person let it sit in the queue until it went past deadline. Then you got it. At least now its had too many errors and won't be sent out any more. ID: 14156 · Rating: 0 · rate: / Reply Quote

Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0	Message 14157 - Posted: 20 Apr 2006, 4:51:03 UTC - in response to Message 14154. Last modified: 20 Apr 2006, 4:51:33 UTC As a personal asside I might point out that the usefulness of your posting is directly related to the information you provide concerning the problem. Could you at least provide a link to the result ID so the problem can be examined?[/color][/b] Ok after spending the last 1.5 Hrs looking I found it Note the time sent and the time sent back over 104 that meens this job was on it's 3rd loop 48 + 48 + 7 Now look at the granted points 8.25156419314919 for for a job that ran for 104 Hrs Now you can see why I was saying your Auto abort and your granting due credit is not working And the Real need to find a way to purge BAD WU's from the Rosetta Servers and the members clients. It is unfaire for Rosetta to make us members pay the bills to purge the system of your Bad WU's ... What is unfair is for people to bash the developers and the people who are trying to help you and others with the issues. The errors are frustrating for everyone most of all the project people. While I am sure it may be difficult for you to manage certain aspects of your farm if things aren't perfect, you built the farm. The work and expense is just part of what is required if you plan to run it. While you have made it very clear that the 30 or so credits you may have lost on this WU should be the single most important thing in the universe to the project, just try to get ANY of the credits for a failed work unit on ANY of the other projects. (and yes ALL projects have failed WUs) The Rosy people are doing what they can on the credit issue, and it is a lot more than what you get from other venues. But there are other things going on right now, so they only do it once a week. Why don't you just post the error, with sufficient information to help them fix the problem without all the bashing of the project people. It has already been explained that BOINC does not allow the project to remove WUs from your system and why. The only project that does this at all is CPDN and they are using a highly modified version of the BOINC client to do it. This project uses the standard BOINC package for everything, including the results pages. If you are having a problem with BOINC you should post the issue on the BOINC development site. As for looking up the links to your results, have you tried bookmarking a separate page in your browser for each machine? It is very easy to do. That way you can isolate the results by machine, and monitor them individually if something comes up. That should help you find things. Also, if you reduce your connection interval and increase your time setting you will have fewer WUs to look through. ID: 14157 · Rating: 0 · rate: / Reply Quote

Fuzzy Hollynoodles Send message Joined: 7 Oct 05 Posts: 234 Credit: 15,020 RAC: 0	Message 14170 - Posted: 20 Apr 2006, 10:00:41 UTC - in response to Message 14131. Based on the great advice from this forum, I coded a "watchdog" thread for Rosetta@home. It will output any data and abort work units that haven't changed their score in thirty minutes -- a pretty good indicator that the job is stuck! I'll be testing this over on RALPH over the next couple days. I'm also thinking of putting in an abort if the CPU time is more than twice the maximum time for the workunit (typically 4 hours by default these days, or whatever the client's preference)... that's another sign that the workunit is not compatible with the client. Are there any other conditions under which you think we should abort? Looking forward to some more advice. I think we're on the way to finally bringing an end to these stuck jobs. Thanks! :-) A function to kill a WU, that's stuck because the time for swapping between the projects is too low for creating a new checkpoint (e.g. 1 hour), so the WU starts all over again and again, would be nice also. This could help people not wasting endless time on one WU for nothing, since keeping WU's in memory while preempted is not necessary anymore. [b]"I'm trying to maintain a shred of dignity in this world." - Me[/b] ID: 14170 · Rating: 0 · rate: / Reply Quote

Cureseekers~Kristof Send message Joined: 5 Nov 05 Posts: 80 Credit: 689,603 RAC: 0	Message 14171 - Posted: 20 Apr 2006, 10:16:26 UTC some errors: https://boinc.bakerlab.org/rosetta/result.php?resultid=16591648 https://boinc.bakerlab.org/rosetta/result.php?resultid=16619607 Member of Dutch Power Cows ID: 14171 · Rating: 0 · rate: / Reply Quote

Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0	Message 14172 - Posted: 20 Apr 2006, 11:00:55 UTC Snake Doctor I am sorry If I am a little slower then most people here like you. I have my problems I will not go into right now But if you think Me trying to explain a Bug that I found (endless loop) is Bashing And posting the long letter below I think you are the one doing the bashing at me WHY It may be true that I am a little frustrated And perhaps it my show through but I was not basing I worked hard for Hrs trying to find that WU and learn how to find the info they wanted and I did post it Is that Bashing NO AGAIN my point here is the Rosetta system need to have in place a method to remove bad work on servers and clients To think this will not happen again is unwise 30 point hahaha I make over 17,000 points a day do you really think I am concerned with a Meir 30 points You need to get a reality check I am quite sure I have had around 1200 Hrs of CPU time stuck on these all the wasted time is what concerns me If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- ID: 14172 · Rating: 0 · rate: / Reply Quote

Cureseekers~Kristof Send message Joined: 5 Nov 05 Posts: 80 Credit: 689,603 RAC: 0	Message 14184 - Posted: 20 Apr 2006, 17:14:20 UTC Last modified: 20 Apr 2006, 17:15:29 UTC After more than 30 hours runtime, and stuck for hours at the same percentage, I aborted the job: https://boinc.bakerlab.org/rosetta/result.php?resultid=17454155 260 credits lost... Member of Dutch Power Cows ID: 14184 · Rating: 0 · rate: / Reply Quote

Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0	Message 14186 - Posted: 20 Apr 2006, 17:41:56 UTC - in response to Message 14131. Are there any other conditions under which you think we should abort? Looking forward to some more advice. I think we're on the way to finally bringing an end to these stuck jobs. What about when deadline is approaching? Sometimes people crank their preference straight from 4 hrs to 24 hrs, and all of the sudden 10 WUs cannot be completed before deadline. So, if deadline is "near" (? how near?) then just finish the current model and end this WU so it can be reported in time. I don't know if you would call that an "abort". It's more of a normal end, in advance of the target runtime. Great progress! Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ ID: 14186 · Rating: 0 · rate: / Reply Quote

Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0	Message 14188 - Posted: 20 Apr 2006, 18:03:56 UTC - in response to Message 14172. Last modified: 20 Apr 2006, 18:13:58 UTC AGAIN my point here is the Rosetta system need to have in place a method to remove bad work on servers and clients To think this will not happen again is unwise Lauren, I just read in the other threads, posts by Rhiju, Bin Qian and David Kim that: 1, We've tracked down the bugs which were causing some jobs stucked at 1.04% and been testing the fixes on ralph since yesterday. 2, We've coded up rosetta to do more frequent checkpointing in the modeling process. Now for the large jobs, we are expecting less than 30 minutes for the time between two check points. This code has been tested locally, and will be tested on ralph within a couple of days. 3, Rhiju has coded a watchdog thread for rosetta which will terminate the stucking jobs and return the intermediate results. see his post at this thread. This will be tested on ralph within a couple of days too. 4. One important change I forgot to mention: on top of the more frequent checkpointing, David Kim has added a limit of 5 restarts for each WU. So if a WU restarts 5 times without leaving in memory, Rosetta will stop the WU and output the result at that point. This will help machines that frequently reboot or swap project without leaving in memory. If I understand the new features quoted above correctly, #3 and #4 should ensure that no WUs get stuck forever, until a human operator aborts them via BOINC. Hopefully, the BOINC server code (developed by Berkeley Univ as open source) will eventually implement the other features, most notably the capability / preference flags (e.g. >512M mem, BigWU etc), so that big WUs are sent only to PCs which are capable / willing to process them. PS: So, in light of recent changes, I'd say that a method to cancel bad WUs from volunteer PCs is much less of an issue, as bad WUs will take care of themselves. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity ID: 14188 · Rating: 0 · rate: / Reply Quote

tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 14190 - Posted: 20 Apr 2006, 18:29:47 UTC - in response to Message 14156. That WU was a nasty one. The first person who got it wasted 57 hours of CPU on it before noticing and aborting it, then the second person wasted another 14 hours. The third person let it sit in the queue until it went past deadline. Then you got it. At least now its had too many errors and won't be sent out any more. Perhaps it is a good idea to turn off the mechanism of resending failed WU at the moment (ie by setting number of failures until cancellation of that WU to 1). With such a setting bad WUs would only bother one participant. ID: 14190 · Rating: 0 · rate: / Reply Quote

Report stuck & aborted WU here please - II