Report stuck & aborted WU here please

Author	Message
Kurre Send message Joined: 12 Apr 06 Posts: 9 Credit: 69,240 RAC: 0	Message 13968 - Posted: 17 Apr 2006, 19:33:06 UTC Result ID 17407773,17407809 aborted ID: 13968 · Rating: 0 · rate: / Reply Quote

Kurre Send message Joined: 12 Apr 06 Posts: 9 Credit: 69,240 RAC: 0	Message 13971 - Posted: 17 Apr 2006, 19:44:18 UTC Result ID 17188258 server was down and the transfer of the result faild. Seems like there is some work to do on klient to handle the recovery from this kind of situation ID: 13971 · Rating: 0 · rate: / Reply Quote

Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0	Message 13982 - Posted: 17 Apr 2006, 22:24:35 UTC - in response to Message 13964. Can you post a link to the result? I think you will actually get credit for this. The result is reported to us as a large amount of "claimed credit"; we are going through on a weekly basis and granting credit for these jobs that caused big problems and returned "invalid" results. The results and your posting are still useful to us -- in this case, the postings on this work unit helped us track down a pretty esoteric bug in Rosetta. Let me know what the result number is -- if its not flagged to get credit, I can see why. I'm sorry I can not, I looked for it but could not find it. For me your system for tracking WU is hard to use for me It might work OK for me if I had only a few nodes working this. But I have over 50 nodes working this project, jobs get lost with so many pages of WU's It might help if you put in page numbers 1 to 10 20 30 40 50 instead of Just NEXT PAGE But getting the points for lost jobs was not the point of my post putting a PC into a endless LOOP That was the point of my post to let you see a problem you (Rosetta) need to address You have sent out bad WU's in the past and you will send out more new Bad ones to come to think not is unwise. The wise thing is to plan for it and figure out a way to reset the project client side Doing this would flush all the WU's on the client I know I would rather lose a few points in a flush then to have a PC stuck in a endless LOOP for Hundreds for hours untill I check on it and abort it and still get no points for it or just getting a fraction of the points for the total Hrs spent If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- ID: 13982 · Rating: 0 · rate: / Reply Quote

Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0	Message 13985 - Posted: 17 Apr 2006, 22:52:29 UTC - in response to Message 13982. But getting the points for lost jobs was not the point of my post putting a PC into a endless LOOP That was the point of my post to let you see a problem you (Rosetta) need to address You have sent out bad WU's in the past and you will send out more new Bad ones to come to think not is unwise. The wise thing is to plan for it and figure out a way to reset the project client side Doing this would flush all the WU's on the client I know I would rather lose a few points in a flush then to have a PC stuck in a endless LOOP for Hundreds for hours untill I check on it and abort it and still get no points for it or just getting a fraction of the points for the total Hrs spent Lauren, I think what you say makes a lot of sense. The truth is that people like you, with 20+ nodes, bring a lot of TeraFLOPS to any project at very little "expense" (tech support). I don't know how well the Time-To-Live timer works (after which a WU self-destructs). This needs to (ideally) be handled either with a "watchdog" thread for the Rosetta executable, or otherwise by BOINC-client itself. Also, BOINC server should provide a mechanism for projects to cancel jobs, after they've been let loose on volunteers' PCs (e.g. the batch of bad jobs send out last week). Another badly needed feature would be the ability of BOINC's scheduler/feeder to allocate WUs to eligible PCs based on capability and/or preference "flags" (e.g. >512MB, BigWU / BigMem flag, 24/7 operation, leave-in-mem = yes etc) Finally, some optional but still useful features would be to 7zip (or bzip2) the files, which would halve the overall bandwidth per WU. Unfortunately, a lot of the features mentioned above are almost "unique" to Rosetta. One would hope that in this whole wide world, some code wizard would roll up his sleeves and code these things (at least in BOINC's open source) in a few afternoons for fun, like the Hungarian akosf did with Einstein's executables, but maybe we're asking too much. PS: I have to admit that I'm disappointed wrt the big WUs sent out indiscriminately lately, despite being easy to anticipate the results... :-( Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity ID: 13985 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 13988 - Posted: 18 Apr 2006, 0:01:58 UTC - in response to Message 13932. Well David I let this one FA_RLXpt_hom002_1ptq__361_380_3 run through 48 HR it did not self abort as you said it would ... The FA_* WUs are old ones from mid-march so they don't have the timeout enabled. As far as I know, these FA_* WUs were never cancelled, so they pop up now and then and get sent out again until 4 people have rejected them. ID: 13988 · Rating: 0 · rate: / Reply Quote

Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0	Message 13997 - Posted: 18 Apr 2006, 1:14:19 UTC - in response to Message 13988. Well David I let this one FA_RLXpt_hom002_1ptq__361_380_3 run through 48 HR it did not self abort as you said it would ... The FA_* WUs are old ones from mid-march so they don't have the timeout enabled. As far as I know, these FA_* WUs were never cancelled, so they pop up now and then and get sent out again until 4 people have rejected them. Are you telling me this WU was not running for 100 to 150 Hrs I thought it was but steed it was running for a 1,000 + Hrs in a endlass Loop? (Grrrr) If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- ID: 13997 · Rating: 0 · rate: / Reply Quote

Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0	Message 13998 - Posted: 18 Apr 2006, 1:24:29 UTC - in response to Message 13985. The "Time-to-Live" thread is a great idea, and something we can implement within Rosetta. I'll look into it. As for the other stuff, we've made similar suggestions to the BOINC programmers. They're currently concentrating on their next stable release, but might be able to help us out after that. You're right about these being BOINC issues that mainly are a hassle for this Rosetta project. More than other BOINC applications, we're trying to improve our code continually, and thus we stumble upon more glitches when we try to make these scientific improvements. But getting the points for lost jobs was not the point of my post putting a PC into a endless LOOP That was the point of my post to let you see a problem you (Rosetta) need to address You have sent out bad WU's in the past and you will send out more new Bad ones to come to think not is unwise. The wise thing is to plan for it and figure out a way to reset the project client side Doing this would flush all the WU's on the client I know I would rather lose a few points in a flush then to have a PC stuck in a endless LOOP for Hundreds for hours untill I check on it and abort it and still get no points for it or just getting a fraction of the points for the total Hrs spent Lauren, I think what you say makes a lot of sense. The truth is that people like you, with 20+ nodes, bring a lot of TeraFLOPS to any project at very little "expense" (tech support). I don't know how well the Time-To-Live timer works (after which a WU self-destructs). This needs to (ideally) be handled either with a "watchdog" thread for the Rosetta executable, or otherwise by BOINC-client itself. Also, BOINC server should provide a mechanism for projects to cancel jobs, after they've been let loose on volunteers' PCs (e.g. the batch of bad jobs send out last week). Another badly needed feature would be the ability of BOINC's scheduler/feeder to allocate WUs to eligible PCs based on capability and/or preference "flags" (e.g. >512MB, BigWU / BigMem flag, 24/7 operation, leave-in-mem = yes etc) Finally, some optional but still useful features would be to 7zip (or bzip2) the files, which would halve the overall bandwidth per WU. Unfortunately, a lot of the features mentioned above are almost "unique" to Rosetta. One would hope that in this whole wide world, some code wizard would roll up his sleeves and code these things (at least in BOINC's open source) in a few afternoons for fun, like the Hungarian akosf did with Einstein's executables, but maybe we're asking too much. PS: I have to admit that I'm disappointed wrt the big WUs sent out indiscriminately lately, despite being easy to anticipate the results... :-( ID: 13998 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 14003 - Posted: 18 Apr 2006, 2:59:56 UTC - in response to Message 13997. Are you telling me this WU was not running for 100 to 150 Hrs I thought it was but steed it was running for a 1,000 + Hrs in a endlass Loop? (Grrrr) It probably sat in someone else's queue for a few weeks until it timed out. Then it may have been sent to someone else who then aborted it. Then it was sent to you and wasted 100 to 150 Hrs of your CPU time. If it had been cancelled it would not have been sent out again. ID: 14003 · Rating: 0 · rate: / Reply Quote

TCU Computer Science Send message Joined: 7 Dec 05 Posts: 28 Credit: 12,861,977 RAC: 0	Message 14012 - Posted: 18 Apr 2006, 5:04:00 UTC - in response to Message 13964. Let me know what the result number is -- if its not flagged to get credit, I can see why. Here is one that ran for 88 hours. I aborted it on 28 March https://boinc.bakerlab.org/rosetta/result.php?resultid=14764800 ID: 14012 · Rating: 0 · rate: / Reply Quote

Interboy Send message Joined: 28 Sep 05 Posts: 3 Credit: 743,819 RAC: 0	Message 14019 - Posted: 18 Apr 2006, 8:06:48 UTC Last modified: 18 Apr 2006, 8:07:37 UTC Here is another one: https://boinc.bakerlab.org/rosetta/result.php?resultid=17156765 ID: 14019 · Rating: 0 · rate: / Reply Quote

Branislav Send message Joined: 23 Mar 06 Posts: 6 Credit: 450,417 RAC: 0	Message 14079 - Posted: 18 Apr 2006, 21:29:03 UTC - in response to Message 13331. Work unit aborted at 1.04% - CPU time used ~1 hour 30 minutes. WU Name "VP_PRODUCTION_1qgtA_442_6769" Application "Rosetta" Workunit = 14313267; Result ID = 17432540; System = GenuineIntel Intel(R) Pentium(R) III Mobile CPU 866MHz Microsoft Windows Millennium 04.90.3000.00 The workunit was aborted manually. ID: 14079 · Rating: 0 · rate: / Reply Quote

Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0	Message 14091 - Posted: 18 Apr 2006, 22:42:58 UTC Argh.. this Terminated on me prematurely after a lot of computing time was spent 17504849 Name PROD_ABINITIO_9FULLSTRANDBAR_1tul__447_7718_0 Workunit 14375734 Created 18 Apr 2006 2:00:18 UTC Sent 18 Apr 2006 9:59:32 UTC Received 18 Apr 2006 21:53:49 UTC Server state Over Outcome Client error Client state Computing Exit status -1073741819 (0xc0000005) Computer ID 198415 Report deadline 2 May 2006 9:59:32 UTC CPU time 32858.34375 stderr out <core_client_version>5.2.13</core_client_version> <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 57600 # random seed: 2632343 # cpu_run_time_pref: 57600 # Exception caught in nstruct loop ii=1 i=9 # num_decoys:8 attempts:9 cpu_run_time:27054.7 # cpu_run_time_pref: 57600 # cpu_run_time_pref: 57600 # cpu_run_time_pref: 57600 # Exception caught in nstruct loop ii=1 i=9 # num_decoys:8 attempts:9 cpu_run_time:32026.2 # Exception caught in nstruct loop ii=1 i=9 # num_decoys:8 attempts:10 cpu_run_time:32601.3 # Exception caught in nstruct loop ii=1 i=9 # num_decoys:8 attempts:11 cpu_run_time:32856.4 # Exception caught in nstruct loop ii=1 i=9 # num_decoys:8 attempts:12 cpu_run_time:32857.3 # Exception caught in nstruct loop ii=1 i=9 # num_decoys:8 attempts:13 cpu_run_time:32858.3 # Max exceptions (5) reached! # Terminating run prematurely. *UNHANDLED EXCEPTION** Reason: Access Violation (0xc0000005) at address 0x004583A7 read attempt to address 0x00013892 </stderr_txt> What happened? This and no other is the root from which a Tyrant springs; when he first appears he is a protector.â€ Plato ID: 14091 · Rating: 0 · rate: / Reply Quote

Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0	Message 14108 - Posted: 19 Apr 2006, 13:56:58 UTC - in response to Message 13799. But what about the removal of bad WU's from your servers You must set up a way to stop the resending out of the BAD WU's Letting the system purge it self is not right. You have the capability to do auto upgrades you should have the capability to auto abort bad WU;s on client side To let bad WU's run on yours or our system is a BAD THING The bad WU's are removed from our servers, but we can't remove them from your machines. Hopefully there will be no more bad WU's at all so this won't be a problem anymore. Well David it seems you can not or did not REMOVE the bad WU's I and others are still getting them I just found this one TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_126_0 that WASTED another 28 more Hrs This is not good I am nearing the end of my patients with these BAD jobs and the THOUSANDS + of Hrs of wasted work time that you will not give points for David I AM VARY UPSET ABOUT THIS If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- ID: 14108 · Rating: 0 · rate: / Reply Quote

Mike Gelvin Send message Joined: 7 Oct 05 Posts: 65 Credit: 10,612,039 RAC: 0	Message 14109 - Posted: 19 Apr 2006, 14:10:19 UTC Last modified: 19 Apr 2006, 14:10:35 UTC Aborted 1.04% https://boinc.bakerlab.org/rosetta/result.php?resultid=17045924 ID: 14109 · Rating: 0 · rate: / Reply Quote

Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0	Message 14113 - Posted: 19 Apr 2006, 15:11:57 UTC - in response to Message 14112. Last modified: 19 Apr 2006, 15:14:55 UTC [/quote] It looks like maybe "Rhiju's" error trap worked and terminated the Work Unit. If so it should claim some credit.[/quote] Well My account states that for that work unit the following credits were claimed but not granted 17504849 14375734 18 Apr 2006 9:59:32 UTC 18 Apr 2006 21:53:49 UTC Over Client error Computing 32,858.34 101.95 --- See the issue for me goes past the credit stuff [Although I would be dishonest if I don't admit I want all the credits posible added to my team totals as we are facing a vicious stampede by some very annoying cows ( LOL LOL LOL ..Yes I have a sense of humor) ] : it is seeing all that precious computing time not generating useful work that is worrying me. BTW my life partner is considering suing Rosetta@home for loss of consortium... Partner claims I am addicted to the screen saver and that I am becoming slightly more nuttier than when we met. :P This and no other is the root from which a Tyrant springs; when he first appears he is a protector.â€ Plato ID: 14113 · Rating: 0 · rate: / Reply Quote

dag Send message Joined: 16 Dec 05 Posts: 106 Credit: 1,000,020 RAC: 0	Message 14114 - Posted: 19 Apr 2006, 15:23:36 UTC Last modified: 19 Apr 2006, 15:25:12 UTC https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11846000 FA_RLXpt_hom006_1ptq__361_479 1.04%, 17+ hours... takes a lickin' and keeps on tickin'! This will be a good test to see if credit will be eventually awarded as was stated elsewhere in this thread. dag dag --Finding aliens is cool, but understanding the structure of proteins is useful. ID: 14114 · Rating: 0 · rate: / Reply Quote

CremionisD Send message Joined: 10 Mar 06 Posts: 9 Credit: 37,604,006 RAC: 0	Message 14117 - Posted: 19 Apr 2006, 17:58:26 UTC Workunit aborted manually. "Truncate_termini_fullrelax_1b3a_433_628_0" - Model 1, step 241723, at 1.04% CPU time ~24:30:00 Result ID = 17022725, (Workunit = 13954214) ID: 14117 · Rating: 0 · rate: / Reply Quote

tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 14125 - Posted: 19 Apr 2006, 18:58:57 UTC - in response to Message 13985. Also, BOINC server should provide a mechanism for projects to cancel jobs, after they've been let loose on volunteers' PCs (e.g. the batch of bad jobs send out last week). Such a feature exists and was recently employed from cpdn.org to reset the faulty models they send out. It's called "reverse trickle" or "killer trickle". But it still needs a contact from the client in order to respond with a "killer trickle". However every contact should do. ID: 14125 · Rating: 0 · rate: / Reply Quote

Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0	Message 14131 - Posted: 19 Apr 2006, 19:59:25 UTC - in response to Message 14125. Based on the great advice from this forum, I coded a "watchdog" thread for Rosetta@home. It will output any data and abort work units that haven't changed their score in thirty minutes -- a pretty good indicator that the job is stuck! I'll be testing this over on RALPH over the next couple days. I'm also thinking of putting in an abort if the CPU time is more than twice the maximum time for the workunit (typically 4 hours by default these days, or whatever the client's preference)... that's another sign that the workunit is not compatible with the client. Are there any other conditions under which you think we should abort? Looking forward to some more advice. I think we're on the way to finally bringing an end to these stuck jobs. Also, BOINC server should provide a mechanism for projects to cancel jobs, after they've been let loose on volunteers' PCs (e.g. the batch of bad jobs send out last week). Such a feature exists and was recently employed from cpdn.org to reset the faulty models they send out. It's called "reverse trickle" or "killer trickle". But it still needs a contact from the client in order to respond with a "killer trickle". However every contact should do. ID: 14131 · Rating: 1 · rate: / Reply Quote

Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0	Message 14134 - Posted: 19 Apr 2006, 20:22:58 UTC - in response to Message 14123. Last modified: 19 Apr 2006, 20:23:49 UTC My apologies if I sounded like I was bitching. I better take a break from the screen...but drat...those amino acid chains dancing all over the screen are so addictive :) Peace and ty for all your effort to make this project an efficient one. This and no other is the root from which a Tyrant springs; when he first appears he is a protector.â€ Plato ID: 14134 · Rating: 1 · rate: / Reply Quote

Report stuck & aborted WU here please - II