CAPRI14?

Author	Message
rsubler Send message Joined: 24 Jun 07 Posts: 8 Credit: 172,618 RAC: 0	Message 46076 - Posted: 12 Sep 2007, 17:31:08 UTC I have run four results this week. Two have completed successfully, two have been killed by the Rosetta application after four and nine hours, with 20 credits granted. Both of the failed results were labelled CAPR14. I also see from the troubles with 5.78 thread that others are having the same problem. 1. What's wrong? 2. When will this be fixed? For now, I am aborting any CAPRI results and will continue to do so until this is clarified. Thanks, Ron ID: 46076 · Rating: 0 · rate: / Reply Quote

Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0	Message 46084 - Posted: 12 Sep 2007, 19:20:11 UTC This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed? ID: 46084 · Rating: 0 · rate: / Reply Quote

rsubler Send message Joined: 24 Jun 07 Posts: 8 Credit: 172,618 RAC: 0	Message 46085 - Posted: 12 Sep 2007, 19:29:51 UTC Last modified: 12 Sep 2007, 19:30:34 UTC Sure, W/U ID 94798422 - 16,550 seconds W/U ID 94963714 - 34,122 seconds W/U ID 94968990 - aborted. These were on an AMD X2 3800+, which normally yields 10 credits per core per hour on Rosetta, 13 on Einstein. To clarify, I am not upset, merely frustrated at wasting computer time to produce useless output. Happy hunting, Ron ID: 46085 · Rating: 0 · rate: / Reply Quote

Resnick_MEDIC_Lab Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,276,053 RAC: 0	Message 46096 - Posted: 12 Sep 2007, 22:25:01 UTC Last modified: 12 Sep 2007, 22:28:53 UTC some more may be found here from greg_be and myself, watchdog ending runs my preset runtime is 3 hours, 10800 seconds. Capri14 ran for 14594 seconds before watchdog ended. This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed? ID: 46096 · Rating: 1 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 46103 - Posted: 13 Sep 2007, 0:15:56 UTC - in response to Message 46085. To clarify, I am not upset, merely frustrated at wasting computer time to produce useless output. In science, even apparent failures are progress. You now know something more about the subject then you did before you failed. Everyone is working to avoid failing tasks, indeed. But, it does happen on occaison as things are modified over time. Yet without making modifications, it cannot improve. So, unfortonately, it's all part of the process. As always, thanks to all that persevere and work through it. Rosetta Moderator: Mod.Sense ID: 46103 · Rating: 0 · rate: / Reply Quote

Jim Send message Joined: 15 Oct 06 Posts: 22 Credit: 5,410,546 RAC: 0	Message 46107 - Posted: 13 Sep 2007, 2:14:40 UTC - in response to Message 46084. This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed? WU: 95285894 WU: 94876951 WU: 94840468 All 3 are CAPRI14 units and ended by the watchdog timer. My runtime is set for 4 hours (14400 seconds). cheers . . . jhf ID: 46107 · Rating: 0 · rate: / Reply Quote

David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0	Message 46123 - Posted: 13 Sep 2007, 14:19:51 UTC - in response to Message 46084. Last modified: 13 Sep 2007, 14:25:50 UTC This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed? Here are a couple of links: WU 95336686 WU 95336073 Rosie, Rosie, she's our gal, If she can't do it, no one shall! ID: 46123 · Rating: 0 · rate: / Reply Quote

David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0	Message 46158 - Posted: 13 Sep 2007, 23:05:01 UTC - in response to Message 46084. This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed? Here is another Capri workunit killed by the watchdog: WU 95336625 Rosie, Rosie, she's our gal, If she can't do it, no one shall! ID: 46158 · Rating: 0 · rate: / Reply Quote

Prom Send message Joined: 21 Jun 06 Posts: 23 Credit: 931,604 RAC: 0	Message 46161 - Posted: 14 Sep 2007, 0:43:29 UTC Last modified: 14 Sep 2007, 0:51:36 UTC WU 94754759 was killed after being idle for 15 min. I only got 20 credits for it. I'm not upset at this but would have expected about 40 after doing about 3.5 hours of work with a lot of results. What happened to the results that were valid? Just something else I think should be looked into. Now WU 94757175 goes up to 500 steps and stops at the refinement. This time the watchdog didn't shut it down. After going for an hour I restarted boinc but it simply got stuck again so I aborted it after 20 mins. Now someone else has it, will see for how long. I'm watching the rest closely to abort any suspicious units. EDIT: It seems the 1he8 and the 1g4u ones have the most problems, maybe remove these from the system until this is resolved. BBLounge - Broadband and Technology forum ID: 46161 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 46163 - Posted: 14 Sep 2007, 1:14:19 UTC - in response to Message 46161. Last modified: 14 Sep 2007, 1:19:25 UTC WU 94754759 was killed after being idle for 15 min. I only got 20 credits for it. I'm not upset at this but would have expected about 40 after doing about 3.5 hours of work with a lot of results. What happened to the results that were valid? Just something else I think should be looked into. I agree. Did you happen to view the graphic prior to it getting ended by the watch dog? I'm wondering if it was one of those that took the whole 3.5hrs on that first model, and therefore never completed one and had no completed results to report... or if it was one of those running 6 models an hour, and so it had completed 20 models or whatever prior to the problem. If the ladder, then definately there is room for improvement. Most WU types will report the completed models and the failure and show "success", and issue credit accordingly. But I'm not certain if these are proving the exception to the rule. Now WU 94757175 goes up to 500 steps and stops at the refinement. This time the watchdog didn't shut it down. After going for an hour I restarted boinc but it simply got stuck again so I aborted it after 20 mins. Now someone else has it, will see for how long. If it were really stuck, the watch dog should detect that and end it. But there are cases where progress is slow to come and the watch dog knows that. More then an hour for a single step... ya that sounds a bit high. That was one of Rhiju's questions was "did you happen to notice if the screen looked totally stuck before the crash?" ... but then THAT one didn't "crash" with the stuck after 900 seconds condition. Do you know? Was it still using CPU after that hour? There have been issues where BOINC shows a status of "running", but no CPU is being used. Rosetta Moderator: Mod.Sense ID: 46163 · Rating: 0 · rate: / Reply Quote

j2satx Send message Joined: 17 Sep 05 Posts: 97 Credit: 3,670,592 RAC: 0	Message 46164 - Posted: 14 Sep 2007, 1:27:07 UTC - in response to Message 46103. To clarify, I am not upset, merely frustrated at wasting computer time to produce useless output. In science, even apparent failures are progress. You now know something more about the subject then you did before you failed. Everyone is working to avoid failing tasks, indeed. But, it does happen on occaison as things are modified over time. Yet without making modifications, it cannot improve. So, unfortonately, it's all part of the process. As always, thanks to all that persevere and work through it. I thought the purpose of Ralph was to "test" these mods. ID: 46164 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 46167 - Posted: 14 Sep 2007, 3:43:24 UTC - in response to Message 46164. I thought the purpose of Ralph was to "test" these mods. No testing is perfect. ...just ask your favorite sofware vendor. Rosetta Moderator: Mod.Sense ID: 46167 · Rating: 0 · rate: / Reply Quote

David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0	Message 46185 - Posted: 14 Sep 2007, 11:48:02 UTC - in response to Message 46084. Last modified: 14 Sep 2007, 11:50:37 UTC This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed? Here are two more Carpri/watchdog casualties: WU 95336658 WU 95336696 Rosie, Rosie, she's our gal, If she can't do it, no one shall! ID: 46185 · Rating: 0 · rate: / Reply Quote

Prom Send message Joined: 21 Jun 06 Posts: 23 Credit: 931,604 RAC: 0	Message 46192 - Posted: 14 Sep 2007, 12:55:50 UTC - in response to Message 46163. I agree. Did you happen to view the graphic prior to it getting ended by the watch dog? I'm wondering if it was one of those that took the whole 3.5hrs on that first model, and therefore never completed one and had no completed results to report... or if it was one of those running 6 models an hour, and so it had completed 20 models or whatever prior to the problem. If the ladder, then definately there is room for improvement. Most WU types will report the completed models and the failure and show "success", and issue credit accordingly. But I'm not certain if these are proving the exception to the rule. I didn't see it while it was stuck but viewing a couple of times before it was putting out results about every 5~10 minutes so there should have been at least 40 decoys by that time. The problem is that the system seems to give 20 credits for effort because of the error rather than giving credits for the reported results. Luckily mine only claimed about 40 but others have claimed anything from 4 to a couple hundred and they all got 20. Whether it's set up like this or a bug only surfacing now I think the developers should look at it. If it were really stuck, the watch dog should detect that and end it. But there are cases where progress is slow to come and the watch dog knows that. More then an hour for a single step... ya that sounds a bit high. That was one of Rhiju's questions was "did you happen to notice if the screen looked totally stuck before the crash?" ... but then THAT one didn't "crash" with the stuck after 900 seconds condition. Do you know? Was it still using CPU after that hour? There have been issues where BOINC shows a status of "running", but no CPU is being used. It went to 500 steps in a few seconds and then just seemed to stay there indefinitely at the refinement stage. So I restarted the app after an hour and it seemed to stay at those same values before I ended it at 20 mins. Unfortunately I didn't look at the cpu usage but will do so if it happens again. Strange that the watchdog didn't end it. Maybe there could have been progress so little that it didn't show up on screen but I don't know how likely that is. Hmm... seems somebody else completed it... maybe starting with a different random number? I also have to wonder why this didn't show up at ralph as there seems to be so many issues with so many workunits. The problem is that these all went out shortly after the blackout so I suspect there wasn't any testing done on them. BBLounge - Broadband and Technology forum ID: 46192 · Rating: 0 · rate: / Reply Quote

j2satx Send message Joined: 17 Sep 05 Posts: 97 Credit: 3,670,592 RAC: 0	Message 46233 - Posted: 14 Sep 2007, 23:07:02 UTC - in response to Message 46167. I thought the purpose of Ralph was to "test" these mods. No testing is perfect. ...just ask your favorite sofware vendor. How many test WUs are run on Ralph before being released to Rosetta? ID: 46233 · Rating: 0 · rate: / Reply Quote

Resnick_MEDIC_Lab Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,276,053 RAC: 0	Message 46235 - Posted: 14 Sep 2007, 23:25:10 UTC Not certain which threads are being monitored, so just a qwik message that I posted this. ID: 46235 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 46240 - Posted: 15 Sep 2007, 1:07:23 UTC ...as a general rule, threads opened by the Project Team are monitored by the Project Team. So, posting to the "Problems with..." thread, as you did, was probably the best place to mention it. Rosetta Moderator: Mod.Sense ID: 46240 · Rating: 0 · rate: / Reply Quote

Jim Send message Joined: 15 Oct 06 Posts: 22 Credit: 5,410,546 RAC: 0	Message 46250 - Posted: 15 Sep 2007, 4:48:46 UTC - in response to Message 46084. This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed? I have now done several CAPRI14 WUs with 5.80: Only one has a problem: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=95775480 with this result: https://boinc.bakerlab.org/rosetta/result.php?resultid=105522388 I don't have a clue as to why it failed to Validate. It seems to have run the full length of time I have set for processing. Thanks, Jim ID: 46250 · Rating: 0 · rate: / Reply Quote

Prom Send message Joined: 21 Jun 06 Posts: 23 Credit: 931,604 RAC: 0	Message 46298 - Posted: 15 Sep 2007, 19:01:53 UTC - in response to Message 46250. This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed? I have now done several CAPRI14 WUs with 5.80: Only one has a problem: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=95775480 with this result: https://boinc.bakerlab.org/rosetta/result.php?resultid=105522388 I don't have a clue as to why it failed to Validate. It seems to have run the full length of time I have set for processing. Thanks, Jim Ok, one of mine have now failed to validate as well. Resetta owes me 132.92 for 11h57m47.5s of work and 177 decoys. BBLounge - Broadband and Technology forum ID: 46298 · Rating: 0 · rate: / Reply Quote

David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0	Message 46324 - Posted: 16 Sep 2007, 4:06:55 UTC I am quite curious as to the relative success rate of the Capri14 WUs versus WUs of all other types. My personal experience is that a Capri14 WU is nearly doomed to failure, no matter the Rosetta version on which it is running, while nearly all other WUs will succeed... I sincerely hope that the CPU time that (to my layman's perspective) I seem to be wasting on Capri14 is not representative of the experience of the general population of Rosetta crunchers. Respectfully, David Emigh Rosie, Rosie, she's our gal, If she can't do it, no one shall! ID: 46324 · Rating: 0 · rate: / Reply Quote