Message boards : Number crunching : CAPRI14?
Author | Message |
---|---|
rsubler Send message Joined: 24 Jun 07 Posts: 8 Credit: 172,618 RAC: 0 |
I have run four results this week. Two have completed successfully, two have been killed by the Rosetta application after four and nine hours, with 20 credits granted. Both of the failed results were labelled CAPR14. I also see from the troubles with 5.78 thread that others are having the same problem. 1. What's wrong? 2. When will this be fixed? For now, I am aborting any CAPRI results and will continue to do so until this is clarified. Thanks, Ron |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed? |
rsubler Send message Joined: 24 Jun 07 Posts: 8 Credit: 172,618 RAC: 0 |
Sure, W/U ID 94798422 - 16,550 seconds W/U ID 94963714 - 34,122 seconds W/U ID 94968990 - aborted. These were on an AMD X2 3800+, which normally yields 10 credits per core per hour on Rosetta, 13 on Einstein. To clarify, I am not upset, merely frustrated at wasting computer time to produce useless output. Happy hunting, Ron |
The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,271,025 RAC: 0 |
some more may be found here from greg_be and myself, watchdog ending runs my preset runtime is 3 hours, 10800 seconds. Capri14 ran for 14594 seconds before watchdog ended. This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
In science, even apparent failures are progress. You now know something more about the subject then you did before you failed. Everyone is working to avoid failing tasks, indeed. But, it does happen on occaison as things are modified over time. Yet without making modifications, it cannot improve. So, unfortonately, it's all part of the process. As always, thanks to all that persevere and work through it. Rosetta Moderator: Mod.Sense |
Jim Send message Joined: 15 Oct 06 Posts: 22 Credit: 5,410,546 RAC: 0 |
This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed? WU: 95285894 WU: 94876951 WU: 94840468 All 3 are CAPRI14 units and ended by the watchdog timer. My runtime is set for 4 hours (14400 seconds). cheers . . . jhf |
David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0 |
This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed? Here are a couple of links: WU 95336686 WU 95336073 Rosie, Rosie, she's our gal, If she can't do it, no one shall! |
David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0 |
This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed? Here is another Capri workunit killed by the watchdog: WU 95336625 Rosie, Rosie, she's our gal, If she can't do it, no one shall! |
Prom Send message Joined: 21 Jun 06 Posts: 23 Credit: 931,604 RAC: 0 |
WU 94754759 was killed after being idle for 15 min. I only got 20 credits for it. I'm not upset at this but would have expected about 40 after doing about 3.5 hours of work with a lot of results. What happened to the results that were valid? Just something else I think should be looked into. Now WU 94757175 goes up to 500 steps and stops at the refinement. This time the watchdog didn't shut it down. After going for an hour I restarted boinc but it simply got stuck again so I aborted it after 20 mins. Now someone else has it, will see for how long. I'm watching the rest closely to abort any suspicious units. EDIT: It seems the 1he8 and the 1g4u ones have the most problems, maybe remove these from the system until this is resolved. BBLounge - Broadband and Technology forum |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
WU 94754759 was killed after being idle for 15 min. I only got 20 credits for it. I'm not upset at this but would have expected about 40 after doing about 3.5 hours of work with a lot of results. What happened to the results that were valid? Just something else I think should be looked into. I agree. Did you happen to view the graphic prior to it getting ended by the watch dog? I'm wondering if it was one of those that took the whole 3.5hrs on that first model, and therefore never completed one and had no completed results to report... or if it was one of those running 6 models an hour, and so it had completed 20 models or whatever prior to the problem. If the ladder, then definately there is room for improvement. Most WU types will report the completed models and the failure and show "success", and issue credit accordingly. But I'm not certain if these are proving the exception to the rule. Now WU 94757175 goes up to 500 steps and stops at the refinement. This time the watchdog didn't shut it down. After going for an hour I restarted boinc but it simply got stuck again so I aborted it after 20 mins. Now someone else has it, will see for how long. If it were really stuck, the watch dog should detect that and end it. But there are cases where progress is slow to come and the watch dog knows that. More then an hour for a single step... ya that sounds a bit high. That was one of Rhiju's questions was "did you happen to notice if the screen looked totally stuck before the crash?" ... but then THAT one didn't "crash" with the stuck after 900 seconds condition. Do you know? Was it still using CPU after that hour? There have been issues where BOINC shows a status of "running", but no CPU is being used. Rosetta Moderator: Mod.Sense |
j2satx Send message Joined: 17 Sep 05 Posts: 97 Credit: 3,670,592 RAC: 0 |
I thought the purpose of Ralph was to "test" these mods. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I thought the purpose of Ralph was to "test" these mods. No testing is perfect. ...just ask your favorite sofware vendor. Rosetta Moderator: Mod.Sense |
David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0 |
This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed? Here are two more Carpri/watchdog casualties: WU 95336658 WU 95336696 Rosie, Rosie, she's our gal, If she can't do it, no one shall! |
Prom Send message Joined: 21 Jun 06 Posts: 23 Credit: 931,604 RAC: 0 |
I agree. Did you happen to view the graphic prior to it getting ended by the watch dog? I'm wondering if it was one of those that took the whole 3.5hrs on that first model, and therefore never completed one and had no completed results to report... or if it was one of those running 6 models an hour, and so it had completed 20 models or whatever prior to the problem. If the ladder, then definately there is room for improvement. I didn't see it while it was stuck but viewing a couple of times before it was putting out results about every 5~10 minutes so there should have been at least 40 decoys by that time. The problem is that the system seems to give 20 credits for effort because of the error rather than giving credits for the reported results. Luckily mine only claimed about 40 but others have claimed anything from 4 to a couple hundred and they all got 20. Whether it's set up like this or a bug only surfacing now I think the developers should look at it. If it were really stuck, the watch dog should detect that and end it. But there are cases where progress is slow to come and the watch dog knows that. More then an hour for a single step... ya that sounds a bit high. That was one of Rhiju's questions was "did you happen to notice if the screen looked totally stuck before the crash?" ... but then THAT one didn't "crash" with the stuck after 900 seconds condition. It went to 500 steps in a few seconds and then just seemed to stay there indefinitely at the refinement stage. So I restarted the app after an hour and it seemed to stay at those same values before I ended it at 20 mins. Unfortunately I didn't look at the cpu usage but will do so if it happens again. Strange that the watchdog didn't end it. Maybe there could have been progress so little that it didn't show up on screen but I don't know how likely that is. Hmm... seems somebody else completed it... maybe starting with a different random number? I also have to wonder why this didn't show up at ralph as there seems to be so many issues with so many workunits. The problem is that these all went out shortly after the blackout so I suspect there wasn't any testing done on them. BBLounge - Broadband and Technology forum |
j2satx Send message Joined: 17 Sep 05 Posts: 97 Credit: 3,670,592 RAC: 0 |
I thought the purpose of Ralph was to "test" these mods. How many test WUs are run on Ralph before being released to Rosetta? |
The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,271,025 RAC: 0 |
Not certain which threads are being monitored, so just a qwik message that I posted this. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
...as a general rule, threads opened by the Project Team are monitored by the Project Team. So, posting to the "Problems with..." thread, as you did, was probably the best place to mention it. Rosetta Moderator: Mod.Sense |
Jim Send message Joined: 15 Oct 06 Posts: 22 Credit: 5,410,546 RAC: 0 |
This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed? I have now done several CAPRI14 WUs with 5.80: Only one has a problem: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=95775480 with this result: https://boinc.bakerlab.org/rosetta/result.php?resultid=105522388 I don't have a clue as to why it failed to Validate. It seems to have run the full length of time I have set for processing. Thanks, Jim |
Prom Send message Joined: 21 Jun 06 Posts: 23 Credit: 931,604 RAC: 0 |
This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed? Ok, one of mine have now failed to validate as well. Resetta owes me 132.92 for 11h57m47.5s of work and 177 decoys. BBLounge - Broadband and Technology forum |
David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0 |
I am quite curious as to the relative success rate of the Capri14 WUs versus WUs of all other types. My personal experience is that a Capri14 WU is nearly doomed to failure, no matter the Rosetta version on which it is running, while nearly all other WUs will succeed... I sincerely hope that the CPU time that (to my layman's perspective) I seem to be wasting on Capri14 is not representative of the experience of the general population of Rosetta crunchers. Respectfully, David Emigh Rosie, Rosie, she's our gal, If she can't do it, no one shall! |
Message boards :
Number crunching :
CAPRI14?
©2024 University of Washington
https://www.bakerlab.org