CAPRI14?

Message boards : Number crunching : CAPRI14?

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
rsubler

Send message
Joined: 24 Jun 07
Posts: 8
Credit: 172,618
RAC: 0
Message 46076 - Posted: 12 Sep 2007, 17:31:08 UTC

I have run four results this week. Two have completed successfully, two have been killed by the Rosetta application after four and nine hours, with 20 credits granted.

Both of the failed results were labelled CAPR14. I also see from the troubles with 5.78 thread that others are having the same problem.

1. What's wrong?
2. When will this be fixed?

For now, I am aborting any CAPRI results and will continue to do so until this is clarified.

Thanks,
Ron
ID: 46076 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 46084 - Posted: 12 Sep 2007, 19:20:11 UTC

This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed?
ID: 46084 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rsubler

Send message
Joined: 24 Jun 07
Posts: 8
Credit: 172,618
RAC: 0
Message 46085 - Posted: 12 Sep 2007, 19:29:51 UTC
Last modified: 12 Sep 2007, 19:30:34 UTC

Sure,

W/U ID 94798422 - 16,550 seconds
W/U ID 94963714 - 34,122 seconds
W/U ID 94968990 - aborted.

These were on an AMD X2 3800+, which normally yields 10 credits per core per hour on Rosetta, 13 on Einstein.

To clarify, I am not upset, merely frustrated at wasting computer time to produce useless output.

Happy hunting,
Ron
ID: 46085 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The_Bad_Penguin
Avatar

Send message
Joined: 5 Jun 06
Posts: 2751
Credit: 4,271,025
RAC: 0
Message 46096 - Posted: 12 Sep 2007, 22:25:01 UTC
Last modified: 12 Sep 2007, 22:28:53 UTC

some more may be found here from greg_be and myself, watchdog ending runs

my preset runtime is 3 hours, 10800 seconds. Capri14 ran for 14594 seconds before watchdog ended.

This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed?
ID: 46096 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 46103 - Posted: 13 Sep 2007, 0:15:56 UTC - in response to Message 46085.  


To clarify, I am not upset, merely frustrated at wasting computer time to produce useless output.


In science, even apparent failures are progress. You now know something more about the subject then you did before you failed.

Everyone is working to avoid failing tasks, indeed. But, it does happen on occaison as things are modified over time. Yet without making modifications, it cannot improve. So, unfortonately, it's all part of the process.

As always, thanks to all that persevere and work through it.
Rosetta Moderator: Mod.Sense
ID: 46103 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jim

Send message
Joined: 15 Oct 06
Posts: 22
Credit: 5,410,546
RAC: 0
Message 46107 - Posted: 13 Sep 2007, 2:14:40 UTC - in response to Message 46084.  

This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed?



WU: 95285894
WU: 94876951
WU: 94840468

All 3 are CAPRI14 units and ended by the watchdog timer. My runtime is set for 4 hours (14400 seconds).

cheers . . . jhf
ID: 46107 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David Emigh
Avatar

Send message
Joined: 13 Mar 06
Posts: 158
Credit: 417,178
RAC: 0
Message 46123 - Posted: 13 Sep 2007, 14:19:51 UTC - in response to Message 46084.  
Last modified: 13 Sep 2007, 14:25:50 UTC

This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed?


Here are a couple of links:

WU 95336686
WU 95336073

Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
ID: 46123 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David Emigh
Avatar

Send message
Joined: 13 Mar 06
Posts: 158
Credit: 417,178
RAC: 0
Message 46158 - Posted: 13 Sep 2007, 23:05:01 UTC - in response to Message 46084.  

This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed?


Here is another Capri workunit killed by the watchdog:

WU 95336625

Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
ID: 46158 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Prom

Send message
Joined: 21 Jun 06
Posts: 23
Credit: 931,604
RAC: 0
Message 46161 - Posted: 14 Sep 2007, 0:43:29 UTC
Last modified: 14 Sep 2007, 0:51:36 UTC

WU 94754759 was killed after being idle for 15 min. I only got 20 credits for it. I'm not upset at this but would have expected about 40 after doing about 3.5 hours of work with a lot of results. What happened to the results that were valid? Just something else I think should be looked into.

Now WU 94757175 goes up to 500 steps and stops at the refinement. This time the watchdog didn't shut it down. After going for an hour I restarted boinc but it simply got stuck again so I aborted it after 20 mins. Now someone else has it, will see for how long.

I'm watching the rest closely to abort any suspicious units.

EDIT: It seems the 1he8 and the 1g4u ones have the most problems, maybe remove these from the system until this is resolved.
BBLounge - Broadband and Technology forum
ID: 46161 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 46163 - Posted: 14 Sep 2007, 1:14:19 UTC - in response to Message 46161.  
Last modified: 14 Sep 2007, 1:19:25 UTC

WU 94754759 was killed after being idle for 15 min. I only got 20 credits for it. I'm not upset at this but would have expected about 40 after doing about 3.5 hours of work with a lot of results. What happened to the results that were valid? Just something else I think should be looked into.


I agree. Did you happen to view the graphic prior to it getting ended by the watch dog? I'm wondering if it was one of those that took the whole 3.5hrs on that first model, and therefore never completed one and had no completed results to report... or if it was one of those running 6 models an hour, and so it had completed 20 models or whatever prior to the problem. If the ladder, then definately there is room for improvement.

Most WU types will report the completed models and the failure and show "success", and issue credit accordingly. But I'm not certain if these are proving the exception to the rule.

Now WU 94757175 goes up to 500 steps and stops at the refinement. This time the watchdog didn't shut it down. After going for an hour I restarted boinc but it simply got stuck again so I aborted it after 20 mins. Now someone else has it, will see for how long.


If it were really stuck, the watch dog should detect that and end it. But there are cases where progress is slow to come and the watch dog knows that. More then an hour for a single step... ya that sounds a bit high. That was one of Rhiju's questions was "did you happen to notice if the screen looked totally stuck before the crash?" ... but then THAT one didn't "crash" with the stuck after 900 seconds condition.

Do you know? Was it still using CPU after that hour? There have been issues where BOINC shows a status of "running", but no CPU is being used.
Rosetta Moderator: Mod.Sense
ID: 46163 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
j2satx

Send message
Joined: 17 Sep 05
Posts: 97
Credit: 3,670,592
RAC: 0
Message 46164 - Posted: 14 Sep 2007, 1:27:07 UTC - in response to Message 46103.  


To clarify, I am not upset, merely frustrated at wasting computer time to produce useless output.


In science, even apparent failures are progress. You now know something more about the subject then you did before you failed.

Everyone is working to avoid failing tasks, indeed. But, it does happen on occaison as things are modified over time. Yet without making modifications, it cannot improve. So, unfortonately, it's all part of the process.

As always, thanks to all that persevere and work through it.


I thought the purpose of Ralph was to "test" these mods.
ID: 46164 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 46167 - Posted: 14 Sep 2007, 3:43:24 UTC - in response to Message 46164.  

I thought the purpose of Ralph was to "test" these mods.


No testing is perfect. ...just ask your favorite sofware vendor.
Rosetta Moderator: Mod.Sense
ID: 46167 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David Emigh
Avatar

Send message
Joined: 13 Mar 06
Posts: 158
Credit: 417,178
RAC: 0
Message 46185 - Posted: 14 Sep 2007, 11:48:02 UTC - in response to Message 46084.  
Last modified: 14 Sep 2007, 11:50:37 UTC

This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed?


Here are two more Carpri/watchdog casualties:

WU 95336658
WU 95336696

Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
ID: 46185 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Prom

Send message
Joined: 21 Jun 06
Posts: 23
Credit: 931,604
RAC: 0
Message 46192 - Posted: 14 Sep 2007, 12:55:50 UTC - in response to Message 46163.  

I agree. Did you happen to view the graphic prior to it getting ended by the watch dog? I'm wondering if it was one of those that took the whole 3.5hrs on that first model, and therefore never completed one and had no completed results to report... or if it was one of those running 6 models an hour, and so it had completed 20 models or whatever prior to the problem. If the ladder, then definately there is room for improvement.

Most WU types will report the completed models and the failure and show "success", and issue credit accordingly. But I'm not certain if these are proving the exception to the rule.

I didn't see it while it was stuck but viewing a couple of times before it was putting out results about every 5~10 minutes so there should have been at least 40 decoys by that time. The problem is that the system seems to give 20 credits for effort because of the error rather than giving credits for the reported results. Luckily mine only claimed about 40 but others have claimed anything from 4 to a couple hundred and they all got 20. Whether it's set up like this or a bug only surfacing now I think the developers should look at it.
If it were really stuck, the watch dog should detect that and end it. But there are cases where progress is slow to come and the watch dog knows that. More then an hour for a single step... ya that sounds a bit high. That was one of Rhiju's questions was "did you happen to notice if the screen looked totally stuck before the crash?" ... but then THAT one didn't "crash" with the stuck after 900 seconds condition.

Do you know? Was it still using CPU after that hour? There have been issues where BOINC shows a status of "running", but no CPU is being used.

It went to 500 steps in a few seconds and then just seemed to stay there indefinitely at the refinement stage. So I restarted the app after an hour and it seemed to stay at those same values before I ended it at 20 mins. Unfortunately I didn't look at the cpu usage but will do so if it happens again. Strange that the watchdog didn't end it. Maybe there could have been progress so little that it didn't show up on screen but I don't know how likely that is. Hmm... seems somebody else completed it... maybe starting with a different random number?

I also have to wonder why this didn't show up at ralph as there seems to be so many issues with so many workunits. The problem is that these all went out shortly after the blackout so I suspect there wasn't any testing done on them.
BBLounge - Broadband and Technology forum
ID: 46192 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
j2satx

Send message
Joined: 17 Sep 05
Posts: 97
Credit: 3,670,592
RAC: 0
Message 46233 - Posted: 14 Sep 2007, 23:07:02 UTC - in response to Message 46167.  

I thought the purpose of Ralph was to "test" these mods.


No testing is perfect. ...just ask your favorite sofware vendor.


How many test WUs are run on Ralph before being released to Rosetta?

ID: 46233 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The_Bad_Penguin
Avatar

Send message
Joined: 5 Jun 06
Posts: 2751
Credit: 4,271,025
RAC: 0
Message 46235 - Posted: 14 Sep 2007, 23:25:10 UTC

Not certain which threads are being monitored, so just a qwik message that I posted this.
ID: 46235 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 46240 - Posted: 15 Sep 2007, 1:07:23 UTC

...as a general rule, threads opened by the Project Team are monitored by the Project Team. So, posting to the "Problems with..." thread, as you did, was probably the best place to mention it.
Rosetta Moderator: Mod.Sense
ID: 46240 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jim

Send message
Joined: 15 Oct 06
Posts: 22
Credit: 5,410,546
RAC: 0
Message 46250 - Posted: 15 Sep 2007, 4:48:46 UTC - in response to Message 46084.  

This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed?


I have now done several CAPRI14 WUs with 5.80: Only one has a problem:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=95775480
with this result:
https://boinc.bakerlab.org/rosetta/result.php?resultid=105522388
I don't have a clue as to why it failed to Validate. It seems to have run the full length of time I have set for processing.

Thanks,
Jim
ID: 46250 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Prom

Send message
Joined: 21 Jun 06
Posts: 23
Credit: 931,604
RAC: 0
Message 46298 - Posted: 15 Sep 2007, 19:01:53 UTC - in response to Message 46250.  

This is puzzling -- those jobs arent taking long. We'll look into it. In the meanwhile, can your or other post links to the appropriate results that were killed?


I have now done several CAPRI14 WUs with 5.80: Only one has a problem:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=95775480
with this result:
https://boinc.bakerlab.org/rosetta/result.php?resultid=105522388
I don't have a clue as to why it failed to Validate. It seems to have run the full length of time I have set for processing.

Thanks,
Jim

Ok, one of mine have now failed to validate as well. Resetta owes me 132.92 for 11h57m47.5s of work and 177 decoys.
BBLounge - Broadband and Technology forum
ID: 46298 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David Emigh
Avatar

Send message
Joined: 13 Mar 06
Posts: 158
Credit: 417,178
RAC: 0
Message 46324 - Posted: 16 Sep 2007, 4:06:55 UTC

I am quite curious as to the relative success rate of the Capri14 WUs versus WUs of all other types.

My personal experience is that a Capri14 WU is nearly doomed to failure, no matter the Rosetta version on which it is running, while nearly all other WUs will succeed...

I sincerely hope that the CPU time that (to my layman's perspective) I seem to be wasting on Capri14 is not representative of the experience of the general population of Rosetta crunchers.

Respectfully,
David Emigh
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
ID: 46324 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : CAPRI14?



©2024 University of Washington
https://www.bakerlab.org