Posts by Warren B. Rogers

1) Message boards : Number crunching : Minirosetta v1.40 bug thread (Message 57037)
Posted 18 Nov 2008 by Warren B. Rogers
Post:
Actually what the problem is that when the WU gets to 10ish minutes to go it actually isn't doing any work it is just stuck. I've had a WU get to the 9 minute 56 second mark and just get stuck there. The longest I've had it get stuck is close to 18 hours and if BOINC get restarted it will go all the way back to 45 minutes and a then only take about 2 1/2 hours to complete after that. And there are alway a lot of lock file errors or watchdog reset errors. And not all have this problem. I've had several Rossetta Mini finish without getting stuck at just under 10 minutes and I can't remember see a beta have problems getting stuck.

The description by Sid Celery sounds very accurate to me. I've currently got a Mini work unit at 6 hours (3 hours default) and as he says it's still on the first model and ticking up nicely. No problem at all.

If you've got lock file errors I'd hazard a guess that it's not ticking up at all on the CPU Time side. That's the issue. Forget anything to do with the remaining time because that's only ever a complete guess - as likely to be wrong as right.

When you get other errors, the WU falls back to its last save position or the start of the current model within the WU. Maybe it sorts itself out by doing that and that's why it completes quickly after that.

Just my 2cents


Also, BOINC will restart the WU if it gets stuck for too long and will go back to almost the beginning of the WU. Then the WU completes in approximately 2 1/2 hours like a WU that doesn't have any problems. The thing that sucks about that is I only get credit for the time it took to complete the WU, 2 1/2 hour and the other 7 to 16 hours that my computer was stuck doesn't get credited. I don't have a problem with working on WU's that take a long time to complete as most of the projects that I do work for take multiple hours and my longest is ClimatePrediction.net, which at the moment has been working for 339 hours and still has about 7 hour to go. I just don't like having a WU take up CPU cycles from another WU when it isn't necessary.
2) Message boards : Number crunching : Minirosetta v1.40 bug thread (Message 57032)
Posted 17 Nov 2008 by Warren B. Rogers
Post:
Add me to the list of folks seeing WUs seemingly hang at around 9ish minutes to go to completion. I've seen WUs run as long as 11 hrs without completing before manually aborting them. Behavior seen on multiple hardware platforms (at least an AMD 9950BE and Opteron 180 and an Intel Core2 Duo dual core laptop with installed memory ranging from 1GB to 3+ GB), but all running WinXP.

Seems to be happening on 5-20% of my Rosetta Mini 1.40 WUs.

I really don't understand why people keep going on about this. It seems quite obvious to me that once the counter gets to around 10 minutes it stops counting altogether. Every WU does this, Mini or Beta. Always has, likely always will.

If the estimate is 3hours then even if the WU ends up running 3hours exactly the countdown still stops with 10 minutes to go. It ends when the model it's working on ends, then drops to zero as the WU finishes altogether. If the 1st model ends at 1h 31m then the WU ends because it'll assume the next model will take the same time and go over the 3hours. If 2 models complete at 2h 1m it'll do the same, assuming another 1h 0m 30s for the next model. And so on for 3 models at 2h 16m, 4 models at 2h 25m etc.

To see how many models have been done, click "Show Graphics" in the Boinc Manager. It's shown at the bottom right.

An estimate is an estimate. It's not a set time frame. Don't expect it to be cast in stone because it's not.

Same with all the long-running WUs. They don't end earlier because the first model hasn't even been completed. Don't look at the clock ticking down. As long as the CPU time is clicking up then it's running just fine. If you abort the WU while CPU time is running then it's your look-out. I think my record is about 14 hours.


Actually what the problem is that when the WU gets to 10ish minutes to go it actually isn't doing any work it is just stuck. I've had a WU get to the 9 minute 56 second mark and just get stuck there. The longest I've had it get stuck is close to 18 hours and if BOINC get restarted it will go all the way back to 45 minutes and a then only take about 2 1/2 hours to complete after that. And there are alway a lot of lock file errors or watchdog reset errors. And not all have this problem. I've had several Rossetta Mini finish without getting stuck at just under 10 minutes and I can't remember see a beta have problems getting stuck.

Thanks for the info though,

Warren
3) Message boards : Number crunching : Minirosetta v1.40 bug thread (Message 56830)
Posted 11 Nov 2008 by Warren B. Rogers
Post:
Hello everyone,

I've also had trouble with this version of Minirosetta. The WU will get to about 98% completion and show approximately 9 minutes to completion and then it seems to get stuck at that point. I've stopped the WU and let other projects get a chance to complete and when BOINC returns to the WU it will start from the beginning and sometimes complete in approximately 2 hours or it will do the same thing and get stuck at 98% and run for over 6 hour. I've had 2 end with Compute Errors and 1 with a Validate Error. And I've seen even the WU's that complete are getting shut down by the watchdog because of too many restarts. I hope this information helps.


Warren Rogers
4) Message boards : Number crunching : Problems with Rosetta stable version 5.69 and beta version 5.77 (Message 45748)
Posted 4 Sep 2007 by Warren B. Rogers
Post:

Good day all. I've also noticed this problem with version 5.77 but it doesn't happen all of the time only occasionally. When I notice that this is happening I usually suspend that WU and let something else run. The one time I just let it run it was at 09:57 for about 3 hours and the count down timer was not moving. Also, I saw the amount of work being done drop to the 1000th of a percent/sec which was considerably slower than it was for the first 96% of the WU. I didn't want to abort the WU and just hoped that if it did something else for a while and worked it's way back to the WU it would finish it properly. Well, it did take about 1 hour to finish but it was much better than the rate it was moving at. I hope my experience is helpful.

Warren Rogers


Was it that WU which took your computer about 19K seconds? In that case there was nothing wrong, in my opinion. It is just one of those work units which need a large amount of time to generate even one model. And since at least one model needs to be generated, this one model may exceed your pre-set runtime. Maybe the low modelcount also causes the inaccuracies in the estimated time remaining.


Yes it did take about 19K seconds to complete and I had another on that took 20K to complete as well. The problem is that it is taking about 1 1/2 to 2 hours to get to the 10 minute mark then it sort of hangs up there for about 4 hours unless I suspend the project and let something else run and let the BOINC manager work it's way back to the WU.

Thanks,

Warren
5) Message boards : Number crunching : Problems with Rosetta stable version 5.69 and beta version 5.77 (Message 45726)
Posted 3 Sep 2007 by Warren B. Rogers
Post:
Something goes wrong with 5.77 on my machine. It gets down to saying 00:09:57 and then stays there. So for the second time I am about to abort a task. Suspect this old PC just isn't capable or something. Been chugging along with RAH for over a year, I guess, but maybe it's time to quit.


Jerry, ever tried just letting them run? That are probably doing just fine. Rosetta has no way to know ahead of time exactly how long it will take to crunch a given model, so the estimate is... well... just an estimate. Once it gets down to within 10 minutes of your target runtime, it starts to exponentially reduce the time remaining less and less, with the idea being that the time to completion will generally still be going down.

The watch dog is always there looking over your tasks, and prepared to abort them if it deems necessary.


Good day all. I've also noticed this problem with version 5.77 but it doesn't happen all of the time only occasionally. When I notice that this is happening I usually suspend that WU and let something else run. The one time I just let it run it was at 09:57 for about 3 hours and the count down timer was not moving. Also, I saw the amount of work being done drop to the 1000th of a percent/sec which was considerably slower than it was for the first 96% of the WU. I didn't want to abort the WU and just hoped that if it did something else for a while and worked it's way back to the WU it would finish it properly. Well, it did take about 1 hour to finish but it was much better than the rate it was moving at. I hope my experience is helpful.

Warren Rogers






©2024 University of Washington
https://www.bakerlab.org