Message boards : Number crunching : Minirosetta v1.40 bug thread
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 15 · Next
Author | Message |
---|---|
Path7 Send message Joined: 25 Aug 07 Posts: 128 Credit: 61,751 RAC: 0 |
Hello Mike Tyka, Thanks for your reaction, and rerunning this WU. Have had more errors on this new laptop (restarting & can't acquire lockfile). These errors might occur due to throttling which I need to keep the fans “silent”. I've changed some settings to keep the CPU running at a constant frequency. Downgraded to BOINC 5.10.45, just in case. Now this machine seems to crunch better, occasionally restarting, but valid WU's (so far). Have a nice day, Path7. |
Cobra Send message Joined: 9 Nov 05 Posts: 7 Credit: 16,461,654 RAC: 435 |
Add me to the list of folks seeing WUs seemingly hang at around 9ish minutes to go to completion. I've seen WUs run as long as 11 hrs without completing before manually aborting them. Behavior seen on multiple hardware platforms (at least an AMD 9950BE and Opteron 180 and an Intel Core2 Duo dual core laptop with installed memory ranging from 1GB to 3+ GB), but all running WinXP. Seems to be happening on 5-20% of my Rosetta Mini 1.40 WUs. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2082 Credit: 40,621,050 RAC: 4,944 |
Add me to the list of folks seeing WUs seemingly hang at around 9ish minutes to go to completion. I've seen WUs run as long as 11 hrs without completing before manually aborting them. Behavior seen on multiple hardware platforms (at least an AMD 9950BE and Opteron 180 and an Intel Core2 Duo dual core laptop with installed memory ranging from 1GB to 3+ GB), but all running WinXP. I really don't understand why people keep going on about this. It seems quite obvious to me that once the counter gets to around 10 minutes it stops counting altogether. Every WU does this, Mini or Beta. Always has, likely always will. If the estimate is 3hours then even if the WU ends up running 3hours exactly the countdown still stops with 10 minutes to go. It ends when the model it's working on ends, then drops to zero as the WU finishes altogether. If the 1st model ends at 1h 31m then the WU ends because it'll assume the next model will take the same time and go over the 3hours. If 2 models complete at 2h 1m it'll do the same, assuming another 1h 0m 30s for the next model. And so on for 3 models at 2h 16m, 4 models at 2h 25m etc. To see how many models have been done, click "Show Graphics" in the Boinc Manager. It's shown at the bottom right. An estimate is an estimate. It's not a set time frame. Don't expect it to be cast in stone because it's not. Same with all the long-running WUs. They don't end earlier because the first model hasn't even been completed. Don't look at the clock ticking down. As long as the CPU time is clicking up then it's running just fine. If you abort the WU while CPU time is running then it's your look-out. I think my record is about 14 hours. |
DJStarfox Send message Joined: 19 Jul 07 Posts: 145 Credit: 1,250,162 RAC: 2 |
Rosetta Mini doesn't always respect BOINC's "Snooze" setting on making projects suspend. The weird thing is I had 2 Mini's running and when I hit "Snooze" 1 suspended and 1 continued. Yes, I've had the same problem on occassion with Rosetta Mini 1.40. I understand there are times where the program is "right in the middle of something" but it should perform callback checks to the BOINC API to suspend/run appropriately within a few seconds of the API command. |
Warren B. Rogers Send message Joined: 3 Oct 05 Posts: 5 Credit: 1,127,824 RAC: 0 |
Add me to the list of folks seeing WUs seemingly hang at around 9ish minutes to go to completion. I've seen WUs run as long as 11 hrs without completing before manually aborting them. Behavior seen on multiple hardware platforms (at least an AMD 9950BE and Opteron 180 and an Intel Core2 Duo dual core laptop with installed memory ranging from 1GB to 3+ GB), but all running WinXP. Actually what the problem is that when the WU gets to 10ish minutes to go it actually isn't doing any work it is just stuck. I've had a WU get to the 9 minute 56 second mark and just get stuck there. The longest I've had it get stuck is close to 18 hours and if BOINC get restarted it will go all the way back to 45 minutes and a then only take about 2 1/2 hours to complete after that. And there are alway a lot of lock file errors or watchdog reset errors. And not all have this problem. I've had several Rossetta Mini finish without getting stuck at just under 10 minutes and I can't remember see a beta have problems getting stuck. Thanks for the info though, Warren |
DALTON Send message Joined: 9 Jun 08 Posts: 1 Credit: 250,510 RAC: 0 |
Actually what the problem is that when the WU gets to 10ish minutes to go it actually isn't doing any work it is just stuck. I've had a WU get to the 9 minute 56 second mark and just get stuck there. The longest I've had it get stuck is close to 18 hours and if BOINC get restarted it will go all the way back to 45 minutes and a then only take about 2 1/2 hours to complete after that. And there are alway a lot of lock file errors or watchdog reset errors. And not all have this problem. I've had several Rossetta Mini finish without getting stuck at just under 10 minutes and I can't remember see a beta have problems getting stuck. The description by Sid Celery sounds very accurate to me. I've currently got a Mini work unit at 6 hours (3 hours default) and as he says it's still on the first model and ticking up nicely. No problem at all. If you've got lock file errors I'd hazard a guess that it's not ticking up at all on the CPU Time side. That's the issue. Forget anything to do with the remaining time because that's only ever a complete guess - as likely to be wrong as right. When you get other errors, the WU falls back to its last save position or the start of the current model within the WU. Maybe it sorts itself out by doing that and that's why it completes quickly after that. Just my 2cents |
Aegis Maelstrom Send message Joined: 29 Oct 08 Posts: 61 Credit: 2,137,555 RAC: 0 |
I wrote:
Hi, as I have promised I have come back, increased the memory amount and started to crunch again. To my surprise, the process has suddenly finished with a "success". The log says: 2008-11-17 21:54:33|rosetta@home|Restarting task IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_1wr2_4683_55_1 using minirosetta version 140 2008-11-17 21:56:12|rosetta@home|Computation for task IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_1wr2_4683_55_1 finished As I wrote in the posts above, this is impossible to end this task in such a time. Last time I needed two and a half "physical" hours just to crash, due to probably too low memory limits. I would like to notify you that this unit has not been computed properly and probably it's worth a try. I've made a snapshot just before the crash and this protein looks far better (lower energy, RAC) than anyone from the old fashioned abinito process I have seen so far. Frankly speaking, I would be more than happy to compute it by myself; unfortunately the client has sent it back. :( If you could send it to me manually, that would be nice. :) If not, please consider a recomputation of this unit. I wish you best luck with these units as the one I have seen so far signals a true breakthrough... a.m.@Poland |
Warren B. Rogers Send message Joined: 3 Oct 05 Posts: 5 Credit: 1,127,824 RAC: 0 |
Actually what the problem is that when the WU gets to 10ish minutes to go it actually isn't doing any work it is just stuck. I've had a WU get to the 9 minute 56 second mark and just get stuck there. The longest I've had it get stuck is close to 18 hours and if BOINC get restarted it will go all the way back to 45 minutes and a then only take about 2 1/2 hours to complete after that. And there are alway a lot of lock file errors or watchdog reset errors. And not all have this problem. I've had several Rossetta Mini finish without getting stuck at just under 10 minutes and I can't remember see a beta have problems getting stuck. Also, BOINC will restart the WU if it gets stuck for too long and will go back to almost the beginning of the WU. Then the WU completes in approximately 2 1/2 hours like a WU that doesn't have any problems. The thing that sucks about that is I only get credit for the time it took to complete the WU, 2 1/2 hour and the other 7 to 16 hours that my computer was stuck doesn't get credited. I don't have a problem with working on WU's that take a long time to complete as most of the projects that I do work for take multiple hours and my longest is ClimatePrediction.net, which at the moment has been working for 339 hours and still has about 7 hour to go. I just don't like having a WU take up CPU cycles from another WU when it isn't necessary. |
Speedy Send message Joined: 25 Sep 05 Posts: 163 Credit: 808,098 RAC: 0 |
Taskid 207716389 it's a 1d0qA model dose not display any graphics by clicking show graphics or the screen saver. it's has just finished it's valid. I hope this is of help Have a crunching good day!! |
Greg_BE Send message Joined: 30 May 06 Posts: 5690 Credit: 5,859,226 RAC: 8 |
mod or team...what is this recovering checkpoint thing that is showing up in some tasks? see my thread further down the list showing 4 tasks that completed ok, but gave checkpoint messages. also the task of speedy showed the same thing. completed ok, but gives a recovering checkpoint message. |
Conan Send message Joined: 11 Oct 05 Posts: 150 Credit: 4,124,428 RAC: 2,489 |
This task ran for 20 hours and was terminated by Boinc Watchdog. I know that rosetta is a low paying project at the best of times so I will be satisfied (I have to don't I?) with the 80 credits I received (4 cr/hr). # cpu_run_time_pref: 21600 ********************************************************************** Rosetta is going too long. Watchdog is ending the run! CPU time: 71942.5 seconds. Greater than 3X preferred time: 21600 seconds ********************************************************************** called boinc_finish |
mikylinux Send message Joined: 25 Jul 07 Posts: 3 Credit: 73,155 RAC: 0 |
The tasks cs_jumping_abrelax_6PNAS_proteins3_homo_bench_cs_jumping_abrelax_cs_flua_olange_4728_19390_0 and 1bm8__BOINC_ABRELAX_SPLIT_SPLIT2_IGNORE_THE_REST-S25-9-S3-3--1bm8_-_4768_9_0 do not stop the work. It is running 14 hours, usually takes 4 hours. Interrupting the work... |
(_KoDAk_) Send message Joined: 18 Jul 06 Posts: 109 Credit: 1,859,263 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=206369194 |
ramostol Send message Joined: 6 Feb 07 Posts: 64 Credit: 584,052 RAC: 0 |
I observed on my MacBook this morning (it's working by itself in peace and quietude) that the cs_jumping wus appear to complete normally, but seem (according to the message window) to restart once or twice in the computing process without an obvious explanation. |
(_KoDAk_) Send message Joined: 18 Jul 06 Posts: 109 Credit: 1,859,263 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=206938154 https://boinc.bakerlab.org/rosetta/result.php?resultid=207138889 https://boinc.bakerlab.org/rosetta/result.php?resultid=207121456 https://boinc.bakerlab.org/rosetta/result.php?resultid=207114809 https://boinc.bakerlab.org/rosetta/result.php?resultid=206990578 https://boinc.bakerlab.org/rosetta/result.php?resultid=206946754 https://boinc.bakerlab.org/rosetta/result.php?resultid=206944736 https://boinc.bakerlab.org/rosetta/result.php?resultid=206831871 |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
Here's some more NANs in hbonding errors from h001b_BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25 WUS: https://boinc.bakerlab.org/rosetta/result.php?resultid=208041354 https://boinc.bakerlab.org/rosetta/result.php?resultid=207922933 https://boinc.bakerlab.org/rosetta/result.php?resultid=207915448 https://boinc.bakerlab.org/rosetta/result.php?resultid=207873078 |
Alec Rosa Send message Joined: 11 Nov 08 Posts: 18 Credit: 2,635 RAC: 0 |
Hello, new here, sorry guys, I come to bitch. Having to manually abort every Rosetta Mini 1.40 task, so that I'm not wasting CPU time and energy, is a bitch. Just sayin'. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1230 Credit: 14,172,067 RAC: 737 |
Here's some more NANs in hbonding errors from h001b_BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25 WUS: I bet you'd like it if, in addition to reporting the error for the tag with the error, v1.41 also had the capability of reporting the good results for the previous tags, with separate credit calculations for each tag. That would, however, probably require adding a new outcome state indicating partially successful. |
Alec Rosa Send message Joined: 11 Nov 08 Posts: 18 Credit: 2,635 RAC: 0 |
P.S.: 19/11/2008 02:40:10|rosetta@home|Starting foldcst_minimalist_core3_homo_bench_foldcst_cheat_chunk_t312__olange_IGNORE_THE_REST_1XV2A_5_4741_186_0 Again. I should suspend the Rosetta project altogether until this stops happening, right? |
Cobra Send message Joined: 9 Nov 05 Posts: 7 Credit: 16,461,654 RAC: 435 |
Add me to the list of folks seeing WUs seemingly hang at around 9ish minutes to go to completion. I've seen WUs run as long as 11 hrs without completing before manually aborting them. Behavior seen on multiple hardware platforms (at least an AMD 9950BE and Opteron 180 and an Intel Core2 Duo dual core laptop with installed memory ranging from 1GB to 3+ GB), but all running WinXP. I have not shared your experience. I've run Rosetta@Home for a couple of years, and have happened to catch a few WUs counting down their final couple of minutes, so I disagree with your "always has" comment. I have also become accustomed over the years to workunits wrapping up in ~2.5-3 hrs nearly 100% of the time. The combination of a "stuck" countdown timer and WUs going ~3-4 times longer than I'm used to was behavior outside of my experience and seemed to indicate a problem, so I posted. |
Message boards :
Number crunching :
Minirosetta v1.40 bug thread
©2024 University of Washington
https://www.bakerlab.org