Minirosetta v1.40 bug thread

Message boards : Number crunching : Minirosetta v1.40 bug thread

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 15 · Next

AuthorMessage
Path7

Send message
Joined: 25 Aug 07
Posts: 128
Credit: 61,751
RAC: 0
Message 57015 - Posted: 17 Nov 2008, 0:42:50 UTC - in response to Message 57011.  
Last modified: 17 Nov 2008, 0:44:27 UTC


I tried re-running this here locally in the lab and it runs just fine - so not sure what went wrong there i'm afraid :(
Thanks for posting anyway!

Hello Mike Tyka,

Thanks for your reaction, and rerunning this WU.

Have had more errors on this new laptop (restarting & can't acquire lockfile).
These errors might occur due to throttling which I need to keep the fans “silent”.
I've changed some settings to keep the CPU running at a constant frequency.
Downgraded to BOINC 5.10.45, just in case.
Now this machine seems to crunch better, occasionally restarting, but valid WU's (so far).

Have a nice day,
Path7.
ID: 57015 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Cobra

Send message
Joined: 9 Nov 05
Posts: 7
Credit: 16,461,654
RAC: 530
Message 57018 - Posted: 17 Nov 2008, 2:07:34 UTC

Add me to the list of folks seeing WUs seemingly hang at around 9ish minutes to go to completion. I've seen WUs run as long as 11 hrs without completing before manually aborting them. Behavior seen on multiple hardware platforms (at least an AMD 9950BE and Opteron 180 and an Intel Core2 Duo dual core laptop with installed memory ranging from 1GB to 3+ GB), but all running WinXP.

Seems to be happening on 5-20% of my Rosetta Mini 1.40 WUs.
ID: 57018 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2074
Credit: 40,613,760
RAC: 5,140
Message 57027 - Posted: 17 Nov 2008, 16:38:58 UTC - in response to Message 57018.  

Add me to the list of folks seeing WUs seemingly hang at around 9ish minutes to go to completion. I've seen WUs run as long as 11 hrs without completing before manually aborting them. Behavior seen on multiple hardware platforms (at least an AMD 9950BE and Opteron 180 and an Intel Core2 Duo dual core laptop with installed memory ranging from 1GB to 3+ GB), but all running WinXP.

Seems to be happening on 5-20% of my Rosetta Mini 1.40 WUs.

I really don't understand why people keep going on about this. It seems quite obvious to me that once the counter gets to around 10 minutes it stops counting altogether. Every WU does this, Mini or Beta. Always has, likely always will.

If the estimate is 3hours then even if the WU ends up running 3hours exactly the countdown still stops with 10 minutes to go. It ends when the model it's working on ends, then drops to zero as the WU finishes altogether. If the 1st model ends at 1h 31m then the WU ends because it'll assume the next model will take the same time and go over the 3hours. If 2 models complete at 2h 1m it'll do the same, assuming another 1h 0m 30s for the next model. And so on for 3 models at 2h 16m, 4 models at 2h 25m etc.

To see how many models have been done, click "Show Graphics" in the Boinc Manager. It's shown at the bottom right.

An estimate is an estimate. It's not a set time frame. Don't expect it to be cast in stone because it's not.

Same with all the long-running WUs. They don't end earlier because the first model hasn't even been completed. Don't look at the clock ticking down. As long as the CPU time is clicking up then it's running just fine. If you abort the WU while CPU time is running then it's your look-out. I think my record is about 14 hours.
ID: 57027 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,250,162
RAC: 2
Message 57028 - Posted: 17 Nov 2008, 17:11:13 UTC - in response to Message 56923.  

Rosetta Mini doesn't always respect BOINC's "Snooze" setting on making projects suspend. The weird thing is I had 2 Mini's running and when I hit "Snooze" 1 suspended and 1 continued.


Yes, I've had the same problem on occassion with Rosetta Mini 1.40. I understand there are times where the program is "right in the middle of something" but it should perform callback checks to the BOINC API to suspend/run appropriately within a few seconds of the API command.
ID: 57028 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Warren B. Rogers

Send message
Joined: 3 Oct 05
Posts: 5
Credit: 1,127,824
RAC: 0
Message 57032 - Posted: 17 Nov 2008, 19:35:54 UTC - in response to Message 57027.  

Add me to the list of folks seeing WUs seemingly hang at around 9ish minutes to go to completion. I've seen WUs run as long as 11 hrs without completing before manually aborting them. Behavior seen on multiple hardware platforms (at least an AMD 9950BE and Opteron 180 and an Intel Core2 Duo dual core laptop with installed memory ranging from 1GB to 3+ GB), but all running WinXP.

Seems to be happening on 5-20% of my Rosetta Mini 1.40 WUs.

I really don't understand why people keep going on about this. It seems quite obvious to me that once the counter gets to around 10 minutes it stops counting altogether. Every WU does this, Mini or Beta. Always has, likely always will.

If the estimate is 3hours then even if the WU ends up running 3hours exactly the countdown still stops with 10 minutes to go. It ends when the model it's working on ends, then drops to zero as the WU finishes altogether. If the 1st model ends at 1h 31m then the WU ends because it'll assume the next model will take the same time and go over the 3hours. If 2 models complete at 2h 1m it'll do the same, assuming another 1h 0m 30s for the next model. And so on for 3 models at 2h 16m, 4 models at 2h 25m etc.

To see how many models have been done, click "Show Graphics" in the Boinc Manager. It's shown at the bottom right.

An estimate is an estimate. It's not a set time frame. Don't expect it to be cast in stone because it's not.

Same with all the long-running WUs. They don't end earlier because the first model hasn't even been completed. Don't look at the clock ticking down. As long as the CPU time is clicking up then it's running just fine. If you abort the WU while CPU time is running then it's your look-out. I think my record is about 14 hours.


Actually what the problem is that when the WU gets to 10ish minutes to go it actually isn't doing any work it is just stuck. I've had a WU get to the 9 minute 56 second mark and just get stuck there. The longest I've had it get stuck is close to 18 hours and if BOINC get restarted it will go all the way back to 45 minutes and a then only take about 2 1/2 hours to complete after that. And there are alway a lot of lock file errors or watchdog reset errors. And not all have this problem. I've had several Rossetta Mini finish without getting stuck at just under 10 minutes and I can't remember see a beta have problems getting stuck.

Thanks for the info though,

Warren
ID: 57032 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DALTON

Send message
Joined: 9 Jun 08
Posts: 1
Credit: 250,510
RAC: 0
Message 57034 - Posted: 17 Nov 2008, 21:18:29 UTC - in response to Message 57032.  

Actually what the problem is that when the WU gets to 10ish minutes to go it actually isn't doing any work it is just stuck. I've had a WU get to the 9 minute 56 second mark and just get stuck there. The longest I've had it get stuck is close to 18 hours and if BOINC get restarted it will go all the way back to 45 minutes and a then only take about 2 1/2 hours to complete after that. And there are alway a lot of lock file errors or watchdog reset errors. And not all have this problem. I've had several Rossetta Mini finish without getting stuck at just under 10 minutes and I can't remember see a beta have problems getting stuck.

The description by Sid Celery sounds very accurate to me. I've currently got a Mini work unit at 6 hours (3 hours default) and as he says it's still on the first model and ticking up nicely. No problem at all.

If you've got lock file errors I'd hazard a guess that it's not ticking up at all on the CPU Time side. That's the issue. Forget anything to do with the remaining time because that's only ever a complete guess - as likely to be wrong as right.

When you get other errors, the WU falls back to its last save position or the start of the current model within the WU. Maybe it sorts itself out by doing that and that's why it completes quickly after that.

Just my 2cents
ID: 57034 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aegis Maelstrom

Send message
Joined: 29 Oct 08
Posts: 61
Credit: 2,137,555
RAC: 0
Message 57035 - Posted: 17 Nov 2008, 21:24:48 UTC - in response to Message 56782.  
Last modified: 17 Nov 2008, 21:26:08 UTC

I wrote:

Task IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_1wr2_4683_55_1

restarted twice so far, now processing:

(...)

Now I am waiting to check if this workunit is endable.


The Workunit restarted third time, seemingly in the same place as the previous time (the percentage "completed" was higher but I was checking a couple minutes earlier and it was once again step 10000 then, so now it was probably 11000).

The WU started for the fourth time, now with 24% but I guess it was the same moment as before. When I restarted the WU after temporarily halting once again, it went back to 17%. Now I can see 18,23% and step 523.

Now I am halting this task and my business with Rosetta.

When the BOINC tried to download a different task, I got a following log:
2008-11-09 14:29:23|rosetta@home|Message from server: No work sent
2008-11-09 14:29:23|rosetta@home|Message from server: Your preferences limit memory usage to 452 MB, and 488 MB is needed

The problem seems to be with a higher memory usage although one of the mods recently assured us that there is no increase in memory requirements.
I could increase amount of memory dedicated to BOINC, however I would like to have this problem explained and ironed out.


Hi,

as I have promised I have come back, increased the memory amount and started to crunch again.

To my surprise, the process has suddenly finished with a "success". The log says:
2008-11-17 21:54:33|rosetta@home|Restarting task IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_1wr2_4683_55_1 using minirosetta version 140
2008-11-17 21:56:12|rosetta@home|Computation for task IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_1wr2_4683_55_1 finished

As I wrote in the posts above, this is impossible to end this task in such a time. Last time I needed two and a half "physical" hours just to crash, due to probably too low memory limits.

I would like to notify you that this unit has not been computed properly and probably it's worth a try. I've made a snapshot just before the crash and this protein looks far better (lower energy, RAC) than anyone from the old fashioned abinito process I have seen so far.

Frankly speaking, I would be more than happy to compute it by myself; unfortunately the client has sent it back. :(

If you could send it to me manually, that would be nice. :) If not, please consider a recomputation of this unit.

I wish you best luck with these units as the one I have seen so far signals a true breakthrough...

a.m.@Poland
ID: 57035 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Warren B. Rogers

Send message
Joined: 3 Oct 05
Posts: 5
Credit: 1,127,824
RAC: 0
Message 57037 - Posted: 18 Nov 2008, 1:13:13 UTC - in response to Message 57034.  

Actually what the problem is that when the WU gets to 10ish minutes to go it actually isn't doing any work it is just stuck. I've had a WU get to the 9 minute 56 second mark and just get stuck there. The longest I've had it get stuck is close to 18 hours and if BOINC get restarted it will go all the way back to 45 minutes and a then only take about 2 1/2 hours to complete after that. And there are alway a lot of lock file errors or watchdog reset errors. And not all have this problem. I've had several Rossetta Mini finish without getting stuck at just under 10 minutes and I can't remember see a beta have problems getting stuck.

The description by Sid Celery sounds very accurate to me. I've currently got a Mini work unit at 6 hours (3 hours default) and as he says it's still on the first model and ticking up nicely. No problem at all.

If you've got lock file errors I'd hazard a guess that it's not ticking up at all on the CPU Time side. That's the issue. Forget anything to do with the remaining time because that's only ever a complete guess - as likely to be wrong as right.

When you get other errors, the WU falls back to its last save position or the start of the current model within the WU. Maybe it sorts itself out by doing that and that's why it completes quickly after that.

Just my 2cents


Also, BOINC will restart the WU if it gets stuck for too long and will go back to almost the beginning of the WU. Then the WU completes in approximately 2 1/2 hours like a WU that doesn't have any problems. The thing that sucks about that is I only get credit for the time it took to complete the WU, 2 1/2 hour and the other 7 to 16 hours that my computer was stuck doesn't get credited. I don't have a problem with working on WU's that take a long time to complete as most of the projects that I do work for take multiple hours and my longest is ClimatePrediction.net, which at the moment has been working for 339 hours and still has about 7 hour to go. I just don't like having a WU take up CPU cycles from another WU when it isn't necessary.
ID: 57037 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Speedy
Avatar

Send message
Joined: 25 Sep 05
Posts: 163
Credit: 808,098
RAC: 0
Message 57038 - Posted: 18 Nov 2008, 5:29:05 UTC
Last modified: 18 Nov 2008, 5:30:35 UTC

Taskid 207716389 it's a 1d0qA model dose not display any graphics by clicking show graphics or the screen saver. it's has just finished it's valid. I hope this is of help
Have a crunching good day!!
ID: 57038 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5690
Credit: 5,859,226
RAC: 10
Message 57039 - Posted: 18 Nov 2008, 8:15:37 UTC

mod or team...what is this recovering checkpoint thing that is showing up in some tasks? see my thread further down the list showing 4 tasks that completed ok, but gave checkpoint messages. also the task of speedy showed the same thing. completed ok, but gives a recovering checkpoint message.
ID: 57039 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 4,124,428
RAC: 3,034
Message 57040 - Posted: 18 Nov 2008, 10:46:16 UTC

This task ran for 20 hours and was terminated by Boinc Watchdog.
I know that rosetta is a low paying project at the best of times so I will be satisfied (I have to don't I?) with the 80 credits I received (4 cr/hr).

# cpu_run_time_pref: 21600
**********************************************************************
Rosetta is going too long. Watchdog is ending the run!
CPU time: 71942.5 seconds. Greater than 3X preferred time: 21600 seconds
**********************************************************************
called boinc_finish
ID: 57040 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikylinux

Send message
Joined: 25 Jul 07
Posts: 3
Credit: 73,155
RAC: 0
Message 57042 - Posted: 18 Nov 2008, 11:30:19 UTC


The tasks

cs_jumping_abrelax_6PNAS_proteins3_homo_bench_cs_jumping_abrelax_cs_flua_olange_4728_19390_0

and

1bm8__BOINC_ABRELAX_SPLIT_SPLIT2_IGNORE_THE_REST-S25-9-S3-3--1bm8_-_4768_9_0

do not stop the work. It is running 14 hours, usually takes 4 hours.
Interrupting the work...

ID: 57042 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile (_KoDAk_)

Send message
Joined: 18 Jul 06
Posts: 109
Credit: 1,859,263
RAC: 0
Message 57043 - Posted: 18 Nov 2008, 12:36:16 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=206369194
ID: 57043 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ramostol

Send message
Joined: 6 Feb 07
Posts: 64
Credit: 584,052
RAC: 0
Message 57044 - Posted: 18 Nov 2008, 13:07:28 UTC - in response to Message 57042.  


The tasks

cs_jumping_abrelax_6PNAS_proteins3_homo_bench_cs_jumping_abrelax_cs_flua_olange_4728_19390_0

and

1bm8__BOINC_ABRELAX_SPLIT_SPLIT2_IGNORE_THE_REST-S25-9-S3-3--1bm8_-_4768_9_0

do not stop the work. It is running 14 hours, usually takes 4 hours.
Interrupting the work...



I observed on my MacBook this morning (it's working by itself in peace and quietude) that the cs_jumping wus appear to complete normally, but seem (according to the message window) to restart once or twice in the computing process without an obvious explanation.
ID: 57044 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile (_KoDAk_)

Send message
Joined: 18 Jul 06
Posts: 109
Credit: 1,859,263
RAC: 0
Message 57046 - Posted: 18 Nov 2008, 21:36:58 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=206938154
https://boinc.bakerlab.org/rosetta/result.php?resultid=207138889
https://boinc.bakerlab.org/rosetta/result.php?resultid=207121456
https://boinc.bakerlab.org/rosetta/result.php?resultid=207114809
https://boinc.bakerlab.org/rosetta/result.php?resultid=206990578
https://boinc.bakerlab.org/rosetta/result.php?resultid=206946754
https://boinc.bakerlab.org/rosetta/result.php?resultid=206944736
https://boinc.bakerlab.org/rosetta/result.php?resultid=206831871

ID: 57046 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 57049 - Posted: 19 Nov 2008, 0:14:20 UTC

ID: 57049 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alec Rosa

Send message
Joined: 11 Nov 08
Posts: 18
Credit: 2,635
RAC: 0
Message 57050 - Posted: 19 Nov 2008, 0:31:49 UTC

Hello, new here, sorry guys, I come to bitch.

Having to manually abort every Rosetta Mini 1.40 task, so that I'm not wasting CPU time and energy, is a bitch.

Just sayin'.
ID: 57050 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1229
Credit: 14,172,067
RAC: 898
Message 57051 - Posted: 19 Nov 2008, 1:31:46 UTC - in response to Message 57049.  
Last modified: 19 Nov 2008, 1:38:35 UTC

Here's some more NANs in hbonding errors from h001b_BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25 WUS:

https://boinc.bakerlab.org/rosetta/result.php?resultid=208041354
https://boinc.bakerlab.org/rosetta/result.php?resultid=207922933
https://boinc.bakerlab.org/rosetta/result.php?resultid=207915448
https://boinc.bakerlab.org/rosetta/result.php?resultid=207873078


I bet you'd like it if, in addition to reporting the error for the tag with the error, v1.41 also had the capability of reporting the good results for the previous tags, with separate credit calculations for each tag.

That would, however, probably require adding a new outcome state indicating partially successful.
ID: 57051 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alec Rosa

Send message
Joined: 11 Nov 08
Posts: 18
Credit: 2,635
RAC: 0
Message 57053 - Posted: 19 Nov 2008, 3:26:31 UTC
Last modified: 19 Nov 2008, 3:31:16 UTC

P.S.:
19/11/2008 02:40:10|rosetta@home|Starting foldcst_minimalist_core3_homo_bench_foldcst_cheat_chunk_t312__olange_IGNORE_THE_REST_1XV2A_5_4741_186_0
19/11/2008 02:40:14|rosetta@home|Starting task foldcst_minimalist_core3_homo_bench_foldcst_cheat_chunk_t312__olange_IGNORE_THE_REST_1XV2A_5_4741_186_0 using minirosetta version 140
19/11/2008 02:56:51|rosetta@home|Restarting task foldcst_minimalist_core3_homo_bench_foldcst_cheat_chunk_t312__olange_IGNORE_THE_REST_1XV2A_5_4741_186_0 using minirosetta version 140
19/11/2008 02:57:32|rosetta@home|Task foldcst_minimalist_core3_homo_bench_foldcst_cheat_chunk_t312__olange_IGNORE_THE_REST_1XV2A_5_4741_186_0 exited with zero status but no 'finished' file
19/11/2008 02:57:32|rosetta@home|If this happens repeatedly you may need to reset the project.
.
.
.

Again. I should suspend the Rosetta project altogether until this stops happening, right?
ID: 57053 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Cobra

Send message
Joined: 9 Nov 05
Posts: 7
Credit: 16,461,654
RAC: 530
Message 57054 - Posted: 19 Nov 2008, 4:31:18 UTC - in response to Message 57027.  

Add me to the list of folks seeing WUs seemingly hang at around 9ish minutes to go to completion. I've seen WUs run as long as 11 hrs without completing before manually aborting them. Behavior seen on multiple hardware platforms (at least an AMD 9950BE and Opteron 180 and an Intel Core2 Duo dual core laptop with installed memory ranging from 1GB to 3+ GB), but all running WinXP.

I really don't understand why people keep going on about this. It seems quite obvious to me that once the counter gets to around 10 minutes it stops counting altogether. Every WU does this, Mini or Beta. Always has, likely always will.

I have not shared your experience. I've run Rosetta@Home for a couple of years, and have happened to catch a few WUs counting down their final couple of minutes, so I disagree with your "always has" comment.

I have also become accustomed over the years to workunits wrapping up in ~2.5-3 hrs nearly 100% of the time. The combination of a "stuck" countdown timer and WUs going ~3-4 times longer than I'm used to was behavior outside of my experience and seemed to indicate a problem, so I posted.
ID: 57054 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 15 · Next

Message boards : Number crunching : Minirosetta v1.40 bug thread



©2024 University of Washington
https://www.bakerlab.org