Message boards : Number crunching : Problems with Rosetta version 5.41
Previous · 1 · 2 · 3 · 4 · 5
Author | Message |
---|---|
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
To the many users who have been posting about graphics issues, thanks very much for the thorough reports. Chu is now able to reproduce some of these problems on our local Windows machine, and we have some good ideas to fix the problems, based on your posts. Our tentative plan is the following: (1) We want a reasonably stable release to run over Christmas. To that end, we'll be testing an app on ralph tonight that has some of the new graphics "features" turned off. These features include the ability to rotate separate conformations with the mouse, and the display of side chains. If you report fewer crashes, we'll update rosetta@home with this "simplified" version at least over the holidays. (2) In parallel, Chu and I will be testing an alternative communication protocol between rosetta and the boinc graphics manager which should hopefully be far more robust to memory faults. We'll test this new protocol in the new year, at which point we'll put back in the features! (3) Beyond that, Phil and I are testing new modes of Rosetta that involve nucleic acids -- DNA and RNA. There are some pretty cool applications, including designing proteins for gene therapy. We are developing the graphics for these modes, and will be working intensely in January to make sure they don't cause crashes. I'll post something here and on ralph asking for feedback on the temporary "simplified graphics" version of Rosetta tonight... hopefully some of you can help us confirm that is causes fewer crashes. Hi all. One more data point re graphics-related failures. |
The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,271,025 RAC: 0 |
Is this related to the "graphics issues" that are being talked about? Problem? My end? Rosetta's end? |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Problem? My end? Rosetta's end? Is this related to the "graphics issues" that are being talked about? I suspect so. The only time I see the watchdog kick in is when I'm trying to tempt fait with the screensaver. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,271,025 RAC: 0 |
Verrrry interesting.... My screensaver is set to "Blank", and I very rarely display the Rosetta graphics... Although, I will note that last Friday I DID switch from integrated graphics on the Compaq sr2030nx, to the $50 SlickDeals speical, XFX GeForce 7600GS 256MB PCI Express. Hmmmm........ Problem? My end? Rosetta's end? |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Verrrry interesting.... ...in that case, I revert to my more standard response. The watchdog is there to protect you from a work unit spinning time away and not making progress. I've been seeing this occur on some of the WUs that fail due to the screen saver problems. But the original purpose of the watchdog still exists. There are times, due to bugs or in trying to cover all the bases, that it is possible for non-productive loops to occur. When the watchdog detects that no progress (as measured by the current Rosetta score) is being made and ends the work unit, reporting any completed models. I think of Rosetta as being like a hound dog. Sniffing out the best model. And sometimes the rabbit seems to have round around and around in a circle, and the houngdog doesn't know where to exit the circle to continue the chase. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
To all screensaver aficionados who are having problems with rosetta graphics: If you have a surefire way to crash rosetta -- say by moving the mouse a lot, or by keeping the screensaver on too long, or increasing the frame rate -- can you possibly attach your project to ralph, and let us know if its more stable than rosetta@home? Please post comment here. Over on ralph. we have turned off some of the features that we think are causing crashes (display of sidechains and mouse rotation) until we can fix them properly. If ralph is stable we will turn off those features here at rosetta@home too. Thanks! Verrrry interesting.... |
Chu Send message Joined: 23 Feb 06 Posts: 120 Credit: 112,439 RAC: 0 |
Correct. It does not seem to be graphic-related. That is a stuck WUs and the watchdog was lauched to end it to avoid further waste on your cpu time. This type of errors seems to happen randomly . Verrrry interesting.... |
daniels Send message Joined: 3 Jul 06 Posts: 7 Credit: 13,439 RAC: 0 |
guys, sorry to bother u again... this time my work unit got stuck at 1 h 09 min and 31.630% ... i've keept it running for 5 days, but no progress in cpu time or perc... i think i am going to suspend the project, because it consumming my resources and nothing happends... the unit apppear to be running from time to time.... this is a grep from stdoutdae.txt : cat stdoutdae.txt | grep rosetta 2006-12-09 03:19:57 [rosetta@home] Starting task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R38_filters_1441_67_0 using rosetta version 541 2006-12-09 04:20:00 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R38_filters_1441_67_0 (removed from memory) 2006-12-09 05:20:48 [rosetta@home] Restarting task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R38_filters_1441_67_0 using rosetta version 541 2006-12-09 05:20:55 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R38_filters_1441_67_0 (removed from memory) 2006-12-09 06:21:15 [rosetta@home] Restarting task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R38_filters_1441_67_0 using rosetta version 541 2006-12-09 07:21:15 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R38_filters_1441_67_0 (removed from memory) 2006-12-09 07:21:16 [rosetta@home] Unrecoverable error for result BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R38_filters_1441_67_0 (process exited with code 131 (0x83)) 2006-12-09 07:21:16 [rosetta@home] Deferring scheduler requests for 1 minutes and 0 seconds 2006-12-09 07:21:16 [rosetta@home] Computation for task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R38_filters_1441_67_0 finished 2006-12-09 09:12:52 [rosetta@home] Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 2006-12-09 09:12:52 [rosetta@home] Reason: To fetch work 2006-12-09 09:12:52 [rosetta@home] Requesting 8640 seconds of new work, and reporting 1 completed tasks 2006-12-09 09:12:57 [rosetta@home] Scheduler request succeeded 2006-12-09 09:12:59 [rosetta@home] Started download of file BAR_R13_R43_cc1ctf_03_05.200_v1_3.gz 2006-12-09 09:12:59 [rosetta@home] Started download of file BAR_R13_R43_cc1ctf_09_05.200_v1_3.gz 2006-12-09 09:13:03 [rosetta@home] Finished download of file BAR_R13_R43_cc1ctf_03_05.200_v1_3.gz 2006-12-09 09:13:03 [rosetta@home] Throughput 185433 bytes/sec 2006-12-09 09:13:03 [rosetta@home] Started download of file 1ctf__R13_R43_cheat.bar 2006-12-09 09:13:04 [rosetta@home] Finished download of file 1ctf__R13_R43_cheat.bar 2006-12-09 09:13:04 [rosetta@home] Throughput 142 bytes/sec 2006-12-09 09:13:07 [rosetta@home] Finished download of file BAR_R13_R43_cc1ctf_09_05.200_v1_3.gz 2006-12-09 09:13:07 [rosetta@home] Throughput 233371 bytes/sec 2006-12-09 09:28:09 [rosetta@home] Starting task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 using rosetta version 541 2006-12-09 10:28:21 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-09 12:32:02 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-09 15:32:03 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-09 17:53:08 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-09 21:14:30 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-09 23:16:51 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-10 01:21:26 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-10 03:24:06 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-10 05:38:48 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-10 06:53:03 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-10 07:53:25 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-10 08:02:20 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-10 09:03:18 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-10 09:13:18 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2 and goes like this until: 2006-12-13 00:38:11 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-13 00:39:12 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-13 01:39:44 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-13 01:59:36 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-13 03:22:18 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-13 03:22:37 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-13 04:23:26 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-13 04:27:17 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-13 05:28:31 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-13 05:28:57 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-13 06:31:09 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-13 06:32:21 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-13 07:32:37 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-13 07:33:00 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-13 09:33:14 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-13 11:35:04 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) bash-3.1# later edit: i have restarted boinc and now it's working again but with less cpu time spent 56 min and the same perc... i think soon i wil receive that error again... i will come back with more update 2006-12-13 11:35:04 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-13 12:59:01 [rosetta@home] URL: https://boinc.bakerlab.org/rosetta/; Computer ID: 369002; location: ; project prefs: default 2006-12-13 13:00:03 [rosetta@home] Deferring task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 2006-12-13 13:00:03 [rosetta@home] Restarting task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 using rosetta version 541 |
daniels Send message Joined: 3 Jul 06 Posts: 7 Credit: 13,439 RAC: 0 |
guys, sorry to bother u again... this time my work unit got stuck at 1 h 09 min and 31.630% ... i've keept it running for 5 days, but no progress in cpu time or perc... i think i am going to suspend the project, because it consumming my resources and nothing happends... the unit apppear to be running from time to time.... as i expected: this WU crashed too: 2006-12-13 14:00:04 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory) 2006-12-13 14:00:05 [rosetta@home] Unrecoverable error for result BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (process exited with code 131 (0x83)) 2006-12-13 14:00:05 [rosetta@home] Deferring scheduler requests for 1 minutes and 0 seconds 2006-12-13 14:00:05 [rosetta@home] Computation for task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 finished bash-3.1# |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Daniels, by removing the task from memory every hour, you're throwing away a lot of good work. I don't know about that specific task, but it is not uncommon for some tasks to need more then an hour to reach a checkpoint. If no checkpoint is reached in the hour, and you remove from memory, the it would be restarting from the same point each hour. Rosetta has a "watchdog" which should have detected such an event. And if it restarts from the same point... I think it is 5 times in a row, then the watchdog will end the task for you and report it back. You will want to display the graphics for the WU and check the model number shown. Over time the model number should increment. And within each model the steps should be counting up. Faster on some tasks then others. But moving at least every minute. What do you see when you display the graphic? Suggest you go to your General Preferences and specify YES to leave in memory while preempted. This just means "keep the work you have done so far, so when you restart, we pick up where we left off". It keeps all the active information in virtual memory. If you end BOINC or turn off the PC, you still lose it, but it doesn't appear you are doing that. The other approach, if you prefer, would be to change your setting in the General Preferences for how often to switch between tasks. The default is an hour, but you would get more work done if you bump it to 3 or 4 hours. That would give enough time for Rosetta to reach a checkpoint, and assure that even if you do remove from memory, that it will be making meaningful, and permenant progress each time it runs. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
daniels Send message Joined: 3 Jul 06 Posts: 7 Credit: 13,439 RAC: 0 |
Daniels, by removing the task from memory every hour, you're throwing away a lot of good work. I don't know about that specific task, but it is not uncommon for some tasks to need more then an hour to reach a checkpoint. If no checkpoint is reached in the hour, and you remove from memory, the it would be restarting from the same point each hour. Rosetta has a "watchdog" which should have detected such an event. And if it restarts from the same point... I think it is 5 times in a row, then the watchdog will end the task for you and report it back. it never happend before, but with the new version , it is happening at every work unit... i will do as u say and see what is going on... |
Jnargus Send message Joined: 4 Oct 06 Posts: 5 Credit: 7,935,482 RAC: 190 |
I too was having the same problem as Daniels. I have done what you suggested and it now seems to be working much better now. The only question I have is that I only had a problem on my Linux boxes but not on my WinXP boxes. My Rosetta WUs were not restarting after they had been preempted by one of the other projects. The Boinc manager said they were running but there was no message saying they had been restarted. None of the WUs I aborted were getting stopped by the watchdog, probably because they were not getting restarted properly in the first place. For some other reason I am unable to see the graphics, probably because I don't have the right driver for the video cards. This is not a problem for me as I just let the machines slave away at their WUs. Daniels, by removing the task from memory every hour, you're throwing away a lot of good work. I don't know about that specific task, but it is not uncommon for some tasks to need more then an hour to reach a checkpoint. If no checkpoint is reached in the hour, and you remove from memory, the it would be restarting from the same point each hour. Rosetta has a "watchdog" which should have detected such an event. And if it restarts from the same point... I think it is 5 times in a row, then the watchdog will end the task for you and report it back. |
Joachim Send message Joined: 26 Nov 06 Posts: 5 Credit: 518,439 RAC: 17 |
just for test, i have setup that setting for 2 hours and after it reached that period it just stop doing something, like the last time and get 70% done... i will increase the period to 4 hours, to have the task completed... but i think this is not a solution... the watch dog is not restarting the application, it just keep it in memory while it is not doing nothing... the other applications are working properly... i think someone should verify this... i am not using graphics also... I've seen the same phenomena on my computer under Linux (SuSE 10.0). TOP says the four threads of rosetta are in memory but use 0% CPU. Joachim Dinos are not dead. They are alive and well and living in data centers all around you. They speak in tongues and work strange magics with computers. Beware the dino! |
hedera Send message Joined: 15 Jul 06 Posts: 76 Credit: 5,263,150 RAC: 12 |
I got curious to see whether the recurring crashes were the graphics code or the screen saver code. So, while using a different screen saver, I displayed graphics in a window and left it up. Crashed within an hour with this error: 12/16/2006 10:20:35 AM|rosetta@home|Unrecoverable error for result 1urnA_BOINC_POSE_ABRELAX_NEWRELAXFLAGS_frags83__1449_111_0 ( - exit code -1073741819 (0xc0000005)) I guess it's the graphics code itself, the screen saver wasn't running. --hedera Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic. |
Chu Send message Joined: 23 Feb 06 Posts: 120 Credit: 112,439 RAC: 0 |
Thanks for the test, and that is consistent with what we have learned from users' reports and the local tests we have done. That is why we temporarily disabled some advanced graphic features such drawing sidehchains, zooming and rotating proteins in the current 5.43 application, in order to at least alleviate the problems which has caused incovenience on client side. We will turn those features back once we figure out a permannent solution and hopefully that won't take too long. Thanks again for everyone's help. I got curious to see whether the recurring crashes were the graphics code or the screen saver code. So, while using a different screen saver, I displayed graphics in a window and left it up. Crashed within an hour with this error: |
Message boards :
Number crunching :
Problems with Rosetta version 5.41
©2025 University of Washington
https://www.bakerlab.org