Problems with Rosetta version 5.41

Message boards : Number crunching : Problems with Rosetta version 5.41

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5

AuthorMessage
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 32473 - Posted: 11 Dec 2006, 20:53:05 UTC - in response to Message 32150.  

To the many users who have been posting about graphics issues, thanks very much for the thorough reports. Chu is now able to reproduce some of these problems on our local Windows machine, and we have some good ideas to fix the problems, based on your posts. Our tentative plan is the following:

(1) We want a reasonably stable release to run over Christmas. To that end, we'll be testing an app on ralph tonight that has some of the new graphics "features" turned off. These features include the ability to rotate separate conformations with the mouse, and the display of side chains.
If you report fewer crashes, we'll update rosetta@home with this "simplified" version at least over the holidays.

(2) In parallel, Chu and I will be testing an alternative communication protocol between rosetta and the boinc graphics manager which should hopefully be far more robust to memory faults. We'll test this new protocol in the new year, at which point we'll put back in the features!

(3) Beyond that, Phil and I are testing new modes of Rosetta that involve nucleic acids -- DNA and RNA. There are some pretty cool applications, including designing proteins for gene therapy. We are developing the graphics for these modes, and will be working intensely in January to make sure they don't cause crashes.

I'll post something here and on ralph asking for feedback on the temporary "simplified graphics" version of Rosetta tonight... hopefully some of you can help us confirm that is causes fewer crashes.

Hi all. One more data point re graphics-related failures.

ResultID 50704727

Dual AMD AthlonMP 2000+, 512 MB RAM, ATI Radeon 9250, WinXP Pro (SP2), no screen saver (blank screen)

stderr out looks like this:
<core_client_version>5.4.9</core_client_version>
<message>
- exit code 1073807364 (0x40010004)
</message>
<stderr_txt>
# random seed: 2534060
# cpu_run_time_pref: 21600

</stderr_txt>

I had popped up the graphics display and was rotating and zooming the native structure when the window stopped responding. However, BoincView still showed CPU time accumulating, so I believe the science app was still running. I forced the graphics window to close and it crashed the science app. This is the first time anything like this has happened on this computer.

Regards,

-- Tony


ID: 32473 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The_Bad_Penguin
Avatar

Send message
Joined: 5 Jun 06
Posts: 2751
Credit: 4,271,025
RAC: 0
Message 32477 - Posted: 11 Dec 2006, 21:48:23 UTC - in response to Message 32466.  
Last modified: 11 Dec 2006, 21:48:51 UTC

Is this related to the "graphics issues" that are being talked about?

Problem? My end? Rosetta's end?

Result ID 51580518

CPU time 17330.796875
stderr out <core_client_version>5.4.11</core_client_version>
<stderr_txt>
# random seed: 1782254
# cpu_run_time_pref: 21600
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score 3.29644 for 3600 seconds
**********************************************************************
GZIP SILENT FILE: .cc1tig.out

</stderr_txt>


Validate state Valid
Claimed credit 62.638971651533
Granted credit 20

ID: 32477 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 32479 - Posted: 11 Dec 2006, 23:31:52 UTC - in response to Message 32477.  
Last modified: 11 Dec 2006, 23:32:21 UTC

Problem? My end? Rosetta's end?

Result ID 51580518

Is this related to the "graphics issues" that are being talked about?



I suspect so. The only time I see the watchdog kick in is when I'm trying to tempt fait with the screensaver.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 32479 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The_Bad_Penguin
Avatar

Send message
Joined: 5 Jun 06
Posts: 2751
Credit: 4,271,025
RAC: 0
Message 32480 - Posted: 11 Dec 2006, 23:55:25 UTC - in response to Message 32479.  
Last modified: 11 Dec 2006, 23:57:20 UTC

Verrrry interesting....

My screensaver is set to "Blank", and I very rarely display the Rosetta graphics...

Although, I will note that last Friday I DID switch from integrated graphics on the Compaq sr2030nx, to the $50 SlickDeals speical, XFX GeForce 7600GS 256MB PCI Express.

Hmmmm........

Problem? My end? Rosetta's end?

Result ID 51580518

Is this related to the "graphics issues" that are being talked about?



I suspect so. The only time I see the watchdog kick in is when I'm trying to tempt fait with the screensaver.

ID: 32480 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 32483 - Posted: 12 Dec 2006, 2:07:00 UTC - in response to Message 32480.  
Last modified: 12 Dec 2006, 2:13:23 UTC

Verrrry interesting....
My screensaver is set to "Blank"

...in that case, I revert to my more standard response. The watchdog is there to protect you from a work unit spinning time away and not making progress. I've been seeing this occur on some of the WUs that fail due to the screen saver problems. But the original purpose of the watchdog still exists. There are times, due to bugs or in trying to cover all the bases, that it is possible for non-productive loops to occur. When the watchdog detects that no progress (as measured by the current Rosetta score) is being made and ends the work unit, reporting any completed models.

I think of Rosetta as being like a hound dog. Sniffing out the best model. And sometimes the rabbit seems to have round around and around in a circle, and the houngdog doesn't know where to exit the circle to continue the chase.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 32483 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 32486 - Posted: 12 Dec 2006, 2:59:59 UTC - in response to Message 32483.  

To all screensaver aficionados who are having problems with rosetta graphics:

If you have a surefire way to crash rosetta -- say by moving the mouse a lot, or by keeping the screensaver on too long, or increasing the frame rate -- can you possibly attach your project to ralph, and let us know if its more stable than rosetta@home? Please post comment here. Over on ralph. we have turned off some of the features that we think are causing crashes (display of sidechains and mouse rotation) until we can fix them properly. If ralph is stable we will turn off those features here at rosetta@home too. Thanks!


Verrrry interesting....
My screensaver is set to "Blank"

...in that case, I revert to my more standard response. The watchdog is there to protect you from a work unit spinning time away and not making progress. I've been seeing this occur on some of the WUs that fail due to the screen saver problems. But the original purpose of the watchdog still exists. There are times, due to bugs or in trying to cover all the bases, that it is possible for non-productive loops to occur. When the watchdog detects that no progress (as measured by the current Rosetta score) is being made and ends the work unit, reporting any completed models.

I think of Rosetta as being like a hound dog. Sniffing out the best model. And sometimes the rabbit seems to have round around and around in a circle, and the houngdog doesn't know where to exit the circle to continue the chase.


ID: 32486 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chu

Send message
Joined: 23 Feb 06
Posts: 120
Credit: 112,439
RAC: 0
Message 32533 - Posted: 12 Dec 2006, 19:39:26 UTC - in response to Message 32483.  

Correct. It does not seem to be graphic-related. That is a stuck WUs and the watchdog was lauched to end it to avoid further waste on your cpu time. This type of errors seems to happen randomly .
Verrrry interesting....
My screensaver is set to "Blank"

...in that case, I revert to my more standard response. The watchdog is there to protect you from a work unit spinning time away and not making progress. I've been seeing this occur on some of the WUs that fail due to the screen saver problems. But the original purpose of the watchdog still exists. There are times, due to bugs or in trying to cover all the bases, that it is possible for non-productive loops to occur. When the watchdog detects that no progress (as measured by the current Rosetta score) is being made and ends the work unit, reporting any completed models.

I think of Rosetta as being like a hound dog. Sniffing out the best model. And sometimes the rabbit seems to have round around and around in a circle, and the houngdog doesn't know where to exit the circle to continue the chase.


ID: 32533 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
daniels

Send message
Joined: 3 Jul 06
Posts: 7
Credit: 13,439
RAC: 0
Message 32579 - Posted: 13 Dec 2006, 10:48:34 UTC
Last modified: 13 Dec 2006, 11:05:37 UTC

guys, sorry to bother u again... this time my work unit got stuck at 1 h 09 min and 31.630% ... i've keept it running for 5 days, but no progress in cpu time or perc... i think i am going to suspend the project, because it consumming my resources and nothing happends... the unit apppear to be running from time to time....
this is a grep from stdoutdae.txt :
cat stdoutdae.txt | grep rosetta

2006-12-09 03:19:57 [rosetta@home] Starting task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R38_filters_1441_67_0 using rosetta version 541
2006-12-09 04:20:00 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R38_filters_1441_67_0 (removed from memory)
2006-12-09 05:20:48 [rosetta@home] Restarting task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R38_filters_1441_67_0 using rosetta version 541
2006-12-09 05:20:55 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R38_filters_1441_67_0 (removed from memory)
2006-12-09 06:21:15 [rosetta@home] Restarting task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R38_filters_1441_67_0 using rosetta version 541
2006-12-09 07:21:15 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R38_filters_1441_67_0 (removed from memory)
2006-12-09 07:21:16 [rosetta@home] Unrecoverable error for result BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R38_filters_1441_67_0 (process exited with code 131 (0x83))
2006-12-09 07:21:16 [rosetta@home] Deferring scheduler requests for 1 minutes and 0 seconds
2006-12-09 07:21:16 [rosetta@home] Computation for task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R38_filters_1441_67_0 finished
2006-12-09 09:12:52 [rosetta@home] Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
2006-12-09 09:12:52 [rosetta@home] Reason: To fetch work
2006-12-09 09:12:52 [rosetta@home] Requesting 8640 seconds of new work, and reporting 1 completed tasks
2006-12-09 09:12:57 [rosetta@home] Scheduler request succeeded
2006-12-09 09:12:59 [rosetta@home] Started download of file BAR_R13_R43_cc1ctf_03_05.200_v1_3.gz
2006-12-09 09:12:59 [rosetta@home] Started download of file BAR_R13_R43_cc1ctf_09_05.200_v1_3.gz
2006-12-09 09:13:03 [rosetta@home] Finished download of file BAR_R13_R43_cc1ctf_03_05.200_v1_3.gz
2006-12-09 09:13:03 [rosetta@home] Throughput 185433 bytes/sec
2006-12-09 09:13:03 [rosetta@home] Started download of file 1ctf__R13_R43_cheat.bar
2006-12-09 09:13:04 [rosetta@home] Finished download of file 1ctf__R13_R43_cheat.bar
2006-12-09 09:13:04 [rosetta@home] Throughput 142 bytes/sec
2006-12-09 09:13:07 [rosetta@home] Finished download of file BAR_R13_R43_cc1ctf_09_05.200_v1_3.gz
2006-12-09 09:13:07 [rosetta@home] Throughput 233371 bytes/sec
2006-12-09 09:28:09 [rosetta@home] Starting task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 using rosetta version 541
2006-12-09 10:28:21 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-09 12:32:02 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-09 15:32:03 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-09 17:53:08 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-09 21:14:30 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-09 23:16:51 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-10 01:21:26 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-10 03:24:06 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-10 05:38:48 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-10 06:53:03 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-10 07:53:25 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-10 08:02:20 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-10 09:03:18 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-10 09:13:18 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2

and goes like this until:

2006-12-13 00:38:11 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-13 00:39:12 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-13 01:39:44 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-13 01:59:36 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-13 03:22:18 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-13 03:22:37 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-13 04:23:26 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-13 04:27:17 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-13 05:28:31 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-13 05:28:57 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-13 06:31:09 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-13 06:32:21 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-13 07:32:37 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-13 07:33:00 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-13 09:33:14 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-13 11:35:04 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
bash-3.1#


later edit:

i have restarted boinc and now it's working again but with less cpu time spent 56 min and the same perc... i think soon i wil receive that error again... i will come back with more update

2006-12-13 11:35:04 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-13 12:59:01 [rosetta@home] URL: https://boinc.bakerlab.org/rosetta/; Computer ID: 369002; location: ; project prefs: default
2006-12-13 13:00:03 [rosetta@home] Deferring task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0
2006-12-13 13:00:03 [rosetta@home] Restarting task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 using rosetta version 541

ID: 32579 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
daniels

Send message
Joined: 3 Jul 06
Posts: 7
Credit: 13,439
RAC: 0
Message 32584 - Posted: 13 Dec 2006, 13:33:16 UTC - in response to Message 32579.  
Last modified: 13 Dec 2006, 13:33:40 UTC

guys, sorry to bother u again... this time my work unit got stuck at 1 h 09 min and 31.630% ... i've keept it running for 5 days, but no progress in cpu time or perc... i think i am going to suspend the project, because it consumming my resources and nothing happends... the unit apppear to be running from time to time....
this is a grep from stdoutdae.txt :
cat stdoutdae.txt | grep rosetta


......


2006-12-13 07:33:00 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-13 09:33:14 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-13 11:35:04 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
bash-3.1#


later edit:

i have restarted boinc and now it's working again but with less cpu time spent 56 min and the same perc... i think soon i wil receive that error again... i will come back with more update

2006-12-13 11:35:04 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-13 12:59:01 [rosetta@home] URL: https://boinc.bakerlab.org/rosetta/; Computer ID: 369002; location: ; project prefs: default
2006-12-13 13:00:03 [rosetta@home] Deferring task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0
2006-12-13 13:00:03 [rosetta@home] Restarting task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 using rosetta version 541



as i expected: this WU crashed too:

2006-12-13 14:00:04 [rosetta@home] Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (removed from memory)
2006-12-13 14:00:05 [rosetta@home] Unrecoverable error for result BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 (process exited with code 131 (0x83))
2006-12-13 14:00:05 [rosetta@home] Deferring scheduler requests for 1 minutes and 0 seconds
2006-12-13 14:00:05 [rosetta@home] Computation for task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R13_R43_filters_1441_145_0 finished
bash-3.1#



ID: 32584 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 32589 - Posted: 13 Dec 2006, 15:33:52 UTC

Daniels, by removing the task from memory every hour, you're throwing away a lot of good work. I don't know about that specific task, but it is not uncommon for some tasks to need more then an hour to reach a checkpoint. If no checkpoint is reached in the hour, and you remove from memory, the it would be restarting from the same point each hour. Rosetta has a "watchdog" which should have detected such an event. And if it restarts from the same point... I think it is 5 times in a row, then the watchdog will end the task for you and report it back.

You will want to display the graphics for the WU and check the model number shown. Over time the model number should increment. And within each model the steps should be counting up. Faster on some tasks then others. But moving at least every minute. What do you see when you display the graphic?

Suggest you go to your General Preferences and specify YES to leave in memory while preempted. This just means "keep the work you have done so far, so when you restart, we pick up where we left off". It keeps all the active information in virtual memory. If you end BOINC or turn off the PC, you still lose it, but it doesn't appear you are doing that.

The other approach, if you prefer, would be to change your setting in the General Preferences for how often to switch between tasks. The default is an hour, but you would get more work done if you bump it to 3 or 4 hours. That would give enough time for Rosetta to reach a checkpoint, and assure that even if you do remove from memory, that it will be making meaningful, and permenant progress each time it runs.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 32589 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
daniels

Send message
Joined: 3 Jul 06
Posts: 7
Credit: 13,439
RAC: 0
Message 32630 - Posted: 14 Dec 2006, 10:36:51 UTC - in response to Message 32589.  

Daniels, by removing the task from memory every hour, you're throwing away a lot of good work. I don't know about that specific task, but it is not uncommon for some tasks to need more then an hour to reach a checkpoint. If no checkpoint is reached in the hour, and you remove from memory, the it would be restarting from the same point each hour. Rosetta has a "watchdog" which should have detected such an event. And if it restarts from the same point... I think it is 5 times in a row, then the watchdog will end the task for you and report it back.

You will want to display the graphics for the WU and check the model number shown. Over time the model number should increment. And within each model the steps should be counting up. Faster on some tasks then others. But moving at least every minute. What do you see when you display the graphic?

Suggest you go to your General Preferences and specify YES to leave in memory while preempted. This just means "keep the work you have done so far, so when you restart, we pick up where we left off". It keeps all the active information in virtual memory. If you end BOINC or turn off the PC, you still lose it, but it doesn't appear you are doing that.

The other approach, if you prefer, would be to change your setting in the General Preferences for how often to switch between tasks. The default is an hour, but you would get more work done if you bump it to 3 or 4 hours. That would give enough time for Rosetta to reach a checkpoint, and assure that even if you do remove from memory, that it will be making meaningful, and permenant progress each time it runs.


it never happend before, but with the new version , it is happening at every work unit... i will do as u say and see what is going on...
ID: 32630 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jnargus

Send message
Joined: 4 Oct 06
Posts: 5
Credit: 6,875,640
RAC: 2,062
Message 32673 - Posted: 15 Dec 2006, 2:27:31 UTC - in response to Message 32589.  

I too was having the same problem as Daniels. I have done what you suggested and it now seems to be working much better now. The only question I have is that I only had a problem on my Linux boxes but not on my WinXP boxes. My Rosetta WUs were not restarting after they had been preempted by one of the other projects. The Boinc manager said they were running but there was no message saying they had been restarted. None of the WUs I aborted were getting stopped by the watchdog, probably because they were not getting restarted properly in the first place.

For some other reason I am unable to see the graphics, probably because I don't have the right driver for the video cards. This is not a problem for me as I just let the machines slave away at their WUs.

Daniels, by removing the task from memory every hour, you're throwing away a lot of good work. I don't know about that specific task, but it is not uncommon for some tasks to need more then an hour to reach a checkpoint. If no checkpoint is reached in the hour, and you remove from memory, the it would be restarting from the same point each hour. Rosetta has a "watchdog" which should have detected such an event. And if it restarts from the same point... I think it is 5 times in a row, then the watchdog will end the task for you and report it back.

You will want to display the graphics for the WU and check the model number shown. Over time the model number should increment. And within each model the steps should be counting up. Faster on some tasks then others. But moving at least every minute. What do you see when you display the graphic?

Suggest you go to your General Preferences and specify YES to leave in memory while preempted. This just means "keep the work you have done so far, so when you restart, we pick up where we left off". It keeps all the active information in virtual memory. If you end BOINC or turn off the PC, you still lose it, but it doesn't appear you are doing that.

The other approach, if you prefer, would be to change your setting in the General Preferences for how often to switch between tasks. The default is an hour, but you would get more work done if you bump it to 3 or 4 hours. That would give enough time for Rosetta to reach a checkpoint, and assure that even if you do remove from memory, that it will be making meaningful, and permenant progress each time it runs.



ID: 32673 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Joachim
Avatar

Send message
Joined: 26 Nov 06
Posts: 5
Credit: 422,078
RAC: 142
Message 32762 - Posted: 16 Dec 2006, 18:22:27 UTC - in response to Message 32742.  

just for test, i have setup that setting for 2 hours and after it reached that period it just stop doing something, like the last time and get 70% done... i will increase the period to 4 hours, to have the task completed... but i think this is not a solution... the watch dog is not restarting the application, it just keep it in memory while it is not doing nothing... the other applications are working properly... i think someone should verify this... i am not using graphics also...


I've seen the same phenomena on my computer under Linux (SuSE 10.0).

TOP says the four threads of rosetta are in memory but use 0% CPU.
Joachim
Dinos are not dead. They are alive and well and living in data centers all around you. They speak in tongues and work strange magics with computers. Beware the dino!
ID: 32762 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile hedera
Avatar

Send message
Joined: 15 Jul 06
Posts: 76
Credit: 5,138,180
RAC: 840
Message 32764 - Posted: 16 Dec 2006, 20:07:20 UTC

I got curious to see whether the recurring crashes were the graphics code or the screen saver code. So, while using a different screen saver, I displayed graphics in a window and left it up. Crashed within an hour with this error:

12/16/2006 10:20:35 AM|rosetta@home|Unrecoverable error for result 1urnA_BOINC_POSE_ABRELAX_NEWRELAXFLAGS_frags83__1449_111_0 ( - exit code -1073741819 (0xc0000005))

I guess it's the graphics code itself, the screen saver wasn't running.
--hedera

Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic.

ID: 32764 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chu

Send message
Joined: 23 Feb 06
Posts: 120
Credit: 112,439
RAC: 0
Message 32787 - Posted: 17 Dec 2006, 5:15:45 UTC - in response to Message 32764.  

Thanks for the test, and that is consistent with what we have learned from users' reports and the local tests we have done. That is why we temporarily disabled some advanced graphic features such drawing sidehchains, zooming and rotating proteins in the current 5.43 application, in order to at least alleviate the problems which has caused incovenience on client side. We will turn those features back once we figure out a permannent solution and hopefully that won't take too long. Thanks again for everyone's help.
I got curious to see whether the recurring crashes were the graphics code or the screen saver code. So, while using a different screen saver, I displayed graphics in a window and left it up. Crashed within an hour with this error:

12/16/2006 10:20:35 AM|rosetta@home|Unrecoverable error for result 1urnA_BOINC_POSE_ABRELAX_NEWRELAXFLAGS_frags83__1449_111_0 ( - exit code -1073741819 (0xc0000005))

I guess it's the graphics code itself, the screen saver wasn't running.


ID: 32787 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5

Message boards : Number crunching : Problems with Rosetta version 5.41



©2024 University of Washington
https://www.bakerlab.org