Report Problems with Rosetta Version 5.16 I

Message boards : Number crunching : Report Problems with Rosetta Version 5.16 I

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 · Next

AuthorMessage
NJMHoffmann

Send message
Joined: 17 Dec 05
Posts: 45
Credit: 45,891
RAC: 0
Message 16864 - Posted: 22 May 2006, 20:22:14 UTC - in response to Message 16795.  

You are already using the version that has had checkpoints added. Originally the checkpoints only were done at the end of a full model. Now they are every ~20 min.

It's much better now and we loose less work with this shorter checkpoint interval. But if it is possible to insert checkpoints, why not respect the user settings?

Norbert
ID: 16864 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 16865 - Posted: 22 May 2006, 20:41:16 UTC - in response to Message 16864.  

You are already using the version that has had checkpoints added. Originally the checkpoints only were done at the end of a full model. Now they are every ~20 min.

It's much better now and we loose less work with this shorter checkpoint interval. But if it is possible to insert checkpoints, why not respect the user settings?

Norbert

What settings do you feel are not being respected by the current (improved) checkpointing?

Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 16865 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 16866 - Posted: 22 May 2006, 20:45:53 UTC

I'm taking up a (fictious) collection for Jose. I'm taking pledges. For every problem he posts, I'm asking each person to donate 5 cents. This way the more he posts information, the sooner we can buy him a new PC... but Jose, beware, we're not getting new monitor, keyboard, mouse nor printer, so don't include those in your next voodoo ceremony. :)

Hang in there Jose!
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 16866 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 16867 - Posted: 22 May 2006, 20:52:51 UTC - in response to Message 16848.  

[quote]Thanks Mod9 for the quick reply

But thats not what I noticed. The first 5.16 unit(s) I processed didn't show checkpoints every ~20min

The one I'm currently working on seems to behave nicely in the sugested way. Maybe it was a glitch in the first 5.16 units and nobody else noticed...

I'll keep an eye on it and report back if I notice anything unusual.

20min checkpoint intervals is fine with me. I can live with that.

Thor

[quote]...
Thor,

Thanks for the report. I will also watch for this. I am aware that the first checkpoint is usually longer than the others, but it should still not exceed ~35 min. So what you have reported is interesting.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 16867 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
NJMHoffmann

Send message
Joined: 17 Dec 05
Posts: 45
Credit: 45,891
RAC: 0
Message 16873 - Posted: 22 May 2006, 21:29:51 UTC - in response to Message 16865.  

What settings do you feel are not being respected by the current (improved) checkpointing?

I would interpret the setting "write to disk at most..." as: After a checkpoint wait for x seconds before a new checkpoint and then do it as soon as possible.

Norbert
ID: 16873 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tallguy-13088

Send message
Joined: 14 Dec 05
Posts: 9
Credit: 843,378
RAC: 0
Message 16877 - Posted: 22 May 2006, 22:05:36 UTC - in response to Message 16866.  

I think you ought to think about a wristpad/mousepad too! That way he doesn't do quite as much damage pounding his head/hands on the desk <grin>.

Jose, just remember, it doesn't have to cooperate ... its a machine!

I'm taking up a (fictious) collection for Jose. I'm taking pledges. For every problem he posts, I'm asking each person to donate 5 cents. This way the more he posts information, the sooner we can buy him a new PC... but Jose, beware, we're not getting new monitor, keyboard, mouse nor printer, so don't include those in your next voodoo ceremony. :)

Hang in there Jose!


ID: 16877 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 16879 - Posted: 22 May 2006, 22:14:07 UTC - in response to Message 16866.  

I'm taking up a (fictitious) collection for Jose. I'm taking pledges. For every problem he posts, I'm asking each person to donate 5 cents. This way the more he posts information, the sooner we can buy him a new PC... but Jose, beware, we're not getting new monitor, keyboard, mouse nor printer, so don't include those in your next voodoo ceremony. :)

Hang in there Jose!


I was able to track the offending application. ARGH a maintenance application run amok.

Hey, let's face it without me posting my weird problems and my non standard attempts at solving them this thread would be boring.

I think my computer should be inducted in the Rosetta@Home Hall of Fame. Either that or a citation in the next scientific paper by Dr Baker and the team would be nice. :)

Please remember , that as the "minus inter pares " of my team I am in charge of stat reporting and non-traditional credit production methods, so all voodoo is reserved for that and not for my personal gain.

But, should you want, I can send you the specs for the computation system of my dreams. :) It will take a lot of 5 cents. LOL LOL LOL

As to the next voodoo ceremony, it may involve a moderator or a poster being sacrificed to the team production deities (Specifically the 500,000 Credit a Day deity) . Numero 9 is in the sacrificial pool; Want to join him? LOL LOL LOL LOL

Okies, the pain killers are working. I better go to bed.

:)
This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 16879 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 16880 - Posted: 22 May 2006, 22:15:41 UTC - in response to Message 16877.  

I think you ought to think about a wristpad/mousepad too! That way he doesn't do quite as much damage pounding his head/hands on the desk <grin>.

Jose, just remember, it doesn't have to cooperate ... its a machine!

I'm taking up a (fictious) collection for Jose. I'm taking pledges. For every problem he posts, I'm asking each person to donate 5 cents. This way the more he posts information, the sooner we can buy him a new PC... but Jose, beware, we're not getting new monitor, keyboard, mouse nor printer, so don't include those in your next voodoo ceremony. :)

Hang in there Jose!



Tall guys make good candidates for the sacrificial pool. Te he te he :)
This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 16880 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 16881 - Posted: 22 May 2006, 22:15:48 UTC - in response to Message 16873.  
Last modified: 22 May 2006, 22:22:02 UTC

What settings do you feel are not being respected by the current (improved) checkpointing?

I would interpret the setting "write to disk at most..." as: After a checkpoint wait for x seconds before a new checkpoint and then do it as soon as possible.

Norbert


I believe that's exactly what they are doing. The problem is that "as soon as possible" isn't as often as many people would like. It's only about every 20 minutes that they reach a point in the model where they can checkpoint. But it depends on the protein and the CPU. A faster CPU hits that same point much faster than a slow CPU. So, what they do is... reach a point in the model where a checkpoint COULD be made, and if more than 20 min. has gone by since the last checkpoint was made, then another is made.... which I guess is your point now that I type it. Let me see if I can restate it...

"Why use the arbitrary 20 minutes number, when the user's preference might be for write to disk every 5 minutes, and my model may be hitting a checkpointable state every 5 minutes?"

It seems like that point was brought up on Ralph. The project is under maintenance at the moment so can't post a link.

[edit] I think it boiled down to the volume of data they have to write for the checkpoint. It was like 100+MB. And if they wrote that much data every... (I think the default is) 1 min, then your "faster" computer, which is reaching checkpointable points in the model rapidly, would be spending a considerable fraction of time writing the checkpoints rather than getting work done.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 16881 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 16886 - Posted: 22 May 2006, 22:48:06 UTC - in response to Message 16881.  

The write to disk parameter has no direct relationship to checkpointing. It can prevent checkpointing if the interval is set too long, but it is a disk use parameter to control disk access only. It is really there to let laptop drives spin down between write accesses. But it in no way is a setting to request more frequent checkpointing.
Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 16886 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 16889 - Posted: 22 May 2006, 23:19:24 UTC
Last modified: 22 May 2006, 23:19:45 UTC

Jose, have you tried searching for Malware/adware with Ad-ware SE, and searching for Spybots with Spybot search and destroy in addition to your virus program?? They're free.

tony
ID: 16889 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 16904 - Posted: 23 May 2006, 10:08:16 UTC
Last modified: 23 May 2006, 10:21:24 UTC

The following WU grew steadily in memory usage up to 550 MB physical RAM and about 700 MB virtual memory (I have 1 GB RAM and 1.24 GB virtual memory on that host):

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=17772949

After three and a half hour and 26 decoys I restarted BOINC and memory usage started from 36 MB but is again growing with each completed model. Seems to me like a memory leak. Btw, I never looked on the graphics.

Edit: It seems Rosetta is no longer writing to the file stdout.txt after restarting BOINC. However it is writing to the file xxt283.out. Don't know if this means anything.
ID: 16904 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 16924 - Posted: 23 May 2006, 18:05:44 UTC - in response to Message 16904.  

Tralala: Thanks for posting about this problem. I thought I had fixed this issue on this workunit, but apparently there are still problems on some clients. I am
canceling these workunits now. Aborting the jobs was the right thing to do.

The following WU grew steadily in memory usage up to 550 MB physical RAM and about 700 MB virtual memory (I have 1 GB RAM and 1.24 GB virtual memory on that host):

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=17772949

After three and a half hour and 26 decoys I restarted BOINC and memory usage started from 36 MB but is again growing with each completed model. Seems to me like a memory leak. Btw, I never looked on the graphics.

Edit: It seems Rosetta is no longer writing to the file stdout.txt after restarting BOINC. However it is writing to the file xxt283.out. Don't know if this means anything.


ID: 16924 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 16928 - Posted: 23 May 2006, 18:40:31 UTC - in response to Message 16924.  

Tralala: Thanks for posting about this problem. I thought I had fixed this issue on this workunit, but apparently there are still problems on some clients. I am
canceling these workunits now. Aborting the jobs was the right thing to do.


I could abort it the soft way with lowering the run time preference, but I was afraid it would kill one of my remote hosts with only 512 MB RAM. Fortunately that was not the case.

You can safeguard against those incidents if you specify a memory bound for all WU. If the virtual memory exceeds this bound the WU gets automatically aborted.
ID: 16928 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 7 Oct 05
Posts: 65
Credit: 10,612,039
RAC: 0
Message 16932 - Posted: 23 May 2006, 19:39:08 UTC

Curious behaviour.... Two work units "exited with 0" but had no finish file. They then restarted and appear to have resumed where they left off. They are still running.

Heres the log

5/23/2006 9:21:11 AM||Rescheduling CPU: application exited
5/23/2006 9:21:11 AM|rosetta@home|Computation for task u287__CASP7_ABRELAX_SHORTRELAX_SAVE_ALL_OUT_nterm__522_6410_0 finished
5/23/2006 9:21:11 AM|rosetta@home|Starting task v287__CASP7_ABRELAX_SAVE_ALL_OUT_cterm__527_1046_0 using rosetta version 516
5/23/2006 9:21:13 AM|rosetta@home|Started upload of file u287__CASP7_ABRELAX_SHORTRELAX_SAVE_ALL_OUT_nterm__522_6410_0_0
5/23/2006 9:21:19 AM|rosetta@home|Finished upload of file u287__CASP7_ABRELAX_SHORTRELAX_SAVE_ALL_OUT_nterm__522_6410_0_0
5/23/2006 9:21:19 AM|rosetta@home|Throughput 29328 bytes/sec
5/23/2006 9:21:24 AM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
5/23/2006 9:21:24 AM|rosetta@home|Reason: To report completed tasks
5/23/2006 9:21:24 AM|rosetta@home|Reporting 1 tasks
5/23/2006 9:21:29 AM|rosetta@home|Scheduler request succeeded
5/23/2006 10:22:32 AM||Rescheduling CPU: application exited
5/23/2006 10:22:32 AM|rosetta@home|Computation for task b287__CASP7_ABRELAX_SHORTRELAX_SAVE_ALL_OUT_truncate__522_6500_0 finished
5/23/2006 10:22:32 AM|rosetta@home|Starting task v287__CASP7_ABRELAX_SAVE_ALL_OUT_cterm__527_1041_0 using rosetta version 516
5/23/2006 10:22:34 AM|rosetta@home|Started upload of file b287__CASP7_ABRELAX_SHORTRELAX_SAVE_ALL_OUT_truncate__522_6500_0_0
5/23/2006 10:22:40 AM|rosetta@home|Finished upload of file b287__CASP7_ABRELAX_SHORTRELAX_SAVE_ALL_OUT_truncate__522_6500_0_0
5/23/2006 10:22:40 AM|rosetta@home|Throughput 28853 bytes/sec
5/23/2006 10:22:45 AM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
5/23/2006 10:22:45 AM|rosetta@home|Reason: To report completed tasks
5/23/2006 10:22:45 AM|rosetta@home|Reporting 1 tasks
5/23/2006 10:22:50 AM|rosetta@home|Scheduler request succeeded
5/23/2006 11:04:46 AM|rosetta@home|Task v287__CASP7_ABRELAX_SAVE_ALL_OUT_cterm__527_1046_0 exited with zero status but no 'finished' file
5/23/2006 11:04:46 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
5/23/2006 11:04:46 AM||Rescheduling CPU: application exited
5/23/2006 11:04:46 AM|rosetta@home|Task v287__CASP7_ABRELAX_SAVE_ALL_OUT_cterm__527_1041_0 exited with zero status but no 'finished' file
5/23/2006 11:04:46 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
5/23/2006 11:04:46 AM||Rescheduling CPU: application exited
5/23/2006 11:04:46 AM|rosetta@home|Restarting task v287__CASP7_ABRELAX_SAVE_ALL_OUT_cterm__527_1046_0 using rosetta version 516
5/23/2006 11:04:46 AM|rosetta@home|Restarting task v287__CASP7_ABRELAX_SAVE_ALL_OUT_cterm__527_1041_0 using rosetta version 516

ID: 16932 · Rating: -1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 16941 - Posted: 24 May 2006, 0:36:22 UTC - in response to Message 16863.  
Last modified: 24 May 2006, 0:39:34 UTC

[quote]LINUX problem:
I need help with this problem: while running Rosetta on Linux server with PentiumIV HyperThreading processor, Rosetta occasionally hangs in a very strange state: everything is running except Rosetta. Boinc is running. Application on other thread (Simap@home) is running. Just Rosetta isn't.


I had encountered this particular issue back in Jan/Feb-06 (also under Linux). Overall about 5-6 times.



I have been having the same problems with my Dell SC420... The Rosetta application just sleeps (watching in BoincView I have a 0.00 cpu efficiency, and when I "top" I see the Rosetta apps in memory, but 0% cpu usage). I have gotten this quite frequently while running Rosetta on a Linux box, even through all the different versionings. Any ideas/suggestions from the Mods, Testers or Dev's?


OK, it looks to be the same issue. Rosetta "frozen" (SN=Sleeping,Nice and consuming 0% CPU) although BOINC thinks it's running. Also, for some reason BOINC won't pre-empt Rosetta after say 1hr, so effectively the whole DC queue is stuck.

I see (e.g. here) you've encountered Rosetta "hangs" recently under Linux using BOINC v5.4.9 (as I see you're using now), we can rule out the BOINC v5.2.14 possibility. Also you have a different kernel 2.6.x (both myself and Aglarond had kernel 2.4.x and BOINC v5.2.14), so we can rule that out too.

Although I reiterate that my Linux box that had this issue has been running smoothly for over 3 months, 24/7, crunching 90% Rosetta/Ralph, not a single "hung" instance. I thought it was an some odd issue that was "solved" by re-compiling R with new BOINC API, but apparently you guys still have it...

Maybe do some thinking about SIGSEGV and SIGABRT:

SIGSEGV: segmentation violationStack trace (11 frames):
[0x882fbb3]
Exiting...
SIGABRT: abort calledStack trace (18 frames):
[0x882fbb3]
https://boinc.bakerlab.org/rosetta/result.php?resultid=20134206
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 16941 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Steve Shedroff

Send message
Joined: 7 Nov 05
Posts: 11
Credit: 250,657
RAC: 0
Message 16946 - Posted: 24 May 2006, 2:33:49 UTC

This may be coincidence, but I just downloaded the most recent BOINC Client and all my numbers are dropping. Work per day is about 1/2 of what it was before the new client. This is true on MacX and Intel P4 systems. Is it just me?
ID: 16946 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 16950 - Posted: 24 May 2006, 2:51:24 UTC - in response to Message 16946.  

This may be coincidence, but I just downloaded the most recent BOINC Client and all my numbers are dropping. Work per day is about 1/2 of what it was before the new client. This is true on MacX and Intel P4 systems. Is it just me?

What version of BOINC did you install?
Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 16950 · Rating: -1 · rate: Rate + / Rate - Report as offensive    Reply Quote
senatoralex85

Send message
Joined: 27 Sep 05
Posts: 66
Credit: 169,644
RAC: 0
Message 16955 - Posted: 24 May 2006, 5:55:22 UTC - in response to Message 16941.  

[quote][quote]LINUX problem:
I need help with this problem: while running Rosetta on Linux server with PentiumIV HyperThreading processor, Rosetta occasionally hangs in a very strange state: everything is running except Rosetta. Boinc is running. Application on other thread (Simap@home) is running. Just Rosetta isn't.


I had encountered this particular issue back in Jan/Feb-06 (also under Linux). Overall about 5-6 times.



I have been having the same problems with my Dell SC420... The Rosetta application just sleeps (watching in BoincView I have a 0.00 cpu efficiency, and when I "top" I see the Rosetta apps in memory, but 0% cpu usage). I have gotten this quite frequently while running Rosetta on a Linux box, even through all the different versionings. Any ideas/suggestions from the Mods, Testers or Dev's?


OK, it looks to be the same issue. Rosetta "frozen" (SN=Sleeping,Nice and consuming 0% CPU) although BOINC thinks it's running. Also, for some reason BOINC won't pre-empt Rosetta after say 1hr, so effectively the whole DC queue is stuck.

-----------------------------------------------------------------------------

I am not sure but I may have a similiar problem. Once in awhile I will leave my computer running for a few consecutive hours. When I come back, it seems that BOINC got stuck and stranded a workunit at "100% ready to report" status. If I hit the update button under the projects tab, it sends the workunit and simultaneously downloads another one. Why would ite get stuck like that? I am running BOINC 4.45.

ID: 16955 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aglarond

Send message
Joined: 29 Jan 06
Posts: 26
Credit: 446,212
RAC: 0
Message 16974 - Posted: 24 May 2006, 13:09:42 UTC

BAD ERROR! Boinc 5.4.9 crunching WU t283__CASP7_ABRELAX_SAVE_ALL_OUT_hom024__528_13504_0, screensaver appeared.. suddenly windows error message appeared about Rosetta@home doing illegal operation and windows had to end this process.. "send report to microsoft? [send] [don't send]" you probably know that message.. after closing the message: boinc happily crunches another WU.. now it looks like it was normal computing error .. but it wasn't ..
ID: 16974 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 · Next

Message boards : Number crunching : Report Problems with Rosetta Version 5.16 I



©2024 University of Washington
https://www.bakerlab.org