Problems with version 5.90/5.91

Message boards : Number crunching : Problems with version 5.90/5.91

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

AuthorMessage
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 49881 - Posted: 21 Dec 2007, 17:53:54 UTC - in response to Message 49876.  

if it's so common, why wasn't the linux problem picked up on RALPH???


Appears work actually completes normally, just the progress indicator not looking right along the way. So you would actually have to watch it run to see any problem.
Rosetta Moderator: Mod.Sense
ID: 49881 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marky-UK

Send message
Joined: 1 Nov 05
Posts: 73
Credit: 1,689,495
RAC: 0
Message 49882 - Posted: 21 Dec 2007, 18:02:38 UTC - in response to Message 49881.  
Last modified: 21 Dec 2007, 18:03:40 UTC

if it's so common, why wasn't the linux problem picked up on RALPH???


Appears work actually completes normally, just the progress indicator not looking right along the way. So you would actually have to watch it run to see any problem.

Work might complete eventually, but it definately doesn't complete normally. Every WU I've watched has gone past my runtime limit by hours. I suspect the only way the WUs will complete on their own is when Rosetta's internal timelimit kicks in (6x the runtime limit isn't it?). And that's assuming the built-in limit is even working.
ID: 49882 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 49883 - Posted: 21 Dec 2007, 18:09:48 UTC - in response to Message 49876.  

All linux users -- thanks for posting! Its quite interesting, in the past, we've seen issues that were Windows-specific, then Mac-specific, but typically linux has been robust (especially since the app doesn't have graphics).

We're looking into the current Rosetta@home/linux issue (I think the cpu time call must be messed up in the latest boinc api), but it may take a few days to track it down. In the meanwhile, please feel free to switch to another app. Apologies... there aren't that many linux users on RALPH -- if you're interested in helping out, we'd be grateful if some more linux clients attached to ralph at least part time.
ID: 49883 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,250,162
RAC: 0
Message 49886 - Posted: 21 Dec 2007, 18:45:37 UTC - in response to Message 49883.  
Last modified: 21 Dec 2007, 18:47:39 UTC

All linux users -- thanks for posting! Its quite interesting, in the past, we've seen issues that were Windows-specific, then Mac-specific, but typically linux has been robust (especially since the app doesn't have graphics).

We're looking into the current Rosetta@home/linux issue (I think the cpu time call must be messed up in the latest boinc api), but it may take a few days to track it down. In the meanwhile, please feel free to switch to another app. Apologies... there aren't that many linux users on RALPH -- if you're interested in helping out, we'd be grateful if some more linux clients attached to ralph at least part time.


Just to reply to my previous post today....

The WU are finishing with beta 5.90, but it went 24 minutes over my preference time (2 hours). The status takes almost 10 minutes to update in BOINCMGR, but the WU did finish OK. Reporting results now; here is the result that finished:
128257568
ID: 49886 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 49889 - Posted: 21 Dec 2007, 19:58:26 UTC - in response to Message 49883.  

Update: we've tracked down the problem -- its an issue with the BOINC-provided API (I guess we happened to be unlucky in being the first to update our linux app after the bug got introduced). Later today, we'll update the ralph and rosetta@home linux apps and they should work.

All linux users -- thanks for posting! Its quite interesting, in the past, we've seen issues that were Windows-specific, then Mac-specific, but typically linux has been robust (especially since the app doesn't have graphics).

We're looking into the current Rosetta@home/linux issue (I think the cpu time call must be messed up in the latest boinc api), but it may take a few days to track it down. In the meanwhile, please feel free to switch to another app. Apologies... there aren't that many linux users on RALPH -- if you're interested in helping out, we'd be grateful if some more linux clients attached to ralph at least part time.


ID: 49889 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 49917 - Posted: 21 Dec 2007, 21:59:24 UTC - in response to Message 49889.  

OK, just did the update -- this should revert the "cpu run time" and "% complete" behavior to what linux clients are used to! Please let me know if this fixes this issue (looks good locally).

Also, there were complaints about memory usage for Rosetta 5.89 -- have these problems become better?

Thanks for the continuing feedback!

Update: we've tracked down the problem -- its an issue with the BOINC-provided API (I guess we happened to be unlucky in being the first to update our linux app after the bug got introduced). Later today, we'll update the ralph and rosetta@home linux apps and they should work.

All linux users -- thanks for posting! Its quite interesting, in the past, we've seen issues that were Windows-specific, then Mac-specific, but typically linux has been robust (especially since the app doesn't have graphics).

We're looking into the current Rosetta@home/linux issue (I think the cpu time call must be messed up in the latest boinc api), but it may take a few days to track it down. In the meanwhile, please feel free to switch to another app. Apologies... there aren't that many linux users on RALPH -- if you're interested in helping out, we'd be grateful if some more linux clients attached to ralph at least part time.



ID: 49917 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marky-UK

Send message
Joined: 1 Nov 05
Posts: 73
Credit: 1,689,495
RAC: 0
Message 49919 - Posted: 21 Dec 2007, 22:07:06 UTC - in response to Message 49917.  

OK, just did the update -- this should revert the "cpu run time" and "% complete" behavior to what linux clients are used to! Please let me know if this fixes this issue (looks good locally).

Thanks Rhiju! I haven't had any WUs that have run to completion yet, but at least the CPU time is incrementing :-) I'll check on my clients in the morning.
ID: 49919 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Angus

Send message
Joined: 17 Sep 05
Posts: 412
Credit: 321,053
RAC: 0
Message 49923 - Posted: 22 Dec 2007, 0:01:02 UTC

If 5.90 had been tested on Ralph, it never would have made it here in broken form.
Proudly Banned from Predictator@Home and now Cosmology@home as well. Added SETI to the list today. Temporary ban only - so need to work harder :)



"You can't fix stupid" (Ron White)
ID: 49923 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,250,162
RAC: 0
Message 49924 - Posted: 22 Dec 2007, 0:45:57 UTC - in response to Message 49923.  

If 5.90 had been tested on Ralph, it never would have made it here in broken form.


You could criticize or you could help. Try attaching to the Ralpha project with a linux machine. Rhiju said they needed more linux testers. I'm just thankful they responded to the issues in this thread quickly, so not much science will be lost.
ID: 49924 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
j2satx

Send message
Joined: 17 Sep 05
Posts: 97
Credit: 3,670,592
RAC: 0
Message 49925 - Posted: 22 Dec 2007, 1:24:03 UTC - in response to Message 49924.  

If 5.90 had been tested on Ralph, it never would have made it here in broken form.


You could criticize or you could help. Try attaching to the Ralpha project with a linux machine. Rhiju said they needed more linux testers. I'm just thankful they responded to the issues in this thread quickly, so not much science will be lost.


I have 10 Linux cores on Ralph with only 3 WUs. Server status is "zero" queued.

I guess you could say I criticize and "try" to help.

There is still no coherent explanation why something is tested on Ralph for only one day before it gets implemented on Rosetta.
ID: 49925 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Thomas Leibold

Send message
Joined: 30 Jul 06
Posts: 55
Credit: 19,627,164
RAC: 0
Message 49927 - Posted: 22 Dec 2007, 2:29:28 UTC - in response to Message 49889.  

Update: we've tracked down the problem -- its an issue with the BOINC-provided API (I guess we happened to be unlucky in being the first to update our linux app after the bug got introduced). Later today, we'll update the ralph and rosetta@home linux apps and they should work.


Since you tracked down the problem, can you please tell us how it will effect all those of us running Rosetta on Linux ?

We already know that those 5.90 tasks will not finish after the specified runtime. Without manual intervention, will these tasks ever end on their own or do I have to go to each and every server and manually abort all the 5.90 tasks ?

I have over 100 cpus running Rosetta on Linux and having to clean up this mess is not something I'm looking forward to. It especially upsets me that the lack of testing on Ralph caused the problem to appear in Rosetta. This was clearly avoidable!
Team Helix
ID: 49927 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 49936 - Posted: 22 Dec 2007, 7:47:57 UTC - in response to Message 49927.  
Last modified: 22 Dec 2007, 7:48:40 UTC

Hi: Here's a partial explanation. On ralph, nearly all the workunits for a full day had returned and come back as "successes", typically a very good sign -- but the linux issue, as you correctly pointed out, leads to delayed responses from clients (rather than a bunch of immediate WU errors that tell us to go track down the problem). Since there are very few RALPH linux users we didn't notice a drop in the overall return rate of successes. The only sign that things were wrong were from a message board posting there (later bolstered by your and others' posts) and here ...

So, thanks for posting -- it did help us catch the problem relatively quickly -- and please accept our apologies. We'll certainly pay closer attention to this in the future, and do tests for, say, at least two days. if you could recruit some more Rosetta@home linux users to give a fraction of their CPUs to ralph and occasionally post errors in the message boards, that would also help!

Update: we've tracked down the problem -- its an issue with the BOINC-provided API (I guess we happened to be unlucky in being the first to update our linux app after the bug got introduced). Later today, we'll update the ralph and rosetta@home linux apps and they should work.


Since you tracked down the problem, can you please tell us how it will effect all those of us running Rosetta on Linux ?

We already know that those 5.90 tasks will not finish after the specified runtime. Without manual intervention, will these tasks ever end on their own or do I have to go to each and every server and manually abort all the 5.90 tasks ?

I have over 100 cpus running Rosetta on Linux and having to clean up this mess is not something I'm looking forward to. It especially upsets me that the lack of testing on Ralph caused the problem to appear in Rosetta. This was clearly avoidable!

ID: 49936 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BitSpit
Avatar

Send message
Joined: 5 Nov 05
Posts: 33
Credit: 4,147,344
RAC: 0
Message 49943 - Posted: 22 Dec 2007, 14:23:19 UTC

I suspended some jobs to force a couple of 5.91 jobs to run. I'm happy to report they ran without problems. 5.91 seems good.
ID: 49943 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Stephen Glick

Send message
Joined: 10 Dec 05
Posts: 3
Credit: 2,532,962
RAC: 20
Message 49954 - Posted: 22 Dec 2007, 22:15:55 UTC

Hi. I have been running Rosetta and SETI for a long time. Recently Rosetta is taking over my computer with some sort of animated screen saver that won't quit. When I quit the screen saver, it just starts right up again. I don't want it, but can't seem to get rid of it. I've even tried deleting it, but it just recreates itself and comes right back. How can I get rid of this thing? If I can't, I'm going to quit doing Rosetta and allocate all my processing time to SETI. My computer is a Mac G-5 2.3 dual processor. Thanks.
ID: 49954 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 49955 - Posted: 22 Dec 2007, 22:28:19 UTC
Last modified: 22 Dec 2007, 22:28:42 UTC

Stephen is on a Mac, so I'm not sure on the details. On Windows, you have to specifically set your screensaver to "BOINC" to see it, but I think that is the default as you install. And when it is working on SETI, you would set the SETI screensaver as well. So, check what you have set your screensaver to.
Rosetta Moderator: Mod.Sense
ID: 49955 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tng*

Send message
Joined: 28 Oct 05
Posts: 14
Credit: 5,389,798
RAC: 0
Message 49965 - Posted: 23 Dec 2007, 1:49:33 UTC - in response to Message 49936.  
Last modified: 23 Dec 2007, 1:51:19 UTC

3 boxes running CentOS now -- plan to convert more (all except the laptops, and maybe them too). How many do you need on Ralph?

Hi: Here's a partial explanation. On ralph, nearly all the workunits for a full day had returned and come back as "successes", typically a very good sign -- but the linux issue, as you correctly pointed out, leads to delayed responses from clients (rather than a bunch of immediate WU errors that tell us to go track down the problem). Since there are very few RALPH linux users we didn't notice a drop in the overall return rate of successes. The only sign that things were wrong were from a message board posting there (later bolstered by your and others' posts) and here ...

So, thanks for posting -- it did help us catch the problem relatively quickly -- and please accept our apologies. We'll certainly pay closer attention to this in the future, and do tests for, say, at least two days. if you could recruit some more Rosetta@home linux users to give a fraction of their CPUs to ralph and occasionally post errors in the message boards, that would also help!

ID: 49965 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Paul

Send message
Joined: 29 Oct 05
Posts: 193
Credit: 66,175,411
RAC: 6,935
Message 49966 - Posted: 23 Dec 2007, 1:55:18 UTC - in response to Message 49965.  

3 boxes running CentOS now -- plan to convert more (all except the laptops, and maybe them too). How many do you need on Ralph?

[quote]Hi: Here's a partial explanation. On ralph, nearly all the workunits for a full day had returned and come back as "successes", typically a very good sign -- but the linux issue, as you correctly pointed out, leads to delayed responses from clients (rather than a bunch of immediate WU errors that tell us to go track down the problem). Since there are very few RALPH linux users we didn't notice a drop in the overall return rate of successes. The only sign that things were wrong were from a message board posting there (later bolstered by your and others' posts) and here ...

So, thanks for posting -- it did help us catch the problem relatively quickly -- and please accept our apologies. We'll certainly pay closer attention to this in the future, and do tests for, say, at least two days. if you could recruit some more Rosetta@home linux users to give a fraction of their CPUs to ralph and occasionally post errors in the message boards, that would also help!



Windows XP - Intel Q6600 with 2MB RAM

I recently aborted this WU because it was using 0% CPU. I suspended the process and resumed it twice with no change.

It was locked at exactly 43:00 cpu time. This is the second or third WU I have been forced to abort in the last few days.

13 WUs completed prior to this issue so I think my hardware is OK.

https://boinc.bakerlab.org/rosetta/result.php?resultid=128495612

Thx!

Paul

ID: 49966 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Thomas Leibold

Send message
Joined: 30 Jul 06
Posts: 55
Credit: 19,627,164
RAC: 0
Message 49967 - Posted: 23 Dec 2007, 5:53:16 UTC

I have posted the steps I'm taking to recover from the 5.90 problem on my Linux systems in the Ralph forum . Perhaps this is useful to other Linux users.


Team Helix
ID: 49967 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BitSpit
Avatar

Send message
Joined: 5 Nov 05
Posts: 33
Credit: 4,147,344
RAC: 0
Message 49972 - Posted: 23 Dec 2007, 16:18:37 UTC
Last modified: 23 Dec 2007, 16:19:14 UTC

ID: 49972 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile sslickerson

Send message
Joined: 14 Oct 05
Posts: 101
Credit: 578,497
RAC: 0
Message 49978 - Posted: 23 Dec 2007, 17:16:56 UTC

This one just errored out: 128154725

This is on a windows XP box. Rosetta asked Zone Alarm for access to the net. I gave permission and it killed itself.

Tim
ID: 49978 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

Message boards : Number crunching : Problems with version 5.90/5.91



©2024 University of Washington
https://www.bakerlab.org