Problems with Rosetta version 5.93

Message boards : Number crunching : Problems with Rosetta version 5.93

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 9 · Next

AuthorMessage
Ingemar

Send message
Joined: 28 Feb 06
Posts: 20
Credit: 1,680
RAC: 0
Message 50335 - Posted: 5 Jan 2008, 1:39:56 UTC

Please post problems and/or bugs with rosetta 5.93. Thanks for your
support!
ID: 50335 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Thomas Leibold

Send message
Joined: 30 Jul 06
Posts: 55
Credit: 19,627,164
RAC: 0
Message 50345 - Posted: 5 Jan 2008, 7:21:01 UTC - in response to Message 50335.  

Please post problems and/or bugs with rosetta 5.93.


My problem with 5.93 is that once again an insufficiently tested client is released for the Rosetta project. Was the trouble that the 5.90 client caused for Linux users (and the 1zpy workunits for everyone) not severe enough to make project developers think about what they are doing (e.g. learning from their mistakes) ?
Do you really have such an excess of contributors that you can afford to irritate a significant portion of them away to other projects ?

There were less then 20 hours between the 5.93 announcement on Ralph and the same one on Rosetta. During that time my test machine has been getting 0 workunits from Ralph. In fact it didn't get any 5.92 work either and as of this post still has not received any work from Ralph (it did get 5.93 workunits from Rosetta already).

If Rhiju hadn't said " We'll certainly pay closer attention to this in the future, and do tests for, say, at least two days. " I would have left Rosetta right then (I was already testing the Folding@Home SMP Linux client). Not that I think even 2 full days are really sufficient, it should probably be two weeks. To say that I'm disappointed about how quickly this turned into an empty promise is an understatement.

Team Helix
ID: 50345 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luuklag

Send message
Joined: 13 Sep 07
Posts: 262
Credit: 4,171
RAC: 0
Message 50354 - Posted: 5 Jan 2008, 13:27:44 UTC - in response to Message 50345.  

Please post problems and/or bugs with rosetta 5.93.


My problem with 5.93 is that once again an insufficiently tested client is released for the Rosetta project. Was the trouble that the 5.90 client caused for Linux users (and the 1zpy workunits for everyone) not severe enough to make project developers think about what they are doing (e.g. learning from their mistakes) ?
Do you really have such an excess of contributors that you can afford to irritate a significant portion of them away to other projects ?

There were less then 20 hours between the 5.93 announcement on Ralph and the same one on Rosetta. During that time my test machine has been getting 0 workunits from Ralph. In fact it didn't get any 5.92 work either and as of this post still has not received any work from Ralph (it did get 5.93 workunits from Rosetta already).

If Rhiju hadn't said " We'll certainly pay closer attention to this in the future, and do tests for, say, at least two days. " I would have left Rosetta right then (I was already testing the Folding@Home SMP Linux client). Not that I think even 2 full days are really sufficient, it should probably be two weeks. To say that I'm disappointed about how quickly this turned into an empty promise is an understatement.


if you have an existing code which has proven to be running good, and you know what your changes will cause i dont think there is much more to test left then test if it runs, so that could be done within lets say 8 hours or so as long as you get like 10 to 50 results you know if it works or not, if the majority of those WU's error out you know your wrong.
ID: 50354 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 114,376,131
RAC: 52,943
Message 50366 - Posted: 5 Jan 2008, 16:44:05 UTC - in response to Message 50354.  

if you have an existing code which has proven to be running good, and you know what your changes will cause i dont think there is much more to test left then test if it runs, so that could be done within lets say 8 hours or so as long as you get like 10 to 50 results you know if it works or not, if the majority of those WU's error out you know your wrong.


Luuklag, that's just not true. Anyone with any programming experience will tell you the same.

Rosetta is losing people that have invested a lot of time, money and faith into the project because of Rosetta's recent instability. Most of them would have stayed if the problems were accidental and measures were in place to prevent this, but the testing on Ralph is still apparently minimal...

To the Devs: Do you need more computers on Ralph? If so, just ask and I'm sure you'll get them.
ID: 50366 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luuklag

Send message
Joined: 13 Sep 07
Posts: 262
Credit: 4,171
RAC: 0
Message 50369 - Posted: 5 Jan 2008, 16:50:35 UTC

well on the other hand like with the 5.90 for linux, people told it went wrong, so they quickly updated it to 5.91. that this takes 1 or 2 days because the code needs to be looked through and repaired, annoys the people, but hey, nothing is perfect.
ID: 50369 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
j2satx

Send message
Joined: 17 Sep 05
Posts: 97
Credit: 3,670,592
RAC: 0
Message 50371 - Posted: 5 Jan 2008, 17:09:17 UTC - in response to Message 50366.  

if you have an existing code which has proven to be running good, and you know what your changes will cause i dont think there is much more to test left then test if it runs, so that could be done within lets say 8 hours or so as long as you get like 10 to 50 results you know if it works or not, if the majority of those WU's error out you know your wrong.


Luuklag, that's just not true. Anyone with any programming experience will tell you the same.

Rosetta is losing people that have invested a lot of time, money and faith into the project because of Rosetta's recent instability. Most of them would have stayed if the problems were accidental and measures were in place to prevent this, but the testing on Ralph is still apparently minimal...

To the Devs: Do you need more computers on Ralph? If so, just ask and I'm sure you'll get them.


It doesn't do any good to add Linux computers to Ralph, since the WUs go to any OS.

I for one have detached all my computers from Ralph and Rosetta. Ralph doesn't do any good when the majority of testing is done on Rosetta.
ID: 50371 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 114,376,131
RAC: 52,943
Message 50376 - Posted: 5 Jan 2008, 17:39:32 UTC - in response to Message 50369.  

well on the other hand like with the 5.90 for linux, people told it went wrong, so they quickly updated it to 5.91. that this takes 1 or 2 days because the code needs to be looked through and repaired, annoys the people, but hey, nothing is perfect.

It shouldn't get to that point - the one to two days of fixing should be on Ralph because that is installed on computers whose owners understand what's going on and that there might be problems.

It needs testing BEFORE it gets released. People donate their computers, generally because they believe in the project, but releasing code that has had minimal testing (especially when there's a platform in place for the testing) is... poor, and it's causing people to leave.
ID: 50376 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luuklag

Send message
Joined: 13 Sep 07
Posts: 262
Credit: 4,171
RAC: 0
Message 50377 - Posted: 5 Jan 2008, 18:21:18 UTC - in response to Message 50376.  
Last modified: 5 Jan 2008, 18:29:25 UTC

well on the other hand like with the 5.90 for linux, people told it went wrong, so they quickly updated it to 5.91. that this takes 1 or 2 days because the code needs to be looked through and repaired, annoys the people, but hey, nothing is perfect.

It shouldn't get to that point - the one to two days of fixing should be on Ralph because that is installed on computers whose owners understand what's going on and that there might be problems.

It needs testing BEFORE it gets released. People donate their computers, generally because they believe in the project, but releasing code that has had minimal testing (especially when there's a platform in place for the testing) is... poor, and it's causing people to leave.



ok now i heard enaugh from people, im now going to add ralph, sharing with rosetta rosetta/ralph (7/8)/1(8) with rosie running 4 hour WU's and ralph 2 hour WU's

[EDIT] it seams ralph dous have a purpose

Jan. 3, 2008
The Ralph executable has been updated to 5.93. The recently added 5.92 contained a bug which is removed in this version. Please report bugs in this thread.

cause that version never came to rosie.
ID: 50377 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Thomas Leibold

Send message
Joined: 30 Jul 06
Posts: 55
Credit: 19,627,164
RAC: 0
Message 50385 - Posted: 5 Jan 2008, 22:23:41 UTC - in response to Message 50354.  
Last modified: 5 Jan 2008, 22:29:35 UTC


if you have an existing code which has proven to be running good,

The Rosetta Linux client has known issues in the interaction between the main computation thread, the watchdog thread and the Boinc client. These have been there for a very long time and they have still not been resolved (I have no way of knowing if anybody is even attempting to resolve them). However this clearly means that the premise of starting with known good code is false.

and you know what your changes will cause

I'm an experienced software developer and I can assure you that no matter how well you think you understand the code you are changing and all the consequences of making that change, there is always the possibility of overlooking something. It is especially challenging when you change code that needs to run not only in one particular well controlled environment of your own, but at many different customer sites over which you have no control whatsoever. Following good practices while developing software is important, but no substitute for testing.

i dont think there is much more to test left then test if it runs, so that could be done within lets say 8 hours or so as long as you get like 10 to 50 results you know if it works or not, if the majority of those WU's error out you know your wrong.


It seems that you don't understand the nature of the problem with the 5.90 Linux client or some of the 1zpy workunits: they never finish.

In the 5.90 Linux case they will forever run while remaining at near 0 cpu time accumulated. If you restart Boinc some, but not all of the workunits processed with the 5.90 will show the correct amount of cpu time and finish while others will restart and continue again forever.

In the 1zpy workunit case (even with the 5.91 client) they will get to the point of showing 100% completed, but remain in state "running". When restarting Boinc the amount of cpu time accumulated resets to a low number and the workunit starts again.

In both of these recent cases:
- preferred runtime is ignored
- the 4 times runtime safeguard is not working
- the workunits even continue beyond the project deadline for returning the result
- even restarting Boinc does not resolve the problem and the workunit continues to be stuck

Any unattended Linux server running R@H may very well continue to run these stuck 5.90/1zpy workunits for another year or longer. It certainly doesn't help that nobody from the Rosetta Team of Developers has made any attempt to communicate the nature of the problem to the user community (especially the requirement that those workunits have to be manually cleaned up).

An 8 hour test cannot detect that some workunits get stuck and don't complete ever! The reason I believe a 2 week test period is most sensible is that there is enough time to get problem reports from folks who perhaps check their servers only once a week.
Team Helix
ID: 50385 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 114,376,131
RAC: 52,943
Message 50387 - Posted: 6 Jan 2008, 0:44:41 UTC - in response to Message 50385.  


An 8 hour test cannot detect that some workunits get stuck and don't complete ever! The reason I believe a 2 week test period is most sensible is that there is enough time to get problem reports from folks who perhaps check their servers only once a week.

I think two weeks is probably too long for the team to wait before their code is tested on a large scale (Rosetta) but five or six days on Ralph would give plenty of results. If Ralph needs more monitored PCs then we can do that between us, but we need to be confident that Rosetta is stable as it isn't always running on monitored PCs.
ID: 50387 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Azurrio

Send message
Joined: 20 Feb 06
Posts: 8
Credit: 237,979
RAC: 0
Message 50407 - Posted: 6 Jan 2008, 19:32:46 UTC
Last modified: 6 Jan 2008, 19:33:11 UTC

Validate Error on this (whatever that means :D)
ID: 50407 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 50425 - Posted: 7 Jan 2008, 12:12:48 UTC
Last modified: 7 Jan 2008, 12:17:57 UTC

error......... https://boinc.bakerlab.org/rosetta/workunit.php?wuid=118727989
ID: 50425 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike.Gibson

Send message
Joined: 3 Nov 07
Posts: 19
Credit: 311,844
RAC: 0
Message 50440 - Posted: 7 Jan 2008, 21:18:03 UTC

Hi, folks

I thought that the same problem as I was having on 5.90 was recurring. That is it was adding time but not finishing from 10 minutes to go. But no, this time it was just the countdown that stuck. After 10 minutes stuck on 10 minutes to go it suddenly finished.

Regards

Mike
ID: 50440 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 50446 - Posted: 7 Jan 2008, 22:58:47 UTC

Mike, it's normal for the time to basically stand still (time remaining actually is reduced very slowly) once you get below about 12 minutes. This occurs most frequently when you have a runtime preference that is less then the time it takes to compute a single model of a complex protein.
Rosetta Moderator: Mod.Sense
ID: 50446 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 50448 - Posted: 8 Jan 2008, 2:23:39 UTC
Last modified: 8 Jan 2008, 2:26:07 UTC

Perhaps if this thread were called "NO Problems with Rosetta version 5.93" then there'd be more posts (knock on wood). I'm not seeing many reports of errors. Looking at my own hosts, I've not had one error either with Windows or Linux. Wonder what would happen if they fed us some 1zpy boinc twist rings wus???.

While this updated chart only shows a small number of 5.93's I'd have thought I'd have seen some sign of trouble by now, expecially if you consider the numbers for 5.90 (in windows) and 5.91 (in Linux). I've highlighted 5.93 in red.

(my puters are dual boot using both windows and linux as designated by the "l" or "w" following the number. The number is the AMD designation, so put AMD64 in front of the 2800 and 3700, and put AMD64 X2 in front of the 4800, 5200, and 6000.)

ID: 50448 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 50453 - Posted: 8 Jan 2008, 12:38:17 UTC

This one was ended by the watchdog for being stuck for 900 seconds:

https://boinc.bakerlab.org/rosetta/result.php?resultid=131544223
ID: 50453 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
RoSi

Send message
Joined: 9 Nov 07
Posts: 1
Credit: 377,515
RAC: 0
Message 50462 - Posted: 8 Jan 2008, 19:09:28 UTC

I'm facing the same problem.

Sometimes when Boinc switches back to Rosetta then the Progress jumps immediately to 100% but the Status is still on running but a hardware tool shows that the CPU is idling.

Today I stopped therefore all other projects so that Rosetta was working alone. Finally after 1h57min out of 20h (or so) it did the same. Suddenly the Progress jumped to 100% but the Status still showed running while the CPU was idling...

This occurs only under Linux. On my Windows client everything runs smoothly. The Linux client uses Boinc 5.10.28, Rosetta Beta 5.93 and Ubuntu 7.10.
ID: 50462 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luuklag

Send message
Joined: 13 Sep 07
Posts: 262
Credit: 4,171
RAC: 0
Message 50464 - Posted: 8 Jan 2008, 20:36:06 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=131390192
ID: 50464 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 50469 - Posted: 8 Jan 2008, 21:15:25 UTC
Last modified: 8 Jan 2008, 21:16:27 UTC

ID: 50469 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ingemar

Send message
Joined: 28 Feb 06
Posts: 20
Credit: 1,680
RAC: 0
Message 50473 - Posted: 8 Jan 2008, 22:53:27 UTC

In this case it was only 2 days between the boinc and ralph update which I agree is on the short side. However, I just want to alert you to one issue: how long one tests the code on ralph before submitting to boinc depends on the nature of the update. In this case the latest two updates on ralph concerned only code in one scientific protocol. The rest of rosetta stayed the same and the bulk of the jobs sent out were running on identical code base as the previous boinc version. As for the new code, we had tested it on ralph and locally without problems so we were condfident it would not mess things up.

Still, I generally agree that more time between updates is better. I will leave more time the next time I make an update. But as I alluded to above, how much testing is necessary on ralph must be determined on a case to case basis. Also, compared to many other projects we are doing more code development and probably need to update more often.

As for the latest problems: They were mainly caused by high-memory jobs run with higher priority run over holiday season when the queue was not completely full. We have noted these problems and are working on solutions to avoid these problems in the future.

We hope for your continued support, thanks!
ID: 50473 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 9 · Next

Message boards : Number crunching : Problems with Rosetta version 5.93



©2024 University of Washington
https://www.bakerlab.org