Message boards : Number crunching : Problems with Rosetta version 5.93
Author | Message |
---|---|
Ingemar Send message Joined: 28 Feb 06 Posts: 20 Credit: 1,680 RAC: 0 |
Please post problems and/or bugs with rosetta 5.93. Thanks for your support! |
Thomas Leibold Send message Joined: 30 Jul 06 Posts: 55 Credit: 19,627,164 RAC: 0 |
Please post problems and/or bugs with rosetta 5.93. My problem with 5.93 is that once again an insufficiently tested client is released for the Rosetta project. Was the trouble that the 5.90 client caused for Linux users (and the 1zpy workunits for everyone) not severe enough to make project developers think about what they are doing (e.g. learning from their mistakes) ? Do you really have such an excess of contributors that you can afford to irritate a significant portion of them away to other projects ? There were less then 20 hours between the 5.93 announcement on Ralph and the same one on Rosetta. During that time my test machine has been getting 0 workunits from Ralph. In fact it didn't get any 5.92 work either and as of this post still has not received any work from Ralph (it did get 5.93 workunits from Rosetta already). If Rhiju hadn't said " We'll certainly pay closer attention to this in the future, and do tests for, say, at least two days. " I would have left Rosetta right then (I was already testing the Folding@Home SMP Linux client). Not that I think even 2 full days are really sufficient, it should probably be two weeks. To say that I'm disappointed about how quickly this turned into an empty promise is an understatement. Team Helix |
Luuklag Send message Joined: 13 Sep 07 Posts: 262 Credit: 4,171 RAC: 0 |
Please post problems and/or bugs with rosetta 5.93. if you have an existing code which has proven to be running good, and you know what your changes will cause i dont think there is much more to test left then test if it runs, so that could be done within lets say 8 hours or so as long as you get like 10 to 50 results you know if it works or not, if the majority of those WU's error out you know your wrong. |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,860,059 RAC: 3,073 |
if you have an existing code which has proven to be running good, and you know what your changes will cause i dont think there is much more to test left then test if it runs, so that could be done within lets say 8 hours or so as long as you get like 10 to 50 results you know if it works or not, if the majority of those WU's error out you know your wrong. Luuklag, that's just not true. Anyone with any programming experience will tell you the same. Rosetta is losing people that have invested a lot of time, money and faith into the project because of Rosetta's recent instability. Most of them would have stayed if the problems were accidental and measures were in place to prevent this, but the testing on Ralph is still apparently minimal... To the Devs: Do you need more computers on Ralph? If so, just ask and I'm sure you'll get them. |
Luuklag Send message Joined: 13 Sep 07 Posts: 262 Credit: 4,171 RAC: 0 |
well on the other hand like with the 5.90 for linux, people told it went wrong, so they quickly updated it to 5.91. that this takes 1 or 2 days because the code needs to be looked through and repaired, annoys the people, but hey, nothing is perfect. |
j2satx Send message Joined: 17 Sep 05 Posts: 97 Credit: 3,670,592 RAC: 0 |
if you have an existing code which has proven to be running good, and you know what your changes will cause i dont think there is much more to test left then test if it runs, so that could be done within lets say 8 hours or so as long as you get like 10 to 50 results you know if it works or not, if the majority of those WU's error out you know your wrong. It doesn't do any good to add Linux computers to Ralph, since the WUs go to any OS. I for one have detached all my computers from Ralph and Rosetta. Ralph doesn't do any good when the majority of testing is done on Rosetta. |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,860,059 RAC: 3,073 |
well on the other hand like with the 5.90 for linux, people told it went wrong, so they quickly updated it to 5.91. that this takes 1 or 2 days because the code needs to be looked through and repaired, annoys the people, but hey, nothing is perfect. It shouldn't get to that point - the one to two days of fixing should be on Ralph because that is installed on computers whose owners understand what's going on and that there might be problems. It needs testing BEFORE it gets released. People donate their computers, generally because they believe in the project, but releasing code that has had minimal testing (especially when there's a platform in place for the testing) is... poor, and it's causing people to leave. |
Luuklag Send message Joined: 13 Sep 07 Posts: 262 Credit: 4,171 RAC: 0 |
well on the other hand like with the 5.90 for linux, people told it went wrong, so they quickly updated it to 5.91. that this takes 1 or 2 days because the code needs to be looked through and repaired, annoys the people, but hey, nothing is perfect. ok now i heard enaugh from people, im now going to add ralph, sharing with rosetta rosetta/ralph (7/8)/1(8) with rosie running 4 hour WU's and ralph 2 hour WU's [EDIT] it seams ralph dous have a purpose Jan. 3, 2008 The Ralph executable has been updated to 5.93. The recently added 5.92 contained a bug which is removed in this version. Please report bugs in this thread. cause that version never came to rosie. |
Thomas Leibold Send message Joined: 30 Jul 06 Posts: 55 Credit: 19,627,164 RAC: 0 |
The Rosetta Linux client has known issues in the interaction between the main computation thread, the watchdog thread and the Boinc client. These have been there for a very long time and they have still not been resolved (I have no way of knowing if anybody is even attempting to resolve them). However this clearly means that the premise of starting with known good code is false. and you know what your changes will cause I'm an experienced software developer and I can assure you that no matter how well you think you understand the code you are changing and all the consequences of making that change, there is always the possibility of overlooking something. It is especially challenging when you change code that needs to run not only in one particular well controlled environment of your own, but at many different customer sites over which you have no control whatsoever. Following good practices while developing software is important, but no substitute for testing. i dont think there is much more to test left then test if it runs, so that could be done within lets say 8 hours or so as long as you get like 10 to 50 results you know if it works or not, if the majority of those WU's error out you know your wrong. It seems that you don't understand the nature of the problem with the 5.90 Linux client or some of the 1zpy workunits: they never finish. In the 5.90 Linux case they will forever run while remaining at near 0 cpu time accumulated. If you restart Boinc some, but not all of the workunits processed with the 5.90 will show the correct amount of cpu time and finish while others will restart and continue again forever. In the 1zpy workunit case (even with the 5.91 client) they will get to the point of showing 100% completed, but remain in state "running". When restarting Boinc the amount of cpu time accumulated resets to a low number and the workunit starts again. In both of these recent cases: - preferred runtime is ignored - the 4 times runtime safeguard is not working - the workunits even continue beyond the project deadline for returning the result - even restarting Boinc does not resolve the problem and the workunit continues to be stuck Any unattended Linux server running R@H may very well continue to run these stuck 5.90/1zpy workunits for another year or longer. It certainly doesn't help that nobody from the Rosetta Team of Developers has made any attempt to communicate the nature of the problem to the user community (especially the requirement that those workunits have to be manually cleaned up). An 8 hour test cannot detect that some workunits get stuck and don't complete ever! The reason I believe a 2 week test period is most sensible is that there is enough time to get problem reports from folks who perhaps check their servers only once a week. Team Helix |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,860,059 RAC: 3,073 |
I think two weeks is probably too long for the team to wait before their code is tested on a large scale (Rosetta) but five or six days on Ralph would give plenty of results. If Ralph needs more monitored PCs then we can do that between us, but we need to be confident that Rosetta is stable as it isn't always running on monitored PCs. |
Azurrio Send message Joined: 20 Feb 06 Posts: 8 Credit: 237,979 RAC: 0 |
Validate Error on this (whatever that means :D) |
rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0 |
error......... https://boinc.bakerlab.org/rosetta/workunit.php?wuid=118727989 |
Mike.Gibson Send message Joined: 3 Nov 07 Posts: 19 Credit: 311,844 RAC: 0 |
Hi, folks I thought that the same problem as I was having on 5.90 was recurring. That is it was adding time but not finishing from 10 minutes to go. But no, this time it was just the countdown that stuck. After 10 minutes stuck on 10 minutes to go it suddenly finished. Regards Mike |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Mike, it's normal for the time to basically stand still (time remaining actually is reduced very slowly) once you get below about 12 minutes. This occurs most frequently when you have a runtime preference that is less then the time it takes to compute a single model of a complex protein. Rosetta Moderator: Mod.Sense |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
Perhaps if this thread were called "NO Problems with Rosetta version 5.93" then there'd be more posts (knock on wood). I'm not seeing many reports of errors. Looking at my own hosts, I've not had one error either with Windows or Linux. Wonder what would happen if they fed us some 1zpy boinc twist rings wus???. While this updated chart only shows a small number of 5.93's I'd have thought I'd have seen some sign of trouble by now, expecially if you consider the numbers for 5.90 (in windows) and 5.91 (in Linux). I've highlighted 5.93 in red. (my puters are dual boot using both windows and linux as designated by the "l" or "w" following the number. The number is the AMD designation, so put AMD64 in front of the 2800 and 3700, and put AMD64 X2 in front of the 4800, 5200, and 6000.) |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
This one was ended by the watchdog for being stuck for 900 seconds: https://boinc.bakerlab.org/rosetta/result.php?resultid=131544223 |
RoSi Send message Joined: 9 Nov 07 Posts: 1 Credit: 377,515 RAC: 0 |
I'm facing the same problem. Sometimes when Boinc switches back to Rosetta then the Progress jumps immediately to 100% but the Status is still on running but a hardware tool shows that the CPU is idling. Today I stopped therefore all other projects so that Rosetta was working alone. Finally after 1h57min out of 20h (or so) it did the same. Suddenly the Progress jumped to 100% but the Status still showed running while the CPU was idling... This occurs only under Linux. On my Windows client everything runs smoothly. The Linux client uses Boinc 5.10.28, Rosetta Beta 5.93 and Ubuntu 7.10. |
Luuklag Send message Joined: 13 Sep 07 Posts: 262 Credit: 4,171 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=131390192 |
rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0 |
|
Ingemar Send message Joined: 28 Feb 06 Posts: 20 Credit: 1,680 RAC: 0 |
In this case it was only 2 days between the boinc and ralph update which I agree is on the short side. However, I just want to alert you to one issue: how long one tests the code on ralph before submitting to boinc depends on the nature of the update. In this case the latest two updates on ralph concerned only code in one scientific protocol. The rest of rosetta stayed the same and the bulk of the jobs sent out were running on identical code base as the previous boinc version. As for the new code, we had tested it on ralph and locally without problems so we were condfident it would not mess things up. Still, I generally agree that more time between updates is better. I will leave more time the next time I make an update. But as I alluded to above, how much testing is necessary on ralph must be determined on a case to case basis. Also, compared to many other projects we are doing more code development and probably need to update more often. As for the latest problems: They were mainly caused by high-memory jobs run with higher priority run over holiday season when the queue was not completely full. We have noted these problems and are working on solutions to avoid these problems in the future. We hope for your continued support, thanks! |
Message boards :
Number crunching :
Problems with Rosetta version 5.93
©2024 University of Washington
https://www.bakerlab.org