21)
Message boards :
Number crunching :
Problems with Rosetta version 5.46
(Message 36789)
Posted 14 Feb 2007 by Chu Post: Have you tried to reset the project to see if it helps? Those workunits themself seem to be fine and if this happens all the time on a single host, my guess is that some files become corrupted. Another possibility is hardware problem though this can be ruled out if it does not have problem of running other programs. What could be causing these compute errors? It's only happening on one of my hosts in the last few weeks. |
22)
Message boards :
Number crunching :
Problems with Rosetta version 5.46
(Message 36637)
Posted 13 Feb 2007 by Chu Post: Please report here for problems you have observed with Rosetta version 5.46. |
23)
Message boards :
Number crunching :
Rosetta Application Version Release Log
(Message 36636)
Posted 13 Feb 2007 by Chu Post: Rosetta version 5.46 In this release, we've fixed the bug observed in V5.45 which has caused a high rate of "watchdog termination" for workunits, especially docking ones "DOC..." Please note that even with the fix, watchdog errors can still be seen sometimes and that is because Rosetta simulations get stuck during searching large, complicatined energy landscape, but this should happen randomly at a very low rate. There are also some minor modifications in the science code. |
24)
Message boards :
Number crunching :
Advance copies of the soon-to-be-released executable
(Message 36630)
Posted 12 Feb 2007 by Chu Post: We will be updating Rosetta@Home to 5.46 around 6pm PST today. You can download exectuables in advance at here. Since we have used UPX to compress the executables in the last couple releasees, we would like to ask your opinion on whether it is still necessary to send out release announcement in advance and if so, whether 6 hours in advance is enough for you to download the executables beforehand. Thanks. |
25)
Message boards :
Number crunching :
Validator stalled??
(Message 36622)
Posted 12 Feb 2007 by Chu Post: In your stderr output, there were two repeated blocks which report the number of models produced and it indicates that the same workunit ran twice on your computer, produced 8 models for the first time and then added one more for the second time. During the second run, it probably overrided the output files and therefore you were only returning a result file containing only one model (the 9th model). That is why the validator granted 6 credits in stead of 50. Normally, the workunit should report those 8 models right away and complete the task. I am not sure why a second run was invoked. OK, I see from the server status that the validator has failed. This apparently happened sometime last night, but I've not seen any reference to why it failed or what the prognosis is. |
26)
Message boards :
Number crunching :
Workunits getting stuck and aborting
(Message 36597)
Posted 12 Feb 2007 by Chu Post: if you mean the widdddddth of the board, I am wondering that too... Great, now this thread is impossible to read. |
27)
Message boards :
Number crunching :
Workunits getting stuck and aborting
(Message 36596)
Posted 12 Feb 2007 by Chu Post: Thomas, thanks for helping debug this problem and posting such detailed log output. I never use trace before and do not have much knowledge in how processes work and communicate in linux. I will share your findings and thoughts with other project developers tomorrow to see what this can bring to us. I have run some problematic DOC workunits on our linux computers in stand alone mode (without boinc manager) and it seemed that all the watchdog terminations exited properly. Particularly, I did not remember seeing any segmentation viloations ( I will double check this tomorrow). So I guess this will also help us to narrow down whether the problem is within Rosetta or between Rosetta and bonic manager. This is from another system, but also linux. |
28)
Message boards :
Number crunching :
Workunits getting stuck and aborting
(Message 36569)
Posted 11 Feb 2007 by Chu Post: The "watchdog" error for recent "DOC" workunits has been tracked down to be a bug in Rosetta code which was introduced in the past month. The worker thread worked properly, but it left some gaps during the simulation in which "score" is not updated ( to make it even worse, sometimes it is reset to ZERO ). The way how the "watchdog" thread works is that it periodically checks the "score" and compare it against the previously recorded value. If same, it thinks the current trajectory is stuck and it should terminate the whole process. For "DOC" workunits, the gaps can be relatively long and the chance of this happening therefore turns out to be high. We have fixed this problem and will test it in the next update on Ralph (very soon). As mentioned in my previous post, there seem to be two isolated problems. The first one is why those "DOC" WUs get stuck and we have found the problem. The second one is why the watchdog thread did not terminate the process properly. This problem seems to be specific to linux platforms. As we queried our database on the problematic batch of DOC workunits, the "watchdog ending runs" message was seen across all platforms, but I have not so far seen one case for windows and mac that results were not returned as success. On the other hand, when this happened on linux platform, I saw mostly "aborted by users" outcomes which indicate that even if the watchdog thread found the run stuck, it could not terminate the process properly so that the WU is still hanging in system until mannualy killed by users. I am not sure this is also true for the watchdog termination of non-DOC workunits and we will continue to look into that. Again, the rate of "false watchdog termination" should go away with the new fix, but there might be other problems which can cause a real stuck trajectory. If that happens, please report back to us here. Thank you very much for the help! |
29)
Message boards :
Number crunching :
Bug Reports for R@h Server Update to BOINC version 5.9.2
(Message 36415)
Posted 10 Feb 2007 by Chu Post: If you can post a link to any of the problematic workunits, it will be much easier for us to track down what had happened. Thanks. BTW, since the server update just happend less than one week ago, the problem you have experienced might only be relevant to Rosetta application, if so, please report it here Running boinc on a Mac (dual 2g) with plenty of memory and drive space. |
30)
Message boards :
Rosetta@home Science :
Model and step question
(Message 36399)
Posted 9 Feb 2007 by Chu Post: the number of models in each workunit depends on the cpu run time preference set by users( the total amount of time you want to spend on each workunit ) and the types of WUs ( how long each WU takes to run). the number of steps in each WU vary a lot ( depending how we want to do the search in the confomrational space). Normally, "farlx" type of WUs have more steps and "DOC" type of WUs have less steps. This is just a generic question, and I don't know if there is a standard answer, but how many models are in each workunit and how many steps are in each model? |
31)
Message boards :
Number crunching :
Workunits getting stuck and aborting
(Message 36397)
Posted 9 Feb 2007 by Chu Post: This morning I also checked our local windows and mac platforms. Consistent with what have been reported here, I also saw several "Watchdog ending stuck runs" for "DOC" WUs. However, those stuck WUs were terminated by the watchdog thread properly (returned as success) and none of them hang in the boinc manager( which have to be aborted manulally). So my speculation is: 1. the "DOC" WUs have some problems whose trajectories get stuch more frequently than Rosetta average. We will look into this issue and come up with the fix. 2. when a stuck WU is terminated by the watchdog thread, it has some problem of completely removing it from the task list on linux platform (but not windows and mac platform ???) and needs to be aborted by users. This speculation has to wait more user feedbacks yet to be confirmed. Please post any relevant observations on your side. Thank you for your help. |
32)
Message boards :
Number crunching :
Workunits getting stuck and aborting
(Message 36376)
Posted 9 Feb 2007 by Chu Post: Thanks for reporting. We will look into that and meanwhile those WUs have been temporarily removed from the queue. I am wondering if it is because of the high memory requirement set for those "DOC" WUs. The score is ZERO when the run was stuck, indicating it was not stuck in the middle of the run. From what you guys have described here, it looks that the worker has finished, but for somehow reason it did not signal the other thread properly so that the watchdog thinks it was stuck. To further help us track down the problem, could you please report what kind of platform your host is? It is definitely happening on linux( from Thomos here and Conan at ralph), what about the rest of you? [quote]I've had several of these in the past day, and I see other people making isolated comments of having the same issue, so maybe if we put them all here in one place the scientists can track down the error. These are mine thus far: http://boinc.bakerlab.org/rosetta/result.php?resultid=61646685 http://boinc.bakerlab.org/rosetta/result.php?resultid=61635395 http://boinc.bakerlab.org/rosetta/result.php?resultid=61598016 http://boinc.bakerlab.org/rosetta/result.php?resultid=61597212 http://boinc.bakerlab.org/rosetta/result.php?resultid=61589791[/quote |
33)
Message boards :
Number crunching :
Odd graphics quirk, possibly
(Message 36252)
Posted 7 Feb 2007 by Chu Post: When the window is NOT maxmimized, we have seen the similiar problem as you decribed -- "low" and "native" boxes are not rotatable. But when the window is maximized, I can, at least on our local window computers, to rotate all the four boxes without problem. With Rosetta 5.45 and the BOINC manager 5.8.8, I've noticed that if I want to move a protein around I can click either on the (A)"Searching_all_atoms" panel or the (B)"Accepted" panel and I can manipulate the protein orientation. BUT the moment I click on the smaller (C)"Low Energy" or (D)"Native" panels, I can no longer manipulate the first two (nor the latter two). Oddly, I can still manipulate the "Accepted" panel by clicking on the far left of the "Searching_all_atoms" panel, but can not manipulate the "Searching, etc.." panel at all. |
34)
Message boards :
Number crunching :
Problems with Rosetta version 5.45
(Message 36162)
Posted 5 Feb 2007 by Chu Post: Thanks for the report, River. When this happened, did you happend to see whether the cpu run time was stilled being incremented? I agree with you it definitely looks like a bug somewhere, but not graphic related. I am wondering if this only happens on linux platforms or everywhere else. The 'stuck at 100%' bug has returned with this result here. |
35)
Message boards :
Number crunching :
Problems with Rosetta version 5.45
(Message 35926)
Posted 1 Feb 2007 by Chu Post: Thanks. We are aware of that and are looking into it right now. Just noticed that I have "Pending" granted credits. Is this new for 5.45? WU's appear to be completed successfully. |
36)
Message boards :
Rosetta@home Science :
Does the N-terminus fold first?
(Message 35887)
Posted 1 Feb 2007 by Chu Post: N-terminal blue and C-terminal red. So which end is blue in the graphic, and which is red? |
37)
Message boards :
Number crunching :
Problems with Rosetta version 5.45
(Message 35782)
Posted 31 Jan 2007 by Chu Post: Your computers are hidden. Please post a link to your error results. Seems to be a problem with running a Poweredge 6450 and Centos 4.2. Two of the four process stopped at about 63 percent. I know this is vintage hardware, but I have another Poweredge running Windows 2003 RC2 and it runs fine. |
38)
Message boards :
Number crunching :
Errored out??
(Message 35745)
Posted 30 Jan 2007 by Chu Post: I think that is because the run was actually testing some new etable stuff. You are right that normal runs only have the number of decoys at the end of stderr.txt OK, but the WU's I usually run don't have those messages about etables and such in it. Just the short message about the number of decoys generated. |
39)
Message boards :
Number crunching :
Errored out??
(Message 35728)
Posted 29 Jan 2007 by Chu Post: Those WUs are fine. |
40)
Message boards :
Number crunching :
Ralph is now giving out 5.44 application wus
(Message 35683)
Posted 28 Jan 2007 by Chu Post: In each Rosetta-alpha update, we check out the most recent BOINC API from its CVS repository and use it to build the executables. Rosetta@Home is currently running 5.43 which was released early December last year. The new update 5.45 is being tested right now on Ralph and it is compiled using API less than one week old. Hopefully that will have your problem addressed. Marky, I believe you are talking about the problem where the BOINC manager seems to lose contact with localhost? ...and all the tabs go blank? It seems this is a BOINC issue, and some of the later betas supposedly have a fix for that. So... no a new Rosetta version won't be expected to fix it, but "coming soon" from BOINC changes. |
©2024 University of Washington
https://www.bakerlab.org