Posts by Chu

21) Message boards : Number crunching : Problems with Rosetta version 5.46 (Message 36789)
Posted 14 Feb 2007 by Chu
Post:
Have you tried to reset the project to see if it helps? Those workunits themself seem to be fine and if this happens all the time on a single host, my guess is that some files become corrupted. Another possibility is hardware problem though this can be ruled out if it does not have problem of running other programs.
What could be causing these compute errors? It's only happening on one of my hosts in the last few weeks.

http://boinc.bakerlab.org/rosetta/result.php?resultid=62506015
http://boinc.bakerlab.org/rosetta/result.php?resultid=62470017
http://boinc.bakerlab.org/rosetta/result.php?resultid=62378522
http://boinc.bakerlab.org/rosetta/result.php?resultid=62351637
http://boinc.bakerlab.org/rosetta/result.php?resultid=61390501

That host has been fine running Rosetta for ages.

22) Message boards : Number crunching : Problems with Rosetta version 5.46 (Message 36637)
Posted 13 Feb 2007 by Chu
Post:
Please report here for problems you have observed with Rosetta version 5.46.
23) Message boards : Number crunching : Rosetta Application Version Release Log (Message 36636)
Posted 13 Feb 2007 by Chu
Post:
Rosetta version 5.46

In this release, we've fixed the bug observed in V5.45 which has caused a high rate of "watchdog termination" for workunits, especially docking ones "DOC..." Please note that even with the fix, watchdog errors can still be seen sometimes and that is because Rosetta simulations get stuck during searching large, complicatined energy landscape, but this should happen randomly at a very low rate.

There are also some minor modifications in the science code.
24) Message boards : Number crunching : Advance copies of the soon-to-be-released executable (Message 36630)
Posted 12 Feb 2007 by Chu
Post:
We will be updating Rosetta@Home to 5.46 around 6pm PST today. You can download exectuables in advance at here. Since we have used UPX to compress the executables in the last couple releasees, we would like to ask your opinion on whether it is still necessary to send out release announcement in advance and if so, whether 6 hours in advance is enough for you to download the executables beforehand. Thanks.

25) Message boards : Number crunching : Validator stalled?? (Message 36622)
Posted 12 Feb 2007 by Chu
Post:
In your stderr output, there were two repeated blocks which report the number of models produced and it indicates that the same workunit ran twice on your computer, produced 8 models for the first time and then added one more for the second time. During the second run, it probably overrided the output files and therefore you were only returning a result file containing only one model (the 9th model). That is why the validator granted 6 credits in stead of 50. Normally, the workunit should report those 8 models right away and complete the task. I am not sure why a second run was invoked.
OK, I see from the server status that the validator has failed. This apparently happened sometime last night, but I've not seen any reference to why it failed or what the prognosis is.

I'm assuming that I am not the only one who noticed this though....


Today also in my case the valdator failed, because normaly I get 50 points for one unit and today only 6.

26) Message boards : Number crunching : Workunits getting stuck and aborting (Message 36597)
Posted 12 Feb 2007 by Chu
Post:
if you mean the widdddddth of the board, I am wondering that too...
Great, now this thread is impossible to read.

27) Message boards : Number crunching : Workunits getting stuck and aborting (Message 36596)
Posted 12 Feb 2007 by Chu
Post:
Thomas, thanks for helping debug this problem and posting such detailed log output. I never use trace before and do not have much knowledge in how processes work and communicate in linux. I will share your findings and thoughts with other project developers tomorrow to see what this can bring to us.

I have run some problematic DOC workunits on our linux computers in stand alone mode (without boinc manager) and it seemed that all the watchdog terminations exited properly. Particularly, I did not remember seeing any segmentation viloations ( I will double check this tomorrow). So I guess this will also help us to narrow down whether the problem is within Rosetta or between Rosetta and bonic manager.
This is from another system, but also linux.
After the same Rosetta workunit hung a second time, I restarted boinc with
strace: strace -ff -tt -o boinc_rosetta ./boinc

user 23795 6196 0 21:25 pts/2 00:00:01 strace -ff -tt -o /xen2/boinc_rosetta ./boinc
user 23796 23795 0 21:25 pts/2 00:00:00 ./boinc
user 23797 23796 97 21:25 pts/2 00:03:21 rosetta_5.45_i686-pc-linux-gnu dd 1IAI 1 -s 1IAI.rppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -do
ck_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 1080
0 -watchdog -constant_seed -jran 2083828
user 23798 23797 0 21:25 pts/2 00:00:00 rosetta_5.45_i686-pc-linux-gnu dd 1IAI 1 -s 1IAI.rppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -do
ck_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 1080
0 -watchdog -constant_seed -jran 2083828
user 23799 23798 0 21:25 pts/2 00:00:00 rosetta_5.45_i686-pc-linux-gnu dd 1IAI 1 -s 1IAI.rppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -do
ck_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 1080
0 -watchdog -constant_seed -jran 2083828
user 23800 23798 0 21:25 pts/2 00:00:00 rosetta_5.45_i686-pc-linux-gnu dd 1IAI 1 -s 1IAI.rppk -dock -pose -dock_mcm -randomize1 -randomize2 -unbound_start -ex1 -ex2aro_only -do
ck_rtmin -unboundrot -find_disulf -norepack_disulf -dock_score_norepack -no_filters -output_all -accept_all -nstruct 10 -silent -output_silent_gz -pose_silent_out -cpu_run_time 1080
0 -watchdog -constant_seed -jran 2083828

PID 23795 is strace
PID 23796 is the boinc client started by strace
PID 23797 is the rosetta client started by boinc (this does all the computation)
PID 23798 is another rosetta task (2nd) started by the first one
PID 23799 is another rosetta task (3rd) started by the second one
PID 23800 is another rosetta task (4th) started by the second one (watchdog?)


21:25:52.938929 PID 23796, boinc is being executed (started by strace)
21:25:53.136532 PID 23796 forks (clone system call) and creates PID 23797
21:25:53.224175 PID 23797 rosetta is being executed (started by boinc PID 23796)
21:25:53.719005 PID 23797 creates file "boinc_lockfile"
21:25:53.724098 PID 23797 forks (clone system call) and creates PID 23798
21:25:53.725109 PID 23797 waits for signal (sigsuspend)
21:25:53.726537 PID 23798 forks (clone system call) and creates PID 23799
21:25:53.726825 PID 23798 sends signal SIGRTMIN to PID 23797 (kill)
21:25:53.726951 PID 23797 receives signal SIGRTMIN and continues
21:25:53.731726 PID 23799 starts (but never does anything interesting)
(PID 23797 writes lots of stuff to stdout.txt)
21:25:55.768784 PID 23797 waits for signal (sigsuspend)
21:25:55.769412 PID 23798 forks (clone system call) and creates PID 23800
21:25:55.769752 PID 23798 sends signal SIGRTMIN to PID 23797 (kill)
21:25:55.769875 PID 23797 receives signal SIGRTMIN and continues
21:25:55.772258 PID 23800 starts
22:26:42.220181 PID 23800 checks file "init_data.xml"
22:26:42.225455 PID 23800 writes "Rosetta score is stuck" to stdout.txt
22:26:42.226143 PID 23800 writes "Rosetta score is stuck" to stderr.txt
22:26:42.231475 PID 23800 writes "watchdog_failure: Stuck at score" to dd1IAI.out
22:26:45.470033 PID 23800 creates file "boinc_finish_called"
22:26:45.472508 PID 23800 removes file "boinc_lockfile"
22:26:45.490173 PID 23800 sends signal SIGRTMIN to PID 23797 (kill)
22:26:45.490560 PID 23797 receives signal SIGRTMIN
22:26:45.490459 PID 23800 waits for signal (sigsuspend)
22:26:45.491002 PID 23797 sends signal SIGRTMIN to PID 23800 (kill)
22:26:45.491108 PID 23800 receives signal SIGRTMIN and continues
22:26:45.502802 PID 23800 Segmentation Violation occurs!
The SIGSEGV happens just after several munmap (memory unmap) calls, so
possibly there was a reference to unmapped memory ?
22:26:45.503104 PID 23800 writes "SIGSEGV: segmentation violation" to stderr.txt
22:26:45.507844 PID 23797 waits for signal (sigsuspend)
22:26:45.509013 PID 23800 writes stack trace to stderr.txt
22:26:45.511959 PID 23800 writes "Exiting..." to stderr.txt (but it's a lie!)
22:26:45.512252 PID 23800 waits for signal (sigsuspend) that doesn't come!
22:26:45.821360 PID 23797 receives SIGALRM (timer expired ?)
22:26:45.822021 PID 23797 waits for signal (sigsuspend)
The last two lines keep repeating with PID 23797 waiting for a signal (perhaps
another SIGRTMIN from PID 23800 ?) and getting SIGALRM (timeout) instead.
The normal watchdog termination procedure seems to have been thrown off track
by the watchdog itself crashing in the process.

Left out in the sequence above is some communication between 23797 and 23798
through a pipe. I'm assuming 23797 and 23800 are communicating with shared
memory (besides signalling with SIGRTMIN), but that would not be visible in
the strace output.

Full strace logs available to anybody is interested.

28) Message boards : Number crunching : Workunits getting stuck and aborting (Message 36569)
Posted 11 Feb 2007 by Chu
Post:
The "watchdog" error for recent "DOC" workunits has been tracked down to be a bug in Rosetta code which was introduced in the past month. The worker thread worked properly, but it left some gaps during the simulation in which "score" is not updated ( to make it even worse, sometimes it is reset to ZERO ). The way how the "watchdog" thread works is that it periodically checks the "score" and compare it against the previously recorded value. If same, it thinks the current trajectory is stuck and it should terminate the whole process. For "DOC" workunits, the gaps can be relatively long and the chance of this happening therefore turns out to be high. We have fixed this problem and will test it in the next update on Ralph (very soon).

As mentioned in my previous post, there seem to be two isolated problems. The first one is why those "DOC" WUs get stuck and we have found the problem. The second one is why the watchdog thread did not terminate the process properly. This problem seems to be specific to linux platforms. As we queried our database on the problematic batch of DOC workunits, the "watchdog ending runs" message was seen across all platforms, but I have not so far seen one case for windows and mac that results were not returned as success. On the other hand, when this happened on linux platform, I saw mostly "aborted by users" outcomes which indicate that even if the watchdog thread found the run stuck, it could not terminate the process properly so that the WU is still hanging in system until mannualy killed by users. I am not sure this is also true for the watchdog termination of non-DOC workunits and we will continue to look into that.

Again, the rate of "false watchdog termination" should go away with the new fix, but there might be other problems which can cause a real stuck trajectory. If that happens, please report back to us here. Thank you very much for the help!
29) Message boards : Number crunching : Bug Reports for R@h Server Update to BOINC version 5.9.2 (Message 36415)
Posted 10 Feb 2007 by Chu
Post:
If you can post a link to any of the problematic workunits, it will be much easier for us to track down what had happened. Thanks. BTW, since the server update just happend less than one week ago, the problem you have experienced might only be relevant to Rosetta application, if so, please report it here

Running boinc on a Mac (dual 2g) with plenty of memory and drive space.

I have been getting Rosetta jobs recently (past 2-3 months) that get part way through and then just stop.

I have had to abort them to get them out of the queue, as no other manner of suspend, resume, reboot, etc seems to work.

Please let me know if there is a problem that I should be aware of or if I need to upgrade, etc.

Thanks!

30) Message boards : Rosetta@home Science : Model and step question (Message 36399)
Posted 9 Feb 2007 by Chu
Post:
the number of models in each workunit depends on the cpu run time preference set by users( the total amount of time you want to spend on each workunit ) and the types of WUs ( how long each WU takes to run).

the number of steps in each WU vary a lot ( depending how we want to do the search in the confomrational space). Normally, "farlx" type of WUs have more steps and "DOC" type of WUs have less steps.
This is just a generic question, and I don't know if there is a standard answer, but how many models are in each workunit and how many steps are in each model?

31) Message boards : Number crunching : Workunits getting stuck and aborting (Message 36397)
Posted 9 Feb 2007 by Chu
Post:
This morning I also checked our local windows and mac platforms. Consistent with what have been reported here, I also saw several "Watchdog ending stuck runs" for "DOC" WUs. However, those stuck WUs were terminated by the watchdog thread properly (returned as success) and none of them hang in the boinc manager( which have to be aborted manulally). So my speculation is:

1. the "DOC" WUs have some problems whose trajectories get stuch more frequently than Rosetta average. We will look into this issue and come up with the fix.

2. when a stuck WU is terminated by the watchdog thread, it has some problem of completely removing it from the task list on linux platform (but not windows and mac platform ???) and needs to be aborted by users. This speculation has to wait more user feedbacks yet to be confirmed.

Please post any relevant observations on your side. Thank you for your help.
32) Message boards : Number crunching : Workunits getting stuck and aborting (Message 36376)
Posted 9 Feb 2007 by Chu
Post:
Thanks for reporting. We will look into that and meanwhile those WUs have been temporarily removed from the queue. I am wondering if it is because of the high memory requirement set for those "DOC" WUs. The score is ZERO when the run was stuck, indicating it was not stuck in the middle of the run. From what you guys have described here, it looks that the worker has finished, but for somehow reason it did not signal the other thread properly so that the watchdog thinks it was stuck.

To further help us track down the problem, could you please report what kind of platform your host is? It is definitely happening on linux( from Thomos here and Conan at ralph), what about the rest of you?

[quote]I've had several of these in the past day, and I see other people making isolated comments of having the same issue, so maybe if we put them all here in one place the scientists can track down the error.

These are mine thus far:

http://boinc.bakerlab.org/rosetta/result.php?resultid=61646685
http://boinc.bakerlab.org/rosetta/result.php?resultid=61635395
http://boinc.bakerlab.org/rosetta/result.php?resultid=61598016
http://boinc.bakerlab.org/rosetta/result.php?resultid=61597212
http://boinc.bakerlab.org/rosetta/result.php?resultid=61589791[/quote
33) Message boards : Number crunching : Odd graphics quirk, possibly (Message 36252)
Posted 7 Feb 2007 by Chu
Post:
When the window is NOT maxmimized, we have seen the similiar problem as you decribed -- "low" and "native" boxes are not rotatable.

But when the window is maximized, I can, at least on our local window computers, to rotate all the four boxes without problem.

With Rosetta 5.45 and the BOINC manager 5.8.8, I've noticed that if I want to move a protein around I can click either on the (A)"Searching_all_atoms" panel or the (B)"Accepted" panel and I can manipulate the protein orientation. BUT the moment I click on the smaller (C)"Low Energy" or (D)"Native" panels, I can no longer manipulate the first two (nor the latter two). Oddly, I can still manipulate the "Accepted" panel by clicking on the far left of the "Searching_all_atoms" panel, but can not manipulate the "Searching, etc.." panel at all.
However, if I maximize the Rosetta window, then I can manipulate (B), (C), and (D) but not (A). So, there seems to be a quirk there to be noted.

34) Message boards : Number crunching : Problems with Rosetta version 5.45 (Message 36162)
Posted 5 Feb 2007 by Chu
Post:
Thanks for the report, River. When this happened, did you happend to see whether the cpu run time was stilled being incremented? I agree with you it definitely looks like a bug somewhere, but not graphic related. I am wondering if this only happens on linux platforms or everywhere else.

The 'stuck at 100%' bug has returned with this result here.

The prefferred run time had just been cut from 24hrs to 1hr to encourage Rosetta to make way for LHC (which rarely has work and which I therefore give highest priroity when it does have some), but instead this result hung having reached its new completion point.

I don't know if I provoked it, or if it would have happened anyway at the end of the original run length. Either way I'd say it is a bug, tho obviously less serious of it only occurs with a shortened run.

For others who see this, the best fix I have found is to stop BOINC and restart it, which then pushes the stuck task to start uploading.

edit add:

BTW - in response to your question in the first posting in this thread, this box has no graphics (not even an X-server) so it is not a gfx bug (unless the bug is that the windup code goes looking for the gfx...)

edit 2 add

and two more examples here and here, all different boxes, all running Linux, all stopped at 100% after run time shortened.

This is clearly relevant as it caused the watchdog message to appear, but what I still say is a bug is that the watchdog seems to make the result stick instead of ending properly.

R~~

35) Message boards : Number crunching : Problems with Rosetta version 5.45 (Message 35926)
Posted 1 Feb 2007 by Chu
Post:
Thanks. We are aware of that and are looking into it right now.
Just noticed that I have "Pending" granted credits. Is this new for 5.45? WU's appear to be completed successfully.



The validator is not running.

Se this page http://boinc.bakerlab.org/rosetta/rah_status.php

Anders n

36) Message boards : Rosetta@home Science : Does the N-terminus fold first? (Message 35887)
Posted 1 Feb 2007 by Chu
Post:
N-terminal blue and C-terminal red.
So which end is blue in the graphic, and which is red?

37) Message boards : Number crunching : Problems with Rosetta version 5.45 (Message 35782)
Posted 31 Jan 2007 by Chu
Post:
Your computers are hidden. Please post a link to your error results.
Seems to be a problem with running a Poweredge 6450 and Centos 4.2. Two of the four process stopped at about 63 percent. I know this is vintage hardware, but I have another Poweredge running Windows 2003 RC2 and it runs fine.

Maybe an OS switch is in order?

38) Message boards : Number crunching : Errored out?? (Message 35745)
Posted 30 Jan 2007 by Chu
Post:
I think that is because the run was actually testing some new etable stuff. You are right that normal runs only have the number of decoys at the end of stderr.txt
OK, but the WU's I usually run don't have those messages about etables and such in it. Just the short message about the number of decoys generated.

Doing some math, belatedly, I understand about the shortened runtime, there was simply no time anymore to run another decoy with on average over 1000 seconds per decoy.

39) Message boards : Number crunching : Errored out?? (Message 35728)
Posted 29 Jan 2007 by Chu
Post:
Those WUs are fine.
40) Message boards : Number crunching : Ralph is now giving out 5.44 application wus (Message 35683)
Posted 28 Jan 2007 by Chu
Post:
In each Rosetta-alpha update, we check out the most recent BOINC API from its CVS repository and use it to build the executables. Rosetta@Home is currently running 5.43 which was released early December last year. The new update 5.45 is being tested right now on Ralph and it is compiled using API less than one week old. Hopefully that will have your problem addressed.

Marky, I believe you are talking about the problem where the BOINC manager seems to lose contact with localhost? ...and all the tabs go blank? It seems this is a BOINC issue, and some of the later betas supposedly have a fix for that. So... no a new Rosetta version won't be expected to fix it, but "coming soon" from BOINC changes.

No it's not that, it's the problem where rosetta 5.43 is still running, BOINC thinks it's running, but rosetta isn't using any CPU time at all. There's a thread on here about it here. The solution mentioned in that thread is that the application needs to be compiled using the latest API.

The BOINC crashing problem is something else.



Previous 20 · Next 20



©2024 University of Washington
https://www.bakerlab.org