Message boards : Number crunching : Problems with rosetta 5.48
Previous · 1 · 2 · 3 · 4
Author | Message |
---|---|
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Ahhhh! now thats more like the information Ive been looking for. Thanks guys. I see since I haven't been using my machine that another HINGE WU has loaded in. This confirms your information 100% |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
3/6/2007 9:55:04 PM|rosetta@home|Task 1xpv_1_NMRREF_1_1xpv_1_idid_model_05IGNORE_THE_REST_idl_1597_1519_0 exited with zero status but no 'finished' file 3/6/2007 9:55:04 PM|rosetta@home|If this happens repeatedly you may need to reset the project. 3/6/2007 9:55:04 PM|rosetta@home|Restarting task 1xpv_1_NMRREF_1_1xpv_1_idid_model_05IGNORE_THE_REST_idl_1597_1519_0 using rosetta version 548 Can you guys explain this one? Its back and running ok at the moment. The computer was not in use at the time. |
Viromancy Send message Joined: 23 Sep 06 Posts: 8 Credit: 125,713 RAC: 0 |
5.48 seems mostly stable so far, except for this odd validate error on result ID 65682300 - assuming that's down to 5.48 and not something external. Never had one of those before. |
RichardJ Send message Joined: 19 Mar 06 Posts: 8 Credit: 73,014 RAC: 0 |
Thanks for the explanation. There was a danger there that those of us with small boxes were beginning to feel not welcome at the party anymore. ... I have not got any more HINGE WU's in queue, instead its back to ABRELAX and NMRREF, why is this? Is my system with 512MB RAM not powerful enough to run HINGE or is that just luck of the draw for WU's? |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Richard, I updated my BOINC and ROH and then upped my memory usage to 90% when not in use and 75% when in use and when the box is idle then it gets a HINGE WU, but if it is in use when the updater connects then it goes back to small work. Thanks for the explanation. There was a danger there that those of us with small boxes were beginning to feel not welcome at the party anymore.... I have not got any more HINGE WU's in queue, instead its back to ABRELAX and NMRREF, why is this? Is my system with 512MB RAM not powerful enough to run HINGE or is that just luck of the draw for WU's? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I've created a new thread to discuss memory related issues. I hope you will agree after reviewing the information there, that things are working properly with regard to memory and issuing large tasks to appropriate systems. Rosetta Moderator: Mod.Sense |
(_KoDAk_) Send message Joined: 18 Jul 06 Posts: 109 Credit: 1,859,263 RAC: 0 |
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=56688951 Success Done 45,189.06 claimed credit 264.30 !!!!!!! granted credit 20.00 ???????? why? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Kodak, your task was ended by the watchdog. It's considered a "normal" end, but isn't really completely normal either. I believe "partial success" is the term used. Maximum credit award for such tasks is currently 20 credits :( I just happened to be discussing the watchdog with Chu last week, and learned they are working on changes to make watchdog terminations report the models that were crunched prior to the watchdog termination, and to grant credit for those, plus up to 20 credits for the model that failed. I believe that improvement to the credit awarded AND some additional improvements to help eliminate watchdog terminations are both currently under test on Ralph, and coming to Rosetta very soon. So, such events will soon be a thing of the past. Rosetta Moderator: Mod.Sense |
Thomas Leibold Send message Joined: 30 Jul 06 Posts: 55 Credit: 19,627,164 RAC: 0 |
I hope you will agree after reviewing the information there, that things are working properly with regard to memory and issuing large tasks to appropriate systems. Not completely. There is still the issue of the segmentation violation (SIGSEGV) and subsequent restart of the workunit (at least on the 5.48 Rosetta Client for Linux) whenever the Boinc 5.8.x client tries to stop a tast because there is (temporarily) not enough free memory due to other work being performed on the system. I'm assuming the correct behavior would be for the task to end properly with a save of the checkpoint information so that it can resume where it left off when memory is available again. Team Helix |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Well I'd have to agree with you there Thomas. That would be a problem. I'd leave that discussion here in the "problems with..." thread. But I'm hoping to direct the rest of the memory discussion to this thread. Actually, BOINC cannot force the application (such as Rosetta) to take a checkpoint. If memory is no longer available to BOINC (either because the computer went from idle to active, and active has lower memory usage allowed, or because another BOINC task began using more memory) then the behavior I would expect is for the work unit to be suspended. BOINC appears to have a new status for such a suspended task called "waiting for memory". And I'd expect that if the General Preference to keep tasks in memory (they mean VIRTUAL memory on this setting) while suspended is YES, then the task would pick up right where it left off once memory is available. If it is set to NO, then I'd expect that task to suspend gracefully, and when memory becomes available, I'd expect it to start at the model and step of the last checkpoint. Either way, it shouldn't take an error due to suspension waiting for memory. Are there similar problems with suspending and resuming tasks manually? Rosetta Moderator: Mod.Sense |
Thomas Leibold Send message Joined: 30 Jul 06 Posts: 55 Credit: 19,627,164 RAC: 0 |
That is an interesting question and I should have thought of that myself. My expectation would have been to see the same kind of problem, yet I just manually suspended and resumed a Rosetta task 6 times and there was no apparent ill effect (no SIGSEGV in the stderr.txt file for this task). To add another piece of confusing information: This machine (dual-core, 1GB of memory) is running both Rosetta and Ralph. This means that about every 2 hours (my preference setting for the scheduling frequency) tasks get suspended and resumed by Boinc automatically. One Rosetta task (with a HINGE workunit, but that might be coincidence) that had been preempted by Boinc did have the SIGSEGV in the stderr.txt file (the task status is shown as 'waiting to run'): Graphics are disabled due to configuration... # random seed: 2748782 # cpu_run_time_pref: 28800 SIGSEGV: segmentation violation Stack trace (12 frames): [0x8b63f87] [0x8b7fdcc] [0xffffe420] [0x8c0011b] [0x8c00577] [0x8bd0c29] [0x8bd0c51] [0x8a8ded1] [0x8be67cf] [0x8b7791d] [0x8b8102d] [0x8c1292a] Exiting... This task appears to be dead, it did not become active even after suspending all other tasks. The processes for that task still exist, but do not take up any cpu time. Team Helix |
Stacey Baird Send message Joined: 11 Apr 06 Posts: 19 Credit: 74,745 RAC: 0 |
I keep getting Rename Request errors. Rename what? 3/11/2007 1:06:50 PM||[error] Couldn't write state file: system rename 3/11/2007 1:07:50 PM|rosetta@home|Task 4ubpA_BOINC_NOFILTERS_ABRELAX_SAVE_ALL_OUT_NEWRELAXFLAGS_frags83__1505_15079_1 exited with zero status but no 'finished' file 3/11/2007 1:07:50 PM|rosetta@home|If this happens repeatedly you may need to reset the project. 3/11/2007 1:07:50 PM|rosetta@home|Restarting task 4ubpA_BOINC_NOFILTERS_ABRELAX_SAVE_ALL_OUT_NEWRELAXFLAGS_frags83__1505_15079_1 using rosetta version 548 |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
This is not the first time this has happened. In fact its about the 2nd or 3rd time in 2 days that it has happened. According to the status page everything is running normally. 3/11/2007 2:42:54 AM|rosetta@home|Sending scheduler request: To fetch work 3/11/2007 2:42:54 AM|rosetta@home|Requesting 213 seconds of new work, and reporting 1 completed tasks 3/11/2007 2:42:59 AM|rosetta@home|Scheduler RPC succeeded [server version 509] 3/11/2007 2:42:59 AM|rosetta@home|Deferring communication for 4 min 2 sec 3/11/2007 2:42:59 AM|rosetta@home|Reason: requested by project 3/11/2007 2:43:01 AM|rosetta@home|[file_xfer] Started download of file 2gzp_1_idid_model_12_idl.pdb.gz 3/11/2007 2:43:23 AM||Project communication failed: attempting access to reference site 3/11/2007 2:43:23 AM|rosetta@home|[file_xfer] Temporarily failed download of 2gzp_1_idid_model_12_idl.pdb.gz: system connect 3/11/2007 2:43:24 AM||Access to reference site succeeded - project servers may be temporarily down. 3/11/2007 2:43:24 AM|rosetta@home|[file_xfer] Started download of file 2gzp_1_idid_model_12_idl.pdb.gz 3/11/2007 2:43:26 AM|rosetta@home|[file_xfer] Finished download of file 2gzp_1_idid_model_12_idl.pdb.gz 3/11/2007 2:43:26 AM|rosetta@home|[file_xfer] Throughput 35399 bytes/sec |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I moved Stacey's post here from the Cafe. Stacey, you've got at least a couple of interesting things going on. One thing in particular I wanted to ask the Project Team to look in to is this task. It reported 1.23seconds of runtime, but crunched 30 nstructs, but zero decoys. Move of Stacey's errors reported in his tasks are "no heartbeat from core client" errors. And then ultimately the watchdog ends the task because it keeps restarting repeatedly at the same point. And then the stderr, which seems to reflect the rename error he saw in the messages tab: </stderr_txt> <message> <file_xfer_error> <file_name>4ubpA_BOINC_NOFILTERS_ABRELAX_SAVE_ALL_OUT_NEWRELAXFLAGS_frags83__1505_15079_1_0</file_name> <error_code>-161</error_code> </file_xfer_error> </message> Stacey, do you leave applications in memory while preempted? You will find this setting in your General Preferences. From the looks of it, you have it set NOT to keep applications in memory while preempted, and also have it set not to run while the computer is in use. When you combine these two things, BOINC gets interrupted every time someone sits down to use the computer. The watchdog is trying to assure that work keeps moving along smoothly. When it notices a work unit starting and restarting at the same point in a model without making any progress, it will end it on the 5th such start. This might occur because it crunches for 30-60 minutes without reaching a checkpoint, and then someone uses the computer, thus preempting the task if that is how your settings are configured. If the application is not kept in memory while preempted, this work is lost. When the task starts again later, when the computer is not in use, it will start again at the same point it did the last time, and then you are 2 retries in to the 5 count. Rosetta Moderator: Mod.Sense |
Nothing But Idle Time Send message Joined: 28 Sep 05 Posts: 209 Credit: 139,545 RAC: 0 |
This is not the first time this has happened. In fact its about the 2nd or 3rd time in 2 days that it has happened. According to the status page everything is running normally. ... Likewise. Every task/file download in last 2 days has been "Temp Failed". Never happened to me before. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Latest error messages 3/12/2007 8:29:35 AM|rosetta@home|[file_xfer] Started download of file frags83_2chf_.fasta.gz 3/12/2007 8:29:35 AM|rosetta@home|[file_xfer] Started download of file frags83_2chf_.psipred_ss2.gz 3/12/2007 8:29:57 AM||Project communication failed: attempting access to reference site 3/12/2007 8:29:57 AM|rosetta@home|[file_xfer] Temporarily failed download of frags83_2chf_.fasta.gz: system connect 3/12/2007 8:29:57 AM|rosetta@home|[file_xfer] Temporarily failed download of frags83_2chf_.psipred_ss2.gz: system connect 3/12/2007 8:29:57 AM|rosetta@home|[file_xfer] Started download of file boinc_frags83_aa2chf_03_05.200_v1_3.gz 3/12/2007 8:29:57 AM|rosetta@home|[file_xfer] Started download of file boinc_frags83_aa2chf_09_05.200_v1_3.gz 3/12/2007 8:29:58 AM||Access to reference site succeeded - project servers may be temporarily down. 3/12/2007 8:30:41 AM||Project communication failed: attempting access to reference site 3/12/2007 8:30:41 AM|rosetta@home|[file_xfer] Temporarily failed download of frags83_2chf.pdb.gz: system connect 3/12/2007 8:30:42 AM||Access to reference site succeeded - project servers may be temporarily down. 3/12/2007 8:30:42 AM|rosetta@home|[file_xfer] Started download of file frags83_2chf.pdb.gz 3/12/2007 8:30:43 AM|rosetta@home|[file_xfer] Finished download of file frags83_2chf.pdb.gz 3/12/2007 8:30:43 AM|rosetta@home|[file_xfer] Throughput 22748 bytes/sec 3/12/2007 3:31:31 PM|rosetta@home|[file_xfer] Started download of file frags83_1ail_.fasta.gz 3/12/2007 3:31:31 PM|rosetta@home|[file_xfer] Started download of file frags83_1ail_.psipred_ss2.gz 3/12/2007 3:31:53 PM||Project communication failed: attempting access to reference site 3/12/2007 3:31:53 PM|rosetta@home|[file_xfer] Temporarily failed download of frags83_1ail_.fasta.gz: system connect 3/12/2007 3:31:53 PM|rosetta@home|[file_xfer] Temporarily failed download of frags83_1ail_.psipred_ss2.gz: system connect 3/12/2007 3:31:53 PM|rosetta@home|[file_xfer] Started download of file boinc_frags83_aa1ail_03_05.200_v1_3.gz 3/12/2007 3:31:53 PM|rosetta@home|[file_xfer] Started download of file boinc_frags83_aa1ail_09_05.200_v1_3.gz 3/12/2007 3:31:54 PM||Access to reference site succeeded - project servers may be temporarily down. 3/12/2007 3:32:15 PM||Project communication failed: attempting access to reference site 3/12/2007 3:32:15 PM|rosetta@home|[file_xfer] Temporarily failed download of boinc_frags83_aa1ail_03_05.200_v1_3.gz: system connect 3/12/2007 3:32:15 PM|rosetta@home|[file_xfer] Temporarily failed download of boinc_frags83_aa1ail_09_05.200_v1_3.gz: system connect 3/12/2007 3:32:15 PM|rosetta@home|[file_xfer] Started download of file frags83_1ail.pdb.gz 3/12/2007 3:32:16 PM||Access to reference site succeeded - project servers may be temporarily down. 3/12/2007 3:32:16 PM|rosetta@home|[file_xfer] Started download of file frags83_1ail_.fasta.gz 3/12/2007 3:32:17 PM|rosetta@home|[file_xfer] Finished download of file frags83_1ail_.fasta.gz 3/12/2007 3:32:17 PM|rosetta@home|[file_xfer] Throughput 338 bytes/sec 3/12/2007 3:32:25 PM|rosetta@home|[file_xfer] Started download of file boinc_frags83_aa1ail_09_05.200_v1_3.gz 3/12/2007 3:32:28 PM|rosetta@home|[file_xfer] Finished download of file boinc_frags83_aa1ail_09_05.200_v1_3.gz 3/12/2007 3:32:28 PM|rosetta@home|[file_xfer] Throughput 69195 bytes/sec 3/12/2007 3:32:37 PM||Project communication failed: attempting access to reference site 3/12/2007 3:32:37 PM|rosetta@home|[file_xfer] Temporarily failed download of frags83_1ail.pdb.gz: system connect 3/12/2007 3:32:38 PM||Access to reference site succeeded - project servers may be temporarily down. 3/12/2007 3:32:38 PM|rosetta@home|[file_xfer] Started download of file frags83_1ail.pdb.gz 3/12/2007 3:32:39 PM|rosetta@home|[file_xfer] Finished download of file frags83_1ail.pdb.gz 3/12/2007 3:32:39 PM|rosetta@home|[file_xfer] Throughput 17929 bytes/sec the downloads at 7pm and 10pm are ok uploads are ok all the time |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
The download failures appear to be related to a server restart that was performed. Here's David Kim's post Rosetta Moderator: Mod.Sense |
brunetto2001 Send message Joined: 11 Dec 05 Posts: 3 Credit: 8,019 RAC: 0 |
latest problem ( a part log copy) "29/03/2007 14.17.24|rosetta@home|[file_xfer] Temporarily failed download of cc1dcj_09_05.200_v1_3.gz: http error 29/03/2007 14.17.24|rosetta@home|[file_xfer] Temporarily failed download of ccfrags200.txt: http error 29/03/2007 14.17.25|rosetta@home|[file_xfer] Started download of file cc1dcj_03_05.200_v1_3.gz 29/03/2007 14.17.25|rosetta@home|[file_xfer] Started download of file cc1dcj_09_05.200_v1_3.gz 29/03/2007 14.20.25|rosetta@home|[file_xfer] Temporarily failed download of cc1dcj_03_05.200_v1_3.gz: http error 29/03/2007 14.20.25|rosetta@home|[file_xfer] Temporarily failed download of cc1dcj_09_05.200_v1_3.gz: http error 29/03/2007 14.20.25|rosetta@home|[file_xfer] Started download of file ccfrags200.txt 29/03/2007 14.20.26|rosetta@home|[file_xfer] Started download of file cc1dcj_03_05.200_v1_3.gz 29/03/2007 14.23.24|rosetta@home|[file_xfer] Temporarily failed download of ccfrags200.txt: http error 29/03/2007 14.23.25|rosetta@home|[file_xfer] Temporarily failed download of cc1dcj_03_05.200_v1_3.gz: http error 29/03/2007 14.23.25|rosetta@home|Backing off 1 min 0 sec on download of file cc1dcj_03_05.200_v1_3.gz 29/03/2007 14.23.25|rosetta@home|[file_xfer] Started download of file cc1dcj_09_05.200_v1_3.gz 29/03/2007 14.23.25|rosetta@home|[file_xfer] Started download of file ccfrags200.txt 29/03/2007 14.26.26|rosetta@home|[file_xfer] Temporarily failed download of cc1dcj_09_05.200_v1_3.gz: http error 29/03/2007 14.26.26|rosetta@home|Backing off 1 min 0 sec on download of file cc1dcj_09_05.200_v1_3.gz 29/03/2007 14.26.26|rosetta@home|[file_xfer] Temporarily failed download of ccfrags200.txt: http error 29/03/2007 14.26.26|rosetta@home|Backing off 1 min 0 sec on download of file ccfrags200.txt" |
Message boards :
Number crunching :
Problems with rosetta 5.48
©2024 University of Washington
https://www.bakerlab.org