lr8_combine_smooth_torsion_it00

Author	Message
Rowpie of the Scottish Boinc Team Send message Joined: 26 Aug 06 Posts: 2 Credit: 621,176 RAC: 0	Message 64186 - Posted: 24 Nov 2009, 23:21:07 UTC Last modified: 24 Nov 2009, 23:22:10 UTC Hi all, I don't post on project forums often but this one has me stumped. I returned to Rosetta 3 days ago and it was great until 7am this morning untill WU's started to fail. If the WU name begins with this: lr8_combine_smooth_torsion_it00_rama... I get this error within seconds: ERROR: Value of inactive option accessed: -score:dun08_dir Link to the system: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1188270 Link to most recent error: https://boinc.bakerlab.org/rosetta/result.php?resultid=299171589 System is a Core i7 920 OC'd to 3.8 GHz with 12 GB of ram and two GTX260 GPU's. (Hyper Threading is on, SLI is off and all power saving featured are defiantly off!) It passes Prime 95, Memtest x86+ and IntelBurnTest at these settings for many, many hours each. (I like stability in my overclocks as i run them at 24/7 load where possible) I've stopped the system polling for new WU's until I can work this out as there is no point filling error logs up anymore than it has done already. The only other project active at the same time was SETI on the GPU's (CPU disabled in Seti options) To make sure the GPU's weren't starved I have the CPU cap set to 7 out of the 8 threads. I think thats everything. If anyone has any ideas then I look forward to the help in sorting this. Rowpie of TSBT (Ian) ID: 64186 · Rating: 0 · rate: / Reply Quote

VO Send message Joined: 4 Nov 05 Posts: 7 Credit: 3,250,754 RAC: 0	Message 64187 - Posted: 24 Nov 2009, 23:25:39 UTC you're not the only one all lr8_combine_smooth_torsion_it00 are errors so... be patient ID: 64187 · Rating: 0 · rate: / Reply Quote

LizzieBarry Send message Joined: 25 Feb 08 Posts: 76 Credit: 201,862 RAC: 0	Message 64195 - Posted: 25 Nov 2009, 2:25:25 UTC Ditto. I've got one running at the moment but with a lot of restarts so I'm just going to abort it, and I see some more coming down too which I'll keep my eye on too. I do have some other jobs going, so if they keep me going while someone checks and advises whether we can abort on sight I'd appreciate it. ID: 64195 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 64199 - Posted: 25 Nov 2009, 3:54:25 UTC Last modified: 25 Nov 2009, 3:54:53 UTC Mine failed too. So it doesn't sound like any problem with your machine. We'll just have to churn through any that are left until they clear out. Shouldn't take long at 30 seconds each. Rosetta Moderator: Mod.Sense ID: 64199 · Rating: 0 · rate: / Reply Quote

LizzieBarry Send message Joined: 25 Feb 08 Posts: 76 Credit: 201,862 RAC: 0	Message 64204 - Posted: 25 Nov 2009, 5:12:00 UTC - in response to Message 64199. Shouldn't take long at 30 seconds each. I'd hope not, but the one I had to abort had been running nearly 30 minutes (elapsed) but properties had no checkpoints and just a couple of seconds of processing. Didn't take long, but not immediate compute errors either. I think I'm going to abort on sight, just to hurry this all along. YMMV. ID: 64204 · Rating: 0 · rate: / Reply Quote

_hiVe* Send message Joined: 17 Feb 08 Posts: 1 Credit: 2,201,590 RAC: 0	Message 64207 - Posted: 25 Nov 2009, 8:12:18 UTC Yup, same here, while also happens on stock clocked machines. Guess there is a fundamental problem somewhere with something. ID: 64207 · Rating: 0 · rate: / Reply Quote

Adam Gajdacs (Mr. Fusion) Send message Joined: 26 Nov 05 Posts: 14 Credit: 3,253,405 RAC: 0	Message 64208 - Posted: 25 Nov 2009, 8:40:08 UTC Also confirming this, just got all 3 of such WUs I had in my work cache bombing out on me after less than a minute of runtime: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=272874405 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=272875389 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=272880900 Apparently I had one yesterday too but I only noticed it now: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=272869210 That's 4 out of 4 for me, as far as I can tell, while other WUs process without errors. ID: 64208 · Rating: 0 · rate: / Reply Quote

CharlyD Send message Joined: 1 Dec 06 Posts: 5 Credit: 135,227 RAC: 0	Message 64211 - Posted: 25 Nov 2009, 11:02:52 UTC same problem here. All 4 WU's i got since yesterday crashed few seconds after start ID: 64211 · Rating: 0 · rate: / Reply Quote

Yifan Song Volunteer moderator Project developer Project scientist Send message Joined: 26 May 09 Posts: 62 Credit: 7,322 RAC: 0	Message 64217 - Posted: 25 Nov 2009, 16:15:46 UTC Sorry about this error. It was a mistake from my side. It was a rather old submission, which I increased the running size and forgot that with the new version, those running flags are obsolete. ID: 64217 · Rating: 0 · rate: / Reply Quote

Rowpie of the Scottish Boinc Team Send message Joined: 26 Aug 06 Posts: 2 Credit: 621,176 RAC: 0	Message 64221 - Posted: 25 Nov 2009, 17:58:32 UTC Thanks for the reply Yifan Song I take it this means it is safe to start crunching again? ID: 64221 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 64222 - Posted: 25 Nov 2009, 18:46:23 UTC Last modified: 25 Nov 2009, 18:48:08 UTC It's been "safe" all along. The failing tasks were not causing any problems for your machine. You may still see some of the problem tasks, but churning through them will help clear things up. I just got a task with "redo" in place of "rama" in the task name: lr8_combine_smooth_torsion_it00_redo... It seems to be running fine after more then 20 CPU minutes. [edit]actually the one I have running fine starts with lr5, not lr8. Rosetta Moderator: Mod.Sense ID: 64222 · Rating: 0 · rate: / Reply Quote

Yifan Song Volunteer moderator Project developer Project scientist Send message Joined: 26 May 09 Posts: 62 Credit: 7,322 RAC: 0	Message 64230 - Posted: 26 Nov 2009, 2:09:22 UTC Yes, it should be safe now. The new 'redo' jobs should be good. :p ID: 64230 · Rating: 0 · rate: / Reply Quote

AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0	Message 64238 - Posted: 26 Nov 2009, 18:30:08 UTC - in response to Message 64230. Yes, it should be safe now. The new 'redo' jobs should be good. :p Most of the 'redo' jobs ended in SIGSEGV: segmentation violation on my computer. tasks: 299914416 299948625 299957772 300000282 AdeB ID: 64238 · Rating: 0 · rate: / Reply Quote

Nickster Send message Joined: 27 Nov 09 Posts: 4 Credit: 8,563 RAC: 0	Message 64249 - Posted: 28 Nov 2009, 6:58:51 UTC Last modified: 28 Nov 2009, 7:00:57 UTC Hello! 11/28/2009 1:32:38 AM rosetta@home Task lr8_combine_smooth_torsion_it00_rama01_A_rlbd_2acy_IGNORE_THE_REST_DECOY_14886_687_1 exited with zero status but no 'finished' file 11/28/2009 1:32:38 AM rosetta@home If this happens repeatedly you may need to reset the project. 11/28/2009 1:32:38 AM rosetta@home Restarting task lr8_combine_smooth_torsion_it00_rama01_A_rlbd_2acy_IGNORE_THE_REST_DECOY_14886_687_1 using minirosetta version 200 11/28/2009 1:32:45 AM rosetta@home Task lr8_combine_smooth_torsion_it00_rama01_A_rlbd_2acy_IGNORE_THE_REST_DECOY_14886_687_1 exited with zero status but no 'finished' file 11/28/2009 1:32:45 AM rosetta@home If this happens repeatedly you may need to reset the project. ... etc. I had a dozen of these that were in a loop for 15 minutes with no progress so I reset the project per the message recommendation. Is there an easier approach so I don't flush my entire set of WUs next time? Regards, Nick ID: 64249 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 64259 - Posted: 28 Nov 2009, 15:45:34 UTC Sorry Nick, yes there were problems with those tasks with '...rama...' in the name. Later they were reissued with same names, only rama is replaced with '...redo...'. Actually BOINC would have run through them for you and reported them back and gotten more work, all without intervention. The problem for some is that it still had to download multi-MB files for each task before it could start running to see it would fail immediately. Resetting the project just basically aborts your current work, and deletes all of the applications you've downloaded from the project. It then starts from scratch downloading the large database and application files again in addition to a new batch of work. The problem with that approach is that the new batch of work may still be '...rama...' work units and so they still fail. The other approach is to abort specific tasks. Typically you will see some posts (such at this thread) by other users that have noticed the problem before you, and as soon as I have some confidence myself that there is a trend in failures, I will post on the boards (but I'm not functional 24/7, so that always depends on my observation of the activity on the boards and my personal experience. I have no access to server logs etc. So such posts from me are, by definition, always delayed from the original problems.) For my self, once I realized there was consistent problems with those, regardless of the specific protein being worked on, I aborted any file transfers I could catch for those specific tasks. And I have set my runtime preference to 24 hours, so the number of tasks I have is always very small so it is pretty easy to tell. This failed download causes the task to fail as well, but saves my the bandwidth of the download of that task, and of resetting the project. So, anyway, in my book, always best to let BOINC run itself if possible. It would have done just fine. Also, the rapid failure of so many tasks caused the project to have periods where no work was available, so it is always a good idea to configure at least one other project with steady work available, to keep your machine active (assuming that is your objective) during any shortfall specific to Rosetta. Then set resource share as desired. Even a project with a 1 resource share will get work if your other project has none available. Rosetta Moderator: Mod.Sense ID: 64259 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1265 Credit: 14,424,358 RAC: 0	Message 64280 - Posted: 29 Nov 2009, 15:55:24 UTC Last modified: 29 Nov 2009, 16:02:05 UTC What about workunits with BOTH redo and rama in the name? For example, I currently have 5 workunits whose names start with lr5_combine_smooth_torsion_it00_redo_rama. For now, I'm letting then run, in the hope that they'll do something useful. ID: 64280 · Rating: 0 · rate: / Reply Quote

Nickster Send message Joined: 27 Nov 09 Posts: 4 Credit: 8,563 RAC: 0	Message 64288 - Posted: 29 Nov 2009, 23:46:16 UTC - in response to Message 64280. What about workunits with BOTH redo and rama in the name? For example, I currently have 5 workunits whose names start with lr5_combine_smooth_torsion_it00_redo_rama. For now, I'm letting then run, in the hope that they'll do something useful. They should be fine - I have a lot of these and have been crunching them over the past 2 days... ID: 64288 · Rating: 0 · rate: / Reply Quote

l_mckeon Send message Joined: 5 Jun 07 Posts: 44 Credit: 180,717 RAC: 0	Message 64332 - Posted: 2 Dec 2009, 7:13:32 UTC Here's another problem with the rama...redo tasks. The tasks started from scratch when computer was switched off then restarted. More than 5 hours X 3 cores wasted. The tasks terminated shortly after restart. lr5_combine_smooth_torsion_it00_redo_rama08_A_rlbd_1c7k_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_16279_992_0_0 was one of the culprits. BTW, the tasks had been saving checkpoints according to the slots directory. ID: 64332 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1265 Credit: 14,424,358 RAC: 0	Message 64347 - Posted: 3 Dec 2009, 1:52:59 UTC - in response to Message 64332. Here's another problem with the rama...redo tasks. The tasks started from scratch when computer was switched off then restarted. More than 5 hours X 3 cores wasted. The tasks terminated shortly after restart. lr5_combine_smooth_torsion_it00_redo_rama08_A_rlbd_1c7k_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_16279_992_0_0 was one of the culprits. BTW, the tasks had been saving checkpoints according to the slots directory. I've found that many BOINC workunits are better able to restart after the computer is turned off if you suspend them before turning the computer off. ID: 64347 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1265 Credit: 14,424,358 RAC: 0	Message 64348 - Posted: 3 Dec 2009, 2:21:21 UTC I found that all the redo_rama workunits I got completed successfully. However, many of them went beyond the usual 100 decoys limit. ID: 64348 · Rating: 0 · rate: / Reply Quote

lr8_combine_smooth_torsion_it00 - All Errors?