Message boards : Number crunching : lr8_combine_smooth_torsion_it00 - All Errors?
Author | Message |
---|---|
Rowpie of the Scottish Boinc Team Send message Joined: 26 Aug 06 Posts: 2 Credit: 621,176 RAC: 0 |
Hi all, I don't post on project forums often but this one has me stumped. I returned to Rosetta 3 days ago and it was great until 7am this morning untill WU's started to fail. If the WU name begins with this: lr8_combine_smooth_torsion_it00_rama... I get this error within seconds: ERROR: Value of inactive option accessed: -score:dun08_dir Link to the system: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1188270 Link to most recent error: https://boinc.bakerlab.org/rosetta/result.php?resultid=299171589 System is a Core i7 920 OC'd to 3.8 GHz with 12 GB of ram and two GTX260 GPU's. (Hyper Threading is on, SLI is off and all power saving featured are defiantly off!) It passes Prime 95, Memtest x86+ and IntelBurnTest at these settings for many, many hours each. (I like stability in my overclocks as i run them at 24/7 load where possible) I've stopped the system polling for new WU's until I can work this out as there is no point filling error logs up anymore than it has done already. The only other project active at the same time was SETI on the GPU's (CPU disabled in Seti options) To make sure the GPU's weren't starved I have the CPU cap set to 7 out of the 8 threads. I think thats everything. If anyone has any ideas then I look forward to the help in sorting this. Rowpie of TSBT (Ian) |
VO Send message Joined: 4 Nov 05 Posts: 7 Credit: 3,250,754 RAC: 0 |
you're not the only one all lr8_combine_smooth_torsion_it00 are errors so... be patient |
LizzieBarry Send message Joined: 25 Feb 08 Posts: 76 Credit: 201,862 RAC: 0 |
Ditto. I've got one running at the moment but with a lot of restarts so I'm just going to abort it, and I see some more coming down too which I'll keep my eye on too. I do have some other jobs going, so if they keep me going while someone checks and advises whether we can abort on sight I'd appreciate it. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Mine failed too. So it doesn't sound like any problem with your machine. We'll just have to churn through any that are left until they clear out. Shouldn't take long at 30 seconds each. Rosetta Moderator: Mod.Sense |
LizzieBarry Send message Joined: 25 Feb 08 Posts: 76 Credit: 201,862 RAC: 0 |
Shouldn't take long at 30 seconds each. I'd hope not, but the one I had to abort had been running nearly 30 minutes (elapsed) but properties had no checkpoints and just a couple of seconds of processing. Didn't take long, but not immediate compute errors either. I think I'm going to abort on sight, just to hurry this all along. YMMV. |
_hiVe* Send message Joined: 17 Feb 08 Posts: 1 Credit: 2,138,125 RAC: 0 |
Yup, same here, while also happens on stock clocked machines. Guess there is a fundamental problem somewhere with something. |
Adam Gajdacs (Mr. Fusion) Send message Joined: 26 Nov 05 Posts: 13 Credit: 2,876,565 RAC: 1,673 |
Also confirming this, just got all 3 of such WUs I had in my work cache bombing out on me after less than a minute of runtime: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=272874405 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=272875389 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=272880900 Apparently I had one yesterday too but I only noticed it now: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=272869210 That's 4 out of 4 for me, as far as I can tell, while other WUs process without errors. |
CharlyD Send message Joined: 1 Dec 06 Posts: 5 Credit: 135,227 RAC: 0 |
same problem here. All 4 WU's i got since yesterday crashed few seconds after start |
Yifan Song Volunteer moderator Project developer Project scientist Send message Joined: 26 May 09 Posts: 62 Credit: 7,322 RAC: 0 |
Sorry about this error. It was a mistake from my side. It was a rather old submission, which I increased the running size and forgot that with the new version, those running flags are obsolete. |
Rowpie of the Scottish Boinc Team Send message Joined: 26 Aug 06 Posts: 2 Credit: 621,176 RAC: 0 |
Thanks for the reply Yifan Song I take it this means it is safe to start crunching again? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
It's been "safe" all along. The failing tasks were not causing any problems for your machine. You may still see some of the problem tasks, but churning through them will help clear things up. I just got a task with "redo" in place of "rama" in the task name: lr8_combine_smooth_torsion_it00_redo... It seems to be running fine after more then 20 CPU minutes. [edit]actually the one I have running fine starts with lr5, not lr8. Rosetta Moderator: Mod.Sense |
Yifan Song Volunteer moderator Project developer Project scientist Send message Joined: 26 May 09 Posts: 62 Credit: 7,322 RAC: 0 |
Yes, it should be safe now. The new 'redo' jobs should be good. :p |
AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
Yes, it should be safe now. Most of the 'redo' jobs ended in SIGSEGV: segmentation violation on my computer. tasks: 299914416 299948625 299957772 300000282 AdeB |
Nickster Send message Joined: 27 Nov 09 Posts: 4 Credit: 8,563 RAC: 0 |
Hello! 11/28/2009 1:32:38 AM rosetta@home Task lr8_combine_smooth_torsion_it00_rama01_A_rlbd_2acy_IGNORE_THE_REST_DECOY_14886_687_1 exited with zero status but no 'finished' file 11/28/2009 1:32:38 AM rosetta@home If this happens repeatedly you may need to reset the project. 11/28/2009 1:32:38 AM rosetta@home Restarting task lr8_combine_smooth_torsion_it00_rama01_A_rlbd_2acy_IGNORE_THE_REST_DECOY_14886_687_1 using minirosetta version 200 11/28/2009 1:32:45 AM rosetta@home Task lr8_combine_smooth_torsion_it00_rama01_A_rlbd_2acy_IGNORE_THE_REST_DECOY_14886_687_1 exited with zero status but no 'finished' file 11/28/2009 1:32:45 AM rosetta@home If this happens repeatedly you may need to reset the project. ... etc. I had a dozen of these that were in a loop for 15 minutes with no progress so I reset the project per the message recommendation. Is there an easier approach so I don't flush my entire set of WUs next time? Regards, Nick |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Sorry Nick, yes there were problems with those tasks with '...rama...' in the name. Later they were reissued with same names, only rama is replaced with '...redo...'. Actually BOINC would have run through them for you and reported them back and gotten more work, all without intervention. The problem for some is that it still had to download multi-MB files for each task before it could start running to see it would fail immediately. Resetting the project just basically aborts your current work, and deletes all of the applications you've downloaded from the project. It then starts from scratch downloading the large database and application files again in addition to a new batch of work. The problem with that approach is that the new batch of work may still be '...rama...' work units and so they still fail. The other approach is to abort specific tasks. Typically you will see some posts (such at this thread) by other users that have noticed the problem before you, and as soon as I have some confidence myself that there is a trend in failures, I will post on the boards (but I'm not functional 24/7, so that always depends on my observation of the activity on the boards and my personal experience. I have no access to server logs etc. So such posts from me are, by definition, always delayed from the original problems.) For my self, once I realized there was consistent problems with those, regardless of the specific protein being worked on, I aborted any file transfers I could catch for those specific tasks. And I have set my runtime preference to 24 hours, so the number of tasks I have is always very small so it is pretty easy to tell. This failed download causes the task to fail as well, but saves my the bandwidth of the download of that task, and of resetting the project. So, anyway, in my book, always best to let BOINC run itself if possible. It would have done just fine. Also, the rapid failure of so many tasks caused the project to have periods where no work was available, so it is always a good idea to configure at least one other project with steady work available, to keep your machine active (assuming that is your objective) during any shortfall specific to Rosetta. Then set resource share as desired. Even a project with a 1 resource share will get work if your other project has none available. Rosetta Moderator: Mod.Sense |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 1,227 |
What about workunits with BOTH redo and rama in the name? For example, I currently have 5 workunits whose names start with lr5_combine_smooth_torsion_it00_redo_rama. For now, I'm letting then run, in the hope that they'll do something useful. |
Nickster Send message Joined: 27 Nov 09 Posts: 4 Credit: 8,563 RAC: 0 |
What about workunits with BOTH redo and rama in the name? They should be fine - I have a lot of these and have been crunching them over the past 2 days... |
l_mckeon Send message Joined: 5 Jun 07 Posts: 44 Credit: 180,717 RAC: 0 |
Here's another problem with the rama...redo tasks. The tasks started from scratch when computer was switched off then restarted. More than 5 hours X 3 cores wasted. The tasks terminated shortly after restart. lr5_combine_smooth_torsion_it00_redo_rama08_A_rlbd_1c7k_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_16279_992_0_0 was one of the culprits. BTW, the tasks had been saving checkpoints according to the slots directory. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 1,227 |
Here's another problem with the rama...redo tasks. The tasks started from scratch when computer was switched off then restarted. More than 5 hours X 3 cores wasted. The tasks terminated shortly after restart. I've found that many BOINC workunits are better able to restart after the computer is turned off if you suspend them before turning the computer off. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 1,227 |
I found that all the redo_rama workunits I got completed successfully. However, many of them went beyond the usual 100 decoys limit. |
Message boards :
Number crunching :
lr8_combine_smooth_torsion_it00 - All Errors?
©2024 University of Washington
https://www.bakerlab.org