lr8_combine_smooth_torsion_it00 - All Errors?

Message boards : Number crunching : lr8_combine_smooth_torsion_it00 - All Errors?

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Rowpie of the Scottish Boinc Team

Send message
Joined: 26 Aug 06
Posts: 2
Credit: 621,176
RAC: 0
Message 64186 - Posted: 24 Nov 2009, 23:21:07 UTC
Last modified: 24 Nov 2009, 23:22:10 UTC

Hi all,

I don't post on project forums often but this one has me stumped.

I returned to Rosetta 3 days ago and it was great until 7am this morning untill WU's started to fail.

If the WU name begins with this:

lr8_combine_smooth_torsion_it00_rama...

I get this error within seconds:

ERROR: Value of inactive option accessed: -score:dun08_dir

Link to the system: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1188270

Link to most recent error: https://boinc.bakerlab.org/rosetta/result.php?resultid=299171589

System is a Core i7 920 OC'd to 3.8 GHz with 12 GB of ram and two GTX260 GPU's. (Hyper Threading is on, SLI is off and all power saving featured are defiantly off!)

It passes Prime 95, Memtest x86+ and IntelBurnTest at these settings for many, many hours each. (I like stability in my overclocks as i run them at 24/7 load where possible)

I've stopped the system polling for new WU's until I can work this out as there is no point filling error logs up anymore than it has done already.

The only other project active at the same time was SETI on the GPU's (CPU disabled in Seti options)

To make sure the GPU's weren't starved I have the CPU cap set to 7 out of the 8 threads.

I think thats everything. If anyone has any ideas then I look forward to the help in sorting this.

Rowpie of TSBT (Ian)
ID: 64186 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile VO
Avatar

Send message
Joined: 4 Nov 05
Posts: 7
Credit: 3,250,754
RAC: 0
Message 64187 - Posted: 24 Nov 2009, 23:25:39 UTC

you're not the only one

all lr8_combine_smooth_torsion_it00 are errors so... be patient


ID: 64187 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LizzieBarry

Send message
Joined: 25 Feb 08
Posts: 76
Credit: 201,862
RAC: 0
Message 64195 - Posted: 25 Nov 2009, 2:25:25 UTC

Ditto. I've got one running at the moment but with a lot of restarts so I'm just going to abort it, and I see some more coming down too which I'll keep my eye on too.

I do have some other jobs going, so if they keep me going while someone checks and advises whether we can abort on sight I'd appreciate it.
ID: 64195 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 64199 - Posted: 25 Nov 2009, 3:54:25 UTC
Last modified: 25 Nov 2009, 3:54:53 UTC

Mine failed too. So it doesn't sound like any problem with your machine. We'll just have to churn through any that are left until they clear out. Shouldn't take long at 30 seconds each.
Rosetta Moderator: Mod.Sense
ID: 64199 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LizzieBarry

Send message
Joined: 25 Feb 08
Posts: 76
Credit: 201,862
RAC: 0
Message 64204 - Posted: 25 Nov 2009, 5:12:00 UTC - in response to Message 64199.  

Shouldn't take long at 30 seconds each.

I'd hope not, but the one I had to abort had been running nearly 30 minutes (elapsed) but properties had no checkpoints and just a couple of seconds of processing. Didn't take long, but not immediate compute errors either.

I think I'm going to abort on sight, just to hurry this all along. YMMV.
ID: 64204 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
_hiVe*

Send message
Joined: 17 Feb 08
Posts: 1
Credit: 2,138,125
RAC: 0
Message 64207 - Posted: 25 Nov 2009, 8:12:18 UTC

Yup, same here, while also happens on stock clocked machines. Guess there is a fundamental problem somewhere with something.
ID: 64207 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Adam Gajdacs (Mr. Fusion)

Send message
Joined: 26 Nov 05
Posts: 13
Credit: 2,876,565
RAC: 1,673
Message 64208 - Posted: 25 Nov 2009, 8:40:08 UTC

Also confirming this, just got all 3 of such WUs I had in my work cache bombing out on me after less than a minute of runtime:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=272874405
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=272875389
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=272880900

Apparently I had one yesterday too but I only noticed it now:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=272869210

That's 4 out of 4 for me, as far as I can tell, while other WUs process without errors.
ID: 64208 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CharlyD

Send message
Joined: 1 Dec 06
Posts: 5
Credit: 135,227
RAC: 0
Message 64211 - Posted: 25 Nov 2009, 11:02:52 UTC

same problem here. All 4 WU's i got since yesterday crashed few seconds after start
ID: 64211 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yifan Song
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 26 May 09
Posts: 62
Credit: 7,322
RAC: 0
Message 64217 - Posted: 25 Nov 2009, 16:15:46 UTC

Sorry about this error. It was a mistake from my side. It was a rather old submission, which I increased the running size and forgot that with the new version, those running flags are obsolete.
ID: 64217 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rowpie of the Scottish Boinc Team

Send message
Joined: 26 Aug 06
Posts: 2
Credit: 621,176
RAC: 0
Message 64221 - Posted: 25 Nov 2009, 17:58:32 UTC

Thanks for the reply Yifan Song

I take it this means it is safe to start crunching again?
ID: 64221 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 64222 - Posted: 25 Nov 2009, 18:46:23 UTC
Last modified: 25 Nov 2009, 18:48:08 UTC

It's been "safe" all along. The failing tasks were not causing any problems for your machine. You may still see some of the problem tasks, but churning through them will help clear things up.

I just got a task with "redo" in place of "rama" in the task name:
lr8_combine_smooth_torsion_it00_redo...

It seems to be running fine after more then 20 CPU minutes.

[edit]actually the one I have running fine starts with lr5, not lr8.
Rosetta Moderator: Mod.Sense
ID: 64222 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yifan Song
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 26 May 09
Posts: 62
Credit: 7,322
RAC: 0
Message 64230 - Posted: 26 Nov 2009, 2:09:22 UTC

Yes, it should be safe now.
The new 'redo' jobs should be good. :p
ID: 64230 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 0
Message 64238 - Posted: 26 Nov 2009, 18:30:08 UTC - in response to Message 64230.  

Yes, it should be safe now.
The new 'redo' jobs should be good. :p


Most of the 'redo' jobs ended in SIGSEGV: segmentation violation on my computer.

tasks:
299914416
299948625
299957772
300000282

AdeB
ID: 64238 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nickster

Send message
Joined: 27 Nov 09
Posts: 4
Credit: 8,563
RAC: 0
Message 64249 - Posted: 28 Nov 2009, 6:58:51 UTC
Last modified: 28 Nov 2009, 7:00:57 UTC

Hello!

11/28/2009 1:32:38 AM	rosetta@home	Task lr8_combine_smooth_torsion_it00_rama01_A_rlbd_2acy_IGNORE_THE_REST_DECOY_14886_687_1 exited with zero status but no 'finished' file
11/28/2009 1:32:38 AM	rosetta@home	If this happens repeatedly you may need to reset the project.
11/28/2009 1:32:38 AM	rosetta@home	Restarting task lr8_combine_smooth_torsion_it00_rama01_A_rlbd_2acy_IGNORE_THE_REST_DECOY_14886_687_1 using minirosetta version 200
11/28/2009 1:32:45 AM	rosetta@home	Task lr8_combine_smooth_torsion_it00_rama01_A_rlbd_2acy_IGNORE_THE_REST_DECOY_14886_687_1 exited with zero status but no 'finished' file
11/28/2009 1:32:45 AM	rosetta@home	If this happens repeatedly you may need to reset the project.

... etc.


I had a dozen of these that were in a loop for 15 minutes with no progress so I reset the project per the message recommendation. Is there an easier approach so I don't flush my entire set of WUs next time?

Regards, Nick
ID: 64249 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 64259 - Posted: 28 Nov 2009, 15:45:34 UTC

Sorry Nick, yes there were problems with those tasks with '...rama...' in the name. Later they were reissued with same names, only rama is replaced with '...redo...'.

Actually BOINC would have run through them for you and reported them back and gotten more work, all without intervention. The problem for some is that it still had to download multi-MB files for each task before it could start running to see it would fail immediately.

Resetting the project just basically aborts your current work, and deletes all of the applications you've downloaded from the project. It then starts from scratch downloading the large database and application files again in addition to a new batch of work. The problem with that approach is that the new batch of work may still be '...rama...' work units and so they still fail.

The other approach is to abort specific tasks. Typically you will see some posts (such at this thread) by other users that have noticed the problem before you, and as soon as I have some confidence myself that there is a trend in failures, I will post on the boards (but I'm not functional 24/7, so that always depends on my observation of the activity on the boards and my personal experience. I have no access to server logs etc. So such posts from me are, by definition, always delayed from the original problems.)

For my self, once I realized there was consistent problems with those, regardless of the specific protein being worked on, I aborted any file transfers I could catch for those specific tasks. And I have set my runtime preference to 24 hours, so the number of tasks I have is always very small so it is pretty easy to tell. This failed download causes the task to fail as well, but saves my the bandwidth of the download of that task, and of resetting the project.

So, anyway, in my book, always best to let BOINC run itself if possible. It would have done just fine. Also, the rapid failure of so many tasks caused the project to have periods where no work was available, so it is always a good idea to configure at least one other project with steady work available, to keep your machine active (assuming that is your objective) during any shortfall specific to Rosetta. Then set resource share as desired. Even a project with a 1 resource share will get work if your other project has none available.
Rosetta Moderator: Mod.Sense
ID: 64259 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 1,227
Message 64280 - Posted: 29 Nov 2009, 15:55:24 UTC
Last modified: 29 Nov 2009, 16:02:05 UTC

What about workunits with BOTH redo and rama in the name?

For example, I currently have 5 workunits whose names start with lr5_combine_smooth_torsion_it00_redo_rama.

For now, I'm letting then run, in the hope that they'll do something useful.
ID: 64280 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nickster

Send message
Joined: 27 Nov 09
Posts: 4
Credit: 8,563
RAC: 0
Message 64288 - Posted: 29 Nov 2009, 23:46:16 UTC - in response to Message 64280.  

What about workunits with BOTH redo and rama in the name?

For example, I currently have 5 workunits whose names start with lr5_combine_smooth_torsion_it00_redo_rama.

For now, I'm letting then run, in the hope that they'll do something useful.


They should be fine - I have a lot of these and have been crunching them over the past 2 days...
ID: 64288 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
l_mckeon

Send message
Joined: 5 Jun 07
Posts: 44
Credit: 180,717
RAC: 0
Message 64332 - Posted: 2 Dec 2009, 7:13:32 UTC

Here's another problem with the rama...redo tasks. The tasks started from scratch when computer was switched off then restarted. More than 5 hours X 3 cores wasted. The tasks terminated shortly after restart.

lr5_combine_smooth_torsion_it00_redo_rama08_A_rlbd_1c7k_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_16279_992_0_0 was one of the culprits.

BTW, the tasks had been saving checkpoints according to the slots directory.
ID: 64332 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 1,227
Message 64347 - Posted: 3 Dec 2009, 1:52:59 UTC - in response to Message 64332.  

Here's another problem with the rama...redo tasks. The tasks started from scratch when computer was switched off then restarted. More than 5 hours X 3 cores wasted. The tasks terminated shortly after restart.

lr5_combine_smooth_torsion_it00_redo_rama08_A_rlbd_1c7k_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_16279_992_0_0 was one of the culprits.

BTW, the tasks had been saving checkpoints according to the slots directory.


I've found that many BOINC workunits are better able to restart after the computer is turned off if you suspend them before turning the computer off.
ID: 64347 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 1,227
Message 64348 - Posted: 3 Dec 2009, 2:21:21 UTC

I found that all the redo_rama workunits I got completed successfully.

However, many of them went beyond the usual 100 decoys limit.
ID: 64348 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : lr8_combine_smooth_torsion_it00 - All Errors?



©2024 University of Washington
https://www.bakerlab.org