Bad WU Batch - "TEMP_" all fail

Message boards : Number crunching : Bad WU Batch - "TEMP_" all fail

To post messages, you must log in.

AuthorMessage
pterosoft

Send message
Joined: 27 Nov 09
Posts: 1
Credit: 1,332,891
RAC: 0
Message 68268 - Posted: 30 Oct 2010, 18:58:15 UTC

I ran into a bad WU batch yesterday. Each ran for almost 3 hours and then failed at the very end, with the same error. I had 9 WU across 3 different systems fail this way. Wingmen on each WU are also failing with same error. I finally aborted the 3 I had left from the batch. Other WU from other batches are completing just fine. Lot of wasted CPU with this batch, since they run for 3 hours before failing.

All start with name "TEMP_"

Rosetta Mini 2.16

Error is:
Compute error
Too many error results
<error_code>-161</error_code>

Here are the bad WU I know about:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343254772 TEMP_5_control_1ctf__SAVE_ALL_OUT_22400_1
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343274715 TEMP_0.05_control_1ew4A_SAVE_ALL_OUT_22400_38
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343297411 TEMP_50_control_1b3aA_SAVE_ALL_OUT_22400_34
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343307185 TEMP_0.01_control_1cg5B_SAVE_ALL_OUT_22400_97
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343320953 TEMP_0.05_control_1tul__SAVE_ALL_OUT_22400_65
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343325313 TEMP_10_control_1urnA_SAVE_ALL_OUT_22400_73
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343335483 TEMP_5_control_1enh__SAVE_ALL_OUT_22400_92
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343322361 TEMP_1_control_1dhn__SAVE_ALL_OUT_22400_68
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343254959 TEMP_0.1_control_1c9oA_SAVE_ALL_OUT_22400_2
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343269869 TEMP_0.05_control_1vie__SAVE_ALL_OUT_22400_28
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343275297 TEMP_0.5_control_1c9oA_SAVE_ALL_OUT_22400_38
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343311485 TEMP_1_control_1enh__SAVE_ALL_OUT_22400_49
ID: 68268 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 68269 - Posted: 30 Oct 2010, 20:12:18 UTC
Last modified: 30 Oct 2010, 20:12:51 UTC

I see what you mean, and it is definitely not just you either. It would appear there must be a problem with the templates used to build these. I've asked the Project Team to look in to these TEMP_ tasks.

This probably explains why folks are having trouble getting new tasks today.
Rosetta Moderator: Mod.Sense
ID: 68269 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 608
Credit: 9,529,402
RAC: 5,482
Message 68270 - Posted: 30 Oct 2010, 21:29:48 UTC

<message>
<file_xfer_error>
<file_name>TEMP_5_control_1tul__SAVE_ALL_OUT_22400_85_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>
</message>


https://boinc.bakerlab.org/rosetta/workunit.php?wuid=343331790

There are others, look at my results.

>>> since they run for 3 hours before failing.

3? You were lucky!

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 68270 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 68271 - Posted: 30 Oct 2010, 22:37:27 UTC
Last modified: 30 Oct 2010, 22:40:03 UTC

Does anyone now if the decoys generated by these tasks contain any valid data which can be salvaged? If this is just a "shutdown and reporting" problem but good data is being generated, I'll let them run. I don't care about the loss of credits.

However, if the output is totally FUBAR and unusable, then I might as well set up a simple script to abort them before they burn five or six hours of CPU time.
ID: 68271 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 68272 - Posted: 30 Oct 2010, 22:59:10 UTC
Last modified: 30 Oct 2010, 23:15:37 UTC

By the name of the file, the fact that they ran to completion, and the fact that the task fails, it seems to me as though the output file produced is not reaching the server... or not being produced. My TEMP_ tasks are suspended, awaiting further word. But I am inclined to believe the are unusable. That output file that is not being received on the server, is where your results are sent. Without it, I see no good beyond correcting the problem and getting it on a list of problems to avoid in the future.
Rosetta Moderator: Mod.Sense
ID: 68272 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Robert

Send message
Joined: 10 Nov 05
Posts: 5
Credit: 9,609
RAC: 0
Message 68273 - Posted: 30 Oct 2010, 23:05:57 UTC - in response to Message 68271.  

I'm looking into this now.

The protocol appears to be producing useful structures, I think there's an issue with validation.

Definitely feel free to cancel these work units if you'd like. We may still be able to validate these results, but there's definitely no harm if you run other work units instead.
ID: 68273 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 68274 - Posted: 30 Oct 2010, 23:17:22 UTC

Robert, correct me if I'm mistaken here, but you aren't getting a file to validate... are you? At least that's what the message would imply.
Rosetta Moderator: Mod.Sense
ID: 68274 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 68275 - Posted: 30 Oct 2010, 23:19:35 UTC

OK - I cut the middle ground - those "TEMP" tasks which are running will be allowed to complete as best they can - however, I just filtered out the "TEMP" tasks in a "ready to start" state.

Just a little side note - during the short time I process SETI before discovering Rosetta, I noted when you would display a system's task list, you could further select by status - ready to start, awaiting validation, and ERROR.

It would be so handy at times to be able to display just the tasks which had an error instead of scrolling through page after page of tasks for a dozen computers.

ID: 68275 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bernd Schnitker

Send message
Joined: 2 Jan 09
Posts: 10
Credit: 62,009
RAC: 0
Message 68285 - Posted: 31 Oct 2010, 5:22:41 UTC

I have one of these that errored out. Hope folks will be able to fix them. Both I and my wingman had an error at the end.
ID: 68285 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
transient
Avatar

Send message
Joined: 30 Sep 06
Posts: 376
Credit: 10,576,481
RAC: 3,780
Message 68289 - Posted: 31 Oct 2010, 8:05:45 UTC - in response to Message 68275.  
Last modified: 31 Oct 2010, 8:06:02 UTC

Just a little side note - during the short time I process SETI before discovering Rosetta, I noted when you would display a system's task list, you could further select by status - ready to start, awaiting validation, and ERROR.

It would be so handy at times to be able to display just the tasks which had an error instead of scrolling through page after page of tasks for a dozen computers.

I agree. But that feature is, as far as I know, only available in the more recent versions of the BOINC server-software.

I understand it is not a trivial task to update the server-software. I think that is one of the reasons for not upgrading to a later version.
ID: 68289 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tom

Send message
Joined: 8 Oct 06
Posts: 8
Credit: 1,533,336
RAC: 0
Message 68308 - Posted: 1 Nov 2010, 1:18:58 UTC

Is it alright to process TEMP_ WU's? I have one in my queue and I'm wondering if I should throw it out.
ID: 68308 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
transient
Avatar

Send message
Joined: 30 Sep 06
Posts: 376
Credit: 10,576,481
RAC: 3,780
Message 68319 - Posted: 1 Nov 2010, 18:03:20 UTC

Robert in this thread a few messages above/below this one said it was okay to cancel these units. Read this message.
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5499&nowrap=true#68273
ID: 68319 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile TeAm Enterprise
Avatar

Send message
Joined: 28 Sep 05
Posts: 18
Credit: 27,414,815
RAC: 4,559
Message 68336 - Posted: 2 Nov 2010, 3:56:08 UTC - in response to Message 68268.  

All the WUs I ran yesterday that started with TEMP also failed as did the wingman for all the ones I checked.


I ran into a bad WU batch yesterday. Each ran for almost 3 hours and then failed at the very end, with the same error. I had 9 WU across 3 different systems fail this way. Wingmen on each WU are also failing with same error. I finally aborted the 3 I had left from the batch. Other WU from other batches are completing just fine. Lot of wasted CPU with this batch, since they run for 3 hours before failing.

All start with name "TEMP_"

Rosetta Mini 2.16



Crunch with friends - TeAm Anandtech
ID: 68336 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 68348 - Posted: 2 Nov 2010, 16:57:32 UTC
Last modified: 2 Nov 2010, 17:06:37 UTC

...that was what I figured would happen, but wasn't positive. They cut another application version to fix the validator, but it can't correct the problem with the existing work units.

So, Tom, it is "alright", as in they don't do any harm to your machine or anything. But I don't believe there is any chance they will get credit. So I would suggest aborting the TEMP_ tasks from Rosetta version 2.16.
Rosetta Moderator: Mod.Sense
ID: 68348 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 301
Credit: 379,182
RAC: 138
Message 68570 - Posted: 11 Nov 2010, 9:55:18 UTC - in response to Message 68348.  

I would suggest aborting the TEMP_ tasks from Rosetta version 2.16.


Rosetta v2.17 does not seem to make them better, here the stderr_txt from task 376369621:


<core_client_version>6.10.58</core_client_version>
<![CDATA[
<stderr_txt>
[2010-11- 9 5: 8:52:] :: BOINC:: Initializing ... ok.
[2010-11- 9 5: 8:52:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev39052.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/control_1louA.vT.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Starting work on structure: _00001
# cpu_run_time_pref: 86400
Starting work on structure: _00002
Starting work on structure: _00003
Starting work on structure: _00004
Starting work on structure: _00005
Starting work on structure: _00006
Starting work on structure: _00007

(...)

Starting work on structure: _00095
Starting work on structure: _00096
Starting work on structure: _00097
Starting work on structure: _00098
Starting work on structure: _00099
======================================================
DONE :: 1 starting structures 40063.7 cpu seconds
This process generated 99 decoys from 99 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
<message>
<file_xfer_error>
<file_name>TEMP_50_control_1louA_SAVE_ALL_OUT_22400_84_1_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>


.
ID: 68570 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[AF>france>pas-de-calais]symaski62

Send message
Joined: 19 Sep 05
Posts: 47
Credit: 33,871
RAC: 0
Message 68573 - Posted: 11 Nov 2010, 13:10:27 UTC

http://www.boinc-wiki.info/Error_Code

ERR_NOT_FOUND || -161 || not found || inconsistent client state

:)
ID: 68573 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[AF>france>pas-de-calais]symaski62

Send message
Joined: 19 Sep 05
Posts: 47
Credit: 33,871
RAC: 0
Message 68642 - Posted: 16 Nov 2010, 17:02:04 UTC - in response to Message 68573.  

http://www.boinc-wiki.info/Error_Code

ERR_NOT_FOUND || -161 || not found || inconsistent client state

:)


http://boincfaq.mundayweb.com/index.php?view=77&sessionID=a4d6f938094d567da63ac51e1a31dfb4

ERR_NOT_FOUND -161

This happens when you have an inconsistent client_state.xml file. Files aren't written to it.
Task not found would be the error message.


ID: 68642 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Bad WU Batch - "TEMP_" all fail



©2021 University of Washington
https://www.bakerlab.org