compute errors

Message boards : Number crunching : compute errors

To post messages, you must log in.

AuthorMessage
Bubba

Send message
Joined: 2 Aug 07
Posts: 2
Credit: 616,815
RAC: 0
Message 71432 - Posted: 18 Oct 2011, 22:48:55 UTC

Reinstalled boinc and tried several setups. I am now only crunching 4 failed units a day!

Running Win7 64 bit and boinc 64 bit.



455390035
Name jsr_decoys_cst_2i6c_abrelax_34261_380_0
Workunit 415606757
Created 12 Oct 2011 7:41:19 UTC
Sent 12 Oct 2011 7:52:19 UTC
Received 12 Oct 2011 22:03:49 UTC
Server state Over
Outcome Client error
Client state New
Exit status 0 (0x0)
Computer ID 1482956
Report deadline 22 Oct 2011 7:52:19 UTC
CPU time 10317.08
stderr out <core_client_version>6.12.34</core_client_version>
<![CDATA[
<stderr_txt>
[2011-10-12 2:42:15:] :: BOINC:: Initializing ... ok.
[2011-10-12 2:42:15:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev42272.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/2i6cA_cst.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Starting work on structure: _00001
Starting work on structure: _00002
Starting work on structure: _00003
======================================================
DONE :: 1 starting structures 10316.1 cpu seconds
This process generated 3 decoys from 3 attempts
======================================================
BOINC :: WS_max 2.91103e+008

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>


Validate state Invalid
Claimed credit 85.0680325680518
Granted credit 85.0680325680518
application version ---
ID: 71432 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 114,331,513
RAC: 51,221
Message 71444 - Posted: 19 Oct 2011, 21:45:20 UTC

Bad RAM, excessive overclock, excessive temperatures, bad BOINC/rosetta file, or faulty PSU (often very difficult to catch) would be where I'd start looking, in that order... Prime95 stress test is a good starting point.
ID: 71444 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bubba

Send message
Joined: 2 Aug 07
Posts: 2
Credit: 616,815
RAC: 0
Message 71450 - Posted: 20 Oct 2011, 14:16:07 UTC - in response to Message 71447.  

Does not seem to be the issue running Gpugrid and WCG. IMHO I think it is rosetta!
ID: 71450 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 71458 - Posted: 22 Oct 2011, 15:20:12 UTC

Rosetta utilizes memory more intensively than many other BOINC applications. And comparisons to an application running primarily on another processor in the machine (i.e. GPU) are really not meaningful.

Look at it the other way around, if the is with Rosetta, then why are there not more people having such a problem?
Rosetta Moderator: Mod.Sense
ID: 71458 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Symeon

Send message
Joined: 14 Sep 11
Posts: 1
Credit: 3,166
RAC: 0
Message 71518 - Posted: 28 Oct 2011, 5:05:05 UTC

i'm also having this error, my cpu and ram is overclocked but pass all stress test, i dont see why my cpu/ram would make faulty calculation only with Rosetta and not the stres stest and Memtest86+...
ID: 71518 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 71519 - Posted: 28 Oct 2011, 7:09:36 UTC - in response to Message 71518.  
Last modified: 28 Oct 2011, 7:10:42 UTC

i'm also having this error, my cpu and ram is overclocked but pass all stress test, i dont see why my cpu/ram would make faulty calculation only with Rosetta and not the stres stest and Memtest86+...


Well put your system back to stock and see if you still get errors, if not then you've got your answer!

To easy.

my 2c worth.
ID: 71519 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 114,331,513
RAC: 51,221
Message 71520 - Posted: 28 Oct 2011, 7:51:47 UTC - in response to Message 71518.  

i'm also having this error, my cpu and ram is overclocked but pass all stress test, i dont see why my cpu/ram would make faulty calculation only with Rosetta and not the stres stest and Memtest86+...


Your errors appear to be related to the default.out problem that's mentioned in these threads:

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5833
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5835

So nothing to worry about if those are the only errors.

Danny
ID: 71520 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1223
Credit: 13,824,497
RAC: 2,340
Message 71521 - Posted: 28 Oct 2011, 11:29:18 UTC - in response to Message 71520.  

i'm also having this error, my cpu and ram is overclocked but pass all stress test, i dont see why my cpu/ram would make faulty calculation only with Rosetta and not the stres stest and Memtest86+...


Your errors appear to be related to the default.out problem that's mentioned in these threads:

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5833
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5835

So nothing to worry about if those are the only errors.

Danny


From what I've seen, the error messages referring to default.out only mean that, due to an earlier error, there was no output file named default.out.

Therefore, you need to compare the earlier error messages as well.
ID: 71521 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TPCBF

Send message
Joined: 29 Nov 10
Posts: 109
Credit: 4,624,446
RAC: 1,613
Message 71523 - Posted: 28 Oct 2011, 18:15:38 UTC

Got the same kind of compute errors now too, claiming problems with the .out file, which I think is a red herring. This and a number of validate errors started to show up since the update to 3.17...

Ralf
ID: 71523 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rocco Moretti

Send message
Joined: 18 May 10
Posts: 66
Credit: 585,745
RAC: 0
Message 71526 - Posted: 28 Oct 2011, 21:24:55 UTC - in response to Message 71523.  

Got the same kind of compute errors now too, claiming problems with the .out file, which I think is a red herring. This and a number of validate errors started to show up since the update to 3.17...

Ralf


cmiles talked about the reason in another thread

Basically, one of the changes that happened during 3.17 is that one of the protein movers used in protein-protein interface design changed names. (This was done to avoid a name collision with another protein mover which was added to Rosetta.)

This meant that runs which worked perfectly well during previous versions of Rosetta@Home now crash. Theoretically, these sorts of issues should be discovered when we test new versions of the client on RALPH@home, but this one happened to slip through, and wasn't discovered until a large number of jobs using the renamed mover were launched.

The fallout is that the validators and assimilators on the servers are swamped with the large number of jobs which are sent out and almost immediately come back with errors. We've killed the bad jobs with as much firepower as we can reasonably bring to bear, but unfortunately it'll take a little while for the servers to work through the backlog of bad jobs that have been sent out.

Thanks for your patience.
ID: 71526 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : compute errors



©2024 University of Washington
https://www.bakerlab.org