minirosetta 2.17

Message boards : Number crunching : minirosetta 2.17

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 · Next

AuthorMessage
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,888,320
RAC: 0
Message 69972 - Posted: 5 Apr 2011, 8:49:04 UTC

Two more, both newly issued overnight:
T0590_boinc_nmr_max40_rerun_abrelax_cs_frags_negative_tex_IGNORE_THE_REST_23856_3040 first copy sent 5 Apr 2011 3:07:32 UTC
T0569_boinc_nmr_max40_rerun_abrelax_cs_frags_permuted_tex_IGNORE_THE_REST_23858_6710 first copy sent 5 Apr 2011 5:21:21 UTC

All copies ended with:
process exited with code 1 (0x1, -255)

ERROR: ERROR: FragmentIO: could not open file cs_frags.9mers.gz
ERROR:: Exit from: src/core/fragment/FragmentIO.cc line: 258
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish
ID: 69972 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 1,227
Message 69974 - Posted: 5 Apr 2011, 11:53:28 UTC

Looks much like a problem I've seen before. If so, the version of gzip sent as part of the workunit works properly in one direction but not the other.
ID: 69974 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jesse Viviano

Send message
Joined: 14 Jan 10
Posts: 42
Credit: 2,700,472
RAC: 0
Message 69982 - Posted: 6 Apr 2011, 10:35:17 UTC

Here is another corrupt work unit: 376739708

While writing this post up, I remembered that one other project, Docking@home, once spewed corrupt work units all over the place because the disk drives on the work unit generation server got completely filled up. Therefore, the work unit generator was creating work units that were zero bytes long. The client software tried to crunch these corrupt work units and went nowhere, and failed to declare compute errors, wasting its volunteers' time and electricity until everyone was told to abort the corrupt work units. Could one of the Rosetta@home servers generating work units have a full hard drive? I am wondering if the same situation at Docking@home is happening here except that the client code is smart enough to declare a computation error when it encounters a corrupt work unit.
ID: 69982 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 70021 - Posted: 12 Apr 2011, 16:43:45 UTC

A couple of new protein interface design tasks failing immediately with a computation error on Mac. Sample:

Task 414128879 (dck_rhoA_rhoA_2nr7_final_ProteinInterfaceDesign_11Apr2011_25012_119_0)

ERROR: Option matching -docking:no_filters not found in command line top-level context

ID: 70021 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
moody
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 8 Jun 10
Posts: 11
Credit: 88,068
RAC: 0
Message 70030 - Posted: 13 Apr 2011, 20:08:16 UTC

In response to Message 70021:

"A couple of new protein interface design tasks failing immediately with a computation error on Mac. Sample:

Task 414128879 (dck_rhoA_rhoA_2nr7_final_ProteinInterfaceDesign_11Apr2011_25012_119_0)

ERROR: Option matching -docking:no_filters not found in command line top-level context"

This was due to a Rosetta option that was recently renamed on our end but not in the version of Rosetta currently on Boinc. It should be fixed now. We apologize for any inconvenience.

ID: 70030 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 70045 - Posted: 16 Apr 2011, 17:59:04 UTC

This task 414671655 was puzzling in a couple of respects.

First, it took about 7 hours (3 hours request time) to complete 1 decoy. This would be explicable if the model was particularly large but the log indicated that some other error was occurring:


Watchdog active.
BOINC:: CPU time: 25609.9s, 14400s + 10800s[2011- 4-15 1:37: 2:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 25609.9 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
called boinc_finish

</stderr_txt>
ID: 70045 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
bobgoblin

Send message
Joined: 15 Oct 05
Posts: 2
Credit: 1,616,056
RAC: 0
Message 70047 - Posted: 17 Apr 2011, 14:13:48 UTC
Last modified: 17 Apr 2011, 14:15:18 UTC

I've noticed that both my i7 machines have been taking 12+ hours to complete wu's. When I look in pending tasks once the are reported they are all showing less than 3 hours. My i5 machine is still crunching them in less than 3 hours. So, I've disabled rosetta on the i7s for now. any idea what may be causing that?

All machines are running win7, though the i7's were upgrade last december from vista64, the i5 had win7 installed when it was built.
ID: 70047 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 70048 - Posted: 17 Apr 2011, 16:48:13 UTC - in response to Message 70047.  
Last modified: 17 Apr 2011, 16:48:31 UTC

I've noticed that both my i7 machines have been taking 12+ hours to complete wu's. When I look in pending tasks once the are reported they are all showing less than 3 hours. My i5 machine is still crunching them in less than 3 hours. So, I've disabled rosetta on the i7s for now. any idea what may be causing that?

All machines are running win7, though the i7's were upgrade last december from vista64, the i5 had win7 installed when it was built.


Your main concern (because you've "disabled Rosetta") seems to be whether things are running properly, or doing harm to your machines. Nothing you've described implies any harm. In fact, the tasks complete in the 3 hours of CPU time that you've (likely) set (or defaulted to) in your R@h preferences.

I think what you are saying is that "wall clock" time is over 12 hours, but actual CPU time is around 3 hours. So the question boils down to asking why tasks might not be receiving CPU time when they are trying to run. This could be due to other tasks on the machine demanding CPU (as BOINC runs at lowest possible priority, and will yield to other tasks).

It seems fairly likely that with one 8 core machine running in 6GB of memory and the other 8 core machine running in 8GB of memory, that you would see "waiting for memory" as the status of several tasks rather then "running". This causes BOINC to stop giving the tasks CPU time until the total memory of other active tasks comes back down to be within the preferences set in your BOINC Manager for memory. So, when memory becomes constrained, BOINC is not longer using all of the CPUs of the machine (or all of the CPUs BOINC is configured to use).

This likely is not occurring on your 4-core machine because it has 6GB of memory (50% more per core then the other machines).

This thread has a number of ideas and descriptions of what to expect and what actions you might take to help things run better.
Rosetta Moderator: Mod.Sense
ID: 70048 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile BerlinTomek

Send message
Joined: 11 Mar 09
Posts: 3
Credit: 15,530,617
RAC: 15
Message 70049 - Posted: 17 Apr 2011, 20:36:35 UTC
Last modified: 17 Apr 2011, 20:37:31 UTC

what does it means?
can you tell me if this errors are because of a hardware fault?

i cant believe because my i7 core (3,74 Ghz overclocked)
never reaches a temp. higher than 60°C

so whats wrong?






"Task ID 415317758
Name T0475_boinc_nmr_max40_rerun_abrelax_cs_frags_permuted_tex_IGNORE_THE_REST_23858_18211_0
Workunit 379356995
Created 17 Apr 2011 12:39:28 UTC
Sent 17 Apr 2011 12:42:55 UTC
Received 17 Apr 2011 12:49:46 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 1 (0x1)
Computer ID 1388706
Report deadline 27 Apr 2011 12:42:55 UTC
CPU time 3.296875
stderr out

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
Unzul�ssige Funktion. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
[2011- 4-17 14:44:30:] :: BOINC:: Initializing ... ok.
[2011- 4-17 14:44:30:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev39052.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/T0475_boinc_nmr_max40_rerun_abrelax_cs_frags_permuted_tex.boinc.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...

ERROR: ERROR: FragmentIO: could not open file cs_frags.9mers.gz
ERROR:: Exit from: ....srccorefragmentFragmentIO.cc line: 258
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>

Validate state Invalid
Claimed credit 0.022264471690409
Granted credit 0
application version 2.17
ID: 70049 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 70052 - Posted: 17 Apr 2011, 22:25:00 UTC

ERROR: FragmentIO: could not open file cs_frags.9mers.gz

It sounds like there is a problem with work units starting with T0471 and T0475. When you got the error another copy was created for processing and in each case the other person got the same error that you did. So, it sounds like the work unit has a problem, not your machine.
Rosetta Moderator: Mod.Sense
ID: 70052 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile BerlinTomek

Send message
Joined: 11 Mar 09
Posts: 3
Credit: 15,530,617
RAC: 15
Message 70053 - Posted: 17 Apr 2011, 23:00:11 UTC - in response to Message 70052.  

ERROR: FragmentIO: could not open file cs_frags.9mers.gz

It sounds like there is a problem with work units starting with T0471 and T0475. When you got the error another copy was created for processing and in each case the other person got the same error that you did. So, it sounds like the work unit has a problem, not your machine.



ok thanks for the quick answer... i just thought somethings bad going on with my cpu!

hope they will fix the problem as fast as possible.
my machine hates working for useless boinc units ;-)
ID: 70053 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 1,227
Message 70054 - Posted: 18 Apr 2011, 4:27:07 UTC

I may have a defective workunit.

Rosetta Mini 2.17
T0533_rH_stg0_lrlxjcst_t000__casp9_w_symm_fm_2qmx_2_SAVE_ALL_OUT_25095_1910

Elapsed 20:56:54, Progress 20.026% and not changing, To completion 40:53:28

BOINC thinks it is running, but it's using no CPU time at all.

No error messages seen.

CPU time at last checkpoint 03:11:48
CPU time 03:12:15

I've selected workunits expected to last 12 hours.

No relevant messages in the BOINC log file since:

4/17/2011 7:40:44 AM rosetta@home Restarting task

T0533_rH_rs_stg0_lrlxjcst_t000__casp9_w_symm_fm_2qmx_2_SAVE_ALL_OUT_25095_1910_0 using minirosetta version 217

I've restarted BOINC to see if having it restart from the last checkpoint will help.

Now showing 05:29:07 elapsed, 20.026% progress, 17:45:40 To completion,and Waiting to run.
ID: 70054 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 26,537,115
RAC: 17,468
Message 70056 - Posted: 18 Apr 2011, 12:15:49 UTC
Last modified: 18 Apr 2011, 12:25:45 UTC

ID: 70056 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 1,227
Message 70057 - Posted: 18 Apr 2011, 12:35:20 UTC - in response to Message 70054.  

Rosetta Mini 2.17
T0533_rH_stg0_lrlxjcst_t000__casp9_w_symm_fm_2qmx_2_SAVE_ALL_OUT_25095_1910

The next morning:

10:48:20 elapsed, 20.021% progress, 25:26:47 to completion.

Now aborted.
ID: 70057 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tex1954

Send message
Joined: 3 Apr 11
Posts: 9
Credit: 3,394,752
RAC: 1
Message 70058 - Posted: 18 Apr 2011, 14:55:51 UTC - in response to Message 70056.  
Last modified: 18 Apr 2011, 15:04:51 UTC

Few last days I got big pack of "Compute error" on tasks starting from "T0xxx_".
This tasks ends with errors few seconds after start. Some examples:..."CUT"


Yup, I just had a batch of 8 errors from 4 computers myself. Average 1 error per day per computer. (2 laptops, 2 desktops) In fact, just had like 5 real fast off main computer I just noticed in history too.

***************************************************************************
T0471_boinc_nmr_homology_max10_abrelax_cs_frags_nocst_tex_IGNORE_THE_REST_25328_918_0 00:01:21 (00:00:01) 4/16/2011 8:08:27 PM 4/16/2011 8:10:29 PM Reported: Computation error (1,)

T0471_boinc_nmr_homology_max10_abrelax_cs_frags_tex_IGNORE_THE_REST_25326_931_0 00:01:13 (00:00:01) 4/16/2011 8:06:45 PM 4/16/2011 8:10:29 PM Reported: Computation error (1,)

T0471_boinc_nmr_max40_rerun_abrelax_cs_frags_permuted_tex_IGNORE_THE_REST_23858_15071_1 00:01:27 (00:00:02) 4/16/2011 8:05:13 PM 4/16/2011 8:06:45 PM Reported: Computation error (1,)

T0475_boinc_nmr_max40_rerun_abrelax_cs_frags_permuted_tex_IGNORE_THE_REST_23858_14479_0 00:01:42 (00:00:02) 4/16/2011 5:04:57 PM 4/16/2011 5:07:01 PM Reported: Computation error (1,)

T0475_boinc_nmr_max40_rerun_abrelax_cs_frags_permuted_tex_IGNORE_THE_REST_23858_14438_1 00:01:37 (00:00:02) 4/16/2011 4:08:01 PM 4/16/2011 4:12:33 PM Reported: Computation error (1,)

***************************************************************************

It isn't OUR fault and they only waste a few seconds. I'm sure the PTB's are on it.


8-)


Tex1954
ID: 70058 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 70059 - Posted: 18 Apr 2011, 15:04:22 UTC
Last modified: 18 Apr 2011, 15:05:01 UTC

Few last days I got big pack of "Compute error" on tasks starting from "T0xxx_". This tasks ends with errors few seconds after start. Some examples:


To his list of those ending "in only a few seconds" with the error message "file cs_frags.9mers.gz" you can add:

415450587
415451112
415451774
415525056
415086428

415038519
415038519
415008687
415001340
415384421

415376541
415367010
415339775
415326968
415043715

415337799
415300558
415299126
415289036
415068958

415020984
414880550
415302737
415245580
415216583

415210228
415605253
415563904
415554009
415542857

415540472
415533975
415533394
415519909
415503828

415487923
415485024
415473737
415472523
415466575

415465650
415068765
415064287
415062582
415062402

415053058
415343211
415335164
415333379
415278070

This does NOT represent a COMPLETE listing of what I have seen on my systems - I just listed the FIRST 50 or so that have FAILED with this error so far TODAY. And it is still early.

These errors have been going on for AT LEAST 2 WEEKS and have been the topic of discussion in another thread on this board. These are all "fresh" tasks having been issued by the Rosetta server in the last day or two.
ID: 70059 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tex1954

Send message
Joined: 3 Apr 11
Posts: 9
Credit: 3,394,752
RAC: 1
Message 70060 - Posted: 18 Apr 2011, 15:08:10 UTC - in response to Message 70059.  
Last modified: 18 Apr 2011, 15:10:02 UTC

CUT...
These errors have been going on for AT LEAST 2 WEEKS and have been the topic of discussion in another thread on this board. These are all "fresh" tasks having been issued by the Rosetta server in the last day or two.


Any way to know WHEN this group of tasks was generated on the server? Could it be it's an old batch and we just need to work through it?

Or is it possible the errors themselves are significant in the process?

8-)

Tex1954
ID: 70060 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 70061 - Posted: 18 Apr 2011, 15:24:10 UTC

Tex1954 said: Any way to know WHEN this group of tasks was generated on the server? Could it be it's an old batch and we just need to work through it?


Good question - I just looked at a few of them on my previous list and the task creation dates were 16 April and 18 April.

So this is NOT a case of just letting "old" jobs work their way through the system.
ID: 70061 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 70062 - Posted: 18 Apr 2011, 15:41:53 UTC

You can add to Mad Max's list of failing tasks (with matching wingman results) whose name is in the form of:

ProteinG_abinitio_SAVE_ALL_OUT_design_relax

415571046
414989706
414921629
415131368
415102869

415091934
414802441
415091930
415171797
415008017

This is not an exhaustive list of this type of error found on my systems – these were all “fresh” tasks with creation dates between 16 April and 18 April.
ID: 70062 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 70063 - Posted: 18 Apr 2011, 15:56:18 UTC

I am seeing validate errors (with matching wingman results) on tasks whose name has the form of:

T0590_boinc_nmr_homology_max10_loopbuild_threading_cst_relax_tex

A few samples would be:

414980981
414994609
414957506
414950332
415065606
ID: 70063 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 · Next

Message boards : Number crunching : minirosetta 2.17



©2024 University of Washington
https://www.bakerlab.org