Rosetta@home

minirosetta 2.16

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search
Message boards : Number crunching : minirosetta 2.16

Sort
AuthorMessage
Yifan Song
Forum moderator
Project administrator
Project developer
Project scientist

Joined: May 26 09
Posts: 62
ID: 318024
Credit: 7,322
RAC: 0
Message 68004 - Posted 9 Oct 2010 20:14:32 UTC

This is reverting minirosetta to 2.14 due to the recent memory problem.

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 68009 - Posted 9 Oct 2010 22:33:56 UTC
Last modified: 9 Oct 2010 22:46:14 UTC

These two both failed after 11sec with the new app!

Maybe something to do with the change over?

EDIT// I have a oct_ task running O.K. now.


http://boinc.bakerlab.org/rosetta/workunit.php?wuid=338631824

mem_widd_run02_Menv_B_1c3w_SAVE_ALL_OUT_IGNORE_THE_REST_22293_74504_0

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>


Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...

ERROR: bad line in file minirosetta_database/scoring/weights/membrane_highres_Menv_smooth.wts:Menv_smooth 0.5
ERROR:: Exit from: src/core/scoring/ScoreFunction.cc line: 204
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

==============================================================================

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=338631880

[b]mem_widd_run02_Menv_B_1c3w_SAVE_ALL_OUT_IGNORE_THE_REST_22293_74521_0[b/]

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>

Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...

ERROR: bad line in file minirosetta_database/scoring/weights/membrane_highres_Menv_smooth.wts:Menv_smooth 0.5
ERROR:: Exit from: src/core/scoring/ScoreFunction.cc line: 204
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
____________


P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 68011 - Posted 10 Oct 2010 1:47:57 UTC
Last modified: 10 Oct 2010 2:28:39 UTC

Another one same problem on a different rig, 14sec this time.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=338630619

mem_widd_run02_Menv_B_1c3w_SAVE_ALL_OUT_IGNORE_THE_REST_22293_74213_0

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>


Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...

ERROR: bad line in file minirosetta_database/scoring/weights/membrane_highres_Menv_smooth.wts:Menv_smooth 0.5
ERROR:: Exit from: src/core/scoring/ScoreFunction.cc line: 204
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
____________


Brian Priebe

Joined: Nov 27 09
Posts: 15
ID: 360315
Credit: 18,640,145
RAC: 17,129
Message 68014 - Posted 10 Oct 2010 5:40:04 UTC - in response to Message ID 68004.
Last modified: 10 Oct 2010 5:44:42 UTC

Same error on WU 338667991 (http://boinc.bakerlab.org/rosetta/result.php?resultid=370646341). Wingman also had same error.

Snippet from log:

Incorrect function. (0x1) - exit code 1 (0x1)
...
ERROR: bad line in file minirosetta_database\scoring/weights/membrane_highres_Menv_smooth.wts:Menv_smooth 0.5
ERROR:: Exit from: ..\..\src\core\scoring\ScoreFunction.cc line: 204

Yifan Song
Forum moderator
Project administrator
Project developer
Project scientist

Joined: May 26 09
Posts: 62
ID: 318024
Credit: 7,322
RAC: 0
Message 68020 - Posted 10 Oct 2010 23:14:13 UTC

My apologies for the new errors. I forgot that those jobs are using an option that's associated with 2.15. I just cancelled those jobs.

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 68025 - Posted 11 Oct 2010 3:30:57 UTC

I got this one this morning, must have been before you stopped them.

Same as others, ran for 11sec.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=338812571

mem_widd_run02_Menv_B_1c3w_SAVE_ALL_OUT_IGNORE_THE_REST_22293_85437_0

Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...

ERROR: bad line in file minirosetta_database/scoring/weights/membrane_highres_Menv_smooth.wts:Menv_smooth 0.5
ERROR:: Exit from: src/core/scoring/ScoreFunction.cc line: 204
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

____________


P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 68026 - Posted 11 Oct 2010 6:49:02 UTC

Two more of the same old error, downloaded earlier today.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=338799412

mem_widd_run02_Menv_B_1c3w_SAVE_ALL_OUT_IGNORE_THE_REST_22293_83642_0


http://boinc.bakerlab.org/rosetta/workunit.php?wuid=338804158

mem_widd_run02_Menv_B_1c3w_SAVE_ALL_OUT_IGNORE_THE_REST_22293_84286_0

____________


Arif Mert Kapicioglu

Joined: Aug 23 09
Posts: 1
ID: 340204
Credit: 422,052
RAC: 0
Message 68030 - Posted 11 Oct 2010 11:51:52 UTC

Could those errors be related to oc? I contributed to project in past and recently have been recontributing and have seen no erros on my side.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 68035 - Posted 11 Oct 2010 18:42:53 UTC - in response to Message ID 68030.

weights is not a oc error.
that is a program error.
if you get a windows debugger dump with a 0xc error I think it is, that could be due to oc or just some problem with the task and your memory.
if you get a ton of the memory errors all in a row then your oc speed is to high.
for my machine i can push rosie up to 3.0 ghz (this is as fast as I can go without crashing google earth). official clock speed on my machine is 2.5ghz

Could those errors be related to oc? I contributed to project in past and recently have been recontributing and have seen no erros on my side.

mickey

Joined: Jan 11 10
Posts: 2
ID: 366358
Credit: 20,113
RAC: 0
Message 68067 - Posted 13 Oct 2010 12:32:54 UTC - in response to Message ID 68004.
Last modified: 13 Oct 2010 12:36:52 UTC

I have found one little issue in x86_64 linux app.
When the workunits are started for the first time (i.e. started from 0.000% of progress) they works normally and the graphics are displayed correctly, but sometimes i find the graphics switching countinuosly from
Stage:
to
App suspended
without printing which stage is running. The value % complete is still increasing, like the step counter and the cpu time, but the top graphs: "Searching..." "Accepted", "Low Energy" and "Accepted Energy" are empty. I think that this trouble is raised when the app is stopped (for system reboot or any other motivation), and for making the app running again correctly I have to reset the project, because also new workunits will go in this state.

Following the stderr of one corrupted execution at 17% done:


[2010-10-13 2: 3:34:] :: BOINC:: Initializing ... ok.
[2010-10-13 2: 3:34:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev38513.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/input_mem_widd_run03_centroid_A_2a9h_yfsong.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 21600
[2010-10-13 9:55: 8:] :: BOINC:: Initializing ... ok.
[2010-10-13 9:55: 8:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev38513.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/input_mem_widd_run03_centroid_A_2a9h_yfsong.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage1 ... success!
Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage2 ... success!
Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_1 ... success! # cpu_run_time_pref: 21600

Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_2 ... success!
Continuing computation from checkpoint: chk_S_00001_FragmentSampler__stage_3_iter1_3 ... success!


I don't know if is only a graphics issue or a real app trouble, but I think that it's a strange behaviour.

edit: my system config is
Fedora core 13 x86_64
boinc installed from Fedora repositories version 6.10.45
other computer infos here

mickey

Joined: Jan 11 10
Posts: 2
ID: 366358
Credit: 20,113
RAC: 0
Message 68073 - Posted 13 Oct 2010 16:21:50 UTC - in response to Message ID 68067.

Ok guys, calm down. :)
I have just checked again the graphics of the workunit that I am crunching on, and it works again.

So I think that what I've posted before is only a visualization issue and not an application one.

eruda

Joined: Apr 6 06
Posts: 1
ID: 72171
Credit: 5,256,246
RAC: 0
Message 68086 - Posted 14 Oct 2010 16:22:15 UTC
Last modified: 14 Oct 2010 16:24:41 UTC

it continues that some of the app seemed to corrupt suddenly and the BOINC manager fail to notice it. I have a task running for over 60 hours and stuck at 16%, so I have no choice to terminate it manually...I don't know much about the technical issues, sorry for being not able to provide the error log.
____________

Snagletooth

Joined: Feb 22 07
Posts: 193
ID: 149031
Credit: 1,425,415
RAC: 236
Message 68112 - Posted 16 Oct 2010 22:20:37 UTC

NP_961412.1_boinc_boinc_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_22319_804_1

The WU was ended after only 1201 seconds and one model completed despite 43200 second run time pref, and received the outcome "validate error".

This is a resend (no reply) originally issued on the 6th prior 2.15 being rescinded though obviously I crunched it with 2.16. A third copy has now been issued.


Snags

Snowfall

Joined: Dec 10 06
Posts: 2
ID: 134548
Credit: 24,963
RAC: 0
Message 68117 - Posted 17 Oct 2010 18:55:53 UTC

My computer hanged after about 5 and a half hours of computation (with about the same amount left to go), then, after a reset, when I restarted BOINC it suddenly uploaded the workunit back to the server instead of continuing it.

I logged in to see why it failed, but it seems that its outcome state is "Success". I don't think this is right. Is this supposed to happen?

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=339761570
____________

Murasaki
Avatar

Joined: Apr 20 06
Posts: 303
ID: 78284
Credit: 382,503
RAC: 265
Message 68119 - Posted 18 Oct 2010 6:34:21 UTC - in response to Message ID 68117.

My computer hanged after about 5 and a half hours of computation (with about the same amount left to go), then, after a reset, when I restarted BOINC it suddenly uploaded the workunit back to the server instead of continuing it.

I logged in to see why it failed, but it seems that its outcome state is "Success". I don't think this is right. Is this supposed to happen?

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=339761570


The task unit log you linked to ends with an error, so no, that isn't the expected behaviour. However you must have been able to return complete data for at least one model/decoy from the WU in order for the system to recognise that part of the upload as valid.


ERROR: ERROR: Unable to open silent_input file: 'chk_S_00014_FragmentSampler__rg_state.out'
ERROR:: Exit from: src/core/io/silent/SilentFileData.cc line: 86
called boinc_finish
# cpu_run_time_pref: 28800

Snowfall

Joined: Dec 10 06
Posts: 2
ID: 134548
Credit: 24,963
RAC: 0
Message 68122 - Posted 18 Oct 2010 18:11:28 UTC - in response to Message ID 68119.

Thank you for the quick reply. Yes, I did notice that part of the error log, that's why I was intrigued by the apparently successful outcome state. I didn't ever consider it could be partly successful though :).
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 68146 - Posted 20 Oct 2010 14:01:58 UTC

Linking to a post reporting task names with "lrlx" in the name seem to cause some problems.
____________
Rosetta Moderator: Mod.Sense

Pilgrim57

Joined: Jul 31 08
Posts: 2
ID: 271620
Credit: 1,602,807
RAC: 502
Message 68177 - Posted 23 Oct 2010 20:30:58 UTC

For over a week now I have had a lot of errors with work units getting computer error after just starting, half way through or just before completion
see task id 373614961,373614960,373614958, 372516143, 372515901. 370066639
371496237, 371496257.

I have altered the run time from 6 hours to 4 & now the default 3.
My PC is overclocked but it seems these errors have started with 2.16.

Any help appreciated.

Bikermatt Profile

Joined: Feb 12 10
Posts: 20
ID: 369816
Credit: 6,382,906
RAC: 0
Message 68179 - Posted 23 Oct 2010 23:06:29 UTC

Anyone else notice PCS_ tasks running poorly in Linux? They are running longer than default and producing fewer models then on my similarly equipped Win 7 box.

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 68180 - Posted 23 Oct 2010 23:25:57 UTC

This one failed after 4sec.

PCS_2RN2_v1.frag_1-41_SAVE_ALL_OUT_22378_8_1

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=341585851

ERROR: ERROR: FragmentIO: could not open file boinc_aafrag_1-41_09_05.200_v1_3.gz
ERROR:: Exit from: src/core/fragment/FragmentIO.cc line: 258
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>

____________


Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 68182 - Posted 23 Oct 2010 23:39:04 UTC

PCS_2RN2_atensor.frag_1-100_SAVE_ALL_OUT_22378_16_0 died @ 3.5 seconds

ERROR: ERROR: FragmentIO: could not open file boinc_aafrag_1-100_09_05.200_v1_3.gz
ERROR:: Exit from: ..\..\src\core\fragment\FragmentIO.cc line: 258
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


cleaner

Joined: Aug 22 10
Posts: 6
ID: 391551
Credit: 26,245
RAC: 0
Message 68189 - Posted 25 Oct 2010 7:42:45 UTC

The last two nights when Rosetta has been in screen saver for about an hour,the computer has either frozen or else rosetta does not respond and the process has to be terminated. Up until now 2.16 had been running fine.....

MalX Profile

Joined: Jan 18 06
Posts: 1
ID: 52113
Credit: 66,403
RAC: 0
Message 68194 - Posted 25 Oct 2010 15:46:09 UTC

Having the exact same problem. EVERY WU fails with computation error, and complains about an absent output file for the task.

I am using Gentoo hardened, and even by easing off with paxctl, I still cant get any WU's to crunch.

The paxctl command works fine with enigma but not Rosetta. Any ideas?
____________

Ross Parlette

Joined: Nov 10 05
Posts: 24
ID: 10785
Credit: 212,112
RAC: 0
Message 68226 - Posted 28 Oct 2010 4:08:39 UTC

I'm getting the exited with zero status a lot lately. For the most part, the task is restarted and completes correctly (?) and is uploaded are reported. Here follows an example:

10/25/2010 9:17:46 PM rosetta@home Task mem_widd_run02_Menv_B_round02_0013_SAVE_ALL_OUT_IGNORE_THE_REST_22363_5424_0 exited with zero status but no 'finished' file
10/25/2010 9:17:46 PM rosetta@home If this happens repeatedly you may need to reset the project.
10/25/2010 9:17:46 PM rosetta@home Restarting task mem_widd_run02_Menv_B_round02_0013_SAVE_ALL_OUT_IGNORE_THE_REST_22363_5424_0 using minirosetta version 216
10/26/2010 10:20:10 PM rosetta@home Computation for task mem_widd_run02_Menv_B_round02_0013_SAVE_ALL_OUT_IGNORE_THE_REST_22363_5424_0 finished

I have examined this task in my account. It is the one which was sent on 23 Oct 2010 6:39:24 UTC and returned on the 27th. According to the account, it was successfully completed.

I have been getting multiple examples of this. What should I do? Should I reset the project? Just what does that mean?

Thanks.

Ross
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 68238 - Posted 28 Oct 2010 16:04:08 UTC

Ross, so long as other tasks are completing normally, I would not suggest taking any steps to try and resolve this. It sounds more like a problem in the task then on your machine, so there isn't much you'll be able to do about it. You might observe them as they run though and see if they are using excessive memory or anything like that, just so you can report additional symptoms.
____________
Rosetta Moderator: Mod.Sense

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 68245 - Posted 29 Oct 2010 3:05:38 UTC

Hi.

This tasks seemed to be stuck in a loop, the last checkpoint was at 51min the run

time was up to 4hrs 19mins, when i looked at the graphics it was at

STAGE: rb_CA_CA_07 if that helps. And had 205 models at STEP: 5800 and not

moving, i stop and rebooted on restart it went back to 51mins and is now running

and moving i'll let it finish if it does!


celldivs_LL_1de2_2oqk_ProteinInterfaceDesign_26Oct2010_22394_16_0


http://boinc.bakerlab.org/rosetta/workunit.php?wuid=342826273

____________


SubNuke Profile

Joined: Aug 2 08
Posts: 1
ID: 271898
Credit: 1,242,551
RAC: 0
Message 68266 - Posted 30 Oct 2010 16:19:57 UTC
Last modified: 30 Oct 2010 16:42:47 UTC

I am also seeing tasks fail with computation error accompanied by message indicating an output file is absent.

This is on Core i7-920 systems [with 9800 GTX+'s] running 64-bit Fedora 13 and BOINC 6.10.45 packages installed from the Fedora 13 repository.

If this problem has already been resolved, please point me in the direction of the solution. If I can provide some bit of info that would help to diagnose and resolve the issue, please just ask.

Thanks!

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 68267 - Posted 30 Oct 2010 17:49:11 UTC

Thank you SubNuke. The main thing that is helpful is if you can provide links to specific tasks that are failing, and if there is any pattern to the task names of those that fail vs. those that complete normally.
____________
Rosetta Moderator: Mod.Sense

Message boards : Number crunching : minirosetta 2.16


Home | Join | About | Participants | Community | Statistics

Copyright © 2017 University of Washington

Last Modified: 10 Nov 2010 1:51:38 UTC
Back to top ^