minirosetta 2.05

Message boards : Number crunching : minirosetta 2.05

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next

AuthorMessage
ramostol

Send message
Joined: 6 Feb 07
Posts: 64
Credit: 584,052
RAC: 0
Message 65480 - Posted: 7 Mar 2010, 11:42:31 UTC

My first Protein_interface (validation related?) error as far as I know - MacOS 10.5:

tyrsim_3gbn_2esa_Protein_interface_design_01Feb2010_17949_9_2


Outcome Success
Client state Done
Exit status 0 (0x0)

CPU time 21540.8

<core_client_version>6.10.36</core_client_version>
<![CDATA[
<stderr_txt>

[...]

# cpu_run_time_pref: 21600
======================================================
DONE :: 327 starting structures 21540.3 cpu seconds
This process generated 327 decoys from 327 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>

Validate state Workunit error - check skipped


One of two wingmen validated successfully after his deadline, but with far fewer decoys completed.
ID: 65480 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 0
Message 65481 - Posted: 7 Mar 2010, 21:53:11 UTC - in response to Message 65480.  
Last modified: 7 Mar 2010, 22:00:19 UTC

My first Protein_interface (validation related?) error as far as I know - MacOS 10.5:

tyrsim_3gbn_2esa_Protein_interface_design_01Feb2010_17949_9_2


Outcome Success
Client state Done
Exit status 0 (0x0)

CPU time 21540.8

<core_client_version>6.10.36</core_client_version>
<![CDATA[
<stderr_txt>

[...]

# cpu_run_time_pref: 21600
======================================================
DONE :: 327 starting structures 21540.3 cpu seconds
This process generated 327 decoys from 327 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>

Validate state Workunit error - check skipped


One of two wingmen validated successfully after his deadline, but with far fewer decoys completed.


There is nothing wrong on your end. This is a very old (and rare) bug in the boinc server software. Take a look here.
Wait a second, the trac item claims that the bug is fixed. Maybe it is time for Rosetta to update the server-code.

AdeB
ID: 65481 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 65482 - Posted: 7 Mar 2010, 22:49:27 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=322413556
tyrsim_3gbn_q.gz_Protein_interface_design_25Feb2010_18415_276_1
Outcome Client error
Client state Compute error
Exit status 1 (0x1)
CPU time 4.4375
stderr out

<core_client_version>6.10.18</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
ID: 65482 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 65492 - Posted: 9 Mar 2010, 0:22:41 UTC

Looks like there are still problems with this app, same task

it just restarted near the end and i got it in the neck, not impressed.

tyrsim_3gbn_1c81_Protein_interface_design_25Feb2010_18415_410_0

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=294414088


# cpu_run_time_pref: 14400
======================================================
DONE :: 348 starting structures 14397.5 cpu seconds
This process generated 348 decoys from 348 attempts
======================================================


# cpu_run_time_pref: 14400
======================================================
DONE :: 2 starting structures 14498.9 cpu seconds
This process generated 2 decoys from 2 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>

Validate state Valid

Claimed credit 102.297287162446

Granted credit 0.384433279143336

application version 2.05

ID: 65492 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile apohawk
Avatar

Send message
Joined: 13 Sep 08
Posts: 5
Credit: 30,428,003
RAC: 0
Message 65530 - Posted: 12 Mar 2010, 10:55:16 UTC

This work unit reports "success" despite having errors in the end.

https://boinc.bakerlab.org/rosetta/result.php?resultid=323517090

application: minitosetta 2.05
name of work unit: ina2inaN_to_NOE__18638_5045_0
Outcome: Success
Exit status: 0 (0x0)

CPU time: 2212.594

but at the end of the result we got:
# cpu_run_time_pref: 7200

ERROR: Unrecognized edge type!
ERROR:: Exit from: ....srccorekinematicsutil.cc line: 1422
called boinc_finish


CPU: Phenom II 945
OS: WinXP 64 SP2
ID: 65530 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Duzz

Send message
Joined: 14 Nov 05
Posts: 1
Credit: 13,148
RAC: 0
Message 65544 - Posted: 13 Mar 2010, 13:16:48 UTC
Last modified: 13 Mar 2010, 13:17:53 UTC

During the last days I had several WUs staying idle after some time of computation. Windows XP task manager shows no CPU activity. If one does not notice this, many hours of WU processing get lost, which is very unproductive for the project.
ID: 65544 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 0
Message 65547 - Posted: 13 Mar 2010, 22:39:05 UTC

In workunit gunn_fragments_SAVE_ALL_OUT_-1wtyA__18642_1106 both tasks (324092645 and 323994500) ended with the same error:
ERROR: ct == final_atoms
ERROR:: Exit from: ....srccorescoringrms_util.cc line: 397
BOINC:: Error reading and gzipping output datafile: default.out

AdeB
ID: 65547 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 207
Credit: 23,105,020
RAC: 13,072
Message 65555 - Posted: 15 Mar 2010, 3:44:52 UTC

Today I got strange validation errors: "Task was reported too late to validate"
But there are 4 days until deadline (19 Mar)!
Links to the tasks:
https://boinc.bakerlab.org/rosetta/result.php?resultid=323161767
https://boinc.bakerlab.org/rosetta/result.php?resultid=323181972
https://boinc.bakerlab.org/rosetta/result.php?resultid=323205144
ID: 65555 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65561 - Posted: 15 Mar 2010, 23:09:15 UTC

What is odd is the way the tasks were reissued before he reported the completed ones back. That wouldn't normally happen. That isn't dependent upon Mad Max's machine, so I doubt they did a restore or anything. I'll have to see what we can find out.
Rosetta Moderator: Mod.Sense
ID: 65561 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 207
Credit: 23,105,020
RAC: 13,072
Message 65564 - Posted: 16 Mar 2010, 2:38:42 UTC - in response to Message 65560.  


I see more things happened during the weekend. I'm seeing you detached or something which might cause all incomplete units to report. Did you perhaps restore a backup which tried to continue from an earlier point?


Error with "detached" is boinc related.
Actually I have not detached from the project, but rather connect a new computer. But after that boinc client initially goes mad - first it started to download to the new computer(Athlon II X2 250 ) tasks have already downloaded to old computer (Athlon XP 2600+), then at some point, thought better of it and register new computer on the server under a new ID, and than deleted mistakenly downloaded tasks. (I think this point and recorded on the server as "detached").

Note: there was no transfer of any boinc-related files from old computer to new one. The new client was a clean install from the distrib. So I do not know what caused this behavior. Maybe the fact that the computer is connect to internet under same ip?

Hmm, now I think that in principle, such an validate error could happen because of it. If one computer "cancels" the tasks(mistakenly downloaded), while the second worked on its, the server can issue the same WU to another volunteer computer and shift deadline time?
ID: 65564 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65567 - Posted: 16 Mar 2010, 16:32:04 UTC

True, not a problem specific to v2.05 Rosetta. Perhaps BOINC server, or client. Either way, we should start another thread if further problem tasks are found.

Certainly many users that have multiple machines are connecting from same IP address (I'm talking the router's public IP address that the project servers see). And many other users come in via dynamic IPs, and so it is always different. My understanding is that BOINC uses many factors to determine if a given machine is the same as an existing registered one to keep it all straight and separated correctly. Factors such as the user ID, host name, any existing BOINC host ID, machine type, installed OS, last RPC sequence number... so a fresh install should not have caused the client to "go mad" on either machine. Indeed many users have identically configured machines at same site coming in via same IP.
Rosetta Moderator: Mod.Sense
ID: 65567 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 65570 - Posted: 17 Mar 2010, 3:07:31 UTC

This took 8hrs, 2min on my 3ghz intel, four hour run time.

aqp9__boinc_aqp9_fast_run01_yfsong_loopbuild_threading_cst_relax_superfast_yfsong_IGNORE_THE_REST_18658_1421_0

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=296064742

# cpu_run_time_pref: 14400
Continuing computation from checkpoint: chk_S_2B6OA_15_0001_Remodel__loop_1_0_0_S ... success!
BOINC:: CPU time: 28914.7s, 14400s + 14400s[2010- 3-17 13:39:17:] :: BOINC
InternalDecoyCount: 0
======================================================
DONE :: 1 starting structures 28914.7 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
called boinc_finish
SIGSEGV: segmentation violation
Stack trace (15 frames):
[0x96c49b3]
[0x96ee888]
[0xb7fe9420]
[0x91d6455]
[0x842671e]
[0x83e85d3]
[0x80a7840]
[0x84381fe]
[0x812a54a]
[0x812b82d]
[0x86aa16b]
[0x8243cf5]
[0x8049897]
[0x974c15c]
[0x8048121]

Exiting...

</stderr_txt>
]]>
Validate state Valid
Claimed credit__69.3077894676244
Granted credit__25.52312719487 -- for 8hrs.



ID: 65570 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1966
Credit: 38,188,338
RAC: 11,005
Message 65572 - Posted: 17 Mar 2010, 3:22:49 UTC
Last modified: 17 Mar 2010, 3:24:29 UTC

On this desktop I got a Compute error Exit status -177 (0xffffff4f) in the following task:
aqp9__boinc_aqp9_fast_run01_blast_yfsong_loopbuild_threading_cst_relax_superfast_yfsong_IGNORE_THE_REST_18653_30510_0
<message>
Maximum disk usage exceeded
</message>

I did notice while it was running it was about 2 hours over my 8 hour runtime, on Model 6 Step 19051, but it reported 0 CPU time in the end.

I allow 10Gb disk space for Boinc and have about 581Mb in use on 5 current or waiting tasks, 9.43Gb free.

Also, on this laptop I got a validate error on the following task a few days back:
t290__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_8451_0
ID: 65572 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 207
Credit: 23,105,020
RAC: 13,072
Message 65575 - Posted: 17 Mar 2010, 14:13:21 UTC

2 Mod.Sense
Yes, it is certainly not a problem with minirosetta 2.05. It looks like some rare bug with boinc server. Probably connected with the fact that the computer had the same ip (not only "external" router ip, but internal too) and same network name. The new computer was a replacement of old, so I called the new as well as the previous one, before that renaming the old one. Actually, this should not be a factor, because boinc used to identify the internal id (such as 1211592) and not windows names. But the bug is a bug and that something is not go as intended :)
In any case, now more such errors do not come across, so I think this can be forgotten.

2 Sid Celery
I also had a lot of errors in tasks such as *__boinc_filtered_loopbuild_threading_*. In fact, every second job terminated by an error. And violating the target CPU time in each of the first (ie all tasks of this type) + strange looking things in graphics part (such as RMSD from 20 to 50 and odd-looking models)
So now I am canceling all jobs of this type, if i see them in the job queue.
ID: 65575 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65576 - Posted: 17 Mar 2010, 20:51:24 UTC

Sid, each task also has a configured maximum disk space. So that must be the limit that was hit by the task you mention. This is just one more failsafe that is in place to help assure things keep running smoothly.
Rosetta Moderator: Mod.Sense
ID: 65576 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1966
Credit: 38,188,338
RAC: 11,005
Message 65578 - Posted: 17 Mar 2010, 21:03:05 UTC - in response to Message 65575.  

I also had a lot of errors in tasks such as *__boinc_filtered_loopbuild_threading_*. In fact, every second job terminated by an error. And violating the target CPU time in each of the first (ie all tasks of this type) + strange looking things in graphics part (such as RMSD from 20 to 50 and odd-looking models)
So now I am canceling all jobs of this type, if I see them in the job queue.

It's the only error I've had in the last week on that W7 laptop, and credit was granted in the clean-up job, so I'm not worried by it - I don't understand any of these validate errors but while I was reporting the other one I thought I'd just mention it. I don't think my errors are the same as yours in that case.

I'm more surprised by the disk-usage issue on the Vista desktop which is otherwise very well behaved. I did suspect the task type, but others have gone through now with no problem at all, so maybe it just went a bit 'rogue' on me. I just thought it was worth describing seeing as I noticed it was a bit odd while running for 10 hours, yet the task details didn't indicate anything more than it failed on startup, which wasn't actually the case.

One for the backroom team to ponder.
ID: 65578 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 11,805,838
RAC: 0
Message 65655 - Posted: 28 Mar 2010, 0:24:31 UTC

Miscellaneous computation errors:

----

327069193 (v2FcInnerW_1dAl_3GM3_ProteinInterfaceDesign_15Mar2010_18672_254_0) failed on Mac OS X. Similar failure from wingman.

ERROR: f.check_fold_tree()
ERROR:: Exit from: src/protocols/docking/DockingProtocol.cc line: 405
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>

----

326722657 (placestub_alt_denovo_1zvy_1z2m_ProteinInterfaceDesign_21Mar2010_18705_22_0) failed on W7

ERROR: in::file::zip minirosetta_database.zip does not exist!
ERROR:: Exit from: ....srcappspublicboincminirosetta.cc line: 137
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

----

326721814 (tedor-cs_-tdonly-1-calbindin__18708_33_1) failed on W7. Similar failure from wingman.


ERROR: rsd_type_list.size()
ERROR:: Exit from: ....srccorefragmentFrame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>
ID: 65655 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 65658 - Posted: 28 Mar 2010, 8:01:15 UTC
Last modified: 28 Mar 2010, 8:09:45 UTC

326722657 (placestub_alt_denovo_1zvy_1z2m_ProteinInterfaceDesign_21Mar2010_18705_22_0) failed on W7

ERROR: in::file::zip minirosetta_database.zip does not exist!
ERROR:: Exit from: ....srcappspublicboincminirosetta.cc line: 137
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


Add me to the list with

tedor-cs_-tdonly-1-gb3__18708_4647
ERROR: rsd_type_list.size()
ERROR:: Exit from: ....srccorefragmentFrame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out
ID: 65658 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
allenandholmes

Send message
Joined: 17 Dec 07
Posts: 1
Credit: 7,563
RAC: 0
Message 65659 - Posted: 28 Mar 2010, 8:17:07 UTC

I have been processing my current minirosetta task for 4 or 5 days now and have had a suspicion about its checkpointing capabilities. I shut my PC down each night and restart it the next morning for BOINC processing. However the elapsed time displayed resets to 0, the time to completion continues to increase all day long (and between sessions) and the processed percentage is dramatically different from a ratio of elapsed/completion times. Am I wasting my time?
ID: 65659 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1966
Credit: 38,188,338
RAC: 11,005
Message 65663 - Posted: 28 Mar 2010, 15:14:17 UTC

One unusual error I haven't seen before - W7-64bit laptop:

Rossmann3x3_abinitio_SAVE_ALL_OUT_design_k031_001_18698_1551_0
Outcome Client error
Client state Compute error
Exit status 1 (0x1)
[...]
<core_client_version>6.10.36</core_client_version>
[...]
# cpu_run_time_pref: 28800
Starting work on structure: _00018
Continuing computation from checkpoint: chk_S_00000018_ClassicAbinitio__stage_3_iter1_10 ... success!
Continuing computation from checkpoint: chk_S_00000018_ClassicAbinitio__stage4_kk_1 ... success!
Continuing computation from checkpoint: chk_S_00000018_ClassicAbinitio__stage4_kk_2 ... success!
std::cerr: Exception was thrown:
no success reading silent file chk_S_00000018_ClassicAbinitio__stage4_kk_3.out

ID: 65663 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next

Message boards : Number crunching : minirosetta 2.05



©2024 University of Washington
https://www.bakerlab.org