Problems with Rosetta version 5.80

Message boards : Number crunching : Problems with Rosetta version 5.80

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 10 · Next

AuthorMessage
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 46699 - Posted: 20 Sep 2007, 19:59:32 UTC

I moved Markus' post here from Q&A boards. Sorry the post is so long.

Looks like one of the CAPRI WUs may have caused the reported msgs.

See Rhiju's post below/above. There have been some problems with these tasks on some machines and so they've stopped sending them out.

Markus, could you post links to the two specific hosts (and the specific tasks if possible) where you had problems?
Rosetta Moderator: Mod.Sense
ID: 46699 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 46700 - Posted: 20 Sep 2007, 20:12:29 UTC

Tell me there's some mistake here in the native structure shown
(linky to screenshot with a rather straight native structure shown for a t015_1_NMRREF_1_t015_1_id_model_07_idlIGNORE_THE_REST_core_2097_3599_0 task).

Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 46700 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 4,191,010
RAC: 2,188
Message 46707 - Posted: 20 Sep 2007, 23:20:05 UTC - in response to Message 46665.  

Result ID 106047623
Name t030__BOINC_CAPRI14_DOCK_FIXBACKBONE_POSE_LOOPS-t030_-plexinmonomer__2083_6224_0
Workunit 96261079
Created 16 Sep 2007 14:15:43 UTC
Sent 16 Sep 2007 14:34:53 UTC
Received 20 Sep 2007 3:58:44 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 510574
Report deadline 26 Sep 2007 14:34:53 UTC
CPU time 23419.390625
stderr out <core_client_version>5.10.20</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 3549877
# cpu_run_time_pref: 21600
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -207.662 for 1800 seconds
**********************************************************************
GZIP SILENT FILE: .xxt030.out

</stderr_txt>
]]>


Validate state Valid
Claimed credit 97.7521869214656
Granted credit 20
application version 5.80

This WU ran for 23,419 secs for only 20 credits, is this right?


Well Michael at least you got 20, mine ran for over 28,000 sec and got 5.38.

ID: 46707 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile teemac

Send message
Joined: 18 Jul 06
Posts: 1
Credit: 192,962
RAC: 0
Message 46710 - Posted: 21 Sep 2007, 2:48:08 UTC

I have 5 machines on Rosetta at present:

3x Intel E4300 Core2Duos- Kubuntu v7.04 (64bit) 1gb ram - these machines have been locking one core and sometimes both cores over the last day or so. I have aborted all WU's with the word CAPRI in them. I also currently have nearly all work units with 'IGNORE THE REST' in the units name also locking and freezing cores or completely locking machines with an error message saying something like 'if this keeps happening you may need to reset the project'.

1x AMD X2/4600 - Kubuntu v7.04 (64bit) 1gb ram - this machine is mostly ok - no locking but some errored WU's.

1x AMD 3200+ - Kubuntu v7.04 (32bit) 512mb ram - same as the 4600 machine above.

I currently have 2 of the E4300's locked - no work ticking over for the last hour or so - one of the machines is totally locked and am unable to use the OS at all - the other machine only has BOINC locked up, but I can use the OS.


ID: 46710 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
hugothehermit

Send message
Joined: 26 Sep 05
Posts: 238
Credit: 314,893
RAC: 0
Message 46716 - Posted: 21 Sep 2007, 9:44:23 UTC

I noticed that the ...CAPRI14_DOCK... native looked wrong, they were too far apart to be interacting at all, compared with what I have seen before.

ID: 46716 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Wits End

Send message
Joined: 16 Apr 07
Posts: 4
Credit: 29,477
RAC: 0
Message 46727 - Posted: 21 Sep 2007, 17:13:17 UTC
Last modified: 21 Sep 2007, 17:15:35 UTC

Of the eight post-CAPRI WUs that I've returned, two produced "validate errors". I received credit for the other six
but they all had "watchdog shutting down" notes, and one had "WARNING! Not sure non-ideal rotamers are compatible
with symmetry yet..." What's going on?!?

107006854: Validate error
106890130: Watchdog notice
106794699: Watchdog and Warning notices
106724332: Validate error
106613376: Watchdog notice
106550676: Watchdog notice
106521483: Watchdog notice
106514350: Watchdog notice
ID: 46727 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 19 Sep 05
Posts: 403
Credit: 537,991
RAC: 0
Message 46732 - Posted: 21 Sep 2007, 18:29:25 UTC - in response to Message 46727.  

Of the eight post-CAPRI WUs that I've returned, two produced "validate errors". I received credit for the other six
but they all had "watchdog shutting down" notes, and one had "WARNING! Not sure non-ideal rotamers are compatible
with symmetry yet..." What's going on?!?

107006854: Validate error
106890130: Watchdog notice
106794699: Watchdog and Warning notices
106724332: Validate error
106613376: Watchdog notice
106550676: Watchdog notice
106521483: Watchdog notice
106514350: Watchdog notice


See this post about no more Capri for now.

ID: 46732 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 46733 - Posted: 21 Sep 2007, 18:53:09 UTC - in response to Message 46732.  

RE: Watchdog notice


Normal. Been that way since the watchdog was implemented. Since the watchdog runs in a seperate thread, this message just confirms that the watchdog thread properly ended as the task was completed.

So it is just saying everything ended normally, including the watchdog.
Rosetta Moderator: Mod.Sense
ID: 46733 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mdettweiler
Avatar

Send message
Joined: 15 Oct 06
Posts: 33
Credit: 2,509
RAC: 0
Message 46808 - Posted: 22 Sep 2007, 20:15:10 UTC - in response to Message 46733.  

RE: Watchdog notice


Normal. Been that way since the watchdog was implemented. Since the watchdog runs in a seperate thread, this message just confirms that the watchdog thread properly ended as the task was completed.

So it is just saying everything ended normally, including the watchdog.

When the watchdog has to end a task, is it of any use at all to the project scientifically, or is it practically aborted? I think I heard that the watchdog will abort a task if it goes a given amount of times longer than your preferred runtime, regardless of whether the application is showing visible progress; is this true? If so, are those terminated results useful at all?
ID: 46808 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 46813 - Posted: 22 Sep 2007, 21:38:56 UTC - in response to Message 46808.  

RE: Watchdog notice


Normal. Been that way since the watchdog was implemented. Since the watchdog runs in a seperate thread, this message just confirms that the watchdog thread properly ended as the task was completed.

So it is just saying everything ended normally, including the watchdog.

When the watchdog has to end a task, is it of any use at all to the project scientifically, or is it practically aborted? I think I heard that the watchdog will abort a task if it goes a given amount of times longer than your preferred runtime, regardless of whether the application is showing visible progress; is this true? If so, are those terminated results useful at all?


oh...good question..has me wondering the same thing.
ID: 46813 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
M.L.

Send message
Joined: 21 Nov 06
Posts: 182
Credit: 180,462
RAC: 0
Message 46817 - Posted: 23 Sep 2007, 2:00:28 UTC
Last modified: 23 Sep 2007, 2:04:55 UTC

Result ID 106760639
Name NeT6__BOINC_SYMM_FOLD_AND_DOCK_RELAX-NeT6_-mfr__2100_7176_0
Workunit 96937615

Validate state Valid
Claimed credit 91.5795040403045
Granted credit 48.2594968464685
application version 5.80

Never seen such a big difference between claimed and granted credits,unless the WU failed in some way but don't see any sign of that. Anyone got any ideas?

ID: 46817 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 46823 - Posted: 23 Sep 2007, 5:14:07 UTC - in response to Message 46808.  

When the watchdog has to end a task, is it of any use at all to the project scientifically, or is it practically aborted?


Results are always useful. I exchanged some EMails with Chu some time ago and collected some details on the watchdog. I'll compile them into the FAQ and post them shortly.

Even knowing that a given approach does not function as expected is important to know. This is why Rosetta considers all results useful and meaningful, and attempts to issue credit to participants for their assistence in making such a determination.
Rosetta Moderator: Mod.Sense
ID: 46823 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Markus Schuhmacher

Send message
Joined: 29 May 06
Posts: 4
Credit: 1,455,542
RAC: 0
Message 46841 - Posted: 23 Sep 2007, 10:38:29 UTC - in response to Message 46699.  
Last modified: 23 Sep 2007, 10:42:53 UTC

I moved Markus' post here from Q&A boards. Sorry the post is so long.

Looks like one of the CAPRI WUs may have caused the reported msgs.

See Rhiju's post below/above. There have been some problems with these tasks on some machines and so they've stopped sending them out.

Markus, could you post links to the two specific hosts (and the specific tasks if possible) where you had problems?


Sorry, I've been wondering where my post was gone. The two maschines are

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=603857
https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=509233

How can I figure it out which workunit was currently in progress?
ID: 46841 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 46857 - Posted: 23 Sep 2007, 14:39:30 UTC

Explanation of watch dog added to FAQ here. Please post any comments, or suggestions about it here.
Rosetta Moderator: Mod.Sense
ID: 46857 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 46924 - Posted: 24 Sep 2007, 18:20:29 UTC

serious problems with this WU from 5.80 non capri
ID: 46924 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jmarks
Avatar

Send message
Joined: 16 Jul 07
Posts: 132
Credit: 98,025
RAC: 0
Message 46965 - Posted: 25 Sep 2007, 11:31:59 UTC

Left over CAPRI14
https://boinc.bakerlab.org/rosetta/result.php?resultid=106850152
Jmarks
ID: 46965 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Andrii Muliar

Send message
Joined: 10 Nov 05
Posts: 12
Credit: 7,655,243
RAC: 0
Message 47024 - Posted: 26 Sep 2007, 14:12:17 UTC - in response to Message 47023.  

I am forgot to say: I have Core Duo processor, ADSL connection and Windows XP SP2 as operating system.
ID: 47024 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 47066 - Posted: 27 Sep 2007, 1:21:40 UTC

beat__BOINC_JUMPRELAX_BARCODE2_CONSTRAINT-beat_-_1951_67075_0 ( workunit 98293407 ) stuck on 0% for >1 hour on an Intel iMac2 under OS X 10.4.10 with Boinc 5.10.20. Aborting.

ID: 47066 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Paul

Send message
Joined: 29 Oct 05
Posts: 193
Credit: 66,245,890
RAC: 6,362
Message 47138 - Posted: 28 Sep 2007, 13:11:16 UTC

I noticed that all of my work units are now based on the 5.80 Beta again. The last few days, it looked like most of them were an older version of the application.

This morning, I have 5 out of 6 units with Compute Error status.

The computer ID is 43057

Can someone please look into this situation? It is very frustrating to have so much CPU time wasted. I just refocused 100% of this computer on R@H because it looked like the problems were fixed. If we are going back to compute errors, it would appear to be a better use of resources to focus this CPU on other projects until R@H is fixed.

What is the problem with all the failed WUs and 5.8?

Thx!

Paul

ID: 47138 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nothing But Idle Time

Send message
Joined: 28 Sep 05
Posts: 209
Credit: 139,545
RAC: 0
Message 47142 - Posted: 28 Sep 2007, 13:48:47 UTC

beat__BOINC_JUMPRELAX_BARCODE2_CONSTRAINT-beat_-_1951_61847_0
WU 108207857 v.5.80
Ran 21% over my specified run time preference; never saw this before.
ID: 47142 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 10 · Next

Message boards : Number crunching : Problems with Rosetta version 5.80



©2024 University of Washington
https://www.bakerlab.org