Problems with Rosetta version 5.46

Message boards : Number crunching : Problems with Rosetta version 5.46

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
EigenState

Send message
Joined: 16 Feb 07
Posts: 4
Credit: 1,667
RAC: 0
Message 36899 - Posted: 16 Feb 2007, 21:40:07 UTC - in response to Message 36896.  
Last modified: 16 Feb 2007, 21:44:34 UTC

Yes, I do use BAM. If I use it properly is an entirely different question to which I hope the answer would be yes, but I am not certain of that.


Since you have a Rosetta@home account you may have to find it first in BAM
http://www.boincstats.com/bam/project_sign_up.php



As above, you have to attach, using the host options in BAM. If you try to attach yourself and it is a project boinc support, when it contacts BAM it will kick the project off (unfortunatly now questions asked)

To help you out

http://www.boincstats.com/bam/host_list.php
Link to your host list

OK, I did try to attach to Rosetta directly through the BOINC Manager, so that might explain the detachments I observed.

I also did have BAM set to attach to Rosetta, but so far nothing has actually happened. Being on dialup, I just cannot allow the connection to stand open forever. So is there a way to force the attachment through BAM to proceed?
ID: 36899 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 19 Sep 05
Posts: 403
Credit: 537,991
RAC: 0
Message 36900 - Posted: 16 Feb 2007, 22:01:42 UTC - in response to Message 36899.  

Yes, I do use BAM. If I use it properly is an entirely different question to which I hope the answer would be yes, but I am not certain of that.


Since you have a Rosetta@home account you may have to find it first in BAM
http://www.boincstats.com/bam/project_sign_up.php



As above, you have to attach, using the host options in BAM. If you try to attach yourself and it is a project boinc support, when it contacts BAM it will kick the project off (unfortunatly now questions asked)

To help you out

http://www.boincstats.com/bam/host_list.php
Link to your host list

OK, I did try to attach to Rosetta directly through the BOINC Manager, so that might explain the detachments I observed.

I also did have BAM set to attach to Rosetta, but so far nothing has actually happened. Being on dialup, I just cannot allow the connection to stand open forever. So is there a way to force the attachment through BAM to proceed?


Under tools in Boinc press synconice to BAM.


ID: 36900 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
EigenState

Send message
Joined: 16 Feb 07
Posts: 4
Credit: 1,667
RAC: 0
Message 36904 - Posted: 17 Feb 2007, 3:37:59 UTC

I have successfully attached to Rosetta, and am currently calculating a Work Unit.

Thanks to all of you for the help, and my apologies for taking this thread off topic.
ID: 36904 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Michael.L

Send message
Joined: 12 Nov 06
Posts: 67
Credit: 31,295
RAC: 0
Message 36911 - Posted: 17 Feb 2007, 12:21:23 UTC
Last modified: 17 Feb 2007, 12:24:19 UTC

Result ID- 62740659 Winny XP home. AMD 3200+.
CAPRI 12 ND 73 GLOBAL DOCKING 1562 9497 00 - Was stuck at score 199.144 for 3600 seconds.
stderr out <core_client_version>5.4.11</core_client_version>
<stderr_txt>
# random seed: 1432284
# cpu_run_time_pref: 14400
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -199.144 for 3600 seconds
**********************************************************************
GZIP SILENT FILE: .aand73.out

</stderr_txt>


ID: 36911 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 4875
Credit: 4,472,466
RAC: 382
Message 36913 - Posted: 17 Feb 2007, 14:06:50 UTC

i think the docking units are getting stuck on our AMD chips for some reason.
I had a similar error on this: CAPRI_12_ND73_GLOBAL_DOCKING_1562_9081_0 and a similar one. Both were global docking. I read somewhere in here or over in Dr. Bakers message board area that this was likely to happen at random with these WU's.
ID: 36913 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chu

Send message
Joined: 23 Feb 06
Posts: 120
Credit: 112,439
RAC: 0
Message 36923 - Posted: 17 Feb 2007, 18:22:50 UTC - in response to Message 36913.  

Hi you all, we actually did find a bug which has caused very high rate of "stuck" trjactories for docking workunits and has included it in V5.46 update. Those trajectories are not stuck themselves, but the watchdog thread is fooled to think they are stuck. Although this fix did make improvement a lot, we also find that it is not completely solving the problem. Compared to other "protein folding or farlx" workunits, docking workunits generally have less number of "acceptance" steps and therefore energy values do not change as frquently. That is believed to be the culprit as the watchdog thread is checking energy value periodically to decide whether a run is stuck or not. We have proposed a more robust solution to this problem and plan to include it in the next scheduled update. Thank you all for the help.
i think the docking units are getting stuck on our AMD chips for some reason.
I had a similar error on this: CAPRI_12_ND73_GLOBAL_DOCKING_1562_9081_0 and a similar one. Both were global docking. I read somewhere in here or over in Dr. Bakers message board area that this was likely to happen at random with these WU's.


ID: 36923 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Viromancy

Send message
Joined: 23 Sep 06
Posts: 8
Credit: 125,713
RAC: 0
Message 36934 - Posted: 18 Feb 2007, 8:45:42 UTC - in response to Message 36898.  

Nope, it was running at the standard speed. Just for the heck of it though, I've now underclocked it 6% to see how it goes.


Has underclocking helped?


ID: 36934 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Vagelis Stefas

Send message
Joined: 27 Aug 06
Posts: 5
Credit: 118,856
RAC: 0
Message 36935 - Posted: 18 Feb 2007, 12:50:19 UTC

Problem with rosseta 5.46

WU name: DOC_1MLC_R070216_pose_u_pert_from_farl_abs_tot_1571_1078_0

Target run time was 6 hours.

The WU was currently at 96.5% and stated that it wanted another 13 minutes to complete. Having run about 6 hours that wasn't too unreasonable. After an extra hour of running it still reported that it wanted about 15 minutes to complete. So I checked the graphics in that WU and it seemed to be stuck. I paused and started over only to see that an hour of processing was gone (Back to 5:47). Now it seems to work fine but I can't babysit rosetta forever.

The computer was not overclocked and other than doing rosetta no other major program was running. Computer ID: 377933
ID: 36935 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile (_KoDAk_)

Send message
Joined: 18 Jul 06
Posts: 109
Credit: 1,859,263
RAC: 0
Message 36940 - Posted: 18 Feb 2007, 16:15:10 UTC

Q: I have five work's with "Client error" BUT (CPU time ~ 55,000)
Will i have "granted credit" for this work's ?
https://boinc.bakerlab.org/rosetta/results.php?hostid=350614&offset=0
ID: 36940 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 36942 - Posted: 18 Feb 2007, 16:53:23 UTC

I moved Kodak's post here. He's got at least one of the three WUs that was reported more then 24hrs ago and has been granted zero credit. Is the daily credit granting script running? The other two SHOULD have SOME credit granted by the daily script. It will show on the result page. But I believe the maximum credit awarded by the script is 20 credits. Your machine is crunching at a stellar rate and claims over 300 credits for each of these tasks, and is typically granted significantly more then it claims on successful tasks.

All of your failure codes seem to be -107s, this often points to problems accessing memory. Have you done memory tests on this machine?
Rosetta Moderator: Mod.Sense
ID: 36942 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mentabelgium

Send message
Joined: 16 Dec 05
Posts: 31
Credit: 153,110
RAC: 0
Message 36943 - Posted: 18 Feb 2007, 19:55:17 UTC

Failed WU

I also had a WU, a DOC that after 1 hour of processing didn't pass step 507.
It was a huge result, an energy of -1100. Maybe that's important to know.
So I aborted it
ID: 36943 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile (_KoDAk_)

Send message
Joined: 18 Jul 06
Posts: 109
Credit: 1,859,263
RAC: 0
Message 36947 - Posted: 18 Feb 2007, 21:01:36 UTC - in response to Message 36942.  
Last modified: 18 Feb 2007, 21:17:49 UTC

All of your failure codes seem to be -107s, this often points to problems accessing memory. Have you done memory tests on this machine?

I run 24hur work's.( pending 3 days) before (in 5-6 days)I run 12hur work's AHD all other time I run in AUTO (~3 hur)
I test machine by ("OCCT" work fine 35 min)
MB P5B Deluxe (Bios 1004)
CPU E6600(FSB 390Mhz)
RAM Corsair Volue 667 (OC @ 780Mhz -2.0V)
"RightMark Memory Analyzer' test stability 10 min -"Fine"
ID: 36947 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marky-UK

Send message
Joined: 1 Nov 05
Posts: 73
Credit: 1,689,495
RAC: 0
Message 36948 - Posted: 18 Feb 2007, 21:19:26 UTC - in response to Message 36934.  

Nope, it was running at the standard speed. Just for the heck of it though, I've now underclocked it 6% to see how it goes.


Has underclocking helped?

Looks like it - I don't think the host has failed on a WU since. I'll keep an eye on it though.
ID: 36948 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Vagelis Stefas

Send message
Joined: 27 Aug 06
Posts: 5
Credit: 118,856
RAC: 0
Message 36963 - Posted: 19 Feb 2007, 7:52:09 UTC - in response to Message 36943.  
Last modified: 19 Feb 2007, 7:56:34 UTC

Failed WU

I also had a WU, a DOC that after 1 hour of processing didn't pass step 507.
It was a huge result, an energy of -1100. Maybe that's important to know.
So I aborted it


In my case yhe problem in the WU appeared in model 51 or 57 step 307 something.
ID: 36963 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Michael.L

Send message
Joined: 12 Nov 06
Posts: 67
Credit: 31,295
RAC: 0
Message 36965 - Posted: 19 Feb 2007, 12:49:18 UTC
Last modified: 19 Feb 2007, 12:54:05 UTC

Name DOC_1BRC_R070216_pose_u_pert_bbmin_from_farlx_abs_tol_1571_1116_0
Workunit 56364454
Created 18 Feb 2007 4:51:58 UTC
Sent 18 Feb 2007 4:57:04 UTC
Received 18 Feb 2007 21:35:33 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 416175
Report deadline 28 Feb 2007 4:57:04 UTC
CPU time 7124.953125
stderr out <core_client_version>5.4.11</core_client_version>
<stderr_txt>
# random seed: 1079855
# cpu_run_time_pref: 14400
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -549.143 for 3600 seconds
**********************************************************************
GZIP SILENT FILE: .ii1BRC.out

</stderr_txt>


Validate state Valid
Claimed credit 22.1295493695625
Granted credit 20
application version 5.46
--AMD 3200 64bit WXP Home
ID: 36965 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mark

Send message
Joined: 3 Sep 06
Posts: 1
Credit: 633
RAC: 0
Message 36968 - Posted: 19 Feb 2007, 15:10:41 UTC - in response to Message 36637.  


I'm running the latest BOINC from Seti, and Rosetta is simply not resuming when Seti tries to switch:

Sun 18 Feb 2007 01:25:54 PM CST|SETI@home|Task 24no03aa.2465.11330.379824.3.199_1 exited with zero status but no 'finished' file
Sun 18 Feb 2007 01:25:54 PM CST|SETI@home|If this happens repeatedly you may need to reset the project.
Sun 18 Feb 2007 02:34:05 PM CST|SETI@home|Restarting task 24no03aa.2465.11330.379824.3.199_1 using setiathome_enhanced version 512
Sun 18 Feb 2007 02:34:07 PM CST|rosetta@home|Task DOC_1BVK_R070216_pose_u_pert_bbmin_from_farlx_abs_tol_1571_2226_0 exited with zero status but no 'finished' file
Sun 18 Feb 2007 02:34:07 PM CST|rosetta@home|If this happens repeatedly you may need to reset the project.
Sun 18 Feb 2007 03:34:17 PM CST|rosetta@home|Restarting task DOC_1BVK_R070216_pose_u_pert_bbmin_from_farlx_abs_tol_1571_2226_0 using rosetta version 546
Sun 18 Feb 2007 03:34:22 PM CST|SETI@home|Task 24no03aa.2465.11330.379824.3.199_1 exited with zero status but no 'finished' file
Sun 18 Feb 2007 03:34:22 PM CST|SETI@home|If this happens repeatedly you may need to reset the project.
Sun 18 Feb 2007 04:01:43 PM CST||Restarting DOC_1BVK_R070216_pose_u_pert_bbmin_from_farlx_abs_tol_1571_2226_0 - message timeout
Sun 18 Feb 2007 04:01:44 PM CST||[error] Process 6459 not found
Sun 18 Feb 2007 05:06:48 PM CST|SETI@home|Restarting task 24no03aa.2465.11330.379824.3.199_1 using setiathome_enhanced version 512
Sun 18 Feb 2007 05:06:50 PM CST|rosetta@home|Task DOC_1BVK_R070216_pose_u_pert_bbmin_from_farlx_abs_tol_1571_2226_0 exited with zero status but no 'finished' file
Sun 18 Feb 2007 05:06:50 PM CST|rosetta@home|If this happens repeatedly you may need to reset the project.
Sun 18 Feb 2007 06:29:25 PM CST|rosetta@home|Restarting task DOC_1BVK_R070216_pose_u_pert_bbmin_from_farlx_abs_tol_1571_2226_0 using rosetta version 546
Sun 18 Feb 2007 06:29:27 PM CST|SETI@home|Task 24no03aa.2465.11330.379824.3.199_1 exited with zero status but no 'finished' file
Sun 18 Feb 2007 06:29:27 PM CST|SETI@home|If this happens repeatedly you may need to reset the project.
Sun 18 Feb 2007 07:56:03 PM CST||Restarting DOC_1BVK_R070216_pose_u_pert_bbmin_from_farlx_abs_tol_1571_2226_0 - message timeout
Sun 18 Feb 2007 07:56:03 PM CST|rosetta@home|Restarting task DOC_1BVK_R070216_pose_u_pert_bbmin_from_farlx_abs_tol_1571_2226_0 using rosetta version 546
Sun 18 Feb 2007 07:56:04 PM CST||[error] Process 14659 not found
Sun 18 Feb 2007 09:23:22 PM CST|SETI@home|Restarting task 24no03aa.2465.11330.379824.3.199_1 using setiathome_enhanced version 512
Mon 19 Feb 2007 01:04:00 AM CST||Restarting 24no03aa.2465.11330.379824.3.199_1 - message timeout
Mon 19 Feb 2007 01:04:00 AM CST|SETI@home|Restarting task 24no03aa.2465.11330.379824.3.199_1 using setiathome_enhanced version 512
Mon 19 Feb 2007 01:04:02 AM CST||[error] Process 21992 not found

I have reset both projects several times. Any ideas?
ID: 36968 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Michael.L

Send message
Joined: 12 Nov 06
Posts: 67
Credit: 31,295
RAC: 0
Message 36973 - Posted: 19 Feb 2007, 17:29:11 UTC

Result ID 63293328
Name DOC_1CSE_R070216_pose_u_pert_bbmin_from_farlx_abs_tol_1571_1911_0
Workunit 56411744
Created 18 Feb 2007 12:36:16 UTC
Sent 18 Feb 2007 12:41:57 UTC
Received 19 Feb 2007 17:13:20 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 416175
Report deadline 28 Feb 2007 12:41:57 UTC
CPU time 13927.140625
stderr out <core_client_version>5.4.11</core_client_version>
<stderr_txt>
# random seed: 1039060
# cpu_run_time_pref: 14400
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score 3696.27 for 3600 seconds
**********************************************************************
GZIP SILENT FILE: .ii1CSE.out

</stderr_txt>


Validate state Valid
Claimed credit 43.2566138514458
Granted credit 20
application version 5.46
--
Do we still need to report these??
ID: 36973 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 36974 - Posted: 19 Feb 2007, 19:03:15 UTC - in response to Message 36935.  

...

Target run time was 6 hours.

The WU was currently at 96.5% and stated that it wanted another 13 minutes to complete. Having run about 6 hours that wasn't too unreasonable. After an extra hour of running it still reported that it wanted about 15 minutes to complete. ...


A general note, not specific to this WU:

Please note that Rosetta's estimates of time left to completion are only accurate immediately after they drop in value. Then they slowly increase until the next time they drop in value again. The increases can be disregarded.

The value will be approximately right providing the WU does not get stuck.

If a WU does get stuck then the estimated time to complete will go on increasing slowly forever (if the clock is still running) or will stay the same forever (if the clock has stuck also).

In brief, you cannot rely on the time left to complete when there is a risk of a stuck WU.

River~~
ID: 36974 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Viromancy

Send message
Joined: 23 Sep 06
Posts: 8
Credit: 125,713
RAC: 0
Message 36980 - Posted: 19 Feb 2007, 21:18:11 UTC

Another watchdog termination in 5.46...this time after quite an impressive amount of time:

https://boinc.bakerlab.org/rosetta/result.php?resultid=63410631


ID: 36980 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile TeAm Enterprise
Avatar

Send message
Joined: 28 Sep 05
Posts: 18
Credit: 27,433,152
RAC: 4,249
Message 36982 - Posted: 20 Feb 2007, 1:50:56 UTC - in response to Message 36980.  
Last modified: 20 Feb 2007, 2:07:03 UTC

Message log from my Core 2 Duo, computer 286302 is below. WU ending in 40997_0 started at 2:27am and was still unfinished a 5:31pm when I suspended it. We'll see if it resumes when a core is available. Watchdog didn't do much for me here. Should have taken less than 3 hours to complete.


2/19/2007 2:27:10 AM|rosetta@home|Starting ep10__BOINC_ABRELAX_hom001__1569_40997_0
2/19/2007 2:27:10 AM|rosetta@home|Starting task ep10__BOINC_ABRELAX_hom001__1569_40997_0 using rosetta version 546
2/19/2007 2:27:12 AM|rosetta@home|[file_xfer] Started upload of file ep10__BOINC_ABRELAX_hom001__1569_23899_0_0
2/19/2007 2:27:17 AM|rosetta@home|[file_xfer] Finished upload of file ep10__BOINC_ABRELAX_hom001__1569_23899_0_0
19/2007 5:30:59 PM|rosetta@home|Sending scheduler request: Requested by user
2/19/2007 5:30:59 PM|rosetta@home|Reporting 5 tasks
2/19/2007 5:31:04 PM|rosetta@home|Scheduler RPC succeeded [server version 509]
2/19/2007 5:31:04 PM|rosetta@home|Deferring communication for 4 min 2 sec
2/19/2007 5:31:04 PM|rosetta@home|Reason: requested by project
2/19/2007 5:31:16 PM|rosetta@home|Starting BAK1topH_TnC_loop_model_1561_16682_0
2/19/2007 5:31:16 PM|rosetta@home|Starting task BAK1topH_TnC_loop_model_1561_16682_0 using rosetta version 546

Crunch with friends - TeAm Anandtech
ID: 36982 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Problems with Rosetta version 5.46



©2021 University of Washington
https://www.bakerlab.org