Rosetta@home

Problems with Rosetta version 5.98

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search
Message boards : Number crunching : Problems with Rosetta version 5.98

Sort
AuthorMessage
David E K Profile
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Jul 1 05
Posts: 942
ID: 14
Credit: 2,303,046
RAC: 485
Message 53999 - Posted 25 Jun 2008 22:44:19 UTC

Please post bugs/issues regarding version 5.98 here.

Mike Francis
Avatar

Joined: Nov 24 05
Posts: 8
ID: 17484
Credit: 623,519
RAC: 0
Message 54021 - Posted 26 Jun 2008 20:48:51 UTC

6/26/2008 3:31:54 PM|rosetta@home|Starting t434_1_NMRREF_1_t434_1_T0434_2QPWA_2JV0_hybridIGNORE_THE_REST_truncated_4104_1_1
6/26/2008 3:31:54 PM|rosetta@home|Starting task t434_1_NMRREF_1_t434_1_T0434_2QPWA_2JV0_hybridIGNORE_THE_REST_truncated_4104_1_1 using rosetta_beta version 598
6/26/2008 4:16:56 PM|rosetta@home|Computation for task t434_1_NMRREF_1_t434_1_T0434_2QPWA_2JV0_hybridIGNORE_THE_REST_truncated_4104_1_1 finished
6/26/2008 4:16:56 PM|rosetta@home|Output file t434_1_NMRREF_1_t434_1_T0434_2QPWA_2JV0_hybridIGNORE_THE_REST_truncated_4104_1_1_0 for task t434_1_NMRREF_1_t434_1_T0434_2QPWA_2JV0_hybridIGNORE_THE_REST_truncated_4104_1_1 absent

____________

[KWSN]John Galt 007 Profile
Avatar

Joined: Aug 4 06
Posts: 6
ID: 103245
Credit: 1,012,507
RAC: 0
Message 54025 - Posted 27 Jun 2008 2:36:06 UTC

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=158620863

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=158616469

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=158605953

All with compute errors in the first minute.
____________

Adam

Joined: Jun 26 07
Posts: 7
ID: 186584
Credit: 487,917
RAC: 0
Message 54031 - Posted 27 Jun 2008 17:10:47 UTC

Compute error,
http://boinc.bakerlab.org/rosetta/result.php?resultid=173774049

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 54034 - Posted 28 Jun 2008 0:39:23 UTC

This one fell over on both hosts, same error.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=158612212

Output file FRA_t449_CASP8_MANUAL_1_IGNORE_THE_RESTt449_1_ttxxxxT0449_1CHIM_0001_0001_0001_4126_3627_1_0 for task absent

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 2404847
ERROR:: Exit from: .\loop_relax.cc line: 1745

</stderr_txt>

pete.



____________


RC

Joined: Sep 27 05
Posts: 13
ID: 1401
Credit: 245,498
RAC: 0
Message 54035 - Posted 28 Jun 2008 0:45:05 UTC - in response to Message ID 54031.

Another compute error,
http://boinc.bakerlab.org/rosetta/result.php?resultid=173797309
____________

anti-cancers

Joined: Sep 2 06
Posts: 9
ID: 109503
Credit: 173,262
RAC: 0
Message 54037 - Posted 28 Jun 2008 6:20:01 UTC
Last modified: 28 Jun 2008 6:20:48 UTC

Compute error...
____________

Snagletooth

Joined: Feb 22 07
Posts: 192
ID: 149031
Credit: 1,396,123
RAC: 1,318
Message 54038 - Posted 28 Jun 2008 10:33:46 UTC - in response to Message ID 54034.

This one fell over on both hosts, same error.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=158612212

Output file FRA_t449_CASP8_MANUAL_1_IGNORE_THE_RESTt449_1_ttxxxxT0449_1CHIM_0001_0001_0001_4126_3627_1_0 for task absent

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 2404847
ERROR:: Exit from: .\loop_relax.cc line: 1745

</stderr_txt>

pete.




same here

Konstantin Iliev

Joined: May 22 06
Posts: 4
ID: 83901
Credit: 2,053,517
RAC: 0
Message 54053 - Posted 28 Jun 2008 16:54:45 UTC

Again errors as 5.96 :(

http://boinc.bakerlab.org/rosetta/result.php?resultid=173787198
http://boinc.bakerlab.org/rosetta/result.php?resultid=173807571
http://boinc.bakerlab.org/rosetta/result.php?resultid=173821223
____________

adrianxw Profile
Avatar

Joined: Sep 18 05
Posts: 535
ID: 402
Credit: 1,057,641
RAC: 1,674
Message 54065 - Posted 29 Jun 2008 14:56:31 UTC
Last modified: 29 Jun 2008 14:58:03 UTC

174088567
174032945
174000755

First and last validate errors after the full job, (10,000+ seconds), the middle one after just a few seconds, (Exit from: .\loop_relax.cc line: 1745).
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

Wonderwall

Joined: Mar 19 07
Posts: 1
ID: 154598
Credit: 39,192
RAC: 0
Message 54067 - Posted 29 Jun 2008 16:25:51 UTC - in response to Message ID 53999.

Please post bugs/issues regarding version 5.98 here.

rosetta@home Rosetta Beta 5.96 1405_CaspB_IUMPAB_Type2_RES81to19... 03:51:35 00.000% ...06/28/... Running high prio...
____________

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 54072 - Posted 29 Jun 2008 19:58:38 UTC

This WU did 1247 decoys, then was marked "invalid" for no apparent reason:

FRA_t449_CASP8_AUTO_1SNZ_1L7J_2CIQ_1_IGNORE_THE_RESTt449_1_ttttaaT0449_1L7JA_10_0001_0001_0002_4134_634

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1740
ID: 44890
Credit: 2,350,568
RAC: 3,695
Message 54082 - Posted 30 Jun 2008 14:19:45 UTC

This little bugger has been running all weekend. 47hrs on a 24hr preference.
FRA_t449_CASP8_MANUAL_1_IGNORE_THE_RESTt449_1_ttxxxxT0449_1CHIM_0001_0001_0001_4142_1913 Yet it is still getting CPU time, and the step number is still incrementing. It says it is on model 151.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

sslickerson Profile

Joined: Oct 14 05
Posts: 101
ID: 4578
Credit: 484,477
RAC: 0
Message 54085 - Posted 30 Jun 2008 17:47:13 UTC

Here is a very fast error running version 5.98 on Windows XP: 174404220. It looks like it failed on at least one other host in the same manner.


____________



BrnmccO1

Joined: Jun 26 07
Posts: 17
ID: 186323
Credit: 578,825
RAC: 0
Message 54089 - Posted 30 Jun 2008 20:41:29 UTC
Last modified: 30 Jun 2008 20:44:50 UTC

Bizzare problem with this WU; had an 'unhandled exception error' after about approx 50 mins CPU run time, with a lenthy Std_Out: 157316144

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 2747207


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00B3C947 read attempt to address 0x000000A4

Engaging BOINC Windows Runtime Debugger...


Otherwise no other errors so far with 5.98 on both of my hosts (knocks on wood ;p)

TeAm Enterprise Profile
Avatar

Joined: Sep 28 05
Posts: 18
ID: 1546
Credit: 20,535,719
RAC: 2,662
Message 54096 - Posted 1 Jul 2008 4:56:20 UTC
Last modified: 1 Jul 2008 4:57:11 UTC

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=158648907

Two validate errors after full crunch.

Rosetta needs to think about how to apply credit when the problems are obviously of project/WU source.

Jim
____________
Crunch with friends - TeAm Anandtech

dcdc Profile

Joined: Nov 3 05
Posts: 1596
ID: 8948
Credit: 33,802,121
RAC: 17,331
Message 54097 - Posted 1 Jul 2008 8:19:57 UTC - in response to Message ID 54096.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=158648907

Two validate errors after full crunch.

Rosetta needs to think about how to apply credit when the problems are obviously of project/WU source.

Jim

Credit is applied to these as claimed - it doesn't show on the task's main page but does if you hit the Task ID link on the left.

HTH
Danny
____________

Virtual Boss*
Avatar

Joined: May 10 08
Posts: 35
ID: 257766
Credit: 700,682
RAC: 148
Message 54101 - Posted 1 Jul 2008 10:58:04 UTC
Last modified: 1 Jul 2008 11:18:21 UTC

WU FRA_t449_CASP8_MANUAL_1_IGNORE_THE_RESTt449_1_ttxxxxT0449_1CHIM_0001_0001_0001_4142_3294_1 using rosetta_beta version 598

Original estimated run time about 6 CPU Hrs

Still Runing at 10:10:00 CPU

Progress 98.386% and incrementing 0.001 about every 25 CPU secs

To Completion 00:09:55 (no change last 30 CPU minutes

At current % increase will take another 11+ CPU Hrs, or if Prog% is calculated from time done as % of Time done+To completion then will run forever.

BTW Currently Model 22 Step 47795

dcdc Profile

Joined: Nov 3 05
Posts: 1596
ID: 8948
Credit: 33,802,121
RAC: 17,331
Message 54103 - Posted 1 Jul 2008 11:24:21 UTC - in response to Message ID 54101.

WU FRA_t449_CASP8_MANUAL_1_IGNORE_THE_RESTt449_1_ttxxxxT0449_1CHIM_0001_0001_0001_4142_3294_1 using rosetta_beta version 598

Original estimated run time about 6 CPU Hrs

Still Runing at 10:10:00 CPU

Progress 98.386% and incrementing 0.001 about every 25 CPU secs

To Completion 00:09:55 (no change last 30 CPU minutes

At current % increase will take another 11+ CPU Hrs, or if Prog% is calculated from time done as % of Time done+To completion then will run forever.

BTW Currently Model 22 Step 47795

the % complete and time to completion aren't linear - they're estimates, so don't worry about them if Rosetta's CPU time is increasing in task manager.

Danny
____________

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1740
ID: 44890
Credit: 2,350,568
RAC: 3,695
Message 54105 - Posted 1 Jul 2008 14:37:53 UTC

Danny is correct about the time estimates.

But the t449 I reported earlier never did finish. I let it run for 68 hours before aborting it (my runtime preference is 24hrs so I'm sure the watchdog would have discovered it after 4x that preference, but I didn't want to waste the time).

My aborted task didn't seem to send the normal data in to the server. 150 presumably good models lost. So, I would suggest (if you have the patience) to exit and restart BOINC 5 times. Each time leaving it run for long enough to get itself initialized and running the problem task. Rosetta will then detect no progress after 4 or 5 restarts and more cleanly cut it off and send it in.

I'm also wishing I had saved a copy of all the slots directories. Again, if you have the time, after your first exit of BOINC, I would save all the directories under your BOINC installation path with /slots on the end of the path, and EMail it to the rosettamod.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

glaesum

Joined: Oct 16 06
Posts: 21
ID: 120376
Credit: 106,074
RAC: 0
Message 54109 - Posted 1 Jul 2008 15:15:31 UTC

starting to get the occasional error on 5.98:

here is a t443 {wuid=158705243} that plugged away for nearly 15hrs until it packed in with a validate error. credit was claimed and granted but never actually got issued...

there's no diagnostic on my task report but the wingman's task stopped with client error after 20mins and does have lots of diagnostics (too many restarts with no progress).

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 54111 - Posted 1 Jul 2008 16:06:14 UTC

I'm also seeing this same slowdown problem. A 4 hour task on t443 (17438540) has been going over 8 hours (according to BOINC) and longer in the real world. It appeared stuck on Model 2 Step 373221. Will try restarting Boinc several times as suggested above. Win XP: Boinc 5:10:28.
____________

Betting Slip

Joined: Sep 26 05
Posts: 71
ID: 1160
Credit: 5,702,246
RAC: 0
Message 54114 - Posted 1 Jul 2008 17:52:02 UTC - in response to Message ID 54097.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=158648907

Two validate errors after full crunch.

Rosetta needs to think about how to apply credit when the problems are obviously of project/WU source.

Jim

Credit is applied to these as claimed - it doesn't show on the task's main page but does if you hit the Task ID link on the left.

HTH
Danny



Not on mine there not. In task ID it states claimed and then granted = 0
____________

sslickerson Profile

Joined: Oct 14 05
Posts: 101
ID: 4578
Credit: 484,477
RAC: 0
Message 54129 - Posted 1 Jul 2008 23:45:19 UTC

Here is a really bad WU, even though it validated: 174615818

My runtime preference is 7200 seconds but this one ran over 29000 seconds but here is the kicker: Claimed credit 102.7, Granted credit 13.4.

and another with the same problem: 174641989

3-5 hours of wasted credit is a huge problem!
____________



Virtual Boss*
Avatar

Joined: May 10 08
Posts: 35
ID: 257766
Credit: 700,682
RAC: 148
Message 54141 - Posted 2 Jul 2008 12:00:49 UTC - in response to Message ID 54103.

WU FRA_t449_CASP8_MANUAL_1_IGNORE_THE_RESTt449_1_ttxxxxT0449_1CHIM_0001_0001_0001_4142_3294_1 using rosetta_beta version 598

Original estimated run time about 6 CPU Hrs

Still Runing at 10:10:00 CPU

Progress 98.386% and incrementing 0.001 about every 25 CPU secs

To Completion 00:09:55 (no change last 30 CPU minutes

At current % increase will take another 11+ CPU Hrs, or if Prog% is calculated from time done as % of Time done+To completion then will run forever.

BTW Currently Model 22 Step 47795

the % complete and time to completion aren't linear - they're estimates, so don't worry about them if Rosetta's CPU time is increasing in task manager.

Danny



This WU stii runing at 17:22:00 CPU

Progress now 99.049% and incrementing every 66 CPU secs

To completion is now 00:09:56 (no change last CPU Hr)

Currently Model 22 Step 69581

Also noticed no files in task slot have been updated since 30/06/2008 12:38 PM (56.6 Hrs ago Real Time)(approx 14.4 Hrs ago CPU Time)

I dont think it is worth keeping this WU going

ramostol

Joined: Feb 6 07
Posts: 64
ID: 145835
Credit: 584,052
RAC: 0
Message 54142 - Posted 2 Jul 2008 13:29:00 UTC - in response to Message ID 54141.



This WU stii runing at 17:22:00 CPU

...

I dont think it is worth keeping this WU going


As long as the CPU is running, the WU is alive and well.

My largest WU needed 46 hours to complete (and 12 hours default runtime of course) - the time needed depends on your computer. So if you're not too impatient...

Virtual Boss*
Avatar

Joined: May 10 08
Posts: 35
ID: 257766
Credit: 700,682
RAC: 148
Message 54143 - Posted 2 Jul 2008 13:34:36 UTC - in response to Message ID 54142.



This WU stii runing at 17:22:00 CPU

...

I dont think it is worth keeping this WU going


As long as the CPU is running, the WU is alive and well.

My largest WU needed 46 hours to complete (and 12 hours default runtime of course) - the time needed depends on your computer. So if you're not too impatient...



OK will leave it runing a while longer

anti-cancers

Joined: Sep 2 06
Posts: 9
ID: 109503
Credit: 173,262
RAC: 0
Message 54147 - Posted 2 Jul 2008 21:19:21 UTC

Result (#174921063)

2008. júl. 2., szerda, 23.13.30 CEST|rosetta@home|Computation for task FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4165_2443_0 finished
2008. júl. 2., szerda, 23.13.30 CEST|rosetta@home|Output file FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4165_2443_0_0 for task FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4165_2443_0 absent

adrianxw Profile
Avatar

Joined: Sep 18 05
Posts: 535
ID: 402
Credit: 1,057,641
RAC: 1,674
Message 54151 - Posted 3 Jul 2008 9:04:20 UTC

174952450

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 2137497
======================================================
DONE :: 1 starting structures 7768.21 cpu seconds
This process generated 1 decoys from 1 attempts
0 starting pdbs were skipped
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
<message>
<file_xfer_error>
<file_name>FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4165_1732_1_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>

____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 54155 - Posted 3 Jul 2008 17:41:16 UTC

t407__CASP8_JUMPAB_SAMPLE2_newalign_SAVE_ALL_OUT_BARCODE_hom005__3659_88038_2
VALIDATE ERROR - NO CREDIT GRANTED

just wasted 4 hrs cpu time on this one...I thought these issues were taken care of?

billy ewell 1931

Joined: Mar 30 07
Posts: 10
ID: 160868
Credit: 3,008,779
RAC: 0
Message 54157 - Posted 3 Jul 2008 19:37:47 UTC

Three work units failed in sequence between 01:19 and 01:43 on 3 July UTC time. The error messages attributed "file transfer error". Apparently over 9 hours of cpu time lost. The WUIDs are: 159628332, 159620771 and 159639505. I have no idea if 5.98 is the culprit but I doubt it as many other units have completed ok as I have been running 4 cpus 100% Rosetta for 68.5 hours straight.

Snagletooth

Joined: Feb 22 07
Posts: 192
ID: 149031
Credit: 1,396,123
RAC: 1,318
Message 54162 - Posted 3 Jul 2008 21:43:23 UTC

FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4165_492 seems to be problematic. Same error adrianxw and anti-cancers posted

stderr out

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
Rosetta@home Macintosh Stack Size checker.
Original size: 0.
Maximum size: 8388608.
RLIM_INFINITY 0
# cpu_run_time_pref: 21600
# random seed: 2138737
# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 13919.5 cpu seconds
This process generated 1 decoys from 1 attempts
0 starting pdbs were skipped
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
<message>
<file_xfer_error>
<file_name>FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4165_492_1_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>


Thu Jul 3 16:59:01 2008|rosetta@home|Computation for task FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4165_492_1 finished
Thu Jul 3 16:59:01 2008|rosetta@home|Output file FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4165_492_1_0 for task FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4165_492_1 absent

kb7rzf Profile
Avatar

Joined: Oct 7 05
Posts: 16
ID: 3186
Credit: 35,427
RAC: 0
Message 54164 - Posted 4 Jul 2008 0:09:37 UTC

Well, had 1 compute error since I started crunching again here. The wu is FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4163_1226

The info from the STDERR OUT:

stderr out

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 2143013
======================================================
DONE :: 1 starting structures 20322.7 cpu seconds
This process generated 7 decoys from 7 attempts
0 starting pdbs were skipped
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
<message>
<file_xfer_error>
<file_name>FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4163_1226_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 54166 - Posted 4 Jul 2008 4:46:34 UTC

This errored after 4hrs,50min on me, same error for two hosts again!.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=159591795

7/4/2008 2:29:37 PM|rosetta@home|Output file FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4165_144_1_0 for task absent


<core_client_version>5.10.30</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 2139085
======================================================
DONE :: 1 starting structures 17427 cpu seconds
This process generated 4 decoys from 4 attempts
0 starting pdbs were skipped
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
<message>
<file_xfer_error>
<file_name>FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4165_144_1_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>

pete.

____________


Virtual Boss*
Avatar

Joined: May 10 08
Posts: 35
ID: 257766
Credit: 700,682
RAC: 148
Message 54168 - Posted 4 Jul 2008 15:44:06 UTC - in response to Message ID 54142.

[/quote]

As long as the CPU is running, the WU is alive and well.

My largest WU needed 46 hours to complete (and 12 hours default runtime of course) - the time needed depends on your computer. So if you're not too impatient...[/quote]

Finally finished at 24:09:10 CPU

Clalmed Credit 189.06
Granted Credit 97.39

not as good credit/work as normal but thanks for the confidence to let it complete

sslickerson Profile

Joined: Oct 14 05
Posts: 101
ID: 4578
Credit: 484,477
RAC: 0
Message 54170 - Posted 4 Jul 2008 16:48:55 UTC - in response to Message ID 54168.



As long as the CPU is running, the WU is alive and well.

My largest WU needed 46 hours to complete (and 12 hours default runtime of course) - the time needed depends on your computer. So if you're not too impatient...[/quote]

Finally finished at 24:09:10 CPU

Clalmed Credit 189.06
Granted Credit 97.39

not as good credit/work as normal but thanks for the confidence to let it complete[/quote]

But as you can see the credit discrepancy is huge. This is actually a problem as far as I am concerned. The project staff should look into why this is happening.

The application would not do say 5 decoys, decide it had time to do one other and then take *14 hours* to do so. This is a huge problem and I hope the project staff are looking into it.

Tim
____________



Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 54175 - Posted 4 Jul 2008 18:40:59 UTC

See my post here. It would seem that when you hit that last model, that happens to take a significantly longer time is when your credit is harmed. Because credit is based on averages, and that particular model is anything but average.

As I posted at the above reference, this issue of specific long-running models is already under investigation.
____________
Rosetta Moderator: Mod.Sense

Hypermarkup Profile

Joined: Mar 3 06
Posts: 7
ID: 63370
Credit: 112,275
RAC: 0
Message 54176 - Posted 4 Jul 2008 19:14:04 UTC

Too many times Compute error:

http://boinc.bakerlab.org/rosetta/result.php?resultid=174297995
http://boinc.bakerlab.org/rosetta/result.php?resultid=174188285
http://boinc.bakerlab.org/rosetta/result.php?resultid=173801436

____________

Hypermarkup
Fotowing

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 54179 - Posted 4 Jul 2008 21:32:09 UTC
Last modified: 4 Jul 2008 21:45:59 UTC

This one last night. Edit// Ran about 7 mins.


http://boinc.bakerlab.org/rosetta/workunit.php?wuid=158009479


7/4/2008 5:40:10 PM|rosetta@home|Output file for task t434_1_NMRREF_1_t434_1_T0434_2QPWA_2JV0_hybridIGNORE_THE_REST_truncated_4104_4531_1 absent

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 2590114
ERROR:: Exit from: .\refold.cc line: 338

</stderr_txt>

pete.
____________


P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 54180 - Posted 5 Jul 2008 3:26:47 UTC

This is starting to P... me off it ran for 5hrs,57min. Getting more of these

then with the old 5.96 app. Will credit be given for the work done?

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=159600967

7/5/2008 12:50:11 PM|rosetta@home|Output file for task FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4163_473_1 absent


<core_client_version>5.10.30</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 2143766
# cpu_run_time_pref: 21600
======================================================
DONE :: 1 starting structures 21418 cpu seconds
This process generated 9 decoys from 9 attempts
0 starting pdbs were skipped
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
<message>
<file_xfer_error>
<file_name>FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4163_473_1_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

pete.



____________


The_Bad_Penguin
Avatar

Joined: Jun 5 06
Posts: 2747
ID: 89694
Credit: 1,859,902
RAC: 0
Message 54181 - Posted 5 Jul 2008 4:45:46 UTC

FRA_t449_CASP8_MANUAL_1_IGNORE_THE_RESTt449_1_ttxxxxT0449_1CHIM_0001_0001_0001_4126_1926_1

errors Too many error results

CPU time 36.42623
stderr out <core_client_version>5.10.13</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 2406548
ERROR:: Exit from: .\loop_relax.cc line: 1745

</stderr_txt>
]]>

The_Bad_Penguin
Avatar

Joined: Jun 5 06
Posts: 2747
ID: 89694
Credit: 1,859,902
RAC: 0
Message 54183 - Posted 5 Jul 2008 4:53:29 UTC

FRA_t449_CASP8_MANUAL_1_IGNORE_THE_RESTt449_1_ttxxxxT0449_1CHIM_0001_0001_0001_4126_3868_1

errors Too many error results

stderr out <core_client_version>5.10.13</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 2404606
ERROR:: Exit from: .\loop_relax.cc line: 1745

</stderr_txt>
]]>


Validate state Invalid

kb7rzf Profile
Avatar

Joined: Oct 7 05
Posts: 16
ID: 3186
Credit: 35,427
RAC: 0
Message 54185 - Posted 5 Jul 2008 8:48:39 UTC - in response to Message ID 54164.

The 2nd result on this work unit also got the same error.

Well, had 1 compute error since I started crunching again here. The wu is FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4163_1226

The info from the STDERR OUT:

stderr out

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 2143013
======================================================
DONE :: 1 starting structures 20322.7 cpu seconds
This process generated 7 decoys from 7 attempts
0 starting pdbs were skipped
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
<message>
<file_xfer_error>
<file_name>FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4163_1226_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 54186 - Posted 5 Jul 2008 10:03:47 UTC

rosetta@home|Task t443_FULL_h001__CASP8_LONGRANGE_JUMP_SAVE_ALL_OUT_BARCODE__4133_145188_0 exited with a DLL initialization error.
|rosetta@home|If this happens repeatedly you may need to reboot your computer.
|rosetta@home|Restarting task t443_FULL_h001__CASP8_LONGRANGE_JUMP_SAVE_ALL_OUT_BARCODE__4133_145188_0 using rosetta_beta version 598

it was at 100% and 5 hrs and when I opened the graphics window it just went blank.
when i tried to close the window then it went into not responding.
finally got the window to close and it reset itself to 42%.

BrnmccO1

Joined: Jun 26 07
Posts: 17
ID: 186323
Credit: 578,825
RAC: 0
Message 54188 - Posted 5 Jul 2008 16:27:27 UTC

159639723

Compute error after full run, also failed on someone else's host as well. Output file missing.

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 2136320
======================================================
DONE :: 1 starting structures 10289.6 cpu seconds
This process generated 1 decoys from 1 attempts
0 starting pdbs were skipped
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
<message>
<file_xfer_error>
<file_name>FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4165_2909_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>


____________

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 54189 - Posted 5 Jul 2008 16:27:42 UTC

this FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4165_78 WU seemed to crunch correctly, then it bombed out with:

<message><file_xfer_error>
<file_name>FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4165_78_1_0</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>

This WU did the same for the other cruncher as well.

BrnmccO1

Joined: Jun 26 07
Posts: 17
ID: 186323
Credit: 578,825
RAC: 0
Message 54190 - Posted 5 Jul 2008 16:32:26 UTC - in response to Message ID 54189.

this FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4165_78 WU seemed to crunch correctly, then it bombed out with:

<message><file_xfer_error>
<file_name>FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4165_78_1_0</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>

This WU did the same for the other cruncher as well.


I got the same -161 Output file missing error from one of my t453's as well.
____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 54199 - Posted 6 Jul 2008 1:38:10 UTC

Another 8hrs wasted, is this a boinc,app,server or workunit problem Someone! Anyone!

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=159608923

7/6/2008 11:19:02 AM|rosetta@home|Output file for task FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4163_823_1 absent

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 2143416
# cpu_run_time_pref: 21600
======================================================
DONE :: 1 starting structures 29142 cpu seconds
This process generated 10 decoys from 10 attempts
0 starting pdbs were skipped
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
<message>
<file_xfer_error>
<file_name>FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4163_823_1_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>

pete.

____________


AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 54200 - Posted 6 Jul 2008 3:09:29 UTC

Here's two more that crunched to completion, then bombed out with the -161 error for both me and the other cruncher:

FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4163_1501
FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4163_1279

The_Bad_Penguin
Avatar

Joined: Jun 5 06
Posts: 2747
ID: 89694
Credit: 1,859,902
RAC: 0
Message 54245 - Posted 7 Jul 2008 16:17:39 UTC

t451_M4_grishin_IGNORE_THE_REST_renumbered_4150_1393_0

Outcome Validate error

CPU time 10275.77
stderr out <core_client_version>5.10.13</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 10800
#
</stderr_txt>
]]>


Validate state Invalid
Claimed credit 42.7165467457025
Granted credit 0
application version 5.98

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 54250 - Posted 7 Jul 2008 18:17:11 UTC

The Server Status page currently says that the validator is not running.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 54264 - Posted 7 Jul 2008 23:51:47 UTC - in response to Message ID 54250.

The Server Status page currently says that the validator is not running.


I've EMailed the Project Team pointing this out. Thanks for pointing it out.
____________
Rosetta Moderator: Mod.Sense

Rhiju
Forum moderator

Joined: Jan 8 06
Posts: 223
ID: 48256
Credit: 3,546
RAC: 0
Message 54267 - Posted 8 Jul 2008 0:49:33 UTC - in response to Message ID 54264.

Thanks... we changed the validator code earlier today after testing on RALPH, but there's clearly still an issue! I've contacted DK to revert to the old code.

The Server Status page currently says that the validator is not running.


I've EMailed the Project Team pointing this out. Thanks for pointing it out.


____________

TeAm Enterprise Profile
Avatar

Joined: Sep 28 05
Posts: 18
ID: 1546
Credit: 20,535,719
RAC: 2,662
Message 54268 - Posted 8 Jul 2008 2:07:13 UTC

You have bigger problems than the validator code.

Since 5.96 I have never had more problems with errors.

I have just aborted all the T484 WUs since these don't work on my machine. Are you folks getting any science or just problems.

Jim
____________
Crunch with friends - TeAm Anandtech

Alan Roberts

Joined: Jun 7 06
Posts: 61
ID: 93009
Credit: 5,897,995
RAC: 2,607
Message 54456 - Posted 12 Jul 2008 14:25:13 UTC

I've had so many problems with Mini (see this post) that I've had to resort to filtering it off of quite a few of my dual-core/dual-CPU machines.

This morning I walked into my home's listening room to find my recycled laptop, low-power music server (that to-date has happily consumed anything Rosetta sent its way) making excessive noise. Checking I found this 5.98 WU stuck at 100% CPU, even though the machine's preferences were set for max of 70% of CPU (BOINC 5.10.45, and BOINC CPU setting has been honored in the past). Within BOINC, CPU time used and progress were {b]not[/b] advancing, the job was sitting at 20-something percent progress.

Suspending the project did not suspend the job. Shutting down the BOINC service did. Ran a round of Windows updates and rebooted. The work unit restarted and ran with CPU throttling for about 10 minutes, then locked up at 100% again. This time I aborted the task ... I believe the first time across any of the machines on my team that I've had to abandon a 5.98 work unit.

The worst news for me is that the long (possibly better part of two days) non-cycling fan run seems to have put the fan into a permanent high-noise mode. I've got a spare fan assembly, but won't really enjoy the time to tear down and reassemble the unit this weekend.

I guess I'll reinstall and setup Threadmaster, since the BOINC/Rosetta combination seems to be trending towards less operational reliability.

I know everyone is busy with CASP, but I have to emphasize that this is important to me, and I assume to others who are trying to contribute with machines that are not dedicated crunchers. Most of the machines on my team are there because I committed to not loading the machine during business hours (time-of-day and when needed manual suspends) and not overheating the machine (CPU limits). If I can't reliably do this with minimal ongoing effort I'll end up having to pull machines off the project.
____________

The_Bad_Penguin
Avatar

Joined: Jun 5 06
Posts: 2747
ID: 89694
Credit: 1,859,902
RAC: 0
Message 54467 - Posted 13 Jul 2008 3:58:27 UTC

errors Too many error results

FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4165_2851_0


CPU time 8831.965
stderr out <core_client_version>5.10.13</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 2136378
======================================================
DONE :: 1 starting structures 8831.64 cpu seconds
This process generated 4 decoys from 4 attempts
0 starting pdbs were skipped
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
<message>
<file_xfer_error>
<file_name>FRA_t453_CASP8_HYBRID_MANUAL_1_IGNORE_THE_RESTt451_1_axmin1_0001_4165_2851_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>


Validate state Invalid

Harwood

Joined: Nov 15 05
Posts: 1
ID: 12335
Credit: 1,694,364
RAC: 39
Message 54562 - Posted 18 Jul 2008 1:14:00 UTC

I am running on an AMD Athlon in Win Server 2k8 with rosetta_beta_5.98_windows_x86_64.exe and I noticed in the task manager that it is running in 32 bit compatablity. Could it be that a flag is thrown and this is a 64 bit app? Irregardless, we are not getting the performance for the project. Its running, but it could be better.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 54566 - Posted 18 Jul 2008 13:05:57 UTC - in response to Message ID 54562.

I am running on an AMD Athlon in Win Server 2k8 with rosetta_beta_5.98_windows_x86_64.exe and I noticed in the task manager that it is running in 32 bit compatablity. Could it be that a flag is thrown and this is a 64 bit app? Irregardless, we are not getting the performance for the project. Its running, but it could be better.


At this point, there is no true 64bit application. The Project Team is aware of the performance implications of that fact.
____________
Rosetta Moderator: Mod.Sense

Azurrio Profile

Joined: Feb 20 06
Posts: 8
ID: 60240
Credit: 211,446
RAC: 0
Message 54591 - Posted 21 Jul 2008 11:12:06 UTC

Computer/validate error on this
____________

Alberthuang

Joined: Dec 5 05
Posts: 6
ID: 30308
Credit: 61,178
RAC: 105
Message 54624 - Posted 23 Jul 2008 7:31:22 UTC

My computer's OS is Windows XP SP3, using the BOINC manager version 5.10.45. It computed the workunit n004__BOINC_SYMMETRY_C4SYMM_FOLD_AND_DOCK_RELAX-n004_-t484__4207_923 with Rosetta beta version 5.98, and showed compute error after full run. Then a windows message also showed that Windows C++ Runtime error at the same time, and the output file n004__BOINC_SYMMETRY_C4SYMM_FOLD_AND_DOCK_RELAX-n004_-t484__4207_923_0_0 for this task was missing. The task detail is in the following:

Task ID 176331854
Name n004__BOINC_SYMMETRY_C4SYMM_FOLD_AND_DOCK_RELAX-n004_-t484__4207_923_0
Workunit 160936951
Created 9 Jul 2008 2:51:35 UTC
Sent 9 Jul 2008 2:52:15 UTC
Received 18 Jul 2008 10:17:47 UTC
Server state Over
Outcome Client error
Client state Done
Exit status 3 (0x3)
Computer ID 224205
Report deadline 19 Jul 2008 2:52:15 UTC
CPU time 14199.22
stderr out <core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
系統找不到指定的路徑。 (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400
# random seed: 1315613

</stderr_txt>
]]>


Validate state Invalid
Claimed credit 27.2519995630333
Granted credit 0
application version 5.98

And before this workunit crashed, the BOINC manager downloaded a workunit with Rosetta beta version 5.98. At the same time, two previous files of Rosetta@home were deleted when the BOINC manager got server request of Rosetta@home! I wondered if this workunit's crash was in connection with the deletion of two previous files.
____________


Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 54640 - Posted 24 Jul 2008 13:27:10 UTC - in response to Message ID 54624.
Last modified: 24 Jul 2008 13:32:19 UTC

...two previous files of Rosetta@home were deleted when the BOINC manager got server request of Rosetta@home! I wondered if this workunit's crash was in connection with the deletion of two previous files.


The deleted files should be the databases used by the "mini" version of Rosetta, and so if this had been a mini task that would be likely. Others have reported that the BOINC client did not seem to recognize the fact that existing tasks required the files. Since it was not a mini task, I do not believe the deleted files is related to the problem you ran in to.

Here is a link to DK's post about that over on Ralph.
____________
Rosetta Moderator: Mod.Sense

Path7

Joined: Aug 25 07
Posts: 128
ID: 201002
Credit: 61,751
RAC: 0
Message 54679 - Posted 27 Jul 2008 0:04:27 UTC

Hello all,

Just having twice a Compute error, Exit status 193 (0xc1) on my Ubuntu 7.10 x86.
Link to results
From Boinc I received the next massages:
za 26 jul 2008 19:04:38 CEST|rosetta@home|Computation for task t499__BOINC_SYMMETRY_D2SYMM_FOLD_AND_DOCK_RELAX-t499_-_4244_6669_0 finished
za 26 jul 2008 19:04:38 CEST|rosetta@home|Output file t499__BOINC_SYMMETRY_D2SYMM_FOLD_AND_DOCK_RELAX-t499_-_4244_6669_0_0 for task t499__BOINC_SYMMETRY_D2SYMM_FOLD_AND_DOCK_RELAX-t499_-_4244_6669_0 absent

zo 27 jul 2008 00:10:28 CEST|rosetta@home|Computation for task t498__BOINC_SYMMETRY_D2SYMM_FOLD_AND_DOCK_RELAX-t498_-_4244_13407_0 finished
zo 27 jul 2008 00:10:28 CEST|rosetta@home|Output file t498__BOINC_SYMMETRY_D2SYMM_FOLD_AND_DOCK_RELAX-t498_-_4244_13407_0_0 for task t498__BOINC_SYMMETRY_D2SYMM_FOLD_AND_DOCK_RELAX-t498_-_4244_13407_0 absent

Have a nice day,
Path7.

Robert Gammon Profile

Joined: Nov 9 07
Posts: 14
ID: 219551
Credit: 546,395
RAC: 627
Message 54690 - Posted 27 Jul 2008 13:06:10 UTC

BOINC Client 5.10.45
Rosetta 5.98
WinXP SP2 - intermittently connected to Internet

This problem duplicates with almost any WU.

Scenario is that the laptop is connected to internet long enough to upload completed results and to request/download new work. The laptop then disconnects from Internet to begin number crunching.

Rosetta processes the file a variable amount ( I have seen 55%, 72%, 88%, and 97% completion), then for one reason or another, BOINC shuts down (XP locks up and needs a reboot, power fails, or its time to shutdown for the night).

Note that some of these BOINC shutdowns are orderly, others are not. The result is the same, regardless of how we got there. Rosetta RESTARTS AT ZERO!! The WU gets reprocessed, redoing the work of 2-4 hours compute time.

Robert Gammon Profile

Joined: Nov 9 07
Posts: 14
ID: 219551
Credit: 546,395
RAC: 627
Message 54691 - Posted 27 Jul 2008 16:40:05 UTC - in response to Message ID 54690.

BOINC Client 5.10.45
Rosetta 5.98
WinXP SP2 - intermittently connected to Internet

[snip]
The WU gets reprocessed, redoing the work of 2-4 hours compute time.


I just duplicated this again. I did an orderly shutdown to move the laptop. Rosetta was at 95.583% complete.

When BOINC restarted, Setiathome was the selected task. I let that run for about 5 minutes, then suspended Seti and allowed Rosetta to restart.

In a few moments, 0.00% complete, 3:55:20 to completion!!!

On computers with more than one project active, if this is NOT unique to my laptop, switching to other projects from Rosetta, then back to Rosetta, should show the same characteristic. Note that this is a configuration item on all project Account Info pages, interval between switching tasks. Mine is set to 3 hours.

I cannot do this as I only have access to a single computer.

Robert Gammon Profile

Joined: Nov 9 07
Posts: 14
ID: 219551
Credit: 546,395
RAC: 627
Message 54696 - Posted 27 Jul 2008 21:35:01 UTC - in response to Message ID 54691.

BOINC Client 5.10.45
Rosetta 5.98
WinXP SP2 - intermittently connected to Internet

[snip]
In a few moments, 0.00% complete, 3:55:20 to completion!!!

On computers with more than one project active, if this is NOT unique to my laptop, switching to other projects from Rosetta, then back to Rosetta, should show the same characteristic. Note that this is a configuration item on all project Account Info pages, interval between switching tasks. Mine is set to 3 hours.

I cannot do this as I only have access to a single computer.


I tried again, putting the project on Suspend, waiting 30 minutes while I did some other work, then did a Resume, and EUREKA, it WORKED, execution continued from the spot it left of when the Suspend was issued.

So this makes it seem like the signal BOINC issues when the user EXITS the application leaves the Rosetta work unit in an unstable state, same as an abort due to power fail on the computer. SUSPEND appears to act differently and Rosetta does an orderly pause of the work.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 54697 - Posted 28 Jul 2008 0:48:36 UTC

Robert, suspend, at least when you have it set to leave applications in memory while suspended... is entirely different then BOINC shutting down.

What you are seeing is normal, and not unexpected. It will differ for different types of work units. Some checkpoint more frequently then others. Some complete models more frequently then others. If you would like to discuss it further, since this is not a problem specific to v5.98, please open a new thread.
____________
Rosetta Moderator: Mod.Sense

Korz53 Profile
Avatar

Joined: Apr 22 07
Posts: 2
ID: 167880
Credit: 46,757
RAC: 0
Message 54711 - Posted 29 Jul 2008 0:07:57 UTC
Last modified: 29 Jul 2008 0:16:34 UTC

No graphics when clicking show graphics when Rossetta is running. .CPU is high in kernel_task (36.5%) when minirosetta is running . boinc ( not Responding) may be do to minirosetta. well need to quit BOINC and restart to reset boinc. boinc not Responding has been happening on and off





Model Name: iMac
Model Identifier: iMac6,1
Processor Name: Intel Core 2 Duo
Processor Speed: 2.33 GHz
Number Of Processors: 1
Total Number Of Cores: 2
L2 Cache: 4 MB
Memory: 3 GB
Bus Speed: 667 MHz
Boot ROM Version: IM61.0093.B07
SMC Version: 1.10f2
____________

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 54778 - Posted 31 Jul 2008 14:39:04 UTC

A couple of t498__BOINC_SYMMETRY_D2SYMM_FOLD_AND_DOCK_RELAX-t498_-_4244_ WUs ended with a segmentation violation on two different Linux computers. The stack trace looks similar in each case.

http://boinc.bakerlab.org/rosetta/result.php?resultid=180458759
http://boinc.bakerlab.org/rosetta/result.php?resultid=180399471

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 54782 - Posted 31 Jul 2008 17:38:43 UTC
Last modified: 31 Jul 2008 17:40:10 UTC

t498__BOINC_SYMMETRY_D2SYMM_FOLD_AND_DOCK_RELAX-t498_-_4244_5892_0

50 minute computation and then:
Exit status -1073741819 (0xc0000005)
CPU runtime 3013.89 secs
stderr out

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400
# random seed: 3449163


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0093E2E5 write attempt to address 0x1610F000

Engaging BOINC Windows Runtime Debugger...



Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x7C911D8F read attempt to address 0xFFFFFFF8

Engaging BOINC Windows Runtime Debugger...

# cpu_run_time_pref: 14400
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...


and alot of other stuff mostly PDB symbols.

Is this going to become a commmon theme?

adrianxw Profile
Avatar

Joined: Sep 18 05
Posts: 535
ID: 402
Credit: 1,057,641
RAC: 1,674
Message 54788 - Posted 31 Jul 2008 19:26:31 UTC

Had this one crash and put up the "Rosetta has encountered an error and needs to close" dialog box - not seen that for a while.
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

UBT - The Prof.... Profile

Joined: Nov 5 06
Posts: 1
ID: 127434
Credit: 18,584
RAC: 0
Message 54789 - Posted 31 Jul 2008 20:36:12 UTC

Have had so many crash in the last few days with "client error", etc, that I think I am going to go crunch something else for a while till this gets properly de-bugged.
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 54790 - Posted 31 Jul 2008 21:03:53 UTC - in response to Message ID 54789.

Have had so many crash in the last few days with "client error", etc, that I think I am going to go crunch something else for a while till this gets properly de-bugged.


you should post this as well so they know whats going on...
<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
Too many restarts with no progress. Keep application in memory while preempted.
======================================================
DONE :: 1 starting structures 96.3281 cpu seconds
This process generated 0 decoys from 0 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
<message>
<file_xfer_error>
<file_name>h001__BOINC_CASP8_ABRELAX_RANGE_tvat_d2r__IGNORE_THE_REST-S25-6-S3-9--h001_-_4307_155_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>


You got that how many times? 6 or so?

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 54802 - Posted 1 Aug 2008 13:22:54 UTC

Too many restarts with no progress


Normally this would be due to ending BOINC before the tasks can checkpoint, or suspending the task before it can checkpoint and not leaving suspended tasks in memory... however, I don't see the output lines that indicate how many times it did start, and with what runtime preference that I would normally expect to see if that was truely what had occured.

UBT, does your machine run 24/7? Do you run other projects? Some background may prove helpful.
____________
Rosetta Moderator: Mod.Sense

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 54805 - Posted 1 Aug 2008 16:39:49 UTC

t499__BOINC_SYMMETRY_D2SYMM_FOLD_AND_DOCK_RELAX-t499_-_4244_11023_0

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 14400
# random seed: 3439032
# cpu_run_time_pref: 14400
# random seed: 3439032
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
======================================================
DONE :: 1 starting structures 12695.9 cpu seconds
This process generated 7 decoys from 7 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
]]>

i LOST 7 points on this one...why would you lose points instead of break even?

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 54806 - Posted 1 Aug 2008 16:41:33 UTC

t499__BOINC_SYMMETRY_C4SYMM_FOLD_AND_DOCK_RELAX-t499_-_4244_16487_0
14543.28
stderr out

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 14400
# random seed: 3423568
# cpu_run_time_pref: 14400
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
# cpu_run_time_pref: 14400
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
======================================================
DONE :: 1 starting structures 14543.1 cpu seconds
This process generated 7 decoys from 7 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
]]>

.40 credit LOST on this one

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 54807 - Posted 1 Aug 2008 16:44:13 UTC

t498__BOINC_SYMMETRY_C3SYMM_FOLD_AND_DOCK_RELAX-t498_-_4244_17219_0
CPU time 14213.28
stderr out

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 14400
# random seed: 3447836
# cpu_run_time_pref: 14400
# cpu_run_time_pref: 14400
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
======================================================
DONE :: 1 starting structures 14212.5 cpu seconds
This process generated 26 decoys from 26 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
]]>

gained some serious credit on this even with the errors

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 54808 - Posted 1 Aug 2008 16:46:31 UTC - in response to Message ID 54802.
Last modified: 1 Aug 2008 16:47:24 UTC

Mod: check out all the errors in his profile. He has alot of <error_code>-161</error_code>

http://boinc.bakerlab.org/rosetta/results.php?hostid=837076

Too many restarts with no progress


Normally this would be due to ending BOINC before the tasks can checkpoint, or suspending the task before it can checkpoint and not leaving suspended tasks in memory... however, I don't see the output lines that indicate how many times it did start, and with what runtime preference that I would normally expect to see if that was truely what had occured.

UBT, does your machine run 24/7? Do you run other projects? Some background may prove helpful.

mike

Joined: Dec 10 05
Posts: 1
ID: 33661
Credit: 77,598
RAC: 0
Message 54897 - Posted 3 Aug 2008 22:32:38 UTC

I seem to have a problem with minirosetta application not starting.I just installed the new version of boinc.I keep getting windows error messages,telling me that there was a problem and asking me to send an error report to Microsoft.I tried aborting the work units to get more but it still give me error messages.It seems when it tries to start a work unit it will go immediately to 100% and the error messsage pops up
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 54904 - Posted 4 Aug 2008 5:46:45 UTC - in response to Message ID 54897.

I seem to have a problem with minirosetta application not starting.I just installed the new version of boinc.I keep getting windows error messages,telling me that there was a problem and asking me to send an error report to Microsoft.I tried aborting the work units to get more but it still give me error messages.It seems when it tries to start a work unit it will go immediately to 100% and the error messsage pops up


you need to post this over in 1.28 not here in 5.98

adrianxw Profile
Avatar

Joined: Sep 18 05
Posts: 535
ID: 402
Credit: 1,057,641
RAC: 1,674
Message 54926 - Posted 5 Aug 2008 16:00:36 UTC
Last modified: 5 Aug 2008 16:05:00 UTC

Had this one crash on me today with an unhandled exception, (while I was out of course), and it put up the "Rosetta has encountered..." message box which then leaves that core dead until I return to click the OK button. It should not do this under ANY circumstances, yet is the second time recently.

I have Rosetta on machines at remote sites which I don't visit often. If it happens out there, the core/machine is dead until I next get there.


try
{
Rosetta
}
catch (...)
{
Bomb out
}

____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

adrianxw Profile
Avatar

Joined: Sep 18 05
Posts: 535
ID: 402
Credit: 1,057,641
RAC: 1,674
Message 54979 - Posted 7 Aug 2008 11:45:45 UTC
Last modified: 7 Aug 2008 11:49:48 UTC

...and AGAIN! This one crashed and put the MessageBox up on my Vista system leaving a core dead until I clicked OK.

I am suspending Rosetta at my remote sites - it is clearly unreliable at the moment.

31st July...
1fe6__BOINC_TWIST_RINGS_TWIST_ANGLE_SYMM_FOLD_AND_DOCK_RELAX-1fe6_-crystal_foldanddock__3560_41915_0
Today...
1fe6__BOINC_TWIST_RINGS_TWIST_ANGLE_SYMM_FOLD_AND_DOCK_RELAX-1fe6_-crystal_foldanddock__3560_97958_0

... somewhat similar. The one 5th August was different.
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

Guido Platteau

Joined: Sep 11 06
Posts: 2
ID: 111809
Credit: 283,392
RAC: 0
Message 54993 - Posted 8 Aug 2008 9:09:04 UTC
Last modified: 8 Aug 2008 9:10:00 UTC

Validate errors:

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=167201441
http://boinc.bakerlab.org/rosetta/result.php?resultid=183051417

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=167201385
http://boinc.bakerlab.org/rosetta/result.php?resultid=183051388

Client errors:

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=166338183
http://boinc.bakerlab.org/rosetta/result.php?resultid=182123248

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=166333019
http://boinc.bakerlab.org/rosetta/result.php?resultid=182115595

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=166311099
http://boinc.bakerlab.org/rosetta/result.php?resultid=182091686
____________

ConflictingEmotions

Joined: Jun 5 08
Posts: 10
ID: 263026
Credit: 3,081,990
RAC: 0
Message 54996 - Posted 8 Aug 2008 16:31:06 UTC

I aborted wuid 167183863 because it hung at 100% but took cputime way beyond expected. Watchdog got the other attempt so I am pointing it out as it may expose some error with Rosetta beta.

dag Profile
Avatar

Joined: Dec 16 05
Posts: 106
ID: 38674
Credit: 1,000,020
RAC: 0
Message 54997 - Posted 8 Aug 2008 16:42:20 UTC

I'm getting multiple errors:


____________
dag
--Finding aliens is cool, but understanding the structure of proteins is useful.

One Pelican

Joined: Aug 8 08
Posts: 3
ID: 272727
Credit: 856
RAC: 0
Message 55006 - Posted 9 Aug 2008 11:35:50 UTC

New to Rosetta.
Have a task OR8C BOINC MFR RELAX PICKED 4370 510 0
runtime 06.15.0 > . progress 98% > .Runtime should be at 4 hrs.
Can see that RMSD = 0
Energy = -397
Is this task good or should I abort?

Ver 5.98 AMD 4800 2 Core. XP Ver 3.

One Pelican

Joined: Aug 8 08
Posts: 3
ID: 272727
Credit: 856
RAC: 0
Message 55007 - Posted 9 Aug 2008 11:38:21 UTC

OOPS> Sorry.
Task has just a minute ago completed.

One Pelican

Joined: Aug 8 08
Posts: 3
ID: 272727
Credit: 856
RAC: 0
Message 55010 - Posted 9 Aug 2008 19:37:44 UTC

Aug-2008 19:58:39 [rosetta@home] Computation for task OR3d__BOINC_MFR_ABRELAX_PICKED_4322_1062_1 finished
09-Aug-2008 19:58:39 [rosetta@home] Output file OR3d__BOINC_MFR_ABRELAX_PICKED_4322_1062_1_0 for task OR3d__BOINC_MFR_ABRELAX_PICKED_4322_1062_1 absent

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 55012 - Posted 10 Aug 2008 4:12:30 UTC

The following OR3d__BOINC_MFR_ABRELAX_PICKED_4322_ WUs crunched for the usual length of time and gave the normal message about the number of decoys produced, but then errored out with a -161 error. This happened for both me and the other person who crunched each WU.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=167140030
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=167149129
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=167113344
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=167126016

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 55021 - Posted 10 Aug 2008 14:16:36 UTC
Last modified: 10 Aug 2008 14:23:11 UTC

just installed boinc mgr 6.2.16 and downloaded new work as the old work from 5.10.45 did not register on 6.2

now i get this from Beta 5.98 after getting new work downloaded
these come from m5xx tasks


http://boinc.bakerlab.org/rosetta/result.php?resultid=183863626 http://boinc.bakerlab.org/rosetta/result.php?resultid=183862800
http://boinc.bakerlab.org/rosetta/result.php?resultid=183862783
http://boinc.bakerlab.org/rosetta/result.php?resultid=183862782
http://boinc.bakerlab.org/rosetta/result.php?resultid=183862769
http://boinc.bakerlab.org/rosetta/result.php?resultid=183862768
after this it gets to painful to post all the error results, but needless to say between these and stuff from 1.32 that failed, i chewed up over 30 work units.

i repaired and restarted boinc mgr and got some new work that seems to be running ok, hope this mess quits.



<core_client_version>6.2.16</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
trouble finding Rama_smooth_dyn.dat_ss_6.4
ERROR:: Exit from: .\read_paths.cc line: 360

</stderr_txt>
]]>


0 secs compute time

Matthew Maples Profile

Joined: Oct 19 06
Posts: 5
ID: 121836
Credit: 135,659
RAC: 0
Message 55057 - Posted 12 Aug 2008 19:54:06 UTC

Repeated crashes (Stopped Working and has been closed) and Compute Errors in Vista X64 on my desktop, both old and new BOINC versions, both in Rosetta Beta and Mini Rosetta
____________

Robert Gammon Profile

Joined: Nov 9 07
Posts: 14
ID: 219551
Credit: 546,395
RAC: 627
Message 55091 - Posted 15 Aug 2008 23:26:41 UTC

Boinc 5.10.45 on XP SP2
Rosetta Beta 5.28

Most of the time, the workunit progresses normally, executing 0.05% or so per tick (on my machine 3-5 seconds) UNTIL WE GET TO ABOUT 90%. Work then slows WAY down, executing 0.001% per tick (same schedule as before at about 3-5 seconds per tick).

This behavior is work unit specific as some complete in as little as 1.5 hours, with no hangups in the 90+% range, while others take 3+ to 4+ hours. The last one, which I will upload tomorrow is M624_BOINC_MFR_ABRELAX_PICKED_4395_9734_0. It took an exceptionally long time to complete, some 4:43:07.

DaBrat and DaBear

Joined: Aug 9 08
Posts: 16
ID: 272936
Credit: 213,180
RAC: 0
Message 55392 - Posted 29 Aug 2008 21:59:55 UTC
Last modified: 29 Aug 2008 22:04:14 UTC

I am having the same issue with 5.98. This is the second WU that has progress to the exact time of 9:55 secs remaining and staed there for over an hour.. It seems to be progressing but as the previous person said at .001 ever 30 mins. Should I even bother or just kill it?


BTW my preerences are set at the default three hours and it was estimated to complete in 2:46 we are now into hour 4. All crunching that same 9:55 secs. It was crunching along as expected until it hit this time mark. I killed the last one

Snagletooth

Joined: Feb 22 07
Posts: 192
ID: 149031
Credit: 1,396,123
RAC: 1,318
Message 55395 - Posted 30 Aug 2008 3:22:16 UTC - in response to Message ID 55392.

I am having the same issue with 5.98. This is the second WU that has progress to the exact time of 9:55 secs remaining and stated there for over an hour.. It seems to be progressing but as the previous person said at .001 ever 30 mins. Should I even bother or just kill it?


BTW my preerences are set at the default three hours and it was estimated to complete in 2:46 we are now into hour 4. All crunching that same 9:55 secs. It was crunching along as expected until it hit this time mark. I killed the last one


You made a similar post in the Tasks end prematurely thread. Mod.Sense and I tried to explain that what you are seeing is normal Rosetta behavior. My rather long explanation is here. It is immediately followed by an important clarifying post from Mod.Sense. Perhaps it would be helpful if you reviewed those posts and then asked for clarification on any specific points.

As a quick test of this particular task you could open the graphics window where you will see a model number and a step number. I strongly suspect that you are still working on the first model. If you watch for a few moments you should see the step number change. This tells you the app is not stuck and you should not abort it.

Snags

DaBrat and DaBear

Joined: Aug 9 08
Posts: 16
ID: 272936
Credit: 213,180
RAC: 0
Message 55411 - Posted 31 Aug 2008 11:48:18 UTC
Last modified: 31 Aug 2008 12:06:42 UTC

I believe in you previous post you called me a 'lurker'. Shame on you. I do not believe anything I have posted qualifies me as a 'lurker'... one step up from a troll. I believe that my stats, with the short period of time I have been crunching this project qualifies me as a newb.

I don't belive that your answer really addressed my question of premature task ending since th question was why does this only seem to happen at restart. And yes I use my comp for other things than rosie and the last task simply comnpleted itself when I logged back on to windows from linux at less than half the estimated time. Not the first time. Or are you suggesting that I just happen to need to log off when I have a short task? Those are almost lottery chances.

Further the reason the post was made here is if you will refer to the home page there was mention of task hanging at the end of processing for this particular model and the request was that any issues be posted 'here' in this thread.

Now this behaviour may be particular to Rosetta but most of the time, reagrdless of the complication of the model, The completion time is adjusted druring crunch... say the model say 2:540 to completion, it may not count down as quickly beccause of a time period overrun. This is the first series of models I have ever seen that countdouwn as normal and use the ten minutes of remaining time to crunch 4 hours of calculations. I believe that is the cunfusion with these models.

Not only that, but once these run into that hang problem. It default all remaing moidels to the finish time, my latest being 6+ hours' and causes Rosetta to go into panic mode usurping any other projects you may be working on while on the other hand taking days of crunhing and returning WUs under the defualt to get the overall estimated completion time back to normal.

Since my normal completion time on the comp is about 2:46, when Rosetta gets the 6 hours blues and defaults them all to that completion time (instead of something mid way or an average of completion time)... my other projects are kicked off their cores. At this rate, and the time it takes for rosie to get the normal processing time back for remaining WUs, nothing else will crunch on my machine if I run into a 5.98 every two days.


BTW the 6 hour WU attempted and returned 1 decoy with no errors.

DaBrat and DaBear

Joined: Aug 9 08
Posts: 16
ID: 272936
Credit: 213,180
RAC: 0
Message 55415 - Posted 31 Aug 2008 12:34:58 UTC
Last modified: 31 Aug 2008 12:40:32 UTC

So I guess the best option would be to change my time preferences to 6 hours... that way they will all download with a 6 hour completion time amd if they complete in two I'll get a new WU somewhere in that window... But wait.... on the days when the server is short of work, I will simply be crunching empty space. Nah at least CPN will get SOME crunch time. Maybe three would be a better option... oh wait that is the default.

Right now she is running in panic mode for tasks due over a week from now.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 55421 - Posted 31 Aug 2008 15:41:28 UTC - in response to Message ID 55415.

Right now she is running in panic mode for tasks due over a week from now.


This is why I suggest only changing runtime preference gradually. Sounds like you either had a lot of tasks downloaded before the new preference, or, your Rosetta resource share is low enough that the work would miss the deadline if not run in "high priority" (panic) mode. BOINC's best guess is that these tasks are in danger of missing their deadlines. But don't worry, it keeps track of the time used and pays back other projects once it gets caught up as compared to the deadlines.

If you approach the deadlines and still have too many tasks, you can set your preference back lower again. But you are right, some of them will find they complete early.

This is the first series of models I have ever seen that countdouwn as normal and use the ten minutes of remaining time to crunch 4 hours of calculations. I believe that is the cunfusion with these models.


This has been the expected behavior for over a year. That is why I would prefer to discuss the issue in a seperate thread from the release-specific issues. And tasks are released several times a month that have runtimes over 3 hours on most machines to complete a single model. But the number of such tasks released is usually fairly limited. The task you describe that ran with 6hr preference, that only completed one model, would have run for the same amount of time with any runtime preference lower then the time it took for that one model. And any time after the preferred runtime the client would have showed about 10minutes remaining.



____________
Rosetta Moderator: Mod.Sense

Snagletooth

Joined: Feb 22 07
Posts: 192
ID: 149031
Credit: 1,396,123
RAC: 1,318
Message 55425 - Posted 31 Aug 2008 17:04:07 UTC - in response to Message ID 55411.

I believe in you previous post you called me a 'lurker'. Shame on you. I do not believe anything I have posted qualifies me as a 'lurker'... one step up from a troll. I believe that my stats, with the short period of time I have been crunching this project qualifies me as a newb.


A lurker is someone who reads but does not post. Since you had already posted in the thread I was clearly not referring to you. I may have even been referring to myself:) I suppose on a purely social board some people might find lurkers creepy but here and on similar boards I imagine quite large numbers of people read the boards the gather information about the project and discover solutions to their technical problems, etc. without ever posting. They are lurkers and there is nothing disreputable or irresponsible about their behavior. In fact, many project boards have locked threads with titles such as "read here first" and "If you're new have a view". Who knows how many newbs come to the boards intending to post, find their answers in one of those threads and end up never posting at all. The project is actually encouraging newbs to behave as lurkers, not posters! (This is written in good humour and I hope you will read it with the same. At any rate, the term was neither directed at you nor used as an insult.)


I don't belive that your answer really addressed my question of premature task ending since th question was why does this only seem to happen at restart.


You are right and I acknowledged that in my post on the other thread. In none of your previous posts however did you mention finishes immediately following restarts (though other posters did). You only mentioned the timings and it wasn't clear to me that you understood that the information you have provided about the timings is not evidence of premature exits or stalled wus or in fact a problem of any kind.

Per Mod.Sense's request I'll make any further response in the other thread.

Snags

R.L. Casey

Joined: Jun 7 06
Posts: 91
ID: 92179
Credit: 1,209,412
RAC: 73
Message 55777 - Posted 15 Sep 2008 16:17:40 UTC
Last modified: 15 Sep 2008 16:20:15 UTC

A minor anomaly...
The BOINC Manager button 'Show Graphics' is inhibited for new 'AA2A' Work Units when viewing a 'localhost', and I expect that is desired or required due to the stated large size of the proteins. However, if viewing the same task over a remote connection (e.g., via port 31413), the button is not inhibited. In this case, if the button is clicked, a graphics window is generated that is blank except for the lines separating the various sections of the graphic. No adverse effects on the remote task were noted. FYI.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 55812 - Posted 16 Sep 2008 18:24:44 UTC

AA2A_4_modeling_1_AA2A_1_AA2A_2RH1_align_4467_2456_0
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 2950319
ERROR:: Exit from: .\pack.cc line: 5278

this ran 4961.234 seconds and died

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 55874 - Posted 18 Sep 2008 22:08:29 UTC

These long AA2A work units don't have checkpoints so any stop loses 3+ hours. Is this normal or is there a problem?

In the stdout file there is a checkpoint warning.

WARNING!! cant restore counts from checkpoint file:mc_checkpoint

____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 55917 - Posted 21 Sep 2008 2:50:55 UTC

Can someone from the lab have a look at this one it has problems!


t040_1_NMRREF_1_t040_1_S_00001_0000482_0IGNORE_THE_REST_core_4463_863_2


http://boinc.bakerlab.org/rosetta/workunit.php?wuid=174466679


stderr out <core_client_version>5.10.30</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 3420912
# cpu_run_time_pref: 21600
======================================================
DONE :: 1 starting structures 20508 cpu seconds
This process generated 7 decoys from 7 attempts
0 starting pdbs were skipped
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
]]>


Validate state Workunit error - check skipped

pete.

____________


R.L. Casey

Joined: Jun 7 06
Posts: 91
ID: 92179
Credit: 1,209,412
RAC: 73
Message 55918 - Posted 21 Sep 2008 4:01:18 UTC - in response to Message ID 55917.

Can someone from the lab have a look at this one it has problems!


t040_1_NMRREF_1_t040_1_S_00001_0000482_0IGNORE_THE_REST_core_4463_863_2


http://boinc.bakerlab.org/rosetta/workunit.php?wuid=174466679


stderr out <core_client_version>5.10.30</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 3420912
# cpu_run_time_pref: 21600
======================================================
DONE :: 1 starting structures 20508 cpu seconds
This process generated 7 decoys from 7 attempts
0 starting pdbs were skipped
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
]]>


Validate state Workunit error - check skipped

pete.


Pete,
Looking at the Workunit, it seems that the validator rejected your resukts because the maximum number of results, two, had already been received by the time your result was returned. It looks as if the Projet was in error to send the WU to you, since it alrady had sent it out twice on September 10, the second time sent because of an error returned by the first cruncher. The, this second cruncher took over ten days to return a result, and it just happened to be between the time it was sent to you on the 20th and the time you returned your result! Based upon the time the WU was issued to you, it appears that the project had assumed no result would be returned by the second cruncher afer ten days had elapsed. If I were in charge of the world, I'd validate your WU and grant you credit, and purge the (slightly earlier) result based upon the fact that it was so late that it was assumed to be lost, but... :-)

Thanks for crunching!!

Jack

Joined: Feb 19 07
Posts: 11
ID: 148408
Credit: 328,578
RAC: 207
Message 55965 - Posted 23 Sep 2008 1:02:21 UTC

Rosetta Beta 5.98 does not share CPU fairly with other BOINC projects. My preferences were set to the default 60 minutes and I tried changing it to 70 minutes just to force an update, but Rosetta keeps running long past the time it should pause and let another project have some time.

This task: 193947855 started running at 2:45 this afternoon, used over six hours of CPU time and was still running when I suspended it so other projects could get some time. On previous recent tasks, Rosetta would run about 3 hours before pausing.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 55966 - Posted 23 Sep 2008 2:20:05 UTC

The BOINC Manager is making the fairness decisions, not Rosetta. It is also trying to optimize the use of your machine's crunch time by not losing work that is still in progress, but has not been checkpointed yet.

Either Rosetta is owed time from when other projects were crunching, or perhaps it has a long running model and has not taken a checkpoint yet. BOINC keeps track of it all. If Rosetta does run long before checkpointing, BOINC will pay back the other project(s) the time they are owed.

When you tell BOINC to "switch" every xx minutes, you aren't actually instructing it to switch. You are simply telling it how often to consider making a switch.

The manager will enforce the resource shares you have configured between your projects. But look for this to be true over the course of 100 hours, not 100 minutes.
____________
Rosetta Moderator: Mod.Sense

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 56007 - Posted 24 Sep 2008 17:47:28 UTC
Last modified: 24 Sep 2008 17:49:30 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=193810495
t042_1_NMRREF_1_t042_1_S_00001_0006461IGNORE_THE_REST_010000_4471_6183_0
core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 2556542

this pig locked up my system and it moved on to another task which locked up and also locked up einstein.


also t042_1_NMRREF_1_t042_1_S_00002_0009608IGNORE_THE_REST_050000_4471_6695_0
http://boinc.bakerlab.org/rosetta/result.php?resultid=193877090

died shortly afterwards, my system was locked up and I had to use task manager to abort boinc mgr and then rosetta and einstein and then reboot to get control of my system again.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 56010 - Posted 24 Sep 2008 20:16:20 UTC

looked up the error message, seems that perhaps my graphics driver caused a conflict? I updated it and will see what happens.

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 56013 - Posted 25 Sep 2008 4:22:10 UTC

This errored after 4hrs, 44min it had done 2 models just before it finished.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=177652466

9/25/2008 1:58:06 PM|rosetta@home|Output file AA2A_11_modeling_1_AA2A_1_AA2A_2VTA_align_4556_2991_0_0 for task AA2A_11_modeling_1_AA2A_1_AA2A_2VTA_align_4556_2991_0 absent

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 2669370
ERROR:: Exit from: .\refold.cc line: 338

</stderr_txt>

pete.

____________


AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 56014 - Posted 25 Sep 2008 4:25:22 UTC

Had a couple of WUs exit from refold.cc

AA2A_7_modeling_1_AA2A_1_AA2A_2RH1_align_4493_43027_0
AA2A_7_modeling_1_AA2A_1_AA2A_2RH1_align_4493_40790_0

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 56022 - Posted 25 Sep 2008 9:54:32 UTC

AA2A_6_modeling_1_AA2A_1_AA2A_2RH1_align_4492_13278_0

this completed ok but gave this error message or warning

CPU time 20691.98
stderr out <core_client_version>6.2.19</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 3360223
# cpu_run_time_pref: 21600
# random seed: 3360223
No heartbeat from core client for 31 sec - exiting
# cpu_run_time_pref: 21600
# random seed: 3360223
No heartbeat from core client for 31 sec - exiting
# cpu_run_time_pref: 21600
# random seed: 3360223
# cpu_run_time_pref: 21600
======================================================
DONE :: 1 starting structures 20691 cpu seconds
This process generated 7 decoys from 7 attempts
0 starting pdbs were skipped
======================================================


AdeB Profile
Avatar

Joined: Dec 12 06
Posts: 45
ID: 135244
Credit: 2,358,915
RAC: 2,105
Message 56039 - Posted 26 Sep 2008 18:19:31 UTC

This workunit is valid but stderr out is enormous:

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 43200
# random seed: 2792818
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
.
/// This line is repeated 516 times ///
.
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
======================================================
DONE :: 1 starting structures 43239.7 cpu seconds
This process generated 45 decoys from 45 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>

____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 56064 - Posted 27 Sep 2008 20:32:23 UTC
Last modified: 27 Sep 2008 20:33:39 UTC

HR19__BOINC_LONGNOE_JUMPRELAX_BARCODE_SAVE_ALL_OUT_200-HR19_-_4554_12169_1

This failed on my computer and another one: 888227
That person errors out with:
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00BCCCF2 write attempt to address 0x1F46E428


I error out with:
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00BCCCF2 write attempt to address 0x1F5063B8


get crappy credit for wasting 8541.453 seconds on this task only a big fat goose egg

Better quality control of your coding is needed.

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,545,746
RAC: 7,447
Message 56067 - Posted 28 Sep 2008 2:12:42 UTC

[url=http://boinc.bakerlab.org/rosetta/result.php?resultid=194793599[Task ID 194793599[/url] gave a Compute Error with the std err:

<core_client_version>6.2.18</core_client_version>
<![CDATA[
<message>
too many exit(0)s
</message>
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 3424460
Can't acquire lockfile - exiting
[...]
Can't acquire lockfile - exiting

I get this frequently under MiniRosetta 1.34 but this is the first time under Beta 5.98. All other WUs have run fine though.
____________

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 507,926
RAC: 0
Message 56068 - Posted 28 Sep 2008 3:42:30 UTC - in response to Message ID 56067.
Last modified: 28 Sep 2008 3:46:25 UTC

Task ID 194793599 gave a Compute Error with the std err:

Sid Celery I've fixed up the the link to your result for you
____________
Have a crunching good day!!

Otto

Joined: Apr 6 07
Posts: 27
ID: 163281
Credit: 1,908,137
RAC: 1,443
Message 56080 - Posted 29 Sep 2008 13:55:39 UTC
Last modified: 29 Sep 2008 13:56:38 UTC

(I'm using Boinc 6.2.19 on Windows XP)

When I press the graphics button in the main view, no graphics will show up, but the button itself disappears, so that I'm unable to see any graphics at all. Strange bug/issue.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 56089 - Posted 29 Sep 2008 20:19:12 UTC - in response to Message ID 56080.

I asked this question awhile back, and I believe the answer was that the new boinc managers do not support the graphics application of 5.98.

(I'm using Boinc 6.2.19 on Windows XP)

When I press the graphics button in the main view, no graphics will show up, but the button itself disappears, so that I'm unable to see any graphics at all. Strange bug/issue.

Pepo
Avatar

Joined: Sep 28 05
Posts: 115
ID: 1676
Credit: 101,358
RAC: 0
Message 56091 - Posted 29 Sep 2008 21:16:32 UTC - in response to Message ID 56089.

I asked this question awhile back, and I believe the answer was that the new boinc managers do not support the graphics application of 5.98.

(I'm using Boinc 6.2.19 on Windows XP)

When I press the graphics button in the main view, no graphics will show up, but the button itself disappears, so that I'm unable to see any graphics at all. Strange bug/issue.


Nearly. They still do, but not when BOINC is installed in the protected mode (as a service).

But still, it is strange that the button was initially available for a click.

Peter

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 56109 - Posted 30 Sep 2008 13:00:48 UTC - in response to Message ID 56091.

I asked this question awhile back, and I believe the answer was that the new boinc managers do not support the graphics application of 5.98.

(I'm using Boinc 6.2.19 on Windows XP)

When I press the graphics button in the main view, no graphics will show up, but the button itself disappears, so that I'm unable to see any graphics at all. Strange bug/issue.


Nearly. They still do, but not when BOINC is installed in the protected mode (as a service).

But still, it is strange that the button was initially available for a click.

Peter


You say 'protected mode', thing is I installed Boinc mgr in its standard form, I did not select anything different than the defaults. So is 'protected mode' the default? If so, how do you change it so that 5.98 will show the graphics?

Pepo
Avatar

Joined: Sep 28 05
Posts: 115
ID: 1676
Credit: 101,358
RAC: 0
Message 56112 - Posted 30 Sep 2008 14:19:17 UTC - in response to Message ID 56109.

not when BOINC is installed in the protected mode (as a service).

You say 'protected mode', thing is I installed Boinc mgr in its standard form, I did not select anything different than the defaults. So is 'protected mode' the default? If so, how do you change it so that 5.98 will show the graphics?

Yes, default form for 6.x is a service. In this mode, science apps are running under newly added [i]boinc_project[/usr] account and have no access to your (the logged-in user) desktop. (See also How to install BOINC as a service (BOINC 6 series) on Windows?.)

Run the installation again and when on the BOINC Configuration page, press the "Advanced" button and then switch off the "Protected application execution" (a.k.a. "Service") mode checkbox. The client and applications will be then run under your account, started directly as Manager's child processes (the "good old way").

Peter

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 56129 - Posted 30 Sep 2008 20:08:52 UTC - in response to Message ID 56112.

not when BOINC is installed in the protected mode (as a service).

You say 'protected mode', thing is I installed Boinc mgr in its standard form, I did not select anything different than the defaults. So is 'protected mode' the default? If so, how do you change it so that 5.98 will show the graphics?

Yes, default form for 6.x is a service. In this mode, science apps are running under newly added [i]boinc_project[/usr] account and have no access to your (the logged-in user) desktop. (See also How to install BOINC as a service (BOINC 6 series) on Windows?.)

Run the installation again and when on the BOINC Configuration page, press the "Advanced" button and then switch off the "Protected application execution" (a.k.a. "Service") mode checkbox. The client and applications will be then run under your account, started directly as Manager's child processes (the "good old way").

Peter


thanks, i will look at that this weekend. don't have allot of time during the week to do much here on the computer except read/write and run.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 56160 - Posted 1 Oct 2008 21:45:05 UTC
Last modified: 1 Oct 2008 21:46:17 UTC

I just watched this HR19__BOINC_LONGNOE_JUMPRELAX_BARCODE_SAVE_ALL_OUT_200-HR19_-_4539_39952_0 one die on me. 1:48 computation time and then it even pops up a windows error box.

Error msg is this:
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
Computer ID 871217
Report deadline 8 Oct 2008 18:59:56 UTC
CPU time 6482.563
stderr out

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 2791509


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00BCCCF2 write attempt to address 0x1F4B486C

Engaging BOINC Windows Runtime Debugger...



********************


BOINC Windows Runtime Debugger Version 6.1.5


Dump Timestamp : 10/01/08 23:39:03
LoadLibraryA( symsrv.dll ): GetLastError = 126
LoadLibraryA( srcsrv.dll ): GetLastError = 126
Debugger Engine : 4.0.5.0
Symbol Search Path: E:\boinc\projects\slots\0;E:\boinc\projects\projects\boinc.bakerlab.org_rosetta;srv*C:\WINDOWS\TEMP\symbols*http://msdl.microsoft.com/download/symbols;srv*C:\WINDOWS\TEMP\symbols*http://boinc.bakerlab.org/rosetta/symstore


SymGetModuleInfo(): GetLastError = 87
ModLoad: 00000000 00000000 ( Symbols Loaded) repeats 23 times in total
*** Dump of the Process Statistics: ***

- I/O Operations Counters -
Read: 4016, Write: 0, Other 6432

- I/O Transfers Counters -
Read: 0, Write: 66417, Other 0

- Paged Pool Usage -
QuotaPagedPoolUsage: 51956, QuotaPeakPagedPoolUsage: 51956
QuotaNonPagedPoolUsage: 4200, QuotaPeakNonPagedPoolUsage: 5032

- Virtual Memory Usage -
VirtualSize: 252223488, PeakVirtualSize: 256581632

- Pagefile Usage -
PagefileUsage: 206942208, PeakPagefileUsage: 225918976

- Working Set Size -
WorkingSetSize: 120836096, PeakWorkingSetSize: 139542528, PageFaultCount: 2228047

*** Dump of thread ID 2180 (state: Waiting): ***

- Information -
Status: Wait Reason: UserRequest, ,

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00C4FCF3 read attempt to address 0xC0000000

Engaging BOINC Windows Runtime Debugger...


</stderr_txt>
]]>

NO CREDIT! My RAC is already low enough, now thanks to this it goes lower yet. GEES

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 56161 - Posted 2 Oct 2008 3:33:48 UTC - in response to Message ID 56160.

I just watched this HR19__BOINC_LONGNOE_JUMPRELAX_BARCODE_SAVE_ALL_OUT_200-HR19_-_4539_39952_0 one die on me. 1:48 computation time and then it even pops up a windows error box.

I had a HR19__BOINC_LONGNOE_JUMPRELAX_BARCODE_SAVE_ALL_OUT_200-HR19_-_4539 die on one of my Linux nodes:

http://boinc.bakerlab.org/rosetta/result.php?resultid=195898771

Qui-Gon Jinn

Joined: Aug 10 08
Posts: 3
ID: 273100
Credit: 4,683
RAC: 0
Message 56180 - Posted 3 Oct 2008 0:43:50 UTC

Wierd, i had the same problem with a similar task. This is what Boinc said.
10/2/2008 7:15:37 PM|rosetta@home|Starting HR19__BOINC_LONGNOE_JUMPRELAX_BARCODE_SAVE_ALL_OUT_200-HR19_-_4539_162905_0

10/2/2008 7:15:40 PM|rosetta@home|Starting task HR19__BOINC_LONGNOE_JUMPRELAX_BARCODE_SAVE_ALL_OUT_200-HR19_-_4539_162905_0 using rosetta_beta version 598

10/2/2008 7:15:41 PM|rosetta@home|Started upload of abinitio_nohomfrag_70_A_2he4A_4482_20855_0_0

10/2/2008 7:15:46 PM|rosetta@home|Finished upload of abinitio_nohomfrag_70_A_2he4A_4482_20855_0_0

10/2/2008 7:32:34 PM|rosetta@home|Computation for task HR19__BOINC_LONGNOE_JUMPRELAX_BARCODE_SAVE_ALL_OUT_200-HR19_-_4539_162905_0 finished

10/2/2008 7:32:34 PM|rosetta@home|Output file HR19__BOINC_LONGNOE_JUMPRELAX_BARCODE_SAVE_ALL_OUT_200-HR19_-_4539_162905_0_0 for task HR19__BOINC_LONGNOE_JUMPRELAX_BARCODE_SAVE_ALL_OUT_200-HR19_-_4539_162905_0 absent

Note that the WU was running for only 1000 seconds. I run vista and it said that the process was malfunctioning ( and then stopped it).

Fishead

Joined: Sep 3 08
Posts: 7
ID: 276548
Credit: 89,566
RAC: 0
Message 56185 - Posted 3 Oct 2008 8:10:47 UTC

Same problems here:

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=179380707
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=177135943

One of them gave me a windows error as well, but I'm afraid I don't remember which it was.

Griiim Dave

Joined: Sep 4 08
Posts: 3
ID: 276639
Credit: 386,218
RAC: 0
Message 56213 - Posted 4 Oct 2008 12:13:32 UTC

3 workunits starting with HR19 failed on me this morning!

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 56214 - Posted 4 Oct 2008 15:19:23 UTC
Last modified: 4 Oct 2008 15:22:25 UTC

HR19__BOINC_LONGNOE_JUMPRELAX_BARCODE_SAVE_ALL_OUT_200-HR19_-_4539_122324_0

10/4/2008 10:32:44 AM|rosetta@home|Starting task HR19__BOINC_LONGNOE_JUMPRELAX_BARCODE_SAVE_ALL_OUT_200-HR19_-_4539_122324_0 using rosetta_beta version 598
10/4/2008 2:28:18 PM|rosetta@home|Computation for task HR19__BOINC_LONGNOE_JUMPRELAX_BARCODE_SAVE_ALL_OUT_200-HR19_-_4539_122324_0 finished
10/4/2008 2:28:18 PM|rosetta@home|Output file HR19__BOINC_LONGNOE_JUMPRELAX_BARCODE_SAVE_ALL_OUT_200-HR19_-_4539_122324_0_0 for task HR19__BOINC_LONGNOE_JUMPRELAX_BARCODE_SAVE_ALL_OUT_200-HR19_-_4539_122324_0 absent

Exit status -1073741819 (0xc0000005)
CPU time 13778.61
stderr out

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 2709137


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00BCCCF2 write attempt to address 0x1F4ED470

Engaging BOINC Windows Runtime Debugger...

Dump Timestamp : 10/04/08 14:22:59
LoadLibraryA( symsrv.dll ): GetLastError = 126
LoadLibraryA( srcsrv.dll ): GetLastError = 126
Debugger Engine : 4.0.5.0
Symbol Search Path: E:\boinc\projects\slots\0;E:\boinc\projects\projects\boinc.bakerlab.org_rosetta;srv*C:\WINDOWS\TEMP\symbols*http://msdl.microsoft.com/download/symbols;srv*C:\WINDOWS\TEMP\symbols*http://boinc.bakerlab.org/rosetta/symstore


SymGetModuleInfo(): GetLastError = 87
ModLoad: 00000000 00000000 ( Symbols Loaded)

The above message repeats 22 times like last time

No credit granted, I'm wasting my time with these types of tasks. 2 out of 3 of these tasks have died. If it goes 50% I will abort all the HR19 tasks to save my RAC.

adrianxw Profile
Avatar

Joined: Sep 18 05
Posts: 535
ID: 402
Credit: 1,057,641
RAC: 1,674
Message 56221 - Posted 4 Oct 2008 16:31:26 UTC
Last modified: 4 Oct 2008 16:32:41 UTC

HR19__BOINC_LONGNOE_JUMPRELAX_BARCODE_SAVE_ALL_OUT_200-HR19_-_4539_228065_1

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 2603396


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00BCCCF2 write attempt to address 0x2015E48C

Engaging BOINC Windows Runtime Debugger...
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,545,746
RAC: 7,447
Message 56263 - Posted 6 Oct 2008 23:34:03 UTC - in response to Message ID 56068.

Task ID 194793599 gave a Compute Error with the std err:

Sid Celery I've fixed up the the link to your result for you

Thanks Speedy. Finger trouble here...

I've noticed Mgr 6.2.19 is out now - updating now in the vain hope it helps my issue here or with mini 1.34 errors. This post really just a marker. Any improvement and I'll revert to 3 hour runs instead of 2 hour to help save on bandwidth.
____________

Matthias Lehmkuhl

Joined: Nov 20 05
Posts: 10
ID: 13663
Credit: 709,214
RAC: 249
Message 56291 - Posted 8 Oct 2008 6:46:55 UTC
Last modified: 8 Oct 2008 6:49:31 UTC

I got this error on a machine, causes a window with user acknowledge, where the CPU runs no other boinc project since I press OK.

resultid=197473964

Exit status -1073741819 (0xc0000005)

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400
# random seed: 2595731


Unhandled Exception Detected...

application version 5.98

edit:
Name HR19__BOINC_LONGNOE_JUMPRELAX_BARCODE_SAVE_ALL_OUT_200-HR19_-_4539_235730_1
____________
Matthias

Narlanthrotep

Joined: Jan 23 06
Posts: 2
ID: 53653
Credit: 512,528
RAC: 0
Message 56501 - Posted 29 Oct 2008 1:32:30 UTC

Virus detected by NOD32 v2.7 definition 3564 (20081028)
IMON detected: "probably a variant of Win32/Statik application"
file:
hxxp://srv4.bakerlab.org/rosetta/download/rosetta_bate_5.98_windows_intelx86.exe

i've seen a similar detection before with a previous verision of Rosetta Mini, but the problem was the AV software, not Rosetta itself.

____________

bzag00

Joined: Aug 17 06
Posts: 4
ID: 105305
Credit: 4,765
RAC: 0
Message 56612 - Posted 1 Nov 2008 23:08:44 UTC

My last 2 tasks completed with a Computation Error and I was not granted credit. I am using V5.98. I am not having any problems with any other BOINC serviced applications.

I am subscribing to this thread, so when a new release of Rosetta comes out please post a message in this thread and I will resume the Rosetta project locally.

Thank you.
____________

googloo
Avatar

Joined: Sep 15 06
Posts: 105
ID: 112667
Credit: 5,953,021
RAC: 7,157
Message 56614 - Posted 2 Nov 2008 0:41:43 UTC - in response to Message ID 56612.
Last modified: 2 Nov 2008 0:42:15 UTC

My last 2 tasks completed with a Computation Error and I was not granted credit. I am using V5.98. I am not having any problems with any other BOINC serviced applications.

I am subscribing to this thread, so when a new release of Rosetta comes out please post a message in this thread and I will resume the Rosetta project locally.

Thank you.


The thread to subscribe to for notifications of new Rosetta versions is supposed to be Rosetta Application Version Release Log. However, the Rosetta team apparently has trouble remembering this.

rochester new york Profile
Avatar

Joined: Jul 2 06
Posts: 2562
ID: 98229
Credit: 958,139
RAC: 127
Message 56791 - Posted 9 Nov 2008 19:20:30 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=203282028

kctipton

Joined: Nov 21 05
Posts: 2
ID: 13938
Credit: 56,082
RAC: 0
Message 56904 - Posted 13 Nov 2008 13:40:22 UTC

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=185399690 -- your server sent it out 3x and thereby created an error condition of "too many results"

Toby Broom

Joined: Oct 15 08
Posts: 7
ID: 283928
Credit: 5,360,371
RAC: 14,700
Message 57002 - Posted 16 Nov 2008 17:09:47 UTC
Last modified: 16 Nov 2008 17:12:22 UTC

I have a computer running Windows Server 2008 Core.

This edition of windows doesn't include opengl.dll, this is causing failed workunits on my computer.

As I run BOINC as a service I assumed that the opengl.dll was only used for the screensaver and hence wouldn't be needed.

Aegis Maelstrom

Joined: Oct 29 08
Posts: 61
ID: 285843
Credit: 792,303
RAC: 34
Message 57192 - Posted 24 Nov 2008 0:11:28 UTC
Last modified: 24 Nov 2008 0:17:09 UTC

Hi, I seem to have a following problem with the client: a unit resumes just to get completed after a couple of seconds and get uploaded as succesful.

Unfortunately, I had a similar problem with a minirosetta (but then it was after 3 restarts and probably new memory requirements for the recent tasks - I say probably because nobody replied my directly).

Message 57035


Hi,

as I have promised I have come back, increased the memory amount and started to crunch again.

To my surprise, the process has suddenly finished with a "success". The log says:
2008-11-17 21:54:33|rosetta@home|Restarting task IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_1wr2_4683_55_1 using minirosetta version 140
2008-11-17 21:56:12|rosetta@home|Computation for task IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_1wr2_4683_55_1 finished

As I wrote in the posts above, this is impossible to end this task in such a time. Last time I needed two and a half "physical" hours just to crash, due to probably too low memory limits.
(...)


This time the process was resumed after 3 hrs of work. Log:

2008-11-24 00:49:04|rosetta@home|Restarting task LZ21__BOINC_SYMM_FOLD_AND_DOCK_RELAX-LZ21_-foldanddock__4643_15895_2 using rosetta_beta version 598
2008-11-24 00:49:25|rosetta@home|Computation for task LZ21__BOINC_SYMM_FOLD_AND_DOCK_RELAX-LZ21_-foldanddock__4643_15895_2 finished
2008-11-24 00:49:26|rosetta@home|Starting cs_jumping_abrelax_6PNAS_proteins3_homo_bench_cs_jumping_abrelax_cs_nsp1_olange_4732_32919_1
2008-11-24 00:49:31|rosetta@home|Starting task cs_jumping_abrelax_6PNAS_proteins3_homo_bench_cs_jumping_abrelax_cs_nsp1_olange_4732_32919_1 using minirosetta version 140
2008-11-24 00:49:32|rosetta@home|Started upload of LZ21__BOINC_SYMM_FOLD_AND_DOCK_RELAX-LZ21_-foldanddock__4643_15895_2_0
2008-11-24 00:49:37|rosetta@home|Finished upload of LZ21__BOINC_SYMM_FOLD_AND_DOCK_RELAX-LZ21_-foldanddock__4643_15895_2_0

Is there a problem with a client or a machine? Does BOINC require a reinstall, more resources or what exactly?

EDIT: Unfortunately I can't tell how many % was completed before the resuming. If it was more than 90% it could have been just a coincidence. :/

PlaNed

Joined: Sep 25 05
Posts: 3
ID: 1055
Credit: 37,334
RAC: 0
Message 57386 - Posted 1 Dec 2008 7:55:17 UTC

Today NOD32 report:

01.12.2008 09:49:22 HTTP filter file http://srv1.bakerlab.org/rosetta/download/rosetta_beta_5.98_windows_intelx86.exe probably a variant of Win32/Statik application connection terminated - quarantined

____________
<img src="http://boinc.mundayweb.com/one/stats.php?userID=120&amp;trans=off">

ByRad Profile
Avatar

Joined: Apr 12 08
Posts: 8
ID: 252633
Credit: 8,231,131
RAC: 13,507
Message 57718 - Posted 8 Dec 2008 21:58:36 UTC

This message window occured:

Log messages:
2008-12-08 21:22:52|rosetta@home|Starting t062_1_NMRREF_1_t062_1_id_model_04_coreIGNORE_THE_REST_idl_5434_6154_0
2008-12-08 21:22:53|rosetta@home|Starting task t062_1_NMRREF_1_t062_1_id_model_04_coreIGNORE_THE_REST_idl_5434_6154_0 using rosetta_beta version 598

2008-12-08 22:48:05|rosetta@home|Computation for task t062_1_NMRREF_1_t062_1_id_model_04_coreIGNORE_THE_REST_idl_5434_6154_0 finished
2008-12-08 22:48:05|rosetta@home|Output file t062_1_NMRREF_1_t062_1_id_model_04_coreIGNORE_THE_REST_idl_5434_6154_0_0 for task t062_1_NMRREF_1_t062_1_id_model_04_coreIGNORE_THE_REST_idl_5434_6154_0 absent

____________

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 57820 - Posted 12 Dec 2008 15:26:28 UTC - in response to Message ID 57192.

Hi, I seem to have a following problem with the client: a unit resumes just to get completed after a couple of seconds and get uploaded as succesful.

Unfortunately, I had a similar problem with a minirosetta (but then it was after 3 restarts and probably new memory requirements for the recent tasks - I say probably because nobody replied my directly).

Message 57035

Hi,

as I have promised I have come back, increased the memory amount and started to crunch again.

To my surprise, the process has suddenly finished with a "success". The log says:
2008-11-17 21:54:33|rosetta@home|Restarting task IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_1wr2_4683_55_1 using minirosetta version 140
2008-11-17 21:56:12|rosetta@home|Computation for task IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_1wr2_4683_55_1 finished

As I wrote in the posts above, this is impossible to end this task in such a time. Last time I needed two and a half "physical" hours just to crash, due to probably too low memory limits.
(...)


Is there a problem with a client or a machine? Does BOINC require a reinstall, more resources or what exactly?

EDIT: Unfortunately I can't tell how many % was completed before the resuming. If it was more than 90% it could have been just a coincidence. :/


I'm not sure what went wrong, but my machine has acted better after I increased the upper limit on how much disk space BOINC is allowed to use and also increased the fraction of the virtual memory BOINC is allowed to use. I'd consider increasing the amount of physical memory it has as well, except that I'm already at the limit of how much my motherboard can handle.

As far as I can tell, the check of whether to end processing of a workunit comes only at the beginning of a timeslice. Restarting a task also starts a new timeslice, so the check of whether to return only the results calculated so far comes soon afterwards.

Minirosetta seems to have problems showing the correct time remaining between the time it reaches about 10 minutes from the initial estimated time it needs until it's complete, so if you get a workunit with a bad estimate of how much time it needs, expect slowly changing estimates of how much CPU time it needs to complete during this time. I wouldn't be surprized if Rosetta 5.98 has the same problem.

frederick corse

Joined: Oct 7 05
Posts: 10
ID: 3142
Credit: 1,496,201
RAC: 175
Message 58397 - Posted 3 Jan 2009 1:17:43 UTC

Hello there seems to be a problem with AT01 they all fail during download can't get input files. I've had them download to my G5 and also to my mac pro with the same error.
____________

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 59495 - Posted 9 Feb 2009 23:20:01 UTC

error with these two - no heartbeat problems

227327788
(second time)
lasted .42 secs

</stderr_txt>
<message>
<file_xfer_error>
<file_name>t078_1_NMRREF_1_t078_1_idS_vnl_00024494_10961IGNORE_THE_REST_431_6701_745_1_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

and with
227327677
(first time)
lasted .41 sec


/stderr_txt>
<message>
<file_xfer_error>
<file_name>t078_1_NMRREF_1_t078_1_idS_vnl_00024494_10961IGNORE_THE_REST_431_6701_1032_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

and this one had a validate error

227327697

____________

(_KoDAk_) Profile

Joined: Jul 18 06
Posts: 109
ID: 100677
Credit: 1,859,263
RAC: 0
Message 59648 - Posted 18 Feb 2009 9:49:57 UTC
Last modified: 18 Feb 2009 9:50:25 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=228610191
<core_client_version>6.6.5</core_client_version><![CDATA[<message>
- exit code -1073741819 (0xc0000005)
</message><stderr_txt>
# cpu_run_time_pref: 28800
# random seed: 3595866
____________

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59687 - Posted 20 Feb 2009 16:51:53 UTC

Exceptions at various addresses on two different computers. ONly one of these tasks had a wingman but that person also had an exception death.

230043321 0x008BB97B read attempt to address 0x060A0000
230072124 0x008BB9A5 read attempt to address 0x0BA5A000
229964104 0x008BB97B read attempt to address 0x0A082000
229979933 0x008BB92B read attempt to address 0x0BB38000
230017643 0x008BB92B read attempt to address 0x0A23C000

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 507,926
RAC: 0
Message 59691 - Posted 20 Feb 2009 20:48:21 UTC

resultid=230017087 Wing man still to return result.
resultid=230017086 Wing man also crashed

____________
Have a crunching good day!!

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59695 - Posted 21 Feb 2009 1:20:23 UTC
Last modified: 21 Feb 2009 1:23:51 UTC

Another crash like the ones below 230072124 and 230089446

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59697 - Posted 21 Feb 2009 6:10:08 UTC

And more crashes ... you know, this was the reason I stopped RaH a couple years ago ... non-stop errors ... and some of these tasks are wasting in the region of 10-20 minutes ...

230171238
230225023

BeemerBiker Profile
Avatar

Joined: May 7 07
Posts: 14
ID: 174398
Credit: 1,394,328
RAC: 1,891
Message 59702 - Posted 21 Feb 2009 13:56:47 UTC

bunch of faults, maybe 5, windows xp, vista 64, etc. Started about 3 days ago. These faults are the type that show up over the desktop and ask if microsoft should be informed.


from xp event log
Faulting application rosetta_beta_5.98_windows_intelx86.exe, version 0.0.0.0, faulting module rosetta_beta_5.98_windows_intelx86.exe, version 0.0.0.0, fault address 0x0084fcf3.


Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59707 - Posted 21 Feb 2009 17:47:35 UTC

Yeah, That is what I am getting too ... I keep forgetting to check to see if that pop-up is locking the CPU meaning until you dismiss it you lost the CPU ... if you get it again, let us know ... I keep making a mental note to check and keep forgetting ...

Today's list:
230306967 0x008BB92B read attempt to address 0x0BECC000
230288293 0x008BB955 read attempt to address 0x0B6D3000
230192144 0x008BB9A5 read attempt to address 0x0DD3F000
230089446 0x008BB97B read attempt to address 0x0C06A000
230018563 0x008BB92B read attempt to address 0x09D7A000


Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59744 - Posted 23 Feb 2009 5:30:45 UTC

Only one today:

230534122 0x008BB955 read attempt to address 0x0C07C000

TomaszPawel

Joined: Apr 28 07
Posts: 54
ID: 170716
Credit: 2,791,145
RAC: 0
Message 59759 - Posted 23 Feb 2009 15:19:16 UTC

Yeah, That is what I am getting too ... I keep forgetting to check
to see if that pop-up is locking the CPU meaning until you dismiss it you
lost the CPU ... if you get it again, let us know ... I keep making a
mental note to check and keep forgetting ...

The same to me, on my Quad, 3 days 2 cores was doing nothing until today I
press "don't send".....

Here are:

http://boinc.bakerlab.org/rosetta/result.php?resultid=230037349

http://boinc.bakerlab.org/rosetta/result.php?resultid=230035649

http://boinc.bakerlab.org/rosetta/result.php?resultid=230035633

http://boinc.bakerlab.org/rosetta/result.php?resultid=230035617

so it is 2p64...

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 59772 - Posted 24 Feb 2009 8:22:12 UTC

2p64__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2p64_-native_frag2__7622_18636_1
Exit status -1073741819 (0xc0000005)
CPU time 207.6563
# random seed: 2816709
Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x008BB955 read attempt to address 0x1383C000

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59777 - Posted 24 Feb 2009 17:19:21 UTC

231014076 0x008BB97B read attempt to address 0x0A7ED000

Yaroslav Isakov

Joined: Nov 2 07
Posts: 11
ID: 217531
Credit: 98,027
RAC: 0
Message 59789 - Posted 25 Feb 2009 2:01:35 UTC
Last modified: 25 Feb 2009 2:03:05 UTC

Very strange result:
http://boinc.bakerlab.org/rosetta/result.php?resultid=230979688
and respective WU:
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=208222634
Why canonical result do 108 decoys and mine do 12 decoys? And why I got 'Workunit error - check skipped'? And why there are three results?

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59792 - Posted 25 Feb 2009 5:38:30 UTC - in response to Message ID 59789.

Very strange result:
http://boinc.bakerlab.org/rosetta/result.php?resultid=230979688
and respective WU:
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=208222634
Why canonical result do 108 decoys and mine do 12 decoys? And why I got 'Workunit error - check skipped'? And why there are three results?


The difference is in the:

# cpu_run_time_pref: 86400

yours is 10,800 ...

8 times more runtime gets more decoys ...

Yaroslav Isakov

Joined: Nov 2 07
Posts: 11
ID: 217531
Credit: 98,027
RAC: 0
Message 59794 - Posted 25 Feb 2009 10:32:10 UTC - in response to Message ID 59792.

Very strange result:
http://boinc.bakerlab.org/rosetta/result.php?resultid=230979688
and respective WU:
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=208222634
Why canonical result do 108 decoys and mine do 12 decoys? And why I got 'Workunit error - check skipped'? And why there are three results?


The difference is in the:

# cpu_run_time_pref: 86400

yours is 10,800 ...

8 times more runtime gets more decoys ...


Ok, but why more runtime?

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59799 - Posted 25 Feb 2009 17:11:12 UTC
Last modified: 26 Feb 2009 15:04:52 UTC

With Rosetta@home, you can configure a runtime preference. This is done via the "[Participants]" link at the top of this forum page. And then click on the Rosetta preferences. The result of increasing the runtime is less bandwidth and scheduler requests on the server, and then the tasks just run more models until the target runtime is approached.

The particular WU you show suffers from the dreaded bug in the BOINC server code where a result is accepted after the deadline is reached but the task has already been reissued. There has been a BOINC trac item open to fix this for over a year.
http://boinc.berkeley.edu/trac/ticket/276
____________
Rosetta Moderator: Mod.Sense

Yaroslav Isakov

Joined: Nov 2 07
Posts: 11
ID: 217531
Credit: 98,027
RAC: 0
Message 59803 - Posted 25 Feb 2009 19:37:41 UTC - in response to Message ID 59799.

With Rosetta@home, you can configure a runtime preference. This is done via the "[Participants]" link at the top of this forum page. And then click on the Rosetta preferences. The result of increasing the runtime is less bandwidth and scheduler requests on the server, and then the tasks just run more models until the target runtime is approached.

The particular WU you show suffers from the dreaded bug in the BOINC server code where a result if accepted after the deadline is reached but the task has already been reissued. There has been a BOINC trac item open to fix this for over a year.
http://boinc.berkeley.edu/trac/ticket/276


Thank you for explanation!

BeemerBiker Profile
Avatar

Joined: May 7 07
Posts: 14
ID: 174398
Credit: 1,394,328
RAC: 1,891
Message 59810 - Posted 26 Feb 2009 8:21:01 UTC

2p64__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2p64_-native_frag2__7622_17767_0
got %5 done before faulting. I went and aborted it as it was hung after the app faulted.

Faulting application rosetta_beta_5.98_windows_intelx86.exe, version 0.0.0.0, faulting module rosetta_beta_5.98_windows_intelx86.exe, version 0.0.0.0, fault address 0x0084fcf3.

====

I monitor my farm using boincview and when it shows a yellow background I know there is a problem. When I logged in remotely, I was greeted with a fault report and was asked if I wanted to report it to microsoft.

There have been 3 separate rosetta app faults since Feb 21. My event log goes back to 10/23/08 and there are no other app faults. Milkyway and Poem are the other 2 boinc apps besides rosetta.

Windows Xp, sp3
BM 6.2.18
Dual MP 2800

ramostol

Joined: Feb 6 07
Posts: 64
ID: 145835
Credit: 584,052
RAC: 0
Message 59990 - Posted 6 Mar 2009 10:41:15 UTC - in response to Message ID 59799.


The particular WU you show suffers from the dreaded bug in the BOINC server code where a result is accepted after the deadline is reached but the task has already been reissued. There has been a BOINC trac item open to fix this for over a year.
http://boinc.berkeley.edu/trac/ticket/276


I am still of the opinion that this is not at all a bug to be dreaded. All parties involved still seem to be compensated with appropriate credits. Interested parties might like to have a look at the following wus (while they last):

2p64__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2p64_-native_frag2__7622_8156

2 "Compute error", 1 subsequent success after "Too many error results Too many total results". Which somehow proves a point.


2p64__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2p64_-native_frag2__7622_13799

1 "No reply", 1 "Compute error", after "Too many total results" 1 apparent success with no local error messages, but the result file shows the wu erroring out also with the third cruncher:
CPU time 11808.83
stderr out

<core_client_version>6.6.12</core_client_version> [[but computed with 6.6.11]]
<![CDATA[
<stderr_txt>
Rosetta@home Macintosh Stack Size checker.
Original size: 0.
Maximum size: 8388608.
RLIM_INFINITY 0
shell-init: could not get current directory: getcwd: cannot access parent directories: Permission denied [[normal error message in my situation]]
# cpu_run_time_pref: 21600
# random seed: 2821546
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -147.965 for 900 seconds
**********************************************************************
GZIP SILENT FILE: ./xx2p64.out

</stderr_txt>
]]>

Validate state Valid


RamonS

Joined: Jun 19 08
Posts: 3
ID: 264954
Credit: 4,015,881
RAC: 3,048
Message 60018 - Posted 7 Mar 2009 23:30:00 UTC

Also now getting errors with rosetta_beta_5.98_windows_intelx86, but only after rebooting Windoze. Running BOINC on Server 2003 32bit. So far it crashed three times in a row and appears to have given up. Sent error reports to MS.
I looked at the reports, but the gibberish in there doesn't tell me a thing. Let me know if you need more specific info other than "it crashed and seems to be broken". :)

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 60021 - Posted 8 Mar 2009 3:08:03 UTC

RamonS, if you could post a link to the task that failed, that would be great.
____________
Rosetta Moderator: Mod.Sense

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 60022 - Posted 8 Mar 2009 5:11:33 UTC - in response to Message ID 60021.
Last modified: 8 Mar 2009 5:16:49 UTC

RamonS, if you could post a link to the task that failed, that would be great.


Hi Mod Sense.

I'm not Ramons but i had a look and could only find one 5.98 that errored and all

that ran it failed. This rig http://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=837145

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=209585268

He also has a lot of lock file errors with mini 1.54.
This rig http://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=881461

pete.
____________


svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 60770 - Posted 21 Apr 2009 22:08:10 UTC

A couple of tasks ( 245331636 and 245251599) failed on Mac in a way similar to that reported by ramostol.


Rosetta@home Macintosh Stack Size checker.
Original size: 8388608.
Maximum size: 0.
RLIM_INFINITY 67108864
# cpu_run_time_pref: 14400
No heartbeat from core client for 31 sec - exiting
Rosetta@home Macintosh Stack Size checker.
Original size: 8388608.
Maximum size: 0.
RLIM_INFINITY 67108864
Too many restarts with no progress. Keep application in memory while preempted.
======================================================
DONE :: 1 starting structures 0 cpu seconds
This process generated 0 decoys from 0 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
<message>
<file_xfer_error>
<file_name>Rossmann2X2_033_11257_11463_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>


____________

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 63447 - Posted 25 Sep 2009 3:34:05 UTC

A recent version of acemd from GPUGRID and version 5.98 of rosetta_beta from Rosetta@home may have a compatibility problem; if not, the rosetta_beta graphics portion appears to have frozen by itself.


9/24/2009 4:39:22 PM CUDA device: GeForce 9800 GT (driver version 19038, compute capability 1.1, 1024MB, est. 60GFLOPS)

9/24/2009 4:39:35 PM rosetta@home Restarting task Rossmann2X3_002_14911_14657_0 using rosetta_beta version 598
9/24/2009 4:39:38 PM GPUGRID Restarting task PMEno54-OTTO_HERG4-10-40-RND5579_0 using acemd version 671

Today, I saw the graphics portion of a rosetta_beta workunit freeze in a way that kept it from ending its screensaver function when I used the keyboard and mouse.

Some information above about which workunits resumed after I rebooted the computer.

The rosetta_beta workunit resumed at essentially the same point shown in the frozen graphics before the reboot.

I'd like to see the rosetta_beta graphics portion modified to show the complete workunit and program names - but here's what I copied off the frozen screen:

denova design of Rossmann2X3;
70.74% Complete
CPU time: 8 hr 29 min 21 sec
Stage: Ab initio + relax
Model: 43 Step: 77427
Rosetta@home v5.98

Currently using Nvidia driver version 190.38; no word yet on whether the 190.62 version now available is likely to be more reliable.

64-bit Vista SP2
BOINC version 6.6.36

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 63591 - Posted 4 Oct 2009 7:36:46 UTC

This workunit 285247863 failed on Mac OSX 10.6.1

<core_client_version>6.6.36</core_client_version>
<![CDATA[
<stderr_txt>
Rosetta@home Macintosh Stack Size checker.
Original size: 8388608.
Maximum size: 0.
RLIM_INFINITY 67104768
# cpu_run_time_pref: 10800
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
# random seed: 3155889
Rosetta@home Macintosh Stack Size checker.
Original size: 8388608.
Maximum size: 0.
RLIM_INFINITY 67104768

plus similar messages


____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 63595 - Posted 4 Oct 2009 14:02:43 UTC

svincent, the task indicates that the actual cause of death was too many restarts without progress. This can mean several things, including perhaps a bug in the application or model. But most often it means you restarted your machine several times in a row? Or that the task got suspended several times in a row perhaps to run other projects or if you only run when computer not in use, perhaps someone came up and used it for brief periods several times in a row.

Hence the recommendation in the message to keep tasks in memory when preempted. The "memory" in such a case ends up being the swap space. This will preserve the work (unless the machine is actually powered off, or BOINC completed exited) and let the task pick up where it left off, regardless of checkpoints. Otherwise the task has to crunch long enough to reach and complete a checkpoint, which can take over an hour for some types of work units.

Do you happen to know if all of that happened on the first start of the task? I see it only recorded a fraction of a second of CPU time. But this does not count any prior runs that were not able to checkpoint.
____________
Rosetta Moderator: Mod.Sense

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 63596 - Posted 4 Oct 2009 15:40:52 UTC

Thanks for the explanation. Ever since upgrading to Snow Leopard, Excel 2004 has been constantly crashing on me, causing a return to the log-in screen. After reading your explanation, I suspect that this is the cause of the failed task.
____________

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 63600 - Posted 4 Oct 2009 21:27:17 UTC

One problem with using the leave in memory option - it restricts the participation in multiple BOINC projects with high memory requirements on the same computer, especially if some of them have a memory leak. I no longer consider it a suitable option to use when including Rosetta@home and/or Ralph@home in the mix of projects.

I haven't yet found a version of BOINC that's very good at actually moving much of what's in memory into the swap file, especially when what needs to be moved is the results of minirosetta's known memory leak.

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 63603 - Posted 4 Oct 2009 23:17:21 UTC
Last modified: 4 Oct 2009 23:21:54 UTC

I just found a Rosetta Beta workunit with frozen graphics covering the whole screen again. They wouldn't go away when I used the mouse and keyboard.

Two Rosetta Beta 5.98 workunits could have been running on that machine at the time of the graphics freeze; not enough evidence left to tell which one was responsible for this:

Rossmann2X3_001_14908_12080_1

Rossmann2X3_027_15080_10154_0

Both had just a little less CPU time than in the frozen graphics after I rebooted.

That machine has 64-bit BOINC 6.10.3 under Vista SP2; that BOINC version is recommended if I want to continue using the GPU on that machine under GPUGRID. That version often displays the graphics for any workunits in progress, even if I don't ask for any graphics.

One of those workunits is now running again; the other one is waiting for its turn on a CPU core.

The frozen graphics showed Model 2, Step 287738, with CPU time 0:24:07.

Is there an option to disable Rosetta Beta workunits on that machine, but continue running minirosetta workunits? Or would it be better to just discontinue Rosetta@home participation at all until this 5.98 problem is fixed?

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 63611 - Posted 5 Oct 2009 15:30:54 UTC

I've now seen the same problem with a workunit from a different BOINC project - QMC@home. Also had graphics covering the whole screen. This leads me to suspect that the problem is with BOINC 6.10.3 dealing with situations where it decides to move the graphics around on the screen, and finds that the graphics don't leave any empty space to move them to.

GPUGRID now needs the newer versions of BOINC, and I don't plan to stop participating there, so I expect a number of people would also like the option to stop receiving 5.98 workunits, and a few BOINC alpha testers to want the option to receive only 5.98 workunits from Rosetta@home for a while.

At least part of the problem apparantly occurs inside the Nvidia driver, though. Already using the newest Nvidia driver GPUGRID recommends (190.38), though.

bill Johnson@GMU Profile

Joined: Aug 5 09
Posts: 5
ID: 332972
Credit: 1,356,008
RAC: 0
Message 63620 - Posted 6 Oct 2009 12:51:25 UTC
Last modified: 6 Oct 2009 12:53:44 UTC

I have been getting some Rosetta Beta 5.98. They have been having problems downloading and if they do download my computer simply refuses to start work on them so they just sit there untouched. I have had to delete a few of them to make way for Rosetta Mini 1.97 work units that do actually get worked on.

Is there a problem with my preferences that is causing this or just the Rosetta Beta 5.98 work units?

the Beta work units are all Rossmann2X3 units.

ramostol

Joined: Feb 6 07
Posts: 64
ID: 145835
Credit: 584,052
RAC: 0
Message 63638 - Posted 9 Oct 2009 9:21:36 UTC - in response to Message ID 63591.

This workunit 285247863 failed on Mac OSX 10.6.1

<core_client_version>6.6.36</core_client_version>
<![CDATA[
<stderr_txt>
Rosetta@home Macintosh Stack Size checker.
Original size: 8388608.
Maximum size: 0.
RLIM_INFINITY 67104768
# cpu_run_time_pref: 10800
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
# random seed: 3155889
Rosetta@home Macintosh Stack Size checker.
Original size: 8388608.
Maximum size: 0.
RLIM_INFINITY 67104768

plus similar messages



All Rossmann tasks, successful or not, report these errors, for instance this task run on MacOS 10.5 on a computer working quite undisturbed by human activity:

CPU time 21761.01
stderr out

<core_client_version>6.10.11</core_client_version>
<![CDATA[
<stderr_txt>
Rosetta@home Macintosh Stack Size checker.
Original size: 8388608.
Maximum size: 0.
RLIM_INFINITY 67104768
# cpu_run_time_pref: 21600
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: nan is outside of [-1,+1] sin and cos value legal range
# random seed: 2994865
======================================================
DONE :: 1 starting structures 21760.5 cpu seconds
This process generated 10 decoys from 10 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
]]>

Validate state Valid
Claimed credit 145.750455203617
Granted credit 74.7162632779638

MarcoA

Joined: Sep 2 08
Posts: 9
ID: 276404
Credit: 777,433
RAC: 0
Message 63714 - Posted 16 Oct 2009 13:05:29 UTC

Here is another rossmann-task with the same [-1,+1]-Error:

http://boinc.bakerlab.org/rosetta/result.php?resultid=288301200

Gen_X_Accord Profile
Avatar

Joined: Jun 5 06
Posts: 154
ID: 87850
Credit: 279,018
RAC: 0
Message 63722 - Posted 16 Oct 2009 22:54:28 UTC - in response to Message ID 63603.

I just found a Rosetta Beta workunit with frozen graphics covering the whole screen again. They wouldn't go away when I used the mouse and keyboard.

Two Rosetta Beta 5.98 workunits could have been running on that machine at the time of the graphics freeze; not enough evidence left to tell which one was responsible for this:

Rossmann2X3_001_14908_12080_1

Rossmann2X3_027_15080_10154_0

Both had just a little less CPU time than in the frozen graphics after I rebooted.

That machine has 64-bit BOINC 6.10.3 under Vista SP2; that BOINC version is recommended if I want to continue using the GPU on that machine under GPUGRID. That version often displays the graphics for any workunits in progress, even if I don't ask for any graphics.

One of those workunits is now running again; the other one is waiting for its turn on a CPU core.

The frozen graphics showed Model 2, Step 287738, with CPU time 0:24:07.

Is there an option to disable Rosetta Beta workunits on that machine, but continue running minirosetta workunits? Or would it be better to just discontinue Rosetta@home participation at all until this 5.98 problem is fixed?


It would be better to disable the graphics and not allow Boinc as your screensaver. Set your computer to no screensaver and have the video power down after 10 minutes or so, and shut the monitors off when you are done. No only will you save a little on power, but you will no loger have a problem with frozen graphics. Rosetta doesn't need the graphics to run the work unit.
____________

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 63809 - Posted 25 Oct 2009 2:03:08 UTC

Task 290293053 (Rossmann2X3_060_003_15300_100_0) failed on Windows System 7. It ran for 25 hours stuck on Model 4 Step 271587 before I aborted it. How come the watchdog thread didn't stop it?

------

# cpu_run_time_pref: 10800
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
# random seed: 3714901


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x755A194B

Engaging BOINC Windows Runtime Debugger...

followed by a bunch of Windows debugging info.


____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 63810 - Posted 25 Oct 2009 2:31:27 UTC - in response to Message ID 63809.

Task 290293053 (Rossmann2X3_060_003_15300_100_0) failed on Windows System 7. It ran for 25 hours stuck on Model 4 Step 271587 before I aborted it. How come the watchdog thread didn't stop it?


...because it had only used 4038.959 seconds of CPU time. Your machine must have had some other higher priority work going on during that time period.

____________
Rosetta Moderator: Mod.Sense

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 7,590,569
RAC: 2,277
Message 63816 - Posted 25 Oct 2009 8:46:38 UTC

What is the cutoff for version 5.98? I mean, when does the watchdog in that version abort the unit?
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 63818 - Posted 25 Oct 2009 15:04:17 UTC

Good point transient. I believe at that time it was when you reach 4x the runtime preference. But, as I pointed out, the task wasn't getting much CPU time. The newer BOINC clients show "elapsed time" now, not CPU time.
____________
Rosetta Moderator: Mod.Sense

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 63822 - Posted 25 Oct 2009 15:45:59 UTC

Task 290293053 (Rossmann2X3_060_003_15300_100_0) failed on Windows System 7. It ran for 25 hours stuck on Model 4 Step 271587 before I aborted it. How come the watchdog thread didn't stop it?


...because it had only used 4038.959 seconds of CPU time. Your machine must have had some other higher priority work going on during that time period.

____________
Rosetta Moderator: Mod.Sense



There is a mismatch between the 4,038 seconds of CPU time reported in the Task Details and the 25+ hours it actually took (I decided to let it continue running). The only other tasks going on were Rosetta tasks using the second core. Could it be a System 7 issue?

____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 63829 - Posted 26 Oct 2009 2:40:03 UTC

ok, so when you say it actually took 25 hrs, this information came from what source?
____________
Rosetta Moderator: Mod.Sense

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 63840 - Posted 26 Oct 2009 14:08:03 UTC

ok, so when you say it actually took 25 hrs, this information came from what source?


The elapsed time field in the BOINC manager.

(My run time preference is set to 3 hours)

____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 63843 - Posted 26 Oct 2009 16:45:07 UTC

So the question becomes, why would 25hrs elapse, with only 4000 seconds of low priority CPU being available to BOINC? This is why I made the comment that your machine must have been busy doing something else that day.
____________
Rosetta Moderator: Mod.Sense

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 63845 - Posted 26 Oct 2009 19:26:52 UTC

But it wasn't doing anything else that day.
____________

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 63847 - Posted 26 Oct 2009 21:25:46 UTC

I have occasionally had work units that clock up the elapsed time while the cpu time stays stationary. The only way round it was to reboot the computer which got the work unit past the block and it finished normally.
____________

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 63848 - Posted 26 Oct 2009 23:24:18 UTC

It seemed like my wingman successfully completed the task in spite of getting the same bunch of errors I got : e.g.

sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
etc.

____________

tomba

Joined: May 29 06
Posts: 43
ID: 85179
Credit: 1,558,972
RAC: 0
Message 63943 - Posted 3 Nov 2009 20:06:58 UTC

XP Pro, Pentium D 3.20MHz, Rosetta and GPUGRID.

While the Rosetta chaos below unfolds, GPUGRID merrily rolls along.

Getting lots 'n' lots of these:



My results can be seen here.

I migrated to BOINC 6.10.17. No change. I detached/attached Rosetta. The first two WUs failed within a minute; computation error. Two more WUs are running right now and are at 15%, but I know they aren't gunna complete.

I've no idea what else to try.
____________

tomba

Joined: May 29 06
Posts: 43
ID: 85179
Credit: 1,558,972
RAC: 0
Message 63956 - Posted 4 Nov 2009 17:15:30 UTC - in response to Message ID 63943.

XP Pro, Pentium D 3.20MHz, Rosetta and GPUGRID.

While the Rosetta chaos below unfolds, GPUGRID merrily rolls along.

Getting lots 'n' lots of these:



My results can be seen here.

I migrated to BOINC 6.10.17. No change. I detached/attached Rosetta. The first two WUs failed within a minute; computation error. Two more WUs are running right now and are at 15%, but I know they aren't gunna complete.

I've no idea what else to try.


Well - gave up on Rosetta and attached to Einstein. Guess what? Immediate computation errors! I attached to Climate Prediction and that is running fine.

What is it about Rosetta and Einstein on my Pentium D that gives these errors? I have three other PCs happily running Rosetta...
____________

DJStarfox

Joined: Jul 19 07
Posts: 140
ID: 191721
Credit: 560,560
RAC: 21
Message 63957 - Posted 4 Nov 2009 17:29:02 UTC - in response to Message ID 63956.

What is it about Rosetta and Einstein on my Pentium D that gives these errors? I have three other PCs happily running Rosetta...


Pentium D, is it? How old is that machine?

tomba

Joined: May 29 06
Posts: 43
ID: 85179
Credit: 1,558,972
RAC: 0
Message 63958 - Posted 4 Nov 2009 18:05:56 UTC - in response to Message ID 63957.

What is it about Rosetta and Einstein on my Pentium D that gives these errors? I have three other PCs happily running Rosetta...


Pentium D, is it? How old is that machine?


3-1/4 years. It's a Dell Dimension 9100. It's been running BOINC 24/7 since new.

Last june, a month before the warrantee expired, it died. Dell fitted a refurbished MB and PS.

For a year it's been running with 4 gigs of Crucial Ballistix RAM.

Tom
____________

DJStarfox

Joined: Jul 19 07
Posts: 140
ID: 191721
Credit: 560,560
RAC: 21
Message 63960 - Posted 5 Nov 2009 0:54:47 UTC - in response to Message ID 63958.

3-1/4 years. It's a Dell Dimension 9100....Last june, it died. Dell fitted a refurbished MB and PS.


There's your problem. The capacitors on the motherboard are old and the CPU is generating errors under heavy power load (computation). Time to throw away the system. Get a 2nd opinion if you want, but 'fraid that's what's going on.

I had a system just like this (custom-built). Runs great for normal tasks, but it fails under heavy CPU load. I guess you could try under-clocking the system, but I never found a reason to work at it that hard. Dell's are notorious for expiring with weird symptoms.

dcdc Profile

Joined: Nov 3 05
Posts: 1596
ID: 8948
Credit: 33,802,121
RAC: 17,331
Message 63964 - Posted 5 Nov 2009 13:17:42 UTC - in response to Message ID 63960.

if possible, just undervolting might fix it (possibly only temporarily but maybe not) - the power consumption increases with the square of the voltage so a small reduction can make a big difference. If you can't undervolt in the BIOS then you might be able to use RMClock...

3-1/4 years. It's a Dell Dimension 9100....Last june, it died. Dell fitted a refurbished MB and PS.


There's your problem. The capacitors on the motherboard are old and the CPU is generating errors under heavy power load (computation). Time to throw away the system. Get a 2nd opinion if you want, but 'fraid that's what's going on.

I had a system just like this (custom-built). Runs great for normal tasks, but it fails under heavy CPU load. I guess you could try under-clocking the system, but I never found a reason to work at it that hard. Dell's are notorious for expiring with weird symptoms.


____________

tomba

Joined: May 29 06
Posts: 43
ID: 85179
Credit: 1,558,972
RAC: 0
Message 64043 - Posted 14 Nov 2009 14:56:58 UTC - in response to Message ID 63960.

3-1/4 years. It's a Dell Dimension 9100....Last june, it died. Dell fitted a refurbished MB and PS.


There's your problem. The capacitors on the motherboard are old and the CPU is generating errors under heavy power load (computation). Time to throw away the system. Get a 2nd opinion if you want, but 'fraid that's what's going on.


Well, not wishing to throw it away without trying something, I focused on the RAM. I removed three 1 gig sticks and ran Rosetta with 1 gig for two days without a hiccough. Then I added a second stick and ran for two days; no problems. Then I removed the two sticks and inserted the other two of four; four days sweetness and light. Then I bit the bullet and took it to 4 gigs. All is well!

Funny...

Tom

____________

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 507,926
RAC: 0
Message 64439 - Posted 12 Dec 2009 2:32:03 UTC

This is no ploblem, just wondering if there are any tasks still running under this app?
____________
Have a crunching good day!!

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 64440 - Posted 12 Dec 2009 3:31:24 UTC - in response to Message ID 64439.

This is no ploblem, just wondering if there are any tasks still running under this app?


Every few months there is, yes. That's why I've left the sticky on it.
____________
Rosetta Moderator: Mod.Sense

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 507,926
RAC: 0
Message 64444 - Posted 12 Dec 2009 4:24:36 UTC

Thanks I cant remember the last time I had a 5.98 task. I will look out for a 5.98 task.
____________
Have a crunching good day!!

Callum

Joined: Oct 20 09
Posts: 1
ID: 355129
Credit: 39,407
RAC: 0
Message 65384 - Posted 21 Feb 2010 22:42:30 UTC

On the subject of Rosetta Beta 5.98 I've had two WU error this weekend.

rossmann3x3_k006_008_18101_1902_0 & rossmann3x3_k006_007_18101_1248_0


<core_client_version>6.10.18</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 10800
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: -1.#IND000 is outside of [-1,+1] sin and cos value legal range
# random seed: 1135220
ERROR:: Exit from: .\make_pdb.cc line: 244

</stderr_txt>
]]>

MarcoA

Joined: Sep 2 08
Posts: 9
ID: 276404
Credit: 777,433
RAC: 0
Message 65410 - Posted 24 Feb 2010 11:11:37 UTC
Last modified: 24 Feb 2010 11:12:27 UTC

1
2
3

Error, Error and Error.
Fortunately, these WUs took only zero seconds of my precious CPU-time :)

Message boards : Number crunching : Problems with Rosetta version 5.98


Home | Join | About | Participants | Community | Statistics

Copyright © 2017 University of Washington

Last Modified: 10 Nov 2010 1:51:38 UTC
Back to top ^