Rosetta@home

Problems with Minirosetta 1.80

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search
Message boards : Number crunching : Problems with Minirosetta 1.80

Sort
AuthorMessage
Yifan Song
Forum moderator
Project administrator
Project developer
Project scientist

Joined: May 26 09
Posts: 62
ID: 318024
Credit: 7,322
RAC: 0
Message 61886 - Posted 22 Jun 2009 19:42:37 UTC

In this version:
New protein-protein docking protocol.
New rotamer library.

nick n
Avatar

Joined: Aug 26 07
Posts: 49
ID: 201050
Credit: 219,102
RAC: 0
Message 61918 - Posted 24 Jun 2009 14:24:11 UTC
Last modified: 24 Jun 2009 14:26:12 UTC

I am getting ALOT of errors on my mac. I have tried resetting and detaching and re attaching to no avail. Here are a few WU examples

http://boinc.bakerlab.org/rosetta/result.php?resultid=261129500
http://boinc.bakerlab.org/rosetta/result.php?resultid=261082154
http://boinc.bakerlab.org/rosetta/result.php?resultid=261052997
http://boinc.bakerlab.org/rosetta/result.php?resultid=261042205
http://boinc.bakerlab.org/rosetta/result.php?resultid=260869175
http://boinc.bakerlab.org/rosetta/result.php?resultid=260866803
http://boinc.bakerlab.org/rosetta/result.php?resultid=260840258

Bill Hepburn

Joined: Sep 18 05
Posts: 13
ID: 380
Credit: 9,575,841
RAC: 5,924
Message 61924 - Posted 24 Jun 2009 18:08:57 UTC

I have had three now that came up with a "compute error" after they had almost finished. Don't think it is on my end. They were on two different computers (one XP Pro, one Win Server 2003). Two of them have been reissued and the second person errored out too. The last one just went out. Other 1.80 tasks run fine, other projects are running just fine.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=238330815
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=238113829
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=238093549
____________

RC

Joined: Sep 27 05
Posts: 13
ID: 1401
Credit: 245,498
RAC: 0
Message 61925 - Posted 24 Jun 2009 22:06:48 UTC - in response to Message ID 61924.

I have also had a couple of failures on a Mac. In both cases the run time was less than 10 minutes:

http://boinc.bakerlab.org/rosetta/result.php?resultid=261100252
http://boinc.bakerlab.org/rosetta/result.php?resultid=261064311

____________

Snagletooth

Joined: Feb 22 07
Posts: 193
ID: 149031
Credit: 1,425,415
RAC: 236
Message 61929 - Posted 25 Jun 2009 12:02:58 UTC

lb_cutback_all_multi_hb_t290__IGNORE_THE_REST_1LOPA_7_12941_28_0

Outcome = Success and Validate state = valid but

cpu time = 1637.58 secs and

no models appear in the stderr out but this does:

Hbond tripped: [2009- 6-25 5:28: 3:]

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 334
called boinc_finish




Snags

slamb

Joined: Oct 19 05
Posts: 2
ID: 5505
Credit: 2,044,630
RAC: 0
Message 61930 - Posted 25 Jun 2009 12:24:06 UTC

Running out of work. Can't get any more work to download.
____________

nick n
Avatar

Joined: Aug 26 07
Posts: 49
ID: 201050
Credit: 219,102
RAC: 0
Message 61940 - Posted 25 Jun 2009 18:13:17 UTC
Last modified: 25 Jun 2009 18:16:55 UTC

Now just about everything is failing. I am going to leave for a while if this isn't fixed soon.....

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 61941 - Posted 25 Jun 2009 19:09:02 UTC - in response to Message ID 61930.

Running out of work. Can't get any more work to download.


It seems the work server waits until you complete or get a long ways into your last running tasks before it downloads new work.
I have seen this happen allot lately.
I came down to my last 2 tasks (1 per core) and was running them when I got my huge quota (current +5 days extra) of new work.

See if that is happening on your system.

TomaszPawel

Joined: Apr 28 07
Posts: 54
ID: 170716
Credit: 2,791,145
RAC: 0
Message 61945 - Posted 25 Jun 2009 20:19:56 UTC - in response to Message ID 61941.

I found "bug".

This WU make only 84.37 credit but was runing 22,555.02sec....

This WU make 84.34 credit and was runing 10652.67sec....

So it is bug or it is normal that for WU runing 2x longer I get the same credit?

____________
WWW of Polish National Team - Join! Crunch! Win!

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 61948 - Posted 26 Jun 2009 5:38:09 UTC

Hi.

This one ran for over ten hours on my six hour runtime then fell over, NOT GOOD.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=238272279

Fri 26 Jun 2009 14:59:27 EST|rosetta@home|Output file lb_cutback_all_multi_hb_t325__IGNORE_THE_REST_1ZZMA_12_12955_12_0_0 for task lb_cutback_all_multi_hb_t325__IGNORE_THE_REST_1ZZMA_12_12955_12_0 absent

<error_code>-161</error_code>

pete.

____________


adrianxw Profile
Avatar

Joined: Sep 18 05
Posts: 535
ID: 402
Credit: 1,210,812
RAC: 3,748
Message 61950 - Posted 26 Jun 2009 9:15:14 UTC
Last modified: 26 Jun 2009 9:18:33 UTC

I don't know if this is the right place, but have set 6 hours as the target runtime and this wu has been running 54:05:12 now and claims to be 15.255% complete. I have suspended the task pending comment. Claims to have 88:17:24 to completion.

<edit>

Mini Rosetta 1.80, Windows XP, BOINC 6.6.20.
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 61951 - Posted 26 Jun 2009 13:05:41 UTC

adrianxw, please click the task from the task list and click the properties button. Does it show more then 10 hours of CPU time as well? (because the task list now shows "elapsed time" with the new BOINC version).

If you unsuspend the task (and get it running again, perhaps by suspending other tasks for a moment), is it using CPU time?

If it has more then 10 hours of actual CPU time, I would suggest aborting the task.
____________
Rosetta Moderator: Mod.Sense

Venturini Dario[VENETO] Profile

Joined: May 25 07
Posts: 22
ID: 179805
Credit: 245,028
RAC: 0
Message 61952 - Posted 26 Jun 2009 13:47:17 UTC - in response to Message ID 61951.

adrianxw, please click the task from the task list and click the properties button. Does it show more then 10 hours of CPU time as well? (because the task list now shows "elapsed time" with the new BOINC version).

If you unsuspend the task (and get it running again, perhaps by suspending other tasks for a moment), is it using CPU time?

If it has more then 10 hours of actual CPU time, I would suggest aborting the task.


I also have a WU that got stuck, luckily I noticed after just 4 hours.

Here's a screenshot of the properties of that WU, as you can see that CPU time is just 1 hour + while Run time is 4 hours +



Suspending --> Resuming didn't work to "unstuck" it, until I removed the flag from "keep WU's in memory when suspended". After that, suspending --> resuming made it work again from the percentage reached before the stop (43,43%)
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 61953 - Posted 26 Jun 2009 14:18:40 UTC
Last modified: 26 Jun 2009 14:20:54 UTC

Venturini, are you allowing BOINC to use 100% of CPU? And all of the available CPUs? Is the machine busy working on other applications that are running?
____________
Rosetta Moderator: Mod.Sense

adrianxw Profile
Avatar

Joined: Sep 18 05
Posts: 535
ID: 402
Credit: 1,210,812
RAC: 3,748
Message 61955 - Posted 26 Jun 2009 14:44:52 UTC
Last modified: 26 Jun 2009 14:49:29 UTC

The "Properties" box shows "CPU Time" 00:58:34, the "CPU time at last checkpoint" also shows as 00:58:34 "Elapsed time" 54:05:12 and "Estimated time remaining" 88:17:24.

Resuming the task, it started running in "High priority" mode.

I think I would have noticed if it had really been sitting there for a couple of days. In the time it has taken to write this, the percentage complete has risen to 18.012% and the estimated completion dropped to 83:58:43. Something weird going on there. I'll leave it running for the moment at least.
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

Venturini Dario[VENETO] Profile

Joined: May 25 07
Posts: 22
ID: 179805
Credit: 245,028
RAC: 0
Message 61956 - Posted 26 Jun 2009 15:40:14 UTC - in response to Message ID 61953.

Venturini, are you allowing BOINC to use 100% of CPU? And all of the available CPUs? Is the machine busy working on other applications that are running?


All of the cores (2) are dedicated to BOINC, both running 100%, and the only other application running is Word (I'm writing schemes for my next university exams) plus the background ones (antivirus and so on) ;)

Plus, I have only Rosetta on this PC (and WCG, but it's set to no new task).

OS is Windows Vista Home Premium, BOINC is 6.6.28, CPU is a Intel 7700.

And, btw, call me Dario, Venturini is my surname ;)

PinkPenguin Profile

Joined: Apr 26 09
Posts: 5
ID: 313164
Credit: 280,676
RAC: 0
Message 61957 - Posted 26 Jun 2009 15:41:03 UTC

Reporting a couple of -161 errors encountered at the end of lb_cutback_all_multi_hb work units which appear to have completed OK.

On Windows Vista (Intel Core Duo 2GHz) - BOINC 6.6.36 / Rosetta 1.80:
http://boinc.bakerlab.org/rosetta/result.php?resultid=261371341

On Linux Fedora v10 (Intel Pentium 4 3.00GHz) - BOINC 6.4.7 / Rosetta 1.80:
http://boinc.bakerlab.org/rosetta/result.php?resultid=261035946
In this case the other task with the same workunit (238257150) completed without errors.

I noticed that there are similar reports earlier in thIS thread (see also message: 61948 from P.P.L.).

This may be similar to a series of lb_thread_all_multi errors reported earlier this month.

All the best,
Richard

Chris Down Profile

Joined: Jun 19 09
Posts: 1
ID: 322564
Credit: 11,750
RAC: 0
Message 61960 - Posted 26 Jun 2009 16:25:55 UTC

Also experiencing some compute errors and strange completion times. Seems to be ignoring my settings, too.

Venturini Dario[VENETO] Profile

Joined: May 25 07
Posts: 22
ID: 179805
Credit: 245,028
RAC: 0
Message 61961 - Posted 26 Jun 2009 18:00:40 UTC - in response to Message ID 61956.

Venturini, are you allowing BOINC to use 100% of CPU? And all of the available CPUs? Is the machine busy working on other applications that are running?


All of the cores (2) are dedicated to BOINC, both running 100%, and the only other application running is Word (I'm writing schemes for my next university exams) plus the background ones (antivirus and so on) ;)

Plus, I have only Rosetta on this PC (and WCG, but it's set to no new task).

OS is Windows Vista Home Premium, BOINC is 6.6.28, CPU is a Intel 7700.

And, btw, call me Dario, Venturini is my surname ;)


Here you go, completed, reported and validated succesfully

http://boinc.bakerlab.org/rosetta/result.php?resultid=261619500

Rayburner

Joined: Oct 4 05
Posts: 32
ID: 2632
Credit: 4,527,440
RAC: 444
Message 61976 - Posted 27 Jun 2009 16:24:57 UTC
Last modified: 27 Jun 2009 16:25:27 UTC

compute error after 4 hours

http://boinc.bakerlab.org/rosetta/result.php?resultid=261844121

real_core_1.5_low200_beta_low200_start_hb_t374__IGNORE_THE_REST_13119_137_0
____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 61979 - Posted 28 Jun 2009 1:46:01 UTC

Hi.

Another problem task, it seemed to be in a loop going nowhere.

I aborted it after 4hrs and another the of the same type.

real_core_1.5_low200_beta_low200_start_hb_t308__IGNORE_THE_REST_13186_100_0.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=238909195

Model:0

Step:52800

pete.


____________


gazzawazza

Joined: May 4 07
Posts: 28
ID: 173083
Credit: 294,873
RAC: 0
Message 61982 - Posted 28 Jun 2009 8:49:23 UTC

Hi all.

I'm still getting the odd computation error (please see previous thread: http://boinc.bakerlab.org/rosetta/forum_thread.php?id=4933).

However, this has been the only WU failure since the 21st June 2009.

The symptoms are that a task repeatedly restarts (having exited with zero status but no 'finished' file), then when complete the output file is absent (or at least that's what's being reported in the BOINC client logs).

My other projects seem to be running without issue.

My current setup is BOINC 6.6.36 (running as a service) on vista home premium SP2 (32bit), running Rosetta 1.80.

I do have Kaspersky antivirus 2009 installed but real-time scanning was disabled for the entirety of the time that this latest WU was running for (I only mention this because I know that A/V progs have been implicated in other crunching problems e.g. files getting locked).


Regards,

Gary

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 61983 - Posted 28 Jun 2009 9:01:09 UTC

Another problem task, it seemed to be in a loop going nowhere.


I have also had a real_core going in a loop to nowhere so I have aborted it.

261825496
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 61989 - Posted 28 Jun 2009 20:30:11 UTC

was just randomly doing a check on the tasks i have lined up and went to look at the graphics of lb_cutback_all_multi_hb_t326__IGNORE_THE_REST_2GK3A_3_12956_21_0 and found that the native structure and low energy structure windows were working fine but none of the other windows have any structures or plots showing. on one occasion the search window showed a graphic for all of a second or two. it also says stage unknown for the kind of work it is doing. the line representations in the two working windows move and change position. also the accepted energy value is not a number but 1.#QNAN and for accepted rmsd it shows 1.#QO.

Here is a screen shot:


ByRad Profile
Avatar

Joined: Apr 12 08
Posts: 8
ID: 252633
Credit: 9,025,840
RAC: 14,161
Message 61990 - Posted 28 Jun 2009 21:16:47 UTC

BOINC Manager message: wrote:
2009-06-28 23:07:24 rosetta@home task lr_score12_snase_run02_rlbn_yfsong_3BDC-ASN100LYS_SAVE_ALL_OUT_NATIVE_NOCON_12975_3093_0 aborted by user=

I aborted this task because: after about 1,5h of work it still had 5,3% (normally it is about 40) and then I have checked the graphic for this task - model:2 step:70; I have checked it after about an hour later and there still was model:2, step:70...
An infinite loop... (Normally I crounh 50 to 100 models in about 3h!)
____________

WinterWasp
Avatar

Joined: Jun 16 09
Posts: 2
ID: 321897
Credit: 11,905
RAC: 0
Message 61991 - Posted 28 Jun 2009 22:28:01 UTC

Is it normal, that a task completes successfully, gets verified as ok and grants almost double the asked credits despite the log being almost flooded with not a number and value out of range errors?
wRMSF_1_5_core_jumps_mixcst2_hb_t374__IGNORE_THE_REST_12929_921_1 is the task in question.

Venturini Dario[VENETO] Profile

Joined: May 25 07
Posts: 22
ID: 179805
Credit: 245,028
RAC: 0
Message 61993 - Posted 28 Jun 2009 23:04:20 UTC

dom 28 giu 2009 22:30:23 CEST|rosetta@home|Output file real_core_1.5_low200_beta_low200_start_hb_t322__IGNORE_THE_REST_13290_313_0_0 for task real_core_1.5_low200_beta_low200_start_hb_t322__IGNORE_THE_REST_13290_313_0 absent

This WU errored out after 8 hours of crunching (supposed to be 4)

To me it seems like the "real_core" ones have a fairly high failure rate...

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 61995 - Posted 29 Jun 2009 0:06:58 UTC

I've had a number of real_core_1.5_low200_beta_low200_start_ WUs go 4 hours past my runtime and they were presumably ended by the watchdog. They all claim 1 decoy and were marked invalid.

http://boinc.bakerlab.org/rosetta/result.php?resultid=261815816
http://boinc.bakerlab.org/rosetta/result.php?resultid=261768023
http://boinc.bakerlab.org/rosetta/result.php?resultid=261765649
http://boinc.bakerlab.org/rosetta/result.php?resultid=261722487

Michael G.R.

Joined: Nov 11 05
Posts: 263
ID: 11128
Credit: 8,385,240
RAC: 115
Message 61997 - Posted 29 Jun 2009 5:08:30 UTC

Been getting errors on my Mac too with 1.80.
____________

xsc2

Joined: Jul 9 08
Posts: 4
ID: 267987
Credit: 62,354
RAC: 0
Message 62000 - Posted 29 Jun 2009 6:41:51 UTC

This WU crashed with exit status: 1 (0x1)

http://boinc.bakerlab.org/rosetta/result.php?resultid=262078278

Steve Dodd Profile

Joined: Dec 13 05
Posts: 6
ID: 36900
Credit: 1,389,095
RAC: 70
Message 62005 - Posted 29 Jun 2009 12:38:50 UTC

Just adding to the rest of the comments here. I'm also experiencing issues with wus that being with "lb_cutback_all_multi...". Seems that the app. is ignoring the preferences file for maximum time per wu. Mine's set at 4 hours, but these are running over 8 hrs. and still going.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 62006 - Posted 29 Jun 2009 12:47:51 UTC

Steve, thanks for the info. Just to clarify, the setting in the Rosetta preferences is not for maximum time per work unit. It is a target runtime. Having said that, the program checks periodically to assure the task seems to be progressing normally, and at the end of models it checks to see if the runtime would allow another model or not.

The "watchdog" should take action on any task that runs longer then the runtime preference plus 4 hours. Since it doesn't waste time checking this all of the time, it may take another 15 min. or so after that. So, your task just reached the point where the system should have taken action itself.

With all of these reports, it sounds like there are some new tasks that have lengthy models, and perhaps some new issues with the watchdog as well. Keep the details coming.
____________
Rosetta Moderator: Mod.Sense

lusvladimir

Joined: Oct 18 05
Posts: 12
ID: 5401
Credit: 1,784,854
RAC: 0
Message 62011 - Posted 29 Jun 2009 17:04:38 UTC
Last modified: 29 Jun 2009 17:08:12 UTC

Errors for tasks: real_core_1.5_low200_beta_low200_start_hb

http://boinc.bakerlab.org/result.php?resultid=261781005
http://boinc.bakerlab.org/result.php?resultid=261750967
http://boinc.bakerlab.org/result.php?resultid=261750701
http://boinc.bakerlab.org/result.php?resultid=261750699

Ended by the watchdog. Marked invalid.
____________

AdeB Profile
Avatar

Joined: Dec 12 06
Posts: 45
ID: 135244
Credit: 2,473,480
RAC: 1,988
Message 62013 - Posted 29 Jun 2009 19:57:54 UTC

task 262080735 ended after 16 hours, which happens to be my cpu_run_time_pref + 4 hours.
And then there was a <file_xfer_error>.

BOINC:: CPU time: 57669.2s, 14400s + 43200s[2009- 6-29 14:48:55:] :: BOINC
Output exists: default.out.gz
InternalDecoyCount: 0 (GZ)
======================================================
DONE :: 1 starting structures 57670.2 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
called boinc_finish

</stderr_txt>
<message>
<file_xfer_error>
<file_name>real_core_1.5_low200_beta_low200_start_hb_t286__IGNORE_THE_REST_13040_508_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 62014 - Posted 29 Jun 2009 21:47:54 UTC

This one only ran for 1 sec, and has errored for others.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=237150212

calbindin_BOINC_ABRELAX_4xBIN_1xCYCLES_SAVE_ALL_OUT_IGNORE_THE_REST-S25-9-S3-3--calbindin-_12935_707_2

____________


Path7

Joined: Aug 25 07
Posts: 128
ID: 201002
Credit: 61,751
RAC: 0
Message 62016 - Posted 30 Jun 2009 7:55:49 UTC

The next task:
real_core_1.5_low200_beta_low200_start_hb_t308__IGNORE_THE_REST_13046_407_0
Didn't switch to another application after 1 hour – ran on for over 7 hours.
Didn't stop after runtime preference of 6 hours – was ended by the watchdog after 10 hours.
Didn't checkpoint regular – rebooting after 9 hours runtime: the WU started from 2 hours runtime.
The good thing: Outcome: Success.

Path7.

Venturini Dario[VENETO] Profile

Joined: May 25 07
Posts: 22
ID: 179805
Credit: 245,028
RAC: 0
Message 62017 - Posted 30 Jun 2009 8:41:37 UTC
Last modified: 30 Jun 2009 8:44:45 UTC

2 more real_core ran far over the 4 hours boundary, both ended after 8 hours, one successful, the other one errored out:

mar 30 giu 2009 09:51:33 CEST|rosetta@home|Output file real_core_1.5_low200_beta_low200_start_hb_t368__IGNORE_THE_REST_13036_638_0_0 for task real_core_1.5_low200_beta_low200_start_hb_t368__IGNORE_THE_REST_13036_638_0 absent

Error is always code -161

Venturini Dario[VENETO] Profile

Joined: May 25 07
Posts: 22
ID: 179805
Credit: 245,028
RAC: 0
Message 62018 - Posted 30 Jun 2009 9:32:22 UTC

Another real_core with a strange behaviour:

This is when I turned the PC on this morning (54% completed because it ran yesterday for some hours)



And this is 2 minutes later (5% because somehow it resetted itself, including CPU time)



Btw now it's at 6% after 37 minutes, which means it will need some 16 x 37 minutes to reach 100%, which means more than 8 hours, when the target time is set at 4.

I'm having this errors both on my laptop (Core2Duo 7700, Vista Home Premium, BOINC 6.4.5) and my desktop (Amd 3800x2, Ubuntu 9.04 64bit, BOINC 6.6.28)

____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 62020 - Posted 30 Jun 2009 13:18:32 UTC

Folks, please do not take % complete and time to completion as any indication of a Rosetta problem. It is simply an estimate that your BOINC manager is making. This takes a number of factors in to account, including the speed of your machine, and time it took your last task to complete. So if your last task ran long, the % on the next task MAY (or MAY NOT) reflect that, or part of that information. BOINC tries not to presume all tasks are the same and sometimes looks at the last several tasks runtime as a frame of reference.

If you restart a task, you should be looking at the elapsed time change as the indication of what checkpoint (if any) the task was able to restart from.
____________
Rosetta Moderator: Mod.Sense

Venturini Dario[VENETO] Profile

Joined: May 25 07
Posts: 22
ID: 179805
Credit: 245,028
RAC: 0
Message 62021 - Posted 30 Jun 2009 14:28:32 UTC - in response to Message ID 62020.

Folks, please do not take % complete and time to completion as any indication of a Rosetta problem. It is simply an estimate that your BOINC manager is making. This takes a number of factors in to account, including the speed of your machine, and time it took your last task to complete. So if your last task ran long, the % on the next task MAY (or MAY NOT) reflect that, or part of that information. BOINC tries not to presume all tasks are the same and sometimes looks at the last several tasks runtime as a frame of reference.

If you restart a task, you should be looking at the elapsed time change as the indication of what checkpoint (if any) the task was able to restart from.


Agreed with that, but I think I have enough experience to understand when there is a problem and when not.

I'll write some more elements down:

1) the WU arrived yesterday at 12.52.
2) all of my WUs are started within a few hours from their arrival because I don't have any cache and the PC is set to always connected
Therefore 3) that WU started being crunched yesterday in the middle of the afternoon
4) I turned off the PC for the night when that WU had reached 54% percentage of completion (yes I'm a nerd and I check how work is going in my PC)
5) I restarted it today and saw that WU being crunched but making no progress
6) I checked the graphic and saw nothing (see posted image #1 in my previous post)
7) I waited a few minutes and saw the WU's percentage dropping to 5%. Checked the CPU time and it said 25 minutes (while it ran for hours the day before)
8) I reported to your thread

Also

9) the WU is still running, percentage is inreasing but time is long overdue. Should have been 4 hours, it's already 5 1/2 and the progress bar indicates 55,22%. As you can see, I (and BOINC) made a fairly accurate prevision because at this speed it will end in 9 hours. Of course the watchdog will kill it after 8 but hey, not that I can do anything about it.
10) I am trying to see the graphics of that WU but the window pops up without syncing to the WU. The graphics' window blocks and I have to terminate it from the task manager.

So now

11) I'm going to let that WU run until completion and hope that you will find something useful in the output, being it for medicine or for the improvement of the application.

P.S. Oh and about the checkpoint thing: the elapsed time for that WU changed from 5 hours to 25 minutes. Is it meant to be this way?
____________

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 62022 - Posted 30 Jun 2009 17:53:19 UTC

I've now had quite a lot of WUs run for 4 hours over my run time of 12 hours, and then get ended by the watchdog. They always report one decoy being made, although, in fact, no decoys seem to have been produced. They then have a file xfer error (-161), presumably because there was no output file.

here's yet another example: http://boinc.bakerlab.org/rosetta/result.php?resultid=262096625

Note that this ran over 16 hours on a Phenom II, yet produced no output.

RC

Joined: Sep 27 05
Posts: 13
ID: 1401
Credit: 245,498
RAC: 0
Message 62024 - Posted 30 Jun 2009 20:14:55 UTC - in response to Message ID 61925.

Another one that died after almost 13 hours (my runtime preference is 8 hours):

http://boinc.bakerlab.org/rosetta/result.php?resultid=262397691
____________

Wissi

Joined: Nov 19 08
Posts: 14
ID: 288715
Credit: 396,107
RAC: 0
Message 62025 - Posted 30 Jun 2009 21:36:19 UTC
Last modified: 30 Jun 2009 21:42:05 UTC

Since getting 1.80, almost every WU I get is planned for about 4 Hours of work, but they will run at least 8 hours. So is there some miscalculation of how strong (or weak) my computer is?

It's quite annoying to see "calculation error" on almost every WU, because the runtime exceeds 8 hours, the last 3 did use more than 10 hours of work.

What's going on here?

Currently, I've got the following WU:
real_core_1.5_low200_beta_low200_start_hb_t332_IGNORE_THE_REST_13273_142
Task ID: 261849792, Work unit 238985112

The original time estimation was about 4hrs 20min, but the task now ran for 5 hours, and still there are 4hrs 10min left.

What I can see is, that the time left INCREASES. The same applies for the currently new started job:

lb_dk_ksync_withtrim2_hb_t302_IGNORE_THE_REST_13365_670
Task ID: 262152215, Work unit 239248916

The time left goes up and up, but never down...

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 62026 - Posted 30 Jun 2009 21:51:37 UTC

Here's another sad story.

real_core_3.5_low50_beta_low200_hb_t303__IGNORE_THE_REST_13576_83_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=239454464

This ran for 4hrs 34min made no progress.

At 1hr 49min.
MODEL:0
STEP:46800

At 4hrs 34min.
MODEL:0
STEP:46800

ABORTED.

____________


Rob Heilman [Echo Labs] Profile

Joined: Apr 26 07
Posts: 20
ID: 169840
Credit: 2,815,410
RAC: 0
Message 62027 - Posted 1 Jul 2009 1:10:15 UTC

I am getting a ton of compute errors. I also see some ridiculous disparities at time about Claimed/Awarded credit. i.e.

262177679 239266150 28 Jun 2009 23:26:41 UTC 30 Jun 2009 5:40:18 UTC Over Success Done 101,333.10 224.36 17.95
262177658 239266121 28 Jun 2009 23:26:41 UTC 30 Jun 2009 5:40:18 UTC Over Client error Compute error 101,330.80 224.36 ---

Any ideas?
____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 62028 - Posted 1 Jul 2009 2:04:14 UTC

Here's another real_core that was stuck.

real_core_5.0_low50_beta_low200_hb_t332__IGNORE_THE_REST_13705_64_0.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=239491013

Hadn't moved in 2hrs 12min. Got to that step then didn't move.

MODEL:0
STEP:48000

ABORTED

I think i have only had 1 of these that has ran O.K.


____________


mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 62031 - Posted 1 Jul 2009 9:31:23 UTC - in response to Message ID 62027.

I am getting a ton of compute errors. I also see some ridiculous disparities at time about Claimed/Awarded credit. i.e.

262177679 239266150 28 Jun 2009 23:26:41 UTC 30 Jun 2009 5:40:18 UTC Over Success Done 101,333.10 224.36 17.95
262177658 239266121 28 Jun 2009 23:26:41 UTC 30 Jun 2009 5:40:18 UTC Over Client error Compute error 101,330.80 224.36 ---

Any ideas?


You seem to be having to different kinds of errors, one is error code 161 and the other is something that doesn't list a code. I only looked on a few machines but it is happening on all that I checked. Hmmm Here is the Wiki link to the error codes for Boinc http://www.boinc-wiki.info/Error_Code

Do you ever reboot your machines? Have you updated them lately? I see you run Linux and I know they put out updates all the time, I usually wait until there are just under a hundred to do the updates.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 62033 - Posted 1 Jul 2009 13:22:56 UTC

I moved Rob and mikey's posts to this thread.

Rob, several users are reporting tasks that stop progressing. This often means that some models complete in normal time and others take considerably longer. Since credit is issued on completed models, I believe that is the reason for the large disparities between some of your claimed and granted credit.
____________
Rosetta Moderator: Mod.Sense

Rob Heilman [Echo Labs] Profile

Joined: Apr 26 07
Posts: 20
ID: 169840
Credit: 2,815,410
RAC: 0
Message 62034 - Posted 1 Jul 2009 13:26:22 UTC
Last modified: 1 Jul 2009 13:33:42 UTC

Is there anything I can do on my end to help with the issue? It seems to have started right about when 1.80 came out.
I have tried both decreasing my run time to 3 hrs and increasing to 24 hours. Right now I am at 12 on my way back to 8 hours.

What ever is going on it is costing the project some serious computing power.
If you look at my daily credit numbers you can see that without any changes to my machines, software versions, etc. I am only completing 50-55% of what I was able to do on a daily basis over the last several weeks.

My BOINCstats

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 62035 - Posted 1 Jul 2009 15:17:13 UTC

Rob, I believe the Project Team should already have the data they need to identify specific types of tasks that are causing problems. So, really can't think of anything on your end to help.

I for one have not been getting any of the tasks with names starting with "real_core", so I tend to believe there probably are not very many of them in the mix. So, your machines should return to tasks that are running well soon.
____________
Rosetta Moderator: Mod.Sense

Chilean Profile
Avatar

Joined: Oct 16 05
Posts: 651
ID: 5008
Credit: 10,394,263
RAC: 2,476
Message 62036 - Posted 1 Jul 2009 15:58:36 UTC

No errors beyond the here and there compute errors that happen 1 out of 60 WU (2 hours per WU)
PCs that vary from single core AMD. Single celeron. Dual Athlon AMD. Core 2 Duo. All running Windows XP to 7.
Why is it that so many people have so many problems?
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 62039 - Posted 1 Jul 2009 18:20:13 UTC

Why is it that so many people have so many problems?


You always have to keep in mind that this is the "problems with" thread. So, by design, most of the posts here will be about problems.

Some of the 50 posts in this thread are not about specific problems in 1.80, more about BOINC general issues. I should probably be moving them elsewhere, but who has the time? So of 85,000 active hosts, you will never get every event reported, but overall the big picture is still good.

And so when you compare to about 2 million tasks completed since the creation of this thread, the number of problems is quite modest. And seems most highly correlated to some of the new task types that are being worked on. As I said, it seems these are fairly few in number, so this is the current rough ground being covered.

Not everyone monitors their machines closely, and this is why it was key to make the changes Mike made earlier this year to collect and report more data both for when things go unexpectedly and to gather better information about things that are running well (which helps you readily identify any future variations as compared to that historical data).
____________
Rosetta Moderator: Mod.Sense

alpha Profile

Joined: Nov 4 06
Posts: 27
ID: 127202
Credit: 953,255
RAC: 781
Message 62048 - Posted 2 Jul 2009 7:48:46 UTC

Two compute errors after 101,000 seconds (28 hrs) with a preference of 24 hours run time. Only one decoy in both cases:

http://boinc.bakerlab.org/rosetta/result.php?resultid=261928706
http://boinc.bakerlab.org/rosetta/result.php?resultid=262283940

Also, two more with 101,000 seconds run time, these ones completed successfully but granted ridiculously low credit, again, only one decoy:

http://boinc.bakerlab.org/rosetta/result.php?resultid=262122318
http://boinc.bakerlab.org/rosetta/result.php?resultid=262236422
____________

ByRad Profile
Avatar

Joined: Apr 12 08
Posts: 8
ID: 252633
Credit: 9,025,840
RAC: 14,161
Message 62050 - Posted 2 Jul 2009 8:26:03 UTC
Last modified: 2 Jul 2009 8:27:19 UTC

I have a very odd error in Rosetta Mini 1.80 app. Everything You can see on the screens:




I have 4GB of RAM (3GB efficiently on my WinXP x86)at 667MHz, CPU: C2D T5800 and GPU: GF9300M GS.

And after aborting this WU ewerything is back normal...
____________

Seversen

Joined: Dec 21 07
Posts: 3
ID: 229270
Credit: 57,599
RAC: 0
Message 62057 - Posted 2 Jul 2009 13:30:25 UTC

Why did this workunit get such low credit?
real_core_1.5_low200_beta_low200_start_hb_t331__IGNORE_THE_REST_13032_83

Thanks.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 62058 - Posted 2 Jul 2009 14:54:03 UTC

Lord ByRad my translation skills are minimal, but the status shown for the Rosetta task you highlighted has the acronym RAM in it. Which I take it means that the rest of the words translate to something like "waiting for memory". So the settings for BOINC Manager are not allowing it to use enough of the large memory your system has. There are several memory settings you can adjust to allow BOINC to use more memory.

Also, since there is no Rosetta application in the task list, I take it you have it set to remove from memory when not active. Your machine will do work more efficiently if you leave tasks in memory when suspended.
____________
Rosetta Moderator: Mod.Sense

Oliver

Joined: Oct 11 07
Posts: 4
ID: 211670
Credit: 525
RAC: 0
Message 62065 - Posted 2 Jul 2009 20:23:07 UTC

Hi folks,

I checked the output of the real_core_xxx WUs and found that all of them produce good results and valid results. So if you see RMSD=1 or similar oddities that seems to be an error of the graphics, rather than the actual WU. In summary, the issues seem to be around the boinc-managment but not the internal quality of the results.

We are now starting to address the problems mentioned in this thread with graphics, completion time and checkpointing/resuming.

-Oliver

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 62071 - Posted 3 Jul 2009 13:58:50 UTC

Oliver, the RMSD of 1 we are seeing is in the graphs of results described in this thread.
http://boinc.bakerlab.org/rosetta/forum_thread.php?id=4967
Not the graphics on the client machines. So, somewhere, you have data that reports those values in your databases used to make these graphs.
____________
Rosetta Moderator: Mod.Sense

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,404,794
RAC: 6,286
Message 62075 - Posted 3 Jul 2009 15:38:21 UTC

Task 262972813 failed on Mac,

Watchdog active.
Hbond tripped: [2009- 7- 2 8:46:56:]

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 334
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>


____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 62082 - Posted 5 Jul 2009 0:51:15 UTC

This one ran for 10 hrs on a 6hr pref.

It did 1 Model when the watchdog kicked in, i guess it was incomplete.

http://boinc.bakerlab.org/rosetta/result.php?resultid=263029599

Sun 05 Jul 2009 10:20:27 EST|rosetta@home|Output file lb_cutback_all_multi_hb_t328__IGNORE_THE_REST_2CEXA_8_12958_5_1_0 for task lb_cutback_all_multi_hb_t328__IGNORE_THE_REST_2CEXA_8_12958_5_1 absent



____________


Sid Celery

Joined: Feb 11 08
Posts: 806
ID: 241409
Credit: 10,030,156
RAC: 9,347
Message 62093 - Posted 5 Jul 2009 13:07:37 UTC

A late report - sorry for the delay:

azurin_BOINC_ABRELAX_4xBIN_1xCYCLES_SAVE_ALL_OUT_IGNORE_THE_REST-S25-9-S3-3--azurin-_12935_2849_1

Outcome Client error
Client state Compute error
Exit status 1 (0x1)
CPU time 0

<core_client_version>6.6.20</core_client_version>

ERROR: Option matching -PCS:npc_files_input not found in command line top-level context

No other errors in the last 217 WUs
____________

bruce Profile

Joined: Sep 15 07
Posts: 10
ID: 205458
Credit: 839,797
RAC: 0
Message 62095 - Posted 5 Jul 2009 15:18:20 UTC
Last modified: 5 Jul 2009 15:30:11 UTC

Hi,
I'm experiencing issues with 1.80 where by:
1)a WU does not exit memory.
I currently have 25 minirosetta_1.80_windows_intelx86.exe processes in memory only 2 of which are using any cpu time. Memory utilization ranges from 400kb to 200mb
The fact they are not exiting, is causing my virtual memory to run out.
2)I get error messages in the BOINC client.
3)The ...\BOINC\slots folder is filling up with numbered folders where most have only three files:boinc_lockfile, stderr.txt and stdout.txt.

I've rebooted, reset the project and still continue to get these errors.

Here are some specifics about my setup and the errors:
System:
3.0ghz Pentium 4 (w/hyperthreading on)
2.0gb RAM
WinXP sp3 (32bit)
Boinc 6.6.36 (Windows 32bit)
Preferences:
swtich between apps every 200minutes
use at most 100% processors
use at most 75% of CPU time
use at most 20gb HD space
use at most 50% memory when in use
use at most 90% memory when idle.
Projects: rosetta@home (Resource Share:600); seti@home (Resource Share:75)


Error from the ...\BOINC\stdoutdae.txt file (similar output on the BOINC manager Messages tab):
05-Jul-2009 07:47:45 [rosetta@home] If this happens repeatedly you may need to reset the project.
05-Jul-2009 07:47:45 [rosetta@home] Restarting task abinitio_withrelax_homfrag_129_B_1ynvA_SAVE_ALL_OUT_13795_445_0 using minirosetta version 180
05-Jul-2009 07:48:26 [rosetta@home] Task abinitio_withrelax_homfrag_129_B_1ynvA_SAVE_ALL_OUT_13795_445_0 exited with zero status but no 'finished' file
05-Jul-2009 07:48:26 [rosetta@home] If this happens repeatedly you may need to reset the project.
etc..etc..etc...


Here is some output from the stderr.txt in the slots folders (with only the three files mentioned above):
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Starting work on structure: _U9X3X_00001
...
[2009- 7- 5 7:47: 4:] :: BOINC:: Initializing ... ok.
Can't acquire lockfile - exiting
[2009- 7- 5 7:47:45:] :: BOINC:: Initializing ... ok.
Can't acquire lockfile - exiting
[2009- 7- 5 7:48:26:] :: BOINC:: Initializing ... ok.
Can't acquire lockfile - exiting
[2009- 7- 5 7:49: 8:] :: BOINC:: Initializing ... ok.
Can't acquire lockfile - exiting
...


After a reboot: only two minirosetta_1.80_windows_intelx86.exe in memory, both using cpu time (one at 168mb the other at 219mb) Much more along the lines of what I would expect to see)
After a reboot: all the 'slot' folders with the boinc_lockfile are gone save for 3, the two working rosetta@home WUs and the one Seti@home WU. (again, what I would expect to see)

What other information can I provide that might help clue in on what is causing this problem.

Thanks for your help

William T.M. Theisen Profile

Joined: Sep 11 06
Posts: 7
ID: 111799
Credit: 527,145
RAC: 0
Message 62098 - Posted 5 Jul 2009 20:39:20 UTC

lb_dk_ksync_withtrim_hb_t297__IGNORE_THE_REST_12980_1893_0 Got stuck at 6.888% and has been running 29 hours so far, and has gone up in time for "time to completion" from 60 hours to 65 hours. I'm not sure what is going on with it, should I abort it?
____________

xsc2

Joined: Jul 9 08
Posts: 4
ID: 267987
Credit: 62,354
RAC: 0
Message 62102 - Posted 6 Jul 2009 6:54:46 UTC

Exit status: -1073741819 (0xc0000005)
http://boinc.bakerlab.org/rosetta/result.php?resultid=263200171
http://boinc.bakerlab.org/rosetta/result.php?resultid=263584567

Exit status: 1 (0x1)
http://boinc.bakerlab.org/rosetta/result.php?resultid=263381564

[AF>france>pas-de-calais]symaski62

Joined: Sep 19 05
Posts: 47
ID: 506
Credit: 33,871
RAC: 0
Message 62107 - Posted 6 Jul 2009 19:01:10 UTC

abinitio_withrelax_nohomfrag_129_B_1shfA_SAVE_ALL_OUT_13798_612_0

http://boinc.bakerlab.org/rosetta/result.php?resultid=263840421


<![CDATA[
<stderr_txt>
[2009- 7- 6 17:41:24:] :: BOINC:: Initializing ... ok.
[2009- 7- 6 17:41:24:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev30680.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/fragments_1shf.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Starting work on structure: _U9X3X_00001
Starting work on structure: _U9X3X_00002
Starting work on structure: _U9X3X_00003
Starting work on structure: _U9X3X_00004
Starting work on structure: _U9X3X_00005
Starting work on structure: _U9X3X_00006
Starting work on structure: _U9X3X_00007
Starting work on structure: _U9X3X_00008
Starting work on structure: _U9X3X_00009
Starting work on structure: _U9X3X_00010
Starting work on structure: _U9X3X_00011
Starting work on structure: _U9X3X_00012
Starting work on structure: _U9X3X_00013
Starting work on structure: _U9X3X_00014
Starting work on structure: _U9X3X_00015
Starting work on structure: _U9X3X_00016
Starting work on structure: _U9X3X_00017
Starting work on structure: _U9X3X_00018
Starting work on structure: _U9X3X_00019
Starting work on structure: _U9X3X_00020
======================================================
DONE :: 1 starting structures 10442.9 cpu seconds
This process generated 20 decoys from 20 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish


____________

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 7,834,811
RAC: 4,046
Message 62112 - Posted 7 Jul 2009 4:57:34 UTC - in response to Message ID 62107.

abinitio_withrelax_nohomfrag_129_B_1shfA_SAVE_ALL_OUT_13798_612_0

http://boinc.bakerlab.org/rosetta/result.php?resultid=263840421


I don't seee any errors, it is valid, you got credit. What is the exact problem with this one?
____________

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1740
ID: 44890
Credit: 2,500,639
RAC: 2,030
Message 62115 - Posted 7 Jul 2009 14:40:14 UTC

This one is taking 689MB of memory, peak was 986MB!
2a05_NN_DISCONTROL_BOINC_ABRELAX_SAVE_ALL_OUT_13840
It is 20hrs in to a 24hr runtime on Windows XP, under BOINC 6.6.20.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 62119 - Posted 7 Jul 2009 22:49:54 UTC - in response to Message ID 62115.

This one is taking 689MB of memory, peak was 986MB!
2a05_NN_DISCONTROL_BOINC_ABRELAX_SAVE_ALL_OUT_13840
It is 20hrs in to a 24hr runtime on Windows XP, under BOINC 6.6.20.


Here's a 2a05_NN_DISCONTROL_BOINC_ABRELAX_SAVE_ALL_OUT_13840 WU that ran on a single core diskless Linux node with 1GB installed. It ended with a bad_alloc error, which means the node ran out of physical memory. I've had a number of bad_alloc errors on 512MB nodes (which I no longer crunch with), but now it seems 1GB/core may no longer be enough for Rosetta.

MikeMcC3

Joined: May 13 08
Posts: 2
ID: 258469
Credit: 501,309
RAC: 0
Message 62133 - Posted 8 Jul 2009 17:50:07 UTC

I have no idea what is going on. When I look at the work that has been sent to my computer, I see about one-thousand work units that I haven't received. The due dates arrive, and get red-flagged as time-outs. I can't find any of the work units listed as sent, and no mention of those work units as being received by my computer. What the heck is going on? If anyone can tell me if they have had similar problems like this, or what may have caused it. I've been reducing data for BOINC for over 2 years now, and have never encountered any such problems.

dag Profile
Avatar

Joined: Dec 16 05
Posts: 106
ID: 38674
Credit: 1,000,020
RAC: 0
Message 62151 - Posted 9 Jul 2009 20:12:12 UTC
Last modified: 9 Jul 2009 20:12:48 UTC

I'm getting this many times per day now... never had it before this batch:

7/9/2009 10:49:34 AM|rosetta@home|Task picker-L1-sssim-1bk2A_13839_593_0 exited with a DLL initialization error.
7/9/2009 2:03:14 PM|rosetta@home|Task lr10_seq_score12_rlbd_1elw_IGNORE_THE_REST_DECOY_13841_116_0 exited with a DLL initialization error.
7/9/2009 2:05:31 PM|rosetta@home|Task 1sn6_NN_DISCONTROL_BOINC_ABRELAX_SAVE_ALL_OUT_13840_1231_0 exited with a DLL initialization error.

Rob Heilman [Echo Labs] Profile

Joined: Apr 26 07
Posts: 20
ID: 169840
Credit: 2,815,410
RAC: 0
Message 62165 - Posted 10 Jul 2009 13:33:52 UTC

I am getting a lot of compute errors on sel_core_4.5 work units. They all seem to report error code -161. Examples:

http://boinc.bakerlab.org/rosetta/result.php?resultid=264525168
http://boinc.bakerlab.org/rosetta/result.php?resultid=264520827
http://boinc.bakerlab.org/rosetta/result.php?resultid=264466943
http://boinc.bakerlab.org/rosetta/result.php?resultid=264466941

Any ideas? Seeing this on multiple Linux hosts with different kernels. They are all running the recommended 6.4.5.

____________

Rob Heilman [Echo Labs] Profile

Joined: Apr 26 07
Posts: 20
ID: 169840
Credit: 2,815,410
RAC: 0
Message 62170 - Posted 10 Jul 2009 18:18:40 UTC - in response to Message ID 62165.

I am getting a lot of compute errors on sel_core_4.5 work units. They all seem to report error code -161. Examples:

http://boinc.bakerlab.org/rosetta/result.php?resultid=264525168
http://boinc.bakerlab.org/rosetta/result.php?resultid=264520827
http://boinc.bakerlab.org/rosetta/result.php?resultid=264466943
http://boinc.bakerlab.org/rosetta/result.php?resultid=264466941

Any ideas? Seeing this on multiple Linux hosts with different kernels. They are all running the recommended 6.4.5.


This was moved into this thread by a moderator. Is this a 1.80 problem or a sel_core_4.5 problem? I did not want to assume it was 1.80 and that is why I started a new thread.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 62171 - Posted 10 Jul 2009 20:11:44 UTC

Rob, certainly a valid point. But we'll resolve the question here in this thread. Often new task types are related to new code changes in a release and so the two possibilities are often highly correlated anyway.
____________
Rosetta Moderator: Mod.Sense

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 521,019
RAC: 10
Message 62173 - Posted 10 Jul 2009 22:45:17 UTC

I have noticed something about this thread, it seems to be displaying on my screen in wide format. I have to move the bottom scroll bar across the screen to view the whole post. In the Number crunching thread I can view posts without having to move my scroll bar. Is anyone else having this problem?
____________
Have a crunching good day!!

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 62174 - Posted 10 Jul 2009 23:04:04 UTC

Yes, it start out normally and then changes to wide screen format.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 62175 - Posted 10 Jul 2009 23:11:59 UTC

It is due to wide images posted in the thread. Depending on how long 1.80 remains current release, I may have to move the wide posts.
____________
Rosetta Moderator: Mod.Sense

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 62176 - Posted 10 Jul 2009 23:16:11 UTC
Last modified: 10 Jul 2009 23:16:34 UTC

maybe you guys could suggest some resizing software that we can use to reduce the size of our screen shots. my screen shot started this mess and i can't edit the post to reduce the size and i can not access the storage site i put the image on for free. also maybe you could suggest a file storage site that we can use to post our screen shots for free. then this image issue wouldn't have to happen.

of course we will need a seperate thread for that...

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 521,019
RAC: 10
Message 62178 - Posted 11 Jul 2009 0:18:13 UTC - in response to Message ID 62175.

It is due to wide images posted in the thread. Depending on how long 1.80 remains current release, I may have to move the wide posts.

Thank you for details Mod.Sense, I never gave the screen shots a thought. I'm not sure if this is the right place to ask, is there any chance the page Quick guide to Rosetta and its graphics can be updated to what the different colors mean?
____________
Have a crunching good day!!

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 62179 - Posted 11 Jul 2009 0:41:44 UTC
Last modified: 11 Jul 2009 0:46:00 UTC

speedy, the colors are just rainbow spectrum blue to red. The help you see which end is which. Especially with longer proteins.

greg, I think it best to post links rather then pics, as described here. So, url tags rather then img tags. You might consider using flickr.com to host pics. I see geocities will be going away soon.
____________
Rosetta Moderator: Mod.Sense

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 521,019
RAC: 10
Message 62180 - Posted 11 Jul 2009 1:58:38 UTC - in response to Message ID 62179.
Last modified: 11 Jul 2009 1:59:31 UTC

speedy, the colors are just rainbow spectrum blue to red. The help you see which end is which. Especially with longer proteins.

Ok I was talking about the colours in the accepted energy colors are mainly yellow & blue. I can't tell which end is witch of the proteins now, when you say help you see witch end is witch of the proteins are you referring to the protein that is moving in the accepted panel of the graphics window?
____________
Have a crunching good day!!

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 62183 - Posted 11 Jul 2009 6:03:06 UTC

Hi.

This one seems to have the same type of problem as the real_core one's seems it

got stuck in a loop, done twice.

sel_core_5.0_low200_beta_low200_start_hb_t297__IGNORE_THE_REST_14061_180_1

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=241330995

Model:0
Step:44400

ABORTED MINE.


____________


AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 62184 - Posted 11 Jul 2009 13:17:42 UTC

I'm having a lot of errors from sel_core WUs. They crunch over 16 hours (4 hours over my 12 hour preference), then they exit claiming 1 decoy (although I suspect they didn't produce any decoys), then they error out with code -161 (file_xfer_error, probably because no decoys were generated).

http://boinc.bakerlab.org/rosetta/result.php?resultid=264546035
http://boinc.bakerlab.org/rosetta/result.php?resultid=264511109
http://boinc.bakerlab.org/rosetta/result.php?resultid=264503575
http://boinc.bakerlab.org/rosetta/result.php?resultid=264490536
http://boinc.bakerlab.org/rosetta/result.php?resultid=264476503
http://boinc.bakerlab.org/rosetta/result.php?resultid=264456236
http://boinc.bakerlab.org/rosetta/result.php?resultid=264403527
http://boinc.bakerlab.org/rosetta/result.php?resultid=264394564

Jimmy McNulty

Joined: Nov 13 05
Posts: 2
ID: 11819
Credit: 74,396
RAC: 0
Message 62196 - Posted 12 Jul 2009 3:48:25 UTC

Just came back to this project after a few months break because I was previously having problems with every single WU. I've run a couple dozen in the past week or so with no problems, then got an error with WU lb_alnmatrix_within_2_hb_t370__IGNORE_THE_REST_1DNLA_12_13913_6_0

Ran 31 hours with preferance set for 8, didn't budge past 69.725%

Additionally, I don't dare click show graphics on any work unit since i've returned to the project because 3 times I've gotten an error message for minirosetta 1.80 and progress stops. My computer keeps trying to crunch it but i'm forced to abort; however that was not the case with the WU I mentioned above.

pandem

Joined: Nov 12 08
Posts: 2
ID: 287674
Credit: 111,130
RAC: 0
Message 62216 - Posted 14 Jul 2009 1:46:10 UTC
Last modified: 14 Jul 2009 1:46:58 UTC

In reference to message 62095
[quote]Hi,
I'm experiencing issues with 1.80 where by:
1)a WU does not exit memory.
I currently have 25 minirosetta_1.80_windows_intelx86.exe processes in memory only 2 of which are using any cpu time. Memory utilization ranges from 400kb to 200mb
The fact they are not exiting, is causing my virtual memory to run out.
2)I get error messages in the BOINC client.
3)The ...\BOINC\slots folder is filling up with numbered folders where most have only three files:boinc_lockfile, stderr.txt and stdout.txt.[/quoted]

- has any one also noted this type of issue as I have? It very annoying and I have resorted to suspending the project. There is a setting in the global_prefs_override that says remove from memory when in use. It seems this setting has no affect on this project. minirosetta 1.81/1.82 (dualcore Intel,2.66G 2Gbyte, xp sp3, boinc 6.6.36)
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 62219 - Posted 14 Jul 2009 3:23:29 UTC - in response to Message ID 62216.

In reference to message 62095
[quote]Hi,
I'm experiencing issues with 1.80 where by:
1)a WU does not exit memory.
I currently have 25 minirosetta_1.80_windows_intelx86.exe processes in memory only 2 of which are using any cpu time. Memory utilization ranges from 400kb to 200mb
The fact they are not exiting, is causing my virtual memory to run out.
2)I get error messages in the BOINC client.
3)The ...\BOINC\slots folder is filling up with numbered folders where most have only three files:boinc_lockfile, stderr.txt and stdout.txt.[/quoted]

- has any one also noted this type of issue as I have? It very annoying and I have resorted to suspending the project. There is a setting in the global_prefs_override that says remove from memory when in use. It seems this setting has no affect on this project. minirosetta 1.81/1.82 (dualcore Intel,2.66G 2Gbyte, xp sp3, boinc 6.6.36)



1.81 and 1.82??? These are on Ralph.

What you are reporting is new. Sounds like tasks are not completing properly.

____________
Rosetta Moderator: Mod.Sense

bruce Profile

Joined: Sep 15 07
Posts: 10
ID: 205458
Credit: 839,797
RAC: 0
Message 62222 - Posted 14 Jul 2009 9:49:09 UTC - in response to Message ID 62219.

In reference to message 62095
[quote]Hi,
I'm experiencing issues with 1.80 where by:
1)a WU does not exit memory.
I currently have 25 minirosetta_1.80_windows_intelx86.exe processes in memory only 2 of which are using any cpu time. Memory utilization ranges from 400kb to 200mb
The fact they are not exiting, is causing my virtual memory to run out.
2)I get error messages in the BOINC client.
3)The ...\BOINC\slots folder is filling up with numbered folders where most have only three files:boinc_lockfile, stderr.txt and stdout.txt.[/quoted]

- has any one also noted this type of issue as I have? It very annoying and I have resorted to suspending the project. There is a setting in the global_prefs_override that says remove from memory when in use. It seems this setting has no affect on this project. minirosetta 1.81/1.82 (dualcore Intel,2.66G 2Gbyte, xp sp3, boinc 6.6.36)



1.81 and 1.82??? These are on Ralph.

What you are reporting is new. Sounds like tasks are not completing properly.


I've been experiencing this issue since before 1.67, so, new?, not to me... but perhaps something not seen before by most. I haven't tried RALPH, so I couldn't report any issues there.
I've been experiencing this on 3 separate machines. All running into the same basic problem, where the Minirosetta application does not exit memory.
I have a plethora of errored WUs.
http://boinc.bakerlab.org/rosetta/results.php?userid=205458&offset=40

For the three machines I've been experiencing this on, because it drives memory usage into the ground and begin getting messages about running out of virtual memory, I've suspended the project on those machines, until I see some forward motion in resolution.
I do continue to have R@H running on one machine that does not seem to have this same issue.

Any ideas on what information I can supply that may help work towards a resolution?
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 62223 - Posted 14 Jul 2009 11:16:07 UTC

Any ideas on what information I can supply that may help work towards a resolution?


BOINC version, Rosetta version (which is shown in the tasks tab), computing platform (Windows edition, Linux, Mac), any information you might have about when the problem does and does not occur, whether you display the graphic, and whether you use BOINC as your screensaver, did the tasks complete normally and report back valid results with reasonable credit?

Those are my general questions I'd always ask. Bruce, in your case, is there anything unique about these 3 machines are compared to any others that you have experience with that might explain why they see the problem and others do not?? I'm guessing BOINC version perhaps?

Just so others are clear, in general, you would *prefer* that BOINC leave tasks in memory while preempted. It runs more efficiently that way. This is set up in the preferences. But what is being described here is tasks that are completed (i.e. not preempted) and are not leaving memory. And regardless of your preference, the program should free up all memory and BOINC slots when a task completes.
____________
Rosetta Moderator: Mod.Sense

Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 62225 - Posted 14 Jul 2009 17:10:32 UTC

1.82 is up.
____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 62229 - Posted 14 Jul 2009 18:45:17 UTC
Last modified: 14 Jul 2009 18:45:29 UTC

I'll keep this thread sticky until the remaining 1.80 work reaches the 10 day deadline.
____________
Rosetta Moderator: Mod.Sense

bruce Profile

Joined: Sep 15 07
Posts: 10
ID: 205458
Credit: 839,797
RAC: 0
Message 62234 - Posted 15 Jul 2009 5:32:00 UTC - in response to Message ID 62223.

Any ideas on what information I can supply that may help work towards a resolution?


BOINC version, Rosetta version (which is shown in the tasks tab), computing platform (Windows edition, Linux, Mac), any information you might have about when the problem does and does not occur, whether you display the graphic, and whether you use BOINC as your screensaver, did the tasks complete normally and report back valid results with reasonable credit?

Those are my general questions I'd always ask. Bruce, in your case, is there anything unique about these 3 machines are compared to any others that you have experience with that might explain why they see the problem and others do not?? I'm guessing BOINC version perhaps?

Just so others are clear, in general, you would *prefer* that BOINC leave tasks in memory while preempted. It runs more efficiently that way. This is set up in the preferences. But what is being described here is tasks that are completed (i.e. not preempted) and are not leaving memory. And regardless of your preference, the program should free up all memory and BOINC slots when a task completes.


Hi,
Here's some additional informaion on my situation, and while I describe the situation on one computer, the same situation exists on two others.
http://boinc.bakerlab.org/rosetta/forum_thread.php?id=4953&nowrap=true#62095


These 4 computers are cabable of running only two tasks at a time; and I've had no problem with the applications staying in memory during their normal processing or waiting for resources/pausing during usage; but, There really isn't any reason why 25 WUs should remain in memory(yes, I've observed upwards of 25 jobs in memory, typically I'll begin to notice when around 10 rosetta jobs are in memory. My observation over the last few years has been that they do exit where there was an error, or they've completed running, and at any given time, I would only see two jobs running. I'm not refering to remaining in memory during the normal course of waiting for resources. This behavior is not normal.....that being said, here the information requested

Some tasks complete normally soon after a reboot, but within a day or so I begin seeing errored WUs The errored WUs, no credit is given.

Common to all 4 computers:
BOINC: 6.6.36; Rosetta 1.80
All computers have all latest service packs and patches on operating systems and applications.
swtich between apps every 200 minutes
use at most 100% processors
use at most 75% of CPU time
use at most 20gb HD space
use at most 50% memory when in use
use at most 90% memory when idle
Projects: rosetta@home (Resource Share:600); seti@home (Resource Share:75)


Computer 1 (no observed errors)
2 AMD Opteron 250 processors
8 gb RAM
Windows 7 RC (64bit)
4 146gb HDs (scsi)
No screen saver active
(Light usage:email, internet, etc)

Computer 2
3.0ghz Pentium 4 (w/hyperthreading on)
2.0gb RAM
WinXP Pro (32bit)
1 76gb HD(sata), 1 160gb HD (SATA)
no screen saver active
(almost no usage (runs boinc and tomcat only))

Computer 3
Dell Latitude D830
T7500 Duo Core2
2gb RAM
Windows XP Pro (32bit)
1 80gb HD
Boinc screen saver active and displays graphics
(heavy usage)

Computer 4
Compaq/HP CQ60-215DX
AMD Athlon Dual-Core QL-62 2.0ghz
2gb RAM
Windows Vista Home Premium (32bit)
250gb HD
non-boinc screen saver - no boinc/rosetta graphics
(light usage: internet, email)
____________

alpha Profile

Joined: Nov 4 06
Posts: 27
ID: 127202
Credit: 953,255
RAC: 781
Message 62246 - Posted 16 Jul 2009 8:39:38 UTC

Compute error after 101,115 seconds (1 decoy):

http://boinc.bakerlab.org/rosetta/result.php?resultid=265225509

<file_xfer_error>
<file_name>sel_core_4.5_low200_beta_low200_start_hb_t374__IGNORE_THE_REST_14057_526_1_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>
____________

rochester new york Profile
Avatar

Joined: Jul 2 06
Posts: 2572
ID: 98229
Credit: 1,017,229
RAC: 1,281
Message 62268 - Posted 17 Jul 2009 18:29:44 UTC

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=240483389

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 62474 - Posted 26 Jul 2009 18:33:53 UTC - in response to Message ID 62234.

Any ideas on what information I can supply that may help work towards a resolution?


BOINC version, Rosetta version (which is shown in the tasks tab), computing platform (Windows edition, Linux, Mac), any information you might have about when the problem does and does not occur, whether you display the graphic, and whether you use BOINC as your screensaver, did the tasks complete normally and report back valid results with reasonable credit?

Those are my general questions I'd always ask. Bruce, in your case, is there anything unique about these 3 machines are compared to any others that you have experience with that might explain why they see the problem and others do not?? I'm guessing BOINC version perhaps?

Just so others are clear, in general, you would *prefer* that BOINC leave tasks in memory while preempted. It runs more efficiently that way. This is set up in the preferences. But what is being described here is tasks that are completed (i.e. not preempted) and are not leaving memory. And regardless of your preference, the program should free up all memory and BOINC slots when a task completes.


Hi,
Here's some additional informaion on my situation, and while I describe the situation on one computer, the same situation exists on two others.
http://boinc.bakerlab.org/rosetta/forum_thread.php?id=4953&nowrap=true#62095


These 4 computers are cabable of running only two tasks at a time; and I've had no problem with the applications staying in memory during their normal processing or waiting for resources/pausing during usage; but, There really isn't any reason why 25 WUs should remain in memory(yes, I've observed upwards of 25 jobs in memory, typically I'll begin to notice when around 10 rosetta jobs are in memory. My observation over the last few years has been that they do exit where there was an error, or they've completed running, and at any given time, I would only see two jobs running. I'm not refering to remaining in memory during the normal course of waiting for resources. This behavior is not normal.....that being said, here the information requested

Some tasks complete normally soon after a reboot, but within a day or so I begin seeing errored WUs The errored WUs, no credit is given.

Common to all 4 computers:
BOINC: 6.6.36; Rosetta 1.80
All computers have all latest service packs and patches on operating systems and applications.
swtich between apps every 200 minutes
use at most 100% processors
use at most 75% of CPU time
use at most 20gb HD space
use at most 50% memory when in use
use at most 90% memory when idle
Projects: rosetta@home (Resource Share:600); seti@home (Resource Share:75)


Bruce, Rosetta@home is known to have problems running properly when the CPU time percentage is set to less than 100%. Usually shows up as a lockfile problem.

Both my desktop computers seem to run properly with the CPU percentage set to 100%, since programs you start yourself normally get higher priority than BOINC workunits. However, I currently have one of them set to use only 95%, to help track down the lockfile problems.

Laptops and some computers with poor cooling cannot use 100% without overheating, though.

Message boards : Number crunching : Problems with Minirosetta 1.80


Home | Join | About | Participants | Community | Statistics

Copyright © 2017 University of Washington

Last Modified: 10 Nov 2010 1:51:38 UTC
Back to top ^