Problems with Minirosetta 1.80

Message boards : Number crunching : Problems with Minirosetta 1.80

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 61979 - Posted: 28 Jun 2009, 1:46:01 UTC

Hi.

Another problem task, it seemed to be in a loop going nowhere.

I aborted it after 4hrs and another the of the same type.

real_core_1.5_low200_beta_low200_start_hb_t308__IGNORE_THE_REST_13186_100_0.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=238909195

Model:0

Step:52800

pete.


ID: 61979 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
gazzawazza

Send message
Joined: 4 May 07
Posts: 28
Credit: 297,648
RAC: 0
Message 61982 - Posted: 28 Jun 2009, 8:49:23 UTC

Hi all.

I'm still getting the odd computation error (please see previous thread: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=4933).

However, this has been the only WU failure since the 21st June 2009.

The symptoms are that a task repeatedly restarts (having exited with zero status but no 'finished' file), then when complete the output file is absent (or at least that's what's being reported in the BOINC client logs).

My other projects seem to be running without issue.

My current setup is BOINC 6.6.36 (running as a service) on vista home premium SP2 (32bit), running Rosetta 1.80.

I do have Kaspersky antivirus 2009 installed but real-time scanning was disabled for the entirety of the time that this latest WU was running for (I only mention this because I know that A/V progs have been implicated in other crunching problems e.g. files getting locked).


Regards,

Gary
ID: 61982 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 61983 - Posted: 28 Jun 2009, 9:01:09 UTC

Another problem task, it seemed to be in a loop going nowhere.


I have also had a real_core going in a loop to nowhere so I have aborted it.

261825496
ID: 61983 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5662
Credit: 5,701,869
RAC: 2,154
Message 61989 - Posted: 28 Jun 2009, 20:30:11 UTC

was just randomly doing a check on the tasks i have lined up and went to look at the graphics of lb_cutback_all_multi_hb_t326__IGNORE_THE_REST_2GK3A_3_12956_21_0 and found that the native structure and low energy structure windows were working fine but none of the other windows have any structures or plots showing. on one occasion the search window showed a graphic for all of a second or two. it also says stage unknown for the kind of work it is doing. the line representations in the two working windows move and change position. also the accepted energy value is not a number but 1.#QNAN and for accepted rmsd it shows 1.#QO.

Here is a screen shot:


ID: 61989 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ByRad
Avatar

Send message
Joined: 12 Apr 08
Posts: 8
Credit: 15,681,825
RAC: 194
Message 61990 - Posted: 28 Jun 2009, 21:16:47 UTC

BOINC Manager message: wrote:
2009-06-28 23:07:24 rosetta@home task lr_score12_snase_run02_rlbn_yfsong_3BDC-ASN100LYS_SAVE_ALL_OUT_NATIVE_NOCON_12975_3093_0 aborted by user=

I aborted this task because: after about 1,5h of work it still had 5,3% (normally it is about 40) and then I have checked the graphic for this task - model:2 step:70; I have checked it after about an hour later and there still was model:2, step:70...
An infinite loop... (Normally I crounh 50 to 100 models in about 3h!)
ID: 61990 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
WinterWasp
Avatar

Send message
Joined: 16 Jun 09
Posts: 2
Credit: 11,905
RAC: 0
Message 61991 - Posted: 28 Jun 2009, 22:28:01 UTC

Is it normal, that a task completes successfully, gets verified as ok and grants almost double the asked credits despite the log being almost flooded with not a number and value out of range errors?
wRMSF_1_5_core_jumps_mixcst2_hb_t374__IGNORE_THE_REST_12929_921_1 is the task in question.
ID: 61991 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Venturini Dario[VENETO]

Send message
Joined: 25 May 07
Posts: 22
Credit: 245,028
RAC: 0
Message 61993 - Posted: 28 Jun 2009, 23:04:20 UTC

dom 28 giu 2009 22:30:23 CEST|rosetta@home|Output file real_core_1.5_low200_beta_low200_start_hb_t322__IGNORE_THE_REST_13290_313_0_0 for task real_core_1.5_low200_beta_low200_start_hb_t322__IGNORE_THE_REST_13290_313_0 absent

This WU errored out after 8 hours of crunching (supposed to be 4)

To me it seems like the "real_core" ones have a fairly high failure rate...
ID: 61993 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 61995 - Posted: 29 Jun 2009, 0:06:58 UTC

I've had a number of real_core_1.5_low200_beta_low200_start_ WUs go 4 hours past my runtime and they were presumably ended by the watchdog. They all claim 1 decoy and were marked invalid.

https://boinc.bakerlab.org/rosetta/result.php?resultid=261815816
https://boinc.bakerlab.org/rosetta/result.php?resultid=261768023
https://boinc.bakerlab.org/rosetta/result.php?resultid=261765649
https://boinc.bakerlab.org/rosetta/result.php?resultid=261722487
ID: 61995 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Michael G.R.

Send message
Joined: 11 Nov 05
Posts: 264
Credit: 11,247,510
RAC: 0
Message 61997 - Posted: 29 Jun 2009, 5:08:30 UTC

Been getting errors on my Mac too with 1.80.
ID: 61997 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
xsc2

Send message
Joined: 9 Jul 08
Posts: 4
Credit: 62,354
RAC: 0
Message 62000 - Posted: 29 Jun 2009, 6:41:51 UTC

ID: 62000 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Steve Dodd

Send message
Joined: 13 Dec 05
Posts: 7
Credit: 3,649,060
RAC: 0
Message 62005 - Posted: 29 Jun 2009, 12:38:50 UTC

Just adding to the rest of the comments here. I'm also experiencing issues with wus that being with "lb_cutback_all_multi...". Seems that the app. is ignoring the preferences file for maximum time per wu. Mine's set at 4 hours, but these are running over 8 hrs. and still going.
ID: 62005 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 62006 - Posted: 29 Jun 2009, 12:47:51 UTC

Steve, thanks for the info. Just to clarify, the setting in the Rosetta preferences is not for maximum time per work unit. It is a target runtime. Having said that, the program checks periodically to assure the task seems to be progressing normally, and at the end of models it checks to see if the runtime would allow another model or not.

The "watchdog" should take action on any task that runs longer then the runtime preference plus 4 hours. Since it doesn't waste time checking this all of the time, it may take another 15 min. or so after that. So, your task just reached the point where the system should have taken action itself.

With all of these reports, it sounds like there are some new tasks that have lengthy models, and perhaps some new issues with the watchdog as well. Keep the details coming.
Rosetta Moderator: Mod.Sense
ID: 62006 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
lusvladimir

Send message
Joined: 18 Oct 05
Posts: 12
Credit: 1,784,854
RAC: 0
Message 62011 - Posted: 29 Jun 2009, 17:04:38 UTC
Last modified: 29 Jun 2009, 17:08:12 UTC

ID: 62011 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 0
Message 62013 - Posted: 29 Jun 2009, 19:57:54 UTC

task 262080735 ended after 16 hours, which happens to be my cpu_run_time_pref + 4 hours.
And then there was a <file_xfer_error>.

BOINC:: CPU time: 57669.2s, 14400s + 43200s[2009- 6-29 14:48:55:] :: BOINC 
Output exists: default.out.gz
InternalDecoyCount: 0 (GZ)
======================================================
DONE ::     1 starting structures  57670.2 cpu seconds
This process generated      1 decoys from       1 attempts
======================================================
called boinc_finish

</stderr_txt>
<message>
<file_xfer_error>
  <file_name>real_core_1.5_low200_beta_low200_start_hb_t286__IGNORE_THE_REST_13040_508_0_0</file_name>
  <error_code>-161</error_code>
</file_xfer_error>

ID: 62013 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 62014 - Posted: 29 Jun 2009, 21:47:54 UTC

This one only ran for 1 sec, and has errored for others.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=237150212

calbindin_BOINC_ABRELAX_4xBIN_1xCYCLES_SAVE_ALL_OUT_IGNORE_THE_REST-S25-9-S3-3--calbindin-_12935_707_2

ID: 62014 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Path7

Send message
Joined: 25 Aug 07
Posts: 128
Credit: 61,751
RAC: 0
Message 62016 - Posted: 30 Jun 2009, 7:55:49 UTC

The next task:
real_core_1.5_low200_beta_low200_start_hb_t308__IGNORE_THE_REST_13046_407_0
Didn't switch to another application after 1 hour – ran on for over 7 hours.
Didn't stop after runtime preference of 6 hours – was ended by the watchdog after 10 hours.
Didn't checkpoint regular – rebooting after 9 hours runtime: the WU started from 2 hours runtime.
The good thing: Outcome: Success.

Path7.
ID: 62016 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Venturini Dario[VENETO]

Send message
Joined: 25 May 07
Posts: 22
Credit: 245,028
RAC: 0
Message 62017 - Posted: 30 Jun 2009, 8:41:37 UTC
Last modified: 30 Jun 2009, 8:44:45 UTC

2 more real_core ran far over the 4 hours boundary, both ended after 8 hours, one successful, the other one errored out:

mar 30 giu 2009 09:51:33 CEST|rosetta@home|Output file real_core_1.5_low200_beta_low200_start_hb_t368__IGNORE_THE_REST_13036_638_0_0 for task real_core_1.5_low200_beta_low200_start_hb_t368__IGNORE_THE_REST_13036_638_0 absent

Error is always code -161
ID: 62017 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Venturini Dario[VENETO]

Send message
Joined: 25 May 07
Posts: 22
Credit: 245,028
RAC: 0
Message 62018 - Posted: 30 Jun 2009, 9:32:22 UTC

Another real_core with a strange behaviour:

This is when I turned the PC on this morning (54% completed because it ran yesterday for some hours)



And this is 2 minutes later (5% because somehow it resetted itself, including CPU time)



Btw now it's at 6% after 37 minutes, which means it will need some 16 x 37 minutes to reach 100%, which means more than 8 hours, when the target time is set at 4.

I'm having this errors both on my laptop (Core2Duo 7700, Vista Home Premium, BOINC 6.4.5) and my desktop (Amd 3800x2, Ubuntu 9.04 64bit, BOINC 6.6.28)

ID: 62018 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 62020 - Posted: 30 Jun 2009, 13:18:32 UTC

Folks, please do not take % complete and time to completion as any indication of a Rosetta problem. It is simply an estimate that your BOINC manager is making. This takes a number of factors in to account, including the speed of your machine, and time it took your last task to complete. So if your last task ran long, the % on the next task MAY (or MAY NOT) reflect that, or part of that information. BOINC tries not to presume all tasks are the same and sometimes looks at the last several tasks runtime as a frame of reference.

If you restart a task, you should be looking at the elapsed time change as the indication of what checkpoint (if any) the task was able to restart from.
Rosetta Moderator: Mod.Sense
ID: 62020 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Venturini Dario[VENETO]

Send message
Joined: 25 May 07
Posts: 22
Credit: 245,028
RAC: 0
Message 62021 - Posted: 30 Jun 2009, 14:28:32 UTC - in response to Message 62020.  

Folks, please do not take % complete and time to completion as any indication of a Rosetta problem. It is simply an estimate that your BOINC manager is making. This takes a number of factors in to account, including the speed of your machine, and time it took your last task to complete. So if your last task ran long, the % on the next task MAY (or MAY NOT) reflect that, or part of that information. BOINC tries not to presume all tasks are the same and sometimes looks at the last several tasks runtime as a frame of reference.

If you restart a task, you should be looking at the elapsed time change as the indication of what checkpoint (if any) the task was able to restart from.


Agreed with that, but I think I have enough experience to understand when there is a problem and when not.

I'll write some more elements down:

1) the WU arrived yesterday at 12.52.
2) all of my WUs are started within a few hours from their arrival because I don't have any cache and the PC is set to always connected
Therefore 3) that WU started being crunched yesterday in the middle of the afternoon
4) I turned off the PC for the night when that WU had reached 54% percentage of completion (yes I'm a nerd and I check how work is going in my PC)
5) I restarted it today and saw that WU being crunched but making no progress
6) I checked the graphic and saw nothing (see posted image #1 in my previous post)
7) I waited a few minutes and saw the WU's percentage dropping to 5%. Checked the CPU time and it said 25 minutes (while it ran for hours the day before)
8) I reported to your thread

Also

9) the WU is still running, percentage is inreasing but time is long overdue. Should have been 4 hours, it's already 5 1/2 and the progress bar indicates 55,22%. As you can see, I (and BOINC) made a fairly accurate prevision because at this speed it will end in 9 hours. Of course the watchdog will kill it after 8 but hey, not that I can do anything about it.
10) I am trying to see the graphics of that WU but the window pops up without syncing to the WU. The graphics' window blocks and I have to terminate it from the task manager.

So now

11) I'm going to let that WU run until completion and hope that you will find something useful in the output, being it for medicine or for the improvement of the application.

P.S. Oh and about the checkpoint thing: the elapsed time for that WU changed from 5 hours to 25 minutes. Is it meant to be this way?
ID: 62021 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Problems with Minirosetta 1.80



©2024 University of Washington
https://www.bakerlab.org