Report long-running models here

Message boards : Number crunching : Report long-running models here

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 14 · Next

AuthorMessage
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58532 - Posted: 5 Jan 2009, 18:17:43 UTC

AdeB, since that single model has been running longer then 6 hours. I would suggest you abort it... unless you have no other tasks to work on while

Idan, what is the normal runtime preference for the host that is running that task? (note to self, why hasn't watchdog ended it? If pref. is <24hrs)
Rosetta Moderator: Mod.Sense
ID: 58532 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 0
Message 58543 - Posted: 5 Jan 2009, 21:06:44 UTC - in response to Message 58532.  

AdeB, since that single model has been running longer then 6 hours. I would suggest you abort it...

After another crash the task has been aborted.
ID: 58543 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Idan Shifres

Send message
Joined: 12 Dec 08
Posts: 11
Credit: 126,517
RAC: 0
Message 58557 - Posted: 6 Jan 2009, 8:22:08 UTC

The WU 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_51691_0 finally finished after more than 66 hours only to get 80 credits... quite disappointing... :(
ID: 58557 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Idan Shifres

Send message
Joined: 12 Dec 08
Posts: 11
Credit: 126,517
RAC: 0
Message 58558 - Posted: 6 Jan 2009, 8:23:54 UTC

The WU 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_51691_0 finally finished after more than 66 hours only to get 80 credits... quite disappointing... :(
ID: 58558 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ramostol

Send message
Joined: 6 Feb 07
Posts: 64
Credit: 584,052
RAC: 0
Message 58608 - Posted: 7 Jan 2009, 11:25:16 UTC - in response to Message 58558.  

The WU 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_51691_0 finally finished after more than 66 hours only to get 80 credits... quite disappointing... :(


Well, this is the old problem of the dog that did nothing in the nighttime, as documented in the watchdog message:

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<stderr_txt>
**********************************************************************
Rosetta is going too long. Watchdog is ending the run!
CPU time: 238970 seconds. Greater than 3X preferred time: 10800 seconds
**********************************************************************
called boinc_finish

</stderr_txt>
]]>

3X 10800 seconds = 32400 seconds, why did the watchdog need 238970 seconds to interfere?
ID: 58608 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Idan Shifres

Send message
Joined: 12 Dec 08
Posts: 11
Credit: 126,517
RAC: 0
Message 58612 - Posted: 7 Jan 2009, 13:01:22 UTC - in response to Message 58532.  
Last modified: 7 Jan 2009, 13:06:50 UTC

Idan, what is the normal runtime preference for the host that is running that task? (note to self, why hasn't watchdog ended it? If pref. is <24hrs)


The computer's working on BOINC 24/7... better watch that watchdog... :)
I have another WU with 40+ hours running right now, I'll let it finish to see if it gets the same watchdog message...

Woops, looks like it just finished, as you can see: HERE.
<core_client_version>6.4.5</core_client_version>
<![CDATA[
<stderr_txt>
**********************************************************************
Rosetta is going too long. Watchdog is ending the run!
CPU time: 183132 seconds. Greater than 3X preferred time: 10800 seconds
**********************************************************************
called boinc_finish

</stderr_txt>
]]>

Same watchdog being late again, now only after 50 hours... gave me 80 credits... which is 1.6 credits per hour... I think you should change your credit system a bit... :P
ID: 58612 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
pramo

Send message
Joined: 21 Oct 05
Posts: 4
Credit: 303,087
RAC: 0
Message 58617 - Posted: 7 Jan 2009, 14:50:40 UTC


https://boinc.bakerlab.org/rosetta/workunit.php?wuid=198269130

70+ hours, 9+ hours to go, deadline in 7 hours

I killed it
ID: 58617 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
pramo

Send message
Joined: 21 Oct 05
Posts: 4
Credit: 303,087
RAC: 0
Message 58618 - Posted: 7 Jan 2009, 14:56:50 UTC - in response to Message 58617.  


https://boinc.bakerlab.org/rosetta/workunit.php?wuid=198269130

70+ hours, 9+ hours to go, deadline in 7 hours

I killed it


don't know if this helps, or what you might want to look at on my end. as you can see, a long time member and first time poster:)

stderr.txt was a bunch of this-
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600

stdout.txt was this-
[2008-12-29 6:19:24:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
.
.
.
<snip>
.
.
.
[2009- 1- 7 5:23:12:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...
[2009- 1- 7 8:40: 5:] :: BOINC :: boinc_init()
Created shared memory segment
Created semaphore
Starting watchdog...

ID: 58618 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 0
Message 58661 - Posted: 7 Jan 2009, 23:11:12 UTC

I aborted lr5_score12_rlbd_2fls_IGNORE_THE_REST_DECOY_5559_1293_1 after running for more than 30 hours.

AdeB
ID: 58661 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Idan Shifres

Send message
Joined: 12 Dec 08
Posts: 11
Credit: 126,517
RAC: 0
Message 58665 - Posted: 8 Jan 2009, 10:05:42 UTC

I would really like to know if it would be worth it to let the WU finish by itself or should we just abort it after 25-30 hours?

Do you look at the problems reported by the long-running models? is it contributing to let them run or it doesn't? in case it does contribute, I'll happily running to the bitter end, but if not, I'd like to do some realy science helping work... :)
ID: 58665 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5652
Credit: 5,622,096
RAC: 0
Message 58669 - Posted: 8 Jan 2009, 11:39:36 UTC - in response to Message 58665.  
Last modified: 8 Jan 2009, 11:39:54 UTC

I would really like to know if it would be worth it to let the WU finish by itself or should we just abort it after 25-30 hours?

Do you look at the problems reported by the long-running models? is it contributing to let them run or it doesn't? in case it does contribute, I'll happily running to the bitter end, but if not, I'd like to do some realy science helping work... :)


not sure what you run time is, but if it is way over the maximum time, you might as well abort it as you won't get all that much credit for it anyway. you will regain your loss with the better running tasks.

be sure to post a link to the specific task, so the team knows which task it was that ran so long.
ID: 58669 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Idan Shifres

Send message
Joined: 12 Dec 08
Posts: 11
Credit: 126,517
RAC: 0
Message 58672 - Posted: 8 Jan 2009, 11:57:24 UTC

Thanks for the reply, I was talking about the scientific value of letting the WU run for a long time till it abort itself, not about the points.

I would like to know if it is important for you guys to see that the watch dog stopped the application later than 3 times the time it was suppose to run or it doesn't matter for you what so ever?
ID: 58672 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5652
Credit: 5,622,096
RAC: 0
Message 58674 - Posted: 8 Jan 2009, 12:03:39 UTC - in response to Message 58672.  

Thanks for the reply, I was talking about the scientific value of letting the WU run for a long time till it abort itself, not about the points.

I would like to know if it is important for you guys to see that the watch dog stopped the application later than 3 times the time it was suppose to run or it doesn't matter for you what so ever?


If the task is done, but ran over your time settings by more than 3x, be sure to post the link here so that the team can have a look at the task. Same thing if you aborted the task. They will take any results that have been completed before the abort or the watchdog termination, if the task uploads.
ID: 58674 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58678 - Posted: 8 Jan 2009, 14:04:59 UTC - in response to Message 58672.  

Thanks for the reply, I was talking about the scientific value of letting the WU run for a long time till it abort itself, not about the points.

I would like to know if it is important for you guys to see that the watch dog stopped the application later than 3 times the time it was suppose to run or it doesn't matter for you what so ever?


The main benefit to the science is going to be in understanding how to eliminate the long-running models. Not specifically in the result that yours produces.

It is certainly useful for the Project Team to see the completed result (and/or a report in this thread). Otherwise it just looks like many other "user aborted" tasks, and does not lead one to enter "track down problem" mode. Or to suspect there are specific reasons people aborted the task.

At this point, due to the reports in this thread, some of the causes for long-running models have been identified and coding changes are in the works to resolve them.

So at this point, I'd suggest following the guidelines as originally posted in this thread, abort the task, report it's behavior, and continue on.
Rosetta Moderator: Mod.Sense
ID: 58678 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5652
Credit: 5,622,096
RAC: 0
Message 58687 - Posted: 8 Jan 2009, 16:51:05 UTC

from my run time is set for 4 hrs but tasks are running 6 hrs thread:

My settings are for 4 hrs run times, but these specific tasks ran 6 hrs without me touching a thing.

abinitio_norelax_homfrag_129_B_4icbA_SAVE_ALL_OUT_4626_4561_0
https://boinc.bakerlab.org/rosetta/result.php?resultid=218675744

abinitio_norelax_homfrag_129_B_5croA_SAVE_ALL_OUT_4626_4561_0
https://boinc.bakerlab.org/rosetta/result.php?resultid=218675745

abinitio_norelax_homfrag_129_B_1a8oA_SAVE_ALL_OUT_4626_4562_0
https://boinc.bakerlab.org/rosetta/result.php?resultid=218675747

credit was granted, 2 out of 3 cases were higher than claimed and the other was lower than claimed. all completed ok.
ID: 58687 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 58688 - Posted: 8 Jan 2009, 19:53:35 UTC

ID: 58688 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 58734 - Posted: 11 Jan 2009, 19:29:42 UTC - in response to Message 58688.  
Last modified: 11 Jan 2009, 19:31:03 UTC

ID: 58734 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1222
Credit: 13,725,055
RAC: 3,084
Message 58745 - Posted: 12 Jan 2009, 1:18:18 UTC - in response to Message 58734.  
Last modified: 12 Jan 2009, 1:21:59 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=219050690

https://boinc.bakerlab.org/rosetta/results.php?hostid=267483


Looks like you've got a combination of problems.

1. Using a machine where the time between interrupts for other work is less than the time between checkpoints, without also using the leave in memory option (which may also need an increase in the upper limit of disk space allowed and the percentage of the swap space allowed to be effective without causing problems).

2. A recently completed workunit (although with an error) which took such a short time that the software overestimates how many workunits your machine can complete within the queue length you have selected.

3. A queue length so long relative to the deadline length that your machine has problems catching up when you get too many jobs at once.

4. A group of workunits from a recent batch with poor estimates of how long they will run.

I'm not sure what to do about it except to abort some of the workunits in the queue that your machine hasn't started working on yet, before they reach the deadline.
ID: 58745 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58749 - Posted: 12 Jan 2009, 3:46:07 UTC

You can also reduce your runtime preference and then update to the project. The new preference is applied to your existing tasks, even if they've already started.

But this makes the rest of future scheduling a mess. So, aborting a few may be easier. But it looks like you've got plenty of time to complete them. There is only three days of work there if your machine is on crunching Rosetta all the time.

I think that particular task must have had a long running model without checkpoint. Or perhaps the machine was restarted several times in a row to install service packs etc. while this task was in progress.
Rosetta Moderator: Mod.Sense
ID: 58749 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 58755 - Posted: 12 Jan 2009, 13:34:53 UTC - in response to Message 58749.  

You can also reduce your runtime preference and then update to the project. The new preference is applied to your existing tasks, even if they've already started.


Is it still true that running WUs use the new runtime? I recently changed my runtime, and it seemed to me that the rosetta 5.98 WUs that were running did indeed use the new runtime, but the running minirosetta WUs did not. They seemed to use the preference that was set when they started.

I didn't try stopping and restarting BOINC. That might have forced minirosetta to read the new runtime when it restarted.

ID: 58755 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 14 · Next

Message boards : Number crunching : Report long-running models here



©2024 University of Washington
https://www.bakerlab.org