When is a wu stuck?

Message boards : Number crunching : When is a wu stuck?

To post messages, you must log in.

AuthorMessage
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,662,550
RAC: 720
Message 38492 - Posted: 27 Mar 2007, 18:38:45 UTC
Last modified: 27 Mar 2007, 18:46:02 UTC

I have this wu running on one of my machines, and 2 others of the same type running on other machines which I have no access to at the moment. All 3 machine have 3 hour preference set.

On this machine, the wu is still at 1% complete after >4.5 hours. I am well aware that the program will run at least 1 model, and that as a result, sometimes a wu will run longer then 3 hours.

The thing is, I have no indication at all that this wu is doing anything. There are no files in the project directory getting updated. I am quite happy to let it run if it is doing anything positive of course.

How long is reasonable to let this run?

*** EDIT ***

Sods law! It finished a couple of minutes after I typed the above after 4:36:00!

<core_client_version>5.8.15</core_client_version>
<![CDATA[
<stderr_txt>
# random seed: 3342067
# cpu_run_time_pref: 10800
======================================================
DONE :: 1 starting structures built 30 (nstruct) times
This process generated 1 decoys from 1 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
]]>


Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 38492 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 38500 - Posted: 27 Mar 2007, 20:19:37 UTC

I have a similar one running now. It has been going for about 2 hours 43 min and is at step 34000.
ID: 38500 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,708,840
RAC: 1,920
Message 38501 - Posted: 27 Mar 2007, 20:36:47 UTC
Last modified: 27 Mar 2007, 20:38:17 UTC

i've got that one queued in my system, but i will move it up ahead of 3 other wu's to see what it does. it should start running some time after 10am CET tomorrow.
ID: 38501 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
B-Roy

Send message
Joined: 26 Sep 05
Posts: 26
Credit: 46,121
RAC: 24
Message 38520 - Posted: 28 Mar 2007, 12:48:32 UTC

i just have one at 4:11 with Model: 1 and Step: 326500 (still showing 1%).
I think that for slow crunchers like me, this is a potential problem considering that the wu does not checkpoint; due to this I lost 2h of crunching yesterday, when I turned of my PC with the same wu.
ID: 38520 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 38524 - Posted: 28 Mar 2007, 13:05:15 UTC
Last modified: 28 Mar 2007, 13:08:48 UTC

B-roy, yes if your machine is on the slow side, it can take considerable time to reach the completion of that first model, at which time the task will have reportable results. Rosetta will then evaluate your runtime preference, and decide that is all your machine should crunch on that task. It will then skip to 100% completed and report in.

I don't believe it is accurate to say that there are no checkpoints. However, your point about how it is possible to lose 2hrs of crunching is clear, and illustraits the need for more checkpoints. Rhiju posted just this weekend that they are evaluating how to best address this concern.

There are other cases where improved checkpointing will help preserve completed work and increase the project TFLOPs. One such case is when someone runs several projects, as you are also probably doing as well.
Rosetta Moderator: Mod.Sense
ID: 38524 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
B-Roy

Send message
Joined: 26 Sep 05
Posts: 26
Credit: 46,121
RAC: 24
Message 38526 - Posted: 28 Mar 2007, 14:34:27 UTC

thanks for the quick reply. Is there actually a fixed amount of steps for each model? I am at 446000 and counting, so I wonder whether I could preview a potential end, before having to shut-down the computer for the night again.


ID: 38526 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 38530 - Posted: 28 Mar 2007, 16:22:23 UTC

I cannot tell you an exact number of steps. It varies. 400,000 is in-line with normal for many tasks. In fact, it generally is very near the end.

The main point is not to let the 1% indication throw you. It's not completed the first model yet, so it doesn't have the % complete calculated.

Also, just be aware that if you didn't reach a checkpoint, and power off your machine for the day and have the same situation again, where it doesn't reach a checkpoint before you must power off... if this task does that 5 times, then Rosetta will end it and get another. Most tasks take less time then that to complete each model, and so the next task will then run better with how you are using your machine. So, it's all built-in to detect such a situation and to resolve it for you.
Rosetta Moderator: Mod.Sense
ID: 38530 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,708,840
RAC: 1,920
Message 38532 - Posted: 28 Mar 2007, 18:06:02 UTC - in response to Message 38492.  
Last modified: 28 Mar 2007, 18:07:38 UTC

see here for a similar named work unit. It completed on my computer with 3 decoys and 30 nstruct

I had mine run 8 hours which i do for all WU's
I have this wu running on one of my machines, and 2 others of the same type running on other machines which I have no access to at the moment. All 3 machine have 3 hour preference set.

On this machine, the wu is still at 1% complete after >4.5 hours. I am well aware that the program will run at least 1 model, and that as a result, sometimes a wu will run longer then 3 hours.

The thing is, I have no indication at all that this wu is doing anything. There are no files in the project directory getting updated. I am quite happy to let it run if it is doing anything positive of course.

How long is reasonable to let this run?

*** EDIT ***

Sods law! It finished a couple of minutes after I typed the above after 4:36:00!

<core_client_version>5.8.15</core_client_version>
<![CDATA[
<stderr_txt>
# random seed: 3342067
# cpu_run_time_pref: 10800
======================================================
DONE :: 1 starting structures built 30 (nstruct) times
This process generated 1 decoys from 1 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
]]>



ID: 38532 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 38537 - Posted: 28 Mar 2007, 20:44:27 UTC - in response to Message 38532.  
Last modified: 28 Mar 2007, 20:45:37 UTC

Hi everyone:

Over on ralph, we are testing a new app (5.55) that updates the "percentage complete" in a more reasonable way. Basically, we are following suggestions posted by the users and forum moderators -- we are incrementing the percentage complete every second by an amount scaled so that 100% would correspond to 4 times the user's preferred CPU run time (this is the max time allowed for any workunit). Once each decoy is completed, the % complete is updated (usually jumps up) to a more accurate value! Hopefully this will help prevent some of the confusion for new users!

[We are also working on more frequent checkpointing, but this turns out to be more challenging -- expect progress over then next two weeks.]

Moderators, can you spread the news to the other threads where this question is being discussed?


see here for a similar named work unit. It completed on my computer with 3 decoys and 30 nstruct

I had mine run 8 hours which i do for all WU's
I have this wu running on one of my machines, and 2 others of the same type running on other machines which I have no access to at the moment. All 3 machine have 3 hour preference set.

On this machine, the wu is still at 1% complete after >4.5 hours. I am well aware that the program will run at least 1 model, and that as a result, sometimes a wu will run longer then 3 hours.

The thing is, I have no indication at all that this wu is doing anything. There are no files in the project directory getting updated. I am quite happy to let it run if it is doing anything positive of course.

How long is reasonable to let this run?

*** EDIT ***

Sods law! It finished a couple of minutes after I typed the above after 4:36:00!

<core_client_version>5.8.15</core_client_version>
<![CDATA[
<stderr_txt>
# random seed: 3342067
# cpu_run_time_pref: 10800
======================================================
DONE :: 1 starting structures built 30 (nstruct) times
This process generated 1 decoys from 1 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
]]>




ID: 38537 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : When is a wu stuck?



©2024 University of Washington
https://www.bakerlab.org