Problem with WU?

Message boards : Number crunching : Problem with WU?

To post messages, you must log in.

AuthorMessage
tng*

Send message
Joined: 28 Oct 05
Posts: 14
Credit: 5,389,798
RAC: 0
Message 2081 - Posted: 2 Nov 2005, 19:08:01 UTC
Last modified: 2 Nov 2005, 19:21:53 UTC

Happened to notice that a WU has used almost 5 hours
of CPU time (as opposed to a previous maximum of about 2) and has spent at least 1/2 hour at 100% complete. On inspection, I see that I was reissued this one after another machine missed the deadline.

I tried exiting BOINC and restarting it, but still the
same.

Here's the WU:

141080

Machine is a 1 GHz P3, XP SP2, BOINC 5.2.2.

New to this project, but from what I can tell this isn't normal, and the fact that somebody else has already missed deadline on this one causes me some
concern (although that host has only completed 1 WU,
the user's hostlist leads me to believe that one of
his systems could be stuck on the same WU for a month
and be overlooked).

If it's like the other WU issued to that machine at the same time, this will be longer than the others my
machine has run, but I'm still concerned.

Would appreciate feedback on whether this sort of behavior is normal, and if not should I just kill
that WU or let it go so somebody can investigate?
ID: 2081 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 2094 - Posted: 2 Nov 2005, 20:21:43 UTC

ID: 2094 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tng*

Send message
Joined: 28 Oct 05
Posts: 14
Credit: 5,389,798
RAC: 0
Message 2095 - Posted: 2 Nov 2005, 20:23:47 UTC - in response to Message 2094.  

See this thread


Leave in memory is set.
ID: 2095 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 2096 - Posted: 2 Nov 2005, 20:34:13 UTC

This is one of the larger WUs. Give it another hour and if it is still stuck email me the stdout.txt file to dekim at u.washington.edu
ID: 2096 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tng*

Send message
Joined: 28 Oct 05
Posts: 14
Credit: 5,389,798
RAC: 0
Message 2105 - Posted: 2 Nov 2005, 22:54:36 UTC - in response to Message 2096.  

This is one of the larger WUs. Give it another hour and if it is still stuck email me the stdout.txt file to dekim at u.washington.edu


It finally finished (6:16:47 CPU, at least 1:45 at 100%). Next time I'll know
to be more patient.
ID: 2105 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 2107 - Posted: 3 Nov 2005, 0:15:08 UTC - in response to Message 2105.  

It finally finished (6:16:47 CPU, at least 1:45 at 100%). Next time I'll know
to be more patient.


It normally shouldn't get stuck at 100% for that long but there is a bug in the app that may cause this if the job is restarted when it is almost finished.
ID: 2107 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,359
RAC: 10
Message 2109 - Posted: 3 Nov 2005, 0:39:31 UTC - in response to Message 2107.  
Last modified: 3 Nov 2005, 0:43:26 UTC

It normally shouldn't get stuck at 100% for that long but there is a bug in the app that may cause this if the job is restarted when it is almost finished.


I have one that was at 14 hours or so (slow G3 iBook) this morning, and at 100%. Last night it was at 91 % and 10 hours. I come back now and it's still at 100%, showing 17 hours, preempted by SETI at the moment. Apps are left in memory, so it shouldn't be getting reset, and "switch between" is an hour. I've suspended SETI, I'll give it another couple of hours... it's "1btn__abrelax_10472_1" if that helps. WU id 44502, which also was not returned by another user.

ID: 2109 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,359
RAC: 10
Message 2115 - Posted: 3 Nov 2005, 3:48:58 UTC - in response to Message 2109.  

it's still at 100%, showing 17 hours


Now at 19:42, still at 100%... stdout has several "warnings" in it, tons of data, nothing meaningful or particularly enlightening.

ID: 2115 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,359
RAC: 10
Message 2128 - Posted: 3 Nov 2005, 7:42:53 UTC - in response to Message 2115.  

it's still at 100%, showing 17 hours


Now at 19:42, still at 100%... stdout has several "warnings" in it, tons of data, nothing meaningful or particularly enlightening.


Okay... this WU started on 10/29, is now at 23 hours, been showing 100% for the last 9 hours! The stdout file is still being written to, still seems to be actually accomplishing something, but this is getting ridiculous. Last four hours it had the CPU all to itself; I've now resumed SETI, so at least that computer can be accomplishing something. I hate to abort the Rosetta result, and lose a full day's work... is there any way to tell if/when this thing will ever finish???

The Windows box crunches along happily... only the two Macs have had problems...

ID: 2128 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 2134 - Posted: 3 Nov 2005, 8:29:41 UTC

Can you email me the stdout.txt file? dekim at u.washington.edu
ID: 2134 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,359
RAC: 10
Message 2141 - Posted: 3 Nov 2005, 13:11:22 UTC - in response to Message 2134.  

Can you email me the stdout.txt file?


It's on it's way!

ID: 2141 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,359
RAC: 10
Message 2307 - Posted: 5 Nov 2005, 0:28:02 UTC

Just to report the final outcome - David K took a look and said it was still progressing, so I let it run, and WU 44502 finally completed after 114,203.89 seconds (31.7 hours, the last half sitting showing 100%) and got 122.97 credits...

ID: 2307 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Problem with WU?



©2024 University of Washington
https://www.bakerlab.org