Never-ending WU?

Message boards : Number crunching : Never-ending WU?

To post messages, you must log in.

AuthorMessage
Guido Waldenmeier
Avatar

Send message
Joined: 7 Jan 06
Posts: 11
Credit: 2,670
RAC: 0
Message 9351 - Posted: 19 Jan 2006, 15:49:56 UTC
Last modified: 19 Jan 2006, 15:50:31 UTC

I've been crunching this WU (BARCODE_FRAG_30_1dtj_234_976_0) for over 10 hours on a G4 @ 867MHz. I just checked in BOINC Manager to see how far it had gotten, and the CPU time it's now reporting is 8 hours. All throughout the time - all ten hours - the "to completion" column has been reading "0:50:00" and increasing steadily to "1:45:00" over a period of two hours.

A few questions:
(1) Will this WU never end?
(2) Can anyone explain the rollback on the CPU time?
(3) Should I send this WU to meet its binary maker?
(4) What's the usual runtime?

TIA
ID: 9351 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 9352 - Posted: 19 Jan 2006, 15:52:15 UTC

(2) If it was pre-empted by another and you have not got it set to remain in mempory, or you rebooted......
ID: 9352 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 9353 - Posted: 19 Jan 2006, 15:53:12 UTC

....also what is the %complete figure?
ID: 9353 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Guido Waldenmeier
Avatar

Send message
Joined: 7 Jan 06
Posts: 11
Credit: 2,670
RAC: 0
Message 9358 - Posted: 19 Jan 2006, 16:13:49 UTC - in response to Message 9353.  
Last modified: 19 Jan 2006, 16:15:39 UTC

Currently 08:36:45 at 90%... where it's been for at least an hour or so... maybe two?...

I'm polling client_state.xml every five min for %done via cron... gimmie a few minutes and I'll post the contents.
ID: 9358 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 9359 - Posted: 19 Jan 2006, 16:19:40 UTC

I had a WU today that took over nine hours on my AMD 2800+, so there are some big ones out there
ID: 9359 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Guido Waldenmeier
Avatar

Send message
Joined: 7 Jan 06
Posts: 11
Credit: 2,670
RAC: 0
Message 9368 - Posted: 19 Jan 2006, 17:08:18 UTC - in response to Message 9359.  

Sorry for the delay in the response... It turns out I've got far more data than I had anticipated to sift through.

This is the log I've got from rosetta crunching the WU I mentioned earlier. The chart is date and time of the log entry, the checkpoint time (seconds since the WU began?), the current cpu time, and the "frac_done".

Date       Time     Checkpt.       CPU Time        %

2006/01/18 21:09:00 18863.91001    21622.17204    0.7
2006/01/18 21:20:00 18863.91001    21622.17204    0.7
2006/01/18 22:21:00 18863.91001    20520.94194    0.7
2006/01/18 22:50:00 18863.91001    20520.94194    0.7
2006/01/18 23:50:01 18863.91001    18863.91001    0

2006/01/18 23:59:00 18863.91001    21755.28197    0.7
2006/01/19 00:00:01 18863.91001    21755.28197    0.7
2006/01/19 00:24:00 18863.91001    23356.1166     0.7
2006/01/19 01:10:01 18863.91001    23356.1166     0.7
2006/01/19 01:11:01 18863.91001    25988.00896    0.7

2006/01/19 01:12:00 18863.91001    18863.91001    0
2006/01/19 01:30:01 18863.91001    18869.16166    0.7
2006/01/19 01:32:00 18863.91001    18869.16166    0.7
2006/01/19 01:33:01 18863.91001    18867.14215    0.7
2006/01/19 01:42:00 18863.91001    18867.14215    0.7

2006/01/19 01:43:00 18863.91001    19236.79489    0.7
2006/01/19 01:44:01 18863.91001    19311.62809    0.7
2006/01/19 01:50:01 18863.91001    19311.62809    0.7
2006/01/19 01:51:00 19638.72387    19638.84059    0.8
2006/01/19 01:58:01 19638.72387    19638.84059    0.8

2006/01/19 01:59:00 19638.72387    20062.38786    0.8
2006/01/19 02:13:00 19638.72387    20062.38786    0.8
2006/01/19 02:14:01 19638.72387    20801.38322    0.8
2006/01/19 02:28:01 19638.72387    20801.38322    0.8
2006/01/19 02:29:00 19638.72387    21596.48618    0.8

2006/01/19 02:43:01 19638.72387    21596.48618    0.8
2006/01/19 02:44:00 19638.72387    22421.50753    0.8
2006/01/19 02:58:01 19638.72387    22421.50753    0.8
2006/01/19 02:59:00 19638.72387    23169.07735    0.8
2006/01/19 03:13:00 19638.72387    23169.07735    0.8

2006/01/19 03:14:00 19638.72387    23951.3318     0.8
2006/01/19 03:28:01 19638.72387    23951.3318     0.8
2006/01/19 03:29:00 19638.72387    24788.52825    0.8
2006/01/19 03:43:01 19638.72387    24788.52825    0.8
2006/01/19 03:44:00 19638.72387    25549.733      0.8

2006/01/19 03:58:00 19638.72387    25549.733      0.8
2006/01/19 03:59:00 19638.72387    26136.9324     0.8
2006/01/19 04:13:00 19638.72387    26136.9324     0.8
2006/01/19 04:14:00 19638.72387    26958.85906    0.8
2006/01/19 04:28:00 19638.72387    26958.85906    0.8

2006/01/19 04:29:00 19638.72387    27578.8381     0.8
2006/01/19 04:43:01 19638.72387    27578.8381     0.8
2006/01/19 04:44:00 19638.72387    28236.44895    0.8
2006/01/19 04:58:00 19638.72387    28236.44895    0.8
2006/01/19 04:59:01 19638.72387    28957.29703    0.8

2006/01/19 05:13:00 19638.72387    28957.29703    0.8
2006/01/19 05:14:00 19638.72387    29774.80083    0.8
2006/01/19 05:22:01 19638.72387    29774.80083    0.8
2006/01/19 05:23:00 30243.11149    30243.11136    0.9
2006/01/19 05:28:01 30243.11149    30243.11136    0.9

2006/01/19 05:29:00 30243.11149    30560.37531    0.9
2006/01/19 05:43:00 30243.11149    30560.37531    0.9
2006/01/19 05:44:01 30243.11149    31368.43024    0.9
2006/01/19 05:58:01 30243.11149    31368.43024    0.9
2006/01/19 05:59:00 30243.11149    31968.63966    0.9

2006/01/19 06:13:00 30243.11149    31968.63966    0.9
2006/01/19 06:14:00 30243.11149    32536.2561     0.9
2006/01/19 06:28:00 30243.11149    32536.2561     0.9
2006/01/19 06:29:00 30243.11149    33360.69078    0.9
2006/01/19 06:43:00 30243.11149    33360.69078    0.9

2006/01/19 06:44:00 30243.11149    34151.01588    0.9
2006/01/19 06:59:00 30243.11149    34151.01588    0.9
2006/01/19 07:00:01 30243.11149    34982.87669    0.9
2006/01/19 07:14:00 30243.11149    34982.87669    0.9
2006/01/19 07:15:00 30243.11149    35554.93576    0.9

2006/01/19 07:29:00 30243.11149    35554.93576    0.9
2006/01/19 07:30:01 30243.11149    35939.82315    0.9
2006/01/19 07:44:00 30243.11149    35939.82315    0.9
2006/01/19 07:45:00 30243.11149    36361.87355    0.9
2006/01/19 07:59:00 30243.11149    36361.87355    0.9

2006/01/19 08:00:00 30243.11149    36886.76837    0.9
2006/01/19 08:14:01 30243.11149    36886.76837    0.9
2006/01/19 08:15:00 30243.11149    37368.20149    0.9
2006/01/19 08:29:01 30243.11149    37368.20149    0.9
2006/01/19 08:30:00 30243.11149    37904.4584     0.9

2006/01/19 08:34:00 30243.11149    37904.4584     0.9
2006/01/19 08:35:01 30243.11149    38163.10014    0.9
2006/01/19 08:49:00 30243.11149    38163.10014    0.9
2006/01/19 08:50:01 30243.11149    30243.11149    0
2006/01/19 09:05:00 30243.11149    30703.9906     0.9

2006/01/19 09:19:00 30243.11149    30703.9906     0.9
2006/01/19 09:35:01 30243.11149    30664.69227    0.9
2006/01/19 09:49:00 30243.11149    30664.69227    0.9
2006/01/19 10:05:00 30243.11149    30919.71975    0.9
2006/01/19 10:19:01 30243.11149    30919.71975    0.9

2006/01/19 10:35:01 30243.11149    30617.40544    0.9
2006/01/19 10:42:00 30243.11149    30617.40544    0.9
2006/01/19 10:57:05 30243.11149    30243.11149    0
2006/01/19 10:58:01 30243.11149    30806.81106    0.9
2006/01/19 11:12:01 30243.11149    30806.81106    0.9

2006/01/19 11:13:00 30243.11149    31537.93352    0.9
2006/01/19 11:27:01 30243.11149    31537.93352    0.9
2006/01/19 11:28:00 30243.11149    32374.22245    0.9
2006/01/19 11:39:00 30243.11149    32374.22245    0.9

Date       Time     Checkpt.       CPU Time        %


As for the WU sizes, I hadn't come across a behemoth like this one before - The last two were under four hours and I had to ditch one in order to keep up with a SETI Enhanced WU deadline, but you wouldn't know that because it's still sitting in BOINC Manager saying "Aborted by user"... groan

At least I know something's working right...
ID: 9368 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Guido Waldenmeier
Avatar

Send message
Joined: 7 Jan 06
Posts: 11
Credit: 2,670
RAC: 0
Message 9369 - Posted: 19 Jan 2006, 17:15:25 UTC

One last thing before I head out: Current CPU Time is 9:39:45, progress 90% (still), and "to completion" is 01:00:15 (up 15:15 from an hour ago).

Thanks for the help, Scribe!
ID: 9369 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile carl.h
Avatar

Send message
Joined: 28 Dec 05
Posts: 555
Credit: 183,449
RAC: 0
Message 9374 - Posted: 19 Jan 2006, 18:48:50 UTC
Last modified: 19 Jan 2006, 18:49:10 UTC

It appears we are seeing work units with a lot longer working time 8 hours plus....let`s hope none of these get to 7 hours plus then get errors...
Not all Czech`s bounce but I`d like to try with Barbar ;-)

Make no mistake This IS the TEDDIES TEAM.
ID: 9374 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Guido Waldenmeier
Avatar

Send message
Joined: 7 Jan 06
Posts: 11
Credit: 2,670
RAC: 0
Message 9375 - Posted: 19 Jan 2006, 19:38:11 UTC - in response to Message 9374.  

It finally ended: 40,524.74 seconds (~11hr 15min) - I'll check the logs later on, but I'll wager that there wasn't any data committed to disk during the last three hours of the crunch.

Anyway, can someone eyeball the result and let me know if it's in line with other users' results? I'd greatly appreciate it.

Many thanks to all!
ID: 9375 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 9379 - Posted: 19 Jan 2006, 20:51:19 UTC - in response to Message 9375.  

Anyway, can someone eyeball the result and let me know if it's in line with other users' results? I'd greatly appreciate it.


It looks okay.
ID: 9379 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Guido Waldenmeier
Avatar

Send message
Joined: 7 Jan 06
Posts: 11
Credit: 2,670
RAC: 0
Message 9413 - Posted: 20 Jan 2006, 1:13:14 UTC - in response to Message 9379.  

Many thanks!
ID: 9413 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 9432 - Posted: 20 Jan 2006, 6:43:56 UTC - in response to Message 9375.  

It finally ended: 40,524.74 seconds (~11hr 15min) - I'll check the logs later on, but I'll wager that there wasn't any data committed to disk during the last three hours of the crunch.

Anyway, can someone eyeball the result and let me know if it's in line with other users' results? I'd greatly appreciate it.

Many thanks to all!


While lately these longer ones have been failing for taking too long, it is not impossible to see some WUs run for over 20 hours. It is important to set the leave applications in memory during swaps to "YES". As for the last 10% not looking very busy, this is also common. A few months ago the WUs would rush to 90% in just a few hours, and the last 10% would take 2 or more for a WU that only took 5 total. It is not uncommon for a lot of things to be done in that last two hours, but the application will not produce a checkpoint during that time. If you stop BOINC or have keep in mempory set to OFF then the WU has to start over at 90% each time it is interupted. This may explain some of your delayed processing. Most people set the switch time for applications to at least TWO hours to help this situation, in addition to setting keep in memory to YES.

Regards
Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 9432 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,081,660
RAC: 566
Message 9461 - Posted: 20 Jan 2006, 17:13:09 UTC - in response to Message 9375.  

It finally ended: 40,524.74 seconds (~11hr 15min) - I'll check the logs later on, but I'll wager that there wasn't any data committed to disk during the last three hours of the crunch.

Anyway, can someone eyeball the result and let me know if it's in line with other users' results? I'd greatly appreciate it.

Many thanks to all!


I believe you mentioned cron so I'll asume you're on a Linux or UNIX system. There is another way to watch the progress of a WU with a little more granularity. Go to the slots directory and find the directory there that has the RAH stuff in it. Go into that directory. You should find a file there called stdout.txt. Run a "tail -f stdout.txt" to watch the activity as data is written to the file. I do this all the time.

Hope this helps.

-Charlie
ID: 9461 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 7 Oct 05
Posts: 234
Credit: 15,020
RAC: 0
Message 9610 - Posted: 23 Jan 2006, 0:37:11 UTC - in response to Message 9413.  
Last modified: 23 Jan 2006, 0:42:21 UTC

Many thanks!


It's very important in Rosetta@Home that you set the WU's to stay in memory while preempted in your preferences.

I did write this to you in a mail more than a week ago. Why oh why will men never listen?

But the crunching time for Rosetta WU's differ from each other a lot. I think my longest took about 5 1/2 - 6 hours, and it stayed in memory while preempted.

Another good advice is to set the time between switching between applications to at least 120 min (default 60 min).

Happy crunching. :-)


[b]"I'm trying to maintain a shred of dignity in this world." - Me[/b]

ID: 9610 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Never-ending WU?



©2024 University of Washington
https://www.bakerlab.org