Message boards : Number crunching : Never-ending WU?
Author | Message |
---|---|
Guido Waldenmeier Send message Joined: 7 Jan 06 Posts: 11 Credit: 2,670 RAC: 0 |
I've been crunching this WU (BARCODE_FRAG_30_1dtj_234_976_0) for over 10 hours on a G4 @ 867MHz. I just checked in BOINC Manager to see how far it had gotten, and the CPU time it's now reporting is 8 hours. All throughout the time - all ten hours - the "to completion" column has been reading "0:50:00" and increasing steadily to "1:45:00" over a period of two hours. A few questions: (1) Will this WU never end? (2) Can anyone explain the rollback on the CPU time? (3) Should I send this WU to meet its binary maker? (4) What's the usual runtime? TIA |
Scribe Send message Joined: 2 Nov 05 Posts: 284 Credit: 157,359 RAC: 0 |
(2) If it was pre-empted by another and you have not got it set to remain in mempory, or you rebooted...... |
Scribe Send message Joined: 2 Nov 05 Posts: 284 Credit: 157,359 RAC: 0 |
....also what is the %complete figure? |
Guido Waldenmeier Send message Joined: 7 Jan 06 Posts: 11 Credit: 2,670 RAC: 0 |
Currently 08:36:45 at 90%... where it's been for at least an hour or so... maybe two?... I'm polling client_state.xml every five min for %done via cron... gimmie a few minutes and I'll post the contents. |
Scribe Send message Joined: 2 Nov 05 Posts: 284 Credit: 157,359 RAC: 0 |
I had a WU today that took over nine hours on my AMD 2800+, so there are some big ones out there |
Guido Waldenmeier Send message Joined: 7 Jan 06 Posts: 11 Credit: 2,670 RAC: 0 |
Sorry for the delay in the response... It turns out I've got far more data than I had anticipated to sift through. This is the log I've got from rosetta crunching the WU I mentioned earlier. The chart is date and time of the log entry, the checkpoint time (seconds since the WU began?), the current cpu time, and the "frac_done". Date Time Checkpt. CPU Time % 2006/01/18 21:09:00 18863.91001 21622.17204 0.7 2006/01/18 21:20:00 18863.91001 21622.17204 0.7 2006/01/18 22:21:00 18863.91001 20520.94194 0.7 2006/01/18 22:50:00 18863.91001 20520.94194 0.7 2006/01/18 23:50:01 18863.91001 18863.91001 0 2006/01/18 23:59:00 18863.91001 21755.28197 0.7 2006/01/19 00:00:01 18863.91001 21755.28197 0.7 2006/01/19 00:24:00 18863.91001 23356.1166 0.7 2006/01/19 01:10:01 18863.91001 23356.1166 0.7 2006/01/19 01:11:01 18863.91001 25988.00896 0.7 2006/01/19 01:12:00 18863.91001 18863.91001 0 2006/01/19 01:30:01 18863.91001 18869.16166 0.7 2006/01/19 01:32:00 18863.91001 18869.16166 0.7 2006/01/19 01:33:01 18863.91001 18867.14215 0.7 2006/01/19 01:42:00 18863.91001 18867.14215 0.7 2006/01/19 01:43:00 18863.91001 19236.79489 0.7 2006/01/19 01:44:01 18863.91001 19311.62809 0.7 2006/01/19 01:50:01 18863.91001 19311.62809 0.7 2006/01/19 01:51:00 19638.72387 19638.84059 0.8 2006/01/19 01:58:01 19638.72387 19638.84059 0.8 2006/01/19 01:59:00 19638.72387 20062.38786 0.8 2006/01/19 02:13:00 19638.72387 20062.38786 0.8 2006/01/19 02:14:01 19638.72387 20801.38322 0.8 2006/01/19 02:28:01 19638.72387 20801.38322 0.8 2006/01/19 02:29:00 19638.72387 21596.48618 0.8 2006/01/19 02:43:01 19638.72387 21596.48618 0.8 2006/01/19 02:44:00 19638.72387 22421.50753 0.8 2006/01/19 02:58:01 19638.72387 22421.50753 0.8 2006/01/19 02:59:00 19638.72387 23169.07735 0.8 2006/01/19 03:13:00 19638.72387 23169.07735 0.8 2006/01/19 03:14:00 19638.72387 23951.3318 0.8 2006/01/19 03:28:01 19638.72387 23951.3318 0.8 2006/01/19 03:29:00 19638.72387 24788.52825 0.8 2006/01/19 03:43:01 19638.72387 24788.52825 0.8 2006/01/19 03:44:00 19638.72387 25549.733 0.8 2006/01/19 03:58:00 19638.72387 25549.733 0.8 2006/01/19 03:59:00 19638.72387 26136.9324 0.8 2006/01/19 04:13:00 19638.72387 26136.9324 0.8 2006/01/19 04:14:00 19638.72387 26958.85906 0.8 2006/01/19 04:28:00 19638.72387 26958.85906 0.8 2006/01/19 04:29:00 19638.72387 27578.8381 0.8 2006/01/19 04:43:01 19638.72387 27578.8381 0.8 2006/01/19 04:44:00 19638.72387 28236.44895 0.8 2006/01/19 04:58:00 19638.72387 28236.44895 0.8 2006/01/19 04:59:01 19638.72387 28957.29703 0.8 2006/01/19 05:13:00 19638.72387 28957.29703 0.8 2006/01/19 05:14:00 19638.72387 29774.80083 0.8 2006/01/19 05:22:01 19638.72387 29774.80083 0.8 2006/01/19 05:23:00 30243.11149 30243.11136 0.9 2006/01/19 05:28:01 30243.11149 30243.11136 0.9 2006/01/19 05:29:00 30243.11149 30560.37531 0.9 2006/01/19 05:43:00 30243.11149 30560.37531 0.9 2006/01/19 05:44:01 30243.11149 31368.43024 0.9 2006/01/19 05:58:01 30243.11149 31368.43024 0.9 2006/01/19 05:59:00 30243.11149 31968.63966 0.9 2006/01/19 06:13:00 30243.11149 31968.63966 0.9 2006/01/19 06:14:00 30243.11149 32536.2561 0.9 2006/01/19 06:28:00 30243.11149 32536.2561 0.9 2006/01/19 06:29:00 30243.11149 33360.69078 0.9 2006/01/19 06:43:00 30243.11149 33360.69078 0.9 2006/01/19 06:44:00 30243.11149 34151.01588 0.9 2006/01/19 06:59:00 30243.11149 34151.01588 0.9 2006/01/19 07:00:01 30243.11149 34982.87669 0.9 2006/01/19 07:14:00 30243.11149 34982.87669 0.9 2006/01/19 07:15:00 30243.11149 35554.93576 0.9 2006/01/19 07:29:00 30243.11149 35554.93576 0.9 2006/01/19 07:30:01 30243.11149 35939.82315 0.9 2006/01/19 07:44:00 30243.11149 35939.82315 0.9 2006/01/19 07:45:00 30243.11149 36361.87355 0.9 2006/01/19 07:59:00 30243.11149 36361.87355 0.9 2006/01/19 08:00:00 30243.11149 36886.76837 0.9 2006/01/19 08:14:01 30243.11149 36886.76837 0.9 2006/01/19 08:15:00 30243.11149 37368.20149 0.9 2006/01/19 08:29:01 30243.11149 37368.20149 0.9 2006/01/19 08:30:00 30243.11149 37904.4584 0.9 2006/01/19 08:34:00 30243.11149 37904.4584 0.9 2006/01/19 08:35:01 30243.11149 38163.10014 0.9 2006/01/19 08:49:00 30243.11149 38163.10014 0.9 2006/01/19 08:50:01 30243.11149 30243.11149 0 2006/01/19 09:05:00 30243.11149 30703.9906 0.9 2006/01/19 09:19:00 30243.11149 30703.9906 0.9 2006/01/19 09:35:01 30243.11149 30664.69227 0.9 2006/01/19 09:49:00 30243.11149 30664.69227 0.9 2006/01/19 10:05:00 30243.11149 30919.71975 0.9 2006/01/19 10:19:01 30243.11149 30919.71975 0.9 2006/01/19 10:35:01 30243.11149 30617.40544 0.9 2006/01/19 10:42:00 30243.11149 30617.40544 0.9 2006/01/19 10:57:05 30243.11149 30243.11149 0 2006/01/19 10:58:01 30243.11149 30806.81106 0.9 2006/01/19 11:12:01 30243.11149 30806.81106 0.9 2006/01/19 11:13:00 30243.11149 31537.93352 0.9 2006/01/19 11:27:01 30243.11149 31537.93352 0.9 2006/01/19 11:28:00 30243.11149 32374.22245 0.9 2006/01/19 11:39:00 30243.11149 32374.22245 0.9 Date Time Checkpt. CPU Time % As for the WU sizes, I hadn't come across a behemoth like this one before - The last two were under four hours and I had to ditch one in order to keep up with a SETI Enhanced WU deadline, but you wouldn't know that because it's still sitting in BOINC Manager saying "Aborted by user"... groan At least I know something's working right... |
Guido Waldenmeier Send message Joined: 7 Jan 06 Posts: 11 Credit: 2,670 RAC: 0 |
One last thing before I head out: Current CPU Time is 9:39:45, progress 90% (still), and "to completion" is 01:00:15 (up 15:15 from an hour ago). Thanks for the help, Scribe! |
carl.h Send message Joined: 28 Dec 05 Posts: 555 Credit: 183,449 RAC: 0 |
It appears we are seeing work units with a lot longer working time 8 hours plus....let`s hope none of these get to 7 hours plus then get errors... Not all Czech`s bounce but I`d like to try with Barbar ;-) Make no mistake This IS the TEDDIES TEAM. |
Guido Waldenmeier Send message Joined: 7 Jan 06 Posts: 11 Credit: 2,670 RAC: 0 |
It finally ended: 40,524.74 seconds (~11hr 15min) - I'll check the logs later on, but I'll wager that there wasn't any data committed to disk during the last three hours of the crunch. Anyway, can someone eyeball the result and let me know if it's in line with other users' results? I'd greatly appreciate it. Many thanks to all! |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Anyway, can someone eyeball the result and let me know if it's in line with other users' results? I'd greatly appreciate it. It looks okay. |
Guido Waldenmeier Send message Joined: 7 Jan 06 Posts: 11 Credit: 2,670 RAC: 0 |
Many thanks! |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
It finally ended: 40,524.74 seconds (~11hr 15min) - I'll check the logs later on, but I'll wager that there wasn't any data committed to disk during the last three hours of the crunch. While lately these longer ones have been failing for taking too long, it is not impossible to see some WUs run for over 20 hours. It is important to set the leave applications in memory during swaps to "YES". As for the last 10% not looking very busy, this is also common. A few months ago the WUs would rush to 90% in just a few hours, and the last 10% would take 2 or more for a WU that only took 5 total. It is not uncommon for a lot of things to be done in that last two hours, but the application will not produce a checkpoint during that time. If you stop BOINC or have keep in mempory set to OFF then the WU has to start over at 90% each time it is interupted. This may explain some of your delayed processing. Most people set the switch time for applications to at least TWO hours to help this situation, in addition to setting keep in memory to YES. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
Charles Dennett Send message Joined: 27 Sep 05 Posts: 102 Credit: 2,081,660 RAC: 421 |
It finally ended: 40,524.74 seconds (~11hr 15min) - I'll check the logs later on, but I'll wager that there wasn't any data committed to disk during the last three hours of the crunch. I believe you mentioned cron so I'll asume you're on a Linux or UNIX system. There is another way to watch the progress of a WU with a little more granularity. Go to the slots directory and find the directory there that has the RAH stuff in it. Go into that directory. You should find a file there called stdout.txt. Run a "tail -f stdout.txt" to watch the activity as data is written to the file. I do this all the time. Hope this helps. -Charlie |
Fuzzy Hollynoodles Send message Joined: 7 Oct 05 Posts: 234 Credit: 15,020 RAC: 0 |
Many thanks! It's very important in Rosetta@Home that you set the WU's to stay in memory while preempted in your preferences. I did write this to you in a mail more than a week ago. Why oh why will men never listen? But the crunching time for Rosetta WU's differ from each other a lot. I think my longest took about 5 1/2 - 6 hours, and it stayed in memory while preempted. Another good advice is to set the time between switching between applications to at least 120 min (default 60 min). Happy crunching. :-) [b]"I'm trying to maintain a shred of dignity in this world." - Me[/b] |
Message boards :
Number crunching :
Never-ending WU?
©2024 University of Washington
https://www.bakerlab.org