Report stuck work units here

Author	Message
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 7250 - Posted: 22 Dec 2005, 20:42:32 UTC - in response to Message 7235. ...I firmly believe that a moderator should moderate as little as possible... Moderate moderation perhaps ?? ID: 7250 · Rating: 0 · rate: / Reply Quote

Brett Kneisley Send message Joined: 17 Dec 05 Posts: 2 Credit: 3,593,841 RAC: 0	Message 7353 - Posted: 23 Dec 2005, 11:17:19 UTC New here and I read down to see what information would help in fixing a problem I am having. I have several work units waiting that start with sample 207. 2 have already been listed by my system as ;computational error. The original estimated work time is 2:05. My system has been running seti as well, switching between the 2 every 10 minutes. When the Rosetta cumputation reaches over 10 minutes 20% and the timer shifts to Seti then comes back it will drop the comp time to less than 10 minutes run to 18 - 20 stop after the 10 miinutes ran and drops back to less than 8 minutes total time. It never went over 20 % complete even after 9 hours. I have at this time changed the computation time to 3 hours to see what will happen. I have Windows EP if that helps any. Is there any other info that is needed? ID: 7353 · Rating: 0 · rate: / Reply Quote

Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0	Message 7355 - Posted: 23 Dec 2005, 11:43:53 UTC - in response to Message 7353. Last modified: 23 Dec 2005, 11:51:57 UTC When the Rosetta cumputation reaches over 10 minutes 20% and the timer shifts to Seti then comes back it will drop the comp time to less than 10 minutes Sounds like the usual problem of not keeping work units in memory, in this case combined with very short times between switches (if we had an FAQ, this would probably be at the top of the list) Check your preferences - to run Rosetta alongside other projects, you need to set "Leave applications in memory while preempted?" to yes or you will likely never finish a work unit. I'd also change the setting for "Switch between applications every" to at least the recommended 60 minutes, but if you don't keep the work in memory, you will lose work done (back to 10%, 20%, 30% or whichever percentage it was at before the switch) every time you switch. Of course, there is also the problem with a bad batch of work units, mostly in batches 204 to 207. They will crash soon after starting - nothing you can do about that. * Join BOINC@Australia today * ID: 7355 · Rating: 0 · rate: / Reply Quote

Brett Kneisley Send message Joined: 17 Dec 05 Posts: 2 Credit: 3,593,841 RAC: 0	Message 7356 - Posted: 23 Dec 2005, 12:20:16 UTC current rosetta work unit hit 30 %, I then updated my preferences to'Leave applications in memory while preempted'. After update the computation time did drop but not below the 30% level. One problem fixed. I'm leaving it runnig to get some of the work units out of the way. If they are bad units I'll find out soon enough. 2 new units were downloaded, one says : Default **** 206 and the Second is : NO_RAND_WTS I know there was a problem with a Default 205 and those were to be ditched. Same with these new ones? ID: 7356 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,702,007 RAC: 0	Message 7372 - Posted: 23 Dec 2005, 14:06:52 UTC - in response to Message 7356. 2 new units were downloaded, one says : Default **** 206 and the Second is : NO_RAND_WTS I know there was a problem with a Default 205 and those were to be ditched. Same with these new ones? Nope! Only the DEFAULT_xxxx_205's should be aborted. The _other_ DEFAULT ones are "good", or at least as good as any other WU generated in that batch. In other words, they _could_ be "short WUs" and fail quickly, but it's not probable. Most likely is that they're the "best" of the bunches you could get, and the most likely to both earn you credit, and to be useful to the project. The project uses names that are descriptive of what's going on. Those who expect every WU name to be a boring random string of letters and numbers sometimes get concerned by names like "random jitter whatever", but the names are explained over in Science, and really mean something. ID: 7372 · Rating: 0 · rate: / Reply Quote

pieface Send message Joined: 20 Sep 05 Posts: 17 Credit: 797,661 RAC: 0	Message 7400 - Posted: 23 Dec 2005, 19:44:53 UTC This is just an update to my message nr 7186 from yesterday on a 'stuck' wu. I left it (and rosetta) suspended overnite to see if there would be any reply, and since there wasn't anything new this morning I thought I would just abort the WU and get on with it. I 'resumed' it before aborting, the pct complete went back to zero and wouldn't you it, the danged thing went from zero to completion in 4,892 cpu secs. Odd behavior for something that should be 'repeatable' ??? ID: 7400 · Rating: 0 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 7404 - Posted: 23 Dec 2005, 19:48:09 UTC - in response to Message 7400. This is just an update to my message nr 7186 from yesterday on a 'stuck' wu. I left it (and rosetta) suspended overnite to see if there would be any reply, and since there wasn't anything new this morning I thought I would just abort the WU and get on with it. I 'resumed' it before aborting, the pct complete went back to zero and wouldn't you it, the danged thing went from zero to completion in 4,892 cpu secs. Odd behavior for something that should be 'repeatable' ??? Some Rosetta WU take a random number seed from the clock time, so are not repeatable if they go back to 0% I don't know if that applies to your WU or not. River~~ ID: 7404 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,702,007 RAC: 0	Message 7408 - Posted: 23 Dec 2005, 20:30:23 UTC I have moved several postings that did not relate to stuck work units to the Moderated messages moved here thread. As this thread is a staff-created sticky for a particular problem, the other discussions should take place elsewhere. ID: 7408 · Rating: 0 · rate: / Reply Quote

ecafkid Send message Joined: 5 Oct 05 Posts: 40 Credit: 15,177,319 RAC: 0	Message 7613 - Posted: 25 Dec 2005, 15:44:40 UTC default_1hz6_219_3398_0 stuck at 1% after 16:32:55 with 21:04:49 remaining. I guess I will abort. ID: 7613 · Rating: 0 · rate: / Reply Quote

mgabriel Send message Joined: 18 Sep 05 Posts: 5 Credit: 96,494 RAC: 0	Message 7710 - Posted: 27 Dec 2005, 2:20:00 UTC DEFAULT_2reb_219_3444_0 1% after 11:36:19 on an x2 3800+ ID: 7710 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0	Message 7723 - Posted: 27 Dec 2005, 8:52:35 UTC Last modified: 27 Dec 2005, 8:52:42 UTC If it goes more than 4 hours, take a screen shot (prtscrn key) of the graphics display (highling the WU when running and select "Show Graphics"), Save the image as jpg file (use paint), stop and restart BOINC. The work unit will restart at 0 (sorry) and run from there ... One of the questions we have is if the system is doing ANYTHING other than updateing the clock ... You MAY be able to lobby for extra credit ... :) If you win, let me know, I have one worth 175 CS ... :) ID: 7723 · Rating: 0 · rate: / Reply Quote

Los Alcoholicos~DJNL Send message Joined: 10 Nov 05 Posts: 1 Credit: 248,497 RAC: 0	Message 7729 - Posted: 27 Dec 2005, 12:45:51 UTC Looks stuck at 1%, Rosetta Version 481 [workunit: DEFAULT_2reb_220_2101] 1% complete CPU time: 7 hr 39 min 31 sec stage: Ab Initio Step: 2118 Accepted Rmsd: 8.359 Accepted energy: 0.5129008 It's running on a Amd sempron +2600, win xp home sp2, and left in-memory when swapped. I have it suspended now, the stderr.txt is empty and here are the first/last 10 lines of stdout.txt: [2005-12-27 05:28:14] :: BOINC :: boinc_init() command executed: projects/boinc.bakerlab.org_rosetta/rosetta_4.81_windows_intelx86.exe aa 2reb _ -abrelax -stringent_relax -more_relax_cycles -relax_score_filter -output_chi_silent -vary_omega -sim_aneal -rand_envpair_res_wt -rand_SS_wt -farlx -ex1 -ex2 -silent -barcode_from_fragments -barcode_from_fragments_length 10 -ssblocks -barcode_mode 3 -omega_weight 0.5 -jitter_frag -jitter_variation gauss -max_frags 400 -output_silent_gz -paths frags400.txt -filter1 -90 -filter2 -115 -nstruct 10 [STR OPT]New value for [-paths] frags400.txt. [T/F OPT]Default FALSE value for [-unix_paths] [T/F OPT]Default FALSE value for [-version] [T/F OPT]Default FALSE value for [-score] [T/F OPT]Default FALSE value for [-abinitio] [T/F OPT]Default FALSE value for [-refine] [T/F OPT]Default FALSE value for [-assemble] [T/F OPT]Default FALSE value for [-idealize] __________________________________________________________________________________ score0 done: (best, low) rms 0 0 13.7818356 --------------------------------------------------------- score1 done: (best, low) rms (best,low) -12.5528154 -18.1141415 13.5991316 8.55768871 standard trials: 2000 accepts: 877 %: 43.85 ----------------------------------------------------- Alternate score2/score5... kk score2 score5 low_score n_low_accept rms rms_min low_rms 0 -5.522 -5.522 -5.522 22 8.558 7.168 8.558 [REAL OPT]Default value for [-cpu_frac] 0.100000001 [REAL OPT]Default value for [-frame_rate] 10 ID: 7729 · Rating: 0 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 7765 - Posted: 27 Dec 2005, 23:25:25 UTC I have had a run of errors on this Linux box. Sadly I can't give you any files from it as itis remote and I didn't have access to ssh yesterday/today - all I had was control via BOINCview. Only one (marked with stars) is a 'stuck' wu, all the rest died early, however in view of the suspicion of one job possible tainting the next I thought I would let you see the whole lot. Also of concern is the fact that a CPDN wu died as well - with error code 11 not 131. I have seen error code 11 a few times on failed Rosetta wu over the last few days, so if there is some 'taint' it may be that it causes that error code. Or maybe the cpdn wu would have died then anyway. Who knows. By the way this is a twin cpu box, so the duplicated times are not an error. That again may, or may not, suggest tainting from one cpu to another. 26th 19:54:50 DEFAULT_2tif_219_7301_0 error 131 after 1h 4m 25sec 19:57:44 NO_RANDOM_WTS_OR_FRAGS_1dtj_223_812_0 error 131 after 14sec 19:57:44 NO_RANDOM_WTS_OR_FRAGS_1mky_223_530_0 error 131 after 4sec 19:57:45 DEFAULT_1b72_219_7654_0 error 131 after 29sec 20:03:23 DEFAULT_2tif_220_2806_0 error 131 after 53sec 27th, 0700-1137 *** MORE_FRAGS_2reb_222_897_0 repeatedly stuck at 80%, 6h 51m 36sec, restarted client several times, clock restarted from 80% checkpoint & ran ok at first then stopped again at around this figure. I can't be sure it was the same time, but always just short of 7hr. 11:37:30 ditto aborted by user 14:02:34 sulphur (CPDN WU) error 11 after 18days cpu time :-( 14:04:14 DEFAULT_1ogw_220_1383_1 error 131 after 58sec Most recent successful outcome, 14:40 on 26th. Time now 23:06 on 27th. I have two Rosetta wu apparently running OK, be interesting to see if they finish or not! R~~ ID: 7765 · Rating: 0 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 7781 - Posted: 28 Dec 2005, 2:27:29 UTC Result = NO_BARCODE_FRAGS_1r69_227_1143_0 This ran on a linux box for around 12 hours before I noticed it was stuck with less than 2hrs cpu recorded. top showed that it was not actually running, and was therefore aborted. I kept the std*.txt files and parts of these are given here as they would stratch this thread One thing I noted was the heartbeat message in the stderr file - this is a known BOINC problem and is caused when the inter process communication from the client to the app is delayed by other events in the box. The app should exit without generating an error - on other BOINC platforms it simply restarts from the previous benchmark. I wonder if this is another manifestation of the way Rosetta does not like being removed from memory? Just a guess and not fully consistent with earlier observations on another box (see erlier post). ID: 7781 · Rating: 0 · rate: / Reply Quote

Steve Dodd Send message Joined: 13 Dec 05 Posts: 7 Credit: 4,011,547 RAC: 0	Message 7782 - Posted: 28 Dec 2005, 2:28:54 UTC Sorry for the lack of information being provided. New to Rosetta. 1hz6A_topology_sample_129151_0 -- 10:40:36 @ 1%. Intel 2.8GHz, HT, 1G Ram Aborted before I found this thread. (Rosetta 4.80) ID: 7782 · Rating: 0 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 7784 - Posted: 28 Dec 2005, 2:45:50 UTC This result was running on a single cpu linux box. Clock stopped & client restarted several times, eventually result aborted. std*.txt files given here similar comment as before - note heartbeat again ID: 7784 · Rating: 0 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 7786 - Posted: 28 Dec 2005, 2:49:35 UTC - in response to Message 7782. Sorry for the lack of information being provided. New to Rosetta. 1hz6A_topology_sample_129151_0 -- 10:40:36 @ 1%. Intel 2.8GHz, HT, 1G Ram Aborted before I found this thread. (Rosetta 4.80) hi Steve - every little helps. But can you say if both the clock and progress were stuck, or was the clock running and only the progress stuck? If you don't remember, no worries. ID: 7786 · Rating: 0 · rate: / Reply Quote

Hoelder1in Send message Joined: 30 Sep 05 Posts: 169 Credit: 3,915,947 RAC: 0	Message 7791 - Posted: 28 Dec 2005, 4:51:38 UTC Last modified: 28 Dec 2005, 4:52:54 UTC I wanted to report two cases of the "Clock Stops error" (as dscribed by River~~ in the "Four kinds of errors" thread). They both happened on Linux a few weeks ago. In both cases 'top' showed that the task status had gone from RN to SN, i.e., the task just sat there and prevented other Rosetta work from being done. After killing the respective rosetta job the WU continued from the last checkpoint to completion (no points were lost). Oh and all of this happened on a hyperthreading cpu. In the second of the two cases I actually tar'ed the respective slot directory before killing the job, intending to report this after collecting a few more cases (which didn't happen so far). If the saved slot directory is of interest I can make this available when I am back home again next week. ID: 7791 · Rating: 0 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 7815 - Posted: 28 Dec 2005, 12:49:40 UTC Last modified: 28 Dec 2005, 12:51:35 UTC INCREASE_CYCLES_10_1hz6_226_241 got stuck for 27 hours at 1%, edit: win-2k box, clock still running, still 99%+ in top. So we are still seeing the other kind of stuck WU. I have suspended this one, not aborted, as it has already been somewhere alse first. Will abort when advised that files removed from server. Files will be posted here R~~ ID: 7815 · Rating: 0 · rate: / Reply Quote

Steve Dodd Send message Joined: 13 Dec 05 Posts: 7 Credit: 4,011,547 RAC: 0	Message 7845 - Posted: 28 Dec 2005, 21:18:22 UTC - in response to Message 7786. Sorry for the lack of information being provided. New to Rosetta. 1hz6A_topology_sample_129151_0 -- 10:40:36 @ 1%. Intel 2.8GHz, HT, 1G Ram Aborted before I found this thread. (Rosetta 4.80) hi Steve - every little helps. But can you say if both the clock and progress were stuck, or was the clock running and only the progress stuck? If you don't remember, no worries. Clock was humming right along. Just no progress. Sorry it took so long to respond. ID: 7845 · Rating: 0 · rate: / Reply Quote