Report stuck work units here

Message boards : Number crunching : Report stuck work units here

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 6355 - Posted: 15 Dec 2005, 20:47:21 UTC

Our apologies for the apparent problems with the recent batch of jobs. We should be able to track down the infinite loop, if there is one, pretty quickly with your help. Please post screen shots of your stuck work units here (or alternatively the information at the top and bottom of the screensaver)--this will help us identify the problem work units and the stage in the calculations where jobs are getting stuck.
ID: 6355 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yeti
Avatar

Send message
Joined: 2 Nov 05
Posts: 45
Credit: 11,088,948
RAC: 57,679
Message 6360 - Posted: 15 Dec 2005, 22:59:52 UTC

a) From which point define we stuck ? I have one stuck at 1% with 1 hour, 30 Minutes, the total estimate on this box for this kind of WUs is 6 hours, 30 Minutes

b) My boxes run as a Service with specialized User Account; so I can't see the graphics. Is there another way to find the needed informations for you ?



Supporting BOINC, a great concept !
ID: 6360 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jack Schonbrun

Send message
Joined: 1 Nov 05
Posts: 115
Credit: 5,954
RAC: 0
Message 6362 - Posted: 15 Dec 2005, 23:13:36 UTC - in response to Message 6360.  

a) From which point define we stuck ? I have one stuck at 1% with 1 hour, 30 Minutes, the total estimate on this box for this kind of WUs is 6 hours, 30 Minutes

b) My boxes run as a Service with specialized User Account; so I can't see the graphics. Is there another way to find the needed informations for you ?


1% after 1.5 hours sounds stuck to me. If you can't send graphics, the complete Work Unit would still help.

ID: 6362 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yeti
Avatar

Send message
Joined: 2 Nov 05
Posts: 45
Credit: 11,088,948
RAC: 57,679
Message 6363 - Posted: 15 Dec 2005, 23:18:43 UTC

Only the name of the unit or the whole active Slot ?

If slot, where / how to send ? E-Mail ?



Supporting BOINC, a great concept !
ID: 6363 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yeti
Avatar

Send message
Joined: 2 Nov 05
Posts: 45
Credit: 11,088,948
RAC: 57,679
Message 6364 - Posted: 15 Dec 2005, 23:21:41 UTC
Last modified: 15 Dec 2005, 23:44:10 UTC

it just jumped to 10%



Supporting BOINC, a great concept !
ID: 6364 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jack Schonbrun

Send message
Joined: 1 Nov 05
Posts: 115
Credit: 5,954
RAC: 0
Message 6366 - Posted: 15 Dec 2005, 23:42:50 UTC - in response to Message 6363.  

Only the name of the unit or the whole active Slot ?

If slot, where / how to send ? E-Mail ?


Hi Yeti, sorry to be a censor, but can you edit the profanity out of your previous post?

I'm not sure if your WU is still a candidate for being hung, perhaps you are having a different problem.

Until we figure out a good way for you to give us the whole slot, if you could go into the active Slot and post the content of stderr.txt, as well as the first 10 and last 10 lines of stdout.txt, that would help a lot.

Thanks,
Jack
ID: 6366 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yeti
Avatar

Send message
Joined: 2 Nov 05
Posts: 45
Credit: 11,088,948
RAC: 57,679
Message 6369 - Posted: 15 Dec 2005, 23:50:48 UTC

Okay, edit done, sorry

stderr:

# =====================================
# random seed: 1639541
# =====================================

stdout first:

2005-12-15 22:44:20 :: BOINC :: boinc_init()
command executed: projects/boinc.bakerlab.org_rosetta/rosetta_4.80_windows_intelx86.exe aa 1ogw _ -abrelax_mode -stringent_relax -more_relax_cycles -relax_score_filter -filter1 -105 -filter2 -145 -output_chi_silent -vary_omega -sim_aneal -rand_envpair_res_wt -rand_SS_wt -farlx -ex1 -ex2 -silent -barcode_from_fragments -barcode_from_fragments_length 10 -ssblocks -barcode_mode 3 -barcode_file 1ogw.top7_lowenergy.cst -jitter_frag -jitter_variation gauss -output_silent_gz -nstruct 10
[STR OPT]Default value for [-paths] paths.txt.
[T/F OPT]Default FALSE value for [-unix_paths]
--------------------------------------------
WARNING:: paths.txt file not found!!
Setting all paths to .
Using default fragment file names:
aa*****03_05.200_v1_3
aa*****03_05.200_v1_3
--------------------------------------------
[T/F OPT]Default FALSE value for [-version]


Stderr last:

Size: 3 NUMBER OF FRAGS FOR POS: 69 50
Size: 3 NUMBER OF FRAGS FOR POS: 70 50
Size: 3 NUMBER OF FRAGS FOR POS: 71 50
Size: 3 NUMBER OF FRAGS FOR POS: 72 50
Size: 3 NUMBER OF FRAGS FOR POS: 73 50
Size: 3 NUMBER OF FRAGS FOR POS: 74 50
score0 done: (best, low) rms
0 0 21.2227993
---------------------------------------------------------
score1 done: (best, low) rms (best,low)
8.89822865 3.30363035 18.3562622 13.2358065
standard trials: 2000 accepts: 629 %: 31.45
-----------------------------------------------------
Alternate score2/score5...
kk score2 score5 low_score n_low_accept rms rms_min low_rms
0 10.208 17.806 10.208 17 13.236 11.080 13.236
1 8.439 18.236 13.531 21 15.868 10.943 13.702
2 -44.739 -33.730 -44.738 36 11.227 8.677 11.227
3 -71.017 -60.008 -60.002 40 11.134 8.677 11.134



Supporting BOINC, a great concept !
ID: 6369 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jack Schonbrun

Send message
Joined: 1 Nov 05
Posts: 115
Credit: 5,954
RAC: 0
Message 6375 - Posted: 16 Dec 2005, 0:07:36 UTC

Thank you Yeti, we will look into this.

For completeness, do you also have the Work Unit name?

Also is the percentage complete now climbing at a more typical rate?

ID: 6375 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yeti
Avatar

Send message
Joined: 2 Nov 05
Posts: 45
Credit: 11,088,948
RAC: 57,679
Message 6377 - Posted: 16 Dec 2005, 0:11:12 UTC - in response to Message 6366.  

I'm not sure if your WU is still a candidate for being hung, perhaps you are having a different problem.

Really hung WU I only had several weeks ago, when I started with Rosetta. In the last week, I watched several times, that WUs keep very long the 1%, but normally, after 1 / 2 / 3 hours they jump to 10%. After this, they go on much faster. (The jump from 1% to 10% has been at 1:35, now the WU has 2:05 and says 20%)

Until we figure out a good way for you to give us the whole slot, if you could go into the active Slot and post the content of stderr.txt, as well as the first 10 and last 10 lines of stdout.txt, that would help a lot.

I have made a copy of the whole slot and saved; I can zip or rar them and e-mail to you. If you don't want to post an email-adress, send me the adress via my registered e-mail-adress.

Or, I tell you an adress where you can download the slot from one of my servers ...

The WU-Name: 1ogw_topology_sample_88400_0





Supporting BOINC, a great concept !
ID: 6377 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yeti
Avatar

Send message
Joined: 2 Nov 05
Posts: 45
Credit: 11,088,948
RAC: 57,679
Message 6378 - Posted: 16 Dec 2005, 0:14:51 UTC

If I watch similar things, shall I collect the data like with this one or shall I wait until you have had a look in this one ?



Supporting BOINC, a great concept !
ID: 6378 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 3,429,746
RAC: 14,613
Message 6379 - Posted: 16 Dec 2005, 0:16:08 UTC

Also is the percentage complete now climbing at a more typical rate?
==========
How can anybody answer that with the Rosetta WU's, I see some WU's take 3 hours to get to 20% then jump 30% to 50% in the next 10 minutes. So whats a typical rate for that WU ... ???

These WU's have a mind of their own and I don't think there is any set rate for them to Progress. I can finish 10 WU's and none of them have the same amount of time to finish @ 100%, there may be a variance of 3 or 4 hours difference between them.

This is not a Rant but just stating my observance's of the Rosetta WU's ... :)
ID: 6379 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jack Schonbrun

Send message
Joined: 1 Nov 05
Posts: 115
Credit: 5,954
RAC: 0
Message 6380 - Posted: 16 Dec 2005, 0:33:08 UTC - in response to Message 6379.  


How can anybody answer that with the Rosetta WU's, I see some WU's take 3 hours to get to 20% then jump 30% to 50% in the next 10 minutes. So whats a typical rate for that WU ... ???


Fair enough :)

At the moment we we'd like to find the problem that's causing WUs to stick on 1% for over 10 hours. This should not be typical.

Yeti: If you see a case where it is stuck for more than 10 hours I'd very much like to see your Slot. In the meantime I'll try to figure out what was going on with the one you already sent.
ID: 6380 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 6475 - Posted: 16 Dec 2005, 21:08:04 UTC
Last modified: 16 Dec 2005, 21:38:10 UTC

This box, running 1ogw__topology_sample_84611_2

Did over 6000 sec and ./boinc_cmd --get_results reported fraction done = 0.01

I am sorry, didn't get the stderr & stdout, I had not realised they would disappear as soon as I aborted the result!

This happened because the client promptly reported the result and deleted the files.

One thing I did notice tho was that the result never reached its first checkpoint, which might help you pinpoint where it enters its infloop.
On my 600MHz linux box the first checkpoint came at 322 sec; whereas this box runs at 700MHz so the first checkpoint should have been a little sooner than that all things being equal.

***Please note if you wait till after the abort, and wait a little longer, you may be seeing the stdout & stderr relating to the new work, not the aborted work.

I think maybe disable network (to stop the client reporting) or copy the info while the result is still running. Which technique would be more useful please?

On my linux boxes I am using command line only.

regards,
ID: 6475 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jack Schonbrun

Send message
Joined: 1 Nov 05
Posts: 115
Credit: 5,954
RAC: 0
Message 6480 - Posted: 16 Dec 2005, 23:07:56 UTC
Last modified: 16 Dec 2005, 23:09:38 UTC

River~~
This is perfect, just the information we need. It's possible that this is normal behavior, we are testing that now. You've given us what we nee to do this. Some of the protocols take a while before they hit the first checkpoint.

The ones that we'd really like catch are those that are stuck at 1% for 10 or more hours.

Jack

ID: 6480 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 6481 - Posted: 16 Dec 2005, 23:23:16 UTC - in response to Message 6480.  
Last modified: 16 Dec 2005, 23:25:44 UTC

...The ones that we'd really like catch are those that are stuck at 1% for 10 or more hours.

Jack



OK, if I get another one I will leave it overnight before aborting & see what happens. I'm off to bed now, as I live in UTC timezone...

Jack: Do please notice the addition I edited into my previous post, as you were posting as I was editing.

R~~
ID: 6481 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Hammer

Send message
Joined: 11 Dec 05
Posts: 2
Credit: 9,597
RAC: 0
Message 6497 - Posted: 17 Dec 2005, 2:40:27 UTC - in response to Message 6366.  

Until we figure out a good way for you to give us the whole slot, if you could go into the active Slot and post the content of stderr.txt, as well as the first 10 and last 10 lines of stdout.txt, that would help a lot.

Thanks,
Jack


1ogw_topology_sample_106451_2
1ogw_topology_sample_131011_0

Stuck at 1% for some time, but just like Yeti got bumped up to 10% then kept going. Had another stuck at 80% the other day for 2 days before I noticed. As has been described before, the percent jumps are always odd, sometimes taking an hour to move 10%, and sometimes taking only a few minutes, all on the same WU. Could stick to 1% for an hour before it jumps to 10%. Hard to tell if it's actually stuck.
ID: 6497 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 6551 - Posted: 17 Dec 2005, 13:06:29 UTC - in response to Message 6380.  

At the moment we we'd like to find the problem that's causing WUs to stick on 1% for over 10 hours.


Should we scale this to machine speed?

ie if 10hours is the reporting point on a 2.8GHz box it would seem premature to be reporting at 10hours on an 665MHz box as well.

I've got boxes running at both those speeds already attached to Rosetta, so it is a practical question from me.

It's useful to have a guideline like 10hrs, and I'm suggesting it would be even more helpful for you to give a guideline for a mythical 1GHz box and donors can scale it appropriately up or down for their slower or faster boxes.

R~~
ID: 6551 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mark Rush

Send message
Joined: 6 Oct 05
Posts: 13
Credit: 50,777,520
RAC: 8,289
Message 6881 - Posted: 20 Dec 2005, 15:47:59 UTC - in response to Message 6355.  

OK, I have a unit that is 1% complete after 28 hours of work. I don't know how to do a screen shot, so here's the stuff at the top of the screen saver:
rosetta [workunit:2reb_abrelax_rand_len10_jit02_omega_sim_filters_53131]

At the bottom left of the screen is:
1% complete
CPU time: 27 hr 53 min 10 sec
Mark Rush - Total credit: 2014.34 - RAC 30.7302
Rosetta Fools
Rosetta@home v0 http:/boinc.bakerlab.org/rosetta/

At the bottom right of the screen is:
Stage: Ab initio
Step: 2291
Accepted RMSD: 7.908
Accepted Energy: -3.025729

I won't abort this unit for a while in case you need more information from it.

Also, for what it's worth, this computer is using Boinc Manager 4.45, running Seti, climateprediction, Einstein, LHC (though there are no WUs), and Predictor. It's a 3.0 GHZ machine running Windows XP Pro with 512 MB of RAM. The WUs stay in the memory after they pause.

Mark

ID: 6881 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,488,178
RAC: 10
Message 6895 - Posted: 20 Dec 2005, 17:52:53 UTC

Mark, if you can locate which one of the "Slots" directories that Rosetta is running in, and just make a copy of that whole directory before you do anything else it would probably be the biggest help. Someone from the project can tell you what part of that they actually need. Normally for a "backup" of any files, you have to quit BOINC first, but in this case I would think it would be better to grab it "open".

Heck, if you have the disk space, I'd just make a copy of the whole BOINC folder!

ID: 6895 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mark Rush

Send message
Joined: 6 Oct 05
Posts: 13
Credit: 50,777,520
RAC: 8,289
Message 6897 - Posted: 20 Dec 2005, 17:55:51 UTC - in response to Message 6895.  

Bill:

Where would I look for the "slots" directories?

And, I apologize in advance but I have meetings all afternoon and so probably won't get a chance to look until tomorrow.

Mark
ID: 6897 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Number crunching : Report stuck work units here



©2024 University of Washington
https://www.bakerlab.org