Work unit errors.

Message boards : Number crunching : Work unit errors.

To post messages, you must log in.

AuthorMessage
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,632,350
RAC: 1,054
Message 69120 - Posted: 10 Jan 2011, 14:39:18 UTC
Last modified: 10 Jan 2011, 14:41:10 UTC

I know you have problems, and that completed wu's are not uploading, but we do appear to be able to download new work units. These new work units, to me anyway, seem to fall into one of three categories.

There are ones that fail almost immediately. These fail with a trace like this...

<core_client_version>6.10.56</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
[2011- 1- 9 15:59:13:] :: BOINC:: Initializing ... ok.
[2011- 1- 9 15:59:13:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
ERROR: Option matching -relax:fastrelax_repeats not found in command line top-level context

</stderr_txt>
]]>

There are those that run for hours, and are then flagged as failed. The error log appears normal. There is one here, I have several others like these.

The others run and run without making progress.

I can't see how these are doing science. I have set No New Tasks for the time being.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 69120 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 69123 - Posted: 10 Jan 2011, 15:44:30 UTC
Last modified: 10 Jan 2011, 15:44:58 UTC

adrianxw, can you explain further what you mean by "...without making progress"? ...are they not checkpointing? Not producing many models? Can you give a few of the WU names that you would put in this category?
Rosetta Moderator: Mod.Sense
ID: 69123 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,632,350
RAC: 1,054
Message 69125 - Posted: 10 Jan 2011, 16:54:27 UTC
Last modified: 10 Jan 2011, 17:07:36 UTC

I mean, the wu is running, elapsed CPU time is increasing, the To Completion estimate increasing, actually increasing faster than the elapsed time. abrelax_helixfrag_1enh_SAVE_ALL_OUT_22843_2800_0 is one, it has a CPU time of 09:29:45, an estimated To Completion time of 20:32:28 and is showing a static 15.483% complete in the progress column. Similarly named wu's have completed normally.

In view of the validate errors, I was considering aborting it actually.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 69125 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 69126 - Posted: 10 Jan 2011, 17:03:18 UTC

It looks like you run with a 6 hours runtime preference and so are still about an hour away from the watchdog detecting the task running more then 4 hours past the target and ending it for you. I'd let it run it's course, the watchdog is watching it for you. The completion time is simply calculated based on the percent complete, so not a distressing sign in and of itself.
Rosetta Moderator: Mod.Sense
ID: 69126 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,632,350
RAC: 1,054
Message 69127 - Posted: 10 Jan 2011, 17:10:08 UTC

I can set it to run longer if it is going to a) acheive anything or b) will accept the runtime change on the fly. I've suspended it for the moment to allow time for reply/action.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 69127 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 69132 - Posted: 10 Jan 2011, 18:51:13 UTC

My prior comment was assuming you were on the host using the 6hr runtime preference. If you do change that target, and then update to the project (with a successful completion on the scheduler request) then the change will take effect on the existing work units. Beware, this will effect how BOINC requests new work to maintain your buffer of work.
Rosetta Moderator: Mod.Sense
ID: 69132 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,632,350
RAC: 1,054
Message 69134 - Posted: 10 Jan 2011, 18:57:01 UTC
Last modified: 10 Jan 2011, 18:58:42 UTC

The time limit was set to 6 hours. I have set that machine to 1 Day for the moment and re-enabled the wu, lets see what happens.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 69134 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,632,350
RAC: 1,054
Message 69212 - Posted: 12 Jan 2011, 8:08:59 UTC
Last modified: 12 Jan 2011, 8:26:32 UTC

Very strange, I recommend someone "back there" have a try running this wu, ideally on a Windows XP system over a stock speed Intel Core 2 Quad.

After bumping the runtime limit up to 1 day, I left it to run. With the quota Rosetta has at the moment, (10%), I didn't know what would happen or when. Looking at it yesterday evening, it had crunched 16:15:46 hours and was showing 27.186% done. Didn't look good.

I then did something I don't do often, I turned on the graphics to have a look at what was going on. Two things struck me almost at once, first, it was a smallish, quite simple looking protein, secondly, and more importantly, was the rate of "progress" shown by the graphics seemed faster than expected. Indeed, it was knocking on about 0.015% every 10 - 12 seconds. I figured I'd leave the graphics window open, and see how it looked this morning.

So it is 100.00% completed, and showing a run-time of 18:32:22. I can't say what the quality of the result will be, as it is stuck in the uploading state with several others - there are other threads about that.

The casual fiddling I did leads me to suspect that having the graphics window open, (or having opened the graphics at least once), made the unit "go", (released a semaphore/critical section etc.). This may point towards an unusual weakness in the program code.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 69212 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,632,350
RAC: 1,054
Message 69404 - Posted: 18 Jan 2011, 11:56:17 UTC
Last modified: 18 Jan 2011, 12:04:41 UTC

Low and behold, another one. This one. Shows it has 17:34:10 running and a static 42.151% complete, 19:27:43 to go, rising. Similar, but not he same machine.

After the previous wu, I tried to start the graphics, it opens the window, but remains black.

I'd set the desired run time to 1 day earlier, but set it back to the 6 hour normal. The machine with the current "long runner" was not in the same group as this one, so should never have seen the change, I would have expected the time out to have stopped it by now?

There is a problem here. I'm setting No New Tasks across my systems.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 69404 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 69405 - Posted: 18 Jan 2011, 13:23:38 UTC

Not sure what you mean about not being in the "same group". The runtime preference applies to all machines assigned to the location where you've set the preference. Did you mean you have more then one location defined?

The 17.5hrs you mention, is that CPU time? Or elapsed time? It sounds like perhaps you are seeing the problem where BOINC Manager stops assigning CPU time to a task, yet shows it with a running status anyway.

I believe the watchdog perks up every 15 minutes or so to see if there is any need to end a running task. And it will only step in if the task has exceeded the target runtime by more then 4 hours (this to help give it time to complete at least the one required model per task). So, if a task has been running for 17.5hrs, and I reset my runtime preference to 6 hours, and updated to the project so the machine is aware of the new runtime preference, I would expect the watchdog to wrap up that task about 15 minutes later because it is past the now expected 6hr runtime by more then 4 hrs.

The graphic display has nothing to do with how the tasks run. It sometimes takes a couple of minutes to establish itself. Also, depends upon the % of CPU you have configured to allow the graphic to utilize in your preferences.
Rosetta Moderator: Mod.Sense
ID: 69405 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,632,350
RAC: 1,054
Message 69407 - Posted: 18 Jan 2011, 15:47:02 UTC
Last modified: 18 Jan 2011, 16:07:31 UTC

This machine is in the "Home" group, the machine I refer to above is in the "Work" group. As far as I know, changes to settings in one should not affect settings in others. This being the case, it has, and has always had, 6 hours set.

The 17 hours, (now 21 hours and still 42.151% done), is elapsed. The CPU time in the Proprties tag does not appear to be advancing. There are four processes shown as "Running" in BOINC Manager, if the task is showing as running but actually not, then the process is wasting 25% of that machine. I have suspended it.

The graphics on these machines normally starts within a few seconds of issuing the request. I just went back to that machine, started the graphics and waited, after two minutes, I had a plain black window, and the window title had the "Not Responding" in it.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 69407 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,632,350
RAC: 1,054
Message 69443 - Posted: 20 Jan 2011, 19:26:35 UTC
Last modified: 20 Jan 2011, 19:29:17 UTC

I suspended the task a while back. I was going to abort it this morning, but figured it had been off for a while, I'd "just give it another chance". So it then started running, and ran on to an apparent normal completion.

If the watchdog is supposed to have done stuff, well, it didn't. On both of the events I report in this thread, it apparently took manual action to get things going. I have seen problems on two different machines now, on different wu's.

I cannot risk running Rosetta anymore on machines that I am not looking at several times a day.

There is a problem here.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 69443 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 69460 - Posted: 21 Jan 2011, 14:00:39 UTC

When the watchdog does it's thing, it reports the task back in a normal mannar, so it isn't really possible to know it's done nothing because when it's working properly it doesn't bark. I mean the only way to know is by seeing tasks that clearly exceed it's guidelines, which it sounds like you may have had.

The only issue that I've heard many times that makes tasks appear to run long is that for some reason, and I believe it is specific to Windows, the BOINC Manager stops assigning CPU time to tasks in a "running" status. The BOINC Manager will show elapsed time increasing, but the task's properties and Windows task manager confirm it's CPU time is not increasing. The only thing that seems to resolve that state is a complete restart (not reload, just exit and run again or reboot the machine) of BOINC. The problem with these is that since they are not getting any CPU time, the watchdog never gets to run to clean things up.

By the way you mentioned that the CPU time was not advancing, I believe this may be what you are seeing here. I've been trying to gather additional details on what makes one hang up like that and not another for several months now. One thing I've noticed on my own machine is that it seems more likely to occur when there are tasks for other projects running, and when BOINC is bumping up against it's maximum memory allowed in the preferences. Were these factors on your machine? Can you think of anything else on the machines that was different then over the past few months? I haven't found any correlation to WU names and likelihood of such an issue occurring.
Rosetta Moderator: Mod.Sense
ID: 69460 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,632,350
RAC: 1,054
Message 69464 - Posted: 21 Jan 2011, 18:06:56 UTC
Last modified: 21 Jan 2011, 18:14:58 UTC

The machine that wu was on, (Intel Core2 Quad), rarely runs anything other than BOINC, certainly nothing else recently, (months). The projects allocated to it are Climate Prediction, Docking, Einstien, Leiden, Malaria Control, POEM, Rosetta and SIMAP. That project portfolio has not changed for some time, (years).

I fiddle around with the quotas from time to time, but not often, and not much. I tend to think of that machine as "un-attended", even though it is in the same room as me here, (I have to move the screen from this system to that to look at it). Of course, the projects change their apps from time to time, POEM has been going through the motions with POEM++ recently for example. Rosetta had a 20% share of it.

I HAVE seen tasks in the Waiting for Memory state on there recently, which may be relevent. It has 2GB in it. It will get the memory from this machine, (also 2GB), when this one is upgraded later this year.

Another thing is the graphics card, it is CUDA compatible and Einstien uses that now I think. I doubt this is relevent, but hey, it is a difference.

[edit] Forgot SIMAP on the tasks [/edit]
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 69464 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Work unit errors.



©2024 University of Washington
https://www.bakerlab.org