Report Maximum CPU Time Exceeded WU HERE

Message boards : Number crunching : Report Maximum CPU Time Exceeded WU HERE

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Honza

Send message
Joined: 18 Sep 05
Posts: 48
Credit: 173,517
RAC: 0
Message 10697 - Posted: 12 Feb 2006, 17:04:24 UTC

Had a WU taking ~70 hours which has not errored out due to long processing
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=8363123
ID: 10697 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 10972 - Posted: 19 Feb 2006, 22:00:13 UTC - in response to Message 10964.  

This WU took 165 hours before it finally decided that it was running for too long...



This is not a max time error. This is a 1% hang issue. so I have moved the post to the proper thread. While the WU did finally fail, the cause was the system aborting it after 165 hours. If I am not mistaken Dr. Kim has said he will grant the credit in this case.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 10972 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 10977 - Posted: 19 Feb 2006, 22:37:37 UTC

Could someone please explain the difference between "max cpu time exceeded" and otherwise "hung/stuck WUs" to me?

I mean, let's say I have a WU "stuck", loaded in memory but somehow not actually running -shown with "top" command and "ps" shows them as "SN"=stopped,nice- (I've had a few such situations under Linux).

If user doesn't intervene to "kill" the stuck Rosetta task manually (so BOINC re-runs the same WU with only diff the random seed, apparently), would it abort on its own after X days have passed?

In short my question is: do the "Max CPU time exceeded" WUs actually consume 100% CPU cycles during the X days they kept "running" until they reached their TTL?. Or could it be just "stuck" WUs which simply hit their TTL?

PS: I'm thouroughly confused about the definitions of the various issues (bugs) we're trying to track and I read the R@H forums everyday for the last month.
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 10977 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 10978 - Posted: 20 Feb 2006, 0:25:10 UTC - in response to Message 10977.  
Last modified: 20 Feb 2006, 2:27:53 UTC

Could someone please explain the difference between "max cpu time exceeded" and otherwise "hung/stuck WUs" to me?

I mean, let's say I have a WU "stuck", loaded in memory but somehow not actually running -shown with "top" command and "ps" shows them as "SN"=stopped,nice- (I've had a few such situations under Linux).

If user doesn't intervene to "kill" the stuck Rosetta task manually (so BOINC re-runs the same WU with only diff the random seed, apparently), would it abort on its own after X days have passed?

In short my question is: do the "Max CPU time exceeded" WUs actually consume 100% CPU cycles during the X days they kept "running" until they reached their TTL?. Or could it be just "stuck" WUs which simply hit their TTL?

PS: I'm thouroughly confused about the definitions of the various issues (bugs) we're trying to track and I read the R@H forums everyday for the last month.



Max time wu run normally and only fail as result of hitting the maximiun time alloted for them to run when the project sent them out.

Hung work units run but the progress never increases. Usually they stick at 1% complete but it can happen anywhere. While they may fail usually they are aborted or restarted by the user.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 10978 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Nite Owl
Avatar

Send message
Joined: 2 Nov 05
Posts: 87
Credit: 3,019,449
RAC: 0
Message 10989 - Posted: 20 Feb 2006, 8:43:23 UTC

I've had a few of these on different machines:
2/19/2006 10:16:41 PM|rosetta@home|Aborting result PRODUCTION_ABINITIO_1gvp__250_35_2: exceeded CPU time limit 50195.312500
2/19/2006 10:16:41 PM|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_1gvp__250_35_2 (Maximum CPU time exceeded)
2/19/2006 10:16:42 PM||request_reschedule_cpus: process exited

Is there I can do to prevent this from occuring?

Join the Teddies@WCG
ID: 10989 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 10990 - Posted: 20 Feb 2006, 9:40:20 UTC

Owlie forgot to mention that the ones above were also 4.82 with the CPU time set to 4 days......
ID: 10990 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Nite Owl
Avatar

Send message
Joined: 2 Nov 05
Posts: 87
Credit: 3,019,449
RAC: 0
Message 10992 - Posted: 20 Feb 2006, 9:48:41 UTC - in response to Message 10990.  

Owlie forgot to mention that the ones above were also 4.82 with the CPU time set to 4 days......

Thanks Scribbles... I set half my machines to 2 days and left the others at 4...
ID: 10992 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ExtraTerrestrial Apes
Avatar

Send message
Joined: 3 Jan 06
Posts: 3
Credit: 5,756,092
RAC: 4,378
Message 10998 - Posted: 20 Feb 2006, 14:47:37 UTC

I got one here, result:
http://www.boinc.bakerlab.org/rosetta/result.php?resultid=11775907
WU (4.82):
http://www.boinc.bakerlab.org/rosetta/workunit.php?wuid=6142533
client:
http://www.boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=128060
error:
<core_client_version>5.2.13</core_client_version>
<message>Maximum CPU time exceeded</message>
<stderr_txt>
# random seed: 216581
# cpu_run_time_pref: 28800
</stderr_txt>

MrS
Scanning for our furry friends since Jan 2002
ID: 10998 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Cureseekers~Joschy

Send message
Joined: 8 Dec 05
Posts: 2
Credit: 1,969,809
RAC: 0
Message 11019 - Posted: 20 Feb 2006, 18:39:54 UTC
Last modified: 20 Feb 2006, 18:49:21 UTC

ID: 11019 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 11034 - Posted: 20 Feb 2006, 20:26:56 UTC

Thanks much for reporting these CPU time errors. It looks like
we were able to largely solve the problem in the jobs submitted
after January. We reduced the number of structures per work unit and
extending the max CPU time; none of these later jobs appear to
have given the error.

We're now setting up jobs for the updated application. As David
will explain in a note soon, we're now tapping the BOINC resources
to unleash the powerful information available in sequence "homologues"
(sequences related to the target protein and thus expected
to have nearly the same fold). Very exciting!

These next jobs should hopefully be even less likely to
trigger the max CPU time error.
We now are allowing you to set the maximum time you want
your computer to crunch (default 8 hours) before
returning us structures, rather than asking for a specific
number of structures back. So far seems to have worked on the
test server -- please do report any further Max CPU time errors
here!



ID: 11034 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Darren
Avatar

Send message
Joined: 6 Oct 05
Posts: 27
Credit: 43,535
RAC: 0
Message 11065 - Posted: 21 Feb 2006, 4:24:36 UTC

Whoa now, what is this???

I set my cpu time for 24 hours and I get a max cpu time exceeded after 10 hours.

Here is the WU, and here is the pertinent info:

CPU time 36185.368987

stderr out

<core_client_version>5.2.14</core_client_version>
<message>Maximum CPU time exceeded
</message>
<stderr_txt>
# random seed: 910501
# cpu_run_time_pref: 86400

</stderr_txt>

Validate state Invalid
Claimed credit 101.245443882576
Granted credit 0
application version 4.81


ID: 11065 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
gpcola

Send message
Joined: 31 Dec 05
Posts: 8
Credit: 361,118
RAC: 0
Message 11067 - Posted: 21 Feb 2006, 5:05:36 UTC
Last modified: 21 Feb 2006, 5:07:07 UTC

I have had two 'max cpu time exceeded' errors reported since upgrading to 4.82. It seems to have been caused by setting my 'target cpu run time' to 4hrs whilst these two WUs were already at 6+ hours of progress, or at least they both errored out shortly after I changed that value.

These are the WUs in question:

https://boinc.bakerlab.org/rosetta/result.php?resultid=11797455
https://boinc.bakerlab.org/rosetta/result.php?resultid=11796520
ID: 11067 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 11103 - Posted: 21 Feb 2006, 12:25:21 UTC - in response to Message 11067.  

I have had two 'max cpu time exceeded' errors reported since upgrading to 4.82. It seems to have been caused by setting my 'target cpu run time' to 4hrs whilst these two WUs were already at 6+ hours of progress, or at least they both errored out shortly after I changed that value.

These are the WUs in question:

https://boinc.bakerlab.org/rosetta/result.php?resultid=11797455
https://boinc.bakerlab.org/rosetta/result.php?resultid=11796520



I suspect this is a WU related issue. It is possible that the bounds limit has not been set right for these to accommodate the new time settings. I will bring it to the attention of the project team.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 11103 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 11136 - Posted: 21 Feb 2006, 19:13:22 UTC

Just as I suspected these latest Max time errors are a WU related issue please see this from David Kim on this subject -

"The max time errors are due to an older batch of work units. I cancelled that batch and also updated all the rsc_fpops_bound values to a fairly high value so as to not reach the limit in 4 days.

It is difficult though to guarantee not reaching the limit since it also depends on the clients benchmark...

...In the future we will try to prevent sending out the work units that take a long time to produce a single model. The previous batches of ab initio runs have a filter being used that actually ignores structures that do not fit the filtering criteria, thus for some proteins many structures are being modeled before reaching one that passes the filters. We are going to turn the filters off for future batches and filter them ourselves as a post process.

Thanks,

David K


So While there may be a very few more of these that come out of longer queues, for the most part these Max time errors should now stop very soon.

If you see any please keep reporting them here on this thread.



Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 11136 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 11518 - Posted: 1 Mar 2006, 12:23:03 UTC

Already any progress in granting credits for MCTE WU's ??
Or did I miss it somewhere ?

ID: 11518 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 11526 - Posted: 1 Mar 2006, 16:27:08 UTC - in response to Message 11518.  

Already any progress in granting credits for MCTE WU's ??
Or did I miss it somewhere ?



As was reported before, it will be AT LEAST mid-March before the project team can deal with the credit granting process for this class of WU failures, and maybe longer. They did say they would grant the credit in due course, but they are focused on fixing run time errors at this time. The cause of the Max time errors has been isolated and fixed so people should not see any more of them. But the credit granting process takes time.


Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 11526 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 11527 - Posted: 1 Mar 2006, 18:01:54 UTC - in response to Message 11526.  

Already any progress in granting credits for MCTE WU's ??
Or did I miss it somewhere ?



As was reported before, it will be AT LEAST mid-March before the project team can deal with the credit granting process for this class of WU failures, and maybe longer. They did say they would grant the credit in due course, but they are focused on fixing run time errors at this time. The cause of the Max time errors has been isolated and fixed so people should not see any more of them. But the credit granting process takes time.


Thanks. I'm not visiting these forums on a regularly base, so I must've missed.

ID: 11527 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3

Message boards : Number crunching : Report Maximum CPU Time Exceeded WU HERE



©2024 University of Washington
https://www.bakerlab.org