Message boards : Number crunching : Report Maximum CPU Time Exceeded WU HERE
Previous · 1 · 2 · 3
Author | Message |
---|---|
Honza Send message Joined: 18 Sep 05 Posts: 48 Credit: 173,517 RAC: 0 |
Had a WU taking ~70 hours which has not errored out due to long processing https://boinc.bakerlab.org/rosetta/workunit.php?wuid=8363123 |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
This WU took 165 hours before it finally decided that it was running for too long... This is not a max time error. This is a 1% hang issue. so I have moved the post to the proper thread. While the WU did finally fail, the cause was the system aborting it after 165 hours. If I am not mistaken Dr. Kim has said he will grant the credit in this case. Moderator9 ROSETTA@home FAQ Moderator Contact |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
Could someone please explain the difference between "max cpu time exceeded" and otherwise "hung/stuck WUs" to me? I mean, let's say I have a WU "stuck", loaded in memory but somehow not actually running -shown with "top" command and "ps" shows them as "SN"=stopped,nice- (I've had a few such situations under Linux). If user doesn't intervene to "kill" the stuck Rosetta task manually (so BOINC re-runs the same WU with only diff the random seed, apparently), would it abort on its own after X days have passed? In short my question is: do the "Max CPU time exceeded" WUs actually consume 100% CPU cycles during the X days they kept "running" until they reached their TTL?. Or could it be just "stuck" WUs which simply hit their TTL? PS: I'm thouroughly confused about the definitions of the various issues (bugs) we're trying to track and I read the R@H forums everyday for the last month. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
Could someone please explain the difference between "max cpu time exceeded" and otherwise "hung/stuck WUs" to me? Max time wu run normally and only fail as result of hitting the maximiun time alloted for them to run when the project sent them out. Hung work units run but the progress never increases. Usually they stick at 1% complete but it can happen anywhere. While they may fail usually they are aborted or restarted by the user. Moderator9 ROSETTA@home FAQ Moderator Contact |
Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
I've had a few of these on different machines: 2/19/2006 10:16:41 PM|rosetta@home|Aborting result PRODUCTION_ABINITIO_1gvp__250_35_2: exceeded CPU time limit 50195.312500 Is there I can do to prevent this from occuring? Join the Teddies@WCG |
Scribe Send message Joined: 2 Nov 05 Posts: 284 Credit: 157,359 RAC: 0 |
Owlie forgot to mention that the ones above were also 4.82 with the CPU time set to 4 days...... |
Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
Owlie forgot to mention that the ones above were also 4.82 with the CPU time set to 4 days...... Thanks Scribbles... I set half my machines to 2 days and left the others at 4... |
ExtraTerrestrial Apes Send message Joined: 3 Jan 06 Posts: 3 Credit: 5,764,899 RAC: 2,487 |
I got one here, result: http://www.boinc.bakerlab.org/rosetta/result.php?resultid=11775907 WU (4.82): http://www.boinc.bakerlab.org/rosetta/workunit.php?wuid=6142533 client: http://www.boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=128060 error: <core_client_version>5.2.13</core_client_version> <message>Maximum CPU time exceeded</message> <stderr_txt> # random seed: 216581 # cpu_run_time_pref: 28800 </stderr_txt> MrS Scanning for our furry friends since Jan 2002 |
Cureseekers~Joschy Send message Joined: 8 Dec 05 Posts: 2 Credit: 1,969,809 RAC: 0 |
One more time : NO_SIM_ANNEAL_BARCODE_30_1n0u_251_13883_1 NO_SIM_ANNEAL_BARCODE_30_1dtj_251_13882_1 NO_SIM_ANNEAL_BARCODE_30_1n0u_251_13879_1 PRODUCTION_ABINITIO_2chf__250_2426_1 PRODUCTION_ABINITIO_2vik__250_1994_1 PRODUCTION_ABINITIO_2chf__250_2400_1 PRODUCTION_ABINITIO_1dhn__250_911_0 This 7 wu's with an exit code "-177" |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Thanks much for reporting these CPU time errors. It looks like we were able to largely solve the problem in the jobs submitted after January. We reduced the number of structures per work unit and extending the max CPU time; none of these later jobs appear to have given the error. We're now setting up jobs for the updated application. As David will explain in a note soon, we're now tapping the BOINC resources to unleash the powerful information available in sequence "homologues" (sequences related to the target protein and thus expected to have nearly the same fold). Very exciting! These next jobs should hopefully be even less likely to trigger the max CPU time error. We now are allowing you to set the maximum time you want your computer to crunch (default 8 hours) before returning us structures, rather than asking for a specific number of structures back. So far seems to have worked on the test server -- please do report any further Max CPU time errors here! |
Darren Send message Joined: 6 Oct 05 Posts: 27 Credit: 43,535 RAC: 0 |
Whoa now, what is this??? I set my cpu time for 24 hours and I get a max cpu time exceeded after 10 hours. Here is the WU, and here is the pertinent info: CPU time 36185.368987 stderr out <core_client_version>5.2.14</core_client_version> <message>Maximum CPU time exceeded </message> <stderr_txt> # random seed: 910501 # cpu_run_time_pref: 86400 </stderr_txt> Validate state Invalid Claimed credit 101.245443882576 Granted credit 0 application version 4.81 |
gpcola Send message Joined: 31 Dec 05 Posts: 8 Credit: 361,118 RAC: 0 |
I have had two 'max cpu time exceeded' errors reported since upgrading to 4.82. It seems to have been caused by setting my 'target cpu run time' to 4hrs whilst these two WUs were already at 6+ hours of progress, or at least they both errored out shortly after I changed that value. These are the WUs in question: https://boinc.bakerlab.org/rosetta/result.php?resultid=11797455 https://boinc.bakerlab.org/rosetta/result.php?resultid=11796520 |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
I have had two 'max cpu time exceeded' errors reported since upgrading to 4.82. It seems to have been caused by setting my 'target cpu run time' to 4hrs whilst these two WUs were already at 6+ hours of progress, or at least they both errored out shortly after I changed that value. I suspect this is a WU related issue. It is possible that the bounds limit has not been set right for these to accommodate the new time settings. I will bring it to the attention of the project team. Moderator9 ROSETTA@home FAQ Moderator Contact |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
Just as I suspected these latest Max time errors are a WU related issue please see this from David Kim on this subject - "The max time errors are due to an older batch of work units. I cancelled that batch and also updated all the rsc_fpops_bound values to a fairly high value so as to not reach the limit in 4 days. It is difficult though to guarantee not reaching the limit since it also depends on the clients benchmark... ...In the future we will try to prevent sending out the work units that take a long time to produce a single model. The previous batches of ab initio runs have a filter being used that actually ignores structures that do not fit the filtering criteria, thus for some proteins many structures are being modeled before reaching one that passes the filters. We are going to turn the filters off for future batches and filter them ourselves as a post process. Thanks, David K So While there may be a very few more of these that come out of longer queues, for the most part these Max time errors should now stop very soon. If you see any please keep reporting them here on this thread. Moderator9 ROSETTA@home FAQ Moderator Contact |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
Already any progress in granting credits for MCTE WU's ?? Or did I miss it somewhere ? |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
Already any progress in granting credits for MCTE WU's ?? As was reported before, it will be AT LEAST mid-March before the project team can deal with the credit granting process for this class of WU failures, and maybe longer. They did say they would grant the credit in due course, but they are focused on fixing run time errors at this time. The cause of the Max time errors has been isolated and fixed so people should not see any more of them. But the credit granting process takes time. Moderator9 ROSETTA@home FAQ Moderator Contact |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
Already any progress in granting credits for MCTE WU's ?? Thanks. I'm not visiting these forums on a regularly base, so I must've missed. |
Message boards :
Number crunching :
Report Maximum CPU Time Exceeded WU HERE
©2024 University of Washington
https://www.bakerlab.org