Message boards : Number crunching : Report stuck & aborted WU here please
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 18 · Next
Author | Message |
---|---|
STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,100,301 RAC: 84 |
Thats the Bottom Line, the DCF gets totally out of Range for the longer WU's to complete successfully. I brought this up some time ago, but like the 1% (or other % point) stuck WU's which have been present since & before I joined the Project nothing seems to be done about it. I'm sure the Dev's are trying to correct both problems as it has to be a pain in the Rear End to have them brought up day after day in the Forum ... |
JDHalter Send message Joined: 3 Nov 05 Posts: 13 Credit: 722,679 RAC: 0 |
I encountered the 1% bug on wu command: command executed: projects/boinc.bakerlab.org_rosetta/rosetta_4.81_windows_intelx86.exe xx 1lou A -output_silent_gz -silent -increase_cycles 10 -nstruct 40 random seed: # random seed: 510621 I aborted the wu, but I did get the stdout.txt file saved before it was aborted...didn't read the instructions properly to test & help out on my own...sorry. Hopefully the info above will help to recreate the issue back at the UWashington Bakerlab...or wherever someone has the proper work unit files. JDHalter XPC - Rosetta@Home |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
Our plan is to get accurate times for each WU using an alpha test server, and then adjust the number of structures made accordingly. This should solve the timing out problem. (we are wondering whether some of the problems might also be coming from optimized clients with innacurate throughput estimates). The 1% problem we are still trying to figure out. |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
Our plan is to get accurate times for each WU using an alpha test server, and then adjust the number of structures made accordingly. This should solve the timing out problem. (we are wondering whether some of the problems might also be coming from optimized clients with innacurate throughput estimates). The 1% problem we are still trying to figure out. David, Actually I have tried the distribution version of BOINC (5.2.13), a version that was released through on of the Mac teams that while optimized does not increase credit claims or benchmarks (5.2.13), and a version the is optimized that was also released through on of the Mac teams. There is no difference in the percentage of Max time WUs between the three versions of BOINC I have tried, for any particular system. The fail point does vary in relation to system speed (slow systems generally fail with longer run times than fast ones). If no adjustments are made to the DCF an optimized client will cause failures to occur at lower CPU time values on a particular system as compared to unoptimized clients. But the variation is only about 10% more failures on an optimized client that is not adjusted for a higher DCF. I also ran 5.2.10 for a few weeks, and I saw no Max time WU failures at all under that release, optimized or not (I tried both). To be fair I used 5.2.10 before the 1% hang solution was deployed. This is the reason that I am now convinced that the 1% hang WU termination is actually at the bottom of the Max time failures. This appears to be a BOINC issue (optimized or otherwise) only in so far as the system is not able to handle widely divergent WU run times. Unlike many folks I have read on these boards, I plan to stick with this project thick and thin, testing things and trying to provide info to your team. In a nutshell this is what I have found by testing and observation. Optimized or not the version of BOINC 5.2.13 does not matter. The percentage of max time failures is the same across optimized and unoptimized BOINC clients. The only thing that seems to reduce the number of Max time failures is to manually raise the DCF in the status file when you start to see max time failures but this will only work for about a day at best. The system NEVER raises the DCF on its own it always lowers it, thus aggravating the problem. This is to be expected because a WU that errors will not be used to adjust the system run parameters. Once the Max time fail point is established for a particular system, there is very little variation for that system. All WUs will generally fail at almost exactly the same CPU second in the process. Even if the DCF is altered manually to improve the situation, there is a high limit beyond which raising it further will not improve the failure rate. In part this seems to be related to the time to completion deductions being larger if the DCF is set too high. As the WU runs these deductions actually run the clock out faster than would be the case with a lower DCF setting. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
Our plan is to get accurate times for each WU using an alpha test server, and then adjust the number of structures made accordingly. This should solve the timing out problem. (we are wondering whether some of the problems might also be coming from optimized clients with innacurate throughput estimates). The 1% problem we are still trying to figure out. Thanks, Phil. Just to be clear, we did not change anything as a fix to the 1% bug--the reason that WU have been timing out is that the batch running the last couple of weeks are longer. we made them longer to reduce traffic and help out dialup users, but it is evident now that we did not increase the max cpu time limit enough (it is a tradeoff, though, as pointed out in this thread--the higher we make this limit, the worse the problems caused by the 1% bug). we don't understand your point that the system can not handle widely divergent WU times--if each WU comes with an appropriate max cpu time cutoff, why should this be a problem? we are going to fix this by 1) doubling the max cpu time limit (David already did this) 2) tuning the max cpu time limit to the work unit (next batch of WU) 3) fixing the 1% bug. 1 and 2 are easy; if Phil is correct then instead of 2) we can adjust the number of structures made so that each WU takes the same amount of time. (3) is the hard one, and is the one we suspect is a rosetta-boinc interaction problem! |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Phil, I don't understand how DCF could effect the maximum cpu time allowed since I can't find code that supports this in the standard boinc source. |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
David, First let me say thank you for the time of you and your team in working with all of us out here. I know it can't be easy. The wider the gap between the shortest and the longest WUs, the larger the adjustments the DCF seem to be when adjustments are made. The system seems to adjust the DCF whenever it encounters a work unit that runs shorter than the estimated time for completion. As we know the longest WUs often fail, and as a result the system will not use them to raise the DCF. Because of this, all the WUs that complete have the effect of lowering the DCF. The larger the gap between the smallest and the largest, the more of those at the top of the size range will fail. AS such as more fail the DCF adjusts to a lower value the longer the system runs under those conditions. As it is now BOINC is not designed to deal with this situation. It is designed to have some tolerance for WUS that run for some reasonable amount of time longer than the projections and then compensate for that by increasing the DCF. If All the WUs the system see are longer than expected, but within the BOINC tolerance for long WUs, then the DCF will slowly rise to meet the actual processing time. But the BOINC tolerance for overtime WUs is nothing like 300 or 400 percent longer than a normal WU. Many of the recent R@H WUs have had a variation of as much as 1000%. This exceeds the tolerance ranges in the system. I have been tracking the BOINC Alpha developments as I am sure you have, but so far as I can see this issue is not even on the radar screen. From my point of view it is short sighted of the BOINC team to have not allowed for a large variation in WU size, but so far they have never seen variations on any of the projects with this kind of range. While I hate to say this, I think adjusting the times for WU completions to provide more accurate projections is good, but I do not think it will work to solve this problem by itself. I have adjusted the DCF on a few occasions to allow as much as 35 hours to complete a WU and they still fail on Max time at around 7. This implies that BOINC has some other function that it is looking at to determine the Max run length. This is what has lead me to believe that there is some absolute maximum time range for each particular system, beyond which the WU will always fail. The approach of adjusting the number of structures to produce WUs of at least similar size should work. Even if you made the WUs longer than the largest we have now, it should still work because the system could eventually adjust to that. I think if you do make them longer everyone would have to reset the project when they first start showing up in the queues or all of them will fail and the system will not adjust. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I have adjusted the DCF on a few occasions to allow as much as 35 hours to complete a WU and they still fail on Max time at around 7. This implies that BOINC has some other function that it is looking at to determine the Max run length. This is what has lead me to believe that there is some absolute maximum time range for each particular system, beyond which the WU will always fail. As I understand, the rsc_fpops_bound sets the absolute maximum time (depending also on the benchmark). See http://boinc.berkeley.edu/work.php. I don't see how DCF effects the max run time. |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
Phil, David, The DCF seems to be used in conjunction with the benchmarks by the BOINC client to determine the time display for "to completion" in the BOINC manager. This time projection is easily changed by stopping BOINC and editing the DCF for the project. It takes a little experimenting (Or math) to figure out how much adjustment represents say one min or one hour, but there is a direct relationship between the DCF value and the completion time shown in the manager. But adjusting the DCF will allow the system to run a WU longer. I have had the system routinely fail WUs at 7 hours. Made an adjustment to the DCF, and successfully processed a 20 hour WU as a result. The relationship is direct. The DCF is adjusted on the fly by the system when a WU completes. Since we know that BOINC does all the house keeping on CPU time, percent complete and time remaining, and credit claims, it is logical to assume that BOINC also adjusts the DCF. I cannot direct you right to the line on the code that does this. All of the other projects I run will start out with wild time projections for WUs when the software is first installed. After about five WUs complete the system will usually settle into a more reasonable time projection. After a few days with projects like E@H and P@H the time projections will be almost a perfect match for actual run times. All the while the system is making small adjustment to the DCF for the project. R@H makes much larger adjustment over much shorter times, and almost never projects a correct time for processing a WU. Now what is interesting is that for R@H, the time to completion actually rises along with the CPU time during processing. This is because the percent complete does not change and the completion time is a function of percent complete and CPU time. Each time R@H checkpoints and adjusted the percent complete the completion time drops in one big jump in direct proportion to what the system "thinks" is appropriate for 10%. But this is calculated by BOINC on the fly. So the amount of adjustment in the completion clock is based on a calculated 10%, but the 10% is based on a false value because the clock has been rising instead of falling during normal processing. So 10% is not really 10% of the processing time required. It is actually the original value of the completion clock plus the time added to the clock during the run. Since the amount added to the clock is not a one to one relationship to the CPU clock, the amount deducted as 10% from the completion clock is never the same amount. For longer WUs the variation in these deductions is significantly more than for shorter WUs. This fact is part of what BOINC is not designed to deal with. I think this may have some play in the Max time problem, but I have not completed enough testing to determine exactly how. R@H is the only application where I have ever seen the completion time rise as processing proceeds. I am now seeing some of the "Production_abinitio" Wus where the percent adjusts more often than 10%, and this helps even things out a bit. But the clock still rises during processing, and I am certain BOINC does not like that. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
Grutte Pier [Wa Oars]~GP500 Send message Joined: 30 Nov 05 Posts: 14 Credit: 432,089 RAC: 0 |
i have 2 kinds of wu's aborting: BARCODE_FRAG_30_1di2_234_7434_0 3x 234's for acces violation. <core_client_version>5.2.13</core_client_version> <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> ***UNHANDLED EXCEPTION**** Reason: Access Violation (0xc0000005) at address 0x7C9122BA read attempt to address 0x7FFFFFF8 Exiting... </stderr_txt> on different times. PRODUCTION_ABINITIO_1louA_250_408_0 3x240's these all on cpu-time exceeded. will we get refund on these ;), it seems fair because they are like the others before christmas. (not our pc's fault) |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
I have adjusted the DCF on a few occasions to allow as much as 35 hours to complete a WU and they still fail on Max time at around 7. This implies that BOINC has some other function that it is looking at to determine the Max run length. This is what has lead me to believe that there is some absolute maximum time range for each particular system, beyond which the WU will always fail. David, I will take a look at the link in a moment but, The DCF is obviously some kind of "tweaking" factor. I am certain it was never intended to make gross adjustment to run time. While it is true that the system has some internal functions to control run length, clearly BOINC should not have absolute times hardcoded in. So I think the rsc_fpops_bound sets the absolute time as you say, and I am certain it is a range, but it can be altered by the the correction factor. As I said I cannot point you to the specific code line, but I have see the DCF adjustments work to lengthen the run time for a WU type. So at least empirical testing shows there is a relationship. I suspect that since Max time is an error, it must be inside one of the traps. I have noticed that on many of the Max time wus I have processed, other systems have also failed on the same WU. If the system is a Mac it will show a Max time failure, if the other system is a PC it will show a memory access exception. So there seems to be some difference as to how the PC and the Mac see this problem. Interestingly, both system types seem to fail at between 80% and 90% complete. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Phil, thanks for your thorough explanations. Is this seen on standard and optimized clients (where you adjust DCF and see changes in the max run time allowed)? If DCF is used for the benchmark then it could effect the max run time allowed. |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Phil, yes, no one should edit the work unit xml file (or any files, in general). I provided the link to help explain what we should expect in terms of how the max run times are set. I'm still a bit confused about what you are seeing compared to what the expected behaviour is. |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
Phil, yes, no one should edit the work unit xml file (or any files, in general). I provided the link to help explain what we should expect in terms of how the max run times are set. I'm still a bit confused about what you are seeing compared to what the expected behaviour is. David, The problem is that while BOINC is an open book, it is really a black box in terms of application. There are many subtile things going on that are still not well understood when Wus come into play from the range of projects on board. I think this Max time thing is one of those things where BOINC is just not designed to deal with a single project with such a wide range in WU size. I have seen variations on projects when they change WU types a number of times. the change at E@H to the "Alberts" is a good example. The alberts run about twice as fast as a normal E@H WU. But this variation only occurs once in a few months, never 5 or six times a day, and never covering the range we are seeing here. That said BOINC SHOULD be able to deal with this. A project should not have to worry about this issue. It occurs to me as a result of our discussion today, that the flops boundary may provide an answer for this issue. I do not think the system really cares how close the working estimate of the processing time for a WU is (I know the users will). What it cares about is this flops count boundary. I presume that you can adjust that with relative ease on the server. If so that is the number I would think must be adjusted to solve the Max time problem. That said you obviously need to allow for some head room in that value. While we all might think that say 10% should be sufficient, I think I would take a more empirical approach. I might set that value sufficiently high, temporarily, so that any system should be able to complete any particular WU. I would then watch the results and see how long things are actually taking across the platforms reporting the results. Then I would set the value 10 to 20 percent above what I was seeing as actual processing times. Just my opinion, your milage may vary. I want you guys to succeed at this because the possibilities for your work are enormous. So if you think I can help let me know, I am a Senior Computer Systems Analyst, and have designed and built a number of large applications over the years. I am getting further away from actual programming, but still do a lot of trouble shooting,testing and debugging and I too work in a research lab. You guys have my e-mail address, if you want help just let me know. We can trade phone numbers if you think that might help speed things up. I guess in short if you are looking for volunteers I am willing. I think you are doing a great job working on a hard bunch of problems short handed, and I can see on some of the other threads at least a few folks are becoming abusive and ungrateful. They do not speak for the majority. Regards Phil |
TB Horst Send message Joined: 1 Oct 05 Posts: 8 Credit: 208,743 RAC: 0 |
2006-01-17 08:32:05 [rosetta@home] Aborting result PRODUCTION_ABINITIO_2vik__250_18_0: exceeded CPU time limit 27251.420455 2006-01-17 08:32:05 [rosetta@home] Unrecoverable error for result PRODUCTION_ABINITIO_2vik__250_18_0 (Maximum CPU time exceeded) |
[B^S] Paul@home Send message Joined: 18 Sep 05 Posts: 34 Credit: 393,096 RAC: 0 |
Hi guys, found this thread interesting as i have been 'hit' with wuite a few max time errors. I could be way off the mark here, but there does not appear to be a link (that I can find) between max cpu time and duration correction factor. The only place I can see max cpu time being set is in app.C. Here it sets max cpu time to be fpops bound / bemchmarked fpops value. max_cpu_time = rp->wup->rsc_fpops_bound/gstate.host_info.p_fpops; However, that means for optimised BOINC clients where the p_fpops value is very high, the max cpu time will drop and runs the risk of being exceeded - even if there WU is progressing correctly. Same problem happeds if the fpops_bound value is too low... The only way around that would be to have a sufficiently high fpops_bound or try to exclude non standard clients (that would be a tin of worms!) edit - since standard clients can also hit this error, it would seem that the fpops_bound value in some of the WUs might be a bit on the low side... Also, it has been noted before that BOINC can at times record CPU time closer to wall time rather than actual crunch time. I can't find the posts to back this up but if that was happening, then the current_cpu_time would be increasing even though little work was being done.... and more results win hit max_cpu.. cheers, Paul Wanna visit BOINC Synergy team site? Click below! Join BOINC Synergy Team |
kevint Send message Joined: 8 Oct 05 Posts: 84 Credit: 2,530,451 RAC: 0 |
This work unit processed for about 5 hours - when Boinc did it automatic switch to work on another project this WU aborted. https://boinc.bakerlab.org/rosetta/result.php?resultid=7095632 I have had this problem with other WU's on other machines. SETI.USA |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
[quote] max_cpu_time = rp->wup->rsc_fpops_bound/gstate.host_info.p_fpops; [/code] It would be really great if it worked the way it is documented. The fact is that while not linear in the sence that adding 1 to the DCF will give you 1 additional min of processing time, adjusting the DCF does in fact increase the time. Depending on the system benchmarks this adjustment provides varying amounts of increase. I think that the sum of the earlier discussion was that the bound value should be increased. I suspect that that is what David (Kim) was looking at. It would seem that this value may have been static while the WU size increased, and that may be the problem. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
godpiou Send message Joined: 22 Dec 05 Posts: 7 Credit: 1,373 RAC: 0 |
This work unit processed for about 5 hours - when Boinc did it automatic switch to work on another project this WU aborted. Hi Kevint, It's my turn to help people in the same way others help me for the same problem (I think it's the same one...) This is apparently a known bug . You just have to set "leave applications in memory when preempted" to YES when your in the section "View or edit general preferences", in the section "Preferences". You can go there by clicking "Your account" on the Rosetta's Home Page. Hope this help ! Godpiou |
Message boards :
Number crunching :
Report stuck & aborted WU here please
©2024 University of Washington
https://www.bakerlab.org