Discussion on increasing the default run time

Author	Message
robertmiles Send message Joined: 16 Jun 08 Posts: 1251 Credit: 14,421,737 RAC: 0	Message 63043 - Posted: 25 Aug 2009, 23:48:35 UTC - in response to Message 63032. I'd happily change my run-time prefs so that computers that are on lots have a high run-time and the others have a low run-time but I find this really difficult as they're tied to the BOINC work/home/school settings (which I think are poor, but not the project's fault ;) ). I also use BAM but that doesn't allow changes to the run-time, so I'm left with the default. Being able to select a run-time preferences per machine would be useful, but probably only for a minority i guess... (just noticed the project haven't posted on this for a while!) I've noticed that the World Community Grid project lets you make some machine-specific settings through their web site, but then several other settings do not propagate through to that machine if changed in other ways. I don't use BAM, so I don't know if this is compatible with BAM. However, it looks like I may soon need to switch managers so that I can control BOINC on my two desktops from my laptop, which appears to be short of much power for running longer workunits well, so could you tell me if BAM seems suitable for that purpose? ID: 63043 · Rating: 0 · rate: / Reply Quote

Warped Send message Joined: 15 Jan 06 Posts: 48 Credit: 1,788,185 RAC: 0	Message 63068 - Posted: 28 Aug 2009, 17:05:24 UTC - in response to Message 63042. I live in a bandwidth-impoverished part of the world, with high prices and low speed. Consequently, I have selected 16 hours run time. However, I find this thread as well as the others discussing long-running models to be of little interest when I have work units running for about 4 hours. Is the preferred run time really applied? I have noticed that on my faster machine, the limit of 99 decoys is usually reached before the 12-hour expected runtime I've requested. You might want to check the report visible on the Rosetta@home of how well the workunit succeeded to see if your workunits also often stop at the 99 decoys limit instead of near the requested run time. The workunits ending before the selected run-time get to a stop at 100 decoys, whereas the one recent workunit which made it to the selected 16 hours stopped at 88 decoys. Is there anything I can do to adjust this or is it a lucky-dip? ID: 63068 · Rating: 0 · rate: / Reply Quote

dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0	Message 63069 - Posted: 28 Aug 2009, 18:08:22 UTC - in response to Message 63068. Last modified: 28 Aug 2009, 18:28:50 UTC I have noticed that on my faster machine, the limit of 99 decoys is usually reached before the 12-hour expected runtime I've requested. You might want to check the report visible on the Rosetta@home of how well the workunit succeeded to see if your workunits also often stop at the 99 decoys limit instead of near the requested run time. The workunits ending before the selected run-time get to a stop at 100 decoys, whereas the one recent workunit which made it to the selected 16 hours stopped at 88 decoys. Is there anything I can do to adjust this or is it a lucky-dip? I've noticed this too. As far as I can tell, it's a "lucky-dip" as you so accurately describe it. Also known as a crap-shoot in other parts of the world. ;) In another thread, I suggested increasing the maximum number of decoys from 100 to something higher, but that idea was rejected. I still find the reason for staying with the 100 decoy max totally counter-intuitive, and in fact I'm not at all sure the reasoning given is correct. That said, I'll make the suggestion again to increase the max decoys to 200 (or even higher), and see where the suggestion goes. For those of us with fast machines, willing to do long run times it will reduce the load on the servers. I admit it will change the "shape" of the uploaded data, but it will not change the amount - this last point is the one where I think people haven't thought the problem through correctly. ID: 63069 · Rating: 0 · rate: / Reply Quote

cenit Send message Joined: 1 Apr 07 Posts: 13 Credit: 1,630,287 RAC: 0	Message 63070 - Posted: 28 Aug 2009, 21:34:07 UTC - in response to Message 63069. I have noticed that on my faster machine, the limit of 99 decoys is usually reached before the 12-hour expected runtime I've requested. You might want to check the report visible on the Rosetta@home of how well the workunit succeeded to see if your workunits also often stop at the 99 decoys limit instead of near the requested run time. The workunits ending before the selected run-time get to a stop at 100 decoys, whereas the one recent workunit which made it to the selected 16 hours stopped at 88 decoys. Is there anything I can do to adjust this or is it a lucky-dip? I've noticed this too. As far as I can tell, it's a "lucky-dip" as you so accurately describe it. Also known as a crap-shoot in other parts of the world. ;) In another thread, I suggested increasing the maximum number of decoys from 100 to something higher, but that idea was rejected. I still find the reason for staying with the 100 decoy max totally counter-intuitive, and in fact I'm not at all sure the reasoning given is correct. That said, I'll make the suggestion again to increase the max decoys to 200 (or even higher), and see where the suggestion goes. For those of us with fast machines, willing to do long run times it will reduce the load on the servers. I admit it will change the "shape" of the uploaded data, but it will not change the amount - this last point is the one where I think people haven't thought the problem through correctly. "Maximum number of decoys" at 99 was introduced some months ago when Rosetta@home was in "debug mode" (I think around v1.50, no new features only bugs solved). It was used as the easy way to solve some bugs that arose with large uploads (if I remember correctly, they didn't even investigate if the problem was in BOINC or somewhere in their code, because this trick solved easily the bug). I don't think that, atm, it's so important to solve drastically this problem; anyway, it should be interesting to know if they have any problem with server load now... ID: 63070 · Rating: 0 · rate: / Reply Quote

dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0	Message 63073 - Posted: 29 Aug 2009, 9:55:27 UTC - in response to Message 63070. Snip ... In another thread, I suggested increasing the maximum number of decoys from 100 to something higher, but that idea was rejected. I still find the reason for staying with the 100 decoy max totally counter-intuitive, and in fact I'm not at all sure the reasoning given is correct. That said, I'll make the suggestion again to increase the max decoys to 200 (or even higher), and see where the suggestion goes. For those of us with fast machines, willing to do long run times it will reduce the load on the servers. I admit it will change the "shape" of the uploaded data, but it will not change the amount - this last point is the one where I think people haven't thought the problem through correctly. "Maximum number of decoys" at 99 was introduced some months ago when Rosetta@home was in "debug mode" (I think around v1.50, no new features only bugs solved). It was used as the easy way to solve some bugs that arose with large uploads (if I remember correctly, they didn't even investigate if the problem was in BOINC or somewhere in their code, because this trick solved easily the bug). I don't think that, atm, it's so important to solve drastically this problem; anyway, it should be interesting to know if they have any problem with server load now... Interesting. If anyone is looking for fairly reliable repro steps on getting uploads to fail, try the following. Set up a machine, and adjust the maximum upload rate to 2 kbytes/sec on the advanced preferences page. Grab yourself a task like this one: 276073593 let it complete, and then try to upload it. The key section of the name appears to be the "ddg_predictions" string. I've seen a few of these guys going by, they seem to produce very large result files. I've had two that are in excess of 7 Mb and one that was over 11 Mb. It's worth noting that if I temporarily adjust the upload speed to something over my connection's max (384 kbits/sec, i.e. ~ 40 kbytes/sec), the transfer will then go through without problems. However it's a bit of a pain doing this, I'm about to the point that if I find another of these WU that's stuck uploading, I'm going to force that upload through, and then abort any of these jobs that I see in the queue. ID: 63073 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 63076 - Posted: 29 Aug 2009, 13:44:35 UTC So yes warped, you are seeing what I would expect. When your preference of 16hrs is near, the tasks end. And if 99 models is reached prior to that, the task will end at that time (at least in the "mini" application). The amount of data reported back on the uploads varies by the type of protein and type of study being done. But the primary factor or multiplier on the size is the number of models. At one point there were batches of WUs that were running 20 models an hour. The upload size and potential for hitting the maximum outfile size were very large for long runtime preferences. 99 models was just a way to strike a compromise between giving the desired runtime, and having a predictable and reasonable upload size. dgnuff From what you are describing, it sounds like the only issue with any given type of work unit is the resulting size of the result file. Any time you have a large file that must move and a very limited bandwidth, there is a conflict to be resolved. The BOINC client can do partial file transfers and continue where it has left off. But I believe it also times out on connections that are actually moving data as well. I've seen connections ended after 5 minutes, and then restarting, at least on downloads. I presume uploads are similar. I am not sure why Berkeley made the client work that way. Seems to me that an active connection that is still successfully moving data should be left alone. So when you say you can get an upload to "fail", do you mean a retry occurs? Or do you mean that so many retries occur that... well the WU you linked looks like it arrived in-tact. So, eventually the upload was completed. I am unclear what you mean about the upload being "stuck". I think what you are seeing sounds normal for a connection with very limited bandwidth. And the client will continue working on, and completing getting it sent all by itself. This is part of why they decided to limit to 99 models too. The uploads on tasks that produced many many models were approaching 100MB, which is large enough to cause difficulty in many environments. Rosetta Moderator: Mod.Sense ID: 63076 · Rating: 0 · rate: / Reply Quote

thatoneguy Send message Joined: 8 Jun 06 Posts: 3 Credit: 2,636,731 RAC: 0	Message 64601 - Posted: 26 Dec 2009, 3:45:42 UTC - in response to Message 56932. Back to the main issue... what would be the best way to transition to an increased run time. If it is possible to do so, temporarily decrease the amount of work that can be downloaded. I think it is possible to fudge the report deadline so that computers don't ask for more work, but still receive credit for past due WUs. Following the change, simply increasing the deadline would ease almost all problems stemming from long run-times. The problem remains of course that WUs may take a long time to return to the server. As long as credit is given for the late work, I think most people won't care about the change (except for the few people who have their computer on so seldom that they won't be able to complete any work on time). ID: 64601 · Rating: 0 · rate: / Reply Quote

S_Koss Send message Joined: 7 Jan 10 Posts: 4 Credit: 37,252 RAC: 0	Message 64914 - Posted: 11 Jan 2010, 16:07:22 UTC I have a serious problem with changing the default times. I shut 2 of my 3 crunching computers off at night because they are in my bedroom. Last night I had a 3 hour WU that was 99% done. But I was tired and did not want to wait 5 - 10 or 15 minutes for it to finish so I exited Boinc and went to bed. This morning the said WU restarted but from 0% I lost 3 hours of work on just this WU not taking into consideration the other WU that also restarted. I find that unacceptable and turned the default time down to 1 hour. If you are going to change the default time to a minimum of 3 hours then I will be changing projects because I will not continue to loose uncountable hours of work. ID: 64914 · Rating: 0 · rate: / Reply Quote

S_Koss Send message Joined: 7 Jan 10 Posts: 4 Credit: 37,252 RAC: 0	Message 64915 - Posted: 11 Jan 2010, 16:32:14 UTC On second thought, you can do whatever you want. I am outa here............. ID: 64915 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 64917 - Posted: 11 Jan 2010, 17:12:46 UTC Steve, what you are describing is a very new issue that has turned up with a new type of work unit that seems to be having some checkpointing issues. Transient, I don't think an overclock would be needed to cause the symptoms he's reporting. I've asked Sarel to look in to it. Steve, I'm curious how the runtime of a task is effecting your user experience (other then loss of work, which I clearly already understand). You appear to have racked up 25,000 credits in just 4 days, so clearly you have machines running 24x7 so how does running one task for 3 hours have a disadvantage over running 3 tasks for an hour each? Rosetta Moderator: Mod.Sense ID: 64917 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 64918 - Posted: 11 Jan 2010, 18:24:48 UTC - in response to Message 64914. I have a serious problem with changing the default times. I shut 2 of my 3 crunching computers off at night because they are in my bedroom. Last night I had a 3 hour WU that was 99% done. But I was tired and did not want to wait 5 - 10 or 15 minutes for it to finish so I exited Boinc and went to bed. This morning the said WU restarted but from 0% I lost 3 hours of work on just this WU not taking into consideration the other WU that also restarted. I find that unacceptable and turned the default time down to 1 hour. If you are going to change the default time to a minimum of 3 hours then I will be changing projects because I will not continue to loose uncountable hours of work. can you give us more information. what was the job id? can you link us to your job information? DK ID: 64918 · Rating: 0 · rate: / Reply Quote

S_Koss Send message Joined: 7 Jan 10 Posts: 4 Credit: 37,252 RAC: 0	Message 64921 - Posted: 11 Jan 2010, 20:57:40 UTC Hi, so let me try to explain this better. If you have 4 or 8 or 12 WU in varying degrees of completion and you shut down for the night (because I shut 2 computers of 3 down at night) the average loss will be higher than 1 hour WU. When you restart the next morning and you loose everything that you did the night before it gets frustrating and it has been so for the past 4 days. That is why I am not really interested in your project. I have since detached from your project so I cannot give you WU numbers. Thank you. ID: 64921 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 64924 - Posted: 12 Jan 2010, 4:21:20 UTC Steve, most Rosetta work units will save a checkpoint every 15 minutes or so. Giving a balance between losing CPU effort, and keeping checkpoint overhead and disk writes low (even in your case where you power off each day, 90% of the checkpoints are never actually needed). So, on average you should see that less then 7.5min. of CPU time is lost when powering off. Sarel is making the needed changes (posted here) so this will be true for his new type of work units as well. I just don't want to see you leave a very worthwhile project for the wrong reasons. Mad Max's Post on Saturday the 9th was one of the first posts that was specific enough to identify the problem, and yours then confirmed the issue. And here we are Monday the 11th, and the problem is being addressed. Rosetta Moderator: Mod.Sense ID: 64924 · Rating: 0 · rate: / Reply Quote

DJStarfox Send message Joined: 19 Jul 07 Posts: 145 Credit: 1,255,054 RAC: 0	Message 64939 - Posted: 12 Jan 2010, 20:23:25 UTC Even if you ignore Steve's experience with this project, I hope you recognize that one point has been made clear repeatedly. Checkpoints are a critical feature of BOINC applications. If you need to make checkpoints work within a single decoy's generation, then make it happen. Given that, there's nothing wrong with doubling the default/minimum run times. ID: 64939 · Rating: 0 · rate: / Reply Quote

Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 30,228,217 RAC: 2,447	Message 64957 - Posted: 14 Jan 2010, 1:51:55 UTC - in response to Message 64924. Last modified: 14 Jan 2010, 2:18:58 UTC Steve, most Rosetta work units will save a checkpoint every 15 minutes or so. Giving a balance between losing CPU effort, and keeping checkpoint overhead and disk writes low (even in your case where you power off each day, 90% of the checkpoints are never actually needed). So, on average you should see that less then 7.5min. of CPU time is lost when powering off. On my observations (after I have faced a similar problem, I some time watched disk writing of Rosetta application) majority of WUs wrote checkpoints even much more often - about 1 time each 1-2 minutes (I think according to setting in BOINС which by default set to 60 seconds). Except for two types WUs - one did not write checkpoints at all (as you have marked this problem is already localised and FIX for it should be included in new version Rosetta mini 2.05) and another wrote checkpoints as usually, but after restarting for any reason could not use them (or did not try at all). If the job of 2nd type once again gets to me I will try to catch it. I think an indirect tag of such tasks there should be a bad ratio between "claimed credit" and "granted credit" (on the scale of the concrete computer). As in this case: https://boinc.bakerlab.org/rosetta/result.php?resultid=309578283 I think having the complete server statistics probably to sort tasks by this ratio and to look in what types of tasks there is a bad ratios more often. By this criterion tasks having one (or both) from following disadvantages should "emerge": 1. Problems with checkpoints mechanism 2. Bad optimisation (executed more slowly in comparison with the others) But, while I do not have any ideas how to separate one from others... P.S. I am impressed by speed of response (only few days between "bug report" and fix for it), on matching with many other projects it are very fast feedback. ID: 64957 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0	Message 64966 - Posted: 14 Jan 2010, 14:33:13 UTC Last modified: 14 Jan 2010, 14:35:02 UTC @S_Koss: why don't you just hibernate the systems instead of shuting them down? Works perfect for me and I don't loose even one second of work. BTT: no problem for me if the default run times be encreased, I run WUs for 12-24 hours. . ID: 64966 · Rating: 0 · rate: / Reply Quote

Rabinovitch Send message Joined: 28 Apr 07 Posts: 28 Credit: 5,439,728 RAC: 0	Message 64973 - Posted: 14 Jan 2010, 17:07:41 UTC - in response to Message 56932. We are planning to increase the default run time from 3 hours to 6 hours and the minimum from 1 to 3 hours to reduce the load on our servers. Nice idea. And what about increasing maximum crunching time? I am ready crunch even several days if it necessary, or crunch untill all models will be processed. What about checkbox like "Work till the end"? :-) ID: 64973 · Rating: 0 · rate: / Reply Quote

Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 30,228,217 RAC: 2,447	Message 64981 - Posted: 14 Jan 2010, 21:44:43 UTC - in response to Message 64957. Last modified: 14 Jan 2010, 21:53:40 UTC On my observations (after I have faced a similar problem, I some time watched disk writing of Rosetta application) majority of WUs wrote checkpoints even much more often - about 1 time each 1-2 minutes (I think according to setting in BOINС which by default set to 60 seconds). Except for two types WUs - one did not write checkpoints at all (as you have marked this problem is already localised and FIX for it should be included in new version Rosetta mini 2.05) and another wrote checkpoints as usually, but after restarting for any reason could not use them (or did not try at all). If the job of 2nd type once again gets to me I will try to catch it. I think an indirect tag of such tasks there should be a bad ratio between "claimed credit" and "granted credit" Long it was not necessary to wait, it is seems I got one of such tasks just right now. I will post the "report" in an appropriate topic a bit later: minirosetta 2.03 ID: 64981 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 64982 - Posted: 14 Jan 2010, 21:46:36 UTC - in response to Message 64973. And what about increasing maximum crunching time? I am ready crunch even several days if it necessary, or crunch untill all models will be processed. What about checkbox like "Work till the end"? :-) There is no end. There are literally trillions of trillions of possible models. The current 24hr maximum attempts to strike a balance between getting results back to the project with a fast turnaround time, and minimizing burden on servers and bandwidth for downloads. Originally the maximum was 4 days, but just think if a problem arose and you ran for 4 days before the watchdog realized it and kicked in to end the task. Rosetta Moderator: Mod.Sense ID: 64982 · Rating: 0 · rate: / Reply Quote

Nuadormrac Send message Joined: 27 Sep 05 Posts: 37 Credit: 202,469 RAC: 0	Message 66124 - Posted: 15 May 2010, 0:54:02 UTC - in response to Message 56983. This also brings up another issue with such a possible increase; though it's a credit related one, so might not take the same precedence as... And yet depending how the units are treated, it might effect the science also. If the processing time is increased, and the unit deadlocks, hangs, or in some way crashes after the initial model(s) had been successfully been processed, it will after whatever time is spent hanging, error out. And yet not everything in the WU was bad. Now because the units don't have the time involved of a CPDN unit, it's unlikely that trickles would be introduced. However, an effect of lengthening the runtime can also be that a unit that does error latter on will have a higher chance to error out; and if this occurs then any science which was accumulated prior to the model within the WU that did error could be lost, and the credits for those models which were completed without error most assuredly would be, unless something along the line of trickles or partial validation/crediting could be implemented to allow the successfully processed models within the unit to be validated and counted as such. I understand completely the motivation behind increasing the default run time and if I only received Rosetta Beta 5.98 WUs I'm sure I'd hold to that default successfully. But as I report here (and previously) I get Mini Rosetta WUs constantly crashing out with "Can't acquire lockfile - exiting" error messages - maybe 60% failure rate with a 3-hour runtime, reducing to 40% failure rate with a 2-hour run time. I've seen this reported by several other people running a 64-bit OS - not just on Vista or with an AMD machine. That said, I don't know how widespread it is. Perhaps you can analyse results at your end. As stated in the post linked above, I get no errors at all with Rosetta Beta, so I'm inclined to think it's not some aberration with my machine. I'd really like to see some feedback on this issue and some assurance it's being investigated in some way. I'd ask that a minimum run time of 2 hours is allowed (I can just about handle that) or some mechanism that allows me to reject all Mini Rosetta WUs. If not, I'm prepared to abort all Mini Rosetta WUs before they run. It's really a waste of time me receiving them if 60% are going to crash out on me anyway. I've commented on this before here, here, here and first of all and more extensively here - see follow-up messages in that thread. No such issues arose for me with my old AMD single core XPSP2 machine - only when I got this new AMD quad-core Vista64 machine. Any advice appreciated. It's a very big Rosetta issue for me, so while I'm sure you'll save a whole load of bandwidth if you go ahead with the proposed changes I just hope some allowance can be made for people in my situation. ID: 66124 · Rating: 0 · rate: / Reply Quote