Message boards : Number crunching : Does Rosie create new jobs if no 'net connection available?
Previous · 1 · 2
Author | Message |
---|---|
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
hey - how come this thread has suddenly become the River tutorial page. I'm stopping here before even I get tired of the sight of my own typing ;-) True - the exponential backoffs are built into BOINC, but that means they are in the client, not in the app. The simple approach of the app testing the network connection would produce the problem I stated. So what is needed to do the automated extension properly is to provide a way for the app to ask the client to download the next WU (or do an empty update if no work is wanted at present). It then tests test to see if the scheduler request has succeeded. The client will go into update deferral if the request does not succeed, and the update will be re-tried on an exponential backoff. New code for the app is to send the request (just once) and the test (that time and each following time). New code for the client is to receive and act on the two kinds of request from the app (unless an existing request can be pressed into service). So I guess what I meant was it is a lot more work than just putting in a line in the app to send a network test packet out. It is POSSIBLE but not simple
Absolutely. But in my opinion it would not be worth the effort for the amount of work saved, not least because of the risk that even one bug in the implementation could lose more work than is likely to be saved. A few local outages don't add up to a lot of lost crunching on the level of the project (annoying as they are to the cruncher affected). A project outage is best covered by doing work for other projects in the meantime. And because it involves changes to the BOINC code as well as to the app, the changes would have to be re-done for each new client, or you'd have to persuade Berkley that they should adopt the new interface. But if enough people want it, yes it is possible. River~~ |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,885,277 RAC: 1,668 |
So what is needed to do the automated extension properly is to provide a way for the app to ask the client to download the next WU (or do an empty update if no work is wanted at present). It then tests test to see if the scheduler request has succeeded. The client will go into update deferral if the request does not succeed, and the update will be re-tried on an exponential backoff. Rather than having to implement this into BOINC, I might be over-simplifying, but I can't see why it couldn't it be done by Rosetta alone(?). I can't see any reason there'd need to be a change to anything other than the rosetta app. Here's my reasoning: No changes would be required server-side, as AFAIK the server doesn't know how many models it's going to get back from a job anyway - its down to the time taken for each run that determines the number of models. I wouldn't have thought that adding extra runs into the result file would make any difference to the server. Rosetta must be able to tell if there is work available as it crunches the work if it is! If there is no work available then it's a case of running the next model of the current job and adding it to the result file. I doubt BOINC has any interaction with the results files other than sending them as and when it should(?), so it seems to me that the required change is purely within Rosetta: If it's the last job in the queue then it shouldn't be marked for upload as more models will be run on it until there's another job added to the queue. It can then be marked for upload by BOINC, and as BOINC has no idea what's in the result file it'll send it as if it was the result file without the extra models in.
Its not just project outages, but connection outages at either end and firewall issues. I had a computer with 118 hours of idle time over the weekend as it had no net connection available. I'd have thought this would add quite few percent onto the total production, and that percentage will scale with the project as it grows. It also means people will find it easier to manage - especially those on dial up or with intermittently available connections as they don't need to micro -manage the project to the same extent. Are there any major flaws in my logic? (I don't mind minor flaws...) |
Rollo Send message Joined: 2 Jan 06 Posts: 21 Credit: 106,369 RAC: 0 |
I am no expert, but I think, that rosetta has no knowledge about the boinc queue. Therefore it can not decide whether it is working on the last model or not... |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
I think that's right. R@H would have to "complete" the WU, delivering a results file to BOINC for transfer back to the project. It wouldn't really "see" it's out of work until it's already "completed" the WU. And it wouldn't know if more work is downloading presently. So these are some of the hic-cups to making it work. There's no way to pull the WU BACK from BOINC and say "I've decided I'm NOT done with it, please give it back to me...and nevermind about sending this result I'll return it later when I compelte it... again". Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
agree so far
No. There are two lots of software running on your machine, the client and the app. The client is written by BOINC, and is the same across all projects. The app is project specific, and sometimes there can be more than one app to a project (as for example when E@H was running a mix of work on the Einstein and Albert apps). It is the client that schedules work on the local box, not the app and not the scheduler. Therefor to make this scheme work there must be new communication between the app and the client. Therefore tha client needs to change. Therefore the changes involve BOINC code. It is the client that tells the app to start, stop, pause, unload from memory, resume, abort. Where these things are initiated by the user, the manager tells the cleint and the client tells the app.
Unless the reason for no work is that the user has set "No more work", or the client has decided it is overcommitted and is not downloading new work till the box is a bit emptier. Without communicationg with the client the app cannot know *why* there is no more work held locally. Your simplification makes it impossible ever to stop crunching Rosetta work - which is totally against the BOINC spirit. To implement this acceptably we must make sure that the auto extend feature kicks in *only* when it is a network issue preventing new work arriving. And as it is the client rather than the app that handles the network, that means it is a client (ie a BOINC) issue
True - once we get to upload there is no issue with the current code being used
Please read my post again on this point. What I actually said was
Yes local outages are annoying to the individual - I just had a five week network outage on my seven fastest boxes due to a telecom company error. It is ******y annoying. But it is still small beer compared to the project at large. And how far do we go with returning supersize-me work units - could Rosetta have used work returned 35 days after it was issued and 35x normal size? What would that do to the disk quotas? Just where would I stop? To do this as a quick hack would be dangerously stupid. The implications for the servers and for other projects sharing the client are complicated. To do this properly would take a lot of careful design and, in my opinion, the time spent would be better spent developing another relaxation strategy or suchlike. I have not been reticent about suggesting new innovations where they can be doen simply and safely (I am proud to say that two of my ideas have been adopted by this project) but I repeat, code that runs on autonomously can do a lot of damage if it runs on when it should not. It is inherently a BOINC issue, not a local project issue. There is a non-automated way to do it, as suggested by Feet1st - the advantage of using manual intervention is that it will only be applied when it really is a netwrk issue. Seems by far the best way forward to me. I am currently testing a 37 hour run to see if it breaks anything... watch this space River~~ |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
Fellows, before we go on discussing ideas, here is the BOINC API: http://boinc.berkeley.edu/api.php All BOINC project science apps MUST be designed around its features. and here is the source: http://setiathome.berkeley.edu/cgi-bin/cvsweb.cgi/boinc/ It's obvious that what Rosetta needs isn't present in the current BOINC features. Which is probably why other similar projects (e.g. FAH) haven't transitioned to BOINC yet. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,885,277 RAC: 1,668 |
fair enough! I think it'd be a good feature to add at some point, but of course the bugs are priority. What's this workaround then? Increase the time in account_boinc.bakerlab.org_rosetta.xml: <cpu_run_time>28800</cpu_run_time> ? If you run out of work and then increase this value and restart BOINC will the job pick up and contiune or is it too late by then? |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
If you run out of work and then increase this value and restart BOINC will the job pick up and contiune or is it too late by then? In order to expect it to work, you'd have figure out that you are GOING to be out of work, or know ahead of time that you won't have a network connection or whatever the case... wait for end of a model in the current WU ;), exit BOINC, increase the time setting, restart BOINC. It will crunch that last WU longer. Keep in mind the project sets the max time to run to 24hrs presently. So, let's not exceed that. If you had several WUs that weren't completed yet, they'd all run longer. The new target runtime probably won't be reflected in the WU "remaining time" calculation, so don't be mystified. Once the WU is "completed" I think you're out of luck. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
Fellows, before we go on discussing ideas, here is the BOINC API: Though the program is open source and can be developed by anyone, so if you want to add a feature you could code it and hope they will add it in to a mainstream release. Maybe the GUI RPC could be used ? Kind of backwards though :-s Or since network features are avialble to the app (like XtremLab porbably use but then also communicates by itslef) Requesting network connection If it appears that there is no physical network connection (e.g. gethostbyname() fails for a valid name) then Call boinc_need_network(). This will alert the user that a network connection is needed. Periodically call boinc_network_poll() until it returns zero. Do whatever communication is needed. When done, call boinc_network_done(). This enables that hangup of a modem connection, if needed. void boinc_need_network(); int boinc_network_poll(); void boinc_network_done(); Seems it can see if there is no connection ? Team mauisun.org |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
If you run out of work and then increase this value and restart BOINC will the job pick up and contiune or is it too late by then? I tried one for 37 hours and it worked fine - could we have some comment from the Rosetta folk please - in extremis (where we do not have a local connection, and where the alternative is that work on Rosetta stops, what is the longest run time you'd tolerate? Could your servers handle a million-second run (just over 11.5 days)? 2 million (23 days, WU would be late?
I tested Feet1st's startegy and stop / restart, then the new target runtime is refelcted in the remaining time calulation and in the % done. The remaini g time can end up a bit daft.
Agreed - it would take a lot deeper understanding of the file structrues to revitalise a result once it has got to the completion code. Another interesting feature is that say you normally run with an 8-hour run length, and say you set 86400 (24-hours) as an emergency run length, you can then hit Update right away. The Update fails (as the net is down), but re-tries on a backoff scehdule. As soon as the Update works, the time resets to your normal 8 hours. Any result with over 8 hours on the clock then finishes the model it is on, and then stops. My suggestion, then, once the Rosetta folk give us an upper limit on the tolerable length of a WU, is to set that max length when you local net goes down, together with an update attempt immediately after restart. That way the overrun will be kept as short as possible. Sample results: 37 hour run [note how I miscalculated 37 hrs] long run cut short by update [bother! no longer available] River~~ |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
The Update fails (as the net is down), but re-tries on a backoff scehdule. As soon as the Update works, the time resets to your normal 8 hours. Any result with over 8 hours on the clock then finishes the model it is on, and then stops. Oh, yep, great point! And if you DO have a network connection, say before you leave for vacation, the update to project is the simplest way to make changes (no xml file editing required). You can download a set of WUs. And suspend them and update to project and download more if you want. Then change your Rosetta preference for WU duration. Then update to project again. Release all the WUs. Now go to the commands menubar and select suspend network activity (so you don't update again during this connection). Go back to your Rosetta preferences and set your WU duration back to where you like it to be. Disconnect from the net, and then enable network activity again (otherwise I always seem to forget :) River~~ I'm surprised you got a 37hr WU. BOINC has a max WU duration failsafe, and I've been somewhat puzzled how people report "hung" WUs that run for more than 24hrs when my understanding was that 24hrs was the present setting for max time before abort. I haven't had time to track down the proper file and attribute. But the project sends with each WU a maximum reasonable runtime, and if it exceeds that, it is supposed to be killed on the client by BOINC. Perhaps there's a BOINC bug in there somewhere, and that's not happening properly. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
The file upload size seems to be reasonable (I run at 1day length and there not much larger than the shorter jobs, say 300kb each. the only time there was a large upload was with some old jobs we did (maybe part of the 'full/everything' jobs, I cannot remember) when it was a stonking 2.5Mb. Team mauisun.org |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
I don't think there is any limit to how long a WU can run except: 1) The WU needs to be uploaded before its deadline. 2) An unstable system is more likely to have an error the longer the WU runs. 3) There is a timeout that will eventually abort the WU. This timeout uses the BOINC method, so it is based on ops/benchmark, not time. It will be about 24 hours on a fast system with an optimized benchmark, but on a slow computer with the recomended Linux client it could be hundreds of hours. The timeout might be removed once they get the watchdog thread working. |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
yes, Feet1st, I was surprised too. But The Answer has already been posted (thanks to AMD is logical) -the max is in terms of ops not time. The box on which I tried the 37 hour run was a slow one ~663MHz. I guess that accounts for most if not all of the long running hung WU too. R~~ |
Message boards :
Number crunching :
Does Rosie create new jobs if no 'net connection available?
©2025 University of Washington
https://www.bakerlab.org