Message boards : Number crunching : Rosetta 4.0+
Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · Next
Author | Message |
---|---|
[AF>Le_Pommier] Jerome_C2005 Send message Joined: 22 Aug 06 Posts: 42 Credit: 1,258,039 RAC: 0 |
@Grand + Sid : I don't understand very well what is the problem of having tasks (whatever the number) cancelled by the server because the deadline is reached ? are they not sent back to other crunchers ? the calculation will be done at then, and no resource will actually be "wasted", correct ? or is it just about the "error count" ? it should only affect me finally, not the project... ? Regarding rosetta deadline I had not noticed is was so short indeed. But my cache is not "rosetta only", I've always been a multi-projects boincer, but it's true it's an old habit when internet was not so stable, and when projects would often come short of tasks, having a cache was always a pleasant idea. But again : this was absolutely not the problem I faced with the mini tasks (see all the history of my explanations above). And again, I "solved" it by blocking the mini on that machine, it was enough for me and was not doing any harm to the project research. Thanks. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 393 Credit: 12,114,842 RAC: 4,200 |
@Grand + Sid : I don't understand very well what is the problem of having tasks (whatever the number) cancelled by the server because the deadline is reached ? are they not sent back to other crunchers ? the calculation will be done at then, and no resource will actually be "wasted", correct ? or is it just about the "error count" ? it should only affect me finally, not the project... ? Yes they are sent to other clients but three days late and if the researchers need the results pronto then that is a big problem. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1684 Credit: 17,957,902 RAC: 23,323 |
@Grand + Sid : I don't understand very well what is the problem of having tasks (whatever the number) cancelled by the server because the deadline is reached ? are they not sent back to other crunchers ? the calculation will be done at then, and no resource will actually be "wasted", correct ? or is it just about the "error count" ? it should only affect me finally, not the project... ?The problem, is that it actually takes considerable time & effort to produce the WUs in the first place that can then be moved here to Rosetta for us to process (that was why the Project was out for work for a day or 2 a while back because of the huge surge in new crunchers, and it took them by surprise & producing new work wasn't just a case of a few keystrokes- it took time). And if it errors out, and then gets sent to another system that is having problems as well, it's a complete loss. And even if it does get done by another system- it would have been nice if that system had been able to process some new work, not something that had to be resent because it timed out, and of course it takes longer to get back than if it had processed the first time around. The Estimated completion times eventually get close, but not correct, so BOINC is always going to underestimate how long it takes to return work. Especially so as some run a lot longer than the Target time, than those that do finish early. Yes, if there is an error, it goes out again to be checked. But having to do that because a system keeps continually missing deadlines really is a waste of resources. If you're not going to process it, then why download it? Especially so when you can easily stop it from occurring? Regarding rosetta deadline I had not noticed is was so short indeed. But my cache is not "rosetta only", I've always been a multi-projects boincer, but it's true it's an old habit when internet was not so stable, and when projects would often come short of tasks, having a cache was always a pleasant idea.I'm use to the same thing with Seti having regular& irregular short & extended outages. But Rosetta isn't Seti, so i don't need a 4 day cache. I'm down to a 0.6 day cache new, and Rosetta is the only project i'm doing. If i did another project as well i wouldn't even have this much of a cache. But again : this was absolutely not the problem I faced with the mini tasks (see all the history of my explanations above). And again, I "solved" it by blocking the mini on that machine, it was enough for me and was not doing any harm to the project research.Yet when i checked out your system at the time you originally posted, before you implemented your fix- most of your errors weren't Rosetta Mini Tasks that had Computation errors, but Rosetta Tasks that had missed their deadlines. The missed deadlines alone were producing more Errors than you were producing Valid work. And that does harm the project- as they say "Even computation errors are useful" as it lets them determine what is wrong. But missed deadlines aren't useful, just a waste of server time, bandwidth & other system's time checking something that shouldn't require checking. Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1684 Credit: 17,957,902 RAC: 23,323 |
Yes they are sent to other clients but three days late and if the researchers need the results pronto then that is a big problem.if they don't end up with another such system and error out again. Grant Darwin NT |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 393 Credit: 12,114,842 RAC: 4,200 |
Yes they are sent to other clients but three days late and if the researchers need the results pronto then that is a big problem.if they don't end up with another such system and error out again. Ouch - I’d assumed that crunchers with long deadlines were a small minority but if some WUs are hitting multiple deadlines maybe not. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1684 Credit: 17,957,902 RAC: 23,323 |
Ouch - I’d assumed that crunchers with long deadlinesIt's not so much the deadline that's the problem, as it is the combination of deadlines, large cache, multiple projects, and the Estimated completion times being less (sometimes a lot less) than what the actual Run time will be (and you've got the 10 hour watchdog timer for those units that run long...). And if the posts here are anything to go by, there are quite a few of them about. Grant Darwin NT |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 28 |
The thread has raised a doubt in my mind. I have my preferred run time set to 12 hours. I know the workunit has a model, it generates a random number, and runs the model to completion. It then looks to see how long that took, how long there is left with my preference, and if suitable, generates a new random number and runs the process again, and again, ad finitum, until it decides there is not enough time to run it again, at that point, it ends the work unit. With the urgency of the current situation, I can see the possibility that the first run of the model had a critical result, but that it was not returned for hours whilst the work unit ran with different random start points. Should the preferred runtime be set down, at least temporarily? Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 9,368 |
@Grand + Sid : I don't understand very well what is the problem of having tasks (whatever the number) cancelled by the server because the deadline is reached ? are they not sent back to other crunchers ? the calculation will be done at then, and no resource will actually be "wasted", correct ? or is it just about the "error count" ? it should only affect me finally, not the project... ? The recent post by bcov explained the reasoning. Up to 2-3yrs ago, the server software used meant it wasn't even possible for tasks to be aborted before running. Then they upgraded and aborting tasks was a rare event. The recent cancellation of running tasks is the first time I've ever seen that happen. But it was a decision from the project admins - no need for us to dwell on it. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 9,368 |
Should the preferred runtime be set down, at least temporarily? Not if they can still meet deadline. A whole batch of tasks are issued and results returned sooner or later within the deadline. No-one's expecting them to be returned instantaneously. 8hr tasks as a default (containing multiple results) or 12hrs is fine. Even 24 & 36hrs as long as they still meet deadline. Deadlines were cut from 8-days to 3-day. I think that addressed the concerns you're thinking about. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 28 |
Fair enough, I'll leave it alone. I'm not seeing anything likely to hit the deadline at the moment. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1684 Credit: 17,957,902 RAC: 23,323 |
I can see the possibility that the first run of the model had a critical result, but that it was not returned for hours whilst the work unit ran with different random start points. Should the preferred runtime be set down, at least temporarily?Or it could be that last run. I figure the 8 hour default was chosen by the project as a good compromise between as many models as possible, and very few models. 8 hours gives them a good selection of models to work with, but they do have the 10 hour Watchdog timer so if more time is needed for a Task that is producing exceptionally good data, then that's what happens. And if it ends up running in to a dead end, or producing too much data (the 500MB result file limitation) it will bail out early. Better to run 36hr Target CPU time Tasks that are returned before the deadline, than to run 2hr Target CPU time Tasks when most of them don't make the deadline. But to do that does require appropriate project settings that take in to account the deadlines & shorter than actual Estimated completion times. Grant Darwin NT |
[AF>Le_Pommier] Jerome_C2005 Send message Joined: 22 Aug 06 Posts: 42 Credit: 1,258,039 RAC: 0 |
OK I get your point, rosetta requires a short cache. We'll see in the future if I put back this machine to run on it. |
[AF>Le_Pommier] Jerome_C2005 Send message Joined: 22 Aug 06 Posts: 42 Credit: 1,258,039 RAC: 0 |
OK so now I decided to give it a go again because of Pentathlon who chose (what a surprise) Rosetta as the main project. On that same machine, my cache is now work_buf_min_days = 0 work_buf_additional_days = 0.2 (I could verify it in the global_prefs.xml file and the global_pref_override.xml is empty) Quite reasonable, isn't it ? Also it is limited to 8 tasks (using app_config.xml) because of the reduced RAM of that machine. And I am still blocking the mini task (using app_info.xml) because I don't feel like trying, and fighting, again. And guess what, it has downloaded MORE THAN 1000 TASKS on the machine !!!! Who is to blame ? not me ! Hundreds of tasks are going to be cancelled by the server within a few days... (I still think it should normally not be a big problems for the project itself, but apparently all of your scholarly demonstrations above tend to show the contrary, so I hope "everybody" is not going to be angry at me again here...) |
Millenium Send message Joined: 20 Sep 05 Posts: 68 Credit: 184,283 RAC: 0 |
Lol I have to say that is funny. I reattached to the projct for the new address and it just downloaded like 20 tasks or so. BOINC says I have 0.3 and 0.5, so similar values for the work buffer. |
Admin Project administrator Send message Joined: 1 Jul 05 Posts: 4805 Credit: 0 RAC: 0 |
Please let me know if this is still an issue. I updated the server scheduler to hopefully fix this cache size issue. |
[AF>Le_Pommier] Jerome_C2005 Send message Joined: 22 Aug 06 Posts: 42 Credit: 1,258,039 RAC: 0 |
Thank you ! Actually now there is nothing I can do but wait, I see the server has started to reclaim a few tasks and I suppose it is going to do it at a larger scale soon. Since you are here, if you look at my earlier posts a few weeks ago I had a problem with all the mini tasks on this host (I have posted the kind of error I got then) so I was forced to block the mini tasks using an app_info file to declare only the rosetta app. It is quite tedious since I have to upgrade the file and also download the application versions manually (mine are sill 4.15 but I see I must now go to 4.20). But on the other end I don't want to risk to block several cores with unlimited wasted CPU cycles again with those mini that this machine really doesn't like... Do you have any idea of where it may come from ? any library version or something like that ? Thanks. |
Admin Project administrator Send message Joined: 1 Jul 05 Posts: 4805 Credit: 0 RAC: 0 |
The mini tasks should soon be gone forever since we have deprecated the app. There was however a batch that was submitted recently but I imagine most of those tasks have completed by now. |
[AF>Le_Pommier] Jerome_C2005 Send message Joined: 22 Aug 06 Posts: 42 Credit: 1,258,039 RAC: 0 |
This is a good news for me then, I'll be able to remove that app_info and go back to fully automated mode ! Thanks for the info. |
[AF>Le_Pommier] Jerome_C2005 Send message Joined: 22 Aug 06 Posts: 42 Credit: 1,258,039 RAC: 0 |
As expected it canceled hundreds of tasks. The cache instructions seem to be followed, I don't have hundreds of tasks anymore. I did remove the app_info and I'm now getting 4.20 tasks. Obviously it killed all the tasks that were currently running when I removed + restarted boinc after removing that file, but I suspected this would happen anyway... Thanks. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 9,368 |
The mini tasks should soon be gone forever since we have deprecated the app. There was however a batch that was submitted recently but I imagine most of those tasks have completed by now. Now all returned |
Message boards :
Number crunching :
Rosetta 4.0+
©2024 University of Washington
https://www.bakerlab.org