Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 88 · 89 · 90 · 91 · 92 · 93 · 94 . . . 309 · Next
Author | Message |
---|---|
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
Bandwidth usage massively increased in MarchThis might be at least in part due to the current batch of work units suffering an unusually high failure rate, meaning you will be downloading a lot more tasks than normal in any given period. As an extreme example, your Threadripper has had over 300 failures in the last few days. As there’s no way to tell bad tasks from good before they’ve downloaded and started, there’s nothing we can do about it other than let them run their course (or stop running Rosetta until they’ve passed). In BOINC Manager you can set a limit on the amount of data transferred in a given period. It’s not very sophisticated and only works per machine, so when you’ve got several the best you can do is set an allowance for each one as a proportion of your total limit based on the number of tasks you expect it to run. (And if you do set a limit you then need to keep an eye out for it being reached, at which point even small results files for completed tasks won’t be uploaded.) Bad tasks aside, one way to reduce the overall amount of network traffic while performing the same amount of work is to increase the target run time for tasks in your project preferences. Even though a longer run time might increase the upload size needed for each task (due to the greater number of results), that is often far outweighed by the saving in download size (which is fixed for each task, however long it runs for). The credit per hour is more or less the same whatever target run time you choose. |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
Does the panel know (a) how long these errors will continue (b) how many good tasks I need to return to get back into Rosetta’s good books?(a) With 1.1 million jobs in the queue and a completion rate around 280,000 per day, I’d estimate at least 4 days… (b) At just shy of 500 max per day you still are in Rosetta’s good books, so number of tasks isn’t the issue. If it’s just backoff times you’re running in to, either that’s set by the server and there’s nothing you can do about it, or you can try to force a connection by selecting Update on the Projects page. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 398 Credit: 12,294,748 RAC: 6,222 |
Does the panel know (a) how long these errors will continue (b) how many good tasks I need to return to get back into Rosetta’s good books?(a) With 1.1 million jobs in the queue and a completion rate around 280,000 per day, I’d estimate at least 4 days… The back off time appears to be set by the server and is near doubling with each computation error that I’m returning :-( |
robertmiles Send message Joined: 16 Jun 08 Posts: 1233 Credit: 14,338,560 RAC: 2,014 |
Shouldn't it have already done that when the 2nd genuine one was posted?Duplicate post deleted.You'd think there'd be a delete button. Who designs these things? Yes. But it's rather slow to happen. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1233 Credit: 14,338,560 RAC: 2,014 |
Bandwidth usage massively increased in MarchThis might be at least in part due to the current batch of work units suffering an unusually high failure rate, meaning you will be downloading a lot more tasks than normal in any given period. As an extreme example, your Threadripper has had over 300 failures in the last few days. As there’s no way to tell bad tasks from good before they’ve downloaded and started, there’s nothing we can do about it other than let them run their course (or stop running Rosetta until they’ve passed). [snip] If anyone can get them to look at the log files to see why the errors are occurring, that might help. For the errors on my computer, they should quickly notice that something with "6mers" in its name is missing from the input files. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1725 Credit: 18,378,164 RAC: 20,578 |
Has there been a significant project change which could be the cause of this increased usage or am I looking for another problem?Hard to say. In most cases the results returned to the Rosetta servers are around 200k-1MB. But they can be well over 1MB in some cases, depending on the type of Task being processed. I'd suggest disabling BOINC network access for a while & see what the average result file size being returned is. Edit- and as Brian mentioned, we have recently had a large batch of Tasks that error out quickly, and appear to still be moving through the system. I would also check to see if there has been a increase in Windows update traffic,, that is the only thing that causes regular spikes in my network bandwidth- also check your privacy settings as having these set loosly results in aa lot of data being sent back to Microsoft & other companies. Also the occasional youtube usage when i find some interesting videos can result in a huge spike in data usage. But Brian's suggestion of the errored Tasks is the most likely cause. This is unsustainable and I will either have to shell out for an expensive unlimited contract (because I have an Ultima connection at over 100mbps) or cut back on Rosetta work.I'm guessing you don't have any real options when it comes to ISP? 50GB limit for a 100Mb connection is insane IMHO. Higher speed plans here come with high data caps. 50GB is something you used to get on a basic 25Mb/s starter plan- these days even 12Mb/s plans have can have as much as 500MB data caps. 100Mb/s plans are 1TB caps or unlimited by default (of course we pay through the nose for those). Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1725 Credit: 18,378,164 RAC: 20,578 |
...how many good tasks I need to return to get back into Rosetta’s good books?It's not an issue. Rosetta is set up to allow for such problems. Both of your systems are still good for plenty of Tasks each day- 491 on one, 502 on the other. I can't remember the exact mechanism, but for example for each Tasks that Validates, your limit in increases by 2 (it's actually more than that- there were times at Seti where people were down to 1 Task per 24hrs. Once they started returning valid Tasks again, within a few hours (depending on how fast they were returning Valid work) their limits were back in the 100 & even thousands of Tasks per 24 hours). But it would be nice of the researchers would test their models a bit more before releasing them here. The odd error is OK, but when it's a case of the odd Task not being an error and all others erroring out it really is a bit silly. Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1725 Credit: 18,378,164 RAC: 20,578 |
The back off time appears to be set by the server and is near doubling with each computation error that I’m returning :-(I haven't seen that occur myself (but most of my errors were returned while i wasn't here). Boinc Manager backoffs set by the Scheduler and usually only occur when there is a problem contacting the Scheduler. A successful Scheduler contact & it's rest to the default 30 seconds. Returning errors should only result in a reduction in the number of Tasks per 24 hours for that host. Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1725 Credit: 18,378,164 RAC: 20,578 |
That means nothing. For example I might (manually or Boinc did it) download a load of work from another project when this one runs out. Now that has to be completed before it will get work from Rosetta again.It really is a shame you don't read all of what's posted before you feel the need to comment. Grant Darwin NT |
DizzyD Send message Joined: 23 Nov 20 Posts: 6 Credit: 1,438,330 RAC: 0 |
Who is the guilty party submitting tasks that all "Error while computing"? I have 70 tasks on April 4th that have errored with no credit. My stats have dropped over 10% in the past day. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1725 Credit: 18,378,164 RAC: 20,578 |
Has there been a significant project change which could be the cause of this increased usage or am I looking for another problem?Hard to say. I just checked out my Data usage, and it is actually less than it has been- in the last reporting period there were some large deferred Windows updates so they would have skewed the figures. Even so, my average usage is around 1GB per day. Since the 28/3 my usage is around only 731MB per day (to date). Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1725 Credit: 18,378,164 RAC: 20,578 |
Who is the guilty party submitting tasks that all "Error while computing"? I have 70 tasks on April 4th that have errored with no credit.That would be you. Along with everyone else- as mentioned in several posts here & some other threads there is a current batch of work that is presently producing almost nothing but errors. My stats have dropped over 10% in the past day.Mine are still climbing, but that is after falling for 4 days straight due to the lack of work for a while, and the fact there is now a new batch of work and that it takes a while for granted Credit to stabilise. Grant Darwin NT |
Falconet Send message Joined: 9 Mar 09 Posts: 354 Credit: 1,276,393 RAC: 2,018 |
Queued jobs dropped to 393,000 from over a million on the last update. Looks like someone pulled off some batches from circulation. |
mrhastyrib Send message Joined: 18 Feb 21 Posts: 90 Credit: 2,541,890 RAC: 0 |
I'm...um, "flattered" that you think about that, but just for the record, I don't roll that way. Kindly limit yourself to policing my vernacular, dawg. |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
Who is the guilty partyWe’re here to help with scientific research, not to point fingers at the people doing it. Somebody made a mistake, and an experiment failed. It happens; that’s how people learn. |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
There does seem to be a backoff when a task fails. With sched_op_debug selected in your event log options, you should see it logged as [sched_op] Deferring communication for … [sched_op] Reason: Unrecoverable error for task …But as that’s client-side I would have expected it to be reset if you manually Update a project. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 398 Credit: 12,294,748 RAC: 6,222 |
There does seem to be a backoff when a task fails. With sched_op_debug selected in your event log options, you should see it logged as[sched_op] Deferring communication for … [sched_op] Reason: Unrecoverable error for task …But as that’s client-side I would have expected it to be reset if you manually Update a project. As my system naturally runs one out, one in I was using manual update to try to get fresh tasks and it did not appear to be resetting the back off. |
Martin.Heinrich Send message Joined: 4 May 20 Posts: 1 Credit: 396,444 RAC: 0 |
I have Rosette for long time but now: Rosetta asks for too much RAM Rosetta@home: Notice from server Rosetta needs 6675.72 MB RAM but only 1343.33 MB is available for use. I can give it 3GB but 6.5GB is not ok. Why dont I simply get tasks with less RAM demand ? If this problem is not solved, then Rosetta will not get more work done by my computers. |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
Why dont I simply get tasks with less RAM demand ?There aren’t any available. Try again in a few days’ time; perhaps there will be some new smaller work units. |
jsm Send message Joined: 4 Apr 20 Posts: 3 Credit: 77,825,233 RAC: 32,838 |
Thanks to all for their views. I confirm the fairly low 50gb cap vs 100mbps circuit is due to an old tariff. It is no longer available but the ISP cannot remove it easily because of the regulator. All new users or changers automatically have unlimited but at a substantially higher monthly cost which I am trying to avoid because the cap has been adequate for years. I do not stream, peer or otherwise have need for substantial throughput. I have followed the advice for run time and have changed the preferences from the default 8 hrs to 22 hrs and will see whether that helps. I confirm that it is the three threadrippers which wireshark identified straight away as the hoggers - every other endpoint line was low mb's. Presumably if the 'bad' tasks work through or are withdrawn this will also help. jsm |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org