Problems and Technical Issues with Rosetta@home

Author	Message
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0	Message 101014 - Posted: 2 Apr 2021, 19:18:32 UTC - in response to Message 101006. Bandwidth usage massively increased in March This might be at least in part due to the current batch of work units suffering an unusually high failure rate, meaning you will be downloading a lot more tasks than normal in any given period. As an extreme example, your Threadripper has had over 300 failures in the last few days. As there’s no way to tell bad tasks from good before they’ve downloaded and started, there’s nothing we can do about it other than let them run their course (or stop running Rosetta until they’ve passed). In BOINC Manager you can set a limit on the amount of data transferred in a given period. It’s not very sophisticated and only works per machine, so when you’ve got several the best you can do is set an allowance for each one as a proportion of your total limit based on the number of tasks you expect it to run. (And if you do set a limit you then need to keep an eye out for it being reached, at which point even small results files for completed tasks won’t be uploaded.) Bad tasks aside, one way to reduce the overall amount of network traffic while performing the same amount of work is to increase the target run time for tasks in your project preferences. Even though a longer run time might increase the upload size needed for each task (due to the greater number of results), that is often far outweighed by the saving in download size (which is fixed for each task, however long it runs for). The credit per hour is more or less the same whatever target run time you choose. ID: 101014 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0	Message 101015 - Posted: 2 Apr 2021, 19:34:53 UTC - in response to Message 101012. Does the panel know (a) how long these errors will continue (b) how many good tasks I need to return to get back into Rosetta’s good books? (a) With 1.1 million jobs in the queue and a completion rate around 280,000 per day, I’d estimate at least 4 days… (b) At just shy of 500 max per day you still are in Rosetta’s good books, so number of tasks isn’t the issue. If it’s just backoff times you’re running in to, either that’s set by the server and there’s nothing you can do about it, or you can try to force a connection by selecting Update on the Projects page. ID: 101015 · Rating: 0 · rate: / Reply Quote

Bryn Mawr Send message Joined: 26 Dec 18 Posts: 442 Credit: 15,697,820 RAC: 6	Message 101016 - Posted: 2 Apr 2021, 19:41:35 UTC - in response to Message 101015. Does the panel know (a) how long these errors will continue (b) how many good tasks I need to return to get back into Rosetta’s good books? (a) With 1.1 million jobs in the queue and a completion rate around 280,000 per day, I’d estimate at least 4 days… (b) At just shy of 500 max per day you still are in Rosetta’s good books, so number of tasks isn’t the issue. If it’s just backoff times you’re running in to, either that’s set by the server and there’s nothing you can do about it, or you can try to force a connection by selecting Update on the Projects page. The back off time appears to be set by the server and is near doubling with each computation error that I’m returning :-( ID: 101016 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1265 Credit: 14,424,358 RAC: 0	Message 101017 - Posted: 2 Apr 2021, 20:58:11 UTC - in response to Message 101007. Duplicate post deleted. You'd think there'd be a delete button. Who designs these things? There's a workaround. If you use the same way to mark it as a duplicate every time, the software will see it as multiple identical posts, and delete all but one of them. Shouldn't it have already done that when the 2nd genuine one was posted? Yes. But it's rather slow to happen. ID: 101017 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1265 Credit: 14,424,358 RAC: 0	Message 101018 - Posted: 2 Apr 2021, 21:03:52 UTC - in response to Message 101014. Bandwidth usage massively increased in March This might be at least in part due to the current batch of work units suffering an unusually high failure rate, meaning you will be downloading a lot more tasks than normal in any given period. As an extreme example, your Threadripper has had over 300 failures in the last few days. As there’s no way to tell bad tasks from good before they’ve downloaded and started, there’s nothing we can do about it other than let them run their course (or stop running Rosetta until they’ve passed). [/quote [snip] If anyone can get them to look at the log files to see why the errors are occurring, that might help. For the errors on my computer, they should quickly notice that something with "6mers" in its name is missing from the input files. ID: 101018 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 101019 - Posted: 2 Apr 2021, 22:35:45 UTC - in response to Message 101006. Last modified: 2 Apr 2021, 23:03:33 UTC Has there been a significant project change which could be the cause of this increased usage or am I looking for another problem? Hard to say. In most cases the results returned to the Rosetta servers are around 200k-1MB. But they can be well over 1MB in some cases, depending on the type of Task being processed. I'd suggest disabling BOINC network access for a while & see what the average result file size being returned is. Edit- and as Brian mentioned, we have recently had a large batch of Tasks that error out quickly, and appear to still be moving through the system. I would also check to see if there has been a increase in Windows update traffic,, that is the only thing that causes regular spikes in my network bandwidth- also check your privacy settings as having these set loosly results in aa lot of data being sent back to Microsoft & other companies. Also the occasional youtube usage when i find some interesting videos can result in a huge spike in data usage. But Brian's suggestion of the errored Tasks is the most likely cause. This is unsustainable and I will either have to shell out for an expensive unlimited contract (because I have an Ultima connection at over 100mbps) or cut back on Rosetta work. I'm guessing you don't have any real options when it comes to ISP? 50GB limit for a 100Mb connection is insane IMHO. Higher speed plans here come with high data caps. 50GB is something you used to get on a basic 25Mb/s starter plan- these days even 12Mb/s plans have can have as much as 500MB data caps. 100Mb/s plans are 1TB caps or unlimited by default (of course we pay through the nose for those). Grant Darwin NT ID: 101019 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 101020 - Posted: 2 Apr 2021, 22:45:47 UTC - in response to Message 101012. ...how many good tasks I need to return to get back into Rosetta’s good books? It's not an issue. Rosetta is set up to allow for such problems. Both of your systems are still good for plenty of Tasks each day- 491 on one, 502 on the other. I can't remember the exact mechanism, but for example for each Tasks that Validates, your limit in increases by 2 (it's actually more than that- there were times at Seti where people were down to 1 Task per 24hrs. Once they started returning valid Tasks again, within a few hours (depending on how fast they were returning Valid work) their limits were back in the 100 & even thousands of Tasks per 24 hours). But it would be nice of the researchers would test their models a bit more before releasing them here. The odd error is OK, but when it's a case of the odd Task not being an error and all others erroring out it really is a bit silly. Grant Darwin NT ID: 101020 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 101021 - Posted: 2 Apr 2021, 22:51:11 UTC - in response to Message 101016. The back off time appears to be set by the server and is near doubling with each computation error that I’m returning :-( I haven't seen that occur myself (but most of my errors were returned while i wasn't here). Boinc Manager backoffs set by the Scheduler and usually only occur when there is a problem contacting the Scheduler. A successful Scheduler contact & it's rest to the default 30 seconds. Returning errors should only result in a reduction in the number of Tasks per 24 hours for that host. Grant Darwin NT ID: 101021 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 101022 - Posted: 2 Apr 2021, 22:57:22 UTC - in response to Message 101011. That means nothing. For example I might (manually or Boinc did it) download a load of work from another project when this one runs out. Now that has to be completed before it will get work from Rosetta again. It really is a shame you don't read all of what's posted before you feel the need to comment. Grant Darwin NT ID: 101022 · Rating: 0 · rate: / Reply Quote

DizzyD Send message Joined: 23 Nov 20 Posts: 6 Credit: 1,438,330 RAC: 0	Message 101023 - Posted: 3 Apr 2021, 0:59:59 UTC Who is the guilty party submitting tasks that all "Error while computing"? I have 70 tasks on April 4th that have errored with no credit. My stats have dropped over 10% in the past day. ID: 101023 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 101024 - Posted: 3 Apr 2021, 1:01:15 UTC - in response to Message 101019. Has there been a significant project change which could be the cause of this increased usage or am I looking for another problem? Hard to say. In most cases the results returned to the Rosetta servers are around 200k-1MB. But they can be well over 1MB in some cases, depending on the type of Task being processed. I'd suggest disabling BOINC network access for a while & see what the average result file size being returned is. Edit- and as Brian mentioned, we have recently had a large batch of Tasks that error out quickly, and appear to still be moving through the system. I would also check to see if there has been a increase in Windows update traffic,, that is the only thing that causes regular spikes in my network bandwidth- also check your privacy settings as having these set loosely results in aa lot of data being sent back to Microsoft & other companies. Also the occasional youtube usage when i find some interesting videos can result in a huge spike in data usage. But Brian's suggestion of the errored Tasks is the most likely cause. This is unsustainable and I will either have to shell out for an expensive unlimited contract (because I have an Ultima connection at over 100mbps) or cut back on Rosetta work. I'm guessing you don't have any real options when it comes to ISP? 50GB limit for a 100Mb connection is insane IMHO. Higher speed plans here come with high data caps. 50GB is something you used to get on a basic 25Mb/s starter plan- these days even 12Mb/s plans have can have as much as 500MB data caps. 100Mb/s plans are 1TB caps or unlimited by default (of course we pay through the nose for those). I just checked out my Data usage, and it is actually less than it has been- in the last reporting period there were some large deferred Windows updates so they would have skewed the figures. Even so, my average usage is around 1GB per day. Since the 28/3 my usage is around only 731MB per day (to date). Grant Darwin NT ID: 101024 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 101025 - Posted: 3 Apr 2021, 1:04:35 UTC - in response to Message 101023. Who is the guilty party submitting tasks that all "Error while computing"? I have 70 tasks on April 4th that have errored with no credit. That would be you. Along with everyone else- as mentioned in several posts here & some other threads there is a current batch of work that is presently producing almost nothing but errors. My stats have dropped over 10% in the past day. Mine are still climbing, but that is after falling for 4 days straight due to the lack of work for a while, and the fact there is now a new batch of work and that it takes a while for granted Credit to stabilise. Grant Darwin NT ID: 101025 · Rating: 0 · rate: / Reply Quote

Falconet Send message Joined: 9 Mar 09 Posts: 355 Credit: 1,669,337 RAC: 0	Message 101026 - Posted: 3 Apr 2021, 4:10:27 UTC Queued jobs dropped to 393,000 from over a million on the last update. Looks like someone pulled off some batches from circulation. ID: 101026 · Rating: 0 · rate: / Reply Quote

mrhastyrib Send message Joined: 18 Feb 21 Posts: 90 Credit: 2,541,890 RAC: 0	Message 101027 - Posted: 3 Apr 2021, 5:52:37 UTC - in response to Message 101009. you have a big appendage. I'm...um, "flattered" that you think about that, but just for the record, I don't roll that way. Kindly limit yourself to policing my vernacular, dawg. ID: 101027 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0	Message 101028 - Posted: 3 Apr 2021, 9:35:04 UTC - in response to Message 101023. Who is the guilty party We’re here to help with scientific research, not to point fingers at the people doing it. Somebody made a mistake, and an experiment failed. It happens; that’s how people learn. ID: 101028 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0	Message 101029 - Posted: 3 Apr 2021, 10:05:19 UTC - in response to Message 101021. There does seem to be a backoff when a task fails. With sched_op_debug selected in your event log options, you should see it logged as [sched_op] Deferring communication for … [sched_op] Reason: Unrecoverable error for task … But as that’s client-side I would have expected it to be reset if you manually Update a project. ID: 101029 · Rating: 0 · rate: / Reply Quote

Bryn Mawr Send message Joined: 26 Dec 18 Posts: 442 Credit: 15,697,820 RAC: 6	Message 101030 - Posted: 3 Apr 2021, 11:12:47 UTC - in response to Message 101029. There does seem to be a backoff when a task fails. With sched_op_debug selected in your event log options, you should see it logged as [sched_op] Deferring communication for … [sched_op] Reason: Unrecoverable error for task … But as that’s client-side I would have expected it to be reset if you manually Update a project. As my system naturally runs one out, one in I was using manual update to try to get fresh tasks and it did not appear to be resetting the back off. ID: 101030 · Rating: 0 · rate: / Reply Quote

Martin.Heinrich Send message Joined: 4 May 20 Posts: 1 Credit: 396,444 RAC: 0	Message 101032 - Posted: 3 Apr 2021, 13:46:52 UTC I have Rosette for long time but now: Rosetta asks for too much RAM Rosetta@home: Notice from server Rosetta needs 6675.72 MB RAM but only 1343.33 MB is available for use. I can give it 3GB but 6.5GB is not ok. Why dont I simply get tasks with less RAM demand ? If this problem is not solved, then Rosetta will not get more work done by my computers. ID: 101032 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0	Message 101033 - Posted: 3 Apr 2021, 14:02:10 UTC - in response to Message 101032. Why dont I simply get tasks with less RAM demand ? There aren’t any available. Try again in a few days’ time; perhaps there will be some new smaller work units. ID: 101033 · Rating: 0 · rate: / Reply Quote

jsm Send message Joined: 4 Apr 20 Posts: 3 Credit: 91,569,702 RAC: 3	Message 101035 - Posted: 3 Apr 2021, 15:11:25 UTC - in response to Message 101019. Thanks to all for their views. I confirm the fairly low 50gb cap vs 100mbps circuit is due to an old tariff. It is no longer available but the ISP cannot remove it easily because of the regulator. All new users or changers automatically have unlimited but at a substantially higher monthly cost which I am trying to avoid because the cap has been adequate for years. I do not stream, peer or otherwise have need for substantial throughput. I have followed the advice for run time and have changed the preferences from the default 8 hrs to 22 hrs and will see whether that helps. I confirm that it is the three threadrippers which wireshark identified straight away as the hoggers - every other endpoint line was low mb's. Presumably if the 'bad' tasks work through or are withdrawn this will also help. jsm ID: 101035 · Rating: 0 · rate: / Reply Quote