Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 89 · 90 · 91 · 92 · 93 · 94 · 95 . . . 311 · Next

AuthorMessage
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 101037 - Posted: 3 Apr 2021, 17:26:57 UTC - in response to Message 101017.  

Duplicate post deleted.
You'd think there'd be a delete button. Who designs these things?
There's a workaround. If you use the same way to mark it as a duplicate every time, the software will see it as multiple identical posts, and delete all but one of them.
Shouldn't it have already done that when the 2nd genuine one was posted?
Yes. But it's rather slow to happen.
So why would this way be any faster?
ID: 101037 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 101038 - Posted: 3 Apr 2021, 17:28:27 UTC - in response to Message 101019.  

This is unsustainable and I will either have to shell out for an expensive unlimited contract (because I have an Ultima connection at over 100mbps) or cut back on Rosetta work.
I'm guessing you don't have any real options when it comes to ISP? 50GB limit for a 100Mb connection is insane IMHO. Higher speed plans here come with high data caps.
50GB is something you used to get on a basic 25Mb/s starter plan- these days even 12Mb/s plans have can have as much as 500MB data caps. 100Mb/s plans are 1TB caps or unlimited by default (of course we pay through the nose for those).
Wow. In the UK there's no data cap whatsoever. I can download at 54Mbps 24/7.
ID: 101038 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 101039 - Posted: 3 Apr 2021, 17:29:30 UTC - in response to Message 101020.  

But it would be nice of the researchers would test their models a bit more before releasing them here. The odd error is OK, but when it's a case of the odd Task not being an error and all others erroring out it really is a bit silly.
There's Ralph@home for that. Not sure why they hardly ever use it. I think I get tasks from there once every 3 months.
ID: 101039 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 101040 - Posted: 3 Apr 2021, 17:32:08 UTC - in response to Message 101022.  

That means nothing. For example I might (manually or Boinc did it) download a load of work from another project when this one runs out. Now that has to be completed before it will get work from Rosetta again.
It really is a shame you don't read all of what's posted before you feel the need to comment.
You said "Looks like the amount of work being done has dropped by almost a third, and isn't showing any signs of recovering." which is a false assumption. It could take a while. My computers for example haven't tried to get more Rosetta since it ran out, since they got stuff from other projects.
ID: 101040 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 101041 - Posted: 3 Apr 2021, 17:33:38 UTC - in response to Message 101025.  

Who is the guilty party submitting tasks that all "Error while computing"? I have 70 tasks on April 4th that have errored with no credit.
That would be you.
I assume he meant which scientist....
ID: 101041 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 101042 - Posted: 3 Apr 2021, 17:34:20 UTC - in response to Message 101027.  


you have a big appendage.

I'm...um, "flattered" that you think about that, but just for the record, I don't roll that way. Kindly limit yourself to policing my vernacular, dawg.
Bigot!
ID: 101042 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1734
Credit: 18,532,940
RAC: 17,945
Message 101043 - Posted: 3 Apr 2021, 18:23:42 UTC - in response to Message 101040.  

That means nothing. For example I might (manually or Boinc did it) download a load of work from another project when this one runs out. Now that has to be completed before it will get work from Rosetta again.
It really is a shame you don't read all of what's posted before you feel the need to comment.
You said "Looks like the amount of work being done has dropped by almost a third, and isn't showing any signs of recovering." which is a false assumption. It could take a while. My computers for example haven't tried to get more Rosetta since it ran out, since they got stuff from other projects.
And i'll repeat it again- It really is a shame you don't read all of what's posted before you feel the need to comment.
I addressed the point you made in the post that i quoted when i made that statement.


Regardless of caches & resource share settings & people's micro-management of their projects- if you compare the graph of the current recovery with past recoveries after a lack of work over the same recovery time frame, the loss of almost 1/3 of the processing resources is quite obvious.
The mis-configured Work Units are making it impossible for a large number of host to do any work (and that batch of Tasks that produced nothing but errors in a matter of seconds didn't help things along either).

Current recovery (or lack of) after almost 3 days.



Previous recovery (after a much longer outage) after 3 days.


Grant
Darwin NT
ID: 101043 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Kissagogo27

Send message
Joined: 31 Mar 20
Posts: 86
Credit: 2,981,693
RAC: 1,241
Message 101044 - Posted: 3 Apr 2021, 19:38:05 UTC - in response to Message 101018.  

not error at all here, no computer with the 6GB requirement of Ram ...
ID: 101044 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 826
Message 101045 - Posted: 3 Apr 2021, 19:42:44 UTC - in response to Message 101037.  

Duplicate post deleted.
You'd think there'd be a delete button. Who designs these things?
There's a workaround. If you use the same way to mark it as a duplicate every time, the software will see it as multiple identical posts, and delete all but one of them.
Shouldn't it have already done that when the 2nd genuine one was posted?
Yes. But it's rather slow to happen.
So why would this way be any faster?

I didn't say it would be faster. However, it will give users less to waste their time reading.
ID: 101045 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2146
Credit: 41,570,180
RAC: 8,210
Message 101047 - Posted: 4 Apr 2021, 1:31:19 UTC - in response to Message 100999.  

Say hello to two less hosts after they finish their current tasks, @Rosetta. I don't know if I have the time that's required to provide the space that is needed.
You’re not alone. Look at the recent results graphs – ‘tasks in progress’ has dropped by around 200,000 (a third)…
In the past it has taken several days for In progress numbers to get back to their pre-work shortage numbers. And that's with out running out of work again only a few hours after new work started coming through (which occurred this time).
If we don't run out of work again over the next few days, we should see how things actually are by early next week.
A few days in and the impact of the mis-configured Work Units is becoming clearer. Looks like the amount of work being done has dropped by almost a third, and isn't showing any signs of recovering.
For all of the latest & greatest systems there are, there are an awful lot more older much more resource limited systems.


Returning to my anecdote about a remote PC I have being unable to download any Rosetta tasks, so running its backup project, WCG, 24/7, my local laptop is also doing weird things. It refuses to run a particular Rosetta task, so it's running those it has room for - a combination of WCG and later Rosetta tasks, but only 3 on 4 cores. Now I know it's definitely happening, I've set NNT and suspended all running tasks except for the one problem Rosetta task. It still refuses to run, even as the only task. No tasks are running in my experiment!

So, maintaining NNT, I've found some combination of WCG and Rosetta tasks that'll run together on all 4 cores. I'll work my way through my small cache until all are completed bar the problem task and see if it runs then. If not, I'll finally abort it and just grab fresh tasks.

Bit of a weird one. Even attempting to micromanage tasks doesn't entirely work. No wonder that graph is running so much lower than it was, if I'm any example
ID: 101047 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2146
Credit: 41,570,180
RAC: 8,210
Message 101048 - Posted: 4 Apr 2021, 1:50:07 UTC - in response to Message 101019.  

Has there been a significant project change which could be the cause of this increased usage or am I looking for another problem?
Hard to say.
In most cases the results returned to the Rosetta servers are around 200k-1MB. But they can be well over 1MB in some cases, depending on the type of Task being processed.
I'd suggest disabling BOINC network access for a while & see what the average result file size being returned is.

Edit- and as Brian mentioned, we have recently had a large batch of Tasks that error out quickly, and appear to still be moving through the system.

I haven't mentioned this because I'm doing some experiments with overclocking and I thought the errors were being caused by me. So it was everyone? Interesting to know.

In the last day or so, these computation errors appear to have stopped. Can others confirm that too?
Hopefully that stops all the re-downloading issues and bandwidth penalties.
Is it stopping the excessive memory & disk space demands too?

And now I notice queued jobs have plummeted to barely more than 100k. Hmm...
ID: 101048 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2146
Credit: 41,570,180
RAC: 8,210
Message 101049 - Posted: 4 Apr 2021, 2:01:59 UTC - in response to Message 101035.  

Thanks to all for their views. I confirm the fairly low 50gb cap vs 100mbps circuit is due to an old tariff. It is no longer available but the ISP cannot remove it easily because of the regulator. All new users or changers automatically have unlimited but at a substantially higher monthly cost which I am trying to avoid because the cap has been adequate for years. I do not stream, peer or otherwise have need for substantial throughput.
I have followed the advice for run time and have changed the preferences from the default 8 hrs to 22 hrs and will see whether that helps. I confirm that it is the three threadrippers which wireshark identified straight away as the hoggers - every other endpoint line was low mb's. Presumably if the 'bad' tasks work through or are withdrawn this will also help.
jsm

Good idea to increase the runtime, but be aware that the tasks you already hold in your cache will run for much longer than Boinc realises, so it's entirely possible/probable you won't meet deadline on the later ones.

If my memory serves me, the unstarted tasks will still show they're 8hrs long, but will actually run for your new preference of 22hrs. This runtime figure for unstarted tasks doesn't update so it will be a permanent problem.
The way around this is to reduce your cache size by around two-thirds, so even though it continues to show Boinc the wrong expected runtime, you won't exceed deadlines in practice.
You may've already noticed this on your threadrippers. Crazy as it seems, the solution I've described ought to prevent the problems I've pointed to.
It's a feature rather than a bug... <cough>
ID: 101049 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 826
Message 101050 - Posted: 4 Apr 2021, 3:57:41 UTC - in response to Message 101048.  

[snip]

In the last day or so, these computation errors appear to have stopped. Can others confirm that too?
Hopefully that stops all the re-downloading issues and bandwidth penalties.
Is it stopping the excessive memory & disk space demands too?

And now I notice queued jobs have plummeted to barely more than 100k. Hmm...

The computation errors due to problems with 6mers have stopped.

I didn't see those other errors, so I can't tell if they have stopped.
ID: 101050 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Kissagogo27

Send message
Joined: 31 Mar 20
Posts: 86
Credit: 2,981,693
RAC: 1,241
Message 101054 - Posted: 4 Apr 2021, 10:06:32 UTC

weird thing, just got a resend but not sure to finish it !

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217325166
ID: 101054 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 101055 - Posted: 4 Apr 2021, 10:30:05 UTC - in response to Message 101054.  
Last modified: 4 Apr 2021, 10:36:08 UTC

This issue was discussed recently in another thread.

The work unit got resent because the first machine hadn’t completed it by its deadline. But 10 minutes later – after you’d started the resend but before you’d finished it – the other host submitted its results. I think you’ll still get credit if you complete it before the deadline, but from the science perspective there’s no point because the results are already in.
ID: 101055 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Kissagogo27

Send message
Joined: 31 Mar 20
Posts: 86
Credit: 2,981,693
RAC: 1,241
Message 101057 - Posted: 4 Apr 2021, 13:38:15 UTC
Last modified: 4 Apr 2021, 13:41:09 UTC

with a 10GB boinc disk space setting, Boinc still send weird messages like this one


04-Apr-2021 15:13:03 [Rosetta@home] Rosetta needs 6675.72 MB RAM but only 4060.49 MB is available for use.
04-Apr-2021 15:21:01 [Rosetta@home] Rosetta needs 252.21MB more disk space. You currently have 8330.86 MB available and it needs 8583.07 MB.


and nothing was downloaded ...

i have to reset the project and this message is gone away, and now Boinc is downloading tasks again . . .

even the first message of lack of memory gone too for this moment ..
ID: 101057 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Kissagogo27

Send message
Joined: 31 Mar 20
Posts: 86
Credit: 2,981,693
RAC: 1,241
Message 101058 - Posted: 4 Apr 2021, 15:52:39 UTC

Haha , they're back
.

04-Apr-2021 15:57:48 [Rosetta@home] Sending scheduler request: To fetch work.
04-Apr-2021 15:57:48 [Rosetta@home] Requesting new tasks for CPU
04-Apr-2021 15:57:51 [Rosetta@home] Scheduler request completed: got 0 new tasks
04-Apr-2021 15:57:51 [Rosetta@home] No tasks sent
04-Apr-2021 15:57:51 [Rosetta@home] Rosetta needs 6675.72 MB RAM but only 4060.49 MB is available for use.
04-Apr-2021 15:57:51 [Rosetta@home] Rosetta needs 134.84MB more disk space. You currently have 8448.22 MB available and it needs 8583.07 MB.
04-Apr-2021 15:57:51 [Rosetta@home] Project requested delay of 31 seconds
04-Apr-2021 15:57:56 [Rosetta@home] General prefs: from Rosetta@home (last modified 04-Apr-2021 15:30:57)
04-Apr-2021 15:57:56 [Rosetta@home] Computer location: home
04-Apr-2021 15:57:56 [---] General prefs: using separate prefs for home
04-Apr-2021 15:57:56 [---] Preferences:
04-Apr-2021 15:57:56 [---] max memory usage when active: 4060.49 MB
04-Apr-2021 15:57:56 [---] max memory usage when idle: 4060.49 MB
04-Apr-2021 15:57:56 [---] max disk usage: 12.00 GB
04-Apr-2021 15:57:56 [---] (to change preferences, visit a project web site or select Preferences in the Manager)
ID: 101058 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 101063 - Posted: 4 Apr 2021, 19:13:29 UTC - in response to Message 101057.  

This issue was discussed recently in another thread.

As well as claiming they needed 6.6 GB of RAM, the recent work units were configured to require 8.5 GB of disk space. With a preference setting maximum disk usage to 10 GB, and more than 1.5 GB already in use (around 2 GB is normal for R@h), the server was unable to send those tasks and so issued that warning.

Resetting the project didn’t make any difference because the disk space that freed up was immediately consumed again by the smaller tasks you were able to download.
ID: 101063 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1734
Credit: 18,532,940
RAC: 17,945
Message 101065 - Posted: 4 Apr 2021, 22:04:27 UTC - in response to Message 101063.  
Last modified: 4 Apr 2021, 22:08:51 UTC

Resetting the project didn’t make any difference because the disk space that freed up was immediately consumed again by the smaller tasks you were able to download.
Along with the executables & support data files.
Over time as different Tasks are downloaded, those support data files will be re-downloaded & the lack of disk space issue will re-occur (if the configuration issue for certain Work Units hasn't been fixed by then), as you soon found out.


Since the project is out of work again, other than the odd resend, now no one will be able to get any new work.
Hopefully the next batches of work system requirements will be configured more appropriately.
Grant
Darwin NT
ID: 101065 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1734
Credit: 18,532,940
RAC: 17,945
Message 101067 - Posted: 5 Apr 2021, 0:56:12 UTC

A new batch of work has been loaded up- hopefully these have their requirements set properly, and they won't error out in a matter of seconds either.
Grant
Darwin NT
ID: 101067 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 89 · 90 · 91 · 92 · 93 · 94 · 95 . . . 311 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org