Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 89 · 90 · 91 · 92 · 93 · 94 · 95 . . . 276 · Next

AuthorMessage
mrhastyrib

Send message
Joined: 18 Feb 21
Posts: 90
Credit: 2,530,236
RAC: 0
Message 101027 - Posted: 3 Apr 2021, 5:52:37 UTC - in response to Message 101009.  


you have a big appendage.

I'm...um, "flattered" that you think about that, but just for the record, I don't roll that way. Kindly limit yourself to policing my vernacular, dawg.
ID: 101027 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 101028 - Posted: 3 Apr 2021, 9:35:04 UTC - in response to Message 101023.  

Who is the guilty party
We’re here to help with scientific research, not to point fingers at the people doing it. Somebody made a mistake, and an experiment failed. It happens; that’s how people learn.
ID: 101028 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 101029 - Posted: 3 Apr 2021, 10:05:19 UTC - in response to Message 101021.  

There does seem to be a backoff when a task fails. With sched_op_debug selected in your event log options, you should see it logged as
[sched_op] Deferring communication for …
[sched_op] Reason: Unrecoverable error for task …
But as that’s client-side I would have expected it to be reset if you manually Update a project.
ID: 101029 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 374
Credit: 10,682,382
RAC: 4,746
Message 101030 - Posted: 3 Apr 2021, 11:12:47 UTC - in response to Message 101029.  

There does seem to be a backoff when a task fails. With sched_op_debug selected in your event log options, you should see it logged as
[sched_op] Deferring communication for …
[sched_op] Reason: Unrecoverable error for task …
But as that’s client-side I would have expected it to be reset if you manually Update a project.


As my system naturally runs one out, one in I was using manual update to try to get fresh tasks and it did not appear to be resetting the back off.
ID: 101030 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Martin.Heinrich

Send message
Joined: 4 May 20
Posts: 1
Credit: 396,444
RAC: 0
Message 101032 - Posted: 3 Apr 2021, 13:46:52 UTC

I have Rosette for long time but now:

Rosetta asks for too much RAM

Rosetta@home: Notice from server
Rosetta needs 6675.72 MB RAM but only 1343.33 MB is available for use.

I can give it 3GB but 6.5GB is not ok. Why dont I simply get tasks with less RAM demand ?

If this problem is not solved, then Rosetta will not get more work done by my computers.
ID: 101032 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 101033 - Posted: 3 Apr 2021, 14:02:10 UTC - in response to Message 101032.  

Why dont I simply get tasks with less RAM demand ?
There aren’t any available. Try again in a few days’ time; perhaps there will be some new smaller work units.
ID: 101033 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
jsm

Send message
Joined: 4 Apr 20
Posts: 3
Credit: 64,510,067
RAC: 60,717
Message 101035 - Posted: 3 Apr 2021, 15:11:25 UTC - in response to Message 101019.  

Thanks to all for their views. I confirm the fairly low 50gb cap vs 100mbps circuit is due to an old tariff. It is no longer available but the ISP cannot remove it easily because of the regulator. All new users or changers automatically have unlimited but at a substantially higher monthly cost which I am trying to avoid because the cap has been adequate for years. I do not stream, peer or otherwise have need for substantial throughput.
I have followed the advice for run time and have changed the preferences from the default 8 hrs to 22 hrs and will see whether that helps. I confirm that it is the three threadrippers which wireshark identified straight away as the hoggers - every other endpoint line was low mb's. Presumably if the 'bad' tasks work through or are withdrawn this will also help.
jsm
ID: 101035 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,667,248
RAC: 6,865
Message 101037 - Posted: 3 Apr 2021, 17:26:57 UTC - in response to Message 101017.  

Duplicate post deleted.
You'd think there'd be a delete button. Who designs these things?
There's a workaround. If you use the same way to mark it as a duplicate every time, the software will see it as multiple identical posts, and delete all but one of them.
Shouldn't it have already done that when the 2nd genuine one was posted?
Yes. But it's rather slow to happen.
So why would this way be any faster?
ID: 101037 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,667,248
RAC: 6,865
Message 101038 - Posted: 3 Apr 2021, 17:28:27 UTC - in response to Message 101019.  

This is unsustainable and I will either have to shell out for an expensive unlimited contract (because I have an Ultima connection at over 100mbps) or cut back on Rosetta work.
I'm guessing you don't have any real options when it comes to ISP? 50GB limit for a 100Mb connection is insane IMHO. Higher speed plans here come with high data caps.
50GB is something you used to get on a basic 25Mb/s starter plan- these days even 12Mb/s plans have can have as much as 500MB data caps. 100Mb/s plans are 1TB caps or unlimited by default (of course we pay through the nose for those).
Wow. In the UK there's no data cap whatsoever. I can download at 54Mbps 24/7.
ID: 101038 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,667,248
RAC: 6,865
Message 101039 - Posted: 3 Apr 2021, 17:29:30 UTC - in response to Message 101020.  

But it would be nice of the researchers would test their models a bit more before releasing them here. The odd error is OK, but when it's a case of the odd Task not being an error and all others erroring out it really is a bit silly.
There's Ralph@home for that. Not sure why they hardly ever use it. I think I get tasks from there once every 3 months.
ID: 101039 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,667,248
RAC: 6,865
Message 101040 - Posted: 3 Apr 2021, 17:32:08 UTC - in response to Message 101022.  

That means nothing. For example I might (manually or Boinc did it) download a load of work from another project when this one runs out. Now that has to be completed before it will get work from Rosetta again.
It really is a shame you don't read all of what's posted before you feel the need to comment.
You said "Looks like the amount of work being done has dropped by almost a third, and isn't showing any signs of recovering." which is a false assumption. It could take a while. My computers for example haven't tried to get more Rosetta since it ran out, since they got stuff from other projects.
ID: 101040 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,667,248
RAC: 6,865
Message 101041 - Posted: 3 Apr 2021, 17:33:38 UTC - in response to Message 101025.  

Who is the guilty party submitting tasks that all "Error while computing"? I have 70 tasks on April 4th that have errored with no credit.
That would be you.
I assume he meant which scientist....
ID: 101041 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,667,248
RAC: 6,865
Message 101042 - Posted: 3 Apr 2021, 17:34:20 UTC - in response to Message 101027.  


you have a big appendage.

I'm...um, "flattered" that you think about that, but just for the record, I don't roll that way. Kindly limit yourself to policing my vernacular, dawg.
Bigot!
ID: 101042 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1480
Credit: 14,534,479
RAC: 12,518
Message 101043 - Posted: 3 Apr 2021, 18:23:42 UTC - in response to Message 101040.  

That means nothing. For example I might (manually or Boinc did it) download a load of work from another project when this one runs out. Now that has to be completed before it will get work from Rosetta again.
It really is a shame you don't read all of what's posted before you feel the need to comment.
You said "Looks like the amount of work being done has dropped by almost a third, and isn't showing any signs of recovering." which is a false assumption. It could take a while. My computers for example haven't tried to get more Rosetta since it ran out, since they got stuff from other projects.
And i'll repeat it again- It really is a shame you don't read all of what's posted before you feel the need to comment.
I addressed the point you made in the post that i quoted when i made that statement.


Regardless of caches & resource share settings & people's micro-management of their projects- if you compare the graph of the current recovery with past recoveries after a lack of work over the same recovery time frame, the loss of almost 1/3 of the processing resources is quite obvious.
The mis-configured Work Units are making it impossible for a large number of host to do any work (and that batch of Tasks that produced nothing but errors in a matter of seconds didn't help things along either).

Current recovery (or lack of) after almost 3 days.



Previous recovery (after a much longer outage) after 3 days.


Grant
Darwin NT
ID: 101043 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Kissagogo27

Send message
Joined: 31 Mar 20
Posts: 83
Credit: 2,632,624
RAC: 2,090
Message 101044 - Posted: 3 Apr 2021, 19:38:05 UTC - in response to Message 101018.  

not error at all here, no computer with the 6GB requirement of Ram ...
ID: 101044 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1223
Credit: 13,834,938
RAC: 1,233
Message 101045 - Posted: 3 Apr 2021, 19:42:44 UTC - in response to Message 101037.  

Duplicate post deleted.
You'd think there'd be a delete button. Who designs these things?
There's a workaround. If you use the same way to mark it as a duplicate every time, the software will see it as multiple identical posts, and delete all but one of them.
Shouldn't it have already done that when the 2nd genuine one was posted?
Yes. But it's rather slow to happen.
So why would this way be any faster?

I didn't say it would be faster. However, it will give users less to waste their time reading.
ID: 101045 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1981
Credit: 38,377,521
RAC: 11,117
Message 101047 - Posted: 4 Apr 2021, 1:31:19 UTC - in response to Message 100999.  

Say hello to two less hosts after they finish their current tasks, @Rosetta. I don't know if I have the time that's required to provide the space that is needed.
You’re not alone. Look at the recent results graphs – ‘tasks in progress’ has dropped by around 200,000 (a third)…
In the past it has taken several days for In progress numbers to get back to their pre-work shortage numbers. And that's with out running out of work again only a few hours after new work started coming through (which occurred this time).
If we don't run out of work again over the next few days, we should see how things actually are by early next week.
A few days in and the impact of the mis-configured Work Units is becoming clearer. Looks like the amount of work being done has dropped by almost a third, and isn't showing any signs of recovering.
For all of the latest & greatest systems there are, there are an awful lot more older much more resource limited systems.


Returning to my anecdote about a remote PC I have being unable to download any Rosetta tasks, so running its backup project, WCG, 24/7, my local laptop is also doing weird things. It refuses to run a particular Rosetta task, so it's running those it has room for - a combination of WCG and later Rosetta tasks, but only 3 on 4 cores. Now I know it's definitely happening, I've set NNT and suspended all running tasks except for the one problem Rosetta task. It still refuses to run, even as the only task. No tasks are running in my experiment!

So, maintaining NNT, I've found some combination of WCG and Rosetta tasks that'll run together on all 4 cores. I'll work my way through my small cache until all are completed bar the problem task and see if it runs then. If not, I'll finally abort it and just grab fresh tasks.

Bit of a weird one. Even attempting to micromanage tasks doesn't entirely work. No wonder that graph is running so much lower than it was, if I'm any example
ID: 101047 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1981
Credit: 38,377,521
RAC: 11,117
Message 101048 - Posted: 4 Apr 2021, 1:50:07 UTC - in response to Message 101019.  

Has there been a significant project change which could be the cause of this increased usage or am I looking for another problem?
Hard to say.
In most cases the results returned to the Rosetta servers are around 200k-1MB. But they can be well over 1MB in some cases, depending on the type of Task being processed.
I'd suggest disabling BOINC network access for a while & see what the average result file size being returned is.

Edit- and as Brian mentioned, we have recently had a large batch of Tasks that error out quickly, and appear to still be moving through the system.

I haven't mentioned this because I'm doing some experiments with overclocking and I thought the errors were being caused by me. So it was everyone? Interesting to know.

In the last day or so, these computation errors appear to have stopped. Can others confirm that too?
Hopefully that stops all the re-downloading issues and bandwidth penalties.
Is it stopping the excessive memory & disk space demands too?

And now I notice queued jobs have plummeted to barely more than 100k. Hmm...
ID: 101048 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1981
Credit: 38,377,521
RAC: 11,117
Message 101049 - Posted: 4 Apr 2021, 2:01:59 UTC - in response to Message 101035.  

Thanks to all for their views. I confirm the fairly low 50gb cap vs 100mbps circuit is due to an old tariff. It is no longer available but the ISP cannot remove it easily because of the regulator. All new users or changers automatically have unlimited but at a substantially higher monthly cost which I am trying to avoid because the cap has been adequate for years. I do not stream, peer or otherwise have need for substantial throughput.
I have followed the advice for run time and have changed the preferences from the default 8 hrs to 22 hrs and will see whether that helps. I confirm that it is the three threadrippers which wireshark identified straight away as the hoggers - every other endpoint line was low mb's. Presumably if the 'bad' tasks work through or are withdrawn this will also help.
jsm

Good idea to increase the runtime, but be aware that the tasks you already hold in your cache will run for much longer than Boinc realises, so it's entirely possible/probable you won't meet deadline on the later ones.

If my memory serves me, the unstarted tasks will still show they're 8hrs long, but will actually run for your new preference of 22hrs. This runtime figure for unstarted tasks doesn't update so it will be a permanent problem.
The way around this is to reduce your cache size by around two-thirds, so even though it continues to show Boinc the wrong expected runtime, you won't exceed deadlines in practice.
You may've already noticed this on your threadrippers. Crazy as it seems, the solution I've described ought to prevent the problems I've pointed to.
It's a feature rather than a bug... <cough>
ID: 101049 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1223
Credit: 13,834,938
RAC: 1,233
Message 101050 - Posted: 4 Apr 2021, 3:57:41 UTC - in response to Message 101048.  

[snip]

In the last day or so, these computation errors appear to have stopped. Can others confirm that too?
Hopefully that stops all the re-downloading issues and bandwidth penalties.
Is it stopping the excessive memory & disk space demands too?

And now I notice queued jobs have plummeted to barely more than 100k. Hmm...

The computation errors due to problems with 6mers have stopped.

I didn't see those other errors, so I can't tell if they have stopped.
ID: 101050 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 89 · 90 · 91 · 92 · 93 · 94 · 95 . . . 276 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org