Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 87 · 88 · 89 · 90 · 91 · 92 · 93 . . . 303 · Next

AuthorMessage
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 100976 - Posted: 1 Apr 2021, 13:19:16 UTC - in response to Message 100937.  

Each task has 3 processes using the same amount of ram, but only one of those 3 is using the cpu, the other two are near zero cpu time. So, 6 tasks, 18 processes using 1-2gb each.
That doesn’t sound right, but as I don’t run BOINC on Linux I can’t add more…
ID: 100976 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 100977 - Posted: 1 Apr 2021, 13:29:55 UTC - in response to Message 100953.  

should I set up any firewall rules?
Assuming it’s the same as on Windows:

The only thing that requires Internet access is the client, and it only makes HTTP(S) connections to the project servers. So you need to open tcp/80 and/or tcp/443 outbound (plus udp/53 or whatever else your DNS needs if that’s not handled by a separate resolver); everything else can be blocked.
ID: 100977 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 291,214,977
RAC: 0
Message 100978 - Posted: 1 Apr 2021, 13:43:03 UTC - in response to Message 100968.  

Hello in the last few hours i had 22 WU who stoppt with an error a few second after starting.
Examples:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217359050
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217354682
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217342042

Is there some kind of underlying Problem? I didnt change anything on my system in the last few weeks and some wu are looking fine.

Sorry i created a seperat thread before reading the instruction to post in this thread.



My reply on your other thread: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=14525&postid=100960


278 fails here
ID: 100978 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2127
Credit: 41,266,340
RAC: 7,498
Message 100979 - Posted: 1 Apr 2021, 13:59:23 UTC - in response to Message 100956.  

Regarding my 2 day cache, I do not consider that excessive given that my £35 SBC running off an SD card (OK, it's an SSD now but when I started it was an SD card) has a more reliable uptime than the servers running the project. In the time I've been doing work for Rosetta I have seen several periods of downtime where work units have not been deployed for days at a time. I don't mind having a Pi dedicated to the task so long as it's doing real work and not just heating the room.

I also get the idea of other projects and I looked into it, however my current setup is severely restricted in that department. I am currently using the Balena client image mostly out of convenience and I have not found another project compatible with that and my processor architecture. At the moment I haven't got the time to go digging around in Linux trying to make this work as I'm still very much learning. I know enough to be dangerous but even online guides tend to make assumptions about people's prior knowledge that are a considerable block to entry.

For an individual host's circumstances it's fine if you have a specific reason, but as a general rule it is excessive.

About a year ago we had a lot of new hosts arrive from Seti with huge multicore machines (which you don't) who were used to large caches, because there were no restrictions there, and ran with the shortest runtimes (which you don't) so they were hoovering up all available tasks, to the exclusion of everyone else, running them for the shortest, least productive time, sending them back almost immediately, then complaining they couldn't re-fill their oversize caches again. Simultaneously, a whole bunch of very keen new users with more reasonable settings couldn't get any tasks to run at all and had their enthusiasm knocked out of them. With tasks in short supply, no-one was happy.

The solution was to cut deadlines from 7 to 3 days, force the immediate aborting of tasks for re-issue that couldn't make deadline, removing the possibility of 1hr runtimes so they ran 2hr minimum and default 8hrs, so that the tasks which were available didn't sit in offline caches that wouldn't run for a week while others had empty cores waiting for work, and ensuring the tasks that did come back were more productive. The result was immediate availability of work for everyone, no more shortage of tasks and more rapid task turnaround of greater value for the project.

I only say it out loud now because all those reasons still apply and we're tight on tasks in the queue again, so it helps to eke them out just a little longer.
Obviously, your (currently) 18 tasks with 4 running to default hrs doesn't do any harm individually - more as a general rule, like when 32 & 64-core machines had 2-3000 tasks in their cache, each running just an hour (or less).
My old 8-core used to store around 50 tasks, now my 16-core keeps nearer 55-60. All proportionate.
ID: 100979 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 100980 - Posted: 1 Apr 2021, 14:01:40 UTC - in response to Message 100956.  

I have seen several periods of downtime where work units have not been deployed for days at a time. I don't mind having a Pi dedicated to the task so long as it's doing real work and not just heating the room.

I also get the idea of other projects and I looked into it, however my current setup is severely restricted in that department. I am currently using the Balena client image mostly out of convenience and I have not found another project compatible with that and my processor architecture.
If you have access to the BOINC Manager application, you might try adding World Community Grid. That reportedly has an ARM Linux application for its Open­Pandemics sub-project, and much smaller work units. Otherwise it might be worth getting in touch with Balena to explain the issue and see if they would consider adding something for a different project in the same way they did for Rosetta (though they may find it harder to convince IBM than Baker Lab to let them hack at their applications; SiDock is another similar project without an ARM build (yet) that might benefit from that kind of effort).

Do bear in mind that we are here to help the project, not the other way round. If they happen not to have any work that needs doing at any given time, it’s their choice not to make use of a resource that’s available to them, not a cause for us to complain.

I’m not sure which part of the U.K. you’re in where an idle Pi is useful for heating, but I think I’d like to move there… (I’ve got four 8-⁠cylinder Xeons pulling 500 W out the wall and barely keeping the place warm…)
ID: 100980 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Garry Heather

Send message
Joined: 23 Nov 20
Posts: 10
Credit: 362,743
RAC: 0
Message 100981 - Posted: 1 Apr 2021, 14:11:17 UTC - in response to Message 100980.  
Last modified: 1 Apr 2021, 14:19:05 UTC

I did reach out to Balena about their solution (their response https://forums.balena.io/t/fold-client-offers-unsupported-project-climateprediction-net/218911/9?u=goto_gosub]) and I subsequently tried a couple of other projects included in their manager (cannot remember which now) and none of those worked either. Hopefully without sounding disrespectful to Balena I do not think they are going to make any changes to how their client works any time soon for a number of reasons, not least with getting other projects on board and comitting time and resources to making someone else's project compatible.

On a slightly different note, at the time of writing this the server status as reported on the website appears to be OK but my rig is reporting internet connectivity but no response from the servers again. Ho hum.
ID: 100981 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 100983 - Posted: 1 Apr 2021, 14:23:50 UTC - in response to Message 100981.  

Ho hum.
You’ve got 4 tasks running and 20 ready to start. That’s 24 more than a lot of other people…
ID: 100983 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Garry Heather

Send message
Joined: 23 Nov 20
Posts: 10
Credit: 362,743
RAC: 0
Message 100984 - Posted: 1 Apr 2021, 15:23:51 UTC - in response to Message 100983.  
Last modified: 1 Apr 2021, 15:40:48 UTC

This is true, but lets have some context here. I just wanted my single Pi to be kept busy because the cost in leaving it on 24/7 is not insignificant to me. There are some people here with multiple monsters processing work units. My solitary Pi was never going to make a dent on their requirements so please do not think badly of me for trying to cache enough to to keep it busy for a couple of days.

I will complete the units currently being processed but suspect that this project is not for me. I have aborted my cached units back into the pool.
ID: 100984 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 100985 - Posted: 1 Apr 2021, 15:56:58 UTC - in response to Message 100984.  
Last modified: 1 Apr 2021, 16:22:05 UTC

Nobody was asking or expecting you to abort the jobs – but what’s done is done, and cannot be undone. It makes no difference to the project who runs them, so please don’t be dissuaded from participating. The ones that weren’t resends are already out to other hosts. My machines are out of Rosetta work primarily because of the way I chose to set them up, and I’m too lazy to go round and change them all just to work around a bug in the work unit configuration. It’s arguably better that machines capable of running the ‘big’ tasks don’t pick up the ‘small’ ones, so that less-powerful machines do have a chance to run something.
ID: 100985 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,874,133
RAC: 8,427
Message 100989 - Posted: 1 Apr 2021, 17:58:17 UTC - in response to Message 100954.  


Don't you just hate folk who put @ in a sentence?

Like you just did? :^P
Are you one of those pricks who said "made you look" in the school playground as a kid? If so, how's the broken nose?
ID: 100989 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,874,133
RAC: 8,427
Message 100990 - Posted: 1 Apr 2021, 17:59:25 UTC - in response to Message 100957.  

Duplicate post deleted.
You'd think there'd be a delete button. Who designs these things?
ID: 100990 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,284,221
RAC: 995
Message 100993 - Posted: 1 Apr 2021, 18:45:40 UTC - in response to Message 100990.  

Duplicate post deleted.
You'd think there'd be a delete button. Who designs these things?

There's a workaround. If you use the same way to mark it as a duplicate every time, the software will see it as multiple identical posts, and delete all but one of them.
ID: 100993 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mrhastyrib

Send message
Joined: 18 Feb 21
Posts: 90
Credit: 2,541,890
RAC: 0
Message 100994 - Posted: 1 Apr 2021, 22:54:49 UTC - in response to Message 100977.  

should I set up any firewall rules?
Assuming it’s the same as on Windows:

The only thing that requires Internet access is the client, and it only makes HTTP(S) connections to the project servers. So you need to open tcp/80 and/or tcp/443 outbound (plus udp/53 or whatever else your DNS needs if that’s not handled by a separate resolver); everything else can be blocked.


Those ports seem to be open by default so I guess that I'm okay. Thanks for your reply.
ID: 100994 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mrhastyrib

Send message
Joined: 18 Feb 21
Posts: 90
Credit: 2,541,890
RAC: 0
Message 100995 - Posted: 1 Apr 2021, 23:17:02 UTC - in response to Message 100979.  

I have seen several periods of downtime where work units have not been deployed for days at a time.

For an individual host's circumstances it's fine if you have a specific reason

This kind of reminds me of the hoarding that takes place here (even prior to the pandemic). There's a supply problem, which leads to hoarding, which makes it worse.

Kind of remarkable that we have too much unused CPU time to go around.
ID: 100995 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mrhastyrib

Send message
Joined: 18 Feb 21
Posts: 90
Credit: 2,541,890
RAC: 0
Message 100996 - Posted: 1 Apr 2021, 23:25:33 UTC - in response to Message 100989.  


Are you one of those pricks who said "made you look" in the school playground as a kid? If so, how's the broken nose?


Woah, dude, where did that come from? Over the use of an "at" symbol?

If you get spun up that hard, that fast over what I write, maybe the better solution is to stop reading my posts, okay?
ID: 100996 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mrhastyrib

Send message
Joined: 18 Feb 21
Posts: 90
Credit: 2,541,890
RAC: 0
Message 100997 - Posted: 1 Apr 2021, 23:36:14 UTC - in response to Message 100984.  


I will complete the units currently being processed but suspect that this project is not for me.


Don't take it personally. There's a three roll limit on toilet paper here because of some hoarders (not you). That's the rule. But best practice for the community at large is for folks to take less, if they can. If everybody does it, then there is more likely to be a ready supply available, including for you. It's something worth repeating, just so everyone is aware of it.
ID: 100997 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1691
Credit: 18,016,030
RAC: 20,964
Message 100999 - Posted: 2 Apr 2021, 1:05:24 UTC - in response to Message 100923.  

Say hello to two less hosts after they finish their current tasks, @Rosetta. I don't know if I have the time that's required to provide the space that is needed.
You’re not alone. Look at the recent results graphs – ‘tasks in progress’ has dropped by around 200,000 (a third)…
In the past it has taken several days for In progress numbers to get back to their pre-work shortage numbers. And that's with out running out of work again only a few hours after new work started coming through (which occurred this time).
If we don't run out of work again over the next few days, we should see how things actually are by early next week.
A few days in and the impact of the mis-configured Work Units is becoming clearer. Looks like the amount of work being done has dropped by almost a third, and isn't showing any signs of recovering.
For all of the latest & greatest systems there are, there are an awful lot more older much more resource limited systems.


Grant
Darwin NT
ID: 100999 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mrhastyrib

Send message
Joined: 18 Feb 21
Posts: 90
Credit: 2,541,890
RAC: 0
Message 101000 - Posted: 2 Apr 2021, 1:25:33 UTC - in response to Message 100999.  
Last modified: 2 Apr 2021, 1:26:59 UTC



Looks like the profile of a dead body lying in a shallow grave. How metaphorical.
ID: 101000 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,188,754
RAC: 3,104
Message 101005 - Posted: 2 Apr 2021, 12:01:39 UTC - in response to Message 100999.  

Say hello to two less hosts after they finish their current tasks, @Rosetta. I don't know if I have the time that's required to provide the space that is needed.
You’re not alone. Look at the recent results graphs – ‘tasks in progress’ has dropped by around 200,000 (a third)…
In the past it has taken several days for In progress numbers to get back to their pre-work shortage numbers. And that's with out running out of work again only a few hours after new work started coming through (which occurred this time).
If we don't run out of work again over the next few days, we should see how things actually are by early next week.
A few days in and the impact of the mis-configured Work Units is becoming clearer. Looks like the amount of work being done has dropped by almost a third, and isn't showing any signs of recovering.
For all of the latest & greatest systems there are, there are an awful lot more older much more resource limited systems.



Just means more work for the rest of us!!
ID: 101005 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
jsm

Send message
Joined: 4 Apr 20
Posts: 3
Credit: 77,309,664
RAC: 49,570
Message 101006 - Posted: 2 Apr 2021, 14:28:03 UTC

Bandwidth usage massively increased in March
I migrated to Rosetta from Seti almost exactly one year ago. For eleven months there was little impact on my capped 50gb bandwidth allowance but in March the usage has more than doubled. I am using the same 6 computers and the same preferences so nothing on my side has changed. When my ISP notified me of the sudden cap half way through March I installed wireshark after a difficult setup to capture packets at the router rather than at specific computers. Imagine my horror when I found that the culprit was rosetta using over 1gb per 6 hours. This is unsustainable and I will either have to shell out for an expensive unlimited contract (because I have an Ultima connection at over 100mbps) or cut back on Rosetta work.
Has there been a significant project change which could be the cause of this increased usage or am I looking for another problem?
Any suggestions most welcome. I have clawed my way to league position 599 and would like to break 500 if possible.
Capt
ID: 101006 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 87 · 88 · 89 · 90 · 91 · 92 · 93 . . . 303 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org