Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 86 · 87 · 88 · 89 · 90 · 91 · 92 . . . 311 · Next

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2146
Credit: 41,570,180
RAC: 8,210
Message 100948 - Posted: 1 Apr 2021, 1:30:22 UTC - in response to Message 100911.  

"Makes use of unused COP cycles" Sounds so easy, doesn't it? I know that I have been here but a short time, but @Rosetta is higher maintenance than any turbulent girlfriend that I've ever had

Then you have lived a charmed life, if my experience is anything to go by (usually not...)
ID: 100948 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2146
Credit: 41,570,180
RAC: 8,210
Message 100949 - Posted: 1 Apr 2021, 1:40:17 UTC - in response to Message 100913.  

Moreover, Rosetta should be a project which is run in the background. So, I should not equip my computer to meet Rosetta requirements, but Rosetta should try to use my resources.

I hear you, but all programs specify the minimum hardware requirements. My complaint would be that those requirements for @Rosetta seem to change without notice.

On the other hand, it seems like they don't have enough work to consistently utilize the hardware available to them. If that's the case, then it makes no sense to spend time fine-tuning the program so that even more capacity is available.

It would be nice if the kind of ground-breaking work done at Rosetta could be done on a basic device, but ground-breaking work kind-of rarely works like that.
We all have to understand we need to make a certain commitment to this project and if parameters change (meaning increase) then we have to see if we can fall in line with that if we have it to spare.
It only need be a minimal change to our background settings, taking no more than 30 seconds, maybe once a year. If we have it available, that's no trouble at all.

If people decide that they can't continue their journey here for the sake of that change, so be it.
The project knows what it requires (as long as someone hasn't cocked up on a certain batch of tasks) and if their req'ts change, they must know there'll be some hosts drop-out as result.
It doesn't change their need - nor should it.
ID: 100949 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2146
Credit: 41,570,180
RAC: 8,210
Message 100950 - Posted: 1 Apr 2021, 1:50:16 UTC - in response to Message 100914.  

It is interesting to read how this is affecting other people - my Pi 4 rig I mentioned previously had acquired a cache of 2 days worth of units but has since stopped downloading more due to the insufficient memory issue. I do rather hope that this is not going to become the new normal.

At this project, with a 3-day deadline and default 8hr runtimes, calling down any more than 2 days of task cache is excessive.
I used to target 2-days including runtime (so about 1.6 days of tasks not running) but I, and a lot of people in the last year, have reduced it to 1 day or less.
We also set up back-up projects with a zero or minimal resource share for those occasions when all tasks are completed here.
That's good advice for anyone (some people recommend far less) so consider it the top-end.
One of the benefits is it very much reduces the resources you need to set aside for Rosetta, particularly when their minimum demands are increasing.
It's an equitable balance imo.
ID: 100950 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2146
Credit: 41,570,180
RAC: 8,210
Message 100951 - Posted: 1 Apr 2021, 2:10:36 UTC - in response to Message 100923.  
Last modified: 1 Apr 2021, 2:27:52 UTC

Say hello to two less hosts after they finish their current tasks, @Rosetta. I don't know if I have the time that's required to provide the space that is needed.
You’re not alone. Look at the recent results graphs – ‘tasks in progress’ has dropped by around 200,000 (a third)…
In the past it has taken several days for In progress numbers to get back to their pre-work shortage numbers. And that's with out running out of work again only a few hours after new work started coming through (which occurred this time).
If we don't run out of work again over the next few days, we should see how things actually are by early next week.

What is odd is that these messages are occurring now, with Tasks that don't require much RAM at all (less than 300MB) compared to many of the previous Tasks (around 800MB). Every one of my current Tasks is using less than 300MB.

It certainly is odd, and what I've observed matches what you've both said.
But I don't agree (if it's what you're saying, which it may not be) that people are voting with their feet, nor that it's hosts being slow to restore their caches.

Much more likely is that people who don't micromanage their systems are coming up against these higher DiskRAM req'ts and either not noticing or not knowing how to resolve it themselves (after all, we've struggled) and that's amounting to this one third reduction in hosts grabbing tasks.

Someone may notice the shortfall on the project side and tie it in with the whole DiskRAM thing, or they may be aware that the queued tasks are down below 1m again and think it's a localised issue and that'll resolve itself, so not actively do anything about it. I suspect we'll find out which within another day or so.

Edit: Sorry, I was going to add one more thing.
I have one remote PC on my team, which I know is running but something's gone wrong with its video output, so I can't change anything on my one-day-per-month visits to it during the UK lockdown (building work at the house it's in has clogged the whole room with dust).
What I've recently noticed is that it hasn't downloaded a Rosetta task for a week or so, but is pulling down lots of WCG tasks. I suspect it's hit this same wall on either Disk or RAM, even though it only runs 4 cores but has 16Gb RAM, which ought to be plenty of space on both counts.
This is just the kind of host I'm talking about above. Available for work, loads of space for work in theory, but can't pull any tasks down so running its back-up project 24/7
ID: 100951 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2146
Credit: 41,570,180
RAC: 8,210
Message 100952 - Posted: 1 Apr 2021, 2:13:41 UTC - in response to Message 100926.  

Still odd that with my number of cores/threads and available system RAM i haven't had issues.
It must be the case that the server only considers a host’s total available RAM and disk space (not per core) in deciding whether a task is suitable.

So if a task tells the server it might need 6.6 GB of RAM, the server will never send it to any host with less (even if in practice it would not need anywhere near that much), but it will happily send you 24 of them because they can run (just maybe not all at the same time).

There can’t be many machines with >6 GB RAM per core…

You think?
How does that tie in with the fact that if the %age of Disk or RAM allocated to Boinc is changed, then it resolves the issue?
I may well be misunderstanding your point tbf
ID: 100952 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mrhastyrib

Send message
Joined: 18 Feb 21
Posts: 90
Credit: 2,541,890
RAC: 0
Message 100953 - Posted: 1 Apr 2021, 2:15:34 UTC - in response to Message 100948.  


Then you have lived a charmed life, if my experience is anything to go by (usually not...)


It didn't start out so hot, but it's getting better year by year. I'm grateful for the life that I have now.

On another topic, running Linux Mint here, should I set up any firewall rules? Or does BOINC operate thru the open ports?
ID: 100953 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mrhastyrib

Send message
Joined: 18 Feb 21
Posts: 90
Credit: 2,541,890
RAC: 0
Message 100954 - Posted: 1 Apr 2021, 2:17:52 UTC - in response to Message 100939.  


Don't you just hate folk who put @ in a sentence?

Like you just did? :^P
ID: 100954 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2146
Credit: 41,570,180
RAC: 8,210
Message 100955 - Posted: 1 Apr 2021, 2:35:04 UTC - in response to Message 100953.  

Then you have lived a charmed life, if my experience is anything to go by (usually not...)

It didn't start out so hot, but it's getting better year by year. I'm grateful for the life that I have now.

On another topic, running Linux Mint here, should I set up any firewall rules? Or does BOINC operate thru the open ports?

Good for you. Balance out that good karma with a little bit of negative karma here. You'll cope

I'm completely ignorant on Linux (though I used MinT on Ataris 20-30 years ago).
I'm not aware of Boinc needing anything out of the ordinary, but someone else will be along in a minute with the answer you need.
ID: 100955 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Garry Heather

Send message
Joined: 23 Nov 20
Posts: 10
Credit: 362,743
RAC: 0
Message 100956 - Posted: 1 Apr 2021, 5:08:26 UTC - in response to Message 100950.  
Last modified: 1 Apr 2021, 6:06:11 UTC

Regarding my 2 day cache, I do not consider that excessive given that my £35 SBC running off an SD card (OK, it's an SSD now but when I started it was an SD card) has a more reliable uptime than the servers running the project. In the time I've been doing work for Rosetta I have seen several periods of downtime where work units have not been deployed for days at a time. I don't mind having a Pi dedicated to the task so long as it's doing real work and not just heating the room.

I also get the idea of other projects and I looked into it, however my current setup is severely restricted in that department. I am currently using the Balena client image mostly out of convenience and I have not found another project compatible with that and my processor architecture. At the moment I haven't got the time to go digging around in Linux trying to make this work as I'm still very much learning. I know enough to be dangerous but even online guides tend to make assumptions about people's prior knowledge that are a considerable block to entry.
ID: 100956 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Garry Heather

Send message
Joined: 23 Nov 20
Posts: 10
Credit: 362,743
RAC: 0
Message 100957 - Posted: 1 Apr 2021, 5:08:26 UTC - in response to Message 100950.  
Last modified: 1 Apr 2021, 5:09:57 UTC

Duplicate post deleted.
ID: 100957 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sophie

Send message
Joined: 13 Aug 19
Posts: 5
Credit: 1,410,379
RAC: 140
Message 100966 - Posted: 1 Apr 2021, 11:07:12 UTC

Hello in the last few hours i had 22 WU who stoppt with an error a few second after starting.
Examples:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217359050
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217354682
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217342042

Is there some kind of underlying Problem? I didnt change anything on my system in the last few weeks and some wu are looking fine.

Sorry i created a seperat thread before reading the instruction to post in this thread.
ID: 100966 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sophie

Send message
Joined: 13 Aug 19
Posts: 5
Credit: 1,410,379
RAC: 140
Message 100967 - Posted: 1 Apr 2021, 11:07:16 UTC

Hello in the last few hours i had 22 WU who stoppt with an error a few second after starting.
Examples:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217359050
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217354682
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217342042

Is there some kind of underlying Problem? I didnt change anything on my system in the last few weeks and some wu are looking fine.

Sorry i created a seperat thread before reading the instruction to post in this thread.
ID: 100967 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 354
Credit: 1,276,393
RAC: 828
Message 100968 - Posted: 1 Apr 2021, 11:09:38 UTC - in response to Message 100967.  

Hello in the last few hours i had 22 WU who stoppt with an error a few second after starting.
Examples:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217359050
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217354682
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217342042

Is there some kind of underlying Problem? I didnt change anything on my system in the last few weeks and some wu are looking fine.

Sorry i created a seperat thread before reading the instruction to post in this thread.



My reply on your other thread: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=14525&postid=100960
ID: 100968 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 100976 - Posted: 1 Apr 2021, 13:19:16 UTC - in response to Message 100937.  

Each task has 3 processes using the same amount of ram, but only one of those 3 is using the cpu, the other two are near zero cpu time. So, 6 tasks, 18 processes using 1-2gb each.
That doesn’t sound right, but as I don’t run BOINC on Linux I can’t add more…
ID: 100976 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 100977 - Posted: 1 Apr 2021, 13:29:55 UTC - in response to Message 100953.  

should I set up any firewall rules?
Assuming it’s the same as on Windows:

The only thing that requires Internet access is the client, and it only makes HTTP(S) connections to the project servers. So you need to open tcp/80 and/or tcp/443 outbound (plus udp/53 or whatever else your DNS needs if that’s not handled by a separate resolver); everything else can be blocked.
ID: 100977 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 291,214,977
RAC: 0
Message 100978 - Posted: 1 Apr 2021, 13:43:03 UTC - in response to Message 100968.  

Hello in the last few hours i had 22 WU who stoppt with an error a few second after starting.
Examples:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217359050
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217354682
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217342042

Is there some kind of underlying Problem? I didnt change anything on my system in the last few weeks and some wu are looking fine.

Sorry i created a seperat thread before reading the instruction to post in this thread.



My reply on your other thread: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=14525&postid=100960


278 fails here
ID: 100978 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2146
Credit: 41,570,180
RAC: 8,210
Message 100979 - Posted: 1 Apr 2021, 13:59:23 UTC - in response to Message 100956.  

Regarding my 2 day cache, I do not consider that excessive given that my £35 SBC running off an SD card (OK, it's an SSD now but when I started it was an SD card) has a more reliable uptime than the servers running the project. In the time I've been doing work for Rosetta I have seen several periods of downtime where work units have not been deployed for days at a time. I don't mind having a Pi dedicated to the task so long as it's doing real work and not just heating the room.

I also get the idea of other projects and I looked into it, however my current setup is severely restricted in that department. I am currently using the Balena client image mostly out of convenience and I have not found another project compatible with that and my processor architecture. At the moment I haven't got the time to go digging around in Linux trying to make this work as I'm still very much learning. I know enough to be dangerous but even online guides tend to make assumptions about people's prior knowledge that are a considerable block to entry.

For an individual host's circumstances it's fine if you have a specific reason, but as a general rule it is excessive.

About a year ago we had a lot of new hosts arrive from Seti with huge multicore machines (which you don't) who were used to large caches, because there were no restrictions there, and ran with the shortest runtimes (which you don't) so they were hoovering up all available tasks, to the exclusion of everyone else, running them for the shortest, least productive time, sending them back almost immediately, then complaining they couldn't re-fill their oversize caches again. Simultaneously, a whole bunch of very keen new users with more reasonable settings couldn't get any tasks to run at all and had their enthusiasm knocked out of them. With tasks in short supply, no-one was happy.

The solution was to cut deadlines from 7 to 3 days, force the immediate aborting of tasks for re-issue that couldn't make deadline, removing the possibility of 1hr runtimes so they ran 2hr minimum and default 8hrs, so that the tasks which were available didn't sit in offline caches that wouldn't run for a week while others had empty cores waiting for work, and ensuring the tasks that did come back were more productive. The result was immediate availability of work for everyone, no more shortage of tasks and more rapid task turnaround of greater value for the project.

I only say it out loud now because all those reasons still apply and we're tight on tasks in the queue again, so it helps to eke them out just a little longer.
Obviously, your (currently) 18 tasks with 4 running to default hrs doesn't do any harm individually - more as a general rule, like when 32 & 64-core machines had 2-3000 tasks in their cache, each running just an hour (or less).
My old 8-core used to store around 50 tasks, now my 16-core keeps nearer 55-60. All proportionate.
ID: 100979 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 100980 - Posted: 1 Apr 2021, 14:01:40 UTC - in response to Message 100956.  

I have seen several periods of downtime where work units have not been deployed for days at a time. I don't mind having a Pi dedicated to the task so long as it's doing real work and not just heating the room.

I also get the idea of other projects and I looked into it, however my current setup is severely restricted in that department. I am currently using the Balena client image mostly out of convenience and I have not found another project compatible with that and my processor architecture.
If you have access to the BOINC Manager application, you might try adding World Community Grid. That reportedly has an ARM Linux application for its Open­Pandemics sub-project, and much smaller work units. Otherwise it might be worth getting in touch with Balena to explain the issue and see if they would consider adding something for a different project in the same way they did for Rosetta (though they may find it harder to convince IBM than Baker Lab to let them hack at their applications; SiDock is another similar project without an ARM build (yet) that might benefit from that kind of effort).

Do bear in mind that we are here to help the project, not the other way round. If they happen not to have any work that needs doing at any given time, it’s their choice not to make use of a resource that’s available to them, not a cause for us to complain.

I’m not sure which part of the U.K. you’re in where an idle Pi is useful for heating, but I think I’d like to move there… (I’ve got four 8-⁠cylinder Xeons pulling 500 W out the wall and barely keeping the place warm…)
ID: 100980 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Garry Heather

Send message
Joined: 23 Nov 20
Posts: 10
Credit: 362,743
RAC: 0
Message 100981 - Posted: 1 Apr 2021, 14:11:17 UTC - in response to Message 100980.  
Last modified: 1 Apr 2021, 14:19:05 UTC

I did reach out to Balena about their solution (their response https://forums.balena.io/t/fold-client-offers-unsupported-project-climateprediction-net/218911/9?u=goto_gosub]) and I subsequently tried a couple of other projects included in their manager (cannot remember which now) and none of those worked either. Hopefully without sounding disrespectful to Balena I do not think they are going to make any changes to how their client works any time soon for a number of reasons, not least with getting other projects on board and comitting time and resources to making someone else's project compatible.

On a slightly different note, at the time of writing this the server status as reported on the website appears to be OK but my rig is reporting internet connectivity but no response from the servers again. Ho hum.
ID: 100981 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 100983 - Posted: 1 Apr 2021, 14:23:50 UTC - in response to Message 100981.  

Ho hum.
You’ve got 4 tasks running and 20 ready to start. That’s 24 more than a lot of other people…
ID: 100983 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 86 · 87 · 88 · 89 · 90 · 91 · 92 . . . 311 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org