Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 86 · 87 · 88 · 89 · 90 · 91 · 92 . . . 309 · Next
Author | Message |
---|---|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2141 Credit: 41,518,559 RAC: 10,612 |
"Makes use of unused COP cycles" Sounds so easy, doesn't it? I know that I have been here but a short time, but @Rosetta is higher maintenance than any turbulent girlfriend that I've ever had Then you have lived a charmed life, if my experience is anything to go by (usually not...) |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2141 Credit: 41,518,559 RAC: 10,612 |
Moreover, Rosetta should be a project which is run in the background. So, I should not equip my computer to meet Rosetta requirements, but Rosetta should try to use my resources. It would be nice if the kind of ground-breaking work done at Rosetta could be done on a basic device, but ground-breaking work kind-of rarely works like that. We all have to understand we need to make a certain commitment to this project and if parameters change (meaning increase) then we have to see if we can fall in line with that if we have it to spare. It only need be a minimal change to our background settings, taking no more than 30 seconds, maybe once a year. If we have it available, that's no trouble at all. If people decide that they can't continue their journey here for the sake of that change, so be it. The project knows what it requires (as long as someone hasn't cocked up on a certain batch of tasks) and if their req'ts change, they must know there'll be some hosts drop-out as result. It doesn't change their need - nor should it. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2141 Credit: 41,518,559 RAC: 10,612 |
It is interesting to read how this is affecting other people - my Pi 4 rig I mentioned previously had acquired a cache of 2 days worth of units but has since stopped downloading more due to the insufficient memory issue. I do rather hope that this is not going to become the new normal. At this project, with a 3-day deadline and default 8hr runtimes, calling down any more than 2 days of task cache is excessive. I used to target 2-days including runtime (so about 1.6 days of tasks not running) but I, and a lot of people in the last year, have reduced it to 1 day or less. We also set up back-up projects with a zero or minimal resource share for those occasions when all tasks are completed here. That's good advice for anyone (some people recommend far less) so consider it the top-end. One of the benefits is it very much reduces the resources you need to set aside for Rosetta, particularly when their minimum demands are increasing. It's an equitable balance imo. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2141 Credit: 41,518,559 RAC: 10,612 |
In the past it has taken several days for In progress numbers to get back to their pre-work shortage numbers. And that's with out running out of work again only a few hours after new work started coming through (which occurred this time).Say hello to two less hosts after they finish their current tasks, @Rosetta. I don't know if I have the time that's required to provide the space that is needed.You’re not alone. Look at the recent results graphs – ‘tasks in progress’ has dropped by around 200,000 (a third)… It certainly is odd, and what I've observed matches what you've both said. But I don't agree (if it's what you're saying, which it may not be) that people are voting with their feet, nor that it's hosts being slow to restore their caches. Much more likely is that people who don't micromanage their systems are coming up against these higher DiskRAM req'ts and either not noticing or not knowing how to resolve it themselves (after all, we've struggled) and that's amounting to this one third reduction in hosts grabbing tasks. Someone may notice the shortfall on the project side and tie it in with the whole DiskRAM thing, or they may be aware that the queued tasks are down below 1m again and think it's a localised issue and that'll resolve itself, so not actively do anything about it. I suspect we'll find out which within another day or so. Edit: Sorry, I was going to add one more thing. I have one remote PC on my team, which I know is running but something's gone wrong with its video output, so I can't change anything on my one-day-per-month visits to it during the UK lockdown (building work at the house it's in has clogged the whole room with dust). What I've recently noticed is that it hasn't downloaded a Rosetta task for a week or so, but is pulling down lots of WCG tasks. I suspect it's hit this same wall on either Disk or RAM, even though it only runs 4 cores but has 16Gb RAM, which ought to be plenty of space on both counts. This is just the kind of host I'm talking about above. Available for work, loads of space for work in theory, but can't pull any tasks down so running its back-up project 24/7 |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2141 Credit: 41,518,559 RAC: 10,612 |
Still odd that with my number of cores/threads and available system RAM i haven't had issues.It must be the case that the server only considers a host’s total available RAM and disk space (not per core) in deciding whether a task is suitable. You think? How does that tie in with the fact that if the %age of Disk or RAM allocated to Boinc is changed, then it resolves the issue? I may well be misunderstanding your point tbf |
mrhastyrib Send message Joined: 18 Feb 21 Posts: 90 Credit: 2,541,890 RAC: 0 |
It didn't start out so hot, but it's getting better year by year. I'm grateful for the life that I have now. On another topic, running Linux Mint here, should I set up any firewall rules? Or does BOINC operate thru the open ports? |
mrhastyrib Send message Joined: 18 Feb 21 Posts: 90 Credit: 2,541,890 RAC: 0 |
Like you just did? :^P |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2141 Credit: 41,518,559 RAC: 10,612 |
Then you have lived a charmed life, if my experience is anything to go by (usually not...) Good for you. Balance out that good karma with a little bit of negative karma here. You'll cope I'm completely ignorant on Linux (though I used MinT on Ataris 20-30 years ago). I'm not aware of Boinc needing anything out of the ordinary, but someone else will be along in a minute with the answer you need. |
Garry Heather Send message Joined: 23 Nov 20 Posts: 10 Credit: 362,743 RAC: 0 |
Regarding my 2 day cache, I do not consider that excessive given that my £35 SBC running off an SD card (OK, it's an SSD now but when I started it was an SD card) has a more reliable uptime than the servers running the project. In the time I've been doing work for Rosetta I have seen several periods of downtime where work units have not been deployed for days at a time. I don't mind having a Pi dedicated to the task so long as it's doing real work and not just heating the room. I also get the idea of other projects and I looked into it, however my current setup is severely restricted in that department. I am currently using the Balena client image mostly out of convenience and I have not found another project compatible with that and my processor architecture. At the moment I haven't got the time to go digging around in Linux trying to make this work as I'm still very much learning. I know enough to be dangerous but even online guides tend to make assumptions about people's prior knowledge that are a considerable block to entry. |
Garry Heather Send message Joined: 23 Nov 20 Posts: 10 Credit: 362,743 RAC: 0 |
Duplicate post deleted. |
Sophie Send message Joined: 13 Aug 19 Posts: 5 Credit: 1,410,379 RAC: 342 |
Hello in the last few hours i had 22 WU who stoppt with an error a few second after starting. Examples: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217359050 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217354682 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217342042 Is there some kind of underlying Problem? I didnt change anything on my system in the last few weeks and some wu are looking fine. Sorry i created a seperat thread before reading the instruction to post in this thread. |
Sophie Send message Joined: 13 Aug 19 Posts: 5 Credit: 1,410,379 RAC: 342 |
Hello in the last few hours i had 22 WU who stoppt with an error a few second after starting. Examples: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217359050 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217354682 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217342042 Is there some kind of underlying Problem? I didnt change anything on my system in the last few weeks and some wu are looking fine. Sorry i created a seperat thread before reading the instruction to post in this thread. |
Falconet Send message Joined: 9 Mar 09 Posts: 354 Credit: 1,276,393 RAC: 2,018 |
Hello in the last few hours i had 22 WU who stoppt with an error a few second after starting. My reply on your other thread: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=14525&postid=100960 |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
Each task has 3 processes using the same amount of ram, but only one of those 3 is using the cpu, the other two are near zero cpu time. So, 6 tasks, 18 processes using 1-2gb each.That doesn’t sound right, but as I don’t run BOINC on Linux I can’t add more… |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
should I set up any firewall rules?Assuming it’s the same as on Windows: The only thing that requires Internet access is the client, and it only makes HTTP(S) connections to the project servers. So you need to open tcp/80 and/or tcp/443 outbound (plus udp/53 or whatever else your DNS needs if that’s not handled by a separate resolver); everything else can be blocked. |
Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 291,214,977 RAC: 0 |
Hello in the last few hours i had 22 WU who stoppt with an error a few second after starting. 278 fails here |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2141 Credit: 41,518,559 RAC: 10,612 |
Regarding my 2 day cache, I do not consider that excessive given that my £35 SBC running off an SD card (OK, it's an SSD now but when I started it was an SD card) has a more reliable uptime than the servers running the project. In the time I've been doing work for Rosetta I have seen several periods of downtime where work units have not been deployed for days at a time. I don't mind having a Pi dedicated to the task so long as it's doing real work and not just heating the room. For an individual host's circumstances it's fine if you have a specific reason, but as a general rule it is excessive. About a year ago we had a lot of new hosts arrive from Seti with huge multicore machines (which you don't) who were used to large caches, because there were no restrictions there, and ran with the shortest runtimes (which you don't) so they were hoovering up all available tasks, to the exclusion of everyone else, running them for the shortest, least productive time, sending them back almost immediately, then complaining they couldn't re-fill their oversize caches again. Simultaneously, a whole bunch of very keen new users with more reasonable settings couldn't get any tasks to run at all and had their enthusiasm knocked out of them. With tasks in short supply, no-one was happy. The solution was to cut deadlines from 7 to 3 days, force the immediate aborting of tasks for re-issue that couldn't make deadline, removing the possibility of 1hr runtimes so they ran 2hr minimum and default 8hrs, so that the tasks which were available didn't sit in offline caches that wouldn't run for a week while others had empty cores waiting for work, and ensuring the tasks that did come back were more productive. The result was immediate availability of work for everyone, no more shortage of tasks and more rapid task turnaround of greater value for the project. I only say it out loud now because all those reasons still apply and we're tight on tasks in the queue again, so it helps to eke them out just a little longer. Obviously, your (currently) 18 tasks with 4 running to default hrs doesn't do any harm individually - more as a general rule, like when 32 & 64-core machines had 2-3000 tasks in their cache, each running just an hour (or less). My old 8-core used to store around 50 tasks, now my 16-core keeps nearer 55-60. All proportionate. |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
I have seen several periods of downtime where work units have not been deployed for days at a time. I don't mind having a Pi dedicated to the task so long as it's doing real work and not just heating the room.If you have access to the BOINC Manager application, you might try adding World Community Grid. That reportedly has an ARM Linux application for its OpenPandemics sub-project, and much smaller work units. Otherwise it might be worth getting in touch with Balena to explain the issue and see if they would consider adding something for a different project in the same way they did for Rosetta (though they may find it harder to convince IBM than Baker Lab to let them hack at their applications; SiDock is another similar project without an ARM build (yet) that might benefit from that kind of effort). Do bear in mind that we are here to help the project, not the other way round. If they happen not to have any work that needs doing at any given time, it’s their choice not to make use of a resource that’s available to them, not a cause for us to complain. I’m not sure which part of the U.K. you’re in where an idle Pi is useful for heating, but I think I’d like to move there… (I’ve got four 8-cylinder Xeons pulling 500 W out the wall and barely keeping the place warm…) |
Garry Heather Send message Joined: 23 Nov 20 Posts: 10 Credit: 362,743 RAC: 0 |
I did reach out to Balena about their solution (their response https://forums.balena.io/t/fold-client-offers-unsupported-project-climateprediction-net/218911/9?u=goto_gosub]) and I subsequently tried a couple of other projects included in their manager (cannot remember which now) and none of those worked either. Hopefully without sounding disrespectful to Balena I do not think they are going to make any changes to how their client works any time soon for a number of reasons, not least with getting other projects on board and comitting time and resources to making someone else's project compatible. On a slightly different note, at the time of writing this the server status as reported on the website appears to be OK but my rig is reporting internet connectivity but no response from the servers again. Ho hum. |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
Ho hum.You’ve got 4 tasks running and 20 ready to start. That’s 24 more than a lot of other people… |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org