Posts by floyd

1) Message boards : Number crunching : task swamping on multi-project host guidance requested (Message 101446)
Posted 22 Apr 2021 by floyd
Post:
* Gutted all my controls via app_config on projects
Please don't be so vague, undoing app_config settings is not trivial. Of course deleting the file is not enough but also reloading the (now non-existent) configuration, updating the project or restarting BOINC isn't, at least the CPU and GPU values persist. The project's original values only come back with new tasks but I'm not sure to what extent they are applied then. I am however sure that the values displayed with old tasks are not updated without another client restart so whatever you see there may be outdated.
When I want to revert app_config settings I first change them to the values used by the project, then reload the configuration, then delete it and restart the client. And I try to avoid app_config in the first place. Don't think of app_config as an easy and safe configuration tool for average users, it is a later add-on to BOINC which as far as I know has never been fully integrated. If you use it you can expect unexpected things to happen. I'm quite sure that getting many more tasks than you could finish was such thing.

* Changed my prefs to use max of 75% of cores
At that point the event log will show you how many CPUs that translates to. Likely the correct six. I've seen BOINC schedule one CPU more than configured when in panic mode but that shouldn't be the case here with only 10 tasks in progress and nearly full time left.

Updated all projects and kickstarted it. Rosetta took 5 cores, GPU projects 1 per, 2 total.
Is that what the Manager showed you? Again, that may not be reality. Without different configuration I'd expect 1 core total scheduled for the GPU tasks and the remaining 5 of 6 for CPU tasks. Real usage will rather have been 2+5, more than you wanted. But if the Manager displayed just that in this case it was coincidence.

* Changed my prefs to use max of 74% of cores because inclusive programmer math.
I don't think so.
Either set 75% and configure the GPU projects to schedule 1 CPU and 1 GPU per task. That way up to 2 CPUs will be scheduled for (usually) 2 GPU tasks and the remaining 4-6 for CPU tasks. OR set 50% and 0.1 CPU + 1 GPU. Due to the way BOINC schedules CPUs it will not reserve any for GPU support (but the applications still use them) and you always have 4 for CPU tasks but never more. That's two simple suggestions, of course you can make things more complicated by running several tasks per GPU.
2) Message boards : Number crunching : task swamping on multi-project host guidance requested (Message 101386)
Posted 20 Apr 2021 by floyd
Post:
First, I agree with Grant's analysis of the underlying problem. Second, I'd like to suggest another course of action which may be more to your liking.

Some remarks ahead. Don't use project_max_concurrent, and if you do make sure to adjust "use n% of CPUs" accordingly. Else you can expect BOINC to fetch more tasks than you allow it to actually process. Don't insist on running 1 Milkyway, 1 Einstein and 4 Rosetta tasks at all times. Run the projects CPU only or GPU only and use resource share to balance projects within each group but this will not work across groups. Keep your cache of work small but you don't need to go as far as 0.01 days. Maybe 0.1 to 0.5 days is good.

Plan 1:
(This is mostly what Grant already suggested) Configure your GPU projects to reserve 1 CPU per task. Configure BOINC to use 100% of CPUs or whatever your preferred maximum is.
Pro: Will always fully load your CPU.
Con: May not run GPU work at all times.

Plan 2:
Set "use n% of CPUs" for CPU tasks only. Make sure there's one CPU left for any possible GPU task running concurrently. Configure the GPU projects to reserve 0.1 (or even less) CPUs per task so the total of all possible GPU tasks is less than 1 CPU.
Pro: Will always run GPU work if available.
Con: If there is no GPU work will not do more CPU work instead.

I think plan 2 is more like what you want.
3) Message boards : Number crunching : Some Tasks failing with STATUS_ACCESS_VIOLATION (Message 97939)
Posted 6 Jul 2020 by floyd
Post:
I would suggest going to a manual voltage and fixed clock multiplier
And I would suggest not doing that before the system is stable at or close to default settings, if at all. The OP seems more concerned about stability than performance.
4) Message boards : Number crunching : Some Tasks failing with STATUS_ACCESS_VIOLATION (Message 97938)
Posted 6 Jul 2020 by floyd
Post:
Motherboard BIOS updates for the CPU Microcode updates should sort those sort of issues out,
They haven't so far and I don't expect any more to come.
5) Message boards : Number crunching : Some Tasks failing with STATUS_ACCESS_VIOLATION (Message 97929)
Posted 5 Jul 2020 by floyd
Post:
I agree that this is likely a hardware issue but I'd like to add some thoughts that haven't come up yet. First, for the relevance of MemTest, I had a computer that could run it for hours without a single error but would crash after seconds under real (BOINC) load until I increased RAM voltage. Second, early first series Ryzens were bugged. I have a 1700 too and it has several problems, including frequent access violations when running R@H or WCG's MIP. I also have a later 1700x that does all that just fine.
6) Message boards : Number crunching : Downloaded way too many tasks at once? (Message 96400)
Posted 12 May 2020 by floyd
Post:
There can be no sensible reason for that. Imagine you have a cardboard box
I understand your thoughts, I just disagree when you say there's no alternative to your suggested action. Let's get back to BOINC, your cardboard box example is a little too simplified. So you have some tasks running and a big Rosetta task and some other smaller tasks waiting. The Rosetta task doesn't fit so you run a smaller one instead and have made sure there's no unused resources. But that's not the end of the story, the big one is still waiting. Now when a task ends and you get some free memory you could face the same situation and again you decide to do something else first. Can you be sure the big task will ever run? You rely on your luck there. You always do as much as possible but you can't be sure that all will be done eventually. There is an alternative if you don't have enough memory, keep what you have and wait for more. That way you'll eventually have enough but you don't do as much work as you could. This is not ignoring the situation, there's a plan behind it but with other objectives than yours. These are just two simple alternatives, both with their advantages and drawbacks. You could of course improve them while adding complexity. I'm not trying to start an argument about the best decision here, I just want to show there's more than one. If someone doesn't do what you think they should that doesn't necessarily mean they haven't thought about it. In fact they could have thought carefully about it and decided otherwise. We all don't know.
7) Message boards : Number crunching : WUs stuck at 99.50% and no progress (Message 96373)
Posted 11 May 2020 by floyd
Post:
I also saw the checkpoint missing, but I simply don't know what that could mean.
I also don't know what that means in detail but I would think that in all the time the task hasn't reached the first intermediate point where something is worth saving.

First problematic task has PID 12258 and definitely it is doing something in there
Well I'm surprised now. I've occasionally seen tasks with the clock ticking but nothing being done. Those continued fine after a restart. But I've never seen a task work that long without coming to a result. Someone with more detailed knowledge will have to tell us how we can know if the task is actually making progress and will eventually come to an end.

By the way, I also have three of those running. Only around two hours now and nothing suspicious, except the displayed progress is quite high for that short time.
Addendum: The first task finished after three hours.
8) Message boards : Number crunching : WUs stuck at 99.50% and no progress (Message 96364)
Posted 11 May 2020 by floyd
Post:
31 hours of runtime and no checkpoint? Check if those tasks cause any CPU load, I dare guess they haven't done any work at all and the 99.5% progress are just fake. You can turn LAIM off, then suspend and resume the tasks to restart them from the beginning. You have enough time left, but look if they work normally this time. Or abort them and leave them to somebody else.
9) Message boards : Number crunching : Downloaded way too many tasks at once? (Message 96359)
Posted 11 May 2020 by floyd
Post:
0.5GB left, Rosetta is too big, so obviously try to fit in a smaller project's task?! Boinc really annoys me sometimes with its gross stupidity.
Not to do what you find obvious could well have been a design decision. Not all people are idiots you know.
10) Message boards : News : Switch to using SSL (Secure Socket Layer) (Message 96119)
Posted 5 May 2020 by floyd
Post:
all I can do is pick from a list which indicates it is using HTTP. I cannot find anywhere within BOINC to add the HTTPS address.
Is there no "Project URL" field below that list? If your preferred UI (boincmgr?) doesn't show it for some reason there should still be boinccmd as a backup.
11) Message boards : News : Switch to using SSL (Secure Socket Layer) (Message 95911)
Posted 3 May 2020 by floyd
Post:
If the all_progect_list,xml file will be distributed again by Boinc patch/release, we lose the modification.
Not only with a new release. As far as I know the list of BOINC projects is maintained at Berkeley and the clients update it from there as part of the normal operations. I don't remember the interval, maybe every 30 days or so. The one on the computer I'm writing from is dated 2020-04-25 and I haven't done anything to it. The Rosetta admin will need to get that central list updated.
12) Message boards : News : Switch to using SSL (Secure Socket Layer) (Message 95908)
Posted 3 May 2020 by floyd
Post:
STEP 5 : UPDATE url- and web_url-lines under Rosetta@home section in all_projects_list.xml

edit /var/lib/boinc/all_projects_list.xml
Is there a reason for this, other than to be able to just pick the project from the list? If there is no such reason, don't do it. If there is a reason, remember that the list is updated regularly. Your changes will be reverted.

Don't you think this method is the safest way for all of the Rosetta project files to be changed with the HTTPS URL ?
The safest way is to stay away from the data directory and just do what you're told from the UI. There's some precautions though.

STEP 1: When convenient
Set the project to No New Tasks, then finish your work or abort it as you wish. IMPORTANT: Report your results, i.e. do a project update.

STEP 2: remove the project
Remove the project.

STEP 3: then add https://boinc.bakerlab.org/rosetta/
Precaution: Set a very small cache size or even set it to zero, to avoid being swamped with fresh work.
Add the project as you do usually. If you use the add project wizard, don't pick the project from the list but enter the URL directly. (You can even choose is from the list, but then don't hit Next right away but adjust the URL first. The only difference should be the s in https.)
13) Message boards : Number crunching : Request: Make tasks share the database (Message 90379)
Posted 17 Feb 2019 by floyd
Post:
This space isn't really unused, it's used for wear leveling.
There's two aspects of space. More space allows broader wear leveling so more data can be written before cells reach their end of life. And of course more space means being able to store more data at a time. I don't want to buy larger drives for the former when I don't need the latter.

More free space on SSD means longer life time, because the data won't be written always to the same few empty cells.
As I understand it the whole space is used for wear leveling, not only the "free" space. Of course things get easier if cells are known to contain no user data. But when "free" space is already much larger than "used" space you won't gain much from even more.

To achieve that I decided to follow the manufacturer's recommendations, which for the devices in question mostly are between 40 and 70 TB of total writes, (...) 10 year life time, (...) 9.6 GB a day (...) That extrapolates to 3.5 TB/year
So that means between 10 and 20 years life time (when we think about the writes)
I think you misunderstood that. Those 3.5 TB/a are only from always rewriting the DB, which I'm suggesting to avoid. As you can see this already halfway reaches my (you may say self-imposed) limit and of course goes on top of the other write operations which can't be avoided but by themselves may not be a problem. Those 3.5 unnecessary extra TB are my problem. Having said that, maybe there actually is a good reason why things need to be this way. In that case I'd be happy to hear it. My guess is that first it was easier to implement and didn't cause any harm in HDD days, and later "it has always been that way".

much longer than any HDD will last
I just recently replaced an 80 GB HDD. 14 years old, without a single bad sector and total overkill for a BOINC machine. Only problem is, it's PATA, none of my mainboards takes that any more, and a controller costs nearly as much as a cheap SSD. Now the host has a 120 GB SSD with 4 GB used.

The SSD will very likely fail long time before that for some other random reason and not because of the data written to it by the rosetta application
You say it will fail anyway. I say if I'm careful it may not fail while it's still useful.
14) Message boards : Number crunching : Request: Make tasks share the database (Message 90375)
Posted 17 Feb 2019 by floyd
Post:
I agree with the calculation but not with the conclusion. Perhaps I should explain my thoughts better. What I need is just a storage device for a dedicated crunching machine. Any old HDD (sic) is good enough for that. Space doesn't matter and speed also is not very important. More so, any SSD should be fine. With that in mind I buy small, inexpensive SSDs, if for some reason I decide against a HDD. Unfortunately those small SSDs offer little endurance and often are from less known brands where you can't be sure about the quality. Larger SSDs are better with respect to that but cost a lot more and in the end 98% of space is unused.
I don't care how much I can possibly stress a device before I kill it. On the contrary, I wish to make as sure as possible that I don't, not even accidentally. To achieve that I decided to follow the manufacturer's recommendations, which for the devices in question mostly are between 40 and 70 TB of total writes, if any are given at all. For Intel this is even a hard limit, not sure about others. So to play it safe, and hoping for a 10 year life time, I want to limit writes to few terabytes per year, the lower the better. And now Rosetta@Home comes into play and writes 9.6 GB a day, as per your example which seems realistic to me. That extrapolates to 3.5 TB/year, just by writing the same data over and over again, thousands of times, where once or a few times could be enough. I'm not saying this is going to kill the device, but it certainly makes my plan impossible and increases the risk of failure. Whether the risk is high or low I can't say for sure but I'm not willing to take it. From my point of view it is just not necessary.
15) Message boards : Number crunching : Request: Make tasks share the database (Message 90369)
Posted 16 Feb 2019 by floyd
Post:
This is about every task generating an own copy of a 400MB database, and I assume that those copies are read only so they will always remain identical. I think that assumption is correct, though I haven't seen this discussed before. (For simplicity I ignore the fact that there actually seem to be two databases, one for rosetta and one for minirosetta.)

Well, if all tasks work with identical copies of a database, it shouldn't be too hard to make them work with a single copy. Disk space is not my main concern here, though it has always annoyed me that this amount of data can easily fill your disk cache if you run several tasks, thus flushing more useful data out. It will make your system less responsive with no gain.

But what really needs to be addressed IMO is the total amount of data written to disk. I was about to return to this project after some weeks of absence, but then it occurred to me that in the meantime I had replaced my HDDs with SSDs so I'd better make some calculations. My result is that the above behavior can easily cause terabytes to be written where some hundred megabytes should be enough. For HDDs this won't matter much but I'm not willing to wear my SSDs down for nothing so I'm off again until there is a solution.

As far as I can see the only problem with a single database is that you can't easily tell when it's no longer needed so it can be deleted. But not deleting it at all, even keeping a few versions, is IMO still a better solution than creating a fresh copy for every task. The problem is not the data kept, it's the data written.
16) Message boards : Number crunching : Errors while computing (Message 90033)
Posted 19 Dec 2018 by floyd
Post:
(Ignoring your parallel thread on the same topic)

Almost all of those errors are about missing or invalid files, and it's exe and zip files. But those files can't really be (permanently) missing or invalid, or you'd see more failing tasks. Something is likely to modify those files, move/delete them or make them inaccessible. That smells like a virus scanner. Yes, you wrote that the BOINC data directory is supposed to be off limits, but have you verified that?
Also, you mentioned "BOINC Data", "BOINC2" and "BOINC". You don't run multiple instances of BOINC on the same data directory, do you?
In any case you should closely examine your log files, both of BOINC and the virus scanner.
17) Message boards : Number crunching : Minirosetta 3.73-3.78 (Message 87870)
Posted 9 Dec 2017 by floyd
Post:
I just saw a similar problem but under Windows 10 and on an Intel CPU.

7H2LD3_51C703_fold_and_dock_SAVE_ALL_OUT_538615_1685
http://boinc.bakerlab.org/workunit.php?wuid=864346673
There's many possible causes for an access violation. Your task list doesn't show any other errors and you'll unlikely ever find out what happened in this single incident. If it doesn't happen repeatedly just ignore it.
18) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 87868)
Posted 9 Dec 2017 by floyd
Post:
I have had several of these as well. They take much longer and eventually get a comp error.
Apparently there's an emergency stop built in, 4 hours after target time.


Some of them might stop after 8 Hours, but at least on my mac there are some tasks which run longer. On may Linux machines occurred many tasks which run longer also. There I have a script to catch them.
My reply was to mmonnin, who has the same problem as Trotador (and me too by the way): 4.06 tasks taking much longer than they should and then failing at a predictable time. What you and William Waggoner describe could well be something else. The effect is different, and it's happening with 3.78 tasks. Your script killed a few 4.06 too, but maybe too early. I've seen 4.06 tasks take somewhat longer than expected and still finish.
19) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 87866)
Posted 9 Dec 2017 by floyd
Post:
I have had several of these as well. They take much longer and eventually get a comp error.
Apparently there's an emergency stop built in, 4 hours after target time.
20) Message boards : Number crunching : Minirosetta 3.73-3.78 (Message 87831)
Posted 5 Dec 2017 by floyd
Post:
New "RMA Ryzen" has not this problem, so they find it and resolve...
I can't agree with that conclusion. The fact that you get a "good" processor (i.e. one that passes this particular test) back only shows that those things exist. It does not prove that current processors in general are good, nor that anything has changed at all.


Next 20



©2024 University of Washington
https://www.bakerlab.org