Posts by rjs5

1) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 109084)
Posted 4 Apr 2024 by rjs5
Post:
28 tasks with validate error...great....but i suppose thats just the way it goes with a beta.


NO. That is the way Rosetta has chosen.

There should be a preference option that allows you to OPT OUT of the BETA work units. This is ESPECIALLY true if the project gives ZERO credit for the computing. About 25% of the BETA work units I am receiving run for several hours, finish without errors, and are marked INVALID as wasted work.

These INVALID results are a problem with the Rosetta BETA binary. Rosetta has chosen to run all the BETA units for hours instead of minutes. They could run the BETA binaries for minutes instead of hours until the BETA binaries have some successes.
2) Message boards : Number crunching : No Work Recieved since June 22, 2022 (Message 106944)
Posted 20 Sep 2022 by rjs5
Post:
But I can guarantee you I did NOT change it to stop work being issued
The Project will do it if the system produces errors.
However, what errors & how many are required to trigger that happening are unknown.


One error that seems to shut off Python work is an "Out of Memory" error.

I have no problem with Rosetta changing a machine status to stop receiving Python WU. I just wish they had the courtesy to generate a NOTICE to me when that happens. That seems like a "no brainer", but too much to expect from the Rosetta developers.
3) Message boards : Number crunching : fedora 32 and 36 (Message 106807)
Posted 23 Aug 2022 by rjs5
Post:
For what its worth, I have managed to install fedora 36 lxde and fedora 32 along with vbox 6.1.

It seems to be working ok sofar. This combination has done a good number of lhc & cosmology vbox w/u's .

Not sure why rosetta only uses half the number of cpu's available. Does it use 2 cpu's/vbox job?.

Anyway it seems fedora 36 does work.



I had to downgrade boinc 7.20.2-1.fc36 back to the previous version, because WU started failing. That version of boinc signed on as a BETA version.
4) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 106791)
Posted 16 Aug 2022 by rjs5
Post:
I installed the latest Boinc release (7.20.2-1.fc36) for this Fedora and started getting errors.
I DOWNGRADED back to the previous Boinc release (7.16.11-6.fc35) and it seems to work fine again so far.

It is because of the security changes in BOINC 7.18.1 and later.
Some projects have adapted to it, but not the pythons.
https://quchempedia.univ-angers.fr/athome/forum_thread.php?id=166


Thanks. You have given better advice than admin have ever offered. I'm impressed!!!
5) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 106787)
Posted 16 Aug 2022 by rjs5
Post:
Linux Fedora 36 distribution Boinc problems

I installed the latest Boinc release (7.20.2-1.fc36) for this Fedora and started getting errors.
I DOWNGRADED back to the previous Boinc release (7.16.11-6.fc35) and it seems to work fine again so far.

For the Climate Prediction project, I build a small test program that verified all the dynamic libraries and support programs were properly installed.

TOO bad the Rosetta developers cannot do something simple like that for Python environments they require.
6) Message boards : Number crunching : Not getting units? (Message 106669)
Posted 28 Jul 2022 by rjs5
Post:
Rosetta seems to automatically block Python WU when they detect an "out of memory" error.

I don't have any problem with them shutting off Python WU after an "out of memory" error. It makes sense when they are dragging around 1gb of disk and network traffic. I have a BIG problem with Rosetta not informing me with a message that they have taken this action.

The whole Python/VirtualBox release was clumsy and amateurish. How much effort does it take to send a message to a person when their profile (ALLOW/SKIP) is changed?


Rosetta python work has a nasty habit of blocking work to a computer that has had a few errors or whatever takes its fancy
Go to the details for your computers page in your account at rosetta and scrole down to the bottom of those pages
You will see a blue/red button labled `skip` or `allow` ,
If you see the word `allow` , click it to get work again , coz U got blocked for some reason
Bin there dun that its a pain , but when python first started there was no button to restart work flow
or it could be something else , worth a try .
7) Message boards : Number crunching : Limit number of Python jobs (Message 106413)
Posted 22 Jun 2022 by rjs5
Post:
I play around with app_config.xml and have been able to do some fine tuning.

Try and change the 2 to what you would like to limit it. It will not control

<app_config>
<app> <name> rosetta_python_projects </name> <max_concurrent> 2 </max_concurrent> </app>
</app_config>

Another app_config.xml line I use to control number of active project jobs is the max concurrent line. It may download more but only execute the specified amount. I was surprised that if a WU "starts", the slot is set up and the memory is allocated in Linux.

<project_max_concurrent> 4 </project_max_concurrent>


Is it possible to limit the number that run at once? on my machine with 32gb ram it runs 11 at a time maxing out the ram, the issue is though that its not even half using my CPU (5950x) which causes a few cores to run really fast and the heat to jump up causing the fans to spin up.
I don't want to stop doing them but limiting to like 4 at any one time would be great.
8) Message boards : Number crunching : Please remove Virtualbox as a dependency. (Message 106308)
Posted 28 May 2022 by rjs5
Post:
No news??


Crickets. I did not expect any response. dcdc has been quite responsive in the past, but the developers can't be bothered.

The new Python application does more than take 3gb of memory per task. It also chews up your disk drives by checkpointing too frequently. Most of the problems and stalls are related to checkpointing. A checkpoint option on the preferences would help.

cheers
9) Message boards : Number crunching : Please remove Virtualbox as a dependency. (Message 106178)
Posted 10 May 2022 by rjs5
Post:
If they were to share the tools used to create the virtualbox, would anyone here be able to convert it to a non-VB task? I.e. do the work for them and send the method back? I expect they are all contracted to do specific work based on funding etc which this would fall outside of.


I successfully built earlier versions of Rosetta. I could work with you again to look at it.

I saw two problems with the Python build. It demands 2.8gb of memory for each work unit and it compresses and saves 1gb+ of files to disk. I had to replace a SATA SSD with an M.2 to get enough write speed. I added memory and PrimoCache in write-back mode to reduce the write traffic to the disks. It cleared up my Rosetta problems with Python ... other than the not being able to run as many tasks.
10) Message boards : Number crunching : Stalled WU (Message 105928)
Posted 13 Apr 2022 by rjs5
Post:
BOINCTasks shows whether a task is using CPU time or not so you can see what to abort.
https://efmer.com/boinctasks/download-boinctasks/


I use Windows BOINCTasks and it is very obvious when a Rosetta WU hangs. The CPU usage goes to zero and stays. I have never seen one finish after the CPU goes to 0%.
On Linux I use "top -i -c -d3" to get a similar display. I press "SHIFT P" to sort processes by CPU time.

"-i" only show running processes
"-c" show the command line so you can see what is burning CPU
"-d 3" sample every 3 seconds so I can see the display


I have two computers with near identical configurations and I saw the number of stalls/hangs increase SIGNIFICANTLY when I simply updated VirtualBox to a newer version than comes with BOINC. When I uninstalled BOINC and VirtualBox and reinstalled again, the problems cleared up. It appears the Rosetta developers/integrator introduced some dependency on a VirtualBox.

Using VirtualBox was supposed to reduce the Rosetta developer problems with different environments. It looks more like they just put a 3gb vbox wrapper around it and introduced a new set of problems.

BOINC startup times when running Rosetta WU is now minutes instead of seconds.
Checkpoints that write gb of data to the BOINC drive is going to kill volunteer HW.
Excess memory demands exhausts memory and adds to the unnecessary excess power needed to run Rosetta WU.
11) Message boards : Number crunching : Lot of failures (Message 105783)
Posted 1 Apr 2022 by rjs5
Post:
>>> LARGE part of clients/volunteers runs Windows

Certainly. Let us, however, consider the purpose. We are trying to help them. It is up to them to decide if Linux users are sufficient in number to acheive the result they require. It is already mentioned in this thread, that the amount of work from the project is a lot less than it used to be. They will, however, pick up on the fact that Windows users are seeing crashed work, and stop issuing it to these people, that, obviously, includes me.


I looked at a number of your failing WU DETAILS and there was the same failure by the other machine running the WU.
It looks like the WU are bad and you are OK.
12) Message boards : Number crunching : 3 x 36-Processor Machines with CPU set to 50% are now working (Message 105764)
Posted 31 Mar 2022 by rjs5
Post:
[I think the XML works find for Rosetta. There have been some problems in the past with the projects and options, but I think Rosetta is fine.

It isn't the .xml file itself that is the problem, but the "<project_max_concurrent>" tag (also the "<max_concurrent>" tag).
Under certain conditions, BOINC thinks it needs to download more work.

You can check it with a test case. https://github.com/BOINC/boinc/issues/4322
It caused me problems here the last time I used it a year or two ago, and no one has said it has been fixed yet that I have seen.


I watch my changes to the configuration until I am sure they work and no problems. I have never had problems with this particular option, but I will watch closer ... just in case.

How did you set up PrimoCache? Did you enable DEFER-WRITES or ... ???
13) Message boards : Number crunching : Multiattach mode disk images (Message 105746)
Posted 28 Mar 2022 by rjs5
Post:
The github issue regarding Step 1 has recently been added to the next BOINC client milestone.
That's a good sign.


Great!! Maybe other projects can use this feature...


You can also download other premade vbox images of the Rosetta environment.
https://www.osboxes.org/virtualbox-images/
14) Message boards : Number crunching : 3 x 36-Processor Machines with CPU set to 50% are now working (Message 105745)
Posted 28 Mar 2022 by rjs5
Post:
It has not but it is not as simple as use this tag and you will be flooded. I’ve used exactly that app_config file on all my projects for several years and never had a problem.

You can investigate it in more detail, and maybe avoid the problem, or not, as the case may be.
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5726&postid=45384#45384


I think the XML works find for Rosetta. There have been some problems in the past with the projects and options, but I think Rosetta is fine.

Your disk cache with the WRITE BACK enabled suggestion is very good. It will reduce disk write traffic and save the SSD/HDD drive. VirtualBox BOINC crunchers can decide on using memory to reduce disk writes or to run more jobs.

Thanks
15) Message boards : Number crunching : 3 x 36-Processor Machines with CPU set to 50% are now working (Message 105718)
Posted 27 Mar 2022 by rjs5
Post:
The Rosetta conversion to vbox caused big problems for me.

1. I had to figure out the Rosetta ALLOW switch.
2. I had to limit the number of Rosetta jobs active on the computer (currently 8gb/job) with 3-line app_config.xml.
3. I found high memory errors in one machine that had been running fine.
4. I had to load VirtualBox packages on a Linux machine so the vbox jobs would run.

I think things have stabilized.


64-gb Fedora Linux machine.
I had to load VirtualBox package to fix COMPUTATION ERRORS.


64-gb Windows 11 Machine
Heavy disk usage caused by WU setup and runtime paging from lack of memory.
Near zero CPU usage. Long runs.
I LIMITED the maximum Rosetta jobs to 8. I can probably relax that some. The jobs seem to want 3gb to start with, but demand more later in the computation.
The failures likely occurred when disk space requests exhausted.

"app_config.xml" file at C:ProgramDataBOINCprojectsboinc.bakerlab.org_rosettaapp_config.xml (3 lines) limits the number of project jobs executed simultaneously.

<app_config>
<project_max_concurrent> 8 </project_max_concurrent>
</app_config>



128-gb Windows 11 Machine
Frequent stalled jobs with little CPU usage. Constant high disk usage.
Isolated two bad memory sticks in the 64gb to 128gb memory range.
2 x 16gb DIMM sticks on order.
Added the 3-line app_config.xml file above.
16) Message boards : Number crunching : Constant computation errors. (Message 105638)
Posted 22 Mar 2022 by rjs5
Post:
You have the the molecular chemistry or whatever department at the university, then the institute, then it seems Baker Lab falls under that umbrella and splits out into robetta, rosetta and foldit, A lot of names that are just sub units of something. It's almost like a circus juggling act.


I understand you're tired of this situation (like me), but i think you're a little bit impolite.
This project is not "a circus", it's science and every kind of help, from simply cpu time to Foldit volunteers, is done with a purpose.


Rosetta may not be a "circus", BUT the person integrating the "science program" with the "real world machines" is unqualified to do the job. There are simple warning messages and parameter testing limits that can be implemented that could screen out most of the error situations before they reach volunteer machines.

Simple things like a "Set the ALLOW computer detail switch to enable Python jobs" message. There are many of these informational messages that could be added, but the integrator is unqualified or simply lazy.

My suggestion: require each researcher submitting WU to the public have an identifier embedded in the WU name. Make incompetence public, traceable and give researchers CREDIT for their successes and failures.

8-)
17) Message boards : Number crunching : Not getting work (Message 105327)
Posted 4 Mar 2022 by rjs5
Post:
Thanks for the explanations. My stats change, although I never see any work on that computer, so I guess it's getting them at night. I'm running Linux so don't know if there's VirtualBox for that setup. I'll just leave it alone.


I am running a Fedora Linux box. I had installed BOINC but there was no BOINC+VirtualBox packages so I just installed the virtualbox packages in addition. It seemed to work.

I am seeing mainly Rosetta Python WU being sent down. They take a huge amount of memory and I am seeing a few hung jobs. There seem to be many jobs available so you should see the machine running them.

I am running 18 CPU on a an 18C/36/T machine with 64gb of memory. The 18 WU will cause Linux to consume all 64gb of memory and a good chunk of the swap space.
18) Message boards : Number crunching : Not getting work (Message 105325)
Posted 4 Mar 2022 by rjs5
Post:
Rosetta 4.20 tasks are not always available. they send out a few days. This may be the problem. If you install virtualbox, you will receive some python tasks and those are always available.


Even if you install the VirtualBox version of BOINC, you still have to "ALLOW" that computer to accept the vbox work units. I fell into that trap. I just installed VirtualBox BOINC and nothing happened. I had to ALLOW each computer to accept WU.

Rosetta added an ALLOW/SKIP option to each COMPUTER profile. You have to explicitly set the ALLOW option. The Rosetta people failed to add a "WARNING" or any information that would help a user find this failure.

I am still getting a number of failures and hung Rosetta WU where they just keep running. This is happening on a machine with plenty of memory, disk and all enabled to run BOINC WU.
19) Message boards : Number crunching : Does Rosetta work with Windows 11? (Message 105290)
Posted 28 Feb 2022 by rjs5
Post:
I had a look at the "details" page for the windows pc`s and the "application details" recons that one has completed 1534 python tasks the other 8 ,
have a check at the bottom of each windows pc details page to see what the [blue+red] skip/allow button is showing ,
if you see the word "allow" , click it ,
The skip/allow button setting is a silent killer of downloads .
worth a try .


THAT WORKED!!! Thank you very much!!

What a foolish way to implement this new feature. If they are going to remove machines from a majority of all the new WU, the least they could do is add a warning at each failed update.
20) Message boards : Number crunching : Does Rosetta work with Windows 11? (Message 105269)
Posted 27 Feb 2022 by rjs5
Post:
I have 3 machines running Rosetta WU. Two Windows 11 and one is running Linux. I have not been following the numerouse problems others are having and posting. After scanning several threads, I gave up.

I installed VirtualBox packages on the Linux machine and it is getting Rosetta WU and is running pretty good. I am getting a couple WU with the message "Postponed: VM job unmanageable, restarting later" and I am just deleting them.

On the Windows 11 machines, I installed BOINC + VirtualBox, but I am not getting any Rosetta WU. They are both large machines with 64gb and 128gb of memory. I have made 90% of the memory available and 500gb of disk. I cannot figure out why Rosetta is not working.

I cannot find any error messages or reasons why Rosetta WU don't get downloaded to the two Windows 11 machines.

Any thoughts to get Rosetta WU downloaded to the Windows machines??


Next 20



©2024 University of Washington
https://www.bakerlab.org