1)
Message boards :
Number crunching :
Rosetta overheating my CPU
(Message 111989)
Posted 29 Jan 2025 by rjs5 Post: Sorry, folks. My computer is too important for me to fry it running Rosetta.Given that there are hundreds of thousands of computer systems that are capable of running at 100% load for days & weeks (and months & years) on end without overheating, the other option would be to fix the problem with your CPU cooling. A couple of things I have noticed about Rosetta. The Rosetta developers "performance tune" Rosetta with a single instance running. They aggressively inline code, and take advantage of the cache hierarchy to get low memory latency. That works fine unless you run multiple copies of Rosetta. Multiple copies of the big footprint Rosetta generates high cache misses, evict cache lines, and access slow memory to retrieve the data. Multiple copies of Rosetta cause the cache miss rate to skyrocket. Accessing main memory requires the CPU to use large, power hungry transistors to drive the memory bus. Driving memory buffers is a major source of the heat and why Rosetta generates more heat. If the CPU temperature reaches the "silicon pause temperature", the CPU automatically inserts PAUSE instructions to slow execution and cool the CPU. Running the CPU at 100% is slower than running the CPU at 99% and 98% and ... The maximum THROUGHPUT of Rosetta work loads is probably closer to 50% because of the big footprint. Running a "mix" of BOINC workloads is tough (impossible?) to fine tune with the BOINC controls. I usually devote a machine and set the CPU PREFERENCE to 60%-70% of CPU for 100% of time. If I see TASK MANAGER or Linux "top" show more than 70% or so load, l will reduce the % of CPU. Getting the best THROUGHPUT one any machine depends on cache sizes, memory sizes, disk speed, ... rjs5 |
2)
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
(Message 109084)
Posted 4 Apr 2024 by rjs5 Post: 28 tasks with validate error...great....but i suppose thats just the way it goes with a beta. NO. That is the way Rosetta has chosen. There should be a preference option that allows you to OPT OUT of the BETA work units. This is ESPECIALLY true if the project gives ZERO credit for the computing. About 25% of the BETA work units I am receiving run for several hours, finish without errors, and are marked INVALID as wasted work. These INVALID results are a problem with the Rosetta BETA binary. Rosetta has chosen to run all the BETA units for hours instead of minutes. They could run the BETA binaries for minutes instead of hours until the BETA binaries have some successes. |
3)
Message boards :
Number crunching :
No Work Recieved since June 22, 2022
(Message 106944)
Posted 20 Sep 2022 by rjs5 Post: But I can guarantee you I did NOT change it to stop work being issuedThe Project will do it if the system produces errors. One error that seems to shut off Python work is an "Out of Memory" error. I have no problem with Rosetta changing a machine status to stop receiving Python WU. I just wish they had the courtesy to generate a NOTICE to me when that happens. That seems like a "no brainer", but too much to expect from the Rosetta developers. |
4)
Message boards :
Number crunching :
fedora 32 and 36
(Message 106807)
Posted 23 Aug 2022 by rjs5 Post: For what its worth, I have managed to install fedora 36 lxde and fedora 32 along with vbox 6.1. I had to downgrade boinc 7.20.2-1.fc36 back to the previous version, because WU started failing. That version of boinc signed on as a BETA version. |
5)
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
(Message 106791)
Posted 16 Aug 2022 by rjs5 Post: I installed the latest Boinc release (7.20.2-1.fc36) for this Fedora and started getting errors. Thanks. You have given better advice than admin have ever offered. I'm impressed!!! |
6)
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
(Message 106787)
Posted 16 Aug 2022 by rjs5 Post: Linux Fedora 36 distribution Boinc problems I installed the latest Boinc release (7.20.2-1.fc36) for this Fedora and started getting errors. I DOWNGRADED back to the previous Boinc release (7.16.11-6.fc35) and it seems to work fine again so far. For the Climate Prediction project, I build a small test program that verified all the dynamic libraries and support programs were properly installed. TOO bad the Rosetta developers cannot do something simple like that for Python environments they require. |
7)
Message boards :
Number crunching :
Not getting units?
(Message 106669)
Posted 28 Jul 2022 by rjs5 Post: Rosetta seems to automatically block Python WU when they detect an "out of memory" error. I don't have any problem with them shutting off Python WU after an "out of memory" error. It makes sense when they are dragging around 1gb of disk and network traffic. I have a BIG problem with Rosetta not informing me with a message that they have taken this action. The whole Python/VirtualBox release was clumsy and amateurish. How much effort does it take to send a message to a person when their profile (ALLOW/SKIP) is changed? Rosetta python work has a nasty habit of blocking work to a computer that has had a few errors or whatever takes its fancy |
8)
Message boards :
Number crunching :
Limit number of Python jobs
(Message 106413)
Posted 22 Jun 2022 by rjs5 Post: I play around with app_config.xml and have been able to do some fine tuning. Try and change the 2 to what you would like to limit it. It will not control <app_config> <app> <name> rosetta_python_projects </name> <max_concurrent> 2 </max_concurrent> </app> </app_config> Another app_config.xml line I use to control number of active project jobs is the max concurrent line. It may download more but only execute the specified amount. I was surprised that if a WU "starts", the slot is set up and the memory is allocated in Linux. <project_max_concurrent> 4 </project_max_concurrent> Is it possible to limit the number that run at once? on my machine with 32gb ram it runs 11 at a time maxing out the ram, the issue is though that its not even half using my CPU (5950x) which causes a few cores to run really fast and the heat to jump up causing the fans to spin up. |
9)
Message boards :
Number crunching :
Please remove Virtualbox as a dependency.
(Message 106308)
Posted 28 May 2022 by rjs5 Post: No news?? Crickets. I did not expect any response. dcdc has been quite responsive in the past, but the developers can't be bothered. The new Python application does more than take 3gb of memory per task. It also chews up your disk drives by checkpointing too frequently. Most of the problems and stalls are related to checkpointing. A checkpoint option on the preferences would help. cheers |
10)
Message boards :
Number crunching :
Please remove Virtualbox as a dependency.
(Message 106178)
Posted 10 May 2022 by rjs5 Post: If they were to share the tools used to create the virtualbox, would anyone here be able to convert it to a non-VB task? I.e. do the work for them and send the method back? I expect they are all contracted to do specific work based on funding etc which this would fall outside of. I successfully built earlier versions of Rosetta. I could work with you again to look at it. I saw two problems with the Python build. It demands 2.8gb of memory for each work unit and it compresses and saves 1gb+ of files to disk. I had to replace a SATA SSD with an M.2 to get enough write speed. I added memory and PrimoCache in write-back mode to reduce the write traffic to the disks. It cleared up my Rosetta problems with Python ... other than the not being able to run as many tasks. |
11)
Message boards :
Number crunching :
Stalled WU
(Message 105928)
Posted 13 Apr 2022 by rjs5 Post: BOINCTasks shows whether a task is using CPU time or not so you can see what to abort. I use Windows BOINCTasks and it is very obvious when a Rosetta WU hangs. The CPU usage goes to zero and stays. I have never seen one finish after the CPU goes to 0%. On Linux I use "top -i -c -d3" to get a similar display. I press "SHIFT P" to sort processes by CPU time. "-i" only show running processes "-c" show the command line so you can see what is burning CPU "-d 3" sample every 3 seconds so I can see the display I have two computers with near identical configurations and I saw the number of stalls/hangs increase SIGNIFICANTLY when I simply updated VirtualBox to a newer version than comes with BOINC. When I uninstalled BOINC and VirtualBox and reinstalled again, the problems cleared up. It appears the Rosetta developers/integrator introduced some dependency on a VirtualBox. Using VirtualBox was supposed to reduce the Rosetta developer problems with different environments. It looks more like they just put a 3gb vbox wrapper around it and introduced a new set of problems. BOINC startup times when running Rosetta WU is now minutes instead of seconds. Checkpoints that write gb of data to the BOINC drive is going to kill volunteer HW. Excess memory demands exhausts memory and adds to the unnecessary excess power needed to run Rosetta WU. |
12)
Message boards :
Number crunching :
Lot of failures
(Message 105783)
Posted 1 Apr 2022 by rjs5 Post: >>> LARGE part of clients/volunteers runs Windows I looked at a number of your failing WU DETAILS and there was the same failure by the other machine running the WU. It looks like the WU are bad and you are OK. |
13)
Message boards :
Number crunching :
3 x 36-Processor Machines with CPU set to 50% are now working
(Message 105764)
Posted 31 Mar 2022 by rjs5 Post: [I think the XML works find for Rosetta. There have been some problems in the past with the projects and options, but I think Rosetta is fine. I watch my changes to the configuration until I am sure they work and no problems. I have never had problems with this particular option, but I will watch closer ... just in case. How did you set up PrimoCache? Did you enable DEFER-WRITES or ... ??? |
14)
Message boards :
Number crunching :
Multiattach mode disk images
(Message 105746)
Posted 28 Mar 2022 by rjs5 Post: The github issue regarding Step 1 has recently been added to the next BOINC client milestone. You can also download other premade vbox images of the Rosetta environment. https://www.osboxes.org/virtualbox-images/ |
15)
Message boards :
Number crunching :
3 x 36-Processor Machines with CPU set to 50% are now working
(Message 105745)
Posted 28 Mar 2022 by rjs5 Post: It has not but it is not as simple as use this tag and you will be flooded. I’ve used exactly that app_config file on all my projects for several years and never had a problem. I think the XML works find for Rosetta. There have been some problems in the past with the projects and options, but I think Rosetta is fine. Your disk cache with the WRITE BACK enabled suggestion is very good. It will reduce disk write traffic and save the SSD/HDD drive. VirtualBox BOINC crunchers can decide on using memory to reduce disk writes or to run more jobs. Thanks |
16)
Message boards :
Number crunching :
3 x 36-Processor Machines with CPU set to 50% are now working
(Message 105718)
Posted 27 Mar 2022 by rjs5 Post: The Rosetta conversion to vbox caused big problems for me. 1. I had to figure out the Rosetta ALLOW switch. 2. I had to limit the number of Rosetta jobs active on the computer (currently 8gb/job) with 3-line app_config.xml. 3. I found high memory errors in one machine that had been running fine. 4. I had to load VirtualBox packages on a Linux machine so the vbox jobs would run. I think things have stabilized. 64-gb Fedora Linux machine. I had to load VirtualBox package to fix COMPUTATION ERRORS. 64-gb Windows 11 Machine Heavy disk usage caused by WU setup and runtime paging from lack of memory. Near zero CPU usage. Long runs. I LIMITED the maximum Rosetta jobs to 8. I can probably relax that some. The jobs seem to want 3gb to start with, but demand more later in the computation. The failures likely occurred when disk space requests exhausted. "app_config.xml" file at C:ProgramDataBOINCprojectsboinc.bakerlab.org_rosettaapp_config.xml (3 lines) limits the number of project jobs executed simultaneously. <app_config> <project_max_concurrent> 8 </project_max_concurrent> </app_config> 128-gb Windows 11 Machine Frequent stalled jobs with little CPU usage. Constant high disk usage. Isolated two bad memory sticks in the 64gb to 128gb memory range. 2 x 16gb DIMM sticks on order. Added the 3-line app_config.xml file above. |
17)
Message boards :
Number crunching :
Constant computation errors.
(Message 105638)
Posted 22 Mar 2022 by rjs5 Post: You have the the molecular chemistry or whatever department at the university, then the institute, then it seems Baker Lab falls under that umbrella and splits out into robetta, rosetta and foldit, A lot of names that are just sub units of something. It's almost like a circus juggling act. Rosetta may not be a "circus", BUT the person integrating the "science program" with the "real world machines" is unqualified to do the job. There are simple warning messages and parameter testing limits that can be implemented that could screen out most of the error situations before they reach volunteer machines. Simple things like a "Set the ALLOW computer detail switch to enable Python jobs" message. There are many of these informational messages that could be added, but the integrator is unqualified or simply lazy. My suggestion: require each researcher submitting WU to the public have an identifier embedded in the WU name. Make incompetence public, traceable and give researchers CREDIT for their successes and failures. 8-) |
18)
Message boards :
Number crunching :
Not getting work
(Message 105327)
Posted 4 Mar 2022 by rjs5 Post: Thanks for the explanations. My stats change, although I never see any work on that computer, so I guess it's getting them at night. I'm running Linux so don't know if there's VirtualBox for that setup. I'll just leave it alone. I am running a Fedora Linux box. I had installed BOINC but there was no BOINC+VirtualBox packages so I just installed the virtualbox packages in addition. It seemed to work. I am seeing mainly Rosetta Python WU being sent down. They take a huge amount of memory and I am seeing a few hung jobs. There seem to be many jobs available so you should see the machine running them. I am running 18 CPU on a an 18C/36/T machine with 64gb of memory. The 18 WU will cause Linux to consume all 64gb of memory and a good chunk of the swap space. |
19)
Message boards :
Number crunching :
Not getting work
(Message 105325)
Posted 4 Mar 2022 by rjs5 Post: Rosetta 4.20 tasks are not always available. they send out a few days. This may be the problem. If you install virtualbox, you will receive some python tasks and those are always available. Even if you install the VirtualBox version of BOINC, you still have to "ALLOW" that computer to accept the vbox work units. I fell into that trap. I just installed VirtualBox BOINC and nothing happened. I had to ALLOW each computer to accept WU. Rosetta added an ALLOW/SKIP option to each COMPUTER profile. You have to explicitly set the ALLOW option. The Rosetta people failed to add a "WARNING" or any information that would help a user find this failure. I am still getting a number of failures and hung Rosetta WU where they just keep running. This is happening on a machine with plenty of memory, disk and all enabled to run BOINC WU. |
20)
Message boards :
Number crunching :
Does Rosetta work with Windows 11?
(Message 105290)
Posted 28 Feb 2022 by rjs5 Post: I had a look at the "details" page for the windows pc`s and the "application details" recons that one has completed 1534 python tasks the other 8 , THAT WORKED!!! Thank you very much!! What a foolish way to implement this new feature. If they are going to remove machines from a majority of all the new WU, the least they could do is add a warning at each failed update. |
©2025 University of Washington
https://www.bakerlab.org