Message boards : Number crunching : Rosetta overheating my CPU
| Author | Message | 
|---|---|
| Fubar the Benevolent Despot Send message Joined: 1 Jul 13 Posts: 2 Credit: 1,020,491 RAC: 0 | 
 It appears I'm going to have to quit crunching on Rosetta. I've been having an issue with my CPU giving out overheating warnings for some time, ever since I downloaded the Core Temp program for a completely unrelated issue. (I can only imagine how long it's been having the issue before that, probably since day 1 on this build) I've gone all Sherlock Holmes, checking every part even remotely connected to cooling: power supply, heat sink fans, etc. I added 3 more case fans, two for intake, one for exhaust. I even removed and reapplied thermal paste. Finally, by sheer luck, I had left task mangler open as I was preparing to do other things and saw a huge spike in power usage at the same time as the CPU cooler fan kicked into high gear. I looked at the Core Temp program - also open - and saw the temp spike up. The power usage was coming from Rosetta. I suspended the work and reset the min & max in Core Temp. 3 days later, It hasn't gone within 10 degrees of warning and only then when BOINC was running Milky Way. Without that running, I'm down 30 degrees below the warning level at maximum. I'm going to keep checking through the weekend but the data seems pretty clear: Rosetta puts too much load on my system. An AMD Ryzen 7 3700X (Matisse) with 16 GB RAM if you care. Sorry, folks. My computer is too important for me to fry it running Rosetta. | 
|  Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1895 Credit: 18,534,891 RAC: 0 | 
 Sorry, folks. My computer is too important for me to fry it running Rosetta.Given that there are hundreds of thousands of computer systems that are capable of running at 100% load for days & weeks (and months & years) on end without overheating, the other option would be to fix the problem with your CPU cooling. Even at Milkyway your system misses the deadline 75% of the time, so there's a good chance your CPU is thermally throttling even there- as 12 days is plenty of time to return work, here at Rosetta it's 3 days. I even removed and reapplied thermal paste.What were the CPU temperatures before & after re-doing the CPU heatsink, at idle & at full load? (i'd suggest Cinebench 2024 using the multicore test using all cores & threads, to load test it, with BOINC processing suspended). If you used more than the slightest smear of paste, then there is your problem. Heatsink paste is meant to fill the air gaps between the heatsink & the CPU, and not come between the metal of the CPU & the metal of the heatsink- otherwise it acts as heat insulator, not a conductor (it's heat conductivity is much greater than that of air, but way, way, way less than that of metal to metal (i would also hope that the heatsink doesn't still have it's protective plastic cover on it...)). Grant Darwin NT | 
| Fubar the Benevolent Despot Send message Joined: 1 Jul 13 Posts: 2 Credit: 1,020,491 RAC: 0 | 
 Thank you for pointing out several things I already knew. For your info, so you are aware of from whence I speak, I've been dealing with computers since 1981, yes, 1981. I have a programming degree and have owned and built computers for years before the internet even existed. For the geeks, my first was an 8Mhz 8088 with a single 5 1/4" floppy drive, a 1200 baud modem and a 12" CGA screen. And I've been running BOINC since it came out and SETI@Home before that. I've been around for a minute. I couldn't give a rat's patoot how long it takes to run a specific task, be it from Milkyway, Rosetta, Einstein or any of the others that are available for BOINC. Perhaps your system misses deadlines, I am not so afflicted. Rosetta is cranking out 10-15 F more than the second hottest thing on this box, Milkyway, and is the only thing on this box that causes overheat warnings. Milkyway is running 20-25 F more than the next hottest load. With BOINC suspended, I'm running 60+ F below warning levels and 65-70 F under Rosetta. I've dealt with "my cooling issues". I've rerouted wiring, added three case fans, et cetera, as detailed in the OP. Perhaps your reading skills are "sub optimal"? Or your comprehension. And I'm not going to go into a boring recitation of temperatures just for your amusement and edification. Over the course of several months, I've identified the problem, and dealt with it. I didn't post to get in a debate about what you think I should do, I posted to be polite and let y'all know I'm done and why. Peace out. | 
|  Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1895 Credit: 18,534,891 RAC: 0 | 
 Perhaps your system misses deadlines, I am not so afflicted.Utter rubbish. Here at Rosetta, you aborted half, and the rest timed out- ie you missed the deadlines. For Milkyway, you completed 7, and 33 missed the deadline. Massive failure rate. I've dealt with "my cooling issues".No, you haven't. Dealing with the issue would imply you fixed the problem. Since that isn't the case, your method of "dealing" with the problem is to just ignore that there is an issue, and avoid activities that bring that issue to the fore. I didn't post to get in a debate about what you think I should do, I posted to be polite and let y'all know I'm done and why.Nothing polite about your response to someone offering suggestions as to how you might be able to fix the issue, as the only reason for someone to post here about such an issue would be to sort out the problem they are having. If they have no intention of fixing their problem, then there's no need for them to post here to say goodbye. But given your attitude and how little you have contributed to the project, and your choice to avoid workloads that show that there is a problem with your system that requires fixing, then your leaving here is our gain, and MilkyWay's loss. Grant Darwin NT | 
| Sid Celery Send message Joined: 11 Feb 08 Posts: 2471 Credit: 46,466,719 RAC: 234   | 
 Thank you for pointing out several things I already knew. At the risk of antagonising you further, can I make one further suggestion before you abandon altogether. My motivation being that there's a world of possibilities between overheating and running 60+F below warning levels. My background being I used to overclock my PCs to within (and beyond) an inch of their lives - having melted sockets into the motherboard when I've got it wrong, done all the same things on high capacity case fans, looked at airflow temp gradients within the case, installing AIO CPU coolers etc. All stuff you'd know and I've no doubt you've max'd out all those possibilities. But there's somewhere else you can go. Given Milky Way is only marginally less problematic than Rosetta it's reasonable to look at Boinc as a whole being the problem and there are settings within it that allow you to throttle Boinc before the strain on the CPU forces effectively the same thing. And I know this because I have 2 PCs - one which runs within thermal limits with Boinc at 100% and one that can't quite seem to manage it. Within Boinc, under Options/Computing Preferences, on the Computing tab, both the sections "When Computer is in use" and "When Compute is not in use" there's a setting "Use at most 100% of the CPUs and at most 100% of CPU time" Reduce the second 100% first to 90% and see how it affects your temps. Adjust further until you reach a CPU temp level you're comfortable with that's consistently below warning levels. On my problem PC, in the summer months, I'm having to use a figure as low as 70%, while at this time of year I can up it to nearer 90% but never quite unrestricted. On my 'good' PC I run at 100% year round. I've never got to the bottom of why the 2 PCs are so different, but the point is that both contribute as much as they can while running safely and reliably, which is a better solution than the overkill of abandoning Boinc projects altogether. What I hope this also does is allow you to run tasks from all Boinc projects more consistently and to completion before deadlines, while your non-Boinc activity has more capacity to run too, at temps that don't make you think your CPU is about to fry, so everyone wins. If you've tried this already, I'll expect you to respond accordingly and that'll be fine too. I can't expect anything less.     | 
|  Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1895 Credit: 18,534,891 RAC: 0 | 
 Reduce the second 100% first to 90% and see how it affects your temps. Adjust further until you reach a CPU temp level you're comfortable with that's consistently below warning levels.The drawback with that method is thermal stress- you're maxing out the CPU, then stopping it, then maxing it out & then stopping it etc. Short, sharp heating & cooling cycles aren't good for electronics (really long cycles, no problem. Extremely short cycles, no problem). Reducing the number of cores/threads being used will result in lower CPU temperatures, and it won't result in longer processing times which will take the BONC Manager some time to adjust to (and when 21% or less of downloaded work is actually being processed and returned in time, taking even longer to process work is just going to make that result even worse). Grant Darwin NT | 
| Sid Celery Send message Joined: 11 Feb 08 Posts: 2471 Credit: 46,466,719 RAC: 234   | 
 Reduce the second 100% first to 90% and see how it affects your temps. Adjust further until you reach a CPU temp level you're comfortable with that's consistently below warning levels.The drawback with that method is thermal stress- you're maxing out the CPU, then stopping it, then maxing it out & then stopping it etc. Short, sharp heating & cooling cycles aren't good for electronics (really long cycles, no problem. Extremely short cycles, no problem). Reducing the number of cores/threads being used will result in lower CPU temperatures, and it won't result in longer processing times which will take the BONC Manager some time to adjust to (and when 21% or less of downloaded work is actually being processed and returned in time, taking even longer to process work is just going to make that result even worse). I actually agree with you 100% if not for (or but or except) this is my badly running PC (Intel i5-9600K), set to run between 85 and 90% of the time (in use/not in use) with a 12hr (43200sec) task runtime. I have no idea at all why it's running with CPU time so close to wall-clock time and pretty much exactly to 12hrs (so no Boinc time adjustment) and delivering so much credit per task. If I hadn't told you how I've got it set up, could you guess? I certainly couldn't. There are no telltale hints at all that I can see. Before I dialled down the Boinc settings, because it's unattended 4 days/wk, I'd visit it and find it'd crashed days previously. I'd had it thoroughly cleaned out, along with case and CPU AIO fans, with only a marginal improvement, so I had no choice but to dial it down as described. And this is the result - no crashes through heat overload while unattended, runs tasks faithfully at temps ~10% below warning levels. I can hardly complain. And if it works for me I offer the information to others who may have an issue to see if it works for them too. Can't do any harm to try.     | 
| Sid Celery Send message Joined: 11 Feb 08 Posts: 2471 Credit: 46,466,719 RAC: 234   | 
 ...if not for (or but or except) this is my badly running PC (Intel i5-9600K), set to run between 85 and 90% of the time (in use/not in use) with a 12hr (43200sec) task runtime. Slight correction - I'm now with that PC and I have it running between 90 & 95%, not 85 & 90%, but also I'd found it crashed while I was away, so still not ideal. I've actually slightly lowered my room heating temp rather than adjust Boinc running time because the weather has slightly improved. I'll see how that goes when I return next week     | 
| Ianab Send message Joined: 21 Jun 09 Posts: 1 Credit: 2,929,705 RAC: 1 | 
 With my machines I often limit the BOINC work, especially in the Summer. Not because they are actually overheating, but because I am, and don't want to pay double electricity to crank up the AC.  But i limit the % of CPUs to use, sometimes down to just a single task. That way the CPU heat load is steady, but only a fraction of the max. The individual tasks actually complete slightly quicker as many systems can boost the clock speed when only 1 or 2 cores are in use, plus less conflict for cache and RAM access.  I see the OP has an 8 core / 16 thread CPU. Now I don't know what his cooling solution is, but I can guarantee that if he switched to 75% CPU use (12 tasks) his machine would not be overheating. That the machine's cooling is inadequate for running at 100% load is another issue, but I don't know what he's got, so I can't comment on it just being an under spec cooler, poor case design, not enough fans etc. The stock cooler for that CPU is a small downdraft unit, with a couple of heat pipes. Probably going to struggle with full load, but of course we don't know what's actually in the box. Different software executes different instructions in the CPU, and those take different amounts of power, depending on the sections of chip that's used. That explains why running 100% on this project creates more heat than 100% with a different program, simply a different mix of instructions. I see the Ryzen also has power limiting software which would limit clock speed (or reduce the boost) when CPU power goes over a certain limit. That might be a work around for a marginal cooling solution. Lock it to use a max of 75W rather than the default of 88W. | 
| TheFiend Send message Joined: 27 Jan 12 Posts: 8 Credit: 27,500,074 RAC: 0 | 
 My AMD Ryzen 3900X has no problems running Rosetta at 100% CPU load 24/7 and doesn't get anywhere near 95 Degrees C.... I  use  the AMD Ryzen Master app to drop the CPU voltage and it have it running very stable at 1.3125V CPU core voltage and the CPU maxing at about 70 Degrees C air-cooled. | 
| Sid Celery Send message Joined: 11 Feb 08 Posts: 2471 Credit: 46,466,719 RAC: 234   | 
 ...if not for (or but or except) this is my badly running PC (Intel i5-9600K), set to run between 85 and 90% of the time (in use/not in use) with a 12hr (43200sec) task runtime. Mission successful. Reading other people's comments, it's my Ryzen 7 5800X (8C/16T) machine that runs fine at 100% 24/7/365, but my i5-9600K (6C) that has problems - both AIO CPU water-cooled. It makes me think it's an Intel thing, with so many saying their AMDs run fine     | 
|  Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1895 Credit: 18,534,891 RAC: 0 | 
 Mission successful.The issues with current Intel CPUs are generally with the high-end models (i9s). But if your i5 is overlocked at all, then i can see it having issues as those ones weren't really made for overclocking (yes, i know it's got an unlocked multiplier, but i'm pretty sure there wasn't a lot of headroom there, even with their higher TDP rating with those lower end models). Grant Darwin NT | 
| Sid Celery Send message Joined: 11 Feb 08 Posts: 2471 Credit: 46,466,719 RAC: 234   | 
 Mission successful.The issues with current Intel CPUs are generally with the high-end models (i9s). You're certainly right that I chose unlocked CPUs with the intention of overclocking, but where I used to force the 3.7k default on this i5-9600K up to 4.3k 24/7, I've eased that way back to 3.9k - and with the 90/95% BOINC runtime it averages that original 3.7k, with temps 15C below the max, which may explain how I've got to this point. But with decent cooling it ought to be doing a lot better than that and not giving me the issues I've seen. The whole thing's a balancing act, where I've now kind of given up playing for the sake of stability. Edit: What I've noticed as I remind myself what I'm running at is that RAM support on this MB/CPU combo is technically 2666MHz but I'm running mine at 3030MHz. I'm now wondering whether that's the root of my instability... Edit2: This PC was originally built using a 4-core i3-8350K CPU which I upgraded to this 6-core i5-9600K after a few years as the MB supported it. I don't think that introduced any issues, but potentially it might have somewhere     | 
|  Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1895 Credit: 18,534,891 RAC: 0 | 
 Edit: What I've noticed as I remind myself what I'm running at is that RAM support on this MB/CPU combo is technically 2666MHz but I'm running mine at 3030MHz. I'm now wondering whether that's the root of my instability...Have you got the latest BIOS on that motherboard? I avoid BIOS updates like the plague, however i finally bit the bullet & upgraded the BIOS on both of my systems; partly due to security issues, but also because a couple of the updates specifically addressed stability & performance issues. The BIOS on my motherboards was only 2 versions after the initial release BIOS- the most recently released one while over 3 years old now was about the 6th or 7th BIOS update. And surprisingly it did improve performance- the all-core 100% load CPU clock was boosted by around 200MHz. Not a lot, but not bad for freebie (especially when you consider a mid-cycle CPU model refresh often only results in a similar clock boost). Grant Darwin NT | 
| noderaser  Send message Joined: 4 Oct 05 Posts: 17 Credit: 150,001 RAC: 0 | 
 For those users who are having heat issues, you can use a utility called TThrottle which can scale your BOINC computation to meet a specified CPU and GPU temperature. It's quite useful, especially for systems where cooling is insufficient such as in laptops. Unfortunately, it is only available for Windows but there is a different utility for Linux that I'll have to dig up. Click here to see My Detailed BOINC Stats   | 
| Sid Celery Send message Joined: 11 Feb 08 Posts: 2471 Credit: 46,466,719 RAC: 234   | 
 Edit: What I've noticed as I remind myself what I'm running at is that RAM support on this MB/CPU combo is technically 2666MHz but I'm running mine at 3030MHz. I'm now wondering whether that's the root of my instability...Have you got the latest BIOS on that motherboard? Fair question. I've forgotten tbh. I'm pretty certain I'd have updated the BIOS after doing the CPU upgrade, but whether I've looked at it since, I can't say. I did the upgrade shortly after that 9k series became unsupported, most likely to benefit from a price-drop, at which point I imagine the BIOS was quite mature, but it's quite possible there's been at least a few updates since then. Now I'm back home on my main Ryzen 7 5800X, where I'm more confident of a recent BIOS update, I find I'm 2.5yrs and 8 BIOS updates behind with all sorts of security fixes not installed <cough> I'll definitely be updating this one as I believe early versions also fixed problems with 5000-series CPUs running hot, so I daren't imagine how out of date the other one is. It's very likely you've hit on something I didn't consider, so thanks for that nudge. Edit: And Chipset, LAN, SATA, Audio and VGA drivers, though I'm maybe only 2 versions away on those. It's going to be a long night of upgrades and reboots     | 
|  Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1895 Credit: 18,534,891 RAC: 0 | 
 For those users who are having heat issues, you can use a utility called TThrottle which can scale your BOINC computation to meet a specified CPU and GPU temperature. It's quite useful, especially for systems where cooling is insufficient such as in laptops. Unfortunately, it is only available for Windows but there is a different utility for Linux that I'll have to dig up.The problem with it, is that it does the same thing that the "Use at most xx % of CPU time" does- starts & stops the CPU in a duty cycle. Lots of thermal stress and discrepancy between Run time & CPU time. And unless you're prepared to manually configure the programme, it will affect all programmes running on that system (word processing, email, AV scanning etc if they result in the CPU reaching the preset temperature limit). Better to just let it run at 100%, but limit the number of cores/threads in use. That way there are no issues with estimated completion times, deadlines, and work scheduling & the temperatures are still kept under control. The cooling systems on much older laptops were much more effective (as they all ran much hotter anyway). Unless you have a desktop replacement unit, portable workstation or higher end gaming system, i personally wouldn't run any sort of computing load on current notebooks, ultra-lites or general laptops these days (other than for a few days as a stress test). They just aren't designed for continuous heavy loads (they can do it, but they run hot and thermally throttle very quickly). Grant Darwin NT | 
| Sid Celery Send message Joined: 11 Feb 08 Posts: 2471 Credit: 46,466,719 RAC: 234   | 
 Have you got the latest BIOS on that motherboard? I'm back. By which I mean I upgraded everything (in reverse order to what's listed above) and all was going well until I updated the BIOS. Well, updating the BIOS went well too, until it completed and asked me if I wanted to switch fTPM on or not before rebooting - making clear it was an important question - but I didn't know what the answer was because I didn't understand the question. So I kind of guessed. Not a good idea. I still had access to the BIOS after, but I couldn't hit the desktop. Restoring saved BIOS settings didn't help either. Tbf I don't really know whether this was the problem. I could disable fTPM again in the BIOS, but I still couldn't hit the desktop. Eventually I noticed the order of booting my drives (NVME and HDD) had been reset and a couple of goes at correcting that finally returned me in one piece. I could tell I was finally on the right track because, as it turned out, Windows Update had downloaded and partially done its monthly update just as I rebooted to update the BIOS - what were the chances of that? - before completing the installation once the drives were back in the right order. All this could've gone very wrong and it might be a minor miracle that I escaped without consequences, except to my heart rate. This is on my 'good' machine btw (the AMD 5800X) not the problem one. That's still to be done <gulp> So anyway... Where I was expecting the machine to run cooler, the fact that it was also more stable meant that the clock speed (3.8k default, was running at 4.2k) increased to about 4.375k and so ends up running 5-10C hotter and within 10% of its maximum temp. So now, when I say it's running at 100% Boinc CPU runtime year-round, I can only reasonably make that claim for this winter. I'll wait to see what the summer brings... What a palaver...     | 
| rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0 | 
 Sorry, folks. My computer is too important for me to fry it running Rosetta.Given that there are hundreds of thousands of computer systems that are capable of running at 100% load for days & weeks (and months & years) on end without overheating, the other option would be to fix the problem with your CPU cooling. A couple of things I have noticed about Rosetta. The Rosetta developers "performance tune" Rosetta with a single instance running. They aggressively inline code, and take advantage of the cache hierarchy to get low memory latency. That works fine unless you run multiple copies of Rosetta. Multiple copies of the big footprint Rosetta generates high cache misses, evict cache lines, and access slow memory to retrieve the data. Multiple copies of Rosetta cause the cache miss rate to skyrocket. Accessing main memory requires the CPU to use large, power hungry transistors to drive the memory bus. Driving memory buffers is a major source of the heat and why Rosetta generates more heat. If the CPU temperature reaches the "silicon pause temperature", the CPU automatically inserts PAUSE instructions to slow execution and cool the CPU. Running the CPU at 100% is slower than running the CPU at 99% and 98% and ... The maximum THROUGHPUT of Rosetta work loads is probably closer to 50% because of the big footprint. Running a "mix" of BOINC workloads is tough (impossible?) to fine tune with the BOINC controls. I usually devote a machine and set the CPU PREFERENCE to 60%-70% of CPU for 100% of time. If I see TASK MANAGER or Linux "top" show more than 70% or so load, l will reduce the % of CPU. Getting the best THROUGHPUT one any machine depends on cache sizes, memory sizes, disk speed, ... rjs5 | 
| TheFiend Send message Joined: 27 Jan 12 Posts: 8 Credit: 27,500,074 RAC: 0 | 
 Have you got the latest BIOS on that motherboard? Have you tried the AMD Ryzen Master app? You can tweak the CPU Core voltage downwards to drop the temperature plus it has a test mode that you can run to see if the tweaks are stable.... I have one of my 3900x running at @4.125MHz,1.3125V .... 63 degrees air-cooled running Rosetta tasks. | 
            Message boards : 
            Number crunching : 
        Rosetta overheating my CPU
    
 
         ©2025 University of Washington 
https://www.bakerlab.org