1)
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
(Message 105080)
Posted 20 Feb 2022 by entity Post: I was finally able to get about 54 of the tasks running before it started to allocate into the swap space. I just stopped there to prevent any additional IO due to memory. I may have to cut back a bit as I'm starting to get the "VM Job unmanageable -- restarting later" message. I may cut back to around 32 tasks (256GB / 8GB). |
2)
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
(Message 105072)
Posted 20 Feb 2022 by entity Post: Are the vbox tasks limited as to how many can run concurrently. Can only get 17 to run at the same time. All others are in "waiting to run" status in BOINC. No app config file in the projects directory. There is 202GB of free memory as of this writing. BOINC client was not acting correctly so I restarted the client. It took almost 15 minutes for the client to restart the 17 VBoxHeadless processes. During the restart the client runs 100% busy and BOINCMgr is totally unresponsive. However, one you let the processes complete the startup and the boinc process drops back to under 5% utilization you can start more processes. Starting 10 processes causes the boinc process to jump back to 100% busy for about 10 minutes. Once the client drops back to 5% the tasks show as running. It seems to be related to BOINC and VBox. I/O is negligible during the starting of tasks. I think I can baby sit this thing and get it to where I want it. Thanks for the insights. |
3)
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
(Message 105067)
Posted 20 Feb 2022 by entity Post: .Can you try to change use at most memory setting in computing preferences > disk and memory? Use at most setting for memory is set to 99% and 100% for the CPUs. Server has 128 threads and 256GB of memory yet only 17 tasks are running. No message in log indicating that BOINC is waiting for any resource. Boinc has copied the VDI file to 88 slots result in about 697GB of used disk space. Disk is a 900GB disk. Boinc told to leave 1% free as the most restrictive parameter. |
4)
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
(Message 105064)
Posted 20 Feb 2022 by entity Post: Are the vbox tasks limited as to how many can run concurrently. Can only get 17 to run at the same time. All others are in "waiting to run" status in BOINC. No app config file in the projects directory. |
5)
Message boards :
Number crunching :
L3 Cache is not a problem, CPU Performance Counter proves it
(Message 93979)
Posted 9 Apr 2020 by entity Post: I know that this have been discussed quite a few times in this forum. People said that Rosetta application is not optimized and needs 5MB of L3 cache for each instance running (though I don't really know where they got this number). But looks to me no one has ever shown any data to proves it. So I decided to figure it out. Unfortunately, statistics are always open to interpretation. "If you torture the data long enough, it will confess" -- Economist Ronald H Coase 1. Many contributors have reported that if they run small numbers of Rosetta instances concurrently they see less of a problem but as the number of concurrent instances increases the problem becomes more pronounced. This observation seems to be backed by your data where the L3 hit ration drops between 4 concurrent and 8 concurrent. What happens when concurrency jumps to 16 threads, 32 threads, etc? 2. PHYSICAL CORE IPC data indicates ~56% utilization. Seems like it should be closer to 100%. Why so low? Does it indicate that the cores are waiting significantly? 3. Memory READ/WRITE numbers need more context. Where is that measured at? Additionally, I assume those are main memory stats and not cache memory stats based on the size of the numbers. I would conclude that if the L3 cache needed to be refreshed with new data due to misses, it would result in high read rates just as your data indicates. Memory transfers rates are not indicative of anything other than data is being transferred and it definitely doesn't indicate memory utilization. I would suggest that it more rightly indicates memory channel utilization. I submit in a highly optimized L3 cache utilization situation, there would be little to no memory transfer activity, however, just because the transfer rate isn't at full utilization doesn't support the claim that there isn't a L3 cache problem. 4. Are you making the assumption that the Rosetta program has a reasonably static processing profile? Could the program have higher L3 cache demands at various points in the processing stream and you just happened to catch the program at a low point in the demand? 5. You make the statement: "I don't know whether Rosetta's algorithm itself is optimized, but looks like it is pretty effective in keeping CPU busy." I ask, based on what? If you are using CPU utilization reported from the OS, I contend that isn't a reliable number in this case. That number is based off the OS wait bits that never get set in this case. The thread is dispatched to the CPU and as far as the OS is concerned, it is active on the processor. If the core is waiting on memory transfers between cache or main memory, the OS doesn't see that. Only way the OS knows about a thread waiting is if it gives up the CPU voluntarily or is interrupted by a higher priority task. Only way the OS knows if the CPU is waiting is if the wait bit for that processor is set. Not saying the data isn't useful or pertinent. Just be careful about drawing conclusions from such a small amount of data. In other words, it's hard to run a program for 100 seconds and then say everything is OK |
6)
Message boards :
Number crunching :
Limited thread usage under Linux?
(Message 93778)
Posted 7 Apr 2020 by entity Post: I would stop and restart boinc, then look at the log to see how many CPUs BOINC detected and also look for any messages as to why it is being limited. I would also check to see if there are any local preferences overriding the global preferences from the website. |
7)
Message boards :
Number crunching :
Rosetta 4.1+ and 4.2+
(Message 93574)
Posted 5 Apr 2020 by entity Post: Downloaded 1135 WUs and working through the last 500 right now. All Linux machines with different distributions... |
8)
Message boards :
Number crunching :
Rosetta 4.1+ and 4.2+
(Message 93289)
Posted 3 Apr 2020 by entity Post: I was poking around a bit more and chatting with others on our team, and discovered something after reading Aurum's post to this morning's COVID-19 update (https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13702&postid=93202#93202). He mentioned something about the misuse of the L3 cache that was causing issues he was noticing on Xeon E5's (didn't specify which architecture, and the computers are hidden). I looked at the 2nd, 4th, 7th and 9th gen Intel processors we have on our project, and found that the 2nd and 4th gen both have 3MB of L3 cache, and the 7th and 9th gen Intel processors have 4MB of L3 cache per each two physical cores (4 MB for the 7th gen core i5, and a 12MB SmartCache for the 9th gen core i7). Maybe it's coincidence; but I find it curious. Hope this helps someone track down what's going on. Cheers! I've been seeing this in several threads recently. This is a response we got at WCG from the MIP project developers a couple of years ago: "The short version is that Rosetta, the program being used by the MIP to fold the proteins on all of your computers*, is pretty hungry when it comes to cache. A single instance of the program fits well in to a small cache. However, when you begin to run multiple instances there is more contention for that cache. This results in L3 cache misses and the CPU sits idle while we have to make a long trip to main memory to get the data we need. This behavior is common for programs that have larger memory requirements. It's also not something that we as developers often notice; we typically run on large clusters and use hundreds to thousands of cores in parallel on machines. Nothing seemed slower for us because we are always running in that regime. We are looking to see if if we can improve the cache behavior. Rosetta is ~2 million lines of C++ and improving the cache performance might involve changing some pretty fundamental parts. We have some ideas of where to start digging, but I can't make any promises. Long term, identifying these issues may end up improving Rosetta for everyone that uses it so pat yourselves on the back for that!" [/code] |
9)
Message boards :
Number crunching :
Rosetta x86 on AMD CPU
(Message 93286)
Posted 3 Apr 2020 by entity Post: I think something is off with the CPU utilization by new 4.12 app for the new Ryzen 3k systems. We were seeing this same behavior over at WCG on the MIP project which also uses Rosetta. The MIP developers posted this reply after we had sent them some data on what we were seeing (this was a couple of years ago): "The short version is that Rosetta, the program being used by the MIP to fold the proteins on all of your computers*, is pretty hungry when it comes to cache. A single instance of the program fits well in to a small cache. However, when you begin to run multiple instances there is more contention for that cache. This results in L3 cache misses and the CPU sits idle while we have to make a long trip to main memory to get the data we need. This behavior is common for programs that have larger memory requirements. It's also not something that we as developers often notice; we typically run on large clusters and use hundreds to thousands of cores in parallel on machines. Nothing seemed slower for us because we are always running in that regime. We are looking to see if if we can improve the cache behavior. Rosetta is ~2 million lines of C++ and improving the cache performance might involve changing some pretty fundamental parts. We have some ideas of where to start digging, but I can't make any promises. Long term, identifying these issues may end up improving Rosetta for everyone that uses it so pat yourselves on the back for that!" It's that sitting idle while waiting for data from main memory that causes the temps and energy use to drop. |
10)
Message boards :
Number crunching :
Rosetta 4.1+ and 4.2+
(Message 93108)
Posted 2 Apr 2020 by entity Post: I always try to suggest ways people can run the way they like, without purchase of additional hardware. So another approach would be to add another BOINC project that has less memory required to run. I usually suggest World Community Grid. Their tasks often run in less than 100MB. Running both projects with same resource share should allow you to use all of your CPUs, and afford enough memory for the larger R@h tasks. Be careful at WCG. Their ARP, FAH2, and MIP work units are quite large. MIP is running Rosetta, so... HST is quite small but also quite rare. SCC might be a good choice. MCM might also be a good choice but is over 100MB, probably closer to 300MB or 400MB. |
11)
Message boards :
Number crunching :
0 new tasks, Rosetta?
(Message 93098)
Posted 2 Apr 2020 by entity Post: I was at WCG for almost 16 years and decided to come here as it seemed to be a more useful project. I guess I have been VERY fortunate in the ability of my machines to stay busy. I have had a steady group of 1600 tasks since I came on board a couple of days ago. I have never seen it below 1000 and it's been staying steady at about 1560 the past 24 hours. I looked at WCG's potential Pandemic project and I'm skeptical as to how useful it could be. Admittedly, the description and announcements are all I can go on but it doesn't seem to address anything specific. I might have to wait for some announcement from someone at the Scripps Institute for more information. |
12)
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
(Message 93080)
Posted 2 Apr 2020 by entity Post: Oh you're right. This is a known problem in Rosetta that the developers have acknowledged but probably haven't fixed yet. They indicated that it would take a major rewrite of the code. L3 cache tends to become over utilized and the CPU waits for data to make the trip from main memory hence the CPU runs cooler (more waiting). There was a post by a developer in another project that suggested to limit the number of tasks run concurrently. They indicated that each task uses about 4MB of L3 cache. Concerning the run time, I noticed that the run parameters include something like cpu_seconds=57500. That is 16 hours. They are ignoring the Target CPU runtime setting |
13)
Message boards :
News :
Rosetta's role in fighting coronavirus
(Message 92907)
Posted 1 Apr 2020 by entity Post: A few things (I just recently came back to R@H): (1). Noticed that I downloaded 4.12 of Rosetta last night. (2). downloaded "a bunch" of conducting_fibre_*_*_*_* work units that have already run 10 hours and only 63% complete (I'm using the 8 hour default) and are due APR 3. A lot are not going to make it if they run the full 16 to 20 hours. (3). Is there maintenance that happens about 0000UTC? I noticed that the scheduler was not running and many WUs were "stuck" in upload and download. It seemed to clear about 0300UTC. Is this an every night thing? |
14)
Message boards :
News :
Rosetta's role in fighting coronavirus
(Message 92762)
Posted 31 Mar 2020 by entity Post: I would not recommend F@H on Linux. I just left there because of the issues concerning software dependencies. As much as F@H likes to state their software will run most anywhere (which is why they suggest ignoring dependencies during install), I and others, have found that to not be the case. I never could get FAHControl running on 3 different distributions without having to go back and install deprecated software. If one is running a Long Term Support release of a distribution (which tends to be back-leveled) then the chances are greater that the install will work. It wasn't worth fiddling with. Just one man's opinion of course I think that's the trick. To run only LTS releases, which I don't do. The Ubuntu 20.04 LTS release is out in couple of weeks and it's going to be interesting to see if it still works. They may have a new client out by then. I was able to get the client to run by using the config.xml file to define my slots but it wasn't easy and lot of trial and effort. Apologies for hijacking this thread to discuss another project. This is my last post on this particular topic. Back to R@H |
15)
Message boards :
News :
Rosetta's role in fighting coronavirus
(Message 92747)
Posted 31 Mar 2020 by entity Post: There has been a discussion that Folding@home is likely to produce a BOINC version soon, Not ready yet, though. I would not recommend F@H on Linux. I just left there because of the issues concerning software dependencies. As much as F@H likes to state their software will run most anywhere (which is why they suggest ignoring dependencies during install), I and others, have found that to not be the case. I never could get FAHControl running on 3 different distributions without having to go back and install deprecated software. If one is running a Long Term Support release of a distribution (which tends to be back-leveled) then the chances are greater that the install will work. It wasn't worth fiddling with. Just one man's opinion of course |
16)
Message boards :
News :
Rosetta's role in fighting coronavirus
(Message 92645)
Posted 30 Mar 2020 by entity Post: I was draining down my 24 thread 32GB machines and had a mix of Rosetta and other work. 2 of the 3 machines were well into the swap file but I was to catch them before they ran out. Trimmed back some work and they are now working just fine. Once the other work drains off I will open them up again. The other machine was out of swap and totally thrashing (disk light on continuously and couldn't log on). Reboot and trimmed back some work like the others. On my 128 thread 256GB machine, it filled up the root filesystem as I only had 50GB allocated to it and when Rosetta started up with about 500MB per slot, BOINC died. I was able to still logon so was able to shrink some LVs and extend the root LV. All good now. Everything is running as expected. |
17)
Message boards :
News :
Rosetta's role in fighting coronavirus
(Message 92568)
Posted 29 Mar 2020 by entity Post: Just added my 384 threads after being gone for 2 years.... 11 servers 24/7. I don't care about the names. I'll run anything that comes down the network link... First units should be coming back in about 4 hours from the faster machines. Slowest machines are about 8 hours. All 1600 WUs downloaded should be back in about 24 to 28 hours |
©2024 University of Washington
https://www.bakerlab.org