High write-rate with Pythons

Author	Message
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 104000 - Posted: 2 Jan 2022, 15:36:32 UTC I always monitor the write-rates for my projects, to ensure that the SSD will not be damaged. For Windows, SsdReady (free version) works. http://www.ssdready.com/ssdready/ Currently six pythons (plus two 4.20 Rosettas) are running on my main Win10 machine, a Ryzen 3600 (VBox 5.2.44). The writes are 1.4 TB (that is Terra-bytes) per day, which is way too much for safety. That is due to the pythons; the 4.20 are much less. I always limit my SSDs to less than 100 GB/day, and preferably 70 GB/day. But I use a write-cache also; for Windows it is PrimoCache. https://www.romexsoftware.com/en-us/primo-cache/index.html Using 24 GB for a write-cache (a huge amount) and 4 hours latency, I can reduce the writes to the SSD to 31% of the writes by the OS, or about 434 GB/day. That is still too much for me. For Linux, I can use the built-in write cache by increasing the parameters. https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ I will have to move the project to a dedicated machine where a failure of the SSD would not be so catastrophic. ID: 104000 · Rating: 0 · rate: / Reply Quote

tullio Send message Joined: 10 May 20 Posts: 63 Credit: 630,125 RAC: 0	Message 104001 - Posted: 2 Jan 2022, 16:27:29 UTC Last modified: 2 Jan 2022, 16:29:47 UTC I have a HP laptop eunning 24(7 since 2014 SuSE Leap 15.0. It has a hybrid disk, which has a small SSD partition and 1 TB rotating disk. It is now running BOINC QuChemPedIA@home. It should be dead by all forecasts and yet it is running on a AMD E-450 CPU. Tullio ID: 104001 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 104002 - Posted: 2 Jan 2022, 17:06:53 UTC - in response to Message 104001. I have a HP laptop eunning 24(7 since 2014 SuSE Leap 15.0. It has a hybrid disk, which has a small SSD partition and 1 TB rotating disk. It is now running BOINC QuChemPedIA@home. It should be dead by all forecasts and yet it is running on a AMD E-450 CPU. Tullio The lifetime estimates given on SsdReady are probably very conservative, but they were showing about six months for my test. But QuChemPedIA is easy compared to the pythons, and your rotating disk will take more writes than an SSD anyway. ID: 104002 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 104005 - Posted: 2 Jan 2022, 22:43:51 UTC - in response to Message 104002. The lifetime estimates given on SsdReady are probably very conservative, but they were showing about six months for my test. But QuChemPedIA is easy compared to the pythons, and your rotating disk will take more writes than an SSD anyway. The problem i can see with that tool is it is reporting the reads/writes reported by the Operating System, and not the SSD itself. Information seems to be rather sparse, but for current SSDs Write Amplification seems to be around 0.5 or so- ie the actual data written to a drive is half that of the value reported by the OS. Of course for smaller SSD, or those low on free space, the value could be well over 2. Even so- the huge disc writes for Python work could be what was occurring with Rosetta tasks when i first started at this project- every new Task resulted in the database files being written to the new Task directory- each & every time a new Tasks started. Someone from another project showed them how to just link to those files so they didn't have to be copied & written again every time a new Task started. A massive reduction in disk activity. How are VirtualBox jobs handled by other projects? I thought it might be a case of security requiring the VM image to be copied to the new Task folder every time. If not, just linking to the VM image file for each Python Task (same as for Rosetta 4.20 Tasks) would reduce the writes by 8GB for each new Task. Grant Darwin NT ID: 104005 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 104006 - Posted: 2 Jan 2022, 22:55:53 UTC - in response to Message 104005. Last modified: 2 Jan 2022, 23:04:02 UTC I thought it might be a case of security requiring the VM image to be copied to the new Task folder every time. If not, just linking to the VM image file for each Python Task (same as for Rosetta 4.20 Tasks) would reduce the writes by 8GB for each new Task. It looks like a new "vm_image.vdi" file is downloaded into the slots directory each time a new work unit is started. At 7.5 GB each, it all adds up. But I think those are the "easy" writes for the SSD, since they are just a large file that is serially transferred. It is the "random" writes done while processing a work unit that are hard on the SSD. The last time I checked on LHC/CMS, the writes were not that much, though I use some write cache on every project. As for write amplification, it used to be a lot more than 2; more like 10 if you were lucky. But that was some years ago. I don't know what they are doing now. PS - Yes, SsdReady is probably just looking at the OS writes, and is a bit pessimistic. It really is probably only 31% of that with the test that I did. ID: 104006 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 104007 - Posted: 2 Jan 2022, 23:44:48 UTC - in response to Message 104006. Last modified: 3 Jan 2022, 0:16:18 UTC It looks like a new "vm_image.vdi" file is downloaded into the slots directory each time a new work unit is started. At 7.5 GB each, it all adds up. But I think those are the "easy" writes for the SSD, since they are just a large file that is serially transferred. It is the "random" writes done while processing a work unit that are hard on the SSD. Even with just the default OS write caching, and the SSDs own wear levelling optmisations, the random writes wouldn't be that much of a issue IMHO- even with lots of cores/threads, as the wear levelling alone would consolidated all those individual writes in to single block sized writes. I suspect for a given Task that all the result data & checkpointing writes combined would have to be a lot less than 7.5GB. Do the other projects download a new VM file for each & every Task they process as well? You'd think that it must be possible for just a single VM file to be downloaded, and when a new Task is sent out it gets a security hash check to make sure the file already on the system is correct & can be used- even if it has to be copied in to the slot each time with the new Task. Only if the hash value is different, or the Project releases a new VM image should it have to download another VM image to the processing system. And ideally the VM image should be usable like the Rosetta 4.20 Database files are- each Task links to the existing database files. No need to copy them every single time to the new Task slot. I have to admit that i would have thought by now it would be possible to implement a VM in a much more efficient manner. I'd have thought the VM software would just require the necessary data- ie CPU type (which would include instructions supported etc) and the OS and it's version. Then it's configuration- RAM, Storage, Keyboard, Video card etc). You'd have a image for a particular CPU & OS, then just configuration data for everything else. So for cloud type deployments there would be multiple CPU + OS images, but each one is used as a template to provide hundreds (or even thousands) of VMs. EDIT- thinking about it, it gets very ugly very quickly when multiple different VMs and software on them are needed. But here at Rosetta it's the simplest possible usage of a VM. Grant Darwin NT ID: 104007 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 104008 - Posted: 3 Jan 2022, 0:02:20 UTC - in response to Message 104007. Even with just the default OS write caching, and the SSDs own wear levelling optmisations, the random writes wouldn't be that much of a issue IMHO- even with lots of cores/threads, as the wear levelling alone would consolidated all those individual writes in to single block sized writes. I suspect for a given Task that all the result data & checkpointing writes combined would have to be a lot less than 7.5GB. Wear leveling is done after the fact, even after garbage collection. But I don't think that directs the writes into a nice order. I think the writes in real time go wherever they can. I did not directly measure how much of the total writes were due to the .VDI downloads, but about half would be an approximation. That leaves the other half that would be a problem I think. Do the other projects download a new VM file for each & every Task they process as well? You'd think that it must be possible for just a single VM file to be downloaded, and when a new Task is sent out it gets a security hash check to make sure the file already on the system is correct & can be used- even if it has to be copied in to the slot each time with the new Task. Only if the hash value is different, or the Project releases a new VM image should it have to download another VM image to the processing system. And ideally the VM image should be usable like the Rosetta 4.20 Database files are- each Task links to the existing database files. No need to copy them every single time to the new Task slot. I doubt very much that the other projects download a new VM file for every task, at least not after they are developed. I think it is one of those things on their to-do list to fix up as soon as they can, from the comments I have seen. But I have not checked it recently. I have to admit that i would have thought by now it would be possible to implement a VM in a much more efficient manner. I'd have thought the VM software would just require the necessary data- ie CPU type (which would include instructions supported etc) and the OS and it's version. Then it's configuration- RAM, Storage, Keyboard, Video card etc). You'd have a image for a particular CPU & OS, then just configuration data for everything else. So for cloud type deployments there would be multiple CPU + OS images, but each one is used as a template to provide hundreds (or even thousands) of VMs. You probably should look at the files yourself and maybe see what I have missed. But yes, they should have fixed it up. I was going to say that this seems to be a side-project from a single researcher that did not get the attention of their full development effort. But that could be inaccurate, and it could be good science anyway. I will try to do it if I can. ID: 104008 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 104009 - Posted: 3 Jan 2022, 0:40:19 UTC - in response to Message 104008. Wear leveling is done after the fact, even after garbage collection. But I don't think that directs the writes into a nice order. I think the writes in real time go wherever they can. There are two types of wear levelling- modern controllers use both types. Static wear levelling is where existing data is moved in order to free up those less used bocks for new data, and is done as a part of the Garbage Collection/Trim work. Dynamic wear levelling is where data being written is written to the least used available blocks. Write amplification reduction also helps with wear levelling by reducing the total number of writes to the drive. I tend to think of Write amplification reduction, Wear levelling & Garbage collection as a single entity (and you can throw overprovisioning in there as well)- improving SSD endurance/life expectancy/write optimisation, even though they are actually different things. They all improve the drive's life expectancy & improve it's performance, Grant Darwin NT ID: 104009 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 104014 - Posted: 3 Jan 2022, 7:56:28 UTC - in response to Message 104009. I tend to think of Write amplification reduction, Wear levelling & Garbage collection as a single entity (and you can throw overprovisioning in there as well)- improving SSD endurance/life expectancy/write optimisation, even though they are actually different things. They all improve the drive's life expectancy & improve it's performance, That is a very nice explanation. But I think the write endurance for the memory cells has gone from about 100k cycles to 10k in the last few years. So even if the write amplification has gone from 10 to 1, we are back where we started. That may not be so bad, but I will use a write-cache as required. ID: 104014 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 104015 - Posted: 3 Jan 2022, 8:40:41 UTC - in response to Message 104014. But I think the write endurance for the memory cells has gone from about 100k cycles to 10k in the last few years. It's actually worse than that. It's gone from 100k P/E (Programme/Erase) Cycles for SLC, to 10k P/E Cycles for MLC, to 3k P/E Cycles for TLC, to 1k P/E Cycles for QLC NAND. And each successive generation has had longer write times than the previous generation (so much higher write latency). So if you've got the RAM, and a good UPS, every little bit of help is worthwhile. Although it does show just how much SSD controllers have developed over the years (and i suspect in the case of QLC in particular how much 3D NAND and smaller manufacturing nodes have helped with the storage density allowing for much higher levels of overprovisioning even with the very large capacity drives) that the endurance levels have been maintained and performance has actually increased (at least up to & including TLC. QLC really is only suited to read intensive/ very occasional write applications). Grant Darwin NT ID: 104015 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 104016 - Posted: 3 Jan 2022, 12:28:59 UTC - in response to Message 104015. Well you are right about the over-provisioning. I buy larger SSDs anyway. But it looks like they have run out of string. They will need to think of something else. The pythons are giving me so much trouble that it may not matter much on this one. Thanks for your input. ID: 104016 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 104292 - Posted: 17 Jan 2022, 2:51:40 UTC - in response to Message 104006. It looks like a new "vm_image.vdi" file is downloaded into the slots directory each time a new work unit is started. At 7.5 GB each, it all adds up. It may not be quite so bad. There is an "AIMNet_vm_v2.vdi" file (7,285,760 KB) that is downloaded to the Projects directory (boinc.bakerlab.org_rosetta) when you attach to Rosetta and receive a python. Then, there is a "vm_image.vdi" file of exactly the same size created in each slot directory when you download each work unit. Chances are, it is just a renamed copy of the file in the Projects directory, and so is downloaded only once. But I can't readily confirm that directly. ID: 104292 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 104783 - Posted: 10 Feb 2022, 20:35:55 UTC I just had a 500 GB NVMe M.2 drive fail on my Ryzen 3900X where I ran most of the pythons. I stopped that about a week ago, and have been running other projects fine ever since, so the pythons may have had nothing to do with it. But it is the first M.2 drive I have ever had fail, so it may not be a coincidence. ID: 104783 · Rating: 0 · rate: / Reply Quote

computezrmle Send message Joined: 9 Dec 11 Posts: 63 Credit: 9,680,103 RAC: 0	Message 104785 - Posted: 10 Feb 2022, 21:14:30 UTC - in response to Message 104783. The "vm_image.vdi" in each slots dir is indeed a copy of "AIMNet_vm_v2.vdi" that you see in the projects dir. That's how vbox apps currently work. The description can be found here (the page has some typos :-)): https://boinc.berkeley.edu/trac/wiki/VboxApps#Creatingappversions https://boinc.berkeley.edu/trac/wiki/VboxApps#Howitworks ID: 104785 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 106343 - Posted: 3 Jun 2022, 15:14:27 UTC I have found out that the settings given for Linux in my original post do not survive a reboot. To ensure that they do, you need to place the values you want in the "/etc/sysctl.conf" file. (These work for Ubuntu 20.04; there should be something similar for the other versions.) For example, to set the write cache to 8 GB/8.5 GB and the latency to one hour (3600 seconds, suitable for 32 GB main memory) run: "sudo gedit /etc/sysctl.conf" Then insert in the file: vm.swappiness=0 vm.dirty_background_bytes=8000000000 vm.dirty_bytes=8500000000 vm.dirty_writeback_centisecs=500 vm.dirty_expire_centisecs=360000 To activate: "sysctl –system", or else reboot Check values: "sysctl -a \| grep dirty" You can check the actual writes to disk using "dstat", by intalling "pcp" first: "sudo apt install pcp" Then, "dstat -d 3600" gives the writes in one hour, for example. ID: 106343 · Rating: 0 · rate: / Reply Quote

computezrmle Send message Joined: 9 Dec 11 Posts: 63 Credit: 9,680,103 RAC: 0	Message 106344 - Posted: 3 Jun 2022, 15:41:22 UTC - in response to Message 106343. Setting the write cache to an extreme delay doesn't really help since the cache daemon also writes to disk when all RAM is in use and the OS requests some more. On Linux I prefer to mount the /slots/ dir as zram device. It shares the same RAM with the disk cache but transparently compresses the data. Typical savings due to compression: 40% A process to backup/restore data to/from a real disk needs to be implemented locally but that's an easy task. On Windows there are a couple of tools doing similar things, even freeware. Just look for "dynamic ramdisk with compression". ID: 106344 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 106345 - Posted: 3 Jun 2022, 17:15:16 UTC - in response to Message 106344. Setting the write cache to an extreme delay doesn't really help since the cache daemon also writes to disk when all RAM is in use and the OS requests some more. With 96 GB DRAM, and a cache size of 64 GB (and page flush every 8 hours), all RAM is not in use. Using 30 cores of a Ryzen 3950X, I am running 9 Pythons and 21 Rosetta 4.20. iostat shows the OS is writing 1 TB/day. dstat shows that less than 1 GB is getting to the SSD. ID: 106345 · Rating: 0 · rate: / Reply Quote

computezrmle Send message Joined: 9 Dec 11 Posts: 63 Credit: 9,680,103 RAC: 0	Message 106346 - Posted: 3 Jun 2022, 17:37:57 UTC - in response to Message 106345. Yes, I know you prefer that extreme long write delay. I just wanted to point out that there are optional methods. I made good experience with LHC (ATLAS native) which I directly write to a tmpfs "partition" (since it has a rather bad compression rate). This is the fastest possible method. LHC (vbox apps like CMS) compress rather good and they run fine on a zram device. This is comparable to Rosetta's vbox app. All those methods crash in case of a power line outage since the data is not written to disk. ID: 106346 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 106347 - Posted: 3 Jun 2022, 18:42:26 UTC - in response to Message 106346. I looked into tmpfs, and decided that a write-cache was simpler. But if you want to economize on memory, it could work. I did not see a particularly simple way to restore the contents of tmpfs after a reboot, which you may need to do depending on what you store there. Also, you have to figure out how to place the BOINC data folder there. It is simple in Windows, but not so in Linux. ID: 106347 · Rating: 0 · rate: / Reply Quote

Paddles Send message Joined: 15 Mar 15 Posts: 11 Credit: 5,878,302 RAC: 0	Message 106701 - Posted: 3 Aug 2022, 17:10:22 UTC This discussion, and noticing the high amount of data written to my drive, had me wondering: would getting a separate SSD to use solely for BOINC data be sensible, for risk-management (to reduce the wear rate on the primary system SSD)? Or is that overly paranoid? My computer is not quite 1 year old, the 500GB SSD has 131TB written. I don't know how much of that is R@H VBox jobs but I'm guessing a lot of it. ID: 106701 · Rating: 0 · rate: / Reply Quote