Message boards : Number crunching : Running Rosetta on Raspberry Pi 3B+ (how to guide)
Author | Message |
---|---|
lakotamm Send message Joined: 28 Jun 19 Posts: 22 Credit: 171,192 RAC: 0 |
Hello guys, I would like to make a guide for getting the Rosetta to run on Raspberry Pi 3B+, or probably any other board with 1GB and more RAM. It seems like many people are struggling with it. My set up: HW: Raspberry Pi 3B+ No name USB charger 5V 1A with a very short cable SD Card 16GB The newest version of Raspbian Beginning state: I assume that you have Raspbian installed and BOINC running. I set up my BOINC client to use 100 percent RAM available and 90 percent swap. 1. In case you have less than 500MB of RAM per core, set up ZRAM (follow the guide in the readme): https://github.com/novaspirit/rpi_zram 2. Increase swap size - I set it to 2048MB. I am not sure how much is needed and how exactly it works in combination with ZRAM, I simply went with the biggest value, since I have enough free space on the SD card. https://wpitchoune.net/tricks/raspberry_pi3_increase_swap_size.html 3. Put your kernel into 64bit mode and set up BOINC client to receive 64bit work: http://marksrpicluster.blogspot.com/2020/04/do-something-useful-with-your-pi4.html 4. Now you should be able to download the tasks and run them on 2-3 cores 5. (Optional) In case you are fine using only SSH, decrease the amount of RAM available to the GPU to 16MB. This will give you a few more MB of RAM sudo raspi-config - Advanced Options - Memory split - 16MB 6. (Optional) Decrease the consumption of the chip and reduce temperatures: - Disable USB hub and Ethernet: echo 0 | sudo tee /sys/devices/platform/soc/3f980000.usb/buspower >/dev/null - Disable HDMI: sudo tvservice --off These commands need to be reapplied after each reboot. In the beginning, only 3 tasks were able to run at the same time. However, after a few hours of run time, my Raspberry pi managed to turn on all 4 tasks. While doing so, 1,3GB of data was stored in the swap. I have no clue how this happened, but it seems to be working fine. The speed of calculations seems to be OK compared to my Android Huawei Honor 8 (I am using only the efficient A53 cores on it). It seems to take a bit over 30 000 sec to complete one WU on RPI 3B+, and around 36000s on A53 on Honor 8. So RPI is even a bit faster? Maybe there are fewer background processes... If you have any questions, just ask. |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,585,338 RAC: 9,805 |
Thanks for this- will give it a go. |
lakotamm Send message Joined: 28 Jun 19 Posts: 22 Credit: 171,192 RAC: 0 |
Updates: 32MB RAM seems to be enough for the GPU to run VNC at 720p. So far I have seen some issues with WUs initializing and disabling repeatedly - I guess that the RAM space is simply quite on the limit. Let's see how it works in the long run. |
Tom Rinehart Send message Joined: 28 Mar 20 Posts: 7 Credit: 1,637,467 RAC: 0 |
On my RPi 3B running Raspbian Buster Lite (Minimal image based on Debian Buster), I only had to put the kernel in 64 bit mode, set up BOINC client to receive 64 bit work, and lower the GPU RAM to 16MB using raspi-config. It downloaded 4 WUs and just started working. |
lakotamm Send message Joined: 28 Jun 19 Posts: 22 Credit: 171,192 RAC: 0 |
That is interesting, am I overengineering this? I could try to disable ZRAM to see how it works. What is so far your average time per WU? Are all your cores running? My times so far: 7:48 7:53 |
bkil Send message Joined: 11 Jan 20 Posts: 97 Credit: 4,433,288 RAC: 0 |
Thanks for sharing! I like your setup. Are you connecting over wifi? Could you perhaps measure the power draw? You may use either a kill-a-watt or the built-in power gauge inside your laptop if you connect to a high power charger port. I'm interested in how low can we get in power? Can you underclock the CPU/GPU? Can you switch off the LEDs? Can we switch wifi on periodically only for transfers (from cron for example)? 3 Rosetta tasks in 1GB sounds like pushing the limit a bit, but we'll see how it performs. You may get some more space if you added a little bit more zram and if you select a better algorithm. Basically I would adjust the end of the script like this: mem=$(( ($totalmem * 4 / 3 / $cores)* 1024 )) modprobe deflate modprobe zlib modprobe lz4hc_compress core=0 while [ $core -lt $cores ]; do echo deflate > /sys/block/zram$core/comp_algorithm || echo zlib > /sys/block/zram$core/comp_algorithm || echo lz4hc > /sys/block/zram$core/comp_algorithm || echo lz4 > /sys/block/zram$core/comp_algorithm # not sure which one this kernel has echo $mem > /sys/block/zram$core/disksize mkswap /dev/zram$core swapon --discard -p 5 /dev/zram$core # reclaim memory better let core=core+1 done You can verify swap usage by cat /proc/swaps and zramctl (or lacking that, cat /sys/block/zram0/{mem_used_total,orig_data_size}). You should also disable "keeping apps in memory while suspended" in your computing preferences. You may set the GPU memory as low as you like, because you can run a vncserver in a virtual framebuffer without a real X11. You may need to double check ~/.vnc/xstartup to execute an lxsession or at least a terminal. But why do you need X, if you can use boinccmd, boinctui or even remote RPC? |
Tom Rinehart Send message Joined: 28 Mar 20 Posts: 7 Credit: 1,637,467 RAC: 0 |
Mine appears to only be running one WU at a time. I will try the zram approach once the current set of work I have are done. No WUs have completed so far. I also have a RPi 3B running 64-bit ubuntu. It is running two WUs at a time. It completed three WUs so far at 13,418.19s, 5,232.10s, and 3,171.98s. |
bkil Send message Joined: 11 Jan 20 Posts: 97 Credit: 4,433,288 RAC: 0 |
I don't know what kind of WU they send on ARM platforms, but it could happen that you were just lucky. On my desktops I get a huge one every once in a while. Please follow your error/invalid statistics closely and keep us posted. |
ProDigit Send message Joined: 6 Dec 18 Posts: 27 Credit: 2,718,346 RAC: 0 |
If you go headless you can run with just 8MB of VRAM (I set my resolution to 480/512p headless). If you only run Boinc on the desktop OS, you'll be able to run it with 24MB of VRAM, if you choose 720p resolution (and 16MB on 480p; but the Raspbian desktop OS isn't very sub-768pix friendly). I would seriously advise against using a swap file. Your swap file should read zero usage. Any Boinc project uses a lot of disk reads or writes, so if you do want to run it, I would recommend to boot from a bootable SSD if possible (via USB3,0). It not only speeds up the process, but it also will be much less susceptible to data corruption than SD cards are. And, especially if you're not running an A1/A2 card from Sandisk, you probably are going to experience system lags on the desktop. From HTOP it appears you're only running 3 out of 4 cores for Rosetta. Each one seems to use about 300-400MB of RAM. In most scenarios, I would recommend to run only 2 instances of Rosetta on a 1GB system, and 4 on 2GB of available RAM. |
bkil Send message Joined: 11 Jan 20 Posts: 97 Credit: 4,433,288 RAC: 0 |
Those are some good points. Although using a compressing file system like btrfs might also help. After you set up zram, it is normal to see zram swap usage (verify in /proc/swaps) and it isn't slowing things down. Swapping out unused portions of daemons/executables to SD can also help free up valuable space for others, so a few hundred there could also be considered normal. (Then you could always try zswap that could maybe even compress swap on disk as well) I guess I wouldn't use such a small cruncher interactively as a desktop myself, so lagging shouldn't be an issue. |
lakotamm Send message Joined: 28 Jun 19 Posts: 22 Credit: 171,192 RAC: 0 |
Thank you guys for advices. In the morning, after waking up, only a single task was running. During night, RPI downloaded Rosetta for Portable devices. Maybe it is more RAM demanding? I managed to initialize one more task by stopping and running again the already running task, but not more. Thanks for sharing! I like your setup. Are you connecting over wifi? Could you perhaps measure the power draw? You may use either a kill-a-watt or the built-in power gauge inside your laptop if you connect to a high power charger port. I will try to measure consumption in the evening. I have not tried any fancy turning on-off wifi yet. 3 Rosetta tasks in 1GB sounds like pushing the limit a bit, but we'll see how it performs. You may get some more space if you added a little bit more zram and if you select a better algorithm. Basically I would adjust the end of the script like this:mem=$(( ($totalmem * 4 / 3 / $cores)* 1024 )) modprobe deflate modprobe zlib modprobe lz4hc_compress core=0 while [ $core -lt $cores ]; do echo deflate > /sys/block/zram$core/comp_algorithm || echo zlib > /sys/block/zram$core/comp_algorithm || echo lz4hc > /sys/block/zram$core/comp_algorithm || echo lz4 > /sys/block/zram$core/comp_algorithm # not sure which one this kernel has echo $mem > /sys/block/zram$core/disksize mkswap /dev/zram$core swapon --discard -p 5 /dev/zram$core # reclaim memory better let core=core+1 done I used your code, but it seems like the use of zram partitions has even lowered. I am not sure why. I do not quite understand how to properly configure zram. You may set the GPU memory as low as you like, because you can run a vncserver in a virtual framebuffer without a real X11. You may need to double check ~/.vnc/xstartup to execute an lxsession or at least a terminal. But why do you need X, if you can use boinccmd, boinctui or even remote RPC? Thanks, I did not know about boinctui. I have not tried using remote RCP yet, but I definitely will have to learn how to do so. It seems like vncserver does not start with anything less than 32MB. Even though I am fine using CLI in most cases, it still feels more comfortable for me to use GUI. |
lakotamm Send message Joined: 28 Jun 19 Posts: 22 Credit: 171,192 RAC: 0 |
This is my current state: CLI only, 4MB VRAM ZRAM algorithm from sangaku 4 WUs running again. Maybe it just needs time? Certain order?... https://prnt.sc/ryz2ua https://prnt.sc/ryz560 pi@raspberrypi:~ $ pi@raspberrypi:~ $ cat /proc/swaps Filename Type Size Used Priority /var/swap file 2024444 0 -2 /dev/zram0 partition 330040 21776 5 /dev/zram1 partition 330040 21796 5 /dev/zram2 partition 330040 21552 5 /dev/zram3 partition 330040 21660 5 Edit: 30 minutes later, all 4 tasks still running, but now the usage of ZRAM partitions has significantly increased to 200 000 each. Swap use is still 0 |
ProDigit Send message Joined: 6 Dec 18 Posts: 27 Credit: 2,718,346 RAC: 0 |
Not sure how compressing RAM/Swap would work out for you, however it's pretty easy to set up the swap partition on an external drive, so you don't have to worry about SD card wear as much. However, not all Boinc projects use a lot of RAM. You could run 2 threads running Rosetta, and 2 running other project(s). |
PorkyPies Send message Joined: 6 Apr 20 Posts: 45 Credit: 1,650,779 RAC: 0 |
The current tasks seem to go up to ‘800MB so you should be able to get away with at least one task on the 1GB Pi’s. I recommend keeping at least 1 core free on the Pi. If you’ve got a 1GB Pi only run one at a time. If using zram you might be able to run two at a time. Why? zram needs some CPU time to do its job so keeping cores free means they are available without having to swap the task out. When the tasks start they unzip over 1.5GB of data and copy files to the slot directory. That is a killer for the SD card. I believe the unzip part is multi-threaded so it will use all available cores but it still takes a few minutes to write to the SD card. MarksRpiCluster |
lakotamm Send message Joined: 28 Jun 19 Posts: 22 Credit: 171,192 RAC: 0 |
Not sure how compressing RAM/Swap would work out for you, however it's pretty easy to set up the swap partition on an external drive, so you don't have to worry about SD card wear as much. I have disabled swap (set the size to 0). Everything seems to be running fine. However, not all Boinc projects use a lot of RAM. Thanks for the idea, I added Universe@Home and I might try to join also LHC@Home. There are definitely moments when Rosetta is not using all available RAM - just because it is so big. The question is - how to set it up so that it runs each project on 2 cores? I changed the resource share of Universe@home to 1. And I am testing it... I recommend keeping at least 1 core free on the Pi. If you’ve got a 1GB Pi only run one at a time. If using zram you might be able to run two at a time. Why? zram needs some CPU time to do its job so keeping cores free means they are available without having to swap the task out. I will try to test this over the next few days, and try to see how long the WU takes with no cores free and a single-core free. When the tasks start they unzip over 1.5GB of data and copy files to the slot directory. That is a killer for the SD card. I believe the unzip part is multi-threaded so it will use all available cores but it still takes a few minutes to write to the SD card. I noticed that if I suspend Rosetta and switch, it takes over 2 minutes to return from Universe to Rosetta. I guess that demonstrates the load... For anyone interested, here are current measurements of RPI 3B+: https://www.raspberrypi.org/forums/viewtopic.php?p=1286716 According to these results, 4 core load + WIFI results in ca. 390mA. I did not test this myself. |
bkil Send message Joined: 11 Jan 20 Posts: 97 Credit: 4,433,288 RAC: 0 |
You probably have "keep tasks in memory on suspend" disabled. This is the right setting in such a constrained system, but it causes tasks to start/stop a bit slower when switching for example because the executable needs to reload and reconstructs its full state from the last checkpoint. This is normal. Yes, zram, especially with deflate takes a measurable amount of time, but my heuristic (yet to be proven formally) is that the heap of Rosetta is mostly static, about half of it active at any given time, so the other half can easily be stored on slower media (be it it compressed in RAM or in flash). I use it on low end PC's and find its performance tradeoff acceptable even for desktop use (slow disk swapping vs. more compression). I haven't tried it on an rPi yet, but it should generally not hurt in case a given workload uses less than half of the memory actively. My RAC isn't impacted on a 2core/2GB+zram PC. I would probably still enable at least a little bit of swap on the flash as well (let's say 0.5-1GB). By default, it creates zrams with a higher priority than disk swaps, so the system will only touch the flash if it is really desperate. I've also noticed that work may not be available at any given time instant, so be patient. On my PC's I usually set a buffer of 1-2 days. |
bkil Send message Joined: 11 Jan 20 Posts: 97 Credit: 4,433,288 RAC: 0 |
Seeing 8 boinc processes in your listing worries me. Maybe some of them got stuck? Anyway, could you perhaps share the results after completion (time, credits, decoy count, WS_MAX)? |
lakotamm Send message Joined: 28 Jun 19 Posts: 22 Credit: 171,192 RAC: 0 |
Seeing 8 boinc processes in your listing worries me. Maybe some of them got stuck? Anyway, could you perhaps share the results after completion (time, credits, decoy count, WS_MAX)? It seems like it is just htop showing it in a weird way - top shows it correctly. https://prnt.sc/rzn8ih You can see the stats here: https://boinc.bakerlab.org/rosetta/results.php?hostid=3710842 It seems like had I had 1 error so far. Let's see whether more appear. The biggest issue seems to be right now the start-up of a WU. If I have 4 WUs running and one gets finished, another one won't start by itself. Instead, it stays waiting for memory. I need to manually suspend one of the running WUs - which allows the new WU to start. Afterwards I can again resume the suspended WU again. I might make a script for this and possibly use an external drive for storing the tasks. Or use only 2-3 cores for Rosetta. However, this would mean loosing 25 percent CPU time unused. So far I have not found a way of selecting only 2-3 cores for a project an using the rest for another project. |
PorkyPies Send message Joined: 6 Apr 20 Posts: 45 Credit: 1,650,779 RAC: 0 |
So far I have not found a way of selecting only 2-3 cores for a project an using the rest for another project. I use an app_config.xml file for that. <app_config> <project_max_concurrent>3</project_max_concurrent> </app_config> It goes in the BOINC projects bakerlab folder. See here for details on configuring the BOINC client. MarksRpiCluster |
bkil Send message Joined: 11 Jan 20 Posts: 97 Credit: 4,433,288 RAC: 0 |
This looks like the dreaded finish file present too long bug usually caused by I/O contention or slow disk I/O when writing the result files and cleaning up after itself. It's a pity that this was at the very end after it has successfully completed the computations: Exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT Run time 7 hours 27 min 34 sec CPU time 7 hours 26 min 11 sec Peak working set size 402.13 MB Peak swap size 631.23 MB Peak disk usage 961.72 MB Could you perhaps try a development version of BOINC to see if it has been fixed already? I see that the website binaries for PC have already been updated, but ARM seems to be lagging behind (maybe compile from source?). I could think of a quick workaround script that detects when a job is finishing and temporarily suspends all others while it is saving its files, but that would be a major hack. Or perhaps trying a faster SD card or faster file system might help (maybe not).
- https://boinc.bakerlab.org/rosetta/forum_thread.php?id=12554&postid=93698 - https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13644&postid=92493
|
Message boards :
Number crunching :
Running Rosetta on Raspberry Pi 3B+ (how to guide)
©2024 University of Washington
https://www.bakerlab.org