Problems and Technical Issues with Rosetta@home

Author	Message
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 442 Credit: 15,697,820 RAC: 5	Message 99786 - Posted: 27 Nov 2020, 12:34:00 UTC - in response to Message 99785. Hi all, Can someone tell me if I should abort the below tasks? It's a bit annoying to find your CPU crunching nothing for two days. Thank you! Have you any other Rosetta jobs running correctly alongside those? Can you see the file names of the WUs to see if they’re all the same type? Could you suspend (some of) those tasks and see if the replacement tasks run ok? That should tell you if it’s the tasks that are ok or a problem with your setup. ID: 99786 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0	Message 99788 - Posted: 27 Nov 2020, 13:20:15 UTC - in response to Message 99785. Those look badly broken. They’re running but making no progress, which is why the estimated remaining time has grown so large. Kill them. Is this your Intel machine? It is returning nothing but errors lately. There seems to be something seriously wrong with it. ID: 99788 · Rating: 0 · rate: / Reply Quote

nikolce Send message Joined: 28 Apr 07 Posts: 2 Credit: 2,002,356 RAC: 0	Message 99792 - Posted: 27 Nov 2020, 21:01:10 UTC Thanks for bringing the errors to my attention. Apparently from that point on it has been returning nothing but errors. I caught it today since my RAC dropped a little. As recommended I killed the tasks and it started to drop on the next tasks within minutes. I restarted the PC and tested the CPU and memory with Prime95 on smallest and large FFTs for 15 minutes each ( I know it's should be longer), with no errors. Meanwhile the PC was not showing any signs of instability. I've resumed the project and the tasks are doing fine for almost an hour now. I'll keep a close eye in the next couple of days. I thought I'll have to retire the old bugger. Thank you! ID: 99792 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 99795 - Posted: 27 Nov 2020, 21:26:35 UTC - in response to Message 99792. Thanks for bringing the errors to my attention. Apparently from that point on it has been returning nothing but errors. I caught it today since my RAC dropped a little. As recommended I killed the tasks and it started to drop on the next tasks within minutes. If it occurs again, try rebooting before aborting the Tasks. Grant Darwin NT ID: 99795 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0	Message 99885 - Posted: 3 Dec 2020, 16:04:05 UTC stub_cyc_target tasks completing in anywhere from under 2 hours to over 19 (against a default target of 8). (Not a problem; just an observation…) ID: 99885 · Rating: 0 · rate: / Reply Quote

lazyacevw Send message Joined: 18 Mar 20 Posts: 12 Credit: 93,576,463 RAC: 0	Message 99910 - Posted: 4 Dec 2020, 18:08:24 UTC Last modified: 4 Dec 2020, 18:11:05 UTC Can anyone comment on the level of compression the tasks are sent with and separately the level of compression that is applied before submitting completed workloads? Anyone know the type of compression? lzma2? bzip2? The reason I ask is that beginning about 1 or 2 months ago, I started exceeding my 22 GB monthly data cap about 22 to 25 days into the month. I've been running pretty much the same batch of clients since April and the internet connection is dedicated solely to R@H so I only assume that work units have become more complex and larger. If I don't have any data hiccups, I average around 150,000 credits. I've started to shut down a few clients in order to just stay under my data cap. I'm not sure if the usage is purely work units or if it is the ancillary files R@H downloads (like the database_357d5d93529_n_methyl files) that they use to set up different variables and references. I'm running dedicated Linux machines that have been pared down to preclude any unnecessary or ancillary data usage. I even gone so far as to set the updater service to not check for updates on all of the machines. I'd really like to bring a few dozen more cores online but I'm in a holding pattern until my data usage goes down. When I run out of data, I have to tether my phone to each computer each day to do batch uploads and downloads. This isn't ideal because it's a little inconvenient having to plug into each computer plus I'm sure it creates unnecessary network oscillations in R@H distribution server. There have been a few days where I forgot to connect and kept breaking the 2 day crunch time limit. I'm sure that is inefficient for the project as a whole. Would the devs consider upping the compression levels? It does slightly increase client and server overhead but most computers are more than capable to spend an extra minute or two to accomplish more compression. It might help people bring more systems online. ID: 99910 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0	Message 99916 - Posted: 4 Dec 2020, 19:27:10 UTC - in response to Message 99910. Last modified: 4 Dec 2020, 19:38:47 UTC In my observation, most downloaded data files use Zip compression. I haven’t paid any attention to what gets uploaded, though I’ve seen gzip mentioned in some logs and file names. The big (500 MB Zip) database and the applications can go months between updates, so even though they’re relatively large they shouldn’t be affecting your recent usage. You can see a history of data download and upload totals in the file daily_xfer_history.xml in your BOINC data directory; you could analyse that to see how usage has changed over time. As a single data point, my current usage seems to be averaging around 4 MB per task. You could increase the target CPU run time in your project preferences to run each task for longer (and thus fewer tasks in total, requiring less overall data transfer, in any given period of time) while maintaining the same credit rate. ID: 99916 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 99920 - Posted: 4 Dec 2020, 22:52:54 UTC - in response to Message 99916. Last modified: 4 Dec 2020, 23:12:00 UTC You could increase the target CPU run time in your project preferences to run each task for longer (and thus fewer tasks in total, requiring less overall data transfer, in any given period of time) while maintaining the same credit rate. Someone would need to try that out to see if it would make any significant difference, i suspect the difference would be minimal (if any). The downloads of Tasks and their support files are rather small in size. It's the returned result files that can be extremely large (i've noticed a couple over 900MB in size). Running the Tasks for longer will result in a larger result file. So instead of returning 2 smaller files, you're returning one larger one. The only saving being less downloaded files- which as i mentioned are very small in comparison -resulting in little if any data transfer reduction. The reason I ask is that beginning about 1 or 2 months ago, I started exceeding my 22 GB monthly data cap about 22 to 25 days into the month. Which is around the time you brought some new systems online. A 4 core/thread, 8c/t & of course the 64c/t Threadripper system. They would all have a significant impact of the amount of results you return. When I run out of data, I have to tether my phone to each computer each day to do batch uploads and downloads. This isn't ideal because it's a little inconvenient having to plug into each computer Does your modem/router support WiFi? If so, just setup your phone as a Personal Hotspot. If not, a WiFi dongle on one of the systems & connect that system to the phone, and enable internet sharing from that system for all of the others. I'm running dedicated Linux machines that have been pared down to preclude any unnecessary or ancillary data usage. I even gone so far as to set the updater service to not check for updates on all of the machines. I suspect Linux has a similar option to Windows Update that allows systems to check for their updates on the local network before checking for them over the internet. Would the devs consider upping the compression levels? It does slightly increase client and server overhead but most computers are more than capable to spend an extra minute or two to accomplish more compression. Problem being that you can only compress data up to a certain point, after which no further compression is possible. And spending 2, 3 or 4 times as long compressing (and uncompressing) the data for 3-5% saving in file size is really not an option. Grant Darwin NT ID: 99920 · Rating: 0 · rate: / Reply Quote

MarkJ Send message Joined: 28 Mar 20 Posts: 72 Credit: 25,292,180 RAC: 0	Message 99924 - Posted: 5 Dec 2020, 3:46:24 UTC You could put a proxy server on your network and that can save on duplicate transfers. Unfortunately since Rosetta switched to https it won’t help with project data files. I find the best benefit for os updates, the 1st machine downloads them but subsequent requests come from the proxy server. It also works well with Einstein and their locality scheduler. BOINC blog ID: 99924 · Rating: 0 · rate: / Reply Quote

lazyacevw Send message Joined: 18 Mar 20 Posts: 12 Credit: 93,576,463 RAC: 0	Message 99935 - Posted: 6 Dec 2020, 13:10:10 UTC - in response to Message 99924. Last modified: 6 Dec 2020, 13:10:38 UTC In my observation, most downloaded data files use Zip compression. I haven’t paid any attention to what gets uploaded, though I’ve seen gzip mentioned in some logs and file names. Thanks. I too have seen gz and zip files in project slot directories along with a lot of uncompressed files. It just isn't clear what files were generated, decompressed from files received, or created when all of the work is said and done. Essentially, I can only easily see the middle part of the data processing. You can see a history of data download and upload totals in the file daily_xfer_history.xml in your BOINC data directory; you could analyse that to see how usage has changed over time.. I took a look at the file but I couldn't make much sense of it. Which is around the time you brought some new systems online. A 4 core/thread, 8c/t & of course the 64c/t Threadripper system. They would all have a significant impact of the amount of results you return. It does appear from my profile that I brought new systems online recently but in actuality, I just reinstalled the OS on most of my systems when R@H ran out of tasks. That TR has been running every day since early April. Does your modem/router support WiFi? If so, just setup your phone as a Personal Hotspot. If not, a WiFi dongle on one of the systems & connect that system to the phone, and enable internet sharing from that system for all of the others. I'm using switches connected to a wired personal hotspot. I tried using a USB-C to Ethernet adapter but my Samsung S8 doesn't appear to support it. Internet sharing isn't as easy to set up as it is with Windows but I will look into it. My clients do not have wireless adapters either. I might see about using a Windows laptop and have it connect wirelessly to my phone and then share it's connection over LAN. So, I do have a few options.... You could put a proxy server on your network and that can save on duplicate transfers I really should set up a pfsense/squid box for the network. One for extra security but another for the network caching feature. It would work as long as the Linux updater doesn't use HTTPS. I could probably also set up a local update mirror. Thanks for the tip! ID: 99935 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1265 Credit: 14,424,358 RAC: 0	Message 99948 - Posted: 7 Dec 2020, 5:08:29 UTC A task running for over 12 hours so far, even though I've selected a run length of 8 hours: 3stub_cyc_target_1cwa_01152_14_extract_B_SAVE_ALL_OUT_1044879_311 The estimated time remaining is INCREASING, not decreasing. It is doing checkpoints WITHOUT ending the task. 26 seconds since the last one. Is something wrong with this task? Should I abort it? ID: 99948 · Rating: 0 · rate: / Reply Quote

Speedy Send message Joined: 25 Sep 05 Posts: 163 Credit: 841,187 RAC: 0	Message 99949 - Posted: 7 Dec 2020, 7:03:35 UTC - in response to Message 99948. A task running for over 12 hours so far, even though I've selected a run length of 8 hours: 3stub_cyc_target_1cwa_01152_14_extract_B_SAVE_ALL_OUT_1044879_311 The estimated time remaining is INCREASING, not decreasing. It is doing checkpoints WITHOUT ending the task. 26 seconds since the last one. Is something wrong with this task? Should I abort it? Have you tried exiting boinc and opening it again or restarting your computer/laptop? If after restart it starts back at for example 10 hours letters run and see if it will finish with in the 12 hours. If it doesn't & keeps running past 13 hours feel free to abort Have a crunching good day!! ID: 99949 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0	Message 99950 - Posted: 7 Dec 2020, 10:56:28 UTC - in response to Message 99948. I had some stub_cyc_target tasks complete in under 2 hours; some after more than 19. I aborted one that had been running for 1½ days, as even though it still seemed to be running, its progress percentage was increasing so slowly that it didn’t seem likely it would reach 100% in any reasonable amount of time. Maybe leave yours for a few more hours, and kill it if it gets to a full day without completing? Once a task has overrun, its remaining time estimate becomes meaningless, as BOINC has no way of knowing when it will finish. And sometimes tasks can get in a state where they are running but not reporting progress, so BOINC estimates progress using elapsed time towards a target perpetually 10 minutes in the future, meaning the value only asymptotically approaches 100%. ID: 99950 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1265 Credit: 14,424,358 RAC: 0	Message 99952 - Posted: 7 Dec 2020, 13:15:50 UTC - in response to Message 99949. A task running for over 12 hours so far, even though I've selected a run length of 8 hours: 3stub_cyc_target_1cwa_01152_14_extract_B_SAVE_ALL_OUT_1044879_311 The estimated time remaining is INCREASING, not decreasing. It is doing checkpoints WITHOUT ending the task. 26 seconds since the last one. Is something wrong with this task? Should I abort it? Have you tried exiting boinc and opening it again or restarting your computer/laptop? If after restart it starts back at for example 10 hours letters run and see if it will finish with in the 12 hours. If it doesn't & keeps running past 13 hours feel free to abort I let it go overnight, It finally finished after 15.5 hours. ID: 99952 · Rating: 0 · rate: / Reply Quote

Joe Send message Joined: 24 Nov 17 Posts: 1 Credit: 3,817,457 RAC: 0	Message 100051 - Posted: 16 Dec 2020, 4:40:18 UTC I've been having this issue with my FreeBSD machine with BOINC installed on it always failing to compute jobs https://kitsunehosting.net/nextcloud/index.php/s/rysi6tY6TE33oZr/preview Now that I'm looking at it I'm pretty sure it never completed a job. Is there anything I should look into? Maybe some logs or something, find out what's failing and if I can fix it? Thanks so much for reading. ID: 100051 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0	Message 100055 - Posted: 16 Dec 2020, 10:51:45 UTC - in response to Message 100051. Last modified: 16 Dec 2020, 11:09:36 UTC Your machine does have some credit, so it’s obviously succeeded in running something at some point – just not recently. The Exec format error failures I assume are because the system is unable to run Rosetta’s Linux application. (Note it was trying to run the 32-⁠bit application, which may not be appropriate for a 64-⁠bit system. It’s also possible that older application versions were able to run, but recent updates have broken something. I don’t know enough about BSD’s Linux capability to be able to diagnose further. Rosetta@home does not provide a native BSD application. One user did report success running the 64-bit Rosetta Linux application on FreeBSD recently.) The others failed to download some of their input files. Could something be blocking downloads? ID: 100055 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0	Message 100081 - Posted: 20 Dec 2020, 22:18:07 UTC Last modified: 20 Dec 2020, 22:21:55 UTC Several of a new batch of horns5 tasks failing with access violations shortly after startup ID: 100081 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 100082 - Posted: 20 Dec 2020, 23:30:58 UTC - in response to Message 100081. Last modified: 20 Dec 2020, 23:35:56 UTC Several of a new batch of horns5 tasks failing with access violations shortly after startup Maybe limited to Windows? I am running seven now (1 to 6 hours) on Ubuntu 18.04.5 (Ryzen 3900X) without a problem. PS - The sizes are quite reasonable, being less than 500 MB. That indicates they are not a new project, but a continuation of horns4 . It is interesting to speculate what that might be... ID: 100082 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1265 Credit: 14,424,358 RAC: 0	Message 100083 - Posted: 20 Dec 2020, 23:47:33 UTC - in response to Message 100081. Last modified: 21 Dec 2020, 0:03:04 UTC Several of a new batch of horns5 tasks failing with access violations shortly after startup I looked at the stderr log for several of your failed tasks. About two thirds of them failed while trying to access location 0, and I can't read the dump well enough to tell what instruction was trying to access that location. I'll have to leave the problem to someone who can read dumps better than I can. I did notice that you are using Windows 7, rather than the newer Windows 10. The only recent horns5 task I spotted for my Windows 10 computer completed and validated. Also, I noticed that all of your computers run BOINC 7.16.5; my computer runs 7.16.11. If no one else helps, you could try updating BOINC on one of your computers showing the problem, and Windows on another, to see if either of these older versions causes the problem. ID: 100083 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0	Message 100084 - Posted: 21 Dec 2020, 0:12:43 UTC Also noticed the graphics app does not work (either disappears immediately or hangs) with those horns5 tasks that do manage to run ID: 100084 · Rating: 0 · rate: / Reply Quote