Problems and Technical Issues with Rosetta@home

Author	Message
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 9,715,484 RAC: 8,600	Message 91868 - Posted: 4 Mar 2020, 21:54:33 UTC - in response to Message 91867. What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here. Within the stdout file the app reports the number of scenarios that have been processed in the time you make available, normally in the standard 8 hour window you’ll process the data with, maybe, 40 different starting positions (these are known as decoys for some reason that escapes me). I set my processing window to 6 hours and a normal work unit will process maybe 30 decoys, the work units that are erroring out have not finished the first run through the data after 10 hours (6 hour preference plus 4 hours allowed overrun) when the watchdog aborts the process. Are you setting this limit or is it done by the server? I just let all projects take whatever time they feel necessary. You’ll find it in your account > project preferences > target cpu time, it defaults to 8 hours but after I had quite a few of these errors I dropped mine to 6 hours in the hope I’d waste slightly less processing time. Strange, all projects that have given me errors cause them well under the normal processing time. And some projects like LHC seem to have tasks that usually take 2 hours but can take 4 days, but still complete fine. ID: 91868 · Rating: 0 · rate: / Reply Quote

Bryn Mawr Send message Joined: 26 Dec 18 Posts: 374 Credit: 10,709,223 RAC: 5,616	Message 91871 - Posted: 5 Mar 2020, 11:59:25 UTC - in response to Message 91868. Are you setting this limit or is it done by the server? I just let all projects take whatever time they feel necessary. You’ll find it in your account > project preferences > target cpu time, it defaults to 8 hours but after I had quite a few of these errors I dropped mine to 6 hours in the hope I’d waste slightly less processing time. Strange, all projects that have given me errors cause them well under the normal processing time. And some projects like LHC seem to have tasks that usually take 2 hours but can take 4 days, but still complete fine. Rosetta works differently to most in that it does as much processing as it can in the time allowed rather than takes as much time as the fixed amount of processing takes. To ensure that a rogue work unit does not lock out a core it has a watchdog that closes out the WU if it is still running 4 hours beyond the time limit set. As the WU closes down at the end of a decoy and it predicts that there is not time to process another before the deadline, this can only really happen if a single decoy runs for more than the 4 hours and I’d guess that this implies the decoy is in a loop but cannot say that for certain. ID: 91871 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 9,715,484 RAC: 8,600	Message 91875 - Posted: 5 Mar 2020, 19:01:19 UTC - in response to Message 91871. Rosetta works differently to most in that it does as much processing as it can in the time allowed rather than takes as much time as the fixed amount of processing takes. To ensure that a rogue work unit does not lock out a core it has a watchdog that closes out the WU if it is still running 4 hours beyond the time limit set. As the WU closes down at the end of a decoy and it predicts that there is not time to process another before the deadline, this can only really happen if a single decoy runs for more than the 4 hours and I’d guess that this implies the decoy is in a loop but cannot say that for certain. My Rosetta WUs all seem to finish in a very precise timeframe of 7.5 hours (8.5 hours on the slower machines), I've seen no variation. Perhaps the limiter only takes effect occasionally. LHC has a huge variation, the theory tasks can take from 1.5 hours to 4 days! The percentage moves very slowly as though it will take 4 days, but it seems to complete at a random point somewhere in there, often at only "2% completed", it jumps to 100% and says it was successful, I guess it's looking for an answer somewhere in there and finds it early? ID: 91875 · Rating: 0 · rate: / Reply Quote

Bryn Mawr Send message Joined: 26 Dec 18 Posts: 374 Credit: 10,709,223 RAC: 5,616	Message 91876 - Posted: 5 Mar 2020, 20:43:13 UTC - in response to Message 91875. Rosetta works differently to most in that it does as much processing as it can in the time allowed rather than takes as much time as the fixed amount of processing takes. To ensure that a rogue work unit does not lock out a core it has a watchdog that closes out the WU if it is still running 4 hours beyond the time limit set. As the WU closes down at the end of a decoy and it predicts that there is not time to process another before the deadline, this can only really happen if a single decoy runs for more than the 4 hours and I’d guess that this implies the decoy is in a loop but cannot say that for certain. My Rosetta WUs all seem to finish in a very precise timeframe of 7.5 hours (8.5 hours on the slower machines), I've seen no variation. Perhaps the limiter only takes effect occasionally. LHC has a huge variation, the theory tasks can take from 1.5 hours to 4 days! The percentage moves very slowly as though it will take 4 days, but it seems to complete at a random point somewhere in there, often at only "2% completed", it jumps to 100% and says it was successful, I guess it's looking for an answer somewhere in there and finds it early? Pass, I’ve never looked at LHC so I wouldn’t know. ID: 91876 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 9,715,484 RAC: 8,600	Message 91877 - Posted: 5 Mar 2020, 21:01:09 UTC - in response to Message 91876. Pass, I’ve never looked at LHC so I wouldn’t know. They've got Atlas tasks, which will run one WU on all your CPU cores at once. I want to get a Ryzen threadripper to see if they'll give me a 64 core task :-) ID: 91877 · Rating: 0 · rate: / Reply Quote

rlpm Send message Joined: 23 Mar 20 Posts: 13 Credit: 84 RAC: 0	Message 92241 - Posted: 24 Mar 2020, 21:38:52 UTC All WUs for Rosetta v4.07 i686-pc-linux-gnu on my 1st gen AppleTV running linux (OSMC with all GUI etc. disabled) are failing for going over the RAM limit. See here. E.g.: working set size > client RAM limit: 167.87MB > 167.55MB Is there something wrong with the working set size matching to the amount of available RAM? Or can I limit to the Rosetta Mini application only? ID: 92241 · Rating: 0 · rate: / Reply Quote

rlpm Send message Joined: 23 Mar 20 Posts: 13 Credit: 84 RAC: 0	Message 92247 - Posted: 25 Mar 2020, 2:07:34 UTC - in response to Message 92241. Looks like the same is happening for mini tasks, e.g. task 1132535295: working set size > client RAM limit: 170.39MB > 167.55MB ID: 92247 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 92274 - Posted: 25 Mar 2020, 15:16:04 UTC rlpm, your host profile shows 256MB of memory. And the "mini" tasks require just as much memory as any others. They seem to have moved the documentation on minimum host requirements on the R@h website, so I'm not finding it at the moment. But the basic guideline is 1GB of memory per active CPU core. I might suggest that you attach the machine to World Community Grid. They have a number of bioscience projects running there, and generally can run in a smaller memory footprint. Rosetta Moderator: Mod.Sense ID: 92274 · Rating: 0 · rate: / Reply Quote

rlpm Send message Joined: 23 Mar 20 Posts: 13 Credit: 84 RAC: 0	Message 92279 - Posted: 25 Mar 2020, 16:40:34 UTC - in response to Message 92274. Thanks Mod.Sense. It would be nice if BOINC automatically failed early, perhaps even at project attachment, if the host doesn't meet the minimum requirements for any app (RAM, disk, instruction set, OS). I already have my old 1st gen RasPis crunching on TN-Grid (gene sequencing) via BOINC, so I'll do the same with this AppleTV. ID: 92279 · Rating: 0 · rate: / Reply Quote

bormolino Send message Joined: 16 May 13 Posts: 4 Credit: 160,977 RAC: 0	Message 92292 - Posted: 25 Mar 2020, 20:11:24 UTC The graphics of the Rosetta 4.07 WU for COVID-19 does not work. It shows "Stage unknown" and "No shared mem" inside the graphics-window. The graphics of the other WUs are working without any problems. ID: 92292 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 1982 Credit: 38,461,917 RAC: 15,153	Message 92355 - Posted: 26 Mar 2020, 19:29:06 UTC I've seen the Rosetta stats for the number of new users who've come on board recently - basically quadrupled with massive throughput, which is great. The number of in-progress tasks is similarly huge - well over a million - more than I can ever remember seeing. A little earlier this afternoon I saw my buffers were smaller than usual and noticed that a few calls for new tasks had brought none down. This is hardly surprising. Before I finally got to this page to mention the task shortage, more had come on stream, which is great. I guess all I'm saying is, especially with all the new users around, if there's an interruption in task supply in the coming daysweeks, we (more accurately, I) need to have a little patience and understanding. It's going to happen and it's surprising it hasn't happened already. Great job on keeping the tasks coming through - thanks. ID: 92355 · Rating: 0 · rate: / Reply Quote

Shaky Jake Send message Joined: 26 Mar 07 Posts: 2 Credit: 55,684 RAC: 0	Message 92455 - Posted: 28 Mar 2020, 13:58:41 UTC - in response to Message 80621. I have an older desktop computer with a Pentium Duo cpu that is having a problem with the COVID-19 workunits. They are erroring out at about 2 min. EXAMPLE: Task 1134452442 Name 0ef4jx8h_jhr_design1_COVID-19_SAVE_ALL_OUT_903439_1_0 Workunit 1021756085 Created 27 Mar 2020, 9:12:21 UTC Sent 27 Mar 2020, 9:38:35 UTC Report deadline 4 Apr 2020, 9:38:35 UTC Received 28 Mar 2020, 12:10:42 UTC Server state Over Outcome Computation error Client state Compute error Exit status 11 (0x0000000B) Unknown error code Computer ID 3794680 Run time 2 min 15 sec CPU time 1 min 59 sec Validate state Invalid Credit 0.00 Device peak FLOPS 1.87 GFLOPS Application version Rosetta v4.08 x86_64-pc-linux-gnu Stderr output <core_client_version>7.2.42</core_client_version> <![CDATA[ <message> process got signal 11 </message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol jhr_boinc.xml @flags -in:file:silent 0ef4jx8h_jhr_design1_COVID-19.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 0ef4jx8h_jhr_design1_COVID-19.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3902678 Starting watchdog... Watchdog active. </stderr_txt> ]]> I have seen a couple that did complete and were validated. EXAMPLE: Task 1133949909 Name 0gr1iv8s_jhr_design1_COVID-19_SAVE_ALL_OUT_903456_1_0 Workunit 1021309240 Created 26 Mar 2020, 20:05:44 UTC Sent 26 Mar 2020, 20:22:20 UTC Report deadline 3 Apr 2020, 20:22:20 UTC Received 27 Mar 2020, 23:58:09 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x00000000) Computer ID 3794680 Run time 13 hours 53 min 23 sec CPU time 10 hours 30 min 46 sec Validate state Valid Credit 222.11 Device peak FLOPS 1.87 GFLOPS Application version Rosetta v4.07 i686-pc-linux-gnu Stderr output <core_client_version>7.2.42</core_client_version> <![CDATA[ <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_i686-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol jhr_boinc.xml @flags -in:file:silent 0gr1iv8s_jhr_design1_COVID-19.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 0gr1iv8s_jhr_design1_COVID-19.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3546964 Starting watchdog... Watchdog active. ====================================================== DONE :: 3 starting structures 37846.6 cpu seconds This process generated 3 decoys from 3 attempts ====================================================== BOINC :: WS_max 9.36336e-97 BOINC :: Watchdog shutting down... 18:53:10 (26863): called boinc_finish(0) </stderr_txt> ]]> Should I stop using this computer for this project or let it continue. All of the other workunits appear to process with no problems. ID: 92455 · Rating: 0 · rate: / Reply Quote

IBM01902 Send message Joined: 23 Mar 20 Posts: 3 Credit: 43,044 RAC: 0	Message 92460 - Posted: 28 Mar 2020, 14:40:07 UTC - in response to Message 92455. I am seeing this too with older computers. I don't have any new ones. They seem to eventually find something they can work on, but there's nothing in the BOINC event log that's helpful. I will occasionally have a task that halts and waits for memory, but that's not the Computation Error result we're seeing. Glad it's not just me. ID: 92460 · Rating: 0 · rate: / Reply Quote

rlpm Send message Joined: 23 Mar 20 Posts: 13 Credit: 84 RAC: 0	Message 92464 - Posted: 28 Mar 2020, 15:16:30 UTC - in response to Message 92460. <message> process got signal 11 </message> The process is crashing. More info: SIGSEGV 11 Core Invalid memory reference The people with access to the code will have to look into it. I don't know whether there are any crash reports (stack traces, etc.) that you can pull to provide more information to them. ID: 92464 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 9,715,484 RAC: 8,600	Message 92468 - Posted: 28 Mar 2020, 16:21:24 UTC - in response to Message 92460. I am seeing this too with older computers. I don't have any new ones. They seem to eventually find something they can work on, but there's nothing in the BOINC event log that's helpful. I will occasionally have a task that halts and waits for memory, but that's not the Computation Error result we're seeing. Glad it's not just me. Working ok for me on all my computers. My oldest is an Intel Q8400 (about 10 years old). It's a pity you can't select which sub projects to run in the Rosetta preferences. Most projects allow you to pick which ones, so you can block the ones that don't work on your machines. I guess as long as some of them work, you should keep going. Sending one back with an error just means the server will try someone else. ID: 92468 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 92474 - Posted: 28 Mar 2020, 17:18:13 UTC - in response to Message 92455. @Shaky Jake. I see you have two machines. It appears the one with 2 CPUs and 2GB of memory is where the errors are occurring the most (the other machine has 2CPUs and 4GB). This is consistent with what I have gleaned from others as well. I believe the Project Team will be tagging the COVID tasks as requiring more memory in the coming days. This should help things run smoother going forward. Rosetta Moderator: Mod.Sense ID: 92474 · Rating: 0 · rate: / Reply Quote

Shaky Jake Send message Joined: 26 Mar 07 Posts: 2 Credit: 55,684 RAC: 0	Message 92489 - Posted: 28 Mar 2020, 21:01:16 UTC - in response to Message 92455. Last modified: 28 Mar 2020, 21:10:21 UTC I found the problem. I am short .1 GB of memory so when 2 COVID-19 WUs try to run, one of them will fail due to lack of memory. I have ordered additional memory. Until it arrives I have set the computer to use run only 1 WU at a time. Thanks Mod.Sense Every thing seems to be running OK by using only 1 core. I am going to upgrade to 4GB of memory. I think that will solve the problem. My other computer is a laptop with 2 cores and 4GB memory and it has had no problems. Shaky Jake ID: 92489 · Rating: 0 · rate: / Reply Quote

rlpm Send message Joined: 23 Mar 20 Posts: 13 Credit: 84 RAC: 0	Message 92490 - Posted: 28 Mar 2020, 21:22:44 UTC - in response to Message 92489. Last modified: 28 Mar 2020, 21:28:47 UTC The binaries should check that there's enough memory for the WU, both at process start time, and checking results of malloc, etc. at run time. Since the process on your computer hit a segfault, it may have been due to a memory allocation failing but the software not checking the result of the allocation. There must be some checking in the 32-bit (for linux) version of the Rosetta & Rosetta Mini binaries, since I've encountered this error message on an older box with only 256MB of memory: working set size > client RAM limit: 180.00MB > 179.51MB (But it would be nice to have the check happen ahead of time -- before sending the WU to the computer.) ID: 92490 · Rating: 0 · rate: / Reply Quote

bormolino Send message Joined: 16 May 13 Posts: 4 Credit: 160,977 RAC: 0	Message 92491 - Posted: 28 Mar 2020, 21:24:50 UTC The graphics of the Rosetta 4.07 WU for COVID-19 does not work. It shows "Stage unknown" and "No shared mem" inside the graphics-window. The graphics of the other WUs are working without any problems. ID: 92491 · Rating: 0 · rate: / Reply Quote

EHM-1 Send message Joined: 21 Mar 20 Posts: 23 Credit: 183,782 RAC: 0	Message 92534 - Posted: 29 Mar 2020, 15:37:52 UTC Last modified: 29 Mar 2020, 15:41:28 UTC Hello all- Longtime SETI@Home user here, new to Rosetta. Hope I'm posting in the right place; please advise me if not. I attached several days ago, and the screensaver was displaying what I would expect for processing until a couple days ago. Since at least yesterday morning (midday Mar 28 UT), the processing screen displays what I would call a blank template, with no indication that anything is being processed. See image below. Any ideas? Anyone else encountering this? I could find no mention of anything similar in the forums. Thanks in advance for any help. Eric PS- Just after posting, I now see that bormolino might be reporting the same issue just above my post. ID: 92534 · Rating: 0 · rate: / Reply Quote