Problems and Technical Issues with Rosetta@home

Author	Message
Sid Celery Send message Joined: 11 Feb 08 Posts: 2465 Credit: 46,464,996 RAC: 274	Message 91767 - Posted: 24 Feb 2020, 14:03:26 UTC - in response to Message 91718. When this has happened to me it has self corrected after about an hour - give it time and then go for another update and you should get some new tasks. Do you mean completely self corrected, or self corrected after you aborted the task? If I don't abort the task, I've seen it still stuck after about 18 hours. It just keeps on retrying and failing to download about every 3 hours. I abort the transfer (not the task) and normally that is enough to allow downloads to restart when I do an update project. On the odd occasion, however, it has given the message you reported after the update. In that case I leave it an hour and redo the update, on all occasions so far the update has succeeded in bringing down new WUs. For some reason, aborting the transfer first before aborting the task didn't work. Aborting the task first, then the download was always much more successful. ID: 91767 · Rating: 0 · rate: / Reply Quote

Bryn Mawr Send message Joined: 26 Dec 18 Posts: 430 Credit: 14,933,154 RAC: 13	Message 91769 - Posted: 24 Feb 2020, 16:18:16 UTC - in response to Message 91767. For some reason, aborting the transfer first before aborting the task didn't work. Aborting the task first, then the download was always much more successful. Odd, when I abort the transfer the task disappears about 10 seconds later, I don’t need to abort it. I’m on 7.16 ID: 91769 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 91773 - Posted: 24 Feb 2020, 19:12:47 UTC - in response to Message 91769. Odd, when I abort the transfer the task disappears about 10 seconds later, I don’t need to abort it. That would be my expectation of the BOINC Manager. When one of the files required by a task fails to download, then the task is aborted. And there will be times when several of the tasks you are downloading depend upon the same file, and all of them abort when a file transfer fails. Rosetta Moderator: Mod.Sense ID: 91773 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1602 Credit: 12,992,479 RAC: 0	Message 91774 - Posted: 24 Feb 2020, 19:19:29 UTC - in response to Message 91773. Last modified: 24 Feb 2020, 19:20:15 UTC Odd, when I abort the transfer the task disappears about 10 seconds later, I don’t need to abort it. That would be my expectation of the BOINC Manager. When one of the files required by a task fails to download, then the task is aborted. And there will be times when several of the tasks you are downloading depend upon the same file, and all of them abort when a file transfer fails. It's a pity the Boinc manager doesn't seem to notice for hours. I get the message in the log "some download is stalled" (which prevents any more tasks getting downloaded from that project) up to an hour or so after I've cancelled both the download and the task. 1) Rosetta needs to stop the server stalling downloads. 2) Boinc needs to fix their program so it doesn't get upset just because 1 file failed, then fail to notice the user told it to give up. ID: 91774 · Rating: 0 · rate: / Reply Quote

amgthis Send message Joined: 25 Mar 06 Posts: 81 Credit: 203,879,282 RAC: 0	Message 91836 - Posted: 2 Mar 2020, 22:05:46 UTC - in response to Message 91700. I haven't seen the small file stall on d/l lately, but did have this one hang a couple of days ago. More than once. rb_02_20_16480_16303_ab_t000_h001_robetta.zip 3.06 KiB I'm still not convinced it's a by-product of super slow DSL on my end. In any case I think it's cleared up. ID: 91836 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1602 Credit: 12,992,479 RAC: 0	Message 91838 - Posted: 2 Mar 2020, 22:18:57 UTC - in response to Message 91836. I haven't seen the small file stall on d/l lately, but did have this one hang a couple of days ago. More than once. rb_02_20_16480_16303_ab_t000_h001_robetta.zip 3.06 KiB I'm still not convinced it's a by-product of super slow DSL on my end. In any case I think it's cleared up. I doubt it was your connection. I have fibre here and got the same problem. But as you say it's pretty much sorted somehow, pity nobody admitted what they did wrong! ID: 91838 · Rating: 0 · rate: / Reply Quote

Bryn Mawr Send message Joined: 26 Dec 18 Posts: 430 Credit: 14,933,154 RAC: 13	Message 91851 - Posted: 3 Mar 2020, 20:20:17 UTC - in response to Message 91838. I haven't seen the small file stall on d/l lately, but did have this one hang a couple of days ago. More than once. rb_02_20_16480_16303_ab_t000_h001_robetta.zip 3.06 KiB I'm still not convinced it's a by-product of super slow DSL on my end. In any case I think it's cleared up. I doubt it was your connection. I have fibre here and got the same problem. But as you say it's pretty much sorted somehow, pity nobody admitted what they did wrong! I had another stalled download today but my main problem over the past few days have been jobs erroring out after 10 hours. Each job has a single decoy that was still running 4 hours after my 6 hour limit. So far there have been 11 such jobs over 2 machines, 3 on one machine yesterday alone which is a quarter of that machine’s allocation for Rosetta :- https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985018 https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985251 https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985278 ID: 91851 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1602 Credit: 12,992,479 RAC: 0	Message 91852 - Posted: 3 Mar 2020, 20:39:52 UTC - in response to Message 91851. I haven't seen the small file stall on d/l lately, but did have this one hang a couple of days ago. More than once. rb_02_20_16480_16303_ab_t000_h001_robetta.zip 3.06 KiB I'm still not convinced it's a by-product of super slow DSL on my end. In any case I think it's cleared up. I doubt it was your connection. I have fibre here and got the same problem. But as you say it's pretty much sorted somehow, pity nobody admitted what they did wrong! I had another stalled download today but my main problem over the past few days have been jobs erroring out after 10 hours. Each job has a single decoy that was still running 4 hours after my 6 hour limit. So far there have been 11 such jobs over 2 machines, 3 on one machine yesterday alone which is a quarter of that machine’s allocation for Rosetta :- https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985018 https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985251 https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985278 What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here. ID: 91852 · Rating: 0 · rate: / Reply Quote

Bryn Mawr Send message Joined: 26 Dec 18 Posts: 430 Credit: 14,933,154 RAC: 13	Message 91855 - Posted: 3 Mar 2020, 23:46:07 UTC - in response to Message 91852. I haven't seen the small file stall on d/l lately, but did have this one hang a couple of days ago. More than once. rb_02_20_16480_16303_ab_t000_h001_robetta.zip 3.06 KiB I'm still not convinced it's a by-product of super slow DSL on my end. In any case I think it's cleared up. I doubt it was your connection. I have fibre here and got the same problem. But as you say it's pretty much sorted somehow, pity nobody admitted what they did wrong! I had another stalled download today but my main problem over the past few days have been jobs erroring out after 10 hours. Each job has a single decoy that was still running 4 hours after my 6 hour limit. So far there have been 11 such jobs over 2 machines, 3 on one machine yesterday alone which is a quarter of that machine’s allocation for Rosetta :- https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985018 https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985251 https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985278 What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here. Within the stdout file the app reports the number of scenarios that have been processed in the time you make available, normally in the standard 8 hour window you’ll process the data with, maybe, 40 different starting positions (these are known as decoys for some reason that escapes me). I set my processing window to 6 hours and a normal work unit will process maybe 30 decoys, the work units that are erroring out have not finished the first run through the data after 10 hours (6 hour preference plus 4 hours allowed overrun) when the watchdog aborts the process. ID: 91855 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1250 Credit: 14,421,737 RAC: 0	Message 91862 - Posted: 4 Mar 2020, 13:39:06 UTC Last modified: 4 Mar 2020, 13:44:58 UTC Another small file download problem, which is currently blocking me from getting any more R@H tasks: 10v3nmgb_c14394_10mer_gb_000420_SAVE_ALL_OUT_896889_53_1 https://boinc.bakerlab.org/rosetta/result.php?resultid=1125377637 A wingmate timed out and therefore may have had the same problem: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1012342810 Relevant lines from the log: 3/4/2020 4:00:05 AM \| Rosetta@home \| Started download of 10v3nmgb_c14394_10mer_gb_000420.zip 3/4/2020 4:05:12 AM \| Rosetta@home \| Temporarily failed download of 10v3nmgb_c14394_10mer_gb_000420.zip: transient HTTP error 3/4/2020 4:05:12 AM \| Rosetta@home \| Backing off 04:50:13 on download of 10v3nmgb_c14394_10mer_gb_000420.zip 3/4/2020 4:05:13 AM \| \| Project communication failed: attempting access to reference site 3/4/2020 4:05:15 AM \| \| Internet access OK - project servers may be temporarily down. 3/4/2020 4:42:49 AM \| Rosetta@home \| Sending scheduler request: To report completed tasks. 3/4/2020 4:42:49 AM \| Rosetta@home \| Reporting 2 completed tasks 3/4/2020 4:42:49 AM \| Rosetta@home \| Not requesting tasks: some download is stalled 3/4/2020 4:42:51 AM \| Rosetta@home \| Scheduler request completed Does the file that fails to download even exist on the server? The expected size of the file is only 3.23 KB. Could the server have problems downloading files of a certain small size? DSL speed here is not especially high or low. Enough other BOINC projects are selected on this computer to keep it busy. ID: 91862 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1602 Credit: 12,992,479 RAC: 0	Message 91865 - Posted: 4 Mar 2020, 18:46:43 UTC - in response to Message 91855. What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here. Within the stdout file the app reports the number of scenarios that have been processed in the time you make available, normally in the standard 8 hour window you’ll process the data with, maybe, 40 different starting positions (these are known as decoys for some reason that escapes me). I set my processing window to 6 hours and a normal work unit will process maybe 30 decoys, the work units that are erroring out have not finished the first run through the data after 10 hours (6 hour preference plus 4 hours allowed overrun) when the watchdog aborts the process. Are you setting this limit or is it done by the server? I just let all projects take whatever time they feel necessary. ID: 91865 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1602 Credit: 12,992,479 RAC: 0	Message 91866 - Posted: 4 Mar 2020, 18:48:35 UTC - in response to Message 91862. Another small file download problem, which is currently blocking me from getting any more R@H tasks: 10v3nmgb_c14394_10mer_gb_000420_SAVE_ALL_OUT_896889_53_1 https://boinc.bakerlab.org/rosetta/result.php?resultid=1125377637 A wingmate timed out and therefore may have had the same problem: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1012342810 Relevant lines from the log: 3/4/2020 4:00:05 AM \| Rosetta@home \| Started download of 10v3nmgb_c14394_10mer_gb_000420.zip 3/4/2020 4:05:12 AM \| Rosetta@home \| Temporarily failed download of 10v3nmgb_c14394_10mer_gb_000420.zip: transient HTTP error 3/4/2020 4:05:12 AM \| Rosetta@home \| Backing off 04:50:13 on download of 10v3nmgb_c14394_10mer_gb_000420.zip 3/4/2020 4:05:13 AM \| \| Project communication failed: attempting access to reference site 3/4/2020 4:05:15 AM \| \| Internet access OK - project servers may be temporarily down. 3/4/2020 4:42:49 AM \| Rosetta@home \| Sending scheduler request: To report completed tasks. 3/4/2020 4:42:49 AM \| Rosetta@home \| Reporting 2 completed tasks 3/4/2020 4:42:49 AM \| Rosetta@home \| Not requesting tasks: some download is stalled 3/4/2020 4:42:51 AM \| Rosetta@home \| Scheduler request completed Does the file that fails to download even exist on the server? The expected size of the file is only 3.23 KB. Could the server have problems downloading files of a certain small size? DSL speed here is not especially high or low. Enough other BOINC projects are selected on this computer to keep it busy. It's always the little 3kB files that stuck for me. It suggests it's a different server producing those that's misbehaving, or the files are corrupt for some reason. They don't even download using a web browser. ID: 91866 · Rating: 0 · rate: / Reply Quote

Bryn Mawr Send message Joined: 26 Dec 18 Posts: 430 Credit: 14,933,154 RAC: 13	Message 91867 - Posted: 4 Mar 2020, 21:35:23 UTC - in response to Message 91865. What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here. Within the stdout file the app reports the number of scenarios that have been processed in the time you make available, normally in the standard 8 hour window you’ll process the data with, maybe, 40 different starting positions (these are known as decoys for some reason that escapes me). I set my processing window to 6 hours and a normal work unit will process maybe 30 decoys, the work units that are erroring out have not finished the first run through the data after 10 hours (6 hour preference plus 4 hours allowed overrun) when the watchdog aborts the process. Are you setting this limit or is it done by the server? I just let all projects take whatever time they feel necessary. You’ll find it in your account > project preferences > target cpu time, it defaults to 8 hours but after I had quite a few of these errors I dropped mine to 6 hours in the hope I’d waste slightly less processing time. ID: 91867 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1602 Credit: 12,992,479 RAC: 0	Message 91868 - Posted: 4 Mar 2020, 21:54:33 UTC - in response to Message 91867. What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here. Within the stdout file the app reports the number of scenarios that have been processed in the time you make available, normally in the standard 8 hour window you’ll process the data with, maybe, 40 different starting positions (these are known as decoys for some reason that escapes me). I set my processing window to 6 hours and a normal work unit will process maybe 30 decoys, the work units that are erroring out have not finished the first run through the data after 10 hours (6 hour preference plus 4 hours allowed overrun) when the watchdog aborts the process. Are you setting this limit or is it done by the server? I just let all projects take whatever time they feel necessary. You’ll find it in your account > project preferences > target cpu time, it defaults to 8 hours but after I had quite a few of these errors I dropped mine to 6 hours in the hope I’d waste slightly less processing time. Strange, all projects that have given me errors cause them well under the normal processing time. And some projects like LHC seem to have tasks that usually take 2 hours but can take 4 days, but still complete fine. ID: 91868 · Rating: 0 · rate: / Reply Quote

Bryn Mawr Send message Joined: 26 Dec 18 Posts: 430 Credit: 14,933,154 RAC: 13	Message 91871 - Posted: 5 Mar 2020, 11:59:25 UTC - in response to Message 91868. Are you setting this limit or is it done by the server? I just let all projects take whatever time they feel necessary. You’ll find it in your account > project preferences > target cpu time, it defaults to 8 hours but after I had quite a few of these errors I dropped mine to 6 hours in the hope I’d waste slightly less processing time. Strange, all projects that have given me errors cause them well under the normal processing time. And some projects like LHC seem to have tasks that usually take 2 hours but can take 4 days, but still complete fine. Rosetta works differently to most in that it does as much processing as it can in the time allowed rather than takes as much time as the fixed amount of processing takes. To ensure that a rogue work unit does not lock out a core it has a watchdog that closes out the WU if it is still running 4 hours beyond the time limit set. As the WU closes down at the end of a decoy and it predicts that there is not time to process another before the deadline, this can only really happen if a single decoy runs for more than the 4 hours and I’d guess that this implies the decoy is in a loop but cannot say that for certain. ID: 91871 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1602 Credit: 12,992,479 RAC: 0	Message 91875 - Posted: 5 Mar 2020, 19:01:19 UTC - in response to Message 91871. Rosetta works differently to most in that it does as much processing as it can in the time allowed rather than takes as much time as the fixed amount of processing takes. To ensure that a rogue work unit does not lock out a core it has a watchdog that closes out the WU if it is still running 4 hours beyond the time limit set. As the WU closes down at the end of a decoy and it predicts that there is not time to process another before the deadline, this can only really happen if a single decoy runs for more than the 4 hours and I’d guess that this implies the decoy is in a loop but cannot say that for certain. My Rosetta WUs all seem to finish in a very precise timeframe of 7.5 hours (8.5 hours on the slower machines), I've seen no variation. Perhaps the limiter only takes effect occasionally. LHC has a huge variation, the theory tasks can take from 1.5 hours to 4 days! The percentage moves very slowly as though it will take 4 days, but it seems to complete at a random point somewhere in there, often at only "2% completed", it jumps to 100% and says it was successful, I guess it's looking for an answer somewhere in there and finds it early? ID: 91875 · Rating: 0 · rate: / Reply Quote

Bryn Mawr Send message Joined: 26 Dec 18 Posts: 430 Credit: 14,933,154 RAC: 13	Message 91876 - Posted: 5 Mar 2020, 20:43:13 UTC - in response to Message 91875. Rosetta works differently to most in that it does as much processing as it can in the time allowed rather than takes as much time as the fixed amount of processing takes. To ensure that a rogue work unit does not lock out a core it has a watchdog that closes out the WU if it is still running 4 hours beyond the time limit set. As the WU closes down at the end of a decoy and it predicts that there is not time to process another before the deadline, this can only really happen if a single decoy runs for more than the 4 hours and I’d guess that this implies the decoy is in a loop but cannot say that for certain. My Rosetta WUs all seem to finish in a very precise timeframe of 7.5 hours (8.5 hours on the slower machines), I've seen no variation. Perhaps the limiter only takes effect occasionally. LHC has a huge variation, the theory tasks can take from 1.5 hours to 4 days! The percentage moves very slowly as though it will take 4 days, but it seems to complete at a random point somewhere in there, often at only "2% completed", it jumps to 100% and says it was successful, I guess it's looking for an answer somewhere in there and finds it early? Pass, I’ve never looked at LHC so I wouldn’t know. ID: 91876 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1602 Credit: 12,992,479 RAC: 0	Message 91877 - Posted: 5 Mar 2020, 21:01:09 UTC - in response to Message 91876. Pass, I’ve never looked at LHC so I wouldn’t know. They've got Atlas tasks, which will run one WU on all your CPU cores at once. I want to get a Ryzen threadripper to see if they'll give me a 64 core task :-) ID: 91877 · Rating: 0 · rate: / Reply Quote

rlpm Send message Joined: 23 Mar 20 Posts: 13 Credit: 84 RAC: 0	Message 92241 - Posted: 24 Mar 2020, 21:38:52 UTC All WUs for Rosetta v4.07 i686-pc-linux-gnu on my 1st gen AppleTV running linux (OSMC with all GUI etc. disabled) are failing for going over the RAM limit. See here. E.g.: working set size > client RAM limit: 167.87MB > 167.55MB Is there something wrong with the working set size matching to the amount of available RAM? Or can I limit to the Rosetta Mini application only? ID: 92241 · Rating: 0 · rate: / Reply Quote

rlpm Send message Joined: 23 Mar 20 Posts: 13 Credit: 84 RAC: 0	Message 92247 - Posted: 25 Mar 2020, 2:07:34 UTC - in response to Message 92241. Looks like the same is happening for mini tasks, e.g. task 1132535295: working set size > client RAM limit: 170.39MB > 167.55MB ID: 92247 · Rating: 0 · rate: / Reply Quote