Problems and Technical Issues with Rosetta@home

Author	Message
mmonnin Send message Joined: 2 Jun 16 Posts: 61 Credit: 25,390,629 RAC: 0	Message 87816 - Posted: 4 Dec 2017, 17:54:54 UTC 40 valid tasks and 37 in progress. Seems fine. ID: 87816 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2541 Credit: 47,118,286 RAC: 389	Message 87817 - Posted: 4 Dec 2017, 18:16:45 UTC Showing 10,000 mini-Rosetta tasks now. Go get'em ID: 87817 · Rating: 0 · rate: / Reply Quote

Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 307,078,775 RAC: 4,694	Message 87842 - Posted: 6 Dec 2017, 23:25:53 UTC several Rosetta 4.06 units failing with this error: <core_client_version>7.6.31</core_client_version> <![CDATA[ <message> process got signal 11 </message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.06_x86_64-pc-linux-gnu @9res_cis_hydrophobic_nmethyl_c.103.1_1_0001.flags -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2224394 Starting watchdog... Watchdog active. BOINC:: CPU time: 43533.6s, 14400s + 28800s[2017-12- 6 12:16:35:] :: BOINC WARNING! cannot get file size for default.out.gz: could not open file. Output exists: default.out.gz Size: -1 InternalDecoyCount: 0 (GZ) ----- 0 ----- Stream information inconsistent. Writing W_0000001 ====================================================== DONE :: 1 starting structures 43533.6 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== 12:16:35 (64706): called boinc_finish(0) pure virtual method called terminate called without an active exception </stderr_txt> ]]> ID: 87842 · Rating: 0 · rate: / Reply Quote

William Waggoner Send message Joined: 19 Nov 17 Posts: 3 Credit: 24,650 RAC: 0	Message 87846 - Posted: 7 Dec 2017, 15:27:35 UTC I am on a Mac, 10.13, running SETI and Rosetta and I have four Rosetta tasks that don't pause when BOINC Manager pauses. They are chewing up about 50% CPU each, apparently system time, even if I kill BOINC Manager. I have tried Force Quitting them but they return to the same problem. The task names from ps are: minirosetta_3.78_x86_64-apple-darwin @G058259_GEO_3_1-160_TEST_v02_t000__krypton.flags minirosetta_3.78_x86_64-apple-darwin @G213038_GEO_3_1-95_TEST_v02_t000__krypton.flags minirosetta_3.78_x86_64-apple-darwin @G018595_GEO_3_10-140_TEST_v02_t000__krypton.flags minirosetta_3.78_x86_64-apple-darwin @G153189_GEO_3_1-72_TEST_v02_t000__krypton.flags Since aborting these four there was one more that persisted. I've stopped Rosetta for now until I get this straightened out. ID: 87846 · Rating: 0 · rate: / Reply Quote

Admin Project administrator Send message Joined: 1 Jul 05 Posts: 5146 Credit: 0 RAC: 0	Message 87849 - Posted: 7 Dec 2017, 19:45:38 UTC - in response to Message 87846. Do you know if this is a general issue with all jobs or for specific jobs? ID: 87849 · Rating: 0 · rate: / Reply Quote

William Waggoner Send message Joined: 19 Nov 17 Posts: 3 Credit: 24,650 RAC: 0	Message 87856 - Posted: 7 Dec 2017, 21:44:05 UTC - in response to Message 87849. It is not all jobs in Rosetta. I have encountered five so far. No jobs in any other project have done the same thing though. ID: 87856 · Rating: 0 · rate: / Reply Quote

ypsilon Send message Joined: 19 Mar 08 Posts: 2 Credit: 312,546,107 RAC: 39,278	Message 87861 - Posted: 8 Dec 2017, 14:03:51 UTC I noticed the same on my Mac (10.11.6). I also noticed that the kernel task takes about 50% with this type of task and that the task runs much longer than my 4h runtime. Here ist an example task. I aborted it after a runtime of 15:39:40 Fr 8 Dez 14:36:21 2017 \| Rosetta@home \| task G176481_GEO_3_1-90_TEST_v02_t000__krypton_SAVE_ALL_OUT_03_09_538050_167_1 aborted by user this is not the only one. I noticed it a several times the last days. ID: 87861 · Rating: 0 · rate: / Reply Quote

Admin Project administrator Send message Joined: 1 Jul 05 Posts: 5146 Credit: 0 RAC: 0	Message 87864 - Posted: 8 Dec 2017, 21:50:46 UTC I alerted the researcher who owns these jobs. There is definitely an issue and he should be aborting them soon. A corrupted input file is causing this very bad behavior. Please go ahead and abort these _GEO_ jobs if they are running too long. You may have to force abort them unfortunately. Sorry for the inconvenience. ID: 87864 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jun 16 Posts: 61 Credit: 25,390,629 RAC: 0	Message 87865 - Posted: 9 Dec 2017, 0:51:32 UTC - in response to Message 87842. several Rosetta 4.06 units failing with this error: <core_client_version>7.6.31</core_client_version> <![CDATA[ <message> process got signal 11 </message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.06_x86_64-pc-linux-gnu @9res_cis_hydrophobic_nmethyl_c.103.1_1_0001.flags -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2224394 Starting watchdog... Watchdog active. BOINC:: CPU time: 43533.6s, 14400s + 28800s[2017-12- 6 12:16:35:] :: BOINC WARNING! cannot get file size for default.out.gz: could not open file. Output exists: default.out.gz Size: -1 InternalDecoyCount: 0 (GZ) ----- 0 ----- Stream information inconsistent. Writing W_0000001 ====================================================== DONE :: 1 starting structures 43533.6 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== 12:16:35 (64706): called boinc_finish(0) pure virtual method called terminate called without an active exception </stderr_txt> ]]> I have had several of these as well. They take much longer and eventually get a comp error. https://boinc.bakerlab.org/workunit.php?wuid=864077368 ID: 87865 · Rating: 0 · rate: / Reply Quote

ypsilon Send message Joined: 19 Mar 08 Posts: 2 Credit: 312,546,107 RAC: 39,278	Message 87867 - Posted: 9 Dec 2017, 9:12:42 UTC - in response to Message 87866. I have had several of these as well. They take much longer and eventually get a comp error. Apparently there's an emergency stop built in, 4 hours after target time. Some of them might stop after 8 Hours, but at least on my mac there are some tasks which run longer. On may Linux machines occurred many tasks which run longer also. There I have a script to catch them. Here is a log from the last few days: Do 30. Nov 11:00:01 CET 2017 UfA2QuiJ_fold_and_dock_SAVE_ALL_OUT_523940_1806_0 aborted Do 30. Nov 12:50:01 CET 2017 T1wKatx0_fold_and_dock_SAVE_ALL_OUT_523844_1809_0 aborted Do 30. Nov 22:25:01 CET 2017 ExWr3Dh9_fold_and_dock_SAVE_ALL_OUT_523951_1809_0 aborted Fr 1. Dez 02:35:01 CET 2017 5mGIFocw_fold_and_dock_SAVE_ALL_OUT_523953_1809_0 aborted Fr 1. Dez 13:40:01 CET 2017 KMdj0rL5_fold_and_dock_SAVE_ALL_OUT_523842_1812_0 aborted Fr 1. Dez 17:10:02 CET 2017 5f6I1vYG_fold_and_dock_SAVE_ALL_OUT_523837_1812_0 aborted Sa 2. Dez 18:15:01 CET 2017 vusPMBXW_fold_and_dock_SAVE_ALL_OUT_523826_1834_0 aborted So 3. Dez 08:00:01 CET 2017 T0zUfk05_fold_and_dock_SAVE_ALL_OUT_523962_1847_0 aborted Mo 4. Dez 11:20:01 CET 2017 qe4GEPtL_fold_and_dock_SAVE_ALL_OUT_523846_1868_0 aborted Mo 4. Dez 23:40:01 CET 2017 EXDyYP7D_fold_and_dock_SAVE_ALL_OUT_523819_1869_1 aborted Di 5. Dez 06:15:01 CET 2017 8res_cis_hydrophobic_nmethyl_8res_c.1.8_0001_SAVE_ALL_OUT_537824_736_0 aborted Di 5. Dez 10:50:01 CET 2017 8res_cis_hydrophobic_nmethyl_8res_c.14.5_0001_SAVE_ALL_OUT_537819_597_0 aborted Mi 6. Dez 02:55:01 CET 2017 WCeQrXeO_fold_and_dock_SAVE_ALL_OUT_523901_1733_1 aborted Mi 6. Dez 08:55:02 CET 2017 XVaB0DRA_fold_and_dock_SAVE_ALL_OUT_523891_1979_0 aborted Mi 6. Dez 13:35:01 CET 2017 A1seaa4W_fold_and_dock_SAVE_ALL_OUT_523928_1990_0 aborted Do 7. Dez 02:30:01 CET 2017 jote4wq3_fold_and_dock_SAVE_ALL_OUT_524006_2016_0 aborted Do 7. Dez 19:05:01 CET 2017 BiqOzAZk_fold_and_dock_SAVE_ALL_OUT_523813_2055_0 aborted Do 7. Dez 23:00:01 CET 2017 WybJiODL_fold_and_dock_SAVE_ALL_OUT_523892_2058_0 aborted Fr 8. Dez 03:05:01 CET 2017 XVaB0DRA_fold_and_dock_SAVE_ALL_OUT_523891_2058_0 aborted Fr 8. Dez 07:40:01 CET 2017 9res_cis_hydrophobic_nmethyl_c.651.1_0001_SAVE_ALL_OUT_538287_42_0 aborted Sa 9. Dez 08:10:01 CET 2017 9res_cis_hydrophobic_nmethyl_c.996.1_0001_SAVE_ALL_OUT_538294_560_0 aborted ID: 87867 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jun 16 Posts: 61 Credit: 25,390,629 RAC: 0	Message 87871 - Posted: 9 Dec 2017, 15:01:45 UTC Now I'm just aborting tasks that aren't staying on track with the others. Lots of wasted hours now. ID: 87871 · Rating: 0 · rate: / Reply Quote

Luigi R. Send message Joined: 7 Feb 14 Posts: 39 Credit: 2,045,527 RAC: 0	Message 87881 - Posted: 10 Dec 2017, 22:54:08 UTC Same here. https://boinc.bakerlab.org/result.php?resultid=958964042 https://boinc.bakerlab.org/result.php?resultid=958964041 ID: 87881 · Rating: 0 · rate: / Reply Quote

Gaurav Bhardwaj Send message Joined: 10 Oct 08 Posts: 2 Credit: 693,069 RAC: 0	Message 87888 - Posted: 11 Dec 2017, 19:16:50 UTC - in response to Message 87881. Thanks for pointing out the issues with really long jobs to us. Some of these jobs are intended to predict the structures of cyclic peptides, and invoke a few different filters during their runs. For some peptides, passing all these filters is very low probability event, and therefore no structure makes it through even after hours of running. We are looking further into it, and will update you with more information very soon. ID: 87888 · Rating: 0 · rate: / Reply Quote

JohnH Send message Joined: 25 Mar 13 Posts: 43 Credit: 2,319,355 RAC: 0	Message 87901 - Posted: 14 Dec 2017, 18:33:57 UTC No Rosetta tasks A G A I N. ID: 87901 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1264 Credit: 14,421,737 RAC: 0	Message 87903 - Posted: 14 Dec 2017, 20:32:46 UTC - in response to Message 87888. Thanks for pointing out the issues with really long jobs to us. Some of these jobs are intended to predict the structures of cyclic peptides, and invoke a few different filters during their runs. For some peptides, passing all these filters is very low probability event, and therefore no structure makes it through even after hours of running. We are looking further into it, and will update you with more information very soon. Some ideas you may want to think about: Grant some credit for work that does not pass the filters, even if not as much as for work that does pass the filters. Check for an end-of-run every time a starting point reaches a final decision on whether it produces good results, not just if the work on that starting point has passed the filters. This means that users get some credit if they only remove at least one starting point from the list of starting points that would otherwise be sent to at least one more user. ID: 87903 · Rating: 0 · rate: / Reply Quote

googloo Send message Joined: 15 Sep 06 Posts: 137 Credit: 23,975,196 RAC: 8	Message 87910 - Posted: 15 Dec 2017, 19:44:38 UTC - in response to Message 87901. Last modified: 15 Dec 2017, 19:47:24 UTC Same here Message 87901 - Posted: 14 Dec 2017, 18:33:57 UTC No Rosetta tasks A G A I N. ID: 87910 · Rating: 0 · rate: / Reply Quote

Admin Project administrator Send message Joined: 1 Jul 05 Posts: 5146 Credit: 0 RAC: 0	Message 87911 - Posted: 15 Dec 2017, 19:59:47 UTC I found a bottleneck in our work unit generation that happens in certain situations. It should be fixed now. There is an endless supply of scientific work queued and planned to be queued in the near future. The more crunching the better! Sorry for these workunit distribution issues. Thanks! ID: 87911 · Rating: 0 · rate: / Reply Quote

Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,662,635 RAC: 0	Message 87912 - Posted: 16 Dec 2017, 0:58:23 UTC No new workunits being downloaded on at least 1 of my boxes. Log says: "Server Error: Feeder not running." I suspect this just just happened as the server status page still shows the Feeder as running. **38 cores crunching for R@H on behalf of cancercomputer.org - a non-profit supporting High Performance Computing in Cancer Research ID: 87912 · Rating: 0 · rate: / Reply Quote

Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,662,635 RAC: 0	Message 87913 - Posted: 16 Dec 2017, 0:59:19 UTC - in response to Message 87912. Seems like it's now fixed (only about 1 minute after I posted this!) Nice work :) **38 cores crunching for R@H on behalf of cancercomputer.org - a non-profit supporting High Performance Computing in Cancer Research ID: 87913 · Rating: 0 · rate: / Reply Quote

xaminmo Send message Joined: 14 Nov 17 Posts: 2 Credit: 2,075,523 RAC: 0	Message 88212 - Posted: 2 Feb 2018, 1:37:37 UTC It seems when BOINC requests work, Rosetta sends more than requested. I do not have this problem with any other project. This is worse as of the last week or so, but it does not always happen this way. The main system I notice this on is here: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3290081 My runtimes closely match what I've selected in project properties, so it's not that. If I set resource share to 10, I still pull down about 5x the work of any of the other projects I crunch on. Deadline is around a week out, so this forces other projects to not run, in favor of Rosetta, because eventually it wants to make sure the jobs are not at risk. When it uploads the work, it grabs new replacement work, and still favors Rosetta. I've opened a similar thread in the BOINC forums, since I think they should force the scheduler to comply with user wishes, not with what the project sends. But I see it as a problem with both sides. I'm hoping to get a workaround, or some sort of dev committment to help improve this situatoin. ID: 88212 · Rating: 0 · rate: / Reply Quote