Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 311 · Next
Author | Message |
---|---|
mmonnin Send message Joined: 2 Jun 16 Posts: 61 Credit: 25,390,629 RAC: 13,030 |
40 valid tasks and 37 in progress. Seems fine. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2146 Credit: 41,570,180 RAC: 8,210 |
Showing 10,000 mini-Rosetta tasks now. Go get'em |
Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 291,214,977 RAC: 0 |
several Rosetta 4.06 units failing with this error: <core_client_version>7.6.31</core_client_version> <![CDATA[ <message> process got signal 11 </message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.06_x86_64-pc-linux-gnu @9res_cis_hydrophobic_nmethyl_c.103.1_1_0001.flags -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2224394 Starting watchdog... Watchdog active. BOINC:: CPU time: 43533.6s, 14400s + 28800s[2017-12- 6 12:16:35:] :: BOINC WARNING! cannot get file size for default.out.gz: could not open file. Output exists: default.out.gz Size: -1 InternalDecoyCount: 0 (GZ) ----- 0 ----- Stream information inconsistent. Writing W_0000001 ====================================================== DONE :: 1 starting structures 43533.6 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== 12:16:35 (64706): called boinc_finish(0) pure virtual method called terminate called without an active exception </stderr_txt> ]]> |
William Waggoner Send message Joined: 19 Nov 17 Posts: 3 Credit: 24,650 RAC: 0 |
I am on a Mac, 10.13, running SETI and Rosetta and I have four Rosetta tasks that don't pause when BOINC Manager pauses. They are chewing up about 50% CPU each, apparently system time, even if I kill BOINC Manager. I have tried Force Quitting them but they return to the same problem. The task names from ps are: minirosetta_3.78_x86_64-apple-darwin @G058259_GEO_3_1-160_TEST_v02_t000__krypton.flags minirosetta_3.78_x86_64-apple-darwin @G213038_GEO_3_1-95_TEST_v02_t000__krypton.flags minirosetta_3.78_x86_64-apple-darwin @G018595_GEO_3_10-140_TEST_v02_t000__krypton.flags minirosetta_3.78_x86_64-apple-darwin @G153189_GEO_3_1-72_TEST_v02_t000__krypton.flags Since aborting these four there was one more that persisted. I've stopped Rosetta for now until I get this straightened out. |
Admin Project administrator Send message Joined: 1 Jul 05 Posts: 5144 Credit: 0 RAC: 0 |
Do you know if this is a general issue with all jobs or for specific jobs? |
William Waggoner Send message Joined: 19 Nov 17 Posts: 3 Credit: 24,650 RAC: 0 |
It is not all jobs in Rosetta. I have encountered five so far. No jobs in any other project have done the same thing though. |
ypsilon Send message Joined: 19 Mar 08 Posts: 2 Credit: 291,819,405 RAC: 7,902 |
I noticed the same on my Mac (10.11.6). I also noticed that the kernel task takes about 50% with this type of task and that the task runs much longer than my 4h runtime. Here ist an example task. I aborted it after a runtime of 15:39:40 Fr 8 Dez 14:36:21 2017 | Rosetta@home | task G176481_GEO_3_1-90_TEST_v02_t000__krypton_SAVE_ALL_OUT_03_09_538050_167_1 aborted by user this is not the only one. I noticed it a several times the last days. |
Admin Project administrator Send message Joined: 1 Jul 05 Posts: 5144 Credit: 0 RAC: 0 |
I alerted the researcher who owns these jobs. There is definitely an issue and he should be aborting them soon. A corrupted input file is causing this very bad behavior. Please go ahead and abort these _GEO_ jobs if they are running too long. You may have to force abort them unfortunately. Sorry for the inconvenience. |
mmonnin Send message Joined: 2 Jun 16 Posts: 61 Credit: 25,390,629 RAC: 13,030 |
several Rosetta 4.06 units failing with this error: I have had several of these as well. They take much longer and eventually get a comp error. https://boinc.bakerlab.org/workunit.php?wuid=864077368 |
ypsilon Send message Joined: 19 Mar 08 Posts: 2 Credit: 291,819,405 RAC: 7,902 |
I have had several of these as well. They take much longer and eventually get a comp error.Apparently there's an emergency stop built in, 4 hours after target time. Some of them might stop after 8 Hours, but at least on my mac there are some tasks which run longer. On may Linux machines occurred many tasks which run longer also. There I have a script to catch them. Here is a log from the last few days: Do 30. Nov 11:00:01 CET 2017 UfA2QuiJ_fold_and_dock_SAVE_ALL_OUT_523940_1806_0 aborted Do 30. Nov 12:50:01 CET 2017 T1wKatx0_fold_and_dock_SAVE_ALL_OUT_523844_1809_0 aborted Do 30. Nov 22:25:01 CET 2017 ExWr3Dh9_fold_and_dock_SAVE_ALL_OUT_523951_1809_0 aborted Fr 1. Dez 02:35:01 CET 2017 5mGIFocw_fold_and_dock_SAVE_ALL_OUT_523953_1809_0 aborted Fr 1. Dez 13:40:01 CET 2017 KMdj0rL5_fold_and_dock_SAVE_ALL_OUT_523842_1812_0 aborted Fr 1. Dez 17:10:02 CET 2017 5f6I1vYG_fold_and_dock_SAVE_ALL_OUT_523837_1812_0 aborted Sa 2. Dez 18:15:01 CET 2017 vusPMBXW_fold_and_dock_SAVE_ALL_OUT_523826_1834_0 aborted So 3. Dez 08:00:01 CET 2017 T0zUfk05_fold_and_dock_SAVE_ALL_OUT_523962_1847_0 aborted Mo 4. Dez 11:20:01 CET 2017 qe4GEPtL_fold_and_dock_SAVE_ALL_OUT_523846_1868_0 aborted Mo 4. Dez 23:40:01 CET 2017 EXDyYP7D_fold_and_dock_SAVE_ALL_OUT_523819_1869_1 aborted Di 5. Dez 06:15:01 CET 2017 8res_cis_hydrophobic_nmethyl_8res_c.1.8_0001_SAVE_ALL_OUT_537824_736_0 aborted Di 5. Dez 10:50:01 CET 2017 8res_cis_hydrophobic_nmethyl_8res_c.14.5_0001_SAVE_ALL_OUT_537819_597_0 aborted Mi 6. Dez 02:55:01 CET 2017 WCeQrXeO_fold_and_dock_SAVE_ALL_OUT_523901_1733_1 aborted Mi 6. Dez 08:55:02 CET 2017 XVaB0DRA_fold_and_dock_SAVE_ALL_OUT_523891_1979_0 aborted Mi 6. Dez 13:35:01 CET 2017 A1seaa4W_fold_and_dock_SAVE_ALL_OUT_523928_1990_0 aborted Do 7. Dez 02:30:01 CET 2017 jote4wq3_fold_and_dock_SAVE_ALL_OUT_524006_2016_0 aborted Do 7. Dez 19:05:01 CET 2017 BiqOzAZk_fold_and_dock_SAVE_ALL_OUT_523813_2055_0 aborted Do 7. Dez 23:00:01 CET 2017 WybJiODL_fold_and_dock_SAVE_ALL_OUT_523892_2058_0 aborted Fr 8. Dez 03:05:01 CET 2017 XVaB0DRA_fold_and_dock_SAVE_ALL_OUT_523891_2058_0 aborted Fr 8. Dez 07:40:01 CET 2017 9res_cis_hydrophobic_nmethyl_c.651.1_0001_SAVE_ALL_OUT_538287_42_0 aborted Sa 9. Dez 08:10:01 CET 2017 9res_cis_hydrophobic_nmethyl_c.996.1_0001_SAVE_ALL_OUT_538294_560_0 aborted |
mmonnin Send message Joined: 2 Jun 16 Posts: 61 Credit: 25,390,629 RAC: 13,030 |
Now I'm just aborting tasks that aren't staying on track with the others. Lots of wasted hours now. |
Luigi R. Send message Joined: 7 Feb 14 Posts: 39 Credit: 2,045,527 RAC: 0 |
|
Gaurav Bhardwaj Send message Joined: 10 Oct 08 Posts: 2 Credit: 693,069 RAC: 0 |
Thanks for pointing out the issues with really long jobs to us. Some of these jobs are intended to predict the structures of cyclic peptides, and invoke a few different filters during their runs. For some peptides, passing all these filters is very low probability event, and therefore no structure makes it through even after hours of running. We are looking further into it, and will update you with more information very soon. |
JohnH Send message Joined: 25 Mar 13 Posts: 43 Credit: 2,319,355 RAC: 0 |
No Rosetta tasks A G A I N. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 826 |
Thanks for pointing out the issues with really long jobs to us. Some of these jobs are intended to predict the structures of cyclic peptides, and invoke a few different filters during their runs. For some peptides, passing all these filters is very low probability event, and therefore no structure makes it through even after hours of running. We are looking further into it, and will update you with more information very soon. Some ideas you may want to think about: Grant some credit for work that does not pass the filters, even if not as much as for work that does pass the filters. Check for an end-of-run every time a starting point reaches a final decision on whether it produces good results, not just if the work on that starting point has passed the filters. This means that users get some credit if they only remove at least one starting point from the list of starting points that would otherwise be sent to at least one more user. |
googloo Send message Joined: 15 Sep 06 Posts: 133 Credit: 22,813,645 RAC: 1,448 |
Same here Message 87901 - Posted: 14 Dec 2017, 18:33:57 UTC |
Admin Project administrator Send message Joined: 1 Jul 05 Posts: 5144 Credit: 0 RAC: 0 |
I found a bottleneck in our work unit generation that happens in certain situations. It should be fixed now. There is an endless supply of scientific work queued and planned to be queued in the near future. The more crunching the better! Sorry for these workunit distribution issues. Thanks! |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
No new workunits being downloaded on at least 1 of my boxes. Log says: "Server Error: Feeder not running." I suspect this just just happened as the server status page still shows the Feeder as running. **38 cores crunching for R@H on behalf of cancercomputer.org - a non-profit supporting High Performance Computing in Cancer Research |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
Seems like it's now fixed (only about 1 minute after I posted this!) Nice work :) **38 cores crunching for R@H on behalf of cancercomputer.org - a non-profit supporting High Performance Computing in Cancer Research |
xaminmo Send message Joined: 14 Nov 17 Posts: 2 Credit: 2,075,523 RAC: 0 |
It seems when BOINC requests work, Rosetta sends more than requested. I do not have this problem with any other project. This is worse as of the last week or so, but it does not always happen this way. The main system I notice this on is here: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3290081 My runtimes closely match what I've selected in project properties, so it's not that. If I set resource share to 10, I still pull down about 5x the work of any of the other projects I crunch on. Deadline is around a week out, so this forces other projects to not run, in favor of Rosetta, because eventually it wants to make sure the jobs are not at risk. When it uploads the work, it grabs new replacement work, and still favors Rosetta. I've opened a similar thread in the BOINC forums, since I think they should force the scheduler to comply with user wishes, not with what the project sends. But I see it as a problem with both sides. I'm hoping to get a workaround, or some sort of dev committment to help improve this situatoin. |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org