Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 311 · Next

AuthorMessage
mmonnin

Send message
Joined: 2 Jun 16
Posts: 61
Credit: 25,390,629
RAC: 13,030
Message 87816 - Posted: 4 Dec 2017, 17:54:54 UTC

40 valid tasks and 37 in progress. Seems fine.
ID: 87816 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2146
Credit: 41,570,180
RAC: 8,210
Message 87817 - Posted: 4 Dec 2017, 18:16:45 UTC

Showing 10,000 mini-Rosetta tasks now. Go get'em
ID: 87817 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 291,214,977
RAC: 0
Message 87842 - Posted: 6 Dec 2017, 23:25:53 UTC

several Rosetta 4.06 units failing with this error:

<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.06_x86_64-pc-linux-gnu @9res_cis_hydrophobic_nmethyl_c.103.1_1_0001.flags -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2224394
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 43533.6s, 14400s + 28800s[2017-12- 6 12:16:35:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 43533.6 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
12:16:35 (64706): called boinc_finish(0)
pure virtual method called
terminate called without an active exception

</stderr_txt>
]]>
ID: 87842 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
William Waggoner

Send message
Joined: 19 Nov 17
Posts: 3
Credit: 24,650
RAC: 0
Message 87846 - Posted: 7 Dec 2017, 15:27:35 UTC

I am on a Mac, 10.13, running SETI and Rosetta and I have four Rosetta tasks that don't pause when BOINC Manager pauses. They are chewing up about 50% CPU each, apparently system time, even if I kill BOINC Manager. I have tried Force Quitting them but they return to the same problem.

The task names from ps are:

minirosetta_3.78_x86_64-apple-darwin @G058259_GEO_3_1-160_TEST_v02_t000__krypton.flags
minirosetta_3.78_x86_64-apple-darwin @G213038_GEO_3_1-95_TEST_v02_t000__krypton.flags
minirosetta_3.78_x86_64-apple-darwin @G018595_GEO_3_10-140_TEST_v02_t000__krypton.flags
minirosetta_3.78_x86_64-apple-darwin @G153189_GEO_3_1-72_TEST_v02_t000__krypton.flags

Since aborting these four there was one more that persisted. I've stopped Rosetta for now until I get this straightened out.
ID: 87846 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin
Project administrator

Send message
Joined: 1 Jul 05
Posts: 5144
Credit: 0
RAC: 0
Message 87849 - Posted: 7 Dec 2017, 19:45:38 UTC - in response to Message 87846.  

Do you know if this is a general issue with all jobs or for specific jobs?
ID: 87849 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
William Waggoner

Send message
Joined: 19 Nov 17
Posts: 3
Credit: 24,650
RAC: 0
Message 87856 - Posted: 7 Dec 2017, 21:44:05 UTC - in response to Message 87849.  

It is not all jobs in Rosetta. I have encountered five so far. No jobs in any other project have done the same thing though.
ID: 87856 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ypsilon

Send message
Joined: 19 Mar 08
Posts: 2
Credit: 291,819,405
RAC: 7,902
Message 87861 - Posted: 8 Dec 2017, 14:03:51 UTC

I noticed the same on my Mac (10.11.6). I also noticed that the kernel task takes about 50% with this type of task and that the task runs much longer than my 4h runtime.

Here ist an example task. I aborted it after a runtime of 15:39:40
Fr 8 Dez 14:36:21 2017 | Rosetta@home | task G176481_GEO_3_1-90_TEST_v02_t000__krypton_SAVE_ALL_OUT_03_09_538050_167_1 aborted by user

this is not the only one. I noticed it a several times the last days.
ID: 87861 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin
Project administrator

Send message
Joined: 1 Jul 05
Posts: 5144
Credit: 0
RAC: 0
Message 87864 - Posted: 8 Dec 2017, 21:50:46 UTC

I alerted the researcher who owns these jobs. There is definitely an issue and he should be aborting them soon. A corrupted input file is causing this very bad behavior. Please go ahead and abort these _GEO_ jobs if they are running too long. You may have to force abort them unfortunately. Sorry for the inconvenience.
ID: 87864 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 61
Credit: 25,390,629
RAC: 13,030
Message 87865 - Posted: 9 Dec 2017, 0:51:32 UTC - in response to Message 87842.  

several Rosetta 4.06 units failing with this error:

<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.06_x86_64-pc-linux-gnu @9res_cis_hydrophobic_nmethyl_c.103.1_1_0001.flags -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2224394
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 43533.6s, 14400s + 28800s[2017-12- 6 12:16:35:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 43533.6 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
12:16:35 (64706): called boinc_finish(0)
pure virtual method called
terminate called without an active exception

</stderr_txt>
]]>


I have had several of these as well. They take much longer and eventually get a comp error.
https://boinc.bakerlab.org/workunit.php?wuid=864077368
ID: 87865 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile ypsilon

Send message
Joined: 19 Mar 08
Posts: 2
Credit: 291,819,405
RAC: 7,902
Message 87867 - Posted: 9 Dec 2017, 9:12:42 UTC - in response to Message 87866.  

I have had several of these as well. They take much longer and eventually get a comp error.
Apparently there's an emergency stop built in, 4 hours after target time.


Some of them might stop after 8 Hours, but at least on my mac there are some tasks which run longer. On may Linux machines occurred many tasks which run longer also. There I have a script to catch them. Here is a log from the last few days:

Do 30. Nov 11:00:01 CET 2017 UfA2QuiJ_fold_and_dock_SAVE_ALL_OUT_523940_1806_0 aborted
Do 30. Nov 12:50:01 CET 2017 T1wKatx0_fold_and_dock_SAVE_ALL_OUT_523844_1809_0 aborted
Do 30. Nov 22:25:01 CET 2017 ExWr3Dh9_fold_and_dock_SAVE_ALL_OUT_523951_1809_0 aborted
Fr 1. Dez 02:35:01 CET 2017 5mGIFocw_fold_and_dock_SAVE_ALL_OUT_523953_1809_0 aborted
Fr 1. Dez 13:40:01 CET 2017 KMdj0rL5_fold_and_dock_SAVE_ALL_OUT_523842_1812_0 aborted
Fr 1. Dez 17:10:02 CET 2017 5f6I1vYG_fold_and_dock_SAVE_ALL_OUT_523837_1812_0 aborted
Sa 2. Dez 18:15:01 CET 2017 vusPMBXW_fold_and_dock_SAVE_ALL_OUT_523826_1834_0 aborted
So 3. Dez 08:00:01 CET 2017 T0zUfk05_fold_and_dock_SAVE_ALL_OUT_523962_1847_0 aborted
Mo 4. Dez 11:20:01 CET 2017 qe4GEPtL_fold_and_dock_SAVE_ALL_OUT_523846_1868_0 aborted
Mo 4. Dez 23:40:01 CET 2017 EXDyYP7D_fold_and_dock_SAVE_ALL_OUT_523819_1869_1 aborted
Di 5. Dez 06:15:01 CET 2017 8res_cis_hydrophobic_nmethyl_8res_c.1.8_0001_SAVE_ALL_OUT_537824_736_0 aborted
Di 5. Dez 10:50:01 CET 2017 8res_cis_hydrophobic_nmethyl_8res_c.14.5_0001_SAVE_ALL_OUT_537819_597_0 aborted
Mi 6. Dez 02:55:01 CET 2017 WCeQrXeO_fold_and_dock_SAVE_ALL_OUT_523901_1733_1 aborted
Mi 6. Dez 08:55:02 CET 2017 XVaB0DRA_fold_and_dock_SAVE_ALL_OUT_523891_1979_0 aborted
Mi 6. Dez 13:35:01 CET 2017 A1seaa4W_fold_and_dock_SAVE_ALL_OUT_523928_1990_0 aborted
Do 7. Dez 02:30:01 CET 2017 jote4wq3_fold_and_dock_SAVE_ALL_OUT_524006_2016_0 aborted
Do 7. Dez 19:05:01 CET 2017 BiqOzAZk_fold_and_dock_SAVE_ALL_OUT_523813_2055_0 aborted
Do 7. Dez 23:00:01 CET 2017 WybJiODL_fold_and_dock_SAVE_ALL_OUT_523892_2058_0 aborted
Fr 8. Dez 03:05:01 CET 2017 XVaB0DRA_fold_and_dock_SAVE_ALL_OUT_523891_2058_0 aborted
Fr 8. Dez 07:40:01 CET 2017 9res_cis_hydrophobic_nmethyl_c.651.1_0001_SAVE_ALL_OUT_538287_42_0 aborted
Sa 9. Dez 08:10:01 CET 2017 9res_cis_hydrophobic_nmethyl_c.996.1_0001_SAVE_ALL_OUT_538294_560_0 aborted
ID: 87867 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 61
Credit: 25,390,629
RAC: 13,030
Message 87871 - Posted: 9 Dec 2017, 15:01:45 UTC

Now I'm just aborting tasks that aren't staying on track with the others. Lots of wasted hours now.
ID: 87871 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luigi R.

Send message
Joined: 7 Feb 14
Posts: 39
Credit: 2,045,527
RAC: 0
Message 87881 - Posted: 10 Dec 2017, 22:54:08 UTC

ID: 87881 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Gaurav Bhardwaj

Send message
Joined: 10 Oct 08
Posts: 2
Credit: 693,069
RAC: 0
Message 87888 - Posted: 11 Dec 2017, 19:16:50 UTC - in response to Message 87881.  

Thanks for pointing out the issues with really long jobs to us. Some of these jobs are intended to predict the structures of cyclic peptides, and invoke a few different filters during their runs. For some peptides, passing all these filters is very low probability event, and therefore no structure makes it through even after hours of running. We are looking further into it, and will update you with more information very soon.
ID: 87888 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JohnH

Send message
Joined: 25 Mar 13
Posts: 43
Credit: 2,319,355
RAC: 0
Message 87901 - Posted: 14 Dec 2017, 18:33:57 UTC

No Rosetta tasks A G A I N.
ID: 87901 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 826
Message 87903 - Posted: 14 Dec 2017, 20:32:46 UTC - in response to Message 87888.  

Thanks for pointing out the issues with really long jobs to us. Some of these jobs are intended to predict the structures of cyclic peptides, and invoke a few different filters during their runs. For some peptides, passing all these filters is very low probability event, and therefore no structure makes it through even after hours of running. We are looking further into it, and will update you with more information very soon.

Some ideas you may want to think about:

Grant some credit for work that does not pass the filters, even if not as much as for work that does pass the filters.

Check for an end-of-run every time a starting point reaches a final decision on whether it produces good results, not just if the work on that starting point has passed the filters.

This means that users get some credit if they only remove at least one starting point from the list of starting points that would otherwise be sent to at least one more user.
ID: 87903 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
googloo
Avatar

Send message
Joined: 15 Sep 06
Posts: 133
Credit: 22,813,645
RAC: 1,448
Message 87910 - Posted: 15 Dec 2017, 19:44:38 UTC - in response to Message 87901.  
Last modified: 15 Dec 2017, 19:47:24 UTC

Same here

Message 87901 - Posted: 14 Dec 2017, 18:33:57 UTC

No Rosetta tasks A G A I N.
ID: 87910 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin
Project administrator

Send message
Joined: 1 Jul 05
Posts: 5144
Credit: 0
RAC: 0
Message 87911 - Posted: 15 Dec 2017, 19:59:47 UTC

I found a bottleneck in our work unit generation that happens in certain situations. It should be fixed now. There is an endless supply of scientific work queued and planned to be queued in the near future. The more crunching the better! Sorry for these workunit distribution issues. Thanks!
ID: 87911 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,649,459
RAC: 0
Message 87912 - Posted: 16 Dec 2017, 0:58:23 UTC

No new workunits being downloaded on at least 1 of my boxes. Log says:
"Server Error: Feeder not running."

I suspect this just just happened as the server status page still shows the Feeder as running.
**38 cores crunching for R@H on behalf of cancercomputer.org - a non-profit supporting High Performance Computing in Cancer Research
ID: 87912 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,649,459
RAC: 0
Message 87913 - Posted: 16 Dec 2017, 0:59:19 UTC - in response to Message 87912.  

Seems like it's now fixed (only about 1 minute after I posted this!) Nice work :)
**38 cores crunching for R@H on behalf of cancercomputer.org - a non-profit supporting High Performance Computing in Cancer Research
ID: 87913 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
xaminmo
Avatar

Send message
Joined: 14 Nov 17
Posts: 2
Credit: 2,075,523
RAC: 0
Message 88212 - Posted: 2 Feb 2018, 1:37:37 UTC

It seems when BOINC requests work, Rosetta sends more than requested.
I do not have this problem with any other project.
This is worse as of the last week or so, but it does not always happen this way.

The main system I notice this on is here:
https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3290081

My runtimes closely match what I've selected in project properties, so it's not that.

If I set resource share to 10, I still pull down about 5x the work of any of the other projects I crunch on.

Deadline is around a week out, so this forces other projects to not run, in favor of Rosetta, because eventually it wants to make sure the jobs are not at risk.

When it uploads the work, it grabs new replacement work, and still favors Rosetta.

I've opened a similar thread in the BOINC forums, since I think they should force the scheduler to comply with user wishes, not with what the project sends.

But I see it as a problem with both sides.

I'm hoping to get a workaround, or some sort of dev committment to help improve this situatoin.
ID: 88212 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 311 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org