MiniRosetta 3.17 Problems.

Message boards : Number crunching : MiniRosetta 3.17 Problems.

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 71507 - Posted: 27 Oct 2011, 7:45:54 UTC
Last modified: 27 Oct 2011, 7:52:04 UTC

Hi.

I've had two different types of tasks error, the same types have been run before on this rig with 3.14 app and not erred.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=418800096

place_CE_20110919_EBOV_GP_2d1v_ProteinInterfaceDesign_31440_359_0

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>

ERROR: drSOP
ERROR:: Exit from: src/protocols/protein_interface_design/movers/PlaceStubMover.cc line: 1063
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
=================================================================================

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=418800129

3filtr5A_CYpa_2aak_ProteinInterfaceDesign_23Aug2011_30588_1098_0

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>

ERROR: drSOP
ERROR:: Exit from: src/protocols/protein_interface_design/movers/PlaceStubMover.cc line: 1063
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
ID: 71507 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Shawn
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 22 Jan 10
Posts: 17
Credit: 53,741
RAC: 0
Message 71513 - Posted: 27 Oct 2011, 18:55:19 UTC

Thanks for letting us know.

As you are probably aware, we recently changed our version of Rosetta@home. These current jobs are associated with protocols written for an older version. I did not notice any compatibility problems at the time, but I will do some more testing on these jobs to find out why they didn't work.
ID: 71513 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Shawn
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 22 Jan 10
Posts: 17
Credit: 53,741
RAC: 0
Message 71516 - Posted: 27 Oct 2011, 22:45:33 UTC - in response to Message 71513.  

Thanks for letting us know.

As you are probably aware, we recently changed our version of Rosetta@home. These current jobs are associated with protocols written for an older version. I did not notice any compatibility problems at the time, but I will do some more testing on these jobs to find out why they didn't work.


I think we've identified the problem, and the ProteinInterfaceDesign team is now aware of the issue. Thanks once again for your time, your computational resources, and your feedback!
ID: 71516 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 71517 - Posted: 28 Oct 2011, 4:07:32 UTC
Last modified: 28 Oct 2011, 4:08:57 UTC

See, if you guys would post a summary of this problem on the front page... it'd have a profound effect on users. They'd see that the rosetta team is working... etc. Same goes when the server goes down. Say: "Hey, someone unplugged the servers during last night's party. We'll get that fixed as soon a possible." Or something along those line would be great for people trying to know what's going on. Just my humble advice.
ID: 71517 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
pieface

Send message
Joined: 20 Sep 05
Posts: 17
Credit: 797,661
RAC: 0
Message 71522 - Posted: 28 Oct 2011, 15:16:00 UTC
Last modified: 28 Oct 2011, 15:21:02 UTC

I really don't mind the small things like the DrSOP problem, they tie up some resources for download then upload, but I don't get charged extra for that. But, during the same timeframe I also had something like a dozen ProteinInterfaceDesign and Ploop2x3 run to their full allotted time (6hrs or so depending on how watchdog was feeling) and then when the validator finally got caught-up they were marked as invalid. I had some of these on both machines I had crunching Rosetta - one is a Win XP X64 system and the other a Win7 box, no overclocking at all. Here are a couple of examples - any ideas or anyone else get those kind of results in this last batch?

Ploop2x3
Ploop2x3
PID

note: edited to take out 'over the weekend'.
ID: 71522 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 71524 - Posted: 28 Oct 2011, 19:24:02 UTC

ID: 71524 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 302,521,161
RAC: 0
Message 71533 - Posted: 29 Oct 2011, 21:35:06 UTC

lots of errors, stop downloading units

https://boinc.bakerlab.org/rosetta/result.php?resultid=459660390
https://boinc.bakerlab.org/rosetta/result.php?resultid=459660074
https://boinc.bakerlab.org/rosetta/result.php?resultid=459660070
https://boinc.bakerlab.org/rosetta/result.php?resultid=459635613
https://boinc.bakerlab.org/rosetta/result.php?resultid=459658860
ID: 71533 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 302,521,161
RAC: 0
Message 71534 - Posted: 29 Oct 2011, 21:45:03 UTC
Last modified: 29 Oct 2011, 21:57:27 UTC

More info

T0....units seem ok

ab_07_19... crashing all

2stubs... crash

place_CE_... crash

rlx_jsr... OK
ID: 71534 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 29,995,343
RAC: 14,010
Message 71539 - Posted: 30 Oct 2011, 17:07:14 UTC
Last modified: 30 Oct 2011, 17:12:44 UTC

ID: 71539 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TJ

Send message
Joined: 29 Mar 09
Posts: 127
Credit: 4,799,890
RAC: 0
Message 71540 - Posted: 30 Oct 2011, 20:13:49 UTC

All my WU's error out very soon, I got these error messages:

ERROR: [ERROR] invalid header input for kill_hairpins file.
ERROR:: Exit from: ......srccorescoringSS_Killhairpins_Info.cc line: 370
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


Greetings,
TJ.
ID: 71540 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 71543 - Posted: 30 Oct 2011, 22:55:15 UTC

Yup, I got some dead hairpin file`s as well in the ab_07_19_ series
The things you have to do to a protein to make them behave :¬)

https://boinc.bakerlab.org/rosetta/result.php?resultid=459619880

https://boinc.bakerlab.org/rosetta/result.php?resultid=459639244
ID: 71543 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 71549 - Posted: 31 Oct 2011, 4:14:46 UTC

Some more errors, different type of tasks others i've had have been running o.k. apart from these.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=401795240

ab_07_19_1fnaA_filtnr_IGNORE_THE_REST_06_08_28682_52_1

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>

Starting work on structure: _00001

ERROR: [ERROR] invalid header input for kill_hairpins file.
ERROR:: Exit from: src/core/scoring/SS_Killhairpins_Info.cc line: 370
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish
Watchdog active.

</stderr_txt>

==================================================================================

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=401801710

ab_07_19_1acfA_control_IGNORE_THE_REST_03_07_28679_51_0

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>

Starting work on structure: _00001

ERROR: [ERROR] invalid header input for kill_hairpins file.
ERROR:: Exit from: src/core/scoring/SS_Killhairpins_Info.cc line: 370
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>




ID: 71549 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TPCBF

Send message
Joined: 29 Nov 10
Posts: 111
Credit: 5,930,751
RAC: 236
Message 71554 - Posted: 31 Oct 2011, 17:38:08 UTC - in response to Message 71553.  

The offending jobs have been removed.

Rosetta is a large and diverse project. Unlike more focused efforts such as SETI@Home, the breadth of compute tasks being performed on Rosetta@Home is incredible. While offering enormous flexibility, this greatly complicates testing and validation. Unfortunately, some bad jobs slipped in this time. In many cases, Rosetta@Home users such as myself find out about failing jobs when you do, and we're just as frustrated when such jobs are distributed.

Thank you for your continued support.
But why the *snap* can no sysadmin post some proper info about this in a timely fashion?

It's just a matter of simple communication, doesn't even cost much time. :-(

Ralf

ID: 71554 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,888,320
RAC: 0
Message 71555 - Posted: 31 Oct 2011, 19:19:01 UTC - in response to Message 71553.  

The offending jobs have been removed.

Rosetta is a large and diverse project. Unlike more focused efforts such as SETI@Home, the breadth of compute tasks being performed on Rosetta@Home is incredible. While offering enormous flexibility, this greatly complicates testing and validation. Unfortunately, some bad jobs slipped in this time. In many cases, Rosetta@Home users such as myself find out about failing jobs when you do, and we're just as frustrated when such jobs are distributed.

Thank you for your continued support.


Why isn't ralph being used to catch these errors? All workunits I've received from ralph recently have been using app version 3.14.


Best,
Snags
ID: 71555 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TPCBF

Send message
Joined: 29 Nov 10
Posts: 111
Credit: 5,930,751
RAC: 236
Message 71556 - Posted: 31 Oct 2011, 19:23:04 UTC - in response to Message 71555.  

Why isn't ralph being used to catch these errors? All workunits I've received from ralph recently have been using app version 3.14.
Yeah, what RALPH@Home is doing is a bit odd recently. Several times, I got swamped with sets of 20 WUs at a time, and a mix of applications labeled both as "Rosetta Mini Beta 3.17" (currently 2 awaiting their turn) and as "Rosetta Mini 3.14" (another 20 WUs piled up to be eventually being processed).

Ralf
ID: 71556 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TPCBF

Send message
Joined: 29 Nov 10
Posts: 111
Credit: 5,930,751
RAC: 236
Message 71559 - Posted: 31 Oct 2011, 20:53:31 UTC - in response to Message 71558.  

RALPH has separate executables for minirosetta (current version of Rosetta@Home) and minirosetta_beta (next version of Rosetta@Home). At the moment, the two applications are identical, despite their different version numbers.

minirosetta => 3.18
minirosetta_beta => 3.17

During the update process, the two versions will diverge. The idea behind this is to always have a running version of the software currently deployed on Rosetta@Home available for test.
And are you sure that everyone's on the same page here? :?

Ralf
ID: 71559 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1250
Credit: 14,421,737
RAC: 0
Message 71576 - Posted: 7 Nov 2011, 0:35:32 UTC
Last modified: 7 Nov 2011, 0:42:15 UTC

Looks like the 3.14 problem with workunits that stop using any CPU time at all but don't tell BOINC that they're finished isn't fully fixed.

Does appear to be less frequent, though.

Rosetta Mini 3.17
T0552_boinc_alignment_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_34966_22
CPU time at last checkpoint 01:17:50
CPU time 01:17:51
Elapsed time 25:00:05
Estimated time remaining 60:12:19
Fraction done 10.594%
Max RAM usage 95 MB
Working set size 546.09 MB

No longer using any CPU time, but still claims to be running.

64-bit Vista SP2 with 8 GB; BOINC allowed to use 40%

11/3/2011 1:42:40 AM | | Starting BOINC client version 6.12.34 for windows_x86_64
11/3/2011 1:42:40 AM | | log flags: file_xfer, sched_ops, task
11/3/2011 1:42:40 AM | | Libraries: libcurl/7.21.6 OpenSSL/1.0.0d zlib/1.2.5
11/3/2011 1:42:40 AM | | Data directory: C:ProgramDataBOINC
11/3/2011 1:42:40 AM | | Running under account Bobby
11/3/2011 1:42:40 AM | | Processor: 4 GenuineIntel Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz [Family 6 Model 23 Stepping 10]
11/3/2011 1:42:40 AM | | Processor: 6.00 MB cache
11/3/2011 1:42:40 AM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 syscall nx lm vmx smx tm2 pbe
11/3/2011 1:42:40 AM | | OS: Microsoft Windows Vista: Home Premium x64 Edition, Service Pack 2, (06.00.6002.00)
11/3/2011 1:42:40 AM | | Memory: 8.00 GB physical, 15.66 GB virtual
11/3/2011 1:42:40 AM | | Disk: 919.67 GB total, 555.16 GB free
11/3/2011 1:42:40 AM | | Local time is UTC -5 hours
11/3/2011 1:42:40 AM | | NVIDIA GPU 0: GeForce GTS 450 (driver version 28562, CUDA version 4010, compute capability 2.1, 1024MB, 476 GFLOPS peak)

Selected workunit length 12 hours.

Restarting BOINC lost all but 01:19:52 of the elapsed time.

I'l give the workunit one more chance to restart properly; if that isn't adequate, I'll put Rosetta@Home on No new tasks again until the next minirosetta version is ready.

I have not seen such a problem with the RALPH@Home 3.18 workunits (6 hour length selected), so I'll continue to run those.
ID: 71576 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1250
Credit: 14,421,737
RAC: 0
Message 71577 - Posted: 7 Nov 2011, 3:32:59 UTC - in response to Message 71576.  

Now finished, returned, and in Pending status.
ID: 71577 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1250
Credit: 14,421,737
RAC: 0
Message 71578 - Posted: 7 Nov 2011, 19:55:56 UTC

The same no-longer-using-CPU-time problem is also present in another workunit.

T0538_boinc_rosetta_cm_medal_ss_v2_cmiles_IGNORE_THE_REST_34758_10367
CPU time at last checkpoint 02:06:31
CPU time 02:07:46
Elapsed time 03:11:44
Fraction done 16.687%

Boinc manager claims it is running, but Windows task manager says it is using no CPU time at all.

11/6/2011 6:23:11 PM | | Starting BOINC client version 6.12.34 for windows_x86_64
11/6/2011 6:23:11 PM | | log flags: file_xfer, sched_ops, task
11/6/2011 6:23:11 PM | | Libraries: libcurl/7.21.6 OpenSSL/1.0.0d zlib/1.2.5
11/6/2011 6:23:11 PM | | Data directory: C:ProgramDataBOINC
11/6/2011 6:23:11 PM | | Running under account Bobby
11/6/2011 6:23:11 PM | | Processor: 4 GenuineIntel Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz [Family 6 Model 23 Stepping 10]
11/6/2011 6:23:11 PM | | Processor: 6.00 MB cache
11/6/2011 6:23:11 PM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 syscall nx lm vmx smx tm2 pbe
11/6/2011 6:23:11 PM | | OS: Microsoft Windows Vista: Home Premium x64 Edition, Service Pack 2, (06.00.6002.00)
11/6/2011 6:23:11 PM | | Memory: 8.00 GB physical, 15.80 GB virtual
11/6/2011 6:23:11 PM | | Disk: 919.67 GB total, 527.06 GB free
11/6/2011 6:23:11 PM | | Local time is UTC -6 hours
11/6/2011 6:23:11 PM | | NVIDIA GPU 0: GeForce GTS 450 (driver version 28562, CUDA version 4010, compute capability 2.1, 1024MB, 476 GFLOPS peak)

I'm about to restart BOINC to give that workunit another chance to restart properly, but I've already set No new tasks for Rosetta@home on that computer.
ID: 71578 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1250
Credit: 14,421,737
RAC: 0
Message 71579 - Posted: 7 Nov 2011, 20:04:16 UTC - in response to Message 71578.  

The restart made that workunit return quickly, with 99 decoys done; now in a pending state.

Could that mean that 3.17 has trouble doing something reasonable after it finishes 99 decoys? Some of the previous versions of minirosetta did.
ID: 71579 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : MiniRosetta 3.17 Problems.



©2025 University of Washington
https://www.bakerlab.org