Minirosetta v1.45 bug thread

Message boards : Number crunching : Minirosetta v1.45 bug thread

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
xsc2

Send message
Joined: 9 Jul 08
Posts: 4
Credit: 62,354
RAC: 0
Message 57677 - Posted: 7 Dec 2008, 13:21:37 UTC

ID: 57677 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
googloo
Avatar

Send message
Joined: 15 Sep 06
Posts: 133
Credit: 22,813,645
RAC: 3,531
Message 57680 - Posted: 7 Dec 2008, 15:29:44 UTC

Task ID 212423733
Name cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_ccr19_olange_5384_13138_0
Workunit 193652172

Validate state Invalid
Claimed credit 14.8874783714289
Granted credit 0
application version 1.45
ID: 57680 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57681 - Posted: 7 Dec 2008, 15:43:38 UTC - in response to Message 57680.  

Task ID 212423733
Name cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_ccr19_olange_5384_13138_0
Workunit 193652172

Validate state Invalid
Claimed credit 14.8874783714289
Granted credit 0
application version 1.45

---
here is the link to his task:

https://boinc.bakerlab.org/rosetta/result.php?resultid=212423733
another (0xc0000005) error
ID: 57681 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
guhungry

Send message
Joined: 1 Dec 08
Posts: 1
Credit: 620,505
RAC: 0
Message 57683 - Posted: 7 Dec 2008, 17:04:28 UTC
Last modified: 7 Dec 2008, 17:08:14 UTC

I have a lot of them and all errors I take a look returned exit code -1073741819 (0xc0000005) from task cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs.
---------------------------------------------
212713228
212673521
212467256
212463757
212356024
212292194
212243396
212205055
212172410
212319134
212285788
212244308
ID: 57683 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 2,014
Message 57685 - Posted: 7 Dec 2008, 19:10:43 UTC

Another (0xc0000005) error:

cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_nsp1_olange_5389_30836_0

https://boinc.bakerlab.org/rosetta/result.php?resultid=212733641

Is there a problem with the cs_vanilla workunits?

I notice that this is one of the first two 1.45 workunits I've seen running on my dual-core machine at the same time - is there some problem with that and my memory size (2 GB total)?
ID: 57685 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2141
Credit: 41,518,559
RAC: 10,612
Message 57686 - Posted: 7 Dec 2008, 19:11:26 UTC - in response to Message 57664.  
Last modified: 7 Dec 2008, 19:17:22 UTC

Promising results. All the silly warning messages have disappeared for me. No NAN hbonding errors either (yet?). Where I used to have 2 out of 5 WUs crash out with the error "Can't acquire lockfile" it's dropped to 2 out of 7 crashing out for that reason (over my first 21 results only - may not turn out to be representative).

9 out of the next 11 were successful too, making 24 good out of 32, which is the best performance I've had for a very long time. Combined with a continuing 100% record on Beta 5.98s (much more coming through recently) I'm officially happier and less frustrated.

My 5th best day ever!

Not perfect yet, but just reporting some better news instead of constant misery. You must be working on the right lines. Keep it up!
ID: 57686 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 57687 - Posted: 7 Dec 2008, 19:47:52 UTC

Another crash on Mac OSX 10.4.11. Task 212684901: Workunit 193888088

Same area of code as before (update_domain_map)

<core_client_version>6.2.18</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400
SIGBUS: bus error

Crashed executable name: minirosetta_1.45_i686-apple-darwin
built using BOINC library version 6.5.0
Machine type Intel 80486 (32-bit executable)
System version: Macintosh OS 10.4.11 build 8S2167
Sun Dec 7 06:56:14 2008

Thread 0 Crashed:
0 ...etta_1.45_i686-apple-darwin 0x00226273 __ZNK4core10kinematics8AtomTree17update_domain_mapERN9ObjexxFCL8FArray1DIiEERKNS_2id10AtomID_MapIbEESA_ + 41
1 ...etta_1.45_i686-apple-darwin 0x0001596f __ZNK4core12conformation12Conformation17update_domain_mapERN9ObjexxFCL8FArray1DIiEE + 403
2 ...etta_1.45_i686-apple-darwin 0x000830f3 __ZN4core4pose4Pose13scoring_beginEN7utility7pointer10owning_ptrINS_7scoring17ScoreFunctionInfoEEE + 1329
3 ...etta_1.45_i686-apple-darwin 0x000fa4e2 __ZNK4core7scoring13ScoreFunctionclERNS_4pose4PoseE + 4686
4 ...etta_1.45_i686-apple-darwin 0x001938b1 __ZNK9protocols8abinitio18AbrelaxApplication13process_decoyERN4core4pose4PoseERKNS2_7scoring13ScoreFunctionESsRNS2_2io6silent12SilentStructE + 35
5 ...etta_1.45_i686-apple-darwin 0x001afe27 __ZN9protocols8abinitio18AbrelaxApplication4foldEv + 9651
6 ...etta_1.45_i686-apple-darwin 0x001b5381 __ZN9protocols8abinitio18AbrelaxApplication3runEv + 881
7 ...etta_1.45_i686-apple-darwin 0x00009a87 _main + 3941
8 ...etta_1.45_i686-apple-darwin 0x0000292e __start + 216
9 ...etta_1.45_i686-apple-darwin 0x00002855 start + 41

etc.


ID: 57687 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile steve

Send message
Joined: 27 Nov 08
Posts: 7
Credit: 1,085
RAC: 0
Message 57690 - Posted: 7 Dec 2008, 22:41:01 UTC - in response to Message 57612.  

David,

I just recieved an error during this file analysis:

Time of DownLoad:
12/7/2008 12:30:34 PM|rosetta@home|Starting task fixed_bb_hb_rlbd_1tig_IGNORE_THE_REST_DECOY_5470_34_0 using minirosetta version 145

Time of Error
12/7/2008 3:31:37 PM|rosetta@home|Started upload of fixed_bb_hb_rlbd_1tig_IGNORE_THE_REST_DECOY_5470_34_0_0

Error Message
"Could not write to a specified memory location". The message asked me if I wanted to DeBug.. I pressed cancel.

Time of Upload:
The file was upload to your server as:
12/7/2008 3:31:43 PM|rosetta@home|Finished upload of fixed_bb_hb_rlbd_1tig_IGNORE_THE_REST_DECOY_5470_34_0_0


Steve


Please post bugs and issues regarding minirosetta version 1.45.

This update includes fixes to long runtimes for 'relax' jobs, validation errors, check point recovery issues, and numerical instability in hydrogen-bond scoring.

We think we might have fixed the preemption problem so please keep an eye out for this. The "can't acquire lockfile" issue might also be related. If you are having lockfile problems, please make sure there are no other boinc applications running in the same slot. If necessary, turn off the client and make sure all boinc apps are not running, and then restart the client.

ID: 57690 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 57697 - Posted: 8 Dec 2008, 4:41:46 UTC

Hi.

This one broke after 2hrs, 44min.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=194032196

Mon 08 Dec 2008 15:03:25 EST|rosetta@home|Output file cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_flua_olange_5385_36439_0_0 for task cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_flua_olange_5385_36439_0 absent

pete.

ID: 57697 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 57699 - Posted: 8 Dec 2008, 4:56:26 UTC

ID: 57699 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 2,014
Message 57700 - Posted: 8 Dec 2008, 5:28:43 UTC - in response to Message 57699.  

Here's some more bad cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs WUs:

https://boinc.bakerlab.org/rosetta/result.php?resultid=212592040
https://boinc.bakerlab.org/rosetta/result.php?resultid=212475523
https://boinc.bakerlab.org/rosetta/result.php?resultid=212454329
https://boinc.bakerlab.org/rosetta/result.php?resultid=212415902
https://boinc.bakerlab.org/rosetta/result.php?resultid=212349479
https://boinc.bakerlab.org/rosetta/result.php?resultid=212298709
https://boinc.bakerlab.org/rosetta/result.php?resultid=212268849
https://boinc.bakerlab.org/rosetta/result.php?resultid=212260558


Makes me suspect that at least one of the following is true:

1. cs_vanilla workunits are a high fraction of the workunits now going out.

2. The cs_vanilla workunits are using a new feature of 1.45 that hasn't been adequately tested for its ability to finish properly.
ID: 57700 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 0
Message 57702 - Posted: 8 Dec 2008, 16:36:54 UTC

ERROR: Illegal value for integer option -run:jran specified:

in workunit 1g73A_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1g73A-_5476_258_1

AdeB
ID: 57702 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2141
Credit: 41,518,559
RAC: 10,612
Message 57703 - Posted: 8 Dec 2008, 16:49:24 UTC - in response to Message 57686.  

Promising results. All the silly warning messages have disappeared for me. No NAN hbonding errors either (yet?). Where I used to have 2 out of 5 WUs crash out with the error "Can't acquire lockfile" it's dropped to 2 out of 7 crashing out for that reason (over my first 21 results only - may not turn out to be representative).

9 out of the next 11 were successful too, making 24 good out of 32, which is the best performance I've had for a very long time. Combined with a continuing 100% record on Beta 5.98s (much more coming through recently) I'm officially happier and less frustrated.

And now 16 more successes out of 18 making 40 out of 50. Most errors came early, so I'm now confident enough to up my run-time from 2 to 3 hours again.

Good work on this "Can't acquire lockfile" problem. I'm just going to tidy up the lockfiles, reboot and see if the good results continue.

Efforts much appreciated here. Let's see if it can be nailed in the next update.
ID: 57703 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2141
Credit: 41,518,559
RAC: 10,612
Message 57704 - Posted: 8 Dec 2008, 17:42:41 UTC - in response to Message 57703.  

I'm just going to tidy up the lockfiles, reboot and see if the good results continue.

On this, when trying to stop the BOINC service I wasn't allowed to until I'd ended the boinc.exe client process under User Name boinc_master (Vista64 OS quad-core AMD Phenom).

In the Task Manager I 'showed processes for all users' to do this and saw that 2 rosetta_beta_5.98_windows_x86_64.exe*32 processes were still running (correct) but also about 20 minirosetta_1.4x_windows_x86_64.exe*32 processes were running. About half of those were for MiniRosetta 1.40 and the other half for v1.45. All under the User Name boinc_project. There should just have been 2 for v1.45.

My last Mini 1.40 WU was completed late last Friday, so these 10-ish 1.40 processes have persisted for 3 days (no re-boots in that time).

I manually ended all these processes.

Going to the C:ProgramDataBOINCslots folder, there were 23 folders (numbered from 0 to 22), the first 19 of which contained a 0-byte boinc_lockfile file and a stderr.txt and a stdout.txt file. The other 4 folders contained the files I'd expect for running processes.

I deleted all the boinc_lockfile files and re-booted. On start-up, the first 19 folders had been removed, leaving the 4 active ones.

I'm no programmer and may be talking out of my hat, but could these old processes still running have something to do with being unable to acquire boinc_lockfile ?

When there's a Compute Error is there some fault in the way the process closes down and releases the files it holds open? Or could it be to do with the user (me) aborting a WU manually when I see it's stalled and producing error messages?

I'm guessing wildly, obviously, but hopefully this means something more sensible to you clever chaps. You seem to be close to some solutions for a problem that's persisted over several versions, so maybe this is the final clue you need? Hope it helps.
ID: 57704 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 57706 - Posted: 8 Dec 2008, 18:50:14 UTC

Some more workunits failing on Mac OS X 10.4.11

Task 212933060 : Workunit 194105301
Task 212892168 : Workunit 194071153
(names **_ZNMP_RELAX_**)

both failing at startup

ERROR: Illegal value for integer option -run:jran specified:

Also Task 212828576 : Workunit 194016740 (cs_vanilla_* again) failing halfway through in Update_domain_map

Thread 0 Crashed:
0 ...etta_1.45_i686-apple-darwin 0x00226273 __ZNK4core10kinematics8AtomTree17update_domain_mapERN9ObjexxFCL8FArray1DIiEERKNS_2id10AtomID_MapIbEESA_ + 41
1 ...etta_1.45_i686-apple-darwin 0x0001596f __ZNK4core12conformation12Conformation17update_domain_mapERN9ObjexxFCL8FArray1DIiEE + 403

etc.


ID: 57706 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 57708 - Posted: 8 Dec 2008, 20:43:04 UTC

Hi, me again.

This one crashed overnight after 5hrs, 30min. not good.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=194070527

Mon 08 Dec 2008 20:55:01 EST|rosetta@home|Output file cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_nsp1_olange_5389_38205_0_0 for task cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_nsp1_olange_5389_38205_0 absent


<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
# cpu_run_time_pref: 21600
SIGSEGV: segmentation violation
Stack trace (12 frames):
[0x8b883ab]
[0x8bb211c]
[0xffffe500]
[0x85d2f9a]
[0x85b766e]
[0x83efc90]
[0x811a22a]
[0x812a216]
[0x812be61]
[0x804b884]
[0x8c0dc1c]
[0x8048111]

Exiting...

</stderr_txt>

pete.

ID: 57708 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Path7

Send message
Joined: 25 Aug 07
Posts: 128
Credit: 61,751
RAC: 0
Message 57709 - Posted: 8 Dec 2008, 20:49:43 UTC

Running Windows XP-home I found 2 Wu's with this error:
ERROR: Illegal value for integer option -run:jran specified:

1wjdA_ZNMP_ABRELAX_tetraL_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1wjdA-_5475_616_0
1dsvA_ZNMP_ABRELAX_tetraL_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1dsvA-_5475_81_0

The first WU also had the same error on the second run.

Have a nice day,
Path7.

ID: 57709 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
stewjack

Send message
Joined: 23 Apr 06
Posts: 39
Credit: 95,871
RAC: 0
Message 57712 - Posted: 8 Dec 2008, 21:10:06 UTC
Last modified: 8 Dec 2008, 21:15:11 UTC

This WU was crunched under v.1.45 and exited with an error.

========
Task ID 212609043
Name: loopbuild_reference_hombench_loopbuild_t327__IGNORE_THE_REST_1XMAA_2_5453_3_0

Workunit 193817482

https://boinc.bakerlab.org/rosetta/result.php?resultid=212609043

==== stderr out ======
<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x007FA87A read attempt to address 0x000002E8


<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x007FA87A read attempt to address 0x000002E8

Jack
ID: 57712 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chu

Send message
Joined: 23 Feb 06
Posts: 120
Credit: 112,439
RAC: 0
Message 57714 - Posted: 8 Dec 2008, 21:24:12 UTC

Hi everyone, due to a limitation on command line length set by BOINC, the jobs with name "*_ZN_ABRELAX_*" have their command lines automatically truncated when sent out on Rosetta@Home. That is why you see it stopped with an ERROR like "ERROR: Illegal value for integer option -run:jran specified: " right away. In fact, the same workunits returned with very good success rate on the testing server. We are investigating why the alpha testing server did not catch such errors in the first place. This is another unfortunate incident which will be a new lesson for us. Sorry for any inconvenience this has brought to you.
ID: 57714 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 2,014
Message 57721 - Posted: 9 Dec 2008, 0:23:12 UTC - in response to Message 57699.  

Here's some more bad cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs WUs:

https://boinc.bakerlab.org/rosetta/result.php?resultid=212592040
https://boinc.bakerlab.org/rosetta/result.php?resultid=212475523
https://boinc.bakerlab.org/rosetta/result.php?resultid=212454329
https://boinc.bakerlab.org/rosetta/result.php?resultid=212415902
https://boinc.bakerlab.org/rosetta/result.php?resultid=212349479
https://boinc.bakerlab.org/rosetta/result.php?resultid=212298709
https://boinc.bakerlab.org/rosetta/result.php?resultid=212268849
https://boinc.bakerlab.org/rosetta/result.php?resultid=212260558


I notice that for most of those workunits, your wingman returned an error on the same workunit, so I'd suspect problems built into either the cs_vanilla workunits or a new feature of minirosetta that few other workunits have used before.

In those cases where your wingman completed the workunit without a problem, it was on an Intel Xeon CPU with a lot more memory. This leads me to believe that cs_vanilla workunits need a lot more memory than your computer has, and suggests that your rather old version of BOINC (5.2.13) may not be sending the information needed to choose only workunits that will work with your memory size. The version of BOINC used where those workunits were successful was 6.2.19.

My computer is in between - using BOINC 5.10.45 with a total RAM memory of 2 GB, and the only cs_vanilla workunit I've had so far ran longer than most of yours, but still failed.
ID: 57721 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Minirosetta v1.45 bug thread



©2024 University of Washington
https://www.bakerlab.org