Problems with Minirosetta v1.54

Message boards : Number crunching : Problems with Minirosetta v1.54

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 15 · Next

AuthorMessage
rembertw

Send message
Joined: 21 Apr 07
Posts: 14
Credit: 628,529
RAC: 0
Message 59530 - Posted: 12 Feb 2009, 14:02:04 UTC - in response to Message 59520.  

Mod.Sense

I've not heard any other reports of the percent completed not increasing. What is it showing for the estimated runtime, before the task starts?


In the meantime I have set that computer on NNT, and changed the preferred runtime. I will reactivate that computer, and evaluate Saturday or after the weekend. You'll be informed :)
ID: 59530 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BrnmccO1

Send message
Joined: 26 Jun 07
Posts: 17
Credit: 578,825
RAC: 0
Message 59532 - Posted: 12 Feb 2009, 21:23:42 UTC

Very good so far, zero error results on all machines for a long time. This 1.54 is much better than the prev versions, much more stable etc. Keep up the good work stamping out the bugs.

Its been a long time since I've reviewed the results on all my crunchers and found no compute errors. If things keep going the way they are, we might break 100 Tflops yet!
ID: 59532 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 11,805,838
RAC: 0
Message 59560 - Posted: 14 Feb 2009, 17:10:02 UTC

Workunit 205979363
Task 228619747
Bame loopbuild_ref_tex_cst_hombench_loopbuild_tex_cst_t332__IGNORE_THE_REST_2FLIA_6_6646_10_1
Mac OS X 10.4.11

This failed after 216 seconds : tail of stderr below

Setting database description ...
Setting up checkpointing ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 14400
Hbond tripped.
interpolate rotamers bin out of range: ARG 1.43667e-05 nan nan nan nan nan
81 81 19 20 2147483649 22 1.43667e-06 nan
ERROR:: Exit from: src/core/scoring/dunbrack/RotamericSingleResidueDunbrackLibrary.tmpl.hh line: 593
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>

ID: 59560 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yaroslav Isakov

Send message
Joined: 2 Nov 07
Posts: 11
Credit: 98,027
RAC: 0
Message 59596 - Posted: 16 Feb 2009, 3:03:25 UTC
Last modified: 16 Feb 2009, 3:05:52 UTC

Hello, I have some problems with Minirosetta 1.54
validate error (about 25,000 seconds of runtime each)

1
2
3

client error

1
2
ID: 59596 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LizzieBarry

Send message
Joined: 25 Feb 08
Posts: 76
Credit: 201,862
RAC: 0
Message 59601 - Posted: 16 Feb 2009, 10:15:59 UTC - in response to Message 59596.  

Hello, I have some problems with Minirosetta 1.54
validate error (about 25,000 seconds of runtime each)

1
2
3

client error

1
2

I got a couple of validate errors too:
Task 228125280
Task 228133134
There's nothing more frustrating than completing a job ok only for it to go wrong when uploaded.

I notice yours are a bit different though.
The first ones just include the line:
hbond tripped


The other two show:
Starting work on structure: _1JUDA_2_00001
Hbond tripped.

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

Not sure if one leads to the other but hbond tripped seems to be coming up in reports more regularly.
ID: 59601 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
epcorian

Send message
Joined: 1 Jan 09
Posts: 16
Credit: 253,062
RAC: 0
Message 59610 - Posted: 16 Feb 2009, 16:39:56 UTC - in response to Message 59428.  
Last modified: 16 Feb 2009, 16:42:55 UTC

I think I spoke too soon...that first WU crunched successfully but only 1 other was WU successful out of the 8 WU's. 2/8, better but still not good. I might try replacing Vista 64 with XP 64 another weekend when I'm bored. Just for curiosity sake I had my P4 and Atom 330 PC's running 32-bit XP SP3 crunch some Mini's and they did just fine.


So this weekend I installed a fresh copy of XP x64, upgraded it to SP2, installed my x64 version of NOD32 antivirus, told BOINC to use "...use at most 75% of the processors" meaning 3 of 4 cores on my Q6600 and it's crunching Mini's and Beta's without a problem! 1 successful Beta, 5 successful Mini's with 4 more coming down the pipe. So it looks like Mini does not like Vista x64 and on my adventures on google, it turns out that XP x64 is actually based on the Server 2003 code tree while Vista is based on crap. :)
ID: 59610 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 59614 - Posted: 16 Feb 2009, 18:41:30 UTC

Just noted that I have two tasks that failed. One had an exception, the other a validate error with 99 decoys ...

Validate Error
Exception

Does the system have an issue with too many decoys? The reissue has not returned ...
ID: 59614 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5659
Credit: 5,691,837
RAC: 1,806
Message 59615 - Posted: 16 Feb 2009, 18:45:12 UTC - in response to Message 59614.  

Just noted that I have two tasks that failed. One had an exception, the other a validate error with 99 decoys ...

Validate Error
Exception

Does the system have an issue with too many decoys? The reissue has not returned ...


If I remember correctly, they have created a 99 model stop line to keep the tasks from running forever.
ID: 59615 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 59617 - Posted: 16 Feb 2009, 19:25:37 UTC
Last modified: 16 Feb 2009, 19:27:33 UTC

Yeah, the 99 stop limit was to avoid a problem with the file size that is zipped up and uploaded. However, I was just wondering if there is now a new companion problem that the validator does not properly handle those results... or, the result itself is somehow bad...

In that I have gone back to the 3rd of Feb and have at least a hundred (220) results with only three errors this is a puzzlement ...

{edit}
added number ..

Also I note that The runtime is only 145 seconds ... so that was fast work ... :)
ID: 59617 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pharrg

Send message
Joined: 10 Jul 06
Posts: 10
Credit: 6,478
RAC: 0
Message 59625 - Posted: 17 Feb 2009, 2:22:04 UTC

I started running Rosetta this morning on a 64bit Vista machine and all seems to be working well. It's been working well on other projects too. Here is what I'm running:

Core i7 920 CPU
Asus P6T6 WS Revolution motherboard
6Gb DDR3 Triple Channel RAM
Vista Home Premium SP1 64bit

64bit BOINC 6.6.7

As I said, no problems yet and a number of WU's have completed already.


ID: 59625 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pharrg

Send message
Joined: 10 Jul 06
Posts: 10
Credit: 6,478
RAC: 0
Message 59626 - Posted: 17 Feb 2009, 3:14:15 UTC

Ok, after a number of successful completions, I did see one that looks like it failed. Message as follows:

2/16/2009 7:49:12 PM rosetta@home Computation for task ss-neg-1i17__7365_4677_1 finished
2/16/2009 7:49:12 PM rosetta@home Output file ss-neg-1i17__7365_4677_1_0 for task ss-neg-1i17__7365_4677_1 absent


Don't know the cause of that one...

ID: 59626 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 59627 - Posted: 17 Feb 2009, 6:35:01 UTC

Well, a couple hundred tasks and several with the same error, multiple systems (3 different), based on Xeon, Q9300, and i7 processors, various amounts of available RAM, though in common all are running Win XP Pro 32-Bit:

228932012
229013783
229066094
229072515

The error:

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004E3308 read attempt to address 0x00000000

ID: 59627 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yaroslav Isakov

Send message
Joined: 2 Nov 07
Posts: 11
Credit: 98,027
RAC: 0
Message 59631 - Posted: 17 Feb 2009, 12:16:07 UTC - in response to Message 59601.  


I notice yours are a bit different though.
The first ones just include the line:
hbond tripped


The other two show:
Starting work on structure: _1JUDA_2_00001
Hbond tripped.

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

Not sure if one leads to the other but hbond tripped seems to be coming up in reports more regularly.


Hey, you're right, all my errors are with Hbond tripped in stderr, so I think that it's a source of problems
ID: 59631 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pharrg

Send message
Joined: 10 Jul 06
Posts: 10
Credit: 6,478
RAC: 0
Message 59632 - Posted: 17 Feb 2009, 15:42:53 UTC
Last modified: 17 Feb 2009, 15:45:01 UTC

So... I completed a bunch more tasks successfully, then got a 2nd task where it said the output file was missing. Anyone else getting these?

2/17/2009 6:20:35 AM rosetta@home Computation for task ss-neg-1i17__7365_5964_0 finished
2/17/2009 6:20:35 AM rosetta@home Output file ss-neg-1i17__7365_5964_0_0 for task ss-neg-1i17__7365_5964_0 absent

I noticed that both tasks that gave the 'absent output file' message had a name the started witht the same first part:

ss-neg-1i17__7365_

perhaps a bug in that one?
ID: 59632 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 59633 - Posted: 17 Feb 2009, 17:14:07 UTC - in response to Message 59632.  
Last modified: 17 Feb 2009, 17:15:24 UTC


I noticed that both tasks that gave the 'absent output file' message had a name the started witht the same first part:

ss-neg-1i17__7365_

perhaps a bug in that one?


I had one of those fail too. Firewall blocked it from reporting the symbol tables :(
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 59633 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 59634 - Posted: 17 Feb 2009, 17:25:15 UTC

Looks like Pharrg actually had three of these fail

ss-neg-1i17__7365_5964_0
ss-neg-1i17__7365_5190_1 (wingman failed too)
ss-neg-1i17__7365_4677_1 (wingman failed too)

Rosetta Moderator: Mod.Sense
ID: 59634 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 59635 - Posted: 17 Feb 2009, 17:40:09 UTC

I had two more similar tasks on my machiens, so I suspended others to try and run them.

I've got an ss-neg-1je9 that seems normal so far. But my other ss-net-1i17 doesn't seem able to display graphics. Black window, no pane lines, on WinXP.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 59635 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 59637 - Posted: 17 Feb 2009, 18:44:34 UTC
Last modified: 17 Feb 2009, 18:45:25 UTC

Yep, my next ss-neg-1i17 failed too.

As soon as you bring up the graphic, which never gets beyond black, Windows task manager shows the graphic thread as "not responding".
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 59637 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5659
Credit: 5,691,837
RAC: 1,806
Message 59638 - Posted: 17 Feb 2009, 21:39:56 UTC

2 ss-neg tasks died on me as well, i have a 3rd in progress at 50% complete so far.

Here are the failures:

ss-neg-1i17__7365_1743_0

ss-neg-1i17__7365_542_1

They both do the following:

initialization is ok, but then when it is about to start it errors out:
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004E3308 read attempt to address 0x00000000
----------

ID: 59638 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1981
Credit: 38,373,493
RAC: 10,990
Message 59640 - Posted: 17 Feb 2009, 23:35:02 UTC
Last modified: 17 Feb 2009, 23:35:45 UTC

ID: 59640 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 15 · Next

Message boards : Number crunching : Problems with Minirosetta v1.54



©2024 University of Washington
https://www.bakerlab.org