Problems with Minirosetta v1.54

Message boards : Number crunching : Problems with Minirosetta v1.54

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 15 · Next

AuthorMessage
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 11,805,838
RAC: 0
Message 59641 - Posted: 18 Feb 2009, 0:53:03 UTC

A couple of these ssneg-1i17* workunits failing on Mac OS X 10.4.11

Workunit 208810096, Task 229094592, Name ss-neg-1i17__7365_4132_0

and

Workunit 208854507, Task 229142269, Name ss-neg-1i17__7365_4742_0

They're both failing in the same routine: here's the crash info from the first one

Thread 0 Crashed:
0 ...etta_1.54_i686-apple-darwin 0x001b13b7 __ZN4core10kinematics10build_treeERKNS0_8FoldTreeERKN7utility7vector1INS4_7pointer10access_ptrIKNS_12conformation7ResidueEEESaISB_EEERNS_2id10AtomID_MapINS6_10owning_ptrINS0_4tree4AtomEEEEE + 235
1 ...etta_1.54_i686-apple-darwin 0x00027735 __ZN4core12conformation12Conformation15setup_atom_treeEv + 109
2 ...etta_1.54_i686-apple-darwin 0x0002a378 __ZN4core12conformation12Conformation9fold_treeERKNS_10kinematics8FoldTreeE + 2910
3 ...etta_1.54_i686-apple-darwin 0x00400e64 __ZN4core2io13serialization11read_binaryERNS_4pose4PoseERNS1_6BUFFERE + 516
4 ...etta_1.54_i686-apple-darwin 0x00107b23 __ZN9protocols5boinc5Boinc18worker_is_finishedERKi + 913
5 ...etta_1.54_i686-apple-darwin 0x00c8d172 __ZN9protocols7jobdist18BaseJobDistributorIN7utility7pointer10owning_ptrINS0_8BasicJobEEEE8next_jobERS6_Ri + 2102
6 ...etta_1.54_i686-apple-darwin 0x001177a5 __ZN9protocols8abinitio18AbrelaxApplication4foldERN4core4pose4PoseEN7utility7pointer10owning_ptrINS_8ProtocolEEE + 1449
7 ...etta_1.54_i686-apple-darwin 0x001289ad __ZN9protocols8abinitio18AbrelaxApplication3runEv + 807
8 ...etta_1.54_i686-apple-darwin 0x000039cc _main + 1356
9 ...etta_1.54_i686-apple-darwin 0x00001dee __start + 216
10 ...etta_1.54_i686-apple-darwin 0x00001d15 start + 41


ID: 59641 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 59645 - Posted: 18 Feb 2009, 4:37:41 UTC

I've had three ss-neg-1i17__7365 WUs fail with segmentation violations on three different linux machines:

https://boinc.bakerlab.org/rosetta/result.php?resultid=229167706
https://boinc.bakerlab.org/rosetta/result.php?resultid=229161990
https://boinc.bakerlab.org/rosetta/result.php?resultid=229084435

(I notice that only the third number is different in the stack traces of the above three WUs.)
ID: 59645 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,841,472
RAC: 1,593
Message 59647 - Posted: 18 Feb 2009, 9:16:58 UTC

A workunit with some odd behavior, but no definite error:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=209046400

A few minutes ago when it was about 93% complete, I told it to display graphics (which I usually don't do). After about a minute, I closed the graphics window. Within another minute or two, that workunit decided it was finished.

It may or may not be significant that a few minutes before doing this, I had set the Activity to Suspend, also suspended the network communications, ran some antispyware programs, then set the Activity back to normal.

Is this something normal that just happened at an unusual time, or something more significant?
ID: 59647 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rembertw

Send message
Joined: 21 Apr 07
Posts: 14
Credit: 628,529
RAC: 0
Message 59649 - Posted: 18 Feb 2009, 10:57:15 UTC - in response to Message 59520.  

Mod.Sense

What is it showing for the estimated runtime, before the task starts?


There is a new task running on that same computer:
- Estimated runtime: 09:43:55
- current runtime: 18:03:14
- Progress: 0%

I think my settings before were asking for about 6 hours runtime and now 10 hours. Changing this did not solve the problem. For the sake of testing I will keep this task running for some more time. You can let me know what to do. In the worst case I'll set that computer on NNT for Rosetta but I'm willing to wait some longer.
ID: 59649 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 59650 - Posted: 18 Feb 2009, 13:14:18 UTC

Three more errors ... this time two I have not seen before:

229353838 0 0x0056d881 SIGPIPE: write on a pipe with no reader

229355014 Reason: Access Violation (0xc0000005) at address 0x004E3308 read attempt to address 0x00000000

229435564 ERROR: ERROR: FragmentIO: could not open file cs_aa_1ji8A09_05.200_v1_3.gz

So, two shiny new errors and one old rusty access violation that quite a few of us have seen ...
ID: 59650 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith T.
Avatar

Send message
Joined: 1 Mar 07
Posts: 58
Credit: 34,135
RAC: 0
Message 59651 - Posted: 18 Feb 2009, 13:30:29 UTC

At least 3 of my recent tasks have resulted in Validate errors.

https://boinc.bakerlab.org/rosetta/result.php?resultid=227721905
https://boinc.bakerlab.org/rosetta/result.php?resultid=227934901
https://boinc.bakerlab.org/rosetta/result.php?resultid=227919237

Please could someone in authority explain why there have been so many of these recently.

I currently have Rosetta set to "No New Tasks", partly because of these. I am still accepting work from RALPH.

Keith
ID: 59651 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 59655 - Posted: 18 Feb 2009, 14:47:25 UTC

rembertw, the maximum runtime preference possible is 24hrs, and if it is a v1.54 task, the watchdog should end it if it runs longer then 28hrs. So, if you could, let it run at least 29hrs and if it is still running at that point, then abort it.

I still have not seen anyone else reporting such a problem, and you've got a score of other hosts running fine. What is different about this one that's having trouble? anitvirus software? Windows service pack? age of machine? BOINC version?
Rosetta Moderator: Mod.Sense
ID: 59655 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yaroslav Isakov

Send message
Joined: 2 Nov 07
Posts: 11
Credit: 98,027
RAC: 0
Message 59657 - Posted: 18 Feb 2009, 15:01:59 UTC

ID: 59657 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Path7

Send message
Joined: 25 Aug 07
Posts: 128
Credit: 61,751
RAC: 0
Message 59658 - Posted: 18 Feb 2009, 18:57:12 UTC

About 12 hours ago the next WU ended with an Unhandled Exception Detected:

ss-neg-1i17__7365_3969_1

This WU had the same error before running on another computer.

Path7.
ID: 59658 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1981
Credit: 38,422,373
RAC: 13,429
Message 59667 - Posted: 19 Feb 2009, 5:04:25 UTC

Another one snuck through:

ss-neg-1i17__7365_4076_1

Looks like I'll have to abort all these on sight. I'm not sure any of them have run successfully for me yet. :(
ID: 59667 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 59668 - Posted: 19 Feb 2009, 7:07:58 UTC

New error -161 on both Mini 1.54 and 5.98 ...

Mini-1.54
229605017
229597762
229594079
229593677

5.98
229601150

ID: 59668 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yaroslav Isakov

Send message
Joined: 2 Nov 07
Posts: 11
Credit: 98,027
RAC: 0
Message 59672 - Posted: 19 Feb 2009, 16:29:16 UTC
Last modified: 19 Feb 2009, 16:32:01 UTC

Hey! Very strange one! it's valid, but with Hbond tripped and verys short time, 2380 secs instead of ~10000:
loopbuild_chunk_1_3_B_hb_t357__IGNORE_THE_REST_1VBGA_4_7477_27_0

BTW, I notice that all my wrong results (and this last one) are loopbuild_chunk*.
ID: 59672 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile xrobert

Send message
Joined: 28 Oct 05
Posts: 3
Credit: 168,865
RAC: 0
Message 59674 - Posted: 19 Feb 2009, 18:02:55 UTC

So far, all my mini-Rosetta WUs are sticking. I've to abort them.
The normal WUs work fine.


ID: 59674 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rembertw

Send message
Joined: 21 Apr 07
Posts: 14
Credit: 628,529
RAC: 0
Message 59677 - Posted: 20 Feb 2009, 7:03:21 UTC - in response to Message 59655.  
Last modified: 20 Feb 2009, 7:12:40 UTC

mod.sense

I still have not seen anyone else reporting such a problem, and you've got a score of other hosts running fine. What is different about this one that's having trouble? antivirus software? Windows service pack? age of machine? BOINC version?


I it strange indeed. My other computers seem to be running fine. About the computer: I have an identical computer that gives no problems. They both have the same antivirus software, same servicepack, same age, same Boinc version.

Some things I noticed:
- when a 0% task (only at Rosetta 1.54) gets paused manually after x hours and it gets restarted, also the time resets to 0.
- When the 1.54 task starts both processors get work (multiple projects). However, when one of the other project tasks stop, then the 2nd processor starts idling. It can not get another task to run from Rosetta or any other project despite the queue having multiple tasks ready to start or continue.

I broke off 2 remaining tasks of Rosetta that still had to get started and am letting run the restarted task. Before it had already 24h+ but because of a pauze it reset its time. At this moment it is at 19h again. I will let it run until it gets past 31h runtime. After (tomorrow) that I will set that computer on NNT for Rosetta so it can crunch for my other projects while I wait for your comment.

[edit]Changed "all" in "both" and corrected a typo[/edit]
ID: 59677 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 59683 - Posted: 20 Feb 2009, 14:32:26 UTC

rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another.

Is it possible BOINC is having trouble writing to disk (authorities?)? Have you checked the authorities to the data directory and it's contents? Is the data on the local hard drive of the machine, or off on a network somewhere?

I see from the one task that completed that you are running BOINC 6.2.14. Have you tried other BOINC versions on this machine?
Rosetta Moderator: Mod.Sense
ID: 59683 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rembertw

Send message
Joined: 21 Apr 07
Posts: 14
Credit: 628,529
RAC: 0
Message 59684 - Posted: 20 Feb 2009, 15:23:06 UTC - in response to Message 59683.  

rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another.

I agree, but this shows only when it started a "0%" Minirosetta task. To check this I put Rosetta on NNT for a while. When it runs only other projects there are no problems at all, making me think the problem is with Minirosetta.

Is it possible BOINC is having trouble writing to disk (authorities?)? Have you checked the authorities to the data directory and it's contents? Is the data on the local hard drive of the machine, or off on a network somewhere?

Standard setup with full authority running on a local hard drive. No fancy settings.

I see from the one task that completed that you are running BOINC 6.2.14. Have you tried other BOINC versions on this machine?

Every now and again I do a Boinc upgrade on my machines. I heard some negative comments about the current Boinc version, which is why I considered waiting until Summer or so to upgrade. I guess now the time has come. To be certain I'll do a total Boinc uninstall on that computer followed by a cleanup before I download the current version. We'll see if this helped...
ID: 59684 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,841,472
RAC: 1,593
Message 59686 - Posted: 20 Feb 2009, 16:41:25 UTC - in response to Message 59684.  

rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another.

I agree, but this shows only when it started a "0%" Minirosetta task. To check this I put Rosetta on NNT for a while. When it runs only other projects there are no problems at all, making me think the problem is with Minirosetta.

Is it possible BOINC is having trouble writing to disk (authorities?)? Have you checked the authorities to the data directory and it's contents? Is the data on the local hard drive of the machine, or off on a network somewhere?

Standard setup with full authority running on a local hard drive. No fancy settings.

I see from the one task that completed that you are running BOINC 6.2.14. Have you tried other BOINC versions on this machine?

Every now and again I do a Boinc upgrade on my machines. I heard some negative comments about the current Boinc version, which is why I considered waiting until Summer or so to upgrade. I guess now the time has come. To be certain I'll do a total Boinc uninstall on that computer followed by a cleanup before I download the current version. We'll see if this helped...


Which BOINC version do you consider current? I'm running 6.2.28 without seeing such a problem, but I've read some negative comments about the 6.4.* series.
ID: 59686 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 59688 - Posted: 20 Feb 2009, 18:31:56 UTC
Last modified: 20 Feb 2009, 18:33:48 UTC

robertmiles, if you were directing the question to me, I try to stay out of that one. And am only recommending a change to BOINC version because problems are occurring with the version installed now. I know we've seen many work-fetch and DCF problems reported on the 6.6 (which is the current test version) and I think 6.4 series introduced those problems. So, if it were me, I'd try the 6.2.19 shown at the link below. I myself am on 6.2.18 and running well on WinXP. (nothing against 6.2.28, but it's not listed anymore for some reason)

You can see more BOINC versions for download on this page:
http://boinc.berkeley.edu/download_all.php
Rosetta Moderator: Mod.Sense
ID: 59688 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TimL

Send message
Joined: 16 Sep 06
Posts: 17
Credit: 15,480,956
RAC: 0
Message 59723 - Posted: 22 Feb 2009, 9:59:14 UTC

Hi all,
loopbuild_mamaln_ideal_hb_t305__IGNORE_THE_REST_1zc0_1_7630_19 finished early with error -
Access Violation (0xc0000005) at address 0x7C91AA01 read attempt to address 0x0D1BF548

Haven't had much luck getting errors of late but will mention that I had just bumped the bus speed up a touch when this error occurred.


ID: 59723 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TomaszPawel

Send message
Joined: 28 Apr 07
Posts: 54
Credit: 2,791,145
RAC: 0
Message 59751 - Posted: 23 Feb 2009, 7:06:15 UTC - in response to Message 59045.  

ID: 59751 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 15 · Next

Message boards : Number crunching : Problems with Minirosetta v1.54



©2024 University of Washington
https://www.bakerlab.org