Problems with Minirosetta v1.54

Message boards : Number crunching : Problems with Minirosetta v1.54

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 15 · Next

AuthorMessage
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,228,767
RAC: 1,961
Message 59637 - Posted: 17 Feb 2009, 18:44:34 UTC
Last modified: 17 Feb 2009, 18:45:25 UTC

Yep, my next ss-neg-1i17 failed too.

As soon as you bring up the graphic, which never gets beyond black, Windows task manager shows the graphic thread as "not responding".
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 59637 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 4877
Credit: 4,566,920
RAC: 1,991
Message 59638 - Posted: 17 Feb 2009, 21:39:56 UTC

2 ss-neg tasks died on me as well, i have a 3rd in progress at 50% complete so far.

Here are the failures:

ss-neg-1i17__7365_1743_0

ss-neg-1i17__7365_542_1

They both do the following:

initialization is ok, but then when it is about to start it errors out:
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004E3308 read attempt to address 0x00000000
----------

ID: 59638 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1606
Credit: 28,938,571
RAC: 19,385
Message 59640 - Posted: 17 Feb 2009, 23:35:02 UTC
Last modified: 17 Feb 2009, 23:35:45 UTC

ID: 59640 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 10,088,171
RAC: 6,728
Message 59641 - Posted: 18 Feb 2009, 0:53:03 UTC

A couple of these ssneg-1i17* workunits failing on Mac OS X 10.4.11

Workunit 208810096, Task 229094592, Name ss-neg-1i17__7365_4132_0

and

Workunit 208854507, Task 229142269, Name ss-neg-1i17__7365_4742_0

They're both failing in the same routine: here's the crash info from the first one

Thread 0 Crashed:
0 ...etta_1.54_i686-apple-darwin 0x001b13b7 __ZN4core10kinematics10build_treeERKNS0_8FoldTreeERKN7utility7vector1INS4_7pointer10access_ptrIKNS_12conformation7ResidueEEESaISB_EEERNS_2id10AtomID_MapINS6_10owning_ptrINS0_4tree4AtomEEEEE + 235
1 ...etta_1.54_i686-apple-darwin 0x00027735 __ZN4core12conformation12Conformation15setup_atom_treeEv + 109
2 ...etta_1.54_i686-apple-darwin 0x0002a378 __ZN4core12conformation12Conformation9fold_treeERKNS_10kinematics8FoldTreeE + 2910
3 ...etta_1.54_i686-apple-darwin 0x00400e64 __ZN4core2io13serialization11read_binaryERNS_4pose4PoseERNS1_6BUFFERE + 516
4 ...etta_1.54_i686-apple-darwin 0x00107b23 __ZN9protocols5boinc5Boinc18worker_is_finishedERKi + 913
5 ...etta_1.54_i686-apple-darwin 0x00c8d172 __ZN9protocols7jobdist18BaseJobDistributorIN7utility7pointer10owning_ptrINS0_8BasicJobEEEE8next_jobERS6_Ri + 2102
6 ...etta_1.54_i686-apple-darwin 0x001177a5 __ZN9protocols8abinitio18AbrelaxApplication4foldERN4core4pose4PoseEN7utility7pointer10owning_ptrINS_8ProtocolEEE + 1449
7 ...etta_1.54_i686-apple-darwin 0x001289ad __ZN9protocols8abinitio18AbrelaxApplication3runEv + 807
8 ...etta_1.54_i686-apple-darwin 0x000039cc _main + 1356
9 ...etta_1.54_i686-apple-darwin 0x00001dee __start + 216
10 ...etta_1.54_i686-apple-darwin 0x00001d15 start + 41


ID: 59641 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 59645 - Posted: 18 Feb 2009, 4:37:41 UTC

I've had three ss-neg-1i17__7365 WUs fail with segmentation violations on three different linux machines:

https://boinc.bakerlab.org/rosetta/result.php?resultid=229167706
https://boinc.bakerlab.org/rosetta/result.php?resultid=229161990
https://boinc.bakerlab.org/rosetta/result.php?resultid=229084435

(I notice that only the third number is different in the stack traces of the above three WUs.)
ID: 59645 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1061
Credit: 11,699,656
RAC: 7,772
Message 59647 - Posted: 18 Feb 2009, 9:16:58 UTC

A workunit with some odd behavior, but no definite error:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=209046400

A few minutes ago when it was about 93% complete, I told it to display graphics (which I usually don't do). After about a minute, I closed the graphics window. Within another minute or two, that workunit decided it was finished.

It may or may not be significant that a few minutes before doing this, I had set the Activity to Suspend, also suspended the network communications, ran some antispyware programs, then set the Activity back to normal.

Is this something normal that just happened at an unusual time, or something more significant?
ID: 59647 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rembertw

Send message
Joined: 21 Apr 07
Posts: 14
Credit: 628,529
RAC: 0
Message 59649 - Posted: 18 Feb 2009, 10:57:15 UTC - in response to Message 59520.  

Mod.Sense

What is it showing for the estimated runtime, before the task starts?


There is a new task running on that same computer:
- Estimated runtime: 09:43:55
- current runtime: 18:03:14
- Progress: 0%

I think my settings before were asking for about 6 hours runtime and now 10 hours. Changing this did not solve the problem. For the sake of testing I will keep this task running for some more time. You can let me know what to do. In the worst case I'll set that computer on NNT for Rosetta but I'm willing to wait some longer.
ID: 59649 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 59650 - Posted: 18 Feb 2009, 13:14:18 UTC

Three more errors ... this time two I have not seen before:

229353838 0 0x0056d881 SIGPIPE: write on a pipe with no reader

229355014 Reason: Access Violation (0xc0000005) at address 0x004E3308 read attempt to address 0x00000000

229435564 ERROR: ERROR: FragmentIO: could not open file cs_aa_1ji8A09_05.200_v1_3.gz

So, two shiny new errors and one old rusty access violation that quite a few of us have seen ...
ID: 59650 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith T.
Avatar

Send message
Joined: 1 Mar 07
Posts: 58
Credit: 34,135
RAC: 0
Message 59651 - Posted: 18 Feb 2009, 13:30:29 UTC

At least 3 of my recent tasks have resulted in Validate errors.

https://boinc.bakerlab.org/rosetta/result.php?resultid=227721905
https://boinc.bakerlab.org/rosetta/result.php?resultid=227934901
https://boinc.bakerlab.org/rosetta/result.php?resultid=227919237

Please could someone in authority explain why there have been so many of these recently.

I currently have Rosetta set to "No New Tasks", partly because of these. I am still accepting work from RALPH.

Keith
ID: 59651 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 59655 - Posted: 18 Feb 2009, 14:47:25 UTC

rembertw, the maximum runtime preference possible is 24hrs, and if it is a v1.54 task, the watchdog should end it if it runs longer then 28hrs. So, if you could, let it run at least 29hrs and if it is still running at that point, then abort it.

I still have not seen anyone else reporting such a problem, and you've got a score of other hosts running fine. What is different about this one that's having trouble? anitvirus software? Windows service pack? age of machine? BOINC version?
Rosetta Moderator: Mod.Sense
ID: 59655 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yaroslav Isakov

Send message
Joined: 2 Nov 07
Posts: 11
Credit: 98,027
RAC: 0
Message 59657 - Posted: 18 Feb 2009, 15:01:59 UTC

ID: 59657 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Path7

Send message
Joined: 25 Aug 07
Posts: 128
Credit: 61,751
RAC: 0
Message 59658 - Posted: 18 Feb 2009, 18:57:12 UTC

About 12 hours ago the next WU ended with an Unhandled Exception Detected:

ss-neg-1i17__7365_3969_1

This WU had the same error before running on another computer.

Path7.
ID: 59658 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1606
Credit: 28,938,571
RAC: 19,385
Message 59667 - Posted: 19 Feb 2009, 5:04:25 UTC

Another one snuck through:

ss-neg-1i17__7365_4076_1

Looks like I'll have to abort all these on sight. I'm not sure any of them have run successfully for me yet. :(
ID: 59667 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 59668 - Posted: 19 Feb 2009, 7:07:58 UTC

New error -161 on both Mini 1.54 and 5.98 ...

Mini-1.54
229605017
229597762
229594079
229593677

5.98
229601150

ID: 59668 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yaroslav Isakov

Send message
Joined: 2 Nov 07
Posts: 11
Credit: 98,027
RAC: 0
Message 59672 - Posted: 19 Feb 2009, 16:29:16 UTC
Last modified: 19 Feb 2009, 16:32:01 UTC

Hey! Very strange one! it's valid, but with Hbond tripped and verys short time, 2380 secs instead of ~10000:
loopbuild_chunk_1_3_B_hb_t357__IGNORE_THE_REST_1VBGA_4_7477_27_0

BTW, I notice that all my wrong results (and this last one) are loopbuild_chunk*.
ID: 59672 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile xrobert

Send message
Joined: 28 Oct 05
Posts: 3
Credit: 166,578
RAC: 0
Message 59674 - Posted: 19 Feb 2009, 18:02:55 UTC

So far, all my mini-Rosetta WUs are sticking. I've to abort them.
The normal WUs work fine.


ID: 59674 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rembertw

Send message
Joined: 21 Apr 07
Posts: 14
Credit: 628,529
RAC: 0
Message 59677 - Posted: 20 Feb 2009, 7:03:21 UTC - in response to Message 59655.  
Last modified: 20 Feb 2009, 7:12:40 UTC

mod.sense

I still have not seen anyone else reporting such a problem, and you've got a score of other hosts running fine. What is different about this one that's having trouble? antivirus software? Windows service pack? age of machine? BOINC version?


I it strange indeed. My other computers seem to be running fine. About the computer: I have an identical computer that gives no problems. They both have the same antivirus software, same servicepack, same age, same Boinc version.

Some things I noticed:
- when a 0% task (only at Rosetta 1.54) gets paused manually after x hours and it gets restarted, also the time resets to 0.
- When the 1.54 task starts both processors get work (multiple projects). However, when one of the other project tasks stop, then the 2nd processor starts idling. It can not get another task to run from Rosetta or any other project despite the queue having multiple tasks ready to start or continue.

I broke off 2 remaining tasks of Rosetta that still had to get started and am letting run the restarted task. Before it had already 24h+ but because of a pauze it reset its time. At this moment it is at 19h again. I will let it run until it gets past 31h runtime. After (tomorrow) that I will set that computer on NNT for Rosetta so it can crunch for my other projects while I wait for your comment.

[edit]Changed "all" in "both" and corrected a typo[/edit]
ID: 59677 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 59683 - Posted: 20 Feb 2009, 14:32:26 UTC

rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another.

Is it possible BOINC is having trouble writing to disk (authorities?)? Have you checked the authorities to the data directory and it's contents? Is the data on the local hard drive of the machine, or off on a network somewhere?

I see from the one task that completed that you are running BOINC 6.2.14. Have you tried other BOINC versions on this machine?
Rosetta Moderator: Mod.Sense
ID: 59683 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rembertw

Send message
Joined: 21 Apr 07
Posts: 14
Credit: 628,529
RAC: 0
Message 59684 - Posted: 20 Feb 2009, 15:23:06 UTC - in response to Message 59683.  

rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another.

I agree, but this shows only when it started a "0%" Minirosetta task. To check this I put Rosetta on NNT for a while. When it runs only other projects there are no problems at all, making me think the problem is with Minirosetta.

Is it possible BOINC is having trouble writing to disk (authorities?)? Have you checked the authorities to the data directory and it's contents? Is the data on the local hard drive of the machine, or off on a network somewhere?

Standard setup with full authority running on a local hard drive. No fancy settings.

I see from the one task that completed that you are running BOINC 6.2.14. Have you tried other BOINC versions on this machine?

Every now and again I do a Boinc upgrade on my machines. I heard some negative comments about the current Boinc version, which is why I considered waiting until Summer or so to upgrade. I guess now the time has come. To be certain I'll do a total Boinc uninstall on that computer followed by a cleanup before I download the current version. We'll see if this helped...
ID: 59684 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1061
Credit: 11,699,656
RAC: 7,772
Message 59686 - Posted: 20 Feb 2009, 16:41:25 UTC - in response to Message 59684.  

rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another.

I agree, but this shows only when it started a "0%" Minirosetta task. To check this I put Rosetta on NNT for a while. When it runs only other projects there are no problems at all, making me think the problem is with Minirosetta.

Is it possible BOINC is having trouble writing to disk (authorities?)? Have you checked the authorities to the data directory and it's contents? Is the data on the local hard drive of the machine, or off on a network somewhere?

Standard setup with full authority running on a local hard drive. No fancy settings.

I see from the one task that completed that you are running BOINC 6.2.14. Have you tried other BOINC versions on this machine?

Every now and again I do a Boinc upgrade on my machines. I heard some negative comments about the current Boinc version, which is why I considered waiting until Summer or so to upgrade. I guess now the time has come. To be certain I'll do a total Boinc uninstall on that computer followed by a cleanup before I download the current version. We'll see if this helped...


Which BOINC version do you consider current? I'm running 6.2.28 without seeing such a problem, but I've read some negative comments about the 6.4.* series.
ID: 59686 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 15 · Next

Message boards : Number crunching : Problems with Minirosetta v1.54



©2021 University of Washington
https://www.bakerlab.org