Message boards : Number crunching : Problems with Minirosetta v1.54
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 15 · Next
Author | Message |
---|---|
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
A couple of these ssneg-1i17* workunits failing on Mac OS X 10.4.11 Workunit 208810096, Task 229094592, Name ss-neg-1i17__7365_4132_0 and Workunit 208854507, Task 229142269, Name ss-neg-1i17__7365_4742_0 They're both failing in the same routine: here's the crash info from the first one Thread 0 Crashed: 0 ...etta_1.54_i686-apple-darwin 0x001b13b7 __ZN4core10kinematics10build_treeERKNS0_8FoldTreeERKN7utility7vector1INS4_7pointer10access_ptrIKNS_12conformation7ResidueEEESaISB_EEERNS_2id10AtomID_MapINS6_10owning_ptrINS0_4tree4AtomEEEEE + 235 1 ...etta_1.54_i686-apple-darwin 0x00027735 __ZN4core12conformation12Conformation15setup_atom_treeEv + 109 2 ...etta_1.54_i686-apple-darwin 0x0002a378 __ZN4core12conformation12Conformation9fold_treeERKNS_10kinematics8FoldTreeE + 2910 3 ...etta_1.54_i686-apple-darwin 0x00400e64 __ZN4core2io13serialization11read_binaryERNS_4pose4PoseERNS1_6BUFFERE + 516 4 ...etta_1.54_i686-apple-darwin 0x00107b23 __ZN9protocols5boinc5Boinc18worker_is_finishedERKi + 913 5 ...etta_1.54_i686-apple-darwin 0x00c8d172 __ZN9protocols7jobdist18BaseJobDistributorIN7utility7pointer10owning_ptrINS0_8BasicJobEEEE8next_jobERS6_Ri + 2102 6 ...etta_1.54_i686-apple-darwin 0x001177a5 __ZN9protocols8abinitio18AbrelaxApplication4foldERN4core4pose4PoseEN7utility7pointer10owning_ptrINS_8ProtocolEEE + 1449 7 ...etta_1.54_i686-apple-darwin 0x001289ad __ZN9protocols8abinitio18AbrelaxApplication3runEv + 807 8 ...etta_1.54_i686-apple-darwin 0x000039cc _main + 1356 9 ...etta_1.54_i686-apple-darwin 0x00001dee __start + 216 10 ...etta_1.54_i686-apple-darwin 0x00001d15 start + 41 |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
I've had three ss-neg-1i17__7365 WUs fail with segmentation violations on three different linux machines: https://boinc.bakerlab.org/rosetta/result.php?resultid=229167706 https://boinc.bakerlab.org/rosetta/result.php?resultid=229161990 https://boinc.bakerlab.org/rosetta/result.php?resultid=229084435 (I notice that only the third number is different in the stack traces of the above three WUs.) |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 2,014 |
A workunit with some odd behavior, but no definite error: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=209046400 A few minutes ago when it was about 93% complete, I told it to display graphics (which I usually don't do). After about a minute, I closed the graphics window. Within another minute or two, that workunit decided it was finished. It may or may not be significant that a few minutes before doing this, I had set the Activity to Suspend, also suspended the network communications, ran some antispyware programs, then set the Activity back to normal. Is this something normal that just happened at an unusual time, or something more significant? |
rembertw Send message Joined: 21 Apr 07 Posts: 14 Credit: 628,529 RAC: 0 |
Mod.Sense What is it showing for the estimated runtime, before the task starts? There is a new task running on that same computer: - Estimated runtime: 09:43:55 - current runtime: 18:03:14 - Progress: 0% I think my settings before were asking for about 6 hours runtime and now 10 hours. Changing this did not solve the problem. For the sake of testing I will keep this task running for some more time. You can let me know what to do. In the worst case I'll set that computer on NNT for Rosetta but I'm willing to wait some longer. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Three more errors ... this time two I have not seen before: 229353838 0 0x0056d881 SIGPIPE: write on a pipe with no reader 229355014 Reason: Access Violation (0xc0000005) at address 0x004E3308 read attempt to address 0x00000000 229435564 ERROR: ERROR: FragmentIO: could not open file cs_aa_1ji8A09_05.200_v1_3.gz So, two shiny new errors and one old rusty access violation that quite a few of us have seen ... |
Keith T. Send message Joined: 1 Mar 07 Posts: 58 Credit: 34,135 RAC: 0 |
At least 3 of my recent tasks have resulted in Validate errors. https://boinc.bakerlab.org/rosetta/result.php?resultid=227721905 https://boinc.bakerlab.org/rosetta/result.php?resultid=227934901 https://boinc.bakerlab.org/rosetta/result.php?resultid=227919237 Please could someone in authority explain why there have been so many of these recently. I currently have Rosetta set to "No New Tasks", partly because of these. I am still accepting work from RALPH. Keith |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
rembertw, the maximum runtime preference possible is 24hrs, and if it is a v1.54 task, the watchdog should end it if it runs longer then 28hrs. So, if you could, let it run at least 29hrs and if it is still running at that point, then abort it. I still have not seen anyone else reporting such a problem, and you've got a score of other hosts running fine. What is different about this one that's having trouble? anitvirus software? Windows service pack? age of machine? BOINC version? Rosetta Moderator: Mod.Sense |
Yaroslav Isakov Send message Joined: 2 Nov 07 Posts: 11 Credit: 98,027 RAC: 0 |
|
Path7 Send message Joined: 25 Aug 07 Posts: 128 Credit: 61,751 RAC: 0 |
About 12 hours ago the next WU ended with an Unhandled Exception Detected: ss-neg-1i17__7365_3969_1 This WU had the same error before running on another computer. Path7. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2141 Credit: 41,526,036 RAC: 10,392 |
Another one snuck through: ss-neg-1i17__7365_4076_1 Looks like I'll have to abort all these on sight. I'm not sure any of them have run successfully for me yet. :( |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
|
Yaroslav Isakov Send message Joined: 2 Nov 07 Posts: 11 Credit: 98,027 RAC: 0 |
Hey! Very strange one! it's valid, but with Hbond tripped and verys short time, 2380 secs instead of ~10000: loopbuild_chunk_1_3_B_hb_t357__IGNORE_THE_REST_1VBGA_4_7477_27_0 BTW, I notice that all my wrong results (and this last one) are loopbuild_chunk*. |
xrobert Send message Joined: 28 Oct 05 Posts: 3 Credit: 168,865 RAC: 0 |
So far, all my mini-Rosetta WUs are sticking. I've to abort them. The normal WUs work fine. |
rembertw Send message Joined: 21 Apr 07 Posts: 14 Credit: 628,529 RAC: 0 |
mod.sense I still have not seen anyone else reporting such a problem, and you've got a score of other hosts running fine. What is different about this one that's having trouble? antivirus software? Windows service pack? age of machine? BOINC version? I it strange indeed. My other computers seem to be running fine. About the computer: I have an identical computer that gives no problems. They both have the same antivirus software, same servicepack, same age, same Boinc version. Some things I noticed: - when a 0% task (only at Rosetta 1.54) gets paused manually after x hours and it gets restarted, also the time resets to 0. - When the 1.54 task starts both processors get work (multiple projects). However, when one of the other project tasks stop, then the 2nd processor starts idling. It can not get another task to run from Rosetta or any other project despite the queue having multiple tasks ready to start or continue. I broke off 2 remaining tasks of Rosetta that still had to get started and am letting run the restarted task. Before it had already 24h+ but because of a pauze it reset its time. At this moment it is at 19h again. I will let it run until it gets past 31h runtime. After (tomorrow) that I will set that computer on NNT for Rosetta so it can crunch for my other projects while I wait for your comment. [edit]Changed "all" in "both" and corrected a typo[/edit] |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another. Is it possible BOINC is having trouble writing to disk (authorities?)? Have you checked the authorities to the data directory and it's contents? Is the data on the local hard drive of the machine, or off on a network somewhere? I see from the one task that completed that you are running BOINC 6.2.14. Have you tried other BOINC versions on this machine? Rosetta Moderator: Mod.Sense |
rembertw Send message Joined: 21 Apr 07 Posts: 14 Credit: 628,529 RAC: 0 |
rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another. I agree, but this shows only when it started a "0%" Minirosetta task. To check this I put Rosetta on NNT for a while. When it runs only other projects there are no problems at all, making me think the problem is with Minirosetta. Is it possible BOINC is having trouble writing to disk (authorities?)? Have you checked the authorities to the data directory and it's contents? Is the data on the local hard drive of the machine, or off on a network somewhere? Standard setup with full authority running on a local hard drive. No fancy settings. I see from the one task that completed that you are running BOINC 6.2.14. Have you tried other BOINC versions on this machine? Every now and again I do a Boinc upgrade on my machines. I heard some negative comments about the current Boinc version, which is why I considered waiting until Summer or so to upgrade. I guess now the time has come. To be certain I'll do a total Boinc uninstall on that computer followed by a cleanup before I download the current version. We'll see if this helped... |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 2,014 |
rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another. Which BOINC version do you consider current? I'm running 6.2.28 without seeing such a problem, but I've read some negative comments about the 6.4.* series. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
robertmiles, if you were directing the question to me, I try to stay out of that one. And am only recommending a change to BOINC version because problems are occurring with the version installed now. I know we've seen many work-fetch and DCF problems reported on the 6.6 (which is the current test version) and I think 6.4 series introduced those problems. So, if it were me, I'd try the 6.2.19 shown at the link below. I myself am on 6.2.18 and running well on WinXP. (nothing against 6.2.28, but it's not listed anymore for some reason) You can see more BOINC versions for download on this page: http://boinc.berkeley.edu/download_all.php Rosetta Moderator: Mod.Sense |
TimL Send message Joined: 16 Sep 06 Posts: 17 Credit: 15,509,973 RAC: 4 |
Hi all, loopbuild_mamaln_ideal_hb_t305__IGNORE_THE_REST_1zc0_1_7630_19 finished early with error - Access Violation (0xc0000005) at address 0x7C91AA01 read attempt to address 0x0D1BF548 Haven't had much luck getting errors of late but will mention that I had just bumped the bus speed up a touch when this error occurred. |
TomaszPawel Send message Joined: 28 Apr 07 Posts: 54 Credit: 2,791,145 RAC: 0 |
Hi: https://boinc.bakerlab.org/rosetta/result.php?resultid=229237620 https://boinc.bakerlab.org/rosetta/result.php?resultid=229237620 https://boinc.bakerlab.org/rosetta/result.php?resultid=229237514 https://boinc.bakerlab.org/rosetta/result.php?resultid=229145242 https://boinc.bakerlab.org/rosetta/result.php?resultid=228892067 https://boinc.bakerlab.org/rosetta/result.php?resultid=228820491 https://boinc.bakerlab.org/rosetta/result.php?resultid=228820477 Any tips? |
Message boards :
Number crunching :
Problems with Minirosetta v1.54
©2024 University of Washington
https://www.bakerlab.org