Problems with Rosetta version 5.80

Author	Message
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0	Message 46370 - Posted: 16 Sep 2007, 16:02:05 UTC I had to abort this one due to 'waiting for memory'. All the others have worked without a problem. https://boinc.bakerlab.org/rosetta/result.php?resultid=105692089 ID: 46370 · Rating: 0 · rate: / Reply Quote

stewjack Send message Joined: 23 Apr 06 Posts: 39 Credit: 95,871 RAC: 0	Message 46371 - Posted: 16 Sep 2007, 16:34:59 UTC - in response to Message 46370. Last modified: 16 Sep 2007, 16:41:11 UTC I had to abort this one due to 'waiting for memory'. All the others have worked without a problem. https://boinc.bakerlab.org/rosetta/result.php?resultid=105692089 Evan, I have had your single 'waiting for memory' problem out of 3 Capri WU's I run a single core CPU with 512 memory. My WU is very similar to yours. https://boinc.bakerlab.org/rosetta/result.php?resultid=105635179 Jack ID: 46371 · Rating: 0 · rate: / Reply Quote

David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0	Message 46373 - Posted: 16 Sep 2007, 17:18:06 UTC - in response to Message 46345. I hope that the few CAPRI14 that actually make it through are worth it. I echo that sentiment. I'm changing my preferences to allow BOINC access to 90% of memory all the time (whether or not the computer is "idle") Rosie, Rosie, she's our gal, If she can't do it, no one shall! ID: 46373 · Rating: 0 · rate: / Reply Quote

Rayburner Send message Joined: 4 Oct 05 Posts: 32 Credit: 16,518,823 RAC: 0	Message 46374 - Posted: 16 Sep 2007, 17:22:51 UTC - in response to Message 46355. Good question ! May be sheer coincidence, but seems we're hearing about this with the Q6600's more than "average"... I'm running standard Boinc client, Q6600, 2 GB RAM, Swap = 75% of page file, Vista Premium (32). I am starting to wonder if this problem with the failed work units is related to multicore or Q6600 processors. Could it be a memory management issue with the WUs attempting to access the same memory locations creating a lock or race condition? I have a dual core e6600 4 gig and 70% of mine are bad also. I think there must be something else. I am running a qx6700 with 2 gig on Vista Ultimate. There hasn't been one of such wus with this memory problem. I am also running several projects at a time (CPDN, malariacontrol, SETI, Einstein, WCG, Rosetta). So there are always several apps in memory (and they stay inside when tasks are switched; also multiple instances of rosetta, of course). Thats is why there is a heavy load on the memory. I will keep watching my results if such memory problem appears on my machine. Regards Rayburner ID: 46374 · Rating: 0 · rate: / Reply Quote

sslickerson Send message Joined: 14 Oct 05 Posts: 101 Credit: 578,497 RAC: 0	Message 46377 - Posted: 16 Sep 2007, 17:48:37 UTC I've had one CAPRI14 WU fail: 105503053 on this computer but all others have finished fine. --Timothy ID: 46377 · Rating: 0 · rate: / Reply Quote

Paul Send message Joined: 29 Oct 05 Posts: 193 Credit: 68,291,456 RAC: 0	Message 46386 - Posted: 16 Sep 2007, 20:25:25 UTC Swap space increased from 400MB to 2048MB. The system is attached to the World Community Grid 50% and R@H 50%. 2 success & 1 failure Rayburner: Can you try going to 100% on R@H and see if you start getting failures similar to what we see on XP? I also wonder if the Vista memory manager is better & corrects for this memory conflict between WUs. It would be good to have a test point with a Q6600 and Vista running 100% R@H. This sounds like an issue with the XP memory manager, BOINC & large memory work units. If we can isolate the CPU types and OS, we might help find this issue quickly. JMarks: What is your config? e6600 4GB RAM Swap ?? OS ?? Thx! Paul ID: 46386 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 46392 - Posted: 16 Sep 2007, 22:09:12 UTC - in response to Message 46346. Here's another one of those CAPRI units which failed. https://boinc.bakerlab.org/rosetta/result.php?resultid=105829549 I happen to have looked at graphics when it froze. 82 models were crunched when it failed, model 83 was at step 537. After that it was just waiting for the watchdog to terminate the task. DK/Rhiju could you look in to why this task only received 20 credits? I had thought that if 80 models were completed prior to the failure, that these should be reported and utilized by the project, and credit issued accordingly as well. Rosetta Moderator: Mod.Sense ID: 46392 · Rating: 0 · rate: / Reply Quote

Keith T. Send message Joined: 1 Mar 07 Posts: 58 Credit: 34,135 RAC: 0	Message 46399 - Posted: 17 Sep 2007, 0:53:35 UTC 1g4u__BOINC_CAPRI14_DOCK_FIXBACKBONE_POSE_LOOPS-1g4u_-lig_plexinmonomer__2085_4000_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=105842490 Compute error Exit status -1073741819 (0xc0000005) PC had been running unattended for > 3 hours when this occured. Screen Saver is Blank Screen. ID: 46399 · Rating: 0 · rate: / Reply Quote

Gen_X_Accord Send message Joined: 5 Jun 06 Posts: 154 Credit: 279,018 RAC: 0	Message 46414 - Posted: 17 Sep 2007, 8:35:47 UTC The only thing I've noticed strange about 5.80 is that my granted credits are much lower than normal. ID: 46414 · Rating: 0 · rate: / Reply Quote

2vpArAUZW8AX5M6mdoxJDTKxKLky Send message Joined: 22 May 06 Posts: 4 Credit: 2,205,841 RAC: 0	Message 46420 - Posted: 17 Sep 2007, 12:03:14 UTC Last modified: 17 Sep 2007, 12:04:24 UTC Lots of Access Violations on one of my computers: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=225341 Capri units... ID: 46420 · Rating: 0 · rate: / Reply Quote

hbobeck Send message Joined: 4 Sep 07 Posts: 1 Credit: 861 RAC: 0	Message 46421 - Posted: 17 Sep 2007, 12:07:37 UTC - in response to Message 46414. Something is going terribly wrong... the last days 7 validate errors! (WU's 95174310, 95809736, 95873445, 96251758, 96251759, 96273830, 96299805). Any particular reason for this??? Harry ID: 46421 · Rating: 0 · rate: / Reply Quote

Rayburner Send message Joined: 4 Oct 05 Posts: 32 Credit: 16,518,823 RAC: 0	Message 46442 - Posted: 17 Sep 2007, 15:28:23 UTC - in response to Message 46386. Swap space increased from 400MB to 2048MB. The system is attached to the World Community Grid 50% and R@H 50%. 2 success & 1 failure Rayburner: Can you try going to 100% on R@H and see if you start getting failures similar to what we see on XP? I also wonder if the Vista memory manager is better & corrects for this memory conflict between WUs. It would be good to have a test point with a Q6600 and Vista running 100% R@H. This sounds like an issue with the XP memory manager, BOINC & large memory work units. If we can isolate the CPU types and OS, we might help find this issue quickly. JMarks: What is your config? e6600 4GB RAM Swap ?? OS ?? I've been running 100% rosetta for the last 10 hours. So far no memory problems but one client error: https://boinc.bakerlab.org/rosetta/result.php?resultid=106067477 ID: 46442 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5774 Credit: 6,139,760 RAC: 0	Message 46443 - Posted: 17 Sep 2007, 15:29:50 UTC Last modified: 17 Sep 2007, 15:30:22 UTC from ricky@seti.usa posted in the Cafe section One of my PC's stops running R@H WU's when the screensaver kicks in and I am getting the following message from another PC from BOINC: 9/16/2007 14:23:04\|rosetta@home\|[error] rosetta_beta not responding to screensaver, requesting exit 9/16/2007 14:23:07\|rosetta@home\|Task 1mh1__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-1mh1_-lig_rxplxn_0585plexinmonomer__2084_138_0 exited with zero status but no 'finished' file 9/16/2007 14:23:07\|rosetta@home\|If this happens repeatedly you may need to reset the project. 9/16/2007 14:23:07\|rosetta@home\|Restarting task 1mh1__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-1mh1_-lig_rxplxn_0585plexinmonomer__2084_138_0 using rosetta_beta version 580 ID: 46443 · Rating: 0 · rate: / Reply Quote

Ingemar Send message Joined: 28 Feb 06 Posts: 20 Credit: 1,680 RAC: 0	Message 46479 - Posted: 17 Sep 2007, 21:41:44 UTC A large fraction of the CAPRI-something jobs are failing. We are removing these jobs from the queue now and will not run more of those before we located the problem. Sorry for the inconvenience! ID: 46479 · Rating: 0 · rate: / Reply Quote

Ricky@SETI.USA Send message Joined: 13 Dec 05 Posts: 20 Credit: 97,355 RAC: 0	Message 46489 - Posted: 18 Sep 2007, 0:23:23 UTC I have a AMD Desktop that downloaded 7 WU's 24 hours ago and so far has only completed 1 WU. The problem is it seems to hang and stops running. At 1st I thought it was a Screensaver problem but after turning off the Screensaver it still hangs, other projects are doing fine. These WU's all have FIXBACKBONE in their file name. I am thinking of aborting them because I am causing other projects to be late because when R@H hangs nothing gets done. "Life is like an Ice Cream cone, just when you think you got it licked, it drips all over you!" ID: 46489 · Rating: 0 · rate: / Reply Quote

Resnick_MEDIC_Lab Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,276,053 RAC: 0	Message 46492 - Posted: 18 Sep 2007, 1:14:09 UTC Last modified: 18 Sep 2007, 1:27:07 UTC And a third "double failure" here Unfortunately, it took my pc 8,256 seconds, while the other pc (a T5500 dual-core) took only 87 seconds to "fail"... Again, I have to wonder if quad-cores (i.e., Q6600's) fail "bigger" (taking 100 times longer)... If I had failed at 87 seconds, that would have been 8,169 seconds (2.25 hours) that could have been spent obtaining "valid" results with a different wu... Why is the same wu failing at two different run times, and at two different points in the program? 1mh1__BOINC_CAPRI14_DOCK_FIXBACKBONE_POSE_LOOPS-1mh1_-lig_plexinmonomer__2085_9238 stderr out <core_client_version>5.10.13</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # cpu_run_time_pref: 10800 # random seed: 2926863 ERROR:: Exit from: .pose.cc line: 769 </stderr_txt> ]]> Validate state Invalid ID: 46492 · Rating: 0 · rate: / Reply Quote

(_KoDAk_) Send message Joined: 18 Jul 06 Posts: 109 Credit: 1,859,263 RAC: 0	Message 46524 - Posted: 18 Sep 2007, 15:19:05 UTC 2007-09-18 18:17:30 [rosetta@home] Sending scheduler request: Requested by user 2007-09-18 18:17:30 [rosetta@home] (not requesting new work or reporting completed tasks) 2007-09-18 18:17:35 [rosetta@home] Scheduler RPC succeeded 2007-09-18 18:17:35 [rosetta@home] Message from server: Project encountered internal error: shared memory 2007-09-18 18:17:35 [rosetta@home] Deferring communication for 1 hr 0 min 0 sec 2007-09-18 18:17:35 [rosetta@home] Reason: project is down 2007-09-18 18:17:40 [rosetta@home] [file_xfer] Started upload of file 1g4u__BOINC_CAPRI14_DOCK_FIXBACKBONE-1g4u_-nosillyloop_plexinmonomer__2067_8577_0_0 2007-09-18 18:17:40 [rosetta@home] [file_xfer] Started upload of file 1mh1__BOINC_CAPRI14_DOCK_FIXBACKBONE-1mh1_-plexindimer__2067_8698_0_0 2007-09-18 18:17:43 [---] Project communication failed: attempting access to reference site 2007-09-18 18:17:43 [rosetta@home] [file_xfer] Temporarily failed upload of 1mh1__BOINC_CAPRI14_DOCK_FIXBACKBONE-1mh1_-plexindimer__2067_8698_0_0: http error 2007-09-18 18:17:43 [rosetta@home] Backing off 1 hr 29 min 34 sec on upload of file 1mh1__BOINC_CAPRI14_DOCK_FIXBACKBONE-1mh1_-plexindimer__2067_8698_0_0 2007-09-18 18:17:43 [rosetta@home] [file_xfer] Started upload of file t030__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-t030_-lig_rxplxn_1152plexinmonomer__2084_586_0_0 2007-09-18 18:17:44 [---] Access to reference site succeeded - project servers may be temporarily down. 2007-09-18 18:17:45 [---] Project communication failed: attempting access to reference site 2007-09-18 18:17:45 [rosetta@home] [file_xfer] Temporarily failed upload of t030__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-t030_-lig_rxplxn_1152plexinmonomer__2084_586_0_0: http error 2007-09-18 18:17:45 [rosetta@home] Backing off 3 hr 26 min 11 sec on upload of file t030__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-t030_-lig_rxplxn_1152plexinmonomer__2084_586_0_0 2007-09-18 18:17:45 [rosetta@home] [file_xfer] Started upload of file 1g4u__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-1g4u_-lig_rxplxn_1036plexinmonomer__2084_2482_0_0 2007-09-18 18:17:47 [---] Access to reference site succeeded - project servers may be temporarily down. 2007-09-18 18:17:47 [rosetta@home] [file_xfer] Temporarily failed upload of 1g4u__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-1g4u_-lig_rxplxn_1036plexinmonomer__2084_2482_0_0: http error 2007-09-18 18:17:47 [rosetta@home] Backing off 2 hr 29 min 35 sec on upload of file 1g4u__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-1g4u_-lig_rxplxn_1036plexinmonomer__2084_2482_0_0 ID: 46524 · Rating: 0 · rate: / Reply Quote

Rayburner Send message Joined: 4 Oct 05 Posts: 32 Credit: 16,518,823 RAC: 0	Message 46527 - Posted: 18 Sep 2007, 15:49:54 UTC - in response to Message 46442. Swap space increased from 400MB to 2048MB. The system is attached to the World Community Grid 50% and R@H 50%. 2 success & 1 failure Rayburner: Can you try going to 100% on R@H and see if you start getting failures similar to what we see on XP? I also wonder if the Vista memory manager is better & corrects for this memory conflict between WUs. It would be good to have a test point with a Q6600 and Vista running 100% R@H. This sounds like an issue with the XP memory manager, BOINC & large memory work units. If we can isolate the CPU types and OS, we might help find this issue quickly. JMarks: What is your config? e6600 4GB RAM Swap ?? OS ?? I've been running 100% rosetta for the last 10 hours. So far no memory problems but one client error: https://boinc.bakerlab.org/rosetta/result.php?resultid=106067477 Result of 24 Hours of rosetta only: 45 successes / 2 client errors both pose loops t30 WUs (4,44% error rate) in total of all wus I crunched recently 3 validate errors and 3 client errors (pose loops t30 for the client errors) --> 4,34% error rate ID: 46527 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 27 Dec 05 Posts: 153 Credit: 30,845,917 RAC: 0	Message 46532 - Posted: 18 Sep 2007, 16:11:29 UTC OK, based on the reports embedded in this thread along with the current shared memory error, I've suspended processing on Rosetta for now and am busily aborting all of the Capri 'bad boy' work units I have out there on workstations (and there are a LOT of them running loose). I'm wondering though if the better approach, once the Rosetta folks have corrected the shared memory issue and are able to announce they have purged the database of the Capri work units, would be to Reset Rosetta on workstations. For now, I'm limiting the damage to other projects (by the CPU waste that Capri work units can cause), by the action of suspending Rosetta on the workstations. Sure would be nice to see some newsflash on this though -- rather than expect folks to wander down here to get the news. ID: 46532 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5774 Credit: 6,139,760 RAC: 0	Message 46533 - Posted: 18 Sep 2007, 17:22:47 UTC i had some issues with old capri on 5.78 and just reset and redownloaded boinc and then it got 5.80 and 7 days of work automaticly. no comm errors or anything. ID: 46533 · Rating: 0 · rate: / Reply Quote