Message boards : Number crunching : Problems with Rosetta version 5.80
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 10 · Next
Author | Message |
---|---|
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
I had to abort this one due to 'waiting for memory'. All the others have worked without a problem. https://boinc.bakerlab.org/rosetta/result.php?resultid=105692089 |
stewjack Send message Joined: 23 Apr 06 Posts: 39 Credit: 95,871 RAC: 0 |
I had to abort this one due to 'waiting for memory'. All the others have worked without a problem. Evan, I have had your single 'waiting for memory' problem out of 3 Capri WU's I run a single core CPU with 512 memory. My WU is very similar to yours. https://boinc.bakerlab.org/rosetta/result.php?resultid=105635179 Jack |
David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0 |
I hope that the few CAPRI14 that actually make it through are worth it. I echo that sentiment. I'm changing my preferences to allow BOINC access to 90% of memory all the time (whether or not the computer is "idle") Rosie, Rosie, she's our gal, If she can't do it, no one shall! |
Rayburner Send message Joined: 4 Oct 05 Posts: 32 Credit: 16,518,823 RAC: 0 |
Good question ! May be sheer coincidence, but seems we're hearing about this with the Q6600's more than "average"... I think there must be something else. I am running a qx6700 with 2 gig on Vista Ultimate. There hasn't been one of such wus with this memory problem. I am also running several projects at a time (CPDN, malariacontrol, SETI, Einstein, WCG, Rosetta). So there are always several apps in memory (and they stay inside when tasks are switched; also multiple instances of rosetta, of course). Thats is why there is a heavy load on the memory. I will keep watching my results if such memory problem appears on my machine. Regards Rayburner |
sslickerson Send message Joined: 14 Oct 05 Posts: 101 Credit: 578,497 RAC: 0 |
|
Paul Send message Joined: 29 Oct 05 Posts: 193 Credit: 66,607,429 RAC: 9,920 |
Swap space increased from 400MB to 2048MB. The system is attached to the World Community Grid 50% and R@H 50%. 2 success & 1 failure Rayburner: Can you try going to 100% on R@H and see if you start getting failures similar to what we see on XP? I also wonder if the Vista memory manager is better & corrects for this memory conflict between WUs. It would be good to have a test point with a Q6600 and Vista running 100% R@H. This sounds like an issue with the XP memory manager, BOINC & large memory work units. If we can isolate the CPU types and OS, we might help find this issue quickly. JMarks: What is your config? e6600 4GB RAM Swap ?? OS ?? Thx! Paul |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Here's another one of those CAPRI units which failed. DK/Rhiju could you look in to why this task only received 20 credits? I had thought that if 80 models were completed prior to the failure, that these should be reported and utilized by the project, and credit issued accordingly as well. Rosetta Moderator: Mod.Sense |
Keith T. Send message Joined: 1 Mar 07 Posts: 58 Credit: 34,135 RAC: 0 |
1g4u__BOINC_CAPRI14_DOCK_FIXBACKBONE_POSE_LOOPS-1g4u_-lig_plexinmonomer__2085_4000_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=105842490 Compute error Exit status -1073741819 (0xc0000005) PC had been running unattended for > 3 hours when this occured. Screen Saver is Blank Screen. |
Gen_X_Accord Send message Joined: 5 Jun 06 Posts: 154 Credit: 279,018 RAC: 0 |
The only thing I've noticed strange about 5.80 is that my granted credits are much lower than normal. |
Konstantin Iliev Send message Joined: 22 May 06 Posts: 4 Credit: 2,205,841 RAC: 0 |
Lots of Access Violations on one of my computers: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=225341 Capri units... |
hbobeck Send message Joined: 4 Sep 07 Posts: 1 Credit: 861 RAC: 0 |
Something is going terribly wrong... the last days 7 validate errors! (WU's 95174310, 95809736, 95873445, 96251758, 96251759, 96273830, 96299805). Any particular reason for this??? Harry |
Rayburner Send message Joined: 4 Oct 05 Posts: 32 Credit: 16,518,823 RAC: 0 |
Swap space increased from 400MB to 2048MB. The system is attached to the World Community Grid 50% and R@H 50%. 2 success & 1 failure I've been running 100% rosetta for the last 10 hours. So far no memory problems but one client error: https://boinc.bakerlab.org/rosetta/result.php?resultid=106067477 |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
from ricky@seti.usa posted in the Cafe section One of my PC's stops running R@H WU's when the screensaver kicks in and I am getting the following message from another PC from BOINC: 9/16/2007 14:23:04|rosetta@home|[error] rosetta_beta not responding to screensaver, requesting exit 9/16/2007 14:23:07|rosetta@home|Task 1mh1__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-1mh1_-lig_rxplxn_0585plexinmonomer__2084_138_0 exited with zero status but no 'finished' file 9/16/2007 14:23:07|rosetta@home|If this happens repeatedly you may need to reset the project. 9/16/2007 14:23:07|rosetta@home|Restarting task 1mh1__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-1mh1_-lig_rxplxn_0585plexinmonomer__2084_138_0 using rosetta_beta version 580 |
Ingemar Send message Joined: 28 Feb 06 Posts: 20 Credit: 1,680 RAC: 0 |
A large fraction of the CAPRI-something jobs are failing. We are removing these jobs from the queue now and will not run more of those before we located the problem. Sorry for the inconvenience! |
Ricky@SETI.USA Send message Joined: 13 Dec 05 Posts: 20 Credit: 97,355 RAC: 0 |
I have a AMD Desktop that downloaded 7 WU's 24 hours ago and so far has only completed 1 WU. The problem is it seems to hang and stops running. At 1st I thought it was a Screensaver problem but after turning off the Screensaver it still hangs, other projects are doing fine. These WU's all have FIXBACKBONE in their file name. I am thinking of aborting them because I am causing other projects to be late because when R@H hangs nothing gets done. "Life is like an Ice Cream cone, just when you think you got it licked, it drips all over you!" |
The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,271,025 RAC: 0 |
And a third "double failure" here Unfortunately, it took my pc 8,256 seconds, while the other pc (a T5500 dual-core) took only 87 seconds to "fail"... Again, I have to wonder if quad-cores (i.e., Q6600's) fail "bigger" (taking 100 times longer)... If I had failed at 87 seconds, that would have been 8,169 seconds (2.25 hours) that could have been spent obtaining "valid" results with a different wu... Why is the same wu failing at two different run times, and at two different points in the program? 1mh1__BOINC_CAPRI14_DOCK_FIXBACKBONE_POSE_LOOPS-1mh1_-lig_plexinmonomer__2085_9238 stderr out <core_client_version>5.10.13</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # cpu_run_time_pref: 10800 # random seed: 2926863 ERROR:: Exit from: .pose.cc line: 769 </stderr_txt> ]]> Validate state Invalid |
(_KoDAk_) Send message Joined: 18 Jul 06 Posts: 109 Credit: 1,859,263 RAC: 0 |
2007-09-18 18:17:30 [rosetta@home] Sending scheduler request: Requested by user 2007-09-18 18:17:30 [rosetta@home] (not requesting new work or reporting completed tasks) 2007-09-18 18:17:35 [rosetta@home] Scheduler RPC succeeded 2007-09-18 18:17:35 [rosetta@home] Message from server: Project encountered internal error: shared memory 2007-09-18 18:17:35 [rosetta@home] Deferring communication for 1 hr 0 min 0 sec 2007-09-18 18:17:35 [rosetta@home] Reason: project is down 2007-09-18 18:17:40 [rosetta@home] [file_xfer] Started upload of file 1g4u__BOINC_CAPRI14_DOCK_FIXBACKBONE-1g4u_-nosillyloop_plexinmonomer__2067_8577_0_0 2007-09-18 18:17:40 [rosetta@home] [file_xfer] Started upload of file 1mh1__BOINC_CAPRI14_DOCK_FIXBACKBONE-1mh1_-plexindimer__2067_8698_0_0 2007-09-18 18:17:43 [---] Project communication failed: attempting access to reference site 2007-09-18 18:17:43 [rosetta@home] [file_xfer] Temporarily failed upload of 1mh1__BOINC_CAPRI14_DOCK_FIXBACKBONE-1mh1_-plexindimer__2067_8698_0_0: http error 2007-09-18 18:17:43 [rosetta@home] Backing off 1 hr 29 min 34 sec on upload of file 1mh1__BOINC_CAPRI14_DOCK_FIXBACKBONE-1mh1_-plexindimer__2067_8698_0_0 2007-09-18 18:17:43 [rosetta@home] [file_xfer] Started upload of file t030__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-t030_-lig_rxplxn_1152plexinmonomer__2084_586_0_0 2007-09-18 18:17:44 [---] Access to reference site succeeded - project servers may be temporarily down. 2007-09-18 18:17:45 [---] Project communication failed: attempting access to reference site 2007-09-18 18:17:45 [rosetta@home] [file_xfer] Temporarily failed upload of t030__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-t030_-lig_rxplxn_1152plexinmonomer__2084_586_0_0: http error 2007-09-18 18:17:45 [rosetta@home] Backing off 3 hr 26 min 11 sec on upload of file t030__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-t030_-lig_rxplxn_1152plexinmonomer__2084_586_0_0 2007-09-18 18:17:45 [rosetta@home] [file_xfer] Started upload of file 1g4u__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-1g4u_-lig_rxplxn_1036plexinmonomer__2084_2482_0_0 2007-09-18 18:17:47 [---] Access to reference site succeeded - project servers may be temporarily down. 2007-09-18 18:17:47 [rosetta@home] [file_xfer] Temporarily failed upload of 1g4u__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-1g4u_-lig_rxplxn_1036plexinmonomer__2084_2482_0_0: http error 2007-09-18 18:17:47 [rosetta@home] Backing off 2 hr 29 min 35 sec on upload of file 1g4u__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-1g4u_-lig_rxplxn_1036plexinmonomer__2084_2482_0_0 |
Rayburner Send message Joined: 4 Oct 05 Posts: 32 Credit: 16,518,823 RAC: 0 |
Swap space increased from 400MB to 2048MB. The system is attached to the World Community Grid 50% and R@H 50%. 2 success & 1 failure Result of 24 Hours of rosetta only: 45 successes / 2 client errors both pose loops t30 WUs (4,44% error rate) in total of all wus I crunched recently 3 validate errors and 3 client errors (pose loops t30 for the client errors) --> 4,34% error rate |
BarryAZ Send message Joined: 27 Dec 05 Posts: 153 Credit: 30,843,285 RAC: 0 |
OK, based on the reports embedded in this thread along with the current shared memory error, I've suspended processing on Rosetta for now and am busily aborting all of the Capri 'bad boy' work units I have out there on workstations (and there are a LOT of them running loose). I'm wondering though if the better approach, once the Rosetta folks have corrected the shared memory issue and are able to *announce* they have purged the database of the Capri work units, would be to *Reset* Rosetta on workstations. For now, I'm limiting the damage to other projects (by the CPU waste that Capri work units can cause), by the action of suspending Rosetta on the workstations. Sure would be nice to see some newsflash on this though -- rather than expect folks to wander down here to get the news. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
i had some issues with old capri on 5.78 and just reset and redownloaded boinc and then it got 5.80 and 7 days of work automaticly. no comm errors or anything. |
Message boards :
Number crunching :
Problems with Rosetta version 5.80
©2024 University of Washington
https://www.bakerlab.org