Problems with Rosetta version 5.80

Message boards : Number crunching : Problems with Rosetta version 5.80

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 10 · Next

AuthorMessage
Profile Jmarks
Avatar

Send message
Joined: 16 Jul 07
Posts: 132
Credit: 98,025
RAC: 0
Message 46362 - Posted: 16 Sep 2007, 14:17:13 UTC - in response to Message 46355.  

Good question ! May be sheer coincidence, but seems we're hearing about this with the Q6600's more than "average"...

I'm running standard Boinc client, Q6600, 2 GB RAM, Swap = 75% of page file, Vista Premium (32).

I am starting to wonder if this problem with the failed work units is related to multicore or Q6600 processors. Could it be a memory management issue with the WUs attempting to access the same memory locations creating a lock or race condition?



I have a dual core e6600 4 gig and 70% of mine are bad also.


I bet it has more to do with the fact that we have more memory available then other PC's so we get more of those wu's.
Jmarks
ID: 46362 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 46370 - Posted: 16 Sep 2007, 16:02:05 UTC

I had to abort this one due to 'waiting for memory'. All the others have worked without a problem.

https://boinc.bakerlab.org/rosetta/result.php?resultid=105692089
ID: 46370 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
stewjack

Send message
Joined: 23 Apr 06
Posts: 39
Credit: 95,871
RAC: 0
Message 46371 - Posted: 16 Sep 2007, 16:34:59 UTC - in response to Message 46370.  
Last modified: 16 Sep 2007, 16:41:11 UTC

I had to abort this one due to 'waiting for memory'. All the others have worked without a problem.

https://boinc.bakerlab.org/rosetta/result.php?resultid=105692089


Evan,
I have had your single 'waiting for memory' problem out of 3 Capri WU's
I run a single core CPU with 512 memory.

My WU is very similar to yours.
https://boinc.bakerlab.org/rosetta/result.php?resultid=105635179


Jack


ID: 46371 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David Emigh
Avatar

Send message
Joined: 13 Mar 06
Posts: 158
Credit: 417,178
RAC: 0
Message 46373 - Posted: 16 Sep 2007, 17:18:06 UTC - in response to Message 46345.  

I hope that the few CAPRI14 that actually make it through are worth it.


I echo that sentiment.

I'm changing my preferences to allow BOINC access to 90% of memory all the time (whether or not the computer is "idle")

Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
ID: 46373 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rayburner

Send message
Joined: 4 Oct 05
Posts: 32
Credit: 16,518,823
RAC: 265
Message 46374 - Posted: 16 Sep 2007, 17:22:51 UTC - in response to Message 46355.  

Good question ! May be sheer coincidence, but seems we're hearing about this with the Q6600's more than "average"...

I'm running standard Boinc client, Q6600, 2 GB RAM, Swap = 75% of page file, Vista Premium (32).

I am starting to wonder if this problem with the failed work units is related to multicore or Q6600 processors. Could it be a memory management issue with the WUs attempting to access the same memory locations creating a lock or race condition?



I have a dual core e6600 4 gig and 70% of mine are bad also.


I think there must be something else.

I am running a qx6700 with 2 gig on Vista Ultimate. There hasn't been one of such wus with this memory problem. I am also running several projects at a time (CPDN, malariacontrol, SETI, Einstein, WCG, Rosetta). So there are always several apps in memory (and they stay inside when tasks are switched; also multiple instances of rosetta, of course). Thats is why there is a heavy load on the memory.

I will keep watching my results if such memory problem appears on my machine.

Regards
Rayburner

ID: 46374 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile sslickerson

Send message
Joined: 14 Oct 05
Posts: 101
Credit: 578,497
RAC: 0
Message 46377 - Posted: 16 Sep 2007, 17:48:37 UTC

I've had one CAPRI14 WU fail: 105503053 on this computer but all others have finished fine.

--Timothy
ID: 46377 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Paul

Send message
Joined: 29 Oct 05
Posts: 193
Credit: 64,397,037
RAC: 494
Message 46386 - Posted: 16 Sep 2007, 20:25:25 UTC

Swap space increased from 400MB to 2048MB. The system is attached to the World Community Grid 50% and R@H 50%. 2 success & 1 failure

Rayburner:

Can you try going to 100% on R@H and see if you start getting failures similar to what we see on XP?

I also wonder if the Vista memory manager is better & corrects for this memory conflict between WUs.

It would be good to have a test point with a Q6600 and Vista running 100% R@H.

This sounds like an issue with the XP memory manager, BOINC & large memory work units. If we can isolate the CPU types and OS, we might help find this issue quickly.

JMarks:

What is your config?
e6600
4GB RAM
Swap ??
OS ??




Thx!

Paul

ID: 46386 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 46392 - Posted: 16 Sep 2007, 22:09:12 UTC - in response to Message 46346.  

Here's another one of those CAPRI units which failed.

https://boinc.bakerlab.org/rosetta/result.php?resultid=105829549

I happen to have looked at graphics when it froze. 82 models were crunched when it failed, model 83 was at step 537. After that it was just waiting for the watchdog to terminate the task.


DK/Rhiju could you look in to why this task only received 20 credits? I had thought that if 80 models were completed prior to the failure, that these should be reported and utilized by the project, and credit issued accordingly as well.
Rosetta Moderator: Mod.Sense
ID: 46392 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith T.
Avatar

Send message
Joined: 1 Mar 07
Posts: 58
Credit: 34,135
RAC: 0
Message 46399 - Posted: 17 Sep 2007, 0:53:35 UTC

1g4u__BOINC_CAPRI14_DOCK_FIXBACKBONE_POSE_LOOPS-1g4u_-lig_plexinmonomer__2085_4000_0

https://boinc.bakerlab.org/rosetta/result.php?resultid=105842490

Compute error Exit status -1073741819 (0xc0000005)

PC had been running unattended for > 3 hours when this occured. Screen Saver is Blank Screen.
ID: 46399 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Gen_X_Accord
Avatar

Send message
Joined: 5 Jun 06
Posts: 154
Credit: 279,018
RAC: 0
Message 46414 - Posted: 17 Sep 2007, 8:35:47 UTC

The only thing I've noticed strange about 5.80 is that my granted credits are much lower than normal.
ID: 46414 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Konstantin Iliev

Send message
Joined: 22 May 06
Posts: 4
Credit: 2,205,841
RAC: 0
Message 46420 - Posted: 17 Sep 2007, 12:03:14 UTC
Last modified: 17 Sep 2007, 12:04:24 UTC

Lots of Access Violations on one of my computers: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=225341

Capri units...
ID: 46420 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
hbobeck

Send message
Joined: 4 Sep 07
Posts: 1
Credit: 861
RAC: 0
Message 46421 - Posted: 17 Sep 2007, 12:07:37 UTC - in response to Message 46414.  

Something is going terribly wrong... the last days 7 validate errors! (WU's 95174310, 95809736, 95873445, 96251758, 96251759, 96273830, 96299805).

Any particular reason for this???

Harry
ID: 46421 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rayburner

Send message
Joined: 4 Oct 05
Posts: 32
Credit: 16,518,823
RAC: 265
Message 46442 - Posted: 17 Sep 2007, 15:28:23 UTC - in response to Message 46386.  

Swap space increased from 400MB to 2048MB. The system is attached to the World Community Grid 50% and R@H 50%. 2 success & 1 failure

Rayburner:

Can you try going to 100% on R@H and see if you start getting failures similar to what we see on XP?

I also wonder if the Vista memory manager is better & corrects for this memory conflict between WUs.

It would be good to have a test point with a Q6600 and Vista running 100% R@H.

This sounds like an issue with the XP memory manager, BOINC & large memory work units. If we can isolate the CPU types and OS, we might help find this issue quickly.

JMarks:

What is your config?
e6600
4GB RAM
Swap ??
OS ??





I've been running 100% rosetta for the last 10 hours. So far no memory problems but one client error:

https://boinc.bakerlab.org/rosetta/result.php?resultid=106067477


ID: 46442 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5502
Credit: 5,466,357
RAC: 1,905
Message 46443 - Posted: 17 Sep 2007, 15:29:50 UTC
Last modified: 17 Sep 2007, 15:30:22 UTC

from ricky@seti.usa posted in the Cafe section

One of my PC's stops running R@H WU's when the screensaver kicks in and I am getting the following message from another PC from BOINC:

9/16/2007 14:23:04|rosetta@home|[error] rosetta_beta not responding to screensaver, requesting exit
9/16/2007 14:23:07|rosetta@home|Task 1mh1__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-1mh1_-lig_rxplxn_0585plexinmonomer__2084_138_0 exited with zero status but no 'finished' file
9/16/2007 14:23:07|rosetta@home|If this happens repeatedly you may need to reset the project.
9/16/2007 14:23:07|rosetta@home|Restarting task 1mh1__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-1mh1_-lig_rxplxn_0585plexinmonomer__2084_138_0 using rosetta_beta version 580



ID: 46443 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ingemar

Send message
Joined: 28 Feb 06
Posts: 20
Credit: 1,680
RAC: 0
Message 46479 - Posted: 17 Sep 2007, 21:41:44 UTC

A large fraction of the CAPRI-something jobs are failing. We are removing these jobs from the queue now and will not run more of those before we located the problem. Sorry for the inconvenience!

ID: 46479 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ricky@SETI.USA
Avatar

Send message
Joined: 13 Dec 05
Posts: 20
Credit: 97,355
RAC: 2
Message 46489 - Posted: 18 Sep 2007, 0:23:23 UTC

I have a AMD Desktop that downloaded 7 WU's 24 hours ago and so far has only completed 1 WU. The problem is it seems to hang and stops running. At 1st I thought it was a Screensaver problem but after turning off the Screensaver it still hangs, other projects are doing fine.

These WU's all have FIXBACKBONE in their file name. I am thinking of aborting them because I am causing other projects to be late because when R@H hangs nothing gets done.

"Life is like an Ice Cream cone, just when you think you got it licked, it drips all over you!"

ID: 46489 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The_Bad_Penguin
Avatar

Send message
Joined: 5 Jun 06
Posts: 2751
Credit: 4,109,756
RAC: 182
Message 46492 - Posted: 18 Sep 2007, 1:14:09 UTC
Last modified: 18 Sep 2007, 1:27:07 UTC

And a third "double failure" here

Unfortunately, it took my pc 8,256 seconds, while the other pc (a T5500 dual-core) took only 87 seconds to "fail"...

Again, I have to wonder if quad-cores (i.e., Q6600's) fail "bigger" (taking 100 times longer)...

If I had failed at 87 seconds, that would have been 8,169 seconds (2.25 hours) that could have been spent obtaining "valid" results with a different wu...

Why is the same wu failing at two different run times, and at two different points in the program?

1mh1__BOINC_CAPRI14_DOCK_FIXBACKBONE_POSE_LOOPS-1mh1_-lig_plexinmonomer__2085_9238

stderr out

<core_client_version>5.10.13</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 2926863
ERROR:: Exit from: .pose.cc line: 769

</stderr_txt>
]]>

Validate state Invalid
ID: 46492 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile (_KoDAk_)

Send message
Joined: 18 Jul 06
Posts: 109
Credit: 1,859,263
RAC: 0
Message 46524 - Posted: 18 Sep 2007, 15:19:05 UTC

2007-09-18 18:17:30 [rosetta@home] Sending scheduler request: Requested by user
2007-09-18 18:17:30 [rosetta@home] (not requesting new work or reporting completed tasks)
2007-09-18 18:17:35 [rosetta@home] Scheduler RPC succeeded
2007-09-18 18:17:35 [rosetta@home] Message from server: Project encountered internal error: shared memory
2007-09-18 18:17:35 [rosetta@home] Deferring communication for 1 hr 0 min 0 sec
2007-09-18 18:17:35 [rosetta@home] Reason: project is down
2007-09-18 18:17:40 [rosetta@home] [file_xfer] Started upload of file 1g4u__BOINC_CAPRI14_DOCK_FIXBACKBONE-1g4u_-nosillyloop_plexinmonomer__2067_8577_0_0
2007-09-18 18:17:40 [rosetta@home] [file_xfer] Started upload of file 1mh1__BOINC_CAPRI14_DOCK_FIXBACKBONE-1mh1_-plexindimer__2067_8698_0_0
2007-09-18 18:17:43 [---] Project communication failed: attempting access to reference site
2007-09-18 18:17:43 [rosetta@home] [file_xfer] Temporarily failed upload of 1mh1__BOINC_CAPRI14_DOCK_FIXBACKBONE-1mh1_-plexindimer__2067_8698_0_0: http error
2007-09-18 18:17:43 [rosetta@home] Backing off 1 hr 29 min 34 sec on upload of file 1mh1__BOINC_CAPRI14_DOCK_FIXBACKBONE-1mh1_-plexindimer__2067_8698_0_0
2007-09-18 18:17:43 [rosetta@home] [file_xfer] Started upload of file t030__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-t030_-lig_rxplxn_1152plexinmonomer__2084_586_0_0
2007-09-18 18:17:44 [---] Access to reference site succeeded - project servers may be temporarily down.
2007-09-18 18:17:45 [---] Project communication failed: attempting access to reference site
2007-09-18 18:17:45 [rosetta@home] [file_xfer] Temporarily failed upload of t030__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-t030_-lig_rxplxn_1152plexinmonomer__2084_586_0_0: http error
2007-09-18 18:17:45 [rosetta@home] Backing off 3 hr 26 min 11 sec on upload of file t030__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-t030_-lig_rxplxn_1152plexinmonomer__2084_586_0_0
2007-09-18 18:17:45 [rosetta@home] [file_xfer] Started upload of file 1g4u__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-1g4u_-lig_rxplxn_1036plexinmonomer__2084_2482_0_0
2007-09-18 18:17:47 [---] Access to reference site succeeded - project servers may be temporarily down.
2007-09-18 18:17:47 [rosetta@home] [file_xfer] Temporarily failed upload of 1g4u__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-1g4u_-lig_rxplxn_1036plexinmonomer__2084_2482_0_0: http error
2007-09-18 18:17:47 [rosetta@home] Backing off 2 hr 29 min 35 sec on upload of file 1g4u__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-1g4u_-lig_rxplxn_1036plexinmonomer__2084_2482_0_0
ID: 46524 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rayburner

Send message
Joined: 4 Oct 05
Posts: 32
Credit: 16,518,823
RAC: 265
Message 46527 - Posted: 18 Sep 2007, 15:49:54 UTC - in response to Message 46442.  

Swap space increased from 400MB to 2048MB. The system is attached to the World Community Grid 50% and R@H 50%. 2 success & 1 failure

Rayburner:

Can you try going to 100% on R@H and see if you start getting failures similar to what we see on XP?

I also wonder if the Vista memory manager is better & corrects for this memory conflict between WUs.

It would be good to have a test point with a Q6600 and Vista running 100% R@H.

This sounds like an issue with the XP memory manager, BOINC & large memory work units. If we can isolate the CPU types and OS, we might help find this issue quickly.

JMarks:

What is your config?
e6600
4GB RAM
Swap ??
OS ??





I've been running 100% rosetta for the last 10 hours. So far no memory problems but one client error:

https://boinc.bakerlab.org/rosetta/result.php?resultid=106067477



Result of 24 Hours of rosetta only:

45 successes / 2 client errors both pose loops t30 WUs (4,44% error rate)

in total of all wus I crunched recently 3 validate errors and 3 client errors (pose loops t30 for the client errors) --> 4,34% error rate

ID: 46527 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BarryAZ

Send message
Joined: 27 Dec 05
Posts: 153
Credit: 30,837,810
RAC: 1
Message 46532 - Posted: 18 Sep 2007, 16:11:29 UTC

OK, based on the reports embedded in this thread along with the current shared memory error, I've suspended processing on Rosetta for now and am busily aborting all of the Capri 'bad boy' work units I have out there on workstations (and there are a LOT of them running loose).

I'm wondering though if the better approach, once the Rosetta folks have corrected the shared memory issue and are able to *announce* they have purged the database of the Capri work units, would be to *Reset* Rosetta on workstations. For now, I'm limiting the damage to other projects (by the CPU waste that Capri work units can cause), by the action of suspending Rosetta on the workstations.

Sure would be nice to see some newsflash on this though -- rather than expect folks to wander down here to get the news.


ID: 46532 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 10 · Next

Message boards : Number crunching : Problems with Rosetta version 5.80



©2022 University of Washington
https://www.bakerlab.org