Problems with Rosetta version 5.80

Message boards : Number crunching : Problems with Rosetta version 5.80

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next

AuthorMessage
Beezlebub
Avatar

Send message
Joined: 18 Oct 05
Posts: 40
Credit: 260,375
RAC: 0
Message 46266 - Posted: 15 Sep 2007, 13:46:43 UTC

This Capri14 WU did "client error" but has a debug readout. Might be useful https://boinc.bakerlab.org/rosetta/result.php?resultid=105518403
e6600 quad @ 2.5ghz
2418 floating point
5227 integer

e6750 dual @ 3.71ghz
3598 floating point
7918 integer


ID: 46266 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,196,825
RAC: 0
Message 46267 - Posted: 15 Sep 2007, 13:47:00 UTC - in response to Message 46264.  

Just noticed each WU is consuming about 248MB of RAM. With 2 GB of RAM, this was not a problem until the Q6600 went into the system. 4 WUs are consuming 1/2 of the system memory.

What changed in 5.8 to cause the massive memory consumption and all of the computation errors? Can you do anything to pull in the memory requirements? Did the previous versions hold memory requirements at about 128MB per WU?



Go into Your Account and Edit
General preferences
Disk and memory usage
Use at most - 50% of memory when computer is in use
*** Lower this to what you want.

Ps This post is not about 5.80 you should start a seperate thread in 'Number Crunching'.


I posted for him because I noticed the same thing. My post here went unanswered.
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=3564

ID: 46267 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Path7

Send message
Joined: 25 Aug 07
Posts: 128
Credit: 61,751
RAC: 0
Message 46280 - Posted: 15 Sep 2007, 16:24:00 UTC - in response to Message 46153.  

Please report problems with this version. Thanks!


While crunching Rosetta Beta 5.80, WU 95855024 on my (1 core) AMD Sempron processor 3000+, BOINC replied with an “Waiting for memory” error.
My computer (Windows XP-home SP2) has 448 MB of memory, which exceeds the recommended system requirements.

To get lost of this problem, I gave the 5.80 more memory by adjusting the: “Use at most 50% of memory when computer is in use” to 60% of memory.
This has solved the problem (so far).

O.t.: The screen saver looks like a beautiful piece of art!

Path7.
ID: 46280 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Gorkan

Send message
Joined: 13 Sep 07
Posts: 10
Credit: 151,300
RAC: 0
Message 46281 - Posted: 15 Sep 2007, 16:35:34 UTC

I dunno , looks like it was chewing on something it didnt want to swallow
On the plus side it didnt leave a mess on the floor.

<core_client_version>5.10.20</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 2944148


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0092541B read attempt to address 0x16481000

Engaging BOINC Windows Runtime Debugger...



********************
ID: 46281 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Paul

Send message
Joined: 29 Oct 05
Posts: 193
Credit: 64,858,035
RAC: 38
Message 46283 - Posted: 15 Sep 2007, 16:38:59 UTC

Thanks for the help with the preferences. I made some changes.

After more investigation on the comutation errors, it is clear that of my 5 systems, only the one with a quad core process Q6600 is getting the computation errors. Of course, this is also the busiest system.

It looks like about 3 failed for every success WU.

Any suggestions are welcome.

I can provide the debug info if it will help.

https://boinc.bakerlab.org/rosetta/result.php?resultid=105762597
https://boinc.bakerlab.org/rosetta/result.php?resultid=105843964
https://boinc.bakerlab.org/rosetta/result.php?resultid=105811123
https://boinc.bakerlab.org/rosetta/result.php?resultid=105739580

Thx!

Paul

ID: 46283 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The_Bad_Penguin
Avatar

Send message
Joined: 5 Jun 06
Posts: 2751
Credit: 4,271,025
RAC: 0
Message 46286 - Posted: 15 Sep 2007, 17:27:29 UTC
Last modified: 15 Sep 2007, 17:28:34 UTC

Here is a double failure...

t030__BOINC_CAPRI14_DOCK_FIXBACKBONE_POSE_LOOPS-t030_-plexinmonomer__2083_2234

stderr out <core_client_version>5.10.13</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 3553867
si
</stderr_txt>
]]>


Validate state Invalid
ID: 46286 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 46293 - Posted: 15 Sep 2007, 18:02:12 UTC - in response to Message 46262.  
Last modified: 15 Sep 2007, 18:24:00 UTC

I got 0 credits for this wu: too many results:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=94605647


There is a quirk with the BOINC server software which reissues a task too soon. Seems to only happen when one machine gets a compute error. This is discussed in an existing thread, and there is an item on the BOINC boards to get this corrected.
Rosetta Moderator: Mod.Sense
ID: 46293 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[RKN] schatten1411 , Mitglied des Teams und des VEREINS Rechenkraft.net

Send message
Joined: 25 Apr 07
Posts: 12
Credit: 441,995
RAC: 0
Message 46303 - Posted: 15 Sep 2007, 20:17:20 UTC

Selber Fehler wie in 5.78 auch in der Beta ?

104916668 545978 11 Sep 2007 15:23:03 UTC 14 Sep 2007 5:36:06 UTC Over Success Done 9,430.36 50.13 20.00

Ihr arbeitet zwar gerade dran, aber was ist mit der Fehlerbeseitigung bei den erledigten WU`s ?
ID: 46303 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
RC

Send message
Joined: 27 Sep 05
Posts: 13
Credit: 262,048
RAC: 0
Message 46304 - Posted: 15 Sep 2007, 20:56:47 UTC - in response to Message 46239.  

OK, great! I'm glad you were able to catch one. Assuming that others behave the same way (a bit of a stretch with only a single one observed, but it's all we have to go by)... the fact that it is still on model one is the reason why the task fails and only 20 credits are granted.


Here's another one
ID: 46304 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The_Bad_Penguin
Avatar

Send message
Joined: 5 Jun 06
Posts: 2751
Credit: 4,271,025
RAC: 0
Message 46319 - Posted: 16 Sep 2007, 2:06:25 UTC
Last modified: 16 Sep 2007, 2:07:36 UTC

And a second "double failure" here

1g4u__BOINC_CAPRI14_DOCK_FIXBACKBONE_POSE_LOOPS-1g4u_-lig_plexinmonomer__2085_1427

stderr out

<core_client_version>5.10.13</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 2944674
ERROR:: Exit from: .pose.cc line: 769

</stderr_txt>
]]>
ID: 46319 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The_Bad_Penguin
Avatar

Send message
Joined: 5 Jun 06
Posts: 2751
Credit: 4,271,025
RAC: 0
Message 46320 - Posted: 16 Sep 2007, 2:14:33 UTC
Last modified: 16 Sep 2007, 2:25:26 UTC

It's a good thing I don't care too much for credits as I do the science...

Just look at the amount of time that is being "wasted"...

My wu's are set for 3 hrs (10,800 seconds), and this one ran for over TWICE that, 21,653 seconds !!!

All for 20 credits...

Here it is...

1g4u__BOINC_CAPRI14_DOCK_FIXBACKBONE_POSE_LOOPS-1g4u_-plexinmonomer__2083_3748_0

stderr out

<core_client_version>5.10.13</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 3582353
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -69.8041 for 1800 seconds
**********************************************************************
GZIP SILENT FILE: .xx1g4u.out

</stderr_txt>
]]>

Validate state Valid
Claimed credit 93.156397645739
Granted credit 20
application version 5.80
ID: 46320 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David Emigh
Avatar

Send message
Joined: 13 Mar 06
Posts: 158
Credit: 417,178
RAC: 0
Message 46323 - Posted: 16 Sep 2007, 3:50:19 UTC

Here are a couple of failed WUs, both of them Capri14...

WU 95938082
WU 95780562

I think it is important to note that this computer has not had a single failure on non-Capri14 WUs, but has a dismal record of about 7 failures for each 8 attempts with Capri14...
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
ID: 46323 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tazrt

Send message
Joined: 31 Aug 06
Posts: 6
Credit: 468,735
RAC: 0
Message 46336 - Posted: 16 Sep 2007, 9:09:32 UTC

Hi,
I also have some trouble with 3 Capri-WUs.

2 of them are valid (granted Credit for 6-8h runtime = 20 credits) but have gotten stuck:
WUID:96055341 and WUID:95800425

1 Capri is invalid: Access Violation (0xc0000005)
WUID:95766573

PC is an not oc'ed Q6600 with 2GB RAM
Target CPU Runtime:12h.

ID: 46336 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Daniel

Send message
Joined: 4 Nov 05
Posts: 1
Credit: 11,084
RAC: 0
Message 46339 - Posted: 16 Sep 2007, 9:53:34 UTC

still running 5.80 since friday and no errors

system:
athlon64-3000
win2k sp4
1GB RAM
ID: 46339 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rolly

Send message
Joined: 31 Dec 05
Posts: 4
Credit: 717,205
RAC: 0
Message 46340 - Posted: 16 Sep 2007, 10:40:13 UTC

I also noticed a first failure on my system, Result 105943441. It seems the unit also hung somewhere during computation.

I was surpsised that this non Capri unit is also using the Beta core, I understand using the Beta core for a competition on rosetta@home bur for less urgent workunits I would think it to be better to first test it on ralp?
ID: 46340 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jmarks
Avatar

Send message
Joined: 16 Jul 07
Posts: 132
Credit: 98,025
RAC: 0
Message 46345 - Posted: 16 Sep 2007, 12:08:27 UTC

Here are 4 more

https://boinc.bakerlab.org/rosetta/result.php?resultid=104541449
https://boinc.bakerlab.org/rosetta/result.php?resultid=104542777
https://boinc.bakerlab.org/rosetta/result.php?resultid=104585618
https://boinc.bakerlab.org/rosetta/result.php?resultid=104621606

I hope that the few CAPRI14 that actually make it through are worth it.
Jmarks
ID: 46345 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
transient
Avatar

Send message
Joined: 30 Sep 06
Posts: 376
Credit: 10,836,395
RAC: 0
Message 46346 - Posted: 16 Sep 2007, 12:14:15 UTC
Last modified: 16 Sep 2007, 12:15:02 UTC

Here's another one of those CAPRI units which failed.

https://boinc.bakerlab.org/rosetta/result.php?resultid=105829549

I happen to have looked at graphics when it froze. 82 models were crunched when it failed, model 83 was at step 537. After that it was just waiting for the watchdog to terminate the task.
ID: 46346 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Paul

Send message
Joined: 29 Oct 05
Posts: 193
Credit: 64,858,035
RAC: 38
Message 46347 - Posted: 16 Sep 2007, 12:16:29 UTC

I am starting to wonder if this problem with the failed work units is related to multicore or Q6600 processors. Could it be a memory management issue with the WUs attempting to access the same memory locations creating a lock or race condition?

I finally disconnected my Q6600 computer from Rosetta and started on other projects. Thus far, no computation errors.

Most of my other computers are Core Duo and they report no issues.

Is anyone else using an optimized boinc client?

Q6600
2GB RAM
500 GB Disk
400 MB Swap < this is very small
XP Home

Is anyone having this problem with Vista 32 or 64?

I will try increasing my swap space to 2GB and see if it corrects the problem.
Thx!

Paul

ID: 46347 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The_Bad_Penguin
Avatar

Send message
Joined: 5 Jun 06
Posts: 2751
Credit: 4,271,025
RAC: 0
Message 46353 - Posted: 16 Sep 2007, 12:55:25 UTC - in response to Message 46347.  
Last modified: 16 Sep 2007, 13:04:00 UTC

Good question ! May be sheer coincidence, but seems we're hearing about this with the Q6600's more than "average"...

I'm running standard Boinc client, Q6600, 2 GB RAM, Swap = 75% of page file, Vista Premium (32).

EDIT--> Just noticed inetersting post here.

I am starting to wonder if this problem with the failed work units is related to multicore or Q6600 processors. Could it be a memory management issue with the WUs attempting to access the same memory locations creating a lock or race condition?

ID: 46353 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jmarks
Avatar

Send message
Joined: 16 Jul 07
Posts: 132
Credit: 98,025
RAC: 0
Message 46355 - Posted: 16 Sep 2007, 13:06:11 UTC - in response to Message 46353.  

Good question ! May be sheer coincidence, but seems we're hearing about this with the Q6600's more than "average"...

I'm running standard Boinc client, Q6600, 2 GB RAM, Swap = 75% of page file, Vista Premium (32).

I am starting to wonder if this problem with the failed work units is related to multicore or Q6600 processors. Could it be a memory management issue with the WUs attempting to access the same memory locations creating a lock or race condition?



I have a dual core e6600 4 gig and 70% of mine are bad also.
Jmarks
ID: 46355 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next

Message boards : Number crunching : Problems with Rosetta version 5.80



©2023 University of Washington
https://www.bakerlab.org