minirosetta v1.19 bug thread

Message boards : Number crunching : minirosetta v1.19 bug thread

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Ingleside

Send message
Joined: 25 Sep 05
Posts: 107
Credit: 1,514,472
RAC: 0
Message 53028 - Posted: 13 May 2008, 0:07:27 UTC - in response to Message 53020.  

Yes, that usually is the cause of it. I don't know if there's an official bug report on it. I do know it's a question that shows up in the BOINC forums every few months where the explanation is given and they claim it would be too much effort to fix.

Adding a little more, there's atleast 2 open Trac-tickets about this, #113 and #336.


"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
ID: 53028 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rom Walton (BOINC)
Volunteer moderator
Project developer

Send message
Joined: 17 Sep 05
Posts: 18
Credit: 40,071
RAC: 0
Message 53029 - Posted: 13 May 2008, 4:33:35 UTC

I'll throw in a bit more about the no heartbeat message.

At least once per release cycle we try to resolve this issue, so far the attempts to resolve the issue has lead to crashes within the core client.

DNS resolution is done through libcurl, and using either libcurl's native async-dns solution or the c-ares library hasn't resolved the issue. We haven't found a way to reproduce this issue in a lab environment, and so we haven't bee able to give the libcurl guys enough information to get it fixed.

So until we can get more info to the libcurl guys who can then fix it, the no heartbeat message is better than a crash.

----- Rom
My Blog
ID: 53029 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Betting Slip

Send message
Joined: 26 Sep 05
Posts: 71
Credit: 5,702,246
RAC: 0
Message 53032 - Posted: 13 May 2008, 11:37:03 UTC

Finally got one to finish https://boinc.bakerlab.org/rosetta/workunit.php?wuid=148643026

It consumed 1,063MB of memory and similar VM. This was on a 12hr run.

Bet if I had rebooted it would have failed.

I watched the last 5% in task manager. The to completion time stopped at 9 mins 59 secs and the WU finished at 96.6%

I then got a lot less credit than requested LOL

Hope the result was worth it. :)
ID: 53032 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [KWSN]John Galt 007
Avatar

Send message
Joined: 4 Aug 06
Posts: 6
Credit: 1,017,647
RAC: 0
Message 53035 - Posted: 13 May 2008, 14:55:53 UTC

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=145523230

Client errors on 2 machines, one of which is mine. 0.00 seconds, so no time lost.
ID: 53035 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
RiverboatSam

Send message
Joined: 9 Dec 05
Posts: 1
Credit: 59,080
RAC: 0
Message 53037 - Posted: 13 May 2008, 16:58:46 UTC

Suddenly this morning, Rosetta is using all my CPU resources. I am having to kill it in order to do any work. I need to figure out how to leave the project - I cannot have this happening.
ID: 53037 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
glaesum

Send message
Joined: 16 Oct 06
Posts: 21
Credit: 508,632
RAC: 8
Message 53038 - Posted: 13 May 2008, 17:03:35 UTC

error #161 (whatever that is)

finally a wu failed, that's on top of the usual non-fatal 120 error:
resultid=162869266

<core_client_version>5.10.30</core_client_version>
<stderr_txt>
AllocateAndInitializeSid Error 120
failed to create shared mem segment
# cpu_run_time_pref: 14400
:
BOINC :: Watchdog shutting down...
called boinc_finish

</stderr_txt>
<message>
<file_xfer_error>
<file_name>rb_05_12_11631_20348_T0397_IGNORE_THE_REST_10_16_3247_49_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>
]]>
ID: 53038 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David Emigh
Avatar

Send message
Joined: 13 Mar 06
Posts: 158
Credit: 417,178
RAC: 0
Message 53040 - Posted: 13 May 2008, 18:51:06 UTC
Last modified: 13 May 2008, 18:55:46 UTC

Error number 4, at 77,900+ CPU seconds.

Reason: Access Violation (0xc0000005) at address 0x005C1E7C write attempt to address 0x00000024

Large and detailed debugger report available at the link, if anyone is reading those things at this point.

The host that received the above error is 1/4 on mini 1.19 tasks that have a runtime preference in excess of 12 hours, but is 8/8 on mini 1.19 tasks with a runtime preference of 12 hours or less.
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
ID: 53040 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 53041 - Posted: 13 May 2008, 19:11:28 UTC - in response to Message 53037.  

Suddenly this morning, Rosetta is using all my CPU resources. I am having to kill it in order to do any work. I need to figure out how to leave the project - I cannot have this happening.


Nothing has changed on the end of Rosetta suddenly this morning. It is designed to run at a low priority, so anything else your computer is working on is ahead in line for the CPU. You can configure BOINC to use a fraction of the CPU, or to only run at specific times of day. You can just go to the advanced view, then use the advanced pulldown menu, and select preferences to set these up for that specific machine.

So, using all of your CPU is normal, when you aren't doing anything else. And if it is causing any noticible impact on your work, it is actually more likely an issue of how much memory is available then the CPU being used.
Rosetta Moderator: Mod.Sense
ID: 53041 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alan Roberts

Send message
Joined: 7 Jun 06
Posts: 61
Credit: 6,901,926
RAC: 0
Message 53043 - Posted: 13 May 2008, 20:20:04 UTC

I have just finished an "observing" session on a Windows 2K server where multiple Mini 1.19 tasks were not honoring suspend behavior. I'm allowed to run Rosetta jobs on this machine during off hours. When I examined the tasks within BOINC Manager they reported as suspended, and were not accumulating CPU time. Checking in Windows Task Manager showed Rosetta Mini merrily consuming CPU. When I toggled Activity with BOINC Manager from Run based on preferences to Run always I would see the CPU time within BOINC Manager "catch up" to that shown in Windows Task Manager.

I aborted the first Mini job, and the second started and demonstrated the same behavior. Shutdown the BOINC service (which did kill everything), and restarted. Problem continued. Shutdown BOINC again, uninstalled and reinstalled BOINC (5.10.45) and restarted. Problem continued. Aborted the second Mini job, observed problem with the third one, also aborted the job.

Now I've got Beta 5.96 tasks downloaded, and these are obeying suspend/resume flawlessly.

Has anyone else seen this, and more importantly if so is there a fix? I have a collection of machines where I'm allowed to run Rosetta only during off hours ... I'll have to pull them out of action if I can't count on reliable time of day suspends. Alternatively, is there any thing I can do to tell any machine exhibiting this behavior to avoid Mini jobs, since Beta 5.96 is behaving correctly?
ID: 53043 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile caesar1987
Avatar

Send message
Joined: 28 Nov 06
Posts: 13
Credit: 22,268
RAC: 0
Message 53048 - Posted: 14 May 2008, 0:45:13 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=162352303
ID: 53048 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Venturini Dario[VENETO]

Send message
Joined: 25 May 07
Posts: 22
Credit: 245,028
RAC: 0
Message 53049 - Posted: 14 May 2008, 8:50:12 UTC

Validate error on a 84k+ seconds task (I'd say... rather annoying)

https://boinc.bakerlab.org/rosetta/result.php?resultid=162388905
ID: 53049 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile (_KoDAk_)

Send message
Joined: 18 Jul 06
Posts: 109
Credit: 1,859,263
RAC: 0
Message 53054 - Posted: 14 May 2008, 17:00:30 UTC

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x005C3051 write attempt to address 0x00000024
https://boinc.bakerlab.org/rosetta/result.php?resultid=162428256

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x76A342EB
\ + 1480Mb use of ram
https://boinc.bakerlab.org/rosetta/result.php?resultid=162386305

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x76A342EB

https://boinc.bakerlab.org/rosetta/result.php?resultid=162246424
ID: 53054 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile (_KoDAk_)

Send message
Joined: 18 Jul 06
Posts: 109
Credit: 1,859,263
RAC: 0
Message 53055 - Posted: 14 May 2008, 17:03:01 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=161724764
https://boinc.bakerlab.org/rosetta/result.php?resultid=161724764
https://boinc.bakerlab.org/rosetta/result.php?resultid=161544482
https://boinc.bakerlab.org/rosetta/result.php?resultid=161438499
https://boinc.bakerlab.org/rosetta/result.php?resultid=161028445
ID: 53055 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
popandbob

Send message
Joined: 30 Oct 05
Posts: 4
Credit: 1,668,419
RAC: 1
Message 53059 - Posted: 14 May 2008, 19:31:42 UTC

3 errors...

Error 1

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 36000
# cpu_run_time_pref: 36000

ERROR: Conformation: fold_tree nres should match size
ERROR:: Exit from: ....srccoreconformationConformation.cc line: 192
called boinc_finish

</stderr_txt>

error2

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>

ERROR: unrecognized atom_type_name HOH
ERROR:: Exit from: c:cygwinhomeboincboinc_buildminirosetta_1.19minisrccore/chemical/AtomTypeSet.hh line: 79
called boinc_finish

</stderr_txt>
]]>

error 3

core_client_version>5.10.30</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 36000


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x005C3030 write attempt to address 0x00000004

Engaging BOINC Windows Runtime Debugger...
ID: 53059 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile alpha

Send message
Joined: 4 Nov 06
Posts: 27
Credit: 1,550,107
RAC: 0
Message 53061 - Posted: 14 May 2008, 19:43:11 UTC

Access violation (exit code -1073741819 (0xc0000005)) after nearly 22,000 seconds:

https://boinc.bakerlab.org/rosetta/result.php?resultid=162546878
ID: 53061 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jipsu

Send message
Joined: 27 Jan 08
Posts: 10
Credit: 454,555
RAC: 0
Message 53072 - Posted: 15 May 2008, 12:47:45 UTC
Last modified: 15 May 2008, 12:48:23 UTC

I think the out of memory error is corrected already in minirosetta v1.2 which is going thru testing at ralph at the moment.

24h minorosetta v1.2 tasks are taking around 150M of memory and the out of memory error in minirosetta v1.19 seems to exist only in windows version of the application.

Just throwing my thoughts around, but I think it's pointless to post out of memory errors since the problem is already fixed in v1.2.
ID: 53072 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David Emigh
Avatar

Send message
Joined: 13 Mar 06
Posts: 158
Credit: 417,178
RAC: 0
Message 53079 - Posted: 15 May 2008, 18:38:43 UTC - in response to Message 53072.  

{...}
Just throwing my thoughts around, but I think it's pointless to post out of memory errors since the problem is already fixed in v1.2.


"Pointless" only for those who: 1) Participate in RALPH@home, 2) Have long runtime preferences, 3) Run Windows operating systems, and 4) Agree with the conclusion that the problem is solved.
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
ID: 53079 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Venturini Dario[VENETO]

Send message
Joined: 25 May 07
Posts: 22
Credit: 245,028
RAC: 0
Message 53100 - Posted: 17 May 2008, 10:02:17 UTC - in response to Message 53049.  

Validate error on a 84k+ seconds task (I'd say... rather annoying)

https://boinc.bakerlab.org/rosetta/result.php?resultid=162388905



And here's another one:

https://boinc.bakerlab.org/rosetta/result.php?resultid=162547290


Both happened after a segfault error some hours before completion.
ID: 53100 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 28 May 06
Posts: 62
Credit: 229,047
RAC: 85
Message 53112 - Posted: 17 May 2008, 22:27:58 UTC
Last modified: 17 May 2008, 22:30:04 UTC

Anyone else seen this yet?

I have a single incidence of minirosetta v1.19 using both "cores" of my Pentium 4 with Hyper Thread.
It is not following the BOINC rules to use only 1 core/app/cpu.

It is currently running:
Task ID 164060225
Task Name h003__BOINC_ABRELAX_IGNORE_THE_REST-S25-5-S3-3--h003_-_3321_121_0

ID: 53112 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The_Bad_Penguin
Avatar

Send message
Joined: 5 Jun 06
Posts: 2751
Credit: 4,271,025
RAC: 0
Message 53116 - Posted: 18 May 2008, 1:19:31 UTC

For those with inquiring minds:

rb_05_17_11407_20379_tim23_IGNORE_THE_REST_06_10_3329_49

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<stderr_txt>
======================================================
DONE :: 1 starting structures 10440.8 cpu seconds
This process generated 8 decoys from 8 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>


Validate state Invalid
Claimed credit 28.6254621435188
Granted credit 0
application version 1.19



and


rb_05_17_11462_20386_CRF-BP_IGNORE_THE_REST_06_17_3330_30

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<stderr_txt>
======================================================
DONE :: 1 starting structures 10685.3 cpu seconds
This process generated 6 decoys from 6 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>


Validate state Invalid
Claimed credit 29.2953911287149
Granted credit 0
application version 1.19
ID: 53116 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : minirosetta v1.19 bug thread



©2024 University of Washington
https://www.bakerlab.org