James Thompson Forum moderator Project administrator Project developer Project scientist Joined: Oct 13 05 Posts: 46 ID: 4392 Credit: 186,109 RAC: 0
We have an updated version of minirosetta v1.19 which should fix some of the stability issues with v1.15. Post minirosetta v1.19 bugs here.
____________
things must be going pretty well as the thread is so quiet...
good news too with win98 OS - the 1.19 app is running, completing and validating although an error message is still getting thrown up. no idea if this matters or not.
on all three wus completed so far this is the message:
Task ID 161439715
Name score13_hb_envtest62_A_1ctf__3171_14411_0
Workunit 147493846
Received 8 May 2008 11:10:33 UTC
Outcome Success
<core_client_version>5.10.30</core_client_version>
<![CDATA[
<stderr_txt>
AllocateAndInitializeSid Error 120
failed to create shared mem segment
# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 13875.8 cpu seconds
This process generated 3 decoys from 3 attempts
======================================================
BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish
</stderr_txt>
]]>
work unit ID nos are:
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=147390671
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=147405464
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=147493846
ID: 52910 | Rating: 0 | rate:
/
radu Joined: May 7 08 Posts: 4 ID: 257126 Credit: 66,301 RAC: 0
I get a crash when I detach from the project.
I'm not sure if this is a minirosetta bug.
Log messages seem to show that minirosetta was running when the crash occurred.
I get a crash when I detach from the project.
I'm not sure if this is a minirosetta bug.
Log messages seem to show that minirosetta was running when the crash occurred.
It is quite possible (and logical IMO) that the client forcibly terminates all related processes upon detach. Otherwise it could not clean up client_state.xml, slots/ and projects/.
Peter
ID: 52913 | Rating: 0 | rate:
/
radu Joined: May 7 08 Posts: 4 ID: 257126 Credit: 66,301 RAC: 0
I get a crash when I detach from the project.
I'm not sure if this is a minirosetta bug.
Log messages seem to show that minirosetta was running when the crash occurred.
It is quite possible (and logical IMO) that the client forcibly terminates all related processes upon detach. Otherwise it could not clean up client_state.xml, slots/ and projects/.
Peter
I'm new to BOINC so I don't know how the detach operation is handled.
I don't use the gui manager and boinc_client appears to be the only BOINC related process running:
$ ps -e | grep boinc
6279 ? 00:00:05 boinc_client
Anyway killing related processes should not generate segmentation faults, so it's clearly an error in boinc_client.
I don't know if it has anything to do with minirosetta though.
I get a crash when I detach from the project.
I'm not sure if this is a minirosetta bug.
Log messages seem to show that minirosetta was running when the crash occurred.
It is quite possible (and logical IMO) that the client forcibly terminates all related processes upon detach. Otherwise it could not clean up client_state.xml, slots/ and projects/.
I'm new to BOINC so I don't know how the detach operation is handled.
Anyway killing related processes should not generate segmentation faults, so it's clearly an error in boinc_client.
I'm sorry, you are right. I was thinking on Rosetta crashing and omitted that actually the client crashed. Off course it should not. (And actually the application should also exit cleanly if asked to by the client.)
I don't know if it has anything to do with minirosetta though.
It should not. Which client, 5.10.45?
Peter
ID: 52915 | Rating: 0 | rate:
/
radu Joined: May 7 08 Posts: 4 ID: 257126 Credit: 66,301 RAC: 0
It should not. Which client, 5.10.45?
yes, 5.10.45
ID: 52916 | Rating: 0 | rate:
/
Rob Joined: Oct 16 06 Posts: 3 ID: 120303 Credit: 121,375 RAC: 0
Someone forgot to post the Minirosetta 1.19 details on the version thread.
ID: 52917 | Rating: 0 | rate:
/
Alexander Klauer Joined: Mar 10 08 Posts: 3 ID: 246483 Credit: 110,308 RAC: 0
Hi, I switched off my computer yesterday, in the middle (maybe 60%) of a task. When I switched it back on today, I got
Fri 09 May 2008 09:51:30 AM CEST|rosetta@home|URL: http://boinc.bakerlab.org/rosetta/; Computer ID: 762923; location: (none); project prefs: default
Fri 09 May 2008 09:51:31 AM CEST|rosetta@home|Restarting task fa_max_dis_9-1ptq_-test_2008-5-6_3222_268_0 using minirosetta version 119
Fri 09 May 2008 09:52:00 AM CEST|rosetta@home|Computation for task fa_max_dis_9-1ptq_-test_2008-5-6_3222_268_0 finished
Fri 09 May 2008 09:52:01 AM CEST|rosetta@home|Starting lambda_repressor_folding_3191_8370_0
Fri 09 May 2008 09:52:01 AM CEST|rosetta@home|Starting task lambda_repressor_folding_3191_8370_0 using rosetta_beta version 596
Fri 09 May 2008 09:52:03 AM CEST|rosetta@home|Started upload of fa_max_dis_9-1ptq_-test_2008-5-6_3222_268_0_0
Fri 09 May 2008 09:52:14 AM CEST|rosetta@home|Finished upload of fa_max_dis_9-1ptq_-test_2008-5-6_3222_268_0_0
so the task finished virtually immediately after restart.
When I switched my computer on yesterday morning, I also had some task crunching at 0%. Back then I believed an old task had been restarted from the beginning due to some fluke, but now it seems more likely that the same thing as today has happened. To me, it seems too much of a coincidence of a task interrupted in the middle being finished immediately after resume, twice in a row.
All those crashes are a result of an out of memory error.
With 4Gb of memory what do I do to put it right?
You could once get out of memory with also 64 GB of RAM... (Do you know the sentence about 64 KB of RAM?)
How much pagefile do you have available there? Any other memory load? Like other projects' applications, preempted and waiting in memory? Take occasionally a look into Task Manager, Performance tab - what are the Commit Charge values like? If the Total (or Peak) anytimes reach the Limit, that's it. You're running at least 7 projects on the host, each Rosetta can require up to 600-900 MB, CPDN at least some 200-300 MB, other projects as well something, and it is a quad...
All those crashes are a result of an out of memory error.
With 4Gb of memory what do I do to put it right?
You could once get out of memory with also 64 GB of RAM... (Do you know the sentence about 64 KB of RAM?)
How much pagefile do you have available there? Any other memory load? Like other projects' applications, preempted and waiting in memory? Take occasionally a look into Task Manager, Performance tab - what are the Commit Charge values like? If the Total (or Peak) anytimes reach the Limit, that's it. You're running at least 7 projects on the host, each Rosetta can require up to 600-900 MB, CPDN at least some 200-300 MB, other projects as well something, and it is a quad...
Peter
Yes, I understand but my commit charge is a fraction of of my available charge 10% at the moment. I have increased my page file to 6GB with a total memory of 4GB on Win XP Pro 64
It just strikes me that the very kowledgeable Rom is arrogant enough to point to the cause without indicating any sort of a solution.
____________
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2976 ID: 106194 Credit: 0 RAC: 0
Fat Loss, I'm guessing that the error is an indication that the task grew to exceed the maximum memory it was configured for, and so was terminated by BOINC. And so, regardless of your machine's physical configuration or % memory used to BOINC etc. etc. it still would have failed. So that would tend to indicate a logic problem in Mini, or perhaps a task that should be created with a higher memory maximum allowed.
We'll have to wait to see what DK finds.
____________ Rosetta Moderator: Mod.Sense
ID: 52977 | Rating: 0 | rate:
/
Rom Walton (BOINC) Forum moderator Project administrator Project developer Joined: Sep 17 05 Posts: 18 ID: 84 Credit: 23,714 RAC: 0
It just strikes me that the very kowledgeable Rom is arrogant enough to point to the cause without indicating any sort of a solution.
In this particular case there isn't anything that any of us can do, I've passed the info on to the MiniRosetta devs. Basically MiniRosetta is a 32-bit process, and generally 32-bit processes are limited to 2GB of user-mode memory. MiniRosetta hit that limit and so when it asked for more the OS said NO, leading to the crash.
The sign that this sort of problem has occurred is:
Sorry for not explaining the situation sooner, I was heading for bed and I started thinking about how I was going to help the devs debug this problem in the wild if they are unable to reproduce this issue in the lab.
At present there isn't anything in the BOINC application framework that'll help them debug this in the wild.
It just strikes me that the very kowledgeable Rom is arrogant enough to point to the cause without indicating any sort of a solution.
In this particular case there isn't anything that any of us can do, I've passed the info on to the MiniRosetta devs. Basically MiniRosetta is a 32-bit process, and generally 32-bit processes are limited to 2GB of user-mode memory. MiniRosetta hit that limit and so when it asked for more the OS said NO, leading to the crash.
The sign that this sort of problem has occurred is:
Sorry for not explaining the situation sooner, I was heading for bed and I started thinking about how I was going to help the devs debug this problem in the wild if they are unable to reproduce this issue in the lab.
At present there isn't anything in the BOINC application framework that'll help them debug this in the wild.
Thanks Rom and sorry for being a bit short with you. Sometimes wonder where all this irritability comes from.
I sometimes long for a slower pace of life LOL
____________
Task ID 162368970
Name SSPAIR_MIN_ABINITIO_1fna_3115_6915_2
Workunit 145958649
Created 10 May 2008 22:29:48 UTC
Sent 10 May 2008 22:30:26 UTC
Received 11 May 2008 15:18:04 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 1 (0x1)
Computer ID 735230
Report deadline 20 May 2008 22:30:26 UTC
CPU time 0
stderr out <core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
ERROR: Option matching -fudge not found in command line top-level context
</stderr_txt>
]]>
Validate state Invalid
Claimed credit 0
Granted credit 0
application version 1.19
In this particular case there isn't anything that any of us can do, I've passed the info on to the MiniRosetta devs. Basically MiniRosetta is a 32-bit process, and generally 32-bit processes are limited to 2GB of user-mode memory. MiniRosetta hit that limit and so when it asked for more the OS said NO, leading to the crash.
Watching Process Explorer, the MiniRosetta application was constantly grabbing more memory. Both physical and virtual were increasing throughout the task's run. My preference was set to 24 hours, but it only made it to ~15.5 hours before reaching the 2GB limit. I then changed preferences to 2 hours and a task finished properly. It appears I will have to set preferences at 12 hours on this machine to avoid the 2GB limit.
For people running a Core2 at 3GHz or higher, you may want to try setting preferences at 8 hours or less to see if that helps.
____________
In both cases the WU ran the normal length of time (16 hr), then printed a bunch of:
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
...
lines to sterr. The WUs ended up being marked "invalid".
These WUs were on separate machines, both running Linux.
This error, and the one I reported earlier in this thread, have the same signature as the errors I was getting with mini 1.15, errors which crippled two stable and reliable crunchers until I discovered a workaround.
The only difference now is that the mini 1.19 workunits take about twice as long to crash, resulting in twice as much wasted CPU time...
____________
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
ID: 52993 | Rating: 0 | rate:
/
radu Joined: May 7 08 Posts: 4 ID: 257126 Credit: 66,301 RAC: 0
More segfaults,on linux running 5.10.45 client.
Apparently there was a problem with the network connection and the client kept trying to reconnect.
All tasks whose results were about to be sent were marked with "compute error", for example: http://boinc.bakerlab.org/rosetta/result.php?resultid=162637592
I hope this helps.
Output of dmesg:
tg3: eth0: Link is down.
minirosetta_1.1[1243]: segfault at ff5fbff8 rip 881dcb0 rsp ff5fbed8 error 6
minirosetta_1.1[1258]: segfault at ff3fbff8 rip 881dcb0 rsp ff3fbed8 error 6
Clocksource tsc unstable (delta = -116217092 ns)
minirosetta_1.1[1348]: segfault at ff7fbff8 rip 881dcb0 rsp ff7fbed8 error 6
rosetta_beta_5.[1353]: segfault at ff1fbfe8 rip 8e8fe90 rsp ff1fbec8 error 6
minirosetta_1.1[1363]: segfault at ff7fbff8 rip 881dcb0 rsp ff7fbed8 error 6
rosetta_beta_5.[1367]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
minirosetta_1.1[1375]: segfault at ff1fbff8 rip 881dcb0 rsp ff1fbed8 error 6
rosetta_beta_5.[1379]: segfault at ff5fbfe8 rip 8e8fe90 rsp ff5fbec8 error 6
minirosetta_1.1[1390]: segfault at ff3fbff8 rip 881dcb0 rsp ff3fbed8 error 6
rosetta_beta_5.[1395]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
rosetta_beta_5.[1411]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
rosetta_beta_5.[1407]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
rosetta_beta_5.[1422]: segfault at ff5fbfe8 rip 8e8fe90 rsp ff5fbec8 error 6
rosetta_beta_5.[1426]: segfault at ff7fbfe8 rip 8e8fe90 rsp ff7fbec8 error 6
rosetta_beta_5.[1434]: segfault at ff7fbfe8 rip 8e8fe90 rsp ff7fbec8 error 6
rosetta_beta_5.[1440]: segfault at ff7fbfe8 rip 8e8fe90 rsp ff7fbec8 error 6
rosetta_beta_5.[1449]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
rosetta_beta_5.[1454]: segfault at ff5fbfe8 rip 8e8fe90 rsp ff5fbec8 error 6
rosetta_beta_5.[1470]: segfault at ff1fbfe8 rip 8e8fe90 rsp ff1fbec8 error 6
rosetta_beta_5.[1463]: segfault at ff1fbfe8 rip 8e8fe90 rsp ff1fbec8 error 6
minirosetta_1.1[1486]: segfault at ff3fbff8 rip 881dcb0 rsp ff3fbed8 error 6
rosetta_beta_5.[1481]: segfault at ff7fbfe8 rip 8e8fe90 rsp ff7fbec8 error 6
minirosetta_1.1[1498]: segfault at ff1fbff8 rip 881dcb0 rsp ff1fbed8 error 6
rosetta_beta_5.[1503]: segfault at ff1fbfe8 rip 8e8fe90 rsp ff1fbec8 error 6
minirosetta_1.1[1509]: segfault at ff5fbff8 rip 881dcb0 rsp ff5fbed8 error 6
rosetta_beta_5.[1514]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
minirosetta_1.1[1520]: segfault at ff3fbff8 rip 881dcb0 rsp ff3fbed8 error 6
rosetta_beta_5.[1526]: segfault at ff3fbfe8 rip 8e8fe90 rsp ff3fbec8 error 6
rosetta_beta_5.[1537]: segfault at ff1fbfe8 rip 8e8fe90 rsp ff1fbec8 error 6
rosetta_beta_5.[1543]: segfault at ff7fbfe8 rip 8e8fe90 rsp ff7fbec8 error 6
For some reason this seems to be a problem with windown version of minirosetta. On my linux server the memory usage peak seems to be around 150MB and both computers have same runtime prefences.
Hello all,
Running Ubuntu 7.10 x86 this Task ID: 162048556 has Outcome = Success, but a double message:
<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 14400
# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 13761.5 cpu seconds
This process generated 4 decoys from 4 attempts
======================================================
BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish
# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 16847.8 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish
</stderr_txt>
]]>
From Boinc I got this message:
ma 12 mei 2008 01:33:52 CEST|rosetta@home|Task 1bkrA_BOINC_ABRELAX_IGNORE_THE_REST-S25-10-S3-11--1bkrA-_3181_3_1
exited with zero status but no 'finished' file
ma 12 mei 2008 01:33:52 CEST|rosetta@home|If this happens repeatedly you may need to reset the project.
Its total runtime was 16848.04 seconds.
This WU errored before, running on Windows XP as Invalid.
Apparently there was a problem with the network connection and the client kept trying to reconnect.
That's usually caused by a known, unfixed BOINC flaw, not Rosetta. When BOINC is resolving a domain name, it blocks all other communication, including running tasks. If that continues past 30 seconds, things start failing/crashing. The only know workaround is changing the DNS timeout. That's done in resolv.conf (usually located at /etc/resolv.conf) by adding the line options timeout:2 That makes each attempt 2 seconds with the default of 2 retries per DNS server. You can play with the options some based on your number of DNS server but try not to go over 25 seconds.
I got a similar err msg as AMD_is_logical except the task succeeded and validated even on top of the error reporting with every work unit done on win98 os. note, the 'psipred' line only occurred three times = no. of decoys, hmmmm?? http://boinc.bakerlab.org/rosetta/result.php?resultid=162095619
Received 11 May 2008 2:20:03 UTC
<core_client_version>5.10.30</core_client_version>
<stderr_txt>
AllocateAndInitializeSid Error 120
failed to create shared mem segment
WARNING: Override of option -out:nstruct sets a different value
can not open psipred_ss2 file tt
# cpu_run_time_pref: 14400
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
======================================================
DONE :: 1 starting structures 10977 cpu seconds
This process generated 3 decoys from 3 attempts
======================================================
BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish
</stderr_txt>
(this work unit had been through boinc v.6.2 with another user where it failed to validate)
]]>
I just had a couple of mtlr_test2_S.00000001.*_3238_1 WUs error out.
In both cases the WU ran the normal length of time (16 hr), then printed a bunch of:
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
can not open psipred_ss2 file tt
...
lines to sterr. The WUs ended up being marked "invalid".
These WUs were on separate machines, both running Linux.
ID: 53007 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2976 ID: 106194 Credit: 0 RAC: 0
Bitspit, is that the root cause of the "no heartbeat from client in 31 secs" msgs?? Do you have a link to the trac item for this?
____________ Rosetta Moderator: Mod.Sense
ID: 53010 | Rating: 0 | rate:
/
Alan Roberts Joined: Jun 7 06 Posts: 61 ID: 93009 Credit: 2,803,710 RAC: 2,245
I've got a Mini 1.19 work unit with a duration of 13:27:06 (machine is set for 14hr target) that has consumed 33:17:05, with a progress of 0.000%.
Is this a stuck job that should be aborted, or should I let it grind on in hopes of producing something? Thanks.
____________
ID: 53015 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2976 ID: 106194 Credit: 0 RAC: 0
Is this a stuck job that should be aborted, or should I let it grind on in hopes of producing something?
I'd suggest you suspend it and resume it again and if progress % doesn't change within 5min of going back to a "running" status, I'd abort it.
____________ Rosetta Moderator: Mod.Sense
Bitspit, is that the root cause of the "no heartbeat from client in 31 secs" msgs?? Do you have a link to the trac item for this?
Yes, that usually is the cause of it. I don't know if there's an official bug report on it. I do know it's a question that shows up in the BOINC forums every few months where the explanation is given and they claim it would be too much effort to fix.
Bitspit, is that the root cause of the "no heartbeat from client in 31 secs" msgs?? Do you have a link to the trac item for this?
Not sure if there's a trac for this, but waiting for DNS-lookup is definitely one of the reasons for "no heartbeat". If not mis-remembers, I've seen this problem on win2k, there one time internet-connection went down it was spitting-out "no heartbeat" all the time, while another running win2003 just continued crunching even didn't manage doing DNS-lookup...
Not sure, but during very heavy disk-usage it's likely also possible to get a "no heartbeat".
And, atleast in my experience, each and every time any of the dvd-players makes a nasty noise before spitting-out "read-error", I'm getting a "no heartbeat" in BOINC...
____________
"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
ID: 53021 | Rating: 0 | rate:
/
Alan Roberts Joined: Jun 7 06 Posts: 61 ID: 93009 Credit: 2,803,710 RAC: 2,245
Mod.Sense, thanks for the suggestion. It doesn't seem to have fixed anything for this machine, but it does produce an interesting result.
According to BOINC Manager, this task is now suspended, and I see no CPU time accumulation within BOINC Manager. According to Windows Task Manager, minirosetta_1.1 is still grinding along, consuming CPU. When I resume the task the CPU time display within BOINC Manager catches up with what Windows Task Manager reports.
I'm off to see if there is a later version of BOINC, but this work unit is looking like an abort.
____________
Bitspit, is that the root cause of the "no heartbeat from client in 31 secs" msgs?? Do you have a link to the trac item for this?
Not sure if there's a trac for this, but waiting for DNS-lookup is definitely one of the reasons for "no heartbeat". If not mis-remembers, I've seen this problem on win2k, there one time internet-connection went down it was spitting-out "no heartbeat" all the time, while another running win2003 just continued crunching even didn't manage doing DNS-lookup...
Not sure, but during very heavy disk-usage it's likely also possible to get a "no heartbeat".
And, atleast in my experience, each and every time any of the dvd-players makes a nasty noise before spitting-out "read-error", I'm getting a "no heartbeat" in BOINC...
David Baker has gotten this error on his own laptop.
stderr out <core_client_version>5.4.9</core_client_version>
<stderr_txt>
No heartbeat from core client for 30 sec - exiting
No heartbeat from core client for 30 sec - exiting
FILE_LOCK::unlock(): close failed.: Bad file descriptor
No heartbeat from core client for 30 sec - exiting
No heartbeat from core client for 30 sec - exiting
No heartbeat from core client for 30 sec - exiting
Too many restarts with no progress. Keep application in memory while preempted.
======================================================
DONE :: 1 starting structures 131.476 cpu seconds
This process generated 0 decoys from 0 attempts
Yes, that usually is the cause of it. I don't know if there's an official bug report on it. I do know it's a question that shows up in the BOINC forums every few months where the explanation is given and they claim it would be too much effort to fix.
Adding a little more, there's atleast 2 open Trac-tickets about this, #113 and #336.
____________
"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
ID: 53028 | Rating: 0 | rate:
/
Rom Walton (BOINC) Forum moderator Project administrator Project developer Joined: Sep 17 05 Posts: 18 ID: 84 Credit: 23,714 RAC: 0
I'll throw in a bit more about the no heartbeat message.
At least once per release cycle we try to resolve this issue, so far the attempts to resolve the issue has lead to crashes within the core client.
DNS resolution is done through libcurl, and using either libcurl's native async-dns solution or the c-ares library hasn't resolved the issue. We haven't found a way to reproduce this issue in a lab environment, and so we haven't bee able to give the libcurl guys enough information to get it fixed.
So until we can get more info to the libcurl guys who can then fix it, the no heartbeat message is better than a crash.
Suddenly this morning, Rosetta is using all my CPU resources. I am having to kill it in order to do any work. I need to figure out how to leave the project - I cannot have this happening.
____________
Large and detailed debugger report available at the link, if anyone is reading those things at this point.
The host that received the above error is 1/4 on mini 1.19 tasks that have a runtime preference in excess of 12 hours, but is 8/8 on mini 1.19 tasks with a runtime preference of 12 hours or less.
____________
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
ID: 53040 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2976 ID: 106194 Credit: 0 RAC: 0
Suddenly this morning, Rosetta is using all my CPU resources. I am having to kill it in order to do any work. I need to figure out how to leave the project - I cannot have this happening.
Nothing has changed on the end of Rosetta suddenly this morning. It is designed to run at a low priority, so anything else your computer is working on is ahead in line for the CPU. You can configure BOINC to use a fraction of the CPU, or to only run at specific times of day. You can just go to the advanced view, then use the advanced pulldown menu, and select preferences to set these up for that specific machine.
So, using all of your CPU is normal, when you aren't doing anything else. And if it is causing any noticible impact on your work, it is actually more likely an issue of how much memory is available then the CPU being used.
____________ Rosetta Moderator: Mod.Sense
ID: 53041 | Rating: 0 | rate:
/
Alan Roberts Joined: Jun 7 06 Posts: 61 ID: 93009 Credit: 2,803,710 RAC: 2,245
I have just finished an "observing" session on a Windows 2K server where multiple Mini 1.19 tasks were not honoring suspend behavior. I'm allowed to run Rosetta jobs on this machine during off hours. When I examined the tasks within BOINC Manager they reported as suspended, and were not accumulating CPU time. Checking in Windows Task Manager showed Rosetta Mini merrily consuming CPU. When I toggled Activity with BOINC Manager from Run based on preferences to Run always I would see the CPU time within BOINC Manager "catch up" to that shown in Windows Task Manager.
I aborted the first Mini job, and the second started and demonstrated the same behavior. Shutdown the BOINC service (which did kill everything), and restarted. Problem continued. Shutdown BOINC again, uninstalled and reinstalled BOINC (5.10.45) and restarted. Problem continued. Aborted the second Mini job, observed problem with the third one, also aborted the job.
Now I've got Beta 5.96 tasks downloaded, and these are obeying suspend/resume flawlessly.
Has anyone else seen this, and more importantly if so is there a fix? I have a collection of machines where I'm allowed to run Rosetta only during off hours ... I'll have to pull them out of action if I can't count on reliable time of day suspends. Alternatively, is there any thing I can do to tell any machine exhibiting this behavior to avoid Mini jobs, since Beta 5.96 is behaving correctly?
____________
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x005C3051 write attempt to address 0x00000024
http://boinc.bakerlab.org/rosetta/result.php?resultid=162428256
- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x76A342EB
\\\ + 1480Mb use of ram
http://boinc.bakerlab.org/rosetta/result.php?resultid=162386305
- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x76A342EB
I think the out of memory error is corrected already in minirosetta v1.2 which is going thru testing at ralph at the moment.
24h minorosetta v1.2 tasks are taking around 150M of memory and the out of memory error in minirosetta v1.19 seems to exist only in windows version of the application.
Just throwing my thoughts around, but I think it's pointless to post out of memory errors since the problem is already fixed in v1.2.
{...}
Just throwing my thoughts around, but I think it's pointless to post out of memory errors since the problem is already fixed in v1.2.
"Pointless" only for those who: 1) Participate in RALPH@home, 2) Have long runtime preferences, 3) Run Windows operating systems, and 4) Agree with the conclusion that the problem is solved.
____________
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
I have a single incidence of minirosetta v1.19 using both "cores" of my Pentium 4 with Hyper Thread.
It is not following the BOINC rules to use only 1 core/app/cpu.
It is currently running: Task ID 164060225
Task Name h003__BOINC_ABRELAX_IGNORE_THE_REST-S25-5-S3-3--h003_-_3321_121_0
____________
Running Ubuntu 7.10 x86 this task: 1opd__BOINC_ABRELAX_SAVE_ALL_OUT_IGNORE_THE_REST-S25-11-S3-4--1opd_-_3252_12
ended with a validate error for me after 11,727.76 seconds and ended successfully on the second run after 7,642.63 seconds running on Windows XP Professional Edition.
I've switched my computer off while this WU was running (no issues with that before).
Scott McInness Joined: Mar 15 08 Posts: 1 ID: 247250 Credit: 392,233 RAC: 0
I've just updated BOINC on a PC that I haven't used for BOINC for about 12 months (wow, there's an x64 version now!) and every work unit initiated with mini 1.19 x86_64 crashes after less than a second. It also seems to run as a 32-bit process...
165005856 - Access Violation (0xc0000005) at address 0x73010175 read attempt to address 0x73010175 165012983 - Access Violation (0xc0000005) at address 0x73010175 read attempt to address 0x73010175 165017550 - Access Violation (0xc0000005) at address 0x73010175 read attempt to address 0x73010175 165018958 - Access Violation (0xc0000005) at address 0x73010175 read attempt to address 0x73010175 165019984 - Access Violation (0xc0000005) at address 0x73010175 read attempt to address 0x73010175
There is a Rosetta Beta 5.96 x86_64 task running atm (which is also running as a 32-bit process) just on 13% without problem, and SETI tasks (32-bit only) seem to work too.
Task ID 164939423
Name rb_05_19_11641_20436_T0407_IGNORE_THE_REST_04_16_3332_224_0 had a Compute error
CPU time 4351.235
stderr out <core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 3600
# cpu_run_time_pref: 3600
======================================================
DONE :: 1 starting structures 4351.19 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish
This is not a bug. I was wondering are there any plans to display what model the work unit is up to? Thanks for your hard work on this application on behalf of all of the cruncher's.
Cheers
Speedy
One of my systems seems to be having issues runing minirosetta v1.19 WUs. It is a 4 processor Intel Xeon CPU X3210 (two dual core chips) running server 2003 R2.
It seems to be crunching through Rosetta Beta 5.96 WUs no problem, but when it goes to start a mini 1.19 WU, it switches the task to "running" but CPU time is ever used and the task stays at 0%. If I suspend all of the mini 1.19 WUs that are queued up the system immediately begins crunching on any Rosetta Beta 5.96 WUs without any problem. I have left the system sitting in the "running" @ 0% state on mini units for hours and it hasn't gotten anywhere, my only option seems to be to suspend or abort the work units.
I have two other machines - one an Intel P4, the other a Core 2 Quad 9300, both running XP - that seem to have no problems running mini or beta WUs.
Is it possible to get the client to not receive mini WUs? Or is there some known reason behind these stalled work units that there is a workaround for?
Thanks,
Adam
ID: 53225 | Rating: 0 | rate:
/
Jeremy Joined: May 15 08 Posts: 13 ID: 259031 Credit: 2,636 RAC: 0
I have had nothing but Compute errors with the mini version of rosetta. See this page
http://boinc.bakerlab.org/rosetta/results.php?userid=259031
I'd rather only have the normal ones for 2 reasons. One it keeps giving errors so the cpu time isn't putt to use. It doesn't have propper grafics, but I've read that that is not a priority.
I'd like to help debugging this application by sending whatever information you need.
Here is my host sheet.
http://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=812509
another one: h001__BOINC_ABRELAX_IGNORE_THE_REST-S25-11-S3-5--h001_-_3324_45140_0
Client error
Client state Done
Exit status -1073741819 (0xc0000005)
Computer ID 293392
Report deadline 30 May 2008 19:09:52 UTC
CPU time 19774.5
stderr out <core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 21600
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x005C3030 write attempt to address 0x00000004
Engaging BOINC Windows Runtime Debugger...
it did grant me credit amazing enough
____________