Rosetta@home

Minirosetta v1.47 bug thread.

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search
Message boards : Number crunching : Minirosetta v1.47 bug thread.

Sort
AuthorMessage
Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 57902 - Posted 15 Dec 2008 22:08:36 UTC

HoHo kids!

We've got a new minirosetta version, with - you've guessed it - more bug fixes ! Woo!

Please report remaining issues here - that would be grand :)
____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

Stephen Profile

Joined: Apr 26 08
Posts: 32
ID: 255217
Credit: 429,286
RAC: 0
Message 57903 - Posted 15 Dec 2008 22:21:18 UTC - in response to Message ID 57902.

are there any new changes to the science?

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1740
ID: 44890
Credit: 2,318,714
RAC: 3,121
Message 57905 - Posted 15 Dec 2008 23:25:25 UTC

......sooooo which bugs do you feel you've fixed?

Which of the many users that have abandoned the project due to problems should feel it is safe to reenter the waters?
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 57906 - Posted 16 Dec 2008 0:04:21 UTC - in response to Message ID 57905.

......sooooo which bugs do you feel you've fixed?

Which of the many users that have abandoned the project due to problems should feel it is safe to reenter the waters?


Amongst a bunch of minor things, one major bug that was fixed was causing jobs ro rash when they enetered full-atom stage ut had a fullatom energy > 0. Which usually occurs rarely, which would explain the random errors seen with the cs_vanilla jobs. The bug was due to a wrongly initialized varaible.
This bug was also causing the majority of the ccc_1_8_* jobs to fail on RALPH (we didnt move these over to BOINC of course, sicne we noticed the bug there).
THe reason those failed more frequently was that they have constraints built in and those cause the energy to be offset to higher values increasing the frequency of the problem to more like 70%.

Looking at the RALPH results i think most of the easily reproducable errors i think we've fixed. I recently ran close to 10000 WUs on our local compute cluster resulting in.. well.. 0 errors. This is wherei t gets tricky really, if stuff is only failing on other plattforms or due to machine dependent issues or *god knows what*. I will propose that the lab aquire a small farm of windows machiens to do extensive bug testing& hunting on to get a grip one these errors.. but believe us, these are difficult grounds.

Thanks for bearing with us,

Mike
____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

LizzieBarry

Joined: Feb 25 08
Posts: 76
ID: 243949
Credit: 201,862
RAC: 0
Message 57910 - Posted 16 Dec 2008 1:24:57 UTC

Sorry Mike, not a good start...

1483407

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0049162C read attempt to address 0x00000000

Engaging BOINC Windows Runtime Debugger...

Chilean Profile
Avatar

Joined: Oct 16 05
Posts: 651
ID: 5008
Credit: 10,201,906
RAC: 4,658
Message 57911 - Posted 16 Dec 2008 1:39:50 UTC - in response to Message ID 57906.

......sooooo which bugs do you feel you've fixed?

Which of the many users that have abandoned the project due to problems should feel it is safe to reenter the waters?


Amongst a bunch of minor things, one major bug that was fixed was causing jobs ro rash when they enetered full-atom stage ut had a fullatom energy > 0. Which usually occurs rarely, which would explain the random errors seen with the cs_vanilla jobs. The bug was due to a wrongly initialized varaible.
This bug was also causing the majority of the ccc_1_8_* jobs to fail on RALPH (we didnt move these over to BOINC of course, sicne we noticed the bug there).
THe reason those failed more frequently was that they have constraints built in and those cause the energy to be offset to higher values increasing the frequency of the problem to more like 70%.

Looking at the RALPH results i think most of the easily reproducable errors i think we've fixed. I recently ran close to 10000 WUs on our local compute cluster resulting in.. well.. 0 errors. This is wherei t gets tricky really, if stuff is only failing on other plattforms or due to machine dependent issues or *god knows what*. I will propose that the lab aquire a small farm of windows machiens to do extensive bug testing& hunting on to get a grip one these errors.. but believe us, these are difficult grounds.

Thanks for bearing with us,

Mike


I can't even imagine the loads of code you (guys) went thru.
____________

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1740
ID: 44890
Credit: 2,318,714
RAC: 3,121
Message 57913 - Posted 16 Dec 2008 1:56:14 UTC

The one bug that comes to mind that would not be easy to observe by counting successfully completed results, on a farm of Linux machines all running only a single project, would be where the tasks were not suspending properly. Someone mentioned a BOINC API compatibility problem might be the cause?

What would reasonable memory expectations be now? Are all the 1.47 tasks tagged as needing 512MB minimum? Or is there a mix? And, of that 512MB, what should one expect to see a task actually using when running normally?
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 57914 - Posted 16 Dec 2008 2:12:00 UTC - in response to Message ID 57913.

The one bug that comes to mind that would not be easy to observe by counting successfully completed results, on a farm of Linux machines all running only a single project, would be where the tasks were not suspending properly. Someone mentioned a BOINC API compatibility problem might be the cause?

You're right. However I believe David Kim has updated and fixed this problem, at 1.45. If you guys *still* see problems with suspension of jobs then do let us know. We also hope that this lockfile problem should be largely fixed. We'll have to wait for the error statistics to come in before we know if the API fix has worked.



What would reasonable memory expectations be now? Are all the 1.47 tasks tagged as needing 512MB minimum? Or is there a mix? And, of that 512MB, what should one expect to see a task actually using when running normally?


I can't speak for the enzyme design guys but to give you an idea:

The jobs named "*_rlbd_*" and "*_rlbn_*" should take no more than 160 MB or so.
The jobs named "cc2_*" or "*_chunk_*" should take between 150 and 320MB or so (they are much larger proteins).

I'm not aware of any jobs that require more than 400MB, that would definitely point to a problem. ALthough the enzyme design guys may well have higher requirements.

Mike


____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 57915 - Posted 16 Dec 2008 2:14:15 UTC - in response to Message ID 57910.

Sorry Mike, not a good start...


Yes, i know. I'm not saying the app is perfect - just that we found a bunch of definite bugs that are now fixed. No doubt there are still issues - we'r e working on it :)

Mike

____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 57916 - Posted 16 Dec 2008 2:16:13 UTC - in response to Message ID 57911.


I can't even imagine the loads of code you (guys) went thru.


Well.. to give you an idea .. Minirosetta has more than 200000 (yes two hundred thousand) lines of code. Each day there are maybe around 20 additions to the code, with around 40 people working on the code at each given time.

But we'll get there, i'm optimisitic that with time we'll find the problems.
____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 57923 - Posted 16 Dec 2008 8:53:03 UTC
Last modified: 16 Dec 2008 8:54:42 UTC

i hope you guys get a small farm of windows machines to double check problems against your linux machines. windows is what the majority of us crunchers use and certain error types may or may not show up on linux.

for instance, how does one tell the difference between a machine error and a application error when the task dies with a (0xc0000005) error? is this something that shows up on your linux machines? or is that a specific windows error code?

also in another thread you mentioned aborting tasks that are using lower than 1.47. would these tasks be reissued using 1.47 or would they use the same mini that they originated with?

ramostol

Joined: Feb 6 07
Posts: 64
ID: 145835
Credit: 584,052
RAC: 0
Message 57926 - Posted 16 Dec 2008 10:50:42 UTC

My 1.47 cc2_1_8_mammoth-tasks have all crashed on Ralph, now my 1.47 cc2_1_8_native-tasks are crashing on Rosetta.

Example (1 of 2):
cc2_1_8_native_fa_cst_hb_t369__IGNORE_THE_REST_1S3QA_4_5599_36_0

<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
minirosetta_1.47_i686-apple-darwin(95094,0xa0538fa0) malloc: *** error for object 0x1747df0: Non-aligned pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
# cpu_run_time_pref: 14400
SIGSEGV: segmentation violation

ramostol

Joined: Feb 6 07
Posts: 64
ID: 145835
Credit: 584,052
RAC: 0
Message 57927 - Posted 16 Dec 2008 11:59:02 UTC

And now all today's imported 1.47-tasks for the upcoming week have collapsed, most of them after less than 1 minute of computing, one was manually aborted as potentially ever-lasting.

It seems that I have to stick to my 5.98-tasks for some days and increase the default runtime.

jjwhalen
Avatar

Joined: Dec 20 06
Posts: 4
ID: 137022
Credit: 399,398
RAC: 0
Message 57928 - Posted 16 Dec 2008 12:17:35 UTC

Minirosetta apparently "looks like" malware, whether it actually is or not. This applies to all versions I've run, thru v1.47.

I run BOINC on two WinVista (God help me) boxes: one a 32 bit Sony with ZoneAlarm Pro|ESET NOD32 for security; the other a 64 bit Sony with Kaspersky Internet Security 2009.

On the first machine, NOD32 Antivirus thinks the Minirosetta .exe either contains a viral signature or looks bad heuristically (their UI doesn't say which). I have to add an exclusion to get the thing out of quarantine, every time a new version is released. Interestingly, ZoneAlarm Pro's application module hasn't had a problem with it.

On the 64 bit machine, Kaspersky's Application Control module gives Minirosetta's executable a Threat Rating of "Potentially Dangerous" with a heuristic Danger Index score of 82. I have to manually override Kaspersky and move Minirosetta out of the "Untrusted Application" zone, to allow it to execute. (By comparison, Rosetta Beta 5.98 has a DI of 12, as does SETI's recently released Astropulse 5.0. SETI's regular Enhanced v6.03 has a DI of zero.)

I realize that heuristic analysis is as much art as science, but both ESET and Kaspersky are rated at or near the top of their field. Of 10 project hosts I subscribe to, with over 25 project executables, Minirosetta is the ONLY one that has ever sent up a red flag to my security suite(s). Since most folks leave their security suite (if any) on autopilot, there are potentially many testers who never get to run Minirosetta because the .exe goes immediately into a black hole. Somewhere in those 200,000 lines of code, something apparently looks funky.

____________
Best wishes:)

Nothing But Idle Time

Joined: Sep 28 05
Posts: 209
ID: 1675
Credit: 139,545
RAC: 0
Message 57929 - Posted 16 Dec 2008 12:32:05 UTC
Last modified: 16 Dec 2008 12:34:34 UTC

After a 1 week hiatus I downloaded v1.47 and 4 tasks. The first task showed a completion time of 12 hours which corresponds to my chosen runtime. The other 3 tasks, all _rlbd_ tasks, showed completion times of only 1 hour. What's up with that? It suggests that the staff provided an estimated task runtime of something like 45 minutes instead of the customary 8 hours.

Because of the 1-hour runtimes BOINC also downloaded additional tasks to fill the cache. Not good.

funkydude

Joined: Jun 15 08
Posts: 12
ID: 264493
Credit: 146,106
RAC: 0
Message 57935 - Posted 16 Dec 2008 13:17:00 UTC - in response to Message ID 57928.


On the first machine, NOD32 Antivirus thinks the Minirosetta .exe either contains a viral signature or looks bad heuristically (their UI doesn't say which). I have to add an exclusion to get the thing out of quarantine, every time a new version is released.


Hello, I've been using both nod32 and rosetta for years now, I've never had nod32 detect rosetta as anything malicious, make sure you are updated. v3.0.672.0 DB 3695 as of writing.

LizzieBarry

Joined: Feb 25 08
Posts: 76
ID: 243949
Credit: 201,862
RAC: 0
Message 57936 - Posted 16 Dec 2008 13:29:41 UTC - in response to Message ID 57915.

Sorry Mike, not a good start...


Yes, I know. I'm not saying the app is perfect - just that we found a bunch of definite bugs that are now fixed. No doubt there are still issues - we're working on it :)


That's ok. Just that I'm trying to get more active here again after some computer problems and the first 1.47 task crashed out quickly. The next 4 have run with no problems though. Hopefully that continues. Usually all the problems are mine, not yours.

Good to see a more active presence from you in this forum. You're feedback to issues makes a big difference, even if it's just to say you're working on it without a solution yet. That matters too.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 57939 - Posted 16 Dec 2008 14:42:37 UTC - in response to Message ID 57936.

Sorry Mike, not a good start...


Yes, I know. I'm not saying the app is perfect - just that we found a bunch of definite bugs that are now fixed. No doubt there are still issues - we're working on it :)


That's ok. Just that I'm trying to get more active here again after some computer problems and the first 1.47 task crashed out quickly. The next 4 have run with no problems though. Hopefully that continues. Usually all the problems are mine, not yours.

Good to see a more active presence from you in this forum. You're feedback to issues makes a big difference, even if it's just to say you're working on it without a solution yet. That matters too.



Just to expand on the point of this person....Thanks for taking the time to tell us what is going on. We like to know and the silence has been deafening lately.
Thanks again for breaking it. We hope for more news as time goes along.

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 57941 - Posted 16 Dec 2008 21:26:31 UTC

Hi.

I found a problem with the graphics on Ubuntu 8.04, mini 1.45 worked fine but now when i click

the show graphics button all i get is the outline of the graphic window, it looked transparent.

I could not close it normally i had to go to processes and kill it from there, also it was

showing that the graphics was using mini 1.40 for some reason. I'm sure that mini 1.45 was

using the graphics 1.45, not a bigge but still.

pete.

____________


RodrigoPS
Avatar

Joined: Nov 28 08
Posts: 3
ID: 289807
Credit: 860,206
RAC: 34
Message 57944 - Posted 16 Dec 2008 23:14:36 UTC - in response to Message ID 57941.
Last modified: 16 Dec 2008 23:16:20 UTC

Hi.

I found a problem with the graphics on Ubuntu 8.04, mini 1.45 worked fine but now when i click

the show graphics button all i get is the outline of the graphic window, it looked transparent.

I could not close it normally i had to go to processes and kill it from there, also it was

showing that the graphics was using mini 1.40 for some reason. I'm sure that mini 1.45 was

using the graphics 1.45, not a bigge but still.

pete.



I'm having the same problem, but in XP 32-bit, in one of the hosts after the installation of mini 1.47

Mr. Ed Profile
Avatar

Joined: Dec 16 08
Posts: 8
ID: 292934
Credit: 28,443
RAC: 0
Message 57945 - Posted 16 Dec 2008 23:19:30 UTC
Last modified: 17 Dec 2008 0:08:00 UTC

Not sure if this is related or not...

Crux of my problem is this, I have no graphic display, the screen saver is blank and when I hit the 'show graphics' button in the advanced view, it opens a window (title - minirosetta version 1.47 [workunit: cs_noe_ .... etc]) that is blank, and then becomes unresponsive within about 10 seconds and requires the process to be killed.

Bonic Manager Version : 6.4.5
Wigets Ver : 2.8.7
Rosetta application : Rosetta Mini 1.47
Microsoft Windows Vista Business x86 Editon, (06.00.6000.00)
Dont know if you need this but..
PC : GenuineIntel Intel(R) Celeron(R) CPU 2.80GHz [x86 Family 15 Model 4 Stepping 9], 1gb RAM, NVIDIA GeForce 8500 GT

New account/install, 44 mins old according to its first work unit.. Vista is a fresh build, <24hrs old...

The workunits are running/progressing along, I would just like to see what im crunching :)

stewjack

Joined: Apr 23 06
Posts: 39
ID: 78784
Credit: 95,871
RAC: 0
Message 57946 - Posted 16 Dec 2008 23:28:48 UTC - in response to Message ID 57944.
Last modified: 16 Dec 2008 23:31:00 UTC


I'm having the same (graphics) problem, but in XP 32-bit, in one of the hosts after the installation of mini 1.47


Same problem here. Also with XP 32bit. I guess that makes it three (edit now 4 ) of us.

The WU seems to be unaffected. The WU has only been processing for abut 15 minutes, but it is check-pointing regularly.
____________

chris

Joined: Oct 18 06
Posts: 6
ID: 121763
Credit: 4,410,116
RAC: 5,526
Message 57947 - Posted 16 Dec 2008 23:54:34 UTC - in response to Message ID 57946.

Same here. No graphics fur WU 196031795.

____________

Mr. Ed Profile
Avatar

Joined: Dec 16 08
Posts: 8
ID: 292934
Credit: 28,443
RAC: 0
Message 57948 - Posted 17 Dec 2008 1:00:40 UTC

How odd... It just started working, I did nothing/made no changes.

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 57951 - Posted 17 Dec 2008 3:10:52 UTC
Last modified: 17 Dec 2008 3:51:15 UTC

Hi.

Yes odd indeed it seems to be this type of task has the problem, for me anyhow

the one that was affected has finished.

This one // cs_noe_fullw_nolin_homo_bench_cs_noe_abrelax_cs_hr1958_olange_5606_12051_0


EDIT // Now i have a t071_ task running and the graphics are fine, go figure.

pete.
____________


Mr. Ed Profile
Avatar

Joined: Dec 16 08
Posts: 8
ID: 292934
Credit: 28,443
RAC: 0
Message 57952 - Posted 17 Dec 2008 3:30:51 UTC

If it's any help, mine was -

cs_noe_fullw_nolin_homo_bench_cs_noe_abrelax_cs_ccr19_olange_5604_12614

DSL

Joined: Dec 6 08
Posts: 1
ID: 291474
Credit: 2,766
RAC: 0
Message 57953 - Posted 17 Dec 2008 4:32:33 UTC

I have a WinXP 32-bit machine with Norton Antivirus 2009 installed.
minirosetta v1.47 is known to have fixed many bugs but there is still a major fault in this version. The bug is that it is detected by my antivirus as a high
security risk threat and is automatically removed by the antivirus. So you download the new version and after some time you will find it evaporated by your antivirus. I dont know whether it really contains some virus or not but the fact is that there is something in thousand lines of code of minirosetta that the antivirus does not like. I hope that this issue will also be resolved soon and it is my message to the developers of minirosetta that fix this issue as early as possible because most of the new users will not run it again on their machines after being detected by the antivirus as a threat.

So it is bad to hear that the new version still contains a major bug. :-(
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 57958 - Posted 17 Dec 2008 8:33:53 UTC - in response to Message ID 57953.

I have a WinXP 32-bit machine with Norton Antivirus 2009 installed.
minirosetta v1.47 is known to have fixed many bugs but there is still a major fault in this version. The bug is that it is detected by my antivirus as a high
security risk threat and is automatically removed by the antivirus. So you download the new version and after some time you will find it evaporated by your antivirus. I dont know whether it really contains some virus or not but the fact is that there is something in thousand lines of code of minirosetta that the antivirus does not like. I hope that this issue will also be resolved soon and it is my message to the developers of minirosetta that fix this issue as early as possible because most of the new users will not run it again on their machines after being detected by the antivirus as a threat.

So it is bad to hear that the new version still contains a major bug. :-(



why not set your antivirus to manual and then when it grabs minirosetta you can tell it to ignore that kind of file. we all know minirosetta is a safe application. it just NAV and other antivirus software that thinks it has a infection. I bet if you ran housecall from trendmicro you would find no problems. I run AVG free and none of the tasks have ever triggered that program and my system is virus free.

Chilean Profile
Avatar

Joined: Oct 16 05
Posts: 651
ID: 5008
Credit: 10,201,906
RAC: 4,658
Message 57965 - Posted 17 Dec 2008 13:42:54 UTC - in response to Message ID 57928.

Minirosetta apparently "looks like" malware, whether it actually is or not. This applies to all versions I've run, thru v1.47.

I run BOINC on two WinVista (God help me) boxes: one a 32 bit Sony with ZoneAlarm Pro|ESET NOD32 for security; the other a 64 bit Sony with Kaspersky Internet Security 2009.

On the first machine, NOD32 Antivirus thinks the Minirosetta .exe either contains a viral signature or looks bad heuristically (their UI doesn't say which). I have to add an exclusion to get the thing out of quarantine, every time a new version is released. Interestingly, ZoneAlarm Pro's application module hasn't had a problem with it.

On the 64 bit machine, Kaspersky's Application Control module gives Minirosetta's executable a Threat Rating of "Potentially Dangerous" with a heuristic Danger Index score of 82. I have to manually override Kaspersky and move Minirosetta out of the "Untrusted Application" zone, to allow it to execute. (By comparison, Rosetta Beta 5.98 has a DI of 12, as does SETI's recently released Astropulse 5.0. SETI's regular Enhanced v6.03 has a DI of zero.)

I realize that heuristic analysis is as much art as science, but both ESET and Kaspersky are rated at or near the top of their field. Of 10 project hosts I subscribe to, with over 25 project executables, Minirosetta is the ONLY one that has ever sent up a red flag to my security suite(s). Since most folks leave their security suite (if any) on autopilot, there are potentially many testers who never get to run Minirosetta because the .exe goes immediately into a black hole. Somewhere in those 200,000 lines of code, something apparently looks funky.


That's weird, because I have NOD32 on one of my PC's and it doesn't have a problem with rosetta. I changed to Avast Pro... and still no problems :S
____________

A Few Good Men

Joined: Mar 25 07
Posts: 14
ID: 157915
Credit: 2,031,382
RAC: 55
Message 57968 - Posted 17 Dec 2008 14:33:05 UTC
Last modified: 17 Dec 2008 14:59:50 UTC

3 machines with XP Pro sp3 1 machine sp2,1 machine Server 2003, mini 1.47, last 24 hours has been all Exit Status -177 (oxffffff4f)Maximum Memory exceeded.

A Sample from 2 machines:

Task ID 215092853 workunit 196054490

Task ID 215087694 work unit 196045728

they both have same computer ID 964014

1 of the machines is running a Beta 5.98 task concurrently, Im going to holdout on detaching to see what it produces for a result.
88.12 RAC on 4 x XPPRO + 1 x server2003 x 24 hours

1 quadcore 2.66ghz
1 HT 2.8ghz
1 1.6ghz
1.5`ghz
1 Mobile at 1.9ghz

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 57972 - Posted 17 Dec 2008 16:12:17 UTC - in response to Message ID 57968.

3 machines with XP Pro sp3 1 machine sp2,1 machine Server 2003, mini 1.47, last 24 hours has been all Exit Status -177 (oxffffff4f)Maximum Memory exceeded.

A Sample from 2 machines:

Task ID 215092853 workunit 196054490

Task ID 215087694 work unit 196045728

they both have same computer ID 964014

1 of the machines is running a Beta 5.98 task concurrently, Im going to holdout on detaching to see what it produces for a result.
88.12 RAC on 4 x XPPRO + 1 x server2003 x 24 hours

1 quadcore 2.66ghz
1 HT 2.8ghz
1 1.6ghz
1.5`ghz
1 Mobile at 1.9ghz



can you point to which specific machine(s) this is happening on.
you have so many there is no quick way to know which machine the tasks you listed belong to.

A Few Good Men

Joined: Mar 25 07
Posts: 14
ID: 157915
Credit: 2,031,382
RAC: 55
Message 57974 - Posted 17 Dec 2008 17:11:29 UTC - in response to Message ID 57972.

3 machines with XP Pro sp3 1 machine sp2,1 machine Server 2003, mini 1.47, last 24 hours has been all Exit Status -177 (oxffffff4f)Maximum Memory exceeded.

A Sample from 2 machines:

Task ID 215092853 workunit 196054490

Task ID 215087694 work unit 196045728

they both have same computer ID 964014

1 of the machines is running a Beta 5.98 task concurrently, Im going to holdout on detaching to see what it produces for a result.
88.12 RAC on 4 x XPPRO + 1 x server2003 x 24 hours

1 quadcore 2.66ghz
1 HT 2.8ghz
1 1.6ghz
1.5`ghz
1 Mobile at 1.9ghz



can you point to which specific machine(s) this is happening on.
you have so many there is no quick way to know which machine the tasks you listed belong to.


I merged machines to assist.

The 2 that I have posted tasks from are
964014
965938

The other machines with simular errors are
961824
954192
954486

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 57977 - Posted 17 Dec 2008 18:52:20 UTC - in response to Message ID 57974.

3 machines with XP Pro sp3 1 machine sp2,1 machine Server 2003, mini 1.47, last 24 hours has been all Exit Status -177 (oxffffff4f)Maximum Memory exceeded.

A Sample from 2 machines:

Task ID 215092853 workunit 196054490

Task ID 215087694 work unit 196045728

they both have same computer ID 964014

1 of the machines is running a Beta 5.98 task concurrently, Im going to holdout on detaching to see what it produces for a result.
88.12 RAC on 4 x XPPRO + 1 x server2003 x 24 hours

1 quadcore 2.66ghz
1 HT 2.8ghz
1 1.6ghz
1.5`ghz
1 Mobile at 1.9ghz



can you point to which specific machine(s) this is happening on.
you have so many there is no quick way to know which machine the tasks you listed belong to.


I merged machines to assist.

The 2 that I have posted tasks from are
964014
965938

The other machines with simular errors are
961824
954192
954486



computer 964014 is less than their new recomendation of 512 memory. this must be one of the tasks they were talking about.

December 10, 2008
We are now recommending systems with at least 512MB of memory. The majority of tasks will run fine with 256MB but some tasks will involve larger proteins that will use more memory.

computer 965938 is having a lockfile issue, there has been alot of discussion in 1.45 thread about this. you have to delete the empty slot folders in the boinc slot folder located in the projects folder. do a search in forums about lockfiles. it is discussed heavily in the 1.45 thread.

only one other computer had an issue, but that is due to defective task.

JChojnacki Profile
Avatar

Joined: Sep 17 05
Posts: 71
ID: 105
Credit: 6,731,215
RAC: 1,607
Message 57985 - Posted 17 Dec 2008 23:32:51 UTC - in response to Message ID 57902.

HoHo kids!


And Happy Holidays to everyone at the Baker Lab, as well as my fellow crunchers.

Oh, and this WU failed: 214946535

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>

~Joel



____________



Stephen Profile

Joined: Apr 26 08
Posts: 32
ID: 255217
Credit: 429,286
RAC: 0
Message 57988 - Posted 18 Dec 2008 0:55:00 UTC

Like some other people mentioned, the WU titled

"cs_noe_fullw_nolin_homo_bench_cs_noe_abrelax_cs_flua_olange_5605_43_0"

appears to be running fine, but when I click "show graphics" the window becomes unresponsive and requires the app to restart. the other work units are working without any problem.

Stephen Profile

Joined: Apr 26 08
Posts: 32
ID: 255217
Credit: 429,286
RAC: 0
Message 57989 - Posted 18 Dec 2008 0:58:56 UTC

I'm running vista64. I am running BOINC 64-bit edition. boincmgr.exe and boinctray.exe are running in 64-bit mode. however, minirosetta_1.47_windows_x86_64.exe is currently running in 32 bit mode. it says *32 next to the name, which I belive to indicate that it is running in 32-bit mode.

Mr. Ed Profile
Avatar

Joined: Dec 16 08
Posts: 8
ID: 292934
Credit: 28,443
RAC: 0
Message 57994 - Posted 18 Dec 2008 4:33:25 UTC - in response to Message ID 57945.

Not sure if this is related or not...

Crux of my problem is this, I have no graphic display, the screen saver is blank and when I hit the 'show graphics' button in the advanced view, it opens a window (title - minirosetta version 1.47 [workunit: cs_noe_ .... etc]) that is blank, and then becomes unresponsive within about 10 seconds and requires the process to be killed.

Bonic Manager Version : 6.4.5
Wigets Ver : 2.8.7
Rosetta application : Rosetta Mini 1.47
Microsoft Windows Vista Business x86 Editon, (06.00.6000.00)
Dont know if you need this but..
PC : GenuineIntel Intel(R) Celeron(R) CPU 2.80GHz [x86 Family 15 Model 4 Stepping 9], 1gb RAM, NVIDIA GeForce 8500 GT

New account/install, 44 mins old according to its first work unit.. Vista is a fresh build, <24hrs old...

The workunits are running/progressing along, I would just like to see what im crunching :)


Getting this again...

cc_nonideal_1_0_nocst4_hb_t286__IGNORE_THE_REST_1ESCA_7_5665_10

WU ID 196233123
PC ID 966609

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 7,571,652
RAC: 1,878
Message 57995 - Posted 18 Dec 2008 6:17:59 UTC - in response to Message ID 57989.

I'm running vista64. I am running BOINC 64-bit edition. boincmgr.exe and boinctray.exe are running in 64-bit mode. however, minirosetta_1.47_windows_x86_64.exe is currently running in 32 bit mode. it says *32 next to the name, which I belive to indicate that it is running in 32-bit mode.


As far as I know there is no real 64 bit version for rosetta. It is the 32 bit version in a 64 bit wrapper.
____________

Zilli Samuel

Joined: Mar 2 06
Posts: 3
ID: 62673
Credit: 30,077
RAC: 0
Message 58006 - Posted 18 Dec 2008 16:13:30 UTC

I've the problem with Norton Antivirus 2009 too, it delete minirosetta exe file because it's a "high security risk threat".
I entered Boinc path in Norton exclusion paths to solve it, but it would be better if Rosetta staff talk to Norton staff to avoid this problem...
____________

jay Profile

Joined: Jan 12 08
Posts: 11
ID: 234922
Credit: 91,634
RAC: 0
Message 58010 - Posted 18 Dec 2008 19:25:23 UTC

Question on memory size..

Greetings!
First of all, thanks to all of the developers for debugging the code.

I have a question about the memory size and page fault rate for mini-rosetta 1.47 .

I was looking at the windows (XP) task manager and looking at the memory size and page fault rate.

I admit that I do not know what it all means - and would like to ask the forum for an explanation that would help me..

Environment: Here is what BOINC says:
Processor: 2 GenuineIntel Intel(R) Core(TM) Duo CPU T2300 @ 1.66GHz [x86 Family 6 Model 14 Stepping 12]
Processor features: fpu tsc pae nx sse sse2 mmx
OS: Microsoft Windows XP: Professional x86 Editon, Service Pack 3, (05.01.2600.00)
Memory: 2.00 GB physical, 4.87 GB virtual
Disk: 107.41 GB total, 78.57 GB free

Here is what the Task manger is showing for mini-rosetta 1.47
Mem usage: 184,944K ( Varying between 170,000K and 247,000K while I watched.)
PF delta: 3,228 ( in a three second period)
VM size: 199,344K ( and moving up to 243,000 K)

I was running 2 Boinc projects at once: Rosetta and WCG-clean energy.
If I suspend all others so that only Rosetta is running, the page faults are more sporadic, mostly zero, then up to 6,375 in the three second period.

With Boinc only running the Rosetta task, the task manager says:

Commit charge (K)
total: 788748
limit: 5107808
peak: 1319708

Physical Memory (K)
total: 2,095,532
available: 1,127,112
System cache: 838,252



Bottom Line - I assumed that the pf rate is not good.
Do you know of anything I can tweak to help??

THANK YOU!!
Jay E.

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,443,745
RAC: 1,849
Message 58011 - Posted 18 Dec 2008 21:57:24 UTC - in response to Message ID 58010.

Question on memory size..

Greetings!
First of all, thanks to all of the developers for debugging the code.

I have a question about the memory size and page fault rate for mini-rosetta 1.47 .

I was looking at the windows (XP) task manager and looking at the memory size and page fault rate.

I admit that I do not know what it all means - and would like to ask the forum for an explanation that would help me..

Environment: Here is what BOINC says:
Processor: 2 GenuineIntel Intel(R) Core(TM) Duo CPU T2300 @ 1.66GHz [x86 Family 6 Model 14 Stepping 12]
Processor features: fpu tsc pae nx sse sse2 mmx
OS: Microsoft Windows XP: Professional x86 Editon, Service Pack 3, (05.01.2600.00)
Memory: 2.00 GB physical, 4.87 GB virtual
Disk: 107.41 GB total, 78.57 GB free

Here is what the Task manger is showing for mini-rosetta 1.47
Mem usage: 184,944K ( Varying between 170,000K and 247,000K while I watched.)
PF delta: 3,228 ( in a three second period)
VM size: 199,344K ( and moving up to 243,000 K)

I was running 2 Boinc projects at once: Rosetta and WCG-clean energy.
If I suspend all others so that only Rosetta is running, the page faults are more sporadic, mostly zero, then up to 6,375 in the three second period.

With Boinc only running the Rosetta task, the task manager says:

Commit charge (K)
total: 788748
limit: 5107808
peak: 1319708

Physical Memory (K)
total: 2,095,532
available: 1,127,112
System cache: 838,252



Bottom Line - I assumed that the pf rate is not good.
Do you know of anything I can tweak to help??

THANK YOU!!
Jay E.



Can you afford to add more physical memory to that machine? That should at least decrease the page fault rate, although I don't know if it's the cheapest way to do this.

Here's a good place to find out what memory fits that machine, and how much it can hold:

http://www.crucial.com/

However, note that your version of Windows has a limit on how much of the installed memory it can actually use, probably about 3.5 GB.

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 58012 - Posted 18 Dec 2008 22:00:40 UTC

This WU had a validate error:

normal_relax_rlbd_1ynv_IGNORE_THE_REST_DECOY_5565_171_0

It looks from the stderr file like it crunched normally for 16 hours (my current preference) with no error. However, it was then marked "Invalid" with no explanation. The only other thing I see is that it crunched an unusually high number of decoys (8777 decoys). Does that cause problems with the validator?

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1740
ID: 44890
Credit: 2,318,714
RAC: 3,121
Message 58013 - Posted 18 Dec 2008 22:28:12 UTC
Last modified: 18 Dec 2008 22:37:08 UTC

Jay, RE: page faults...

If you change the view you can add a column to display the number of faults since the task started. I have long runtimes, but currently have two tasks from Ralph that topped 100,000,000 page faults. One in 15hrs and the other in 19hrs. This is the highest fault rate I've ever seen. Indeed, I recall the days when I thought that 1M per hour of runtime was excessive.

The only solice I can offer is that not all faults are hard faults to disk. Some recorded faults are "soft". Perhaps someone else can further elaborate on the concepts.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

Stephen Profile

Joined: Apr 26 08
Posts: 32
ID: 255217
Credit: 429,286
RAC: 0
Message 58024 - Posted 19 Dec 2008 4:07:51 UTC
Last modified: 19 Dec 2008 4:35:03 UTC

a WU will get to around 85% complete , progress will stay the same. time to completion stays around 10 minutes. i suspend all tasks, resume then the "stuck" WUs will complete.

edited: doing this also rolls back the "cpu time spent" to around 30 minutes

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3378
ID: 106194
Credit: 0
RAC: 0
Message 58027 - Posted 19 Dec 2008 5:47:30 UTC
Last modified: 19 Dec 2008 5:49:36 UTC

Stephen, this may be part of why you are having problems keeping all 8 CPUs busy. Suggest you just let BOINC manage the machine for the next 12 hours or so. Don't abort, suspend, update, anything at all.

Some tasks will take longer then 3 hours to run, and their % complete progress bar will not move steadily. Rather then tell you the task has -30 minutes left, they reflect the situation by making time move very slowly after the task gets to 10 minutes remaining.

It's simply a problem with the estimate, not the work being done.
____________
Rosetta Moderator: Mod.Sense

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58029 - Posted 19 Dec 2008 8:52:37 UTC

how do you "lose credit" on a task?
on this task i claimed 83 and got 68 for 4 hrs runtime. That is just weird when most of the other work I have been running always comes out on the plus side for granted.

rochester new york Profile
Avatar

Joined: Jul 2 06
Posts: 2562
ID: 98229
Credit: 957,089
RAC: 119
Message 58030 - Posted 19 Dec 2008 9:54:17 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=213832280

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58031 - Posted 19 Dec 2008 12:26:59 UTC - in response to Message ID 58030.

http://boinc.bakerlab.org/rosetta/result.php?resultid=213832280


you didn't have to reboot your computer a few times during the tasks run did you?
that will kill a task.

rochester new york Profile
Avatar

Joined: Jul 2 06
Posts: 2562
ID: 98229
Credit: 957,089
RAC: 119
Message 58032 - Posted 19 Dec 2008 14:14:26 UTC - in response to Message ID 58031.
Last modified: 19 Dec 2008 14:15:09 UTC

yes i did... thanks for that info a Microsoft upgrade required a reboot



http://boinc.bakerlab.org/rosetta/result.php?resultid=213832280


you didn't have to reboot your computer a few times during the tasks run did you?
that will kill a task.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58033 - Posted 19 Dec 2008 14:44:26 UTC - in response to Message ID 58032.


heres a tip: before rebooting, because you never know how many times windows will want you to do that when you do a update install, goto the activity tab of boinc manager and put all activity in suspend. wait for your hardrive to stop grinding away with all the saving and then you can reboot. also be sure to have the leave jobs/tasks in memory turned on as well. then you will not lose your position in the task. suspend seems to save everything to the hardrive and you can reboot all you want and not lose any data for the task.

yes i did... thanks for that info a Microsoft upgrade required a reboot



http://boinc.bakerlab.org/rosetta/result.php?resultid=213832280


you didn't have to reboot your computer a few times during the tasks run did you?
that will kill a task.

rochester new york Profile
Avatar

Joined: Jul 2 06
Posts: 2562
ID: 98229
Credit: 957,089
RAC: 119
Message 58034 - Posted 19 Dec 2008 14:50:09 UTC - in response to Message ID 58033.




thanks again ...ill do that next time






heres a tip: before rebooting, because you never know how many times windows will want you to do that when you do a update install, goto the activity tab of boinc manager and put all activity in suspend. wait for your hardrive to stop grinding away with all the saving and then you can reboot. also be sure to have the leave jobs/tasks in memory turned on as well. then you will not lose your position in the task. suspend seems to save everything to the hardrive and you can reboot all you want and not lose any data for the task.

yes i did... thanks for that info a Microsoft upgrade required a reboot



http://boinc.bakerlab.org/rosetta/result.php?resultid=213832280


you didn't have to reboot your computer a few times during the tasks run did you?
that will kill a task.


Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3378
ID: 106194
Credit: 0
RAC: 0
Message 58035 - Posted 19 Dec 2008 15:18:35 UTC
Last modified: 19 Dec 2008 17:09:18 UTC

I do not agree with greg's comments about preservation of work and reasons why, but would prefer to take them up in another thread if you'd like to discuss further.

[edit]
We're discussing this under a new thread here.
____________
Rosetta Moderator: Mod.Sense

rochester new york Profile
Avatar

Joined: Jul 2 06
Posts: 2562
ID: 98229
Credit: 957,089
RAC: 119
Message 58037 - Posted 19 Dec 2008 15:30:07 UTC - in response to Message ID 58035.


ok i just want to know what to do



I do not agree with greg's comments about preservation of work and reasons why, but would prefer to take them up in another thread if you'd like to discuss further.

kr12

Joined: Dec 6 07
Posts: 2
ID: 224503
Credit: 85,902
RAC: 0
Message 58044 - Posted 19 Dec 2008 20:25:15 UTC

"graphic viewer" hangs with this task
cs_noe_fullw_nolin_homo_bench_cs_noe_abrelax_cs_mth1598_olange_5607_11086_0
(http://boinc.bakerlab.org/rosetta/result.php?resultid=215720373)

stewjack

Joined: Apr 23 06
Posts: 39
ID: 78784
Credit: 95,871
RAC: 0
Message 58050 - Posted 20 Dec 2008 4:43:14 UTC - in response to Message ID 58044.

"graphic viewer" hangs with this task
cs_noe_fullw_nolin_homo_bench_cs_noe_abrelax_cs_mth1598_olange_5607_11086_0


I had the same thing happen with this similar WU.

cs_noe_fullw_nolin_homo_bench_cs_noe_abrelax_cs_nsp1_olange_5608_14752_0

Note: I didn't have time to mess with this one - so I just aborted it.

____________

rhb

Joined: Jan 19 07
Posts: 5
ID: 142744
Credit: 277,050
RAC: 0
Message 58052 - Posted 20 Dec 2008 7:14:45 UTC

I had a computation error. Running Ubuntu Linux 6.06, Boinc 5.4.9.
This is the first error I've seen in the last two weeks.

http://boinc.bakerlab.org/rosetta/result.php?resultid=215760302

Task ID 215760302
Name cs_noe_fullw_nolin_homo_bench_cs_noe_abrelax_cs_nsp1_olange_5608_24330_0
Workunit 196639962

<core_client_version>5.4.9</core_client_version>
<message>
process exited with code 193 (0xc1)
</message>
<stderr_txt>
*** glibc detected *** double free or corruption (!prev): 0x0bd2d980 ***
SIGABRT: abort called


____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 58068 - Posted 20 Dec 2008 20:53:45 UTC

Hi.

This one has problems, it's failed twice.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=194507659

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
SIGSEGV: segmentation violation
Stack trace (15 frames):
[0x8b979b7]
[0x8bc20b0]
[0xffffe500]
[0x84c0863]
[0x85ddf0a]
[0x85df32e]
[0x85e65b8]
[0x819a650]
[0x818d3b7]
[0x818ee89]
[0x8127771]
[0x8129a1a]
[0x804b9c8]
[0x8c1dbac]
[0x8048111]

Exiting...

</stderr_txt>

pete.

____________


svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,057,909
RAC: 5,477
Message 58084 - Posted 21 Dec 2008 3:19:52 UTC

I'm seeing problems when attempting to show graphics on workunits with names such as cs_noe* on Mac OS X 10.4.11. Its seems like several other people are seeing similar problems.

The first time Show graphics is pressed the graphics app starts and displays a blank window. Moving the mouse causes the graphics app to crash.

The second and subsequent times Show graphics is pressed the graphics app starts and displays a blank window along with the spinning rainbow beach ball. The graphics app is frozen and you can't even force quit in the normal way: it's necessary to quit via the Activity Monitor.
____________

lusvladimir

Joined: Oct 18 05
Posts: 12
ID: 5401
Credit: 1,784,854
RAC: 0
Message 58087 - Posted 21 Dec 2008 9:38:41 UTC
Last modified: 21 Dec 2008 9:41:39 UTC

Running Debian Linux , Boinc 6.2.14.

http://boinc.bakerlab.org/result.php?resultid=215464278

Task ID 215464278
Name cc_nonideal_1_3_nocst4_hb_t286__IGNORE_THE_REST_1VYHA_6_5693_20_0
Workunit 196380006

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
# cpu_run_time_pref: 3600
*** glibc detected *** double free or corruption (!prev): 0x0e13a4f0 ***
SIGABRT: abort called
Stack trace (23 frames):
____________

NewtonianRefractor

Joined: Sep 29 08
Posts: 19
ID: 281324
Credit: 2,350,860
RAC: 0
Message 58088 - Posted 21 Dec 2008 10:17:10 UTC

The graphics for one of my Minirosetta 1.47 work units crash. If I click on the show graphics button under boinc, a windows is launched, but it remains black and to close it I have to physically end the unresponsive process. The work unit runs fine though. It's under boinc 6.2.19

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58089 - Posted 21 Dec 2008 13:34:11 UTC

two more that wasted my cpu time crashing halfway

http://boinc.bakerlab.org/rosetta/result.php?resultid=215547790
t071_1_RDC_NMR_NESG_5480_118996_0
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 941.5781
--------------

http://boinc.bakerlab.org/rosetta/result.php?resultid=215490731
t072_1_RDC_NMR_NESG_5481_92626_0
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 12309.66
-----------------------------

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,443,745
RAC: 1,849
Message 58090 - Posted 21 Dec 2008 13:35:46 UTC - in response to Message ID 58084.

I'm seeing problems when attempting to show graphics on workunits with names such as cs_noe* on Mac OS X 10.4.11. Its seems like several other people are seeing similar problems.

The first time Show graphics is pressed the graphics app starts and displays a blank window. Moving the mouse causes the graphics app to crash.

The second and subsequent times Show graphics is pressed the graphics app starts and displays a blank window along with the spinning rainbow beach ball. The graphics app is frozen and you can't even force quit in the normal way: it's necessary to quit via the Activity Monitor.


I'm seeing somewhat similar problems under Windows Vista SP1.

12/21/2008 7:18:31 AM|rosetta@home|Resuming task cs_noe_fullw_nolin_homo_bench_cs_noe_abrelax_cs_ccr19_olange_5604_39348_0 using minirosetta version 147

Moving the mouse had no particular effect, but the graphics window stayed blank and shutting it down gave some error messages before it finally worked. I normally let minirosetta run without graphics.

Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 58093 - Posted 21 Dec 2008 15:09:29 UTC

Hi all! I'm back connected with the internet. Sadly to find more errors -
we'll be back to debugging after the holidays.

Quick comments for the major issues reported above:

- The graphics problems cs_noe_* jobs. THis is v strange. we have NOT updated the graphics app - so these jobs must be doing something funny that the graphics app doesnt like. I'll ask the person submitting these to try and run the graphics app locally to see if we can reproduce this error.

- The normal_relax_rlb[dn]_* jobs validator error. I thought i had fixed this, this must be something eles then. Yes the validator will reject the WU if it has produced more than some number of decoys (like around 128 or so per hour). Now,
this is pointing to some other problem now - evidently its racing through decoys nd not doing anything with them, thereby producing thousands of results. How that can happen on a sporadic basis (< 1/1000 WUs it seems) is puzzeling me. I'll have to ook into that one.

- Virus Scanners: Aehm - not really a bug. We have no control over what virus scanners seem to "recognise" about it as a malware/virus. They won't tellus either - they have been wholy unhelpful in this matter. The only solution i see right now is to set exceptions in your virus scanner to ignore apps coming from ralph.bakerlab.org and boinc.bakerlab.org




Has anyone seen any new Lockfile problems ? Or are these finally a thing of the past ?


Mike






____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,057,909
RAC: 5,477
Message 58094 - Posted 21 Dec 2008 15:56:51 UTC

Task 215936807; Workunit 194706499; Name 1dsvA_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1dsvA-_5479_5614_1; crashed on Mac OS X 10.4.11 after 4 secs (thankfully)

<core_client_version>6.2.18</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
SIGSEGV: segmentation violation

Crashed executable name: minirosetta_1.47_i686-apple-darwin
built using BOINC library version 6.5.0
Machine type Intel 80486 (32-bit executable)
System version: Macintosh OS 10.4.11 build 8S2167
Sat Dec 20 23:23:58 2008

Thread 0 Crashed:
0 ...etta_1.47_i686-apple-darwin 0x0022f77f __ZNK4core10kinematics8AtomTree20torsion_angle_dof_idERKNS_2id6AtomIDES5_S5_S5_Rd + 139
1 ...etta_1.47_i686-apple-darwin 0x0023415a __ZNK4core10kinematics8AtomTree13torsion_angleERKNS_2id6AtomIDES5_S5_S5_ + 284
2 ...etta_1.47_i686-apple-darwin 0x00022b1c __ZN4core12conformation12Conformation15setup_atom_treeEv + 1384
3 ...etta_1.47_i686-apple-darwin 0x00025055 __ZN4core12conformation12Conformation9fold_treeERKNS_10kinematics8FoldTreeE + 4167
4 ...etta_1.47_i686-apple-darwin 0x00984800 __ZNK9protocols8abinitio16KinematicControl25prepare_pose_for_samplingERN4core4pose4PoseE + 32
5 ...etta_1.47_i686-apple-darwin 0x0060a1d3 __ZN9protocols8abinitio17KinematicAbinitio5applyERN4core4pose4PoseE + 5277
6 ...etta_1.47_i686-apple-darwin 0x0060d8fd __ZN9protocols8abinitio29JumpingFoldConstraintsWrapper5applyERN4core4pose4PoseE + 3927
7 ...etta_1.47_i686-apple-darwin 0x001b13ee __ZN9protocols8abinitio18AbrelaxApplication4foldEv + 4468
8 ...etta_1.47_i686-apple-darwin 0x001b7fad __ZN9protocols8abinitio18AbrelaxApplication3runEv + 1137
9 ...etta_1.47_i686-apple-darwin 0x00008cc8 _main + 4078
10 ...etta_1.47_i686-apple-darwin 0x00001bce __start + 216
11 ...etta_1.47_i686-apple-darwin 0x00001af5 start + 41

etc.


____________

Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 58095 - Posted 21 Dec 2008 16:41:29 UTC - in response to Message ID 57926.

My 1.47 cc2_1_8_mammoth-tasks have all crashed on Ralph, now my 1.47 cc2_1_8_native-tasks are crashing on Rosetta.

Example (1 of 2):
cc2_1_8_native_fa_cst_hb_t369__IGNORE_THE_REST_1S3QA_4_5599_36_0

<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
minirosetta_1.47_i686-apple-darwin(95094,0xa0538fa0) malloc: *** error for object 0x1747df0: Non-aligned pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
# cpu_run_time_pref: 14400
SIGSEGV: segmentation violation



#Aehm - i can't see your RALPH failure for this job. I had one result come back and it was a success..

http://ralph.bakerlab.org/rah_queue_ops/db_action.php?table=result&id=1228006

____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 58096 - Posted 21 Dec 2008 16:45:10 UTC - in response to Message ID 57929.

After a 1 week hiatus I downloaded v1.47 and 4 tasks. The first task showed a completion time of 12 hours which corresponds to my chosen runtime. The other 3 tasks, all _rlbd_ tasks, showed completion times of only 1 hour. What's up with that? It suggests that the staff provided an estimated task runtime of something like 45 minutes instead of the customary 8 hours.

Because of the 1-hour runtimes BOINC also downloaded additional tasks to fill the cache. Not good.


We run a number of very different jobs on R@home covering a number of different problems in structure prediction and now also protein design. Thus, depending on the type of workunit runtimes may vary hugely. The rldb jobs do indeed run very quickly (requiring something like 25minutes per decoy).

What was your very first job ??

I think we will put a limit into the code that will abort jobs running over 6 hours in the next update. Watch this space..



____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 58097 - Posted 21 Dec 2008 16:50:33 UTC - in response to Message ID 57939.



Just to expand on the point of this person....Thanks for taking the time to tell us what is going on. We like to know and the silence has been deafening lately.
Thanks again for breaking it. We hope for more news as time goes along.


I appreciate that, thanks. I'll try and keep you guys uptodate, your feedback is pretty indispensible for our debugging.
____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 58098 - Posted 21 Dec 2008 17:43:23 UTC

3 errors:

1. This one has failed twice: 4.3 sec
216056173

- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00476D2D read attempt to address 0x00000000



2. 216056174 6.5 sec

Reason: Access Violation (0xc0000005) at address 0x0049162C read attempt to address 0x00000000

Engaging BOINC Windows Runtime Debugger...

3. This one also failed twice. .02 sec
216056175

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00476D2D read attempt to address 0x00000000

Engaging BOINC Windows Runtime Debugger...



____________

Ian_D Profile

Joined: Sep 21 05
Posts: 55
ID: 757
Credit: 4,216,173
RAC: 0
Message 58099 - Posted 21 Dec 2008 18:43:31 UTC

CPU type GenuineIntel
Intel(R) Pentium(R) 4 CPU 2.60GHz [Family 15 Model 2 Stepping 9]
Number of CPUs 2
Operating System Linux
2.6.24-22-generic

process exited with code 193 (0xc1, -63)
Stack trace (22 frames):
[0x8b979b7]
[0x8bc20b0]
[0xb7f03420]
[0x83c53bc]
[0x84356a0]
[0x83c4fa3]
[0x83ba6f8]
[0x85c2f4e]
[0x80cf524]
[0x80de98f]
[0x83376f7]
[0x8337100]
[0x8243364]
[0x82a246c]
[0x818e15a]
[0x819bae3]
[0x819b3aa]
[0x8127771]
[0x8129a1a]
[0x804b9c8]
[0x8c1dbac]
[0x8048111]

http://boinc.bakerlab.org/rosetta/result.php?resultid=215801702

process exited with code 193 (0xc1, -63)
SIGSEGV: segmentation violation
Stack trace (20 frames):
[0x8b979b7]
[0x8bc20b0]
[0xb7fa5420]
[0x83c4fa3]
[0x83ba6f8]
[0x85c2f4e]
[0x80cf1ff]
[0x80de98f]
[0x83376f7]
[0x8337100]
[0x8243364]
[0x82a246c]
[0x818e15a]
[0x819bae3]
[0x819b3aa]
[0x8127771]
[0x8129a1a]
[0x804b9c8]
[0x8c1dbac]
[0x8048111]

http://boinc.bakerlab.org/rosetta/result.php?resultid=215414530

process exited with code 193 (0xc1, -63)
SIGSEGV: segmentation violation
Stack trace (23 frames):
[0x8b979b7]
[0x8bc20b0]
[0xb7f48420]
[0x8ace23a]
[0x84348d3]
[0x8ace5f6]
[0x8acd739]
[0x83b1c55]
[0x862a631]
[0x83f65af]
[0x80cece6]
[0x80de98f]
[0x82c37e4]
[0x82b897a]
[0x82c16c1]
[0x818d6ee]
[0x819bae3]
[0x819b3aa]
[0x8127771]
[0x8129a1a]
[0x804b9c8]
[0x8c1dbac]
[0x8048111]

http://boinc.bakerlab.org/rosetta/result.php?resultid=215035006

What's going on with the Rosetta Linux App ? Sometimes it works , sometimes it's duff ? Machine NOT overclocked in the slightest

Cheers

____________


Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58100 - Posted 21 Dec 2008 20:13:07 UTC - in response to Message ID 58089.
Last modified: 21 Dec 2008 20:16:28 UTC

two more that wasted my cpu time crashing halfway

http://boinc.bakerlab.org/rosetta/result.php?resultid=215547790
t071_1_RDC_NMR_NESG_5480_118996_0
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 941.5781
--------------

http://boinc.bakerlab.org/rosetta/result.php?resultid=215490731
t072_1_RDC_NMR_NESG_5481_92626_0
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 12309.66
-----------------------------




edit - more of the same type of task errored out

http://boinc.bakerlab.org/rosetta/result.php?resultid=215554911
t071_1_RDC_NMR_NESG_5480_119941_0
state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 9361.141

http://boinc.bakerlab.org/rosetta/result.php?resultid=215583938
t072_1_RDC_NMR_NESG_5481_100236_0
state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 4056.126

i am aborting the remaing t071 and t072 tasks due to 4 errors in 5-6 hours.
wasting my time with that junk.

another note: these 2 tasks did not respond to a suspend command in the sense that the time to completion continued to count even though the actual running time had stopped and the status showed as suspended.

hope the t073 tasks are better

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58101 - Posted 21 Dec 2008 20:27:51 UTC

i think you guys should recheck the code or whatever of the t071 and t072 tasks as I see someone before me had one of these series of tasks and ran into a computer error of the same nature of what i reported. i aborted that task since i am not interested in wasting my cpu time on a compute error bugged task.

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,443,745
RAC: 1,849
Message 58103 - Posted 22 Dec 2008 0:45:10 UTC - in response to Message ID 58090.

I'm seeing problems when attempting to show graphics on workunits with names such as cs_noe* on Mac OS X 10.4.11. Its seems like several other people are seeing similar problems.

The first time Show graphics is pressed the graphics app starts and displays a blank window. Moving the mouse causes the graphics app to crash.

The second and subsequent times Show graphics is pressed the graphics app starts and displays a blank window along with the spinning rainbow beach ball. The graphics app is frozen and you can't even force quit in the normal way: it's necessary to quit via the Activity Monitor.


I'm seeing somewhat similar problems under Windows Vista SP1.

12/21/2008 7:18:31 AM|rosetta@home|Resuming task cs_noe_fullw_nolin_homo_bench_cs_noe_abrelax_cs_ccr19_olange_5604_39348_0 using minirosetta version 147

Moving the mouse had no particular effect, but the graphics window stayed blank and shutting it down gave some error messages before it finally worked. I normally let minirosetta run without graphics.


Another workunit with graphics problems:

12/21/2008 11:27:13 AM|rosetta@home|Resuming task cs_noe_fullw_nolin_homo_bench_cs_noe_abrelax_cs_flua_olange_5605_35210_0 using minirosetta version 147

The previous one seemed to complete successfully despite the graphics problem.

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,443,745
RAC: 1,849
Message 58104 - Posted 22 Dec 2008 1:03:03 UTC - in response to Message ID 58096.

After a 1 week hiatus I downloaded v1.47 and 4 tasks. The first task showed a completion time of 12 hours which corresponds to my chosen runtime. The other 3 tasks, all _rlbd_ tasks, showed completion times of only 1 hour. What's up with that? It suggests that the staff provided an estimated task runtime of something like 45 minutes instead of the customary 8 hours.

Because of the 1-hour runtimes BOINC also downloaded additional tasks to fill the cache. Not good.


We run a number of very different jobs on R@home covering a number of different problems in structure prediction and now also protein design. Thus, depending on the type of workunit runtimes may vary hugely. The rldb jobs do indeed run very quickly (requiring something like 25minutes per decoy).

What was your very first job ??

I think we will put a limit into the code that will abort jobs running over 6 hours in the next update. Watch this space..




What effect will that have on users who have chosen default workunit times over 6 hours? Is this 6 hours per decoy or 6 hours for the whole workunit? If it only aborts one decoy, will the other decoys still continue, with credit for the decoys that completed successfully both before and after this aborted decoy?

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3378
ID: 106194
Credit: 0
RAC: 0
Message 58105 - Posted 22 Dec 2008 6:48:14 UTC - in response to Message ID 58104.

What effect will that have on users who have chosen default workunit times over 6 hours? Is this 6 hours per decoy or 6 hours for the whole workunit? If it only aborts one decoy, will the other decoys still continue, with credit for the decoys that completed successfully both before and after this aborted decoy?


Yes, he's talking about per model. If any models that run that long are cut off, it would help assure a more consistent runtime inline with each person's stated preference. Not perfect, but better then having some specific models haul off and run for 12 hours.

So, yes, if time remains for the task, another model may begin.

I won't comment on credit, because it's not my decision, and so far as I know no specific decision has been made yet. But the project has always maintained that even "failures" provide information valueable to advancing the project.

At present, the model would run for (sometimes) as much as 12 hours or more, and you'd get the same credit average as those that are running models with the more average runtime under 3hrs, so if nothing else, just cutting it off at 6 hours (or whatever length is deemed appropriate) is preventing you from running for more then that, for essentially zero credit. So, this approach limits your credit loss, if nothing else.
____________
Rosetta Moderator: Mod.Sense

ramostol

Joined: Feb 6 07
Posts: 64
ID: 145835
Credit: 584,052
RAC: 0
Message 58108 - Posted 22 Dec 2008 10:31:06 UTC - in response to Message ID 58095.

My 1.47 cc2_1_8_mammoth-tasks have all crashed on Ralph, now my 1.47 cc2_1_8_native-tasks are crashing on Rosetta.

...


#Aehm - i can't see your RALPH failure for this job. I had one result come back and it was a success..

http://ralph.bakerlab.org/rah_queue_ops/db_action.php?table=result&id=1228006


I believe I am not allowed access to rah_queue_ops ;-) so I cannot check your observation.

However, my Ralph mammoth-failures flourish, the ultimate example:

cc2_1_8_mammoth_fa_cst_hb_t369__IGNORE_THE_REST_1S3QA_7_6585_1_0

When this is said I seem to have reconciled with Rosetta by rebooting the computer in question. Why this was suddenly necessary on a computer with no new program installations, no new configurations, no system upgrades, no separate computing on the side, and successfully computing 1.47-tasks 24 hours earlier, I am unable to explain. Even the subsequently installed Boinc 6.5 works like a charm. So I am loaded with tasks for a peaceful Christmas session and hope for the best until reporting time next weekend.


Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58109 - Posted 22 Dec 2008 11:27:20 UTC

come on guys, you say this stuff is tested and ok and then it bombs on a windows machine.

can someone tell me if this is a program error an error caused by to high of a OC speed? being that not all the tasks I get error out it would seem more of a case of a bad program and not the OC speed.

see below for a series of tasks that died part of the way through.


http://boinc.bakerlab.org/rosetta/result.php?resultid=215716365
cc2_1_8_native_cen_cst_hb_t311__IGNORE_THE_REST_2B5AA_7_5843_16_0
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)

CPU time 4999.172
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
------

http://boinc.bakerlab.org/rosetta/result.php?resultid=215736070
Name t074_1_RDC_NMR_NESG_5568_92427_0
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 9133.313
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400

----------------
http://boinc.bakerlab.org/rosetta/result.php?resultid=215742498
Name 1wjbA_ZNMP_ABRELAX_tetraL_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1wjbA-_5478_130_1
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 2.984375
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>

------------
http://boinc.bakerlab.org/rosetta/result.php?resultid=215811069
Name t073_1_RDC_NMR_NESG_5563_143956_0
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 12305.66
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400

---------------
http://boinc.bakerlab.org/rosetta/result.php?resultid=215833987
Name t073_1_RDC_NMR_NESG_5563_146392_0
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 8922.172
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400

--------------------




xsc2

Joined: Jul 9 08
Posts: 4
ID: 267987
Credit: 62,354
RAC: 0
Message 58111 - Posted 22 Dec 2008 14:04:08 UTC

Exit status -1073741819 (0xc0000005)
http://boinc.bakerlab.org/rosetta/result.php?resultid=216178769

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58114 - Posted 22 Dec 2008 14:36:04 UTC
Last modified: 22 Dec 2008 14:39:12 UTC

This makes 10 tasks in a days time that have died with the 0xc error. COME ON!
This ran to within 10 minutes of completion and died. Gees!
Then you insult me with me no credit granted for a 99% completed task.


http://boinc.bakerlab.org/rosetta/result.php?resultid=216155882
1g47A_BOINC_MPZN_vanilla_abrelax_5901_6856_0
Workunit 196996323
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 13796
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400

A Few Good Men

Joined: Mar 25 07
Posts: 14
ID: 157915
Credit: 2,031,382
RAC: 55
Message 58115 - Posted 22 Dec 2008 15:39:42 UTC

Well... The lastest attempt to effectivly utilize @home computers to further mankind in medical fields has reduced my last machine into a power wasting room heater.
Just for the fun of it, go to a Rosetta server aquiring results from the last 2 versions and search "Outcome Client error"

Ill check back after a few months to see if things are any better here.


Ian_D Profile

Joined: Sep 21 05
Posts: 55
ID: 757
Credit: 4,216,173
RAC: 0
Message 58118 - Posted 22 Dec 2008 18:39:27 UTC

wuid=196939593

<core_client_version>6.2.15</core_client_version>
<![CDATA[
<message>
process got signal 8
</message>
<stderr_txt>
# cpu_run_time_pref: 7200
**********************************************************************
Rosetta is going too long. Watchdog is ending the run!
CPU time: 26914 seconds. Greater than 3X preferred time: 7200 seconds
**********************************************************************
called boinc_finish

</stderr_txt>
]]>

____________


Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58119 - Posted 22 Dec 2008 18:58:32 UTC

your vanilla task died at 2hrs and 23 mins.
this makes about 12 failures now in 2 days.
http://boinc.bakerlab.org/rosetta/result.php?resultid=216178144
1g47A_BOINC_MPZN_vanilla_abrelax_5901_7554_0
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 8912.25
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58121 - Posted 23 Dec 2008 0:22:58 UTC

yet another one dies...what is going on? is it the program or my OC speed? this makes 12 in 2 days.

http://boinc.bakerlab.org/rosetta/result.php?resultid=216194755
Name t073_1_RDC_NMR_NESG_5563_176398_0
Workunit 197027384
Server state Over
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 25.375
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400

Chu

Joined: Feb 23 06
Posts: 120
ID: 61076
Credit: 112,439
RAC: 0
Message 58126 - Posted 23 Dec 2008 3:31:31 UTC - in response to Message ID 58119.

Hi greg_be, this WU is one of my jobs and I just double checked this sub-batch, so far about 9000 clients have returned results successfully with normal error rate. The fact that you recently have got same error code from many different Rosetta@home workunits makes me think that it is more likely due to some certain incompatible setup on your computer, though I don't know what is exactly causing this. Did this problem happen to you before?

your vanilla task died at 2hrs and 23 mins.
this makes about 12 failures now in 2 days.
http://boinc.bakerlab.org/rosetta/result.php?resultid=216178144
1g47A_BOINC_MPZN_vanilla_abrelax_5901_7554_0
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 8912.25
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400



____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58130 - Posted 23 Dec 2008 9:09:13 UTC - in response to Message ID 58126.
Last modified: 23 Dec 2008 9:11:41 UTC

Chu,

Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH.
I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die.

Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died.

Can you tell me how to see the difference between a error due to windows or OC speed vs a program error that triggers a windows dump with '-1073741819 (0xc0000005)'?

Thanks again for the reply.

Hi greg_be, this WU is one of my jobs and I just double checked this sub-batch, so far about 9000 clients have returned results successfully with normal error rate. The fact that you recently have got same error code from many different Rosetta@home workunits makes me think that it is more likely due to some certain incompatible setup on your computer, though I don't know what is exactly causing this. Did this problem happen to you before?

your vanilla task died at 2hrs and 23 mins.
this makes about 12 failures now in 2 days.
http://boinc.bakerlab.org/rosetta/result.php?resultid=216178144
1g47A_BOINC_MPZN_vanilla_abrelax_5901_7554_0
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 8912.25
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400


HA-SOFT, s.r.o.

Joined: Jan 27 07
Posts: 10
ID: 144015
Credit: 65,377,643
RAC: 50,339
Message 58132 - Posted 23 Dec 2008 9:48:28 UTC - in response to Message ID 58130.

I have the same problem on 64 bit Win 2008 server only for all Minirosetta tasks. Minirosetta 1.45 had this problem too. All other PC (32bit, XP64bit) have no problem.

Zdenek


Chu,

Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH.
I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die.

Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died.

Can you tell me how to see the difference between a error due to windows or OC speed vs a program error that triggers a windows dump with '-1073741819 (0xc0000005)'?

Thanks again for the reply.



____________

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 58134 - Posted 23 Dec 2008 11:17:31 UTC - in response to Message ID 58130.

Chu,

Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH.
I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die.

Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died.


I think this could happen if the system is close to being okay but just on that edge. ie if your task has run for only 7 thousand seconds it would have completed just fine, but that little extra time pushed it over the edge. This could come from the ram being pushed, the hard drive saying enough, the cpu sending that one bit of data too fast etc etc. By backing off in 10 mhz increments I think you will find the solution fairly quickly. Then you could even go back up in 1 mhz increments until the errors come back.
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58137 - Posted 23 Dec 2008 11:39:54 UTC - in response to Message ID 58134.

Chu,

Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH.
I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die.

Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died.


I think this could happen if the system is close to being okay but just on that edge. ie if your task has run for only 7 thousand seconds it would have completed just fine, but that little extra time pushed it over the edge. This could come from the ram being pushed, the hard drive saying enough, the cpu sending that one bit of data too fast etc etc. By backing off in 10 mhz increments I think you will find the solution fairly quickly. Then you could even go back up in 1 mhz increments until the errors come back.


i suspect your right about the ram frequency and the cpu. probably just a bit to high for these tasks now. i might raise it by 5 mhz after tonight just to see what happens. my RAC is already low enough. i can't "afford" to take much more in errors.

Chu

Joined: Feb 23 06
Posts: 120
ID: 61076
Credit: 112,439
RAC: 0
Message 58144 - Posted 23 Dec 2008 19:00:03 UTC - in response to Message ID 58132.

greb_be and all,

When there is a new version of minirosetta update, we usually put a windows debug symbol image in a downloadable location. So when a WU crashes out, it should provide a backtrace of how an error is caused (this does not work every time and that makes our debugging very hard). If it is an error from Minirosetta program or bad command line/input file setup, the stdout or stderr usually will print out a message as hints, for example, the hbond NAN problem in the previous versions. Also, we should see a significantly higher error rate among either all or certain batches of WUs running. If it is caused by interfacing with the host's hardware or software, we will usually see that certain client hosts kept encountering errors or failure. We wish we could tell what have been wrong in every scenario when an error occurs, however, most of us Rosetta developer are far from being an expert on computer software/hardware and we can only hope to trap errors locally on our testing machines to continue with debugging.

Thank you all for voluntarily helping us on doing this project and sorry about any inconvenience/trouble caused on your computer. Please continue to report problems and/or possible fixes you have found as every bit of such information will certainly help us to improve R@H stability and resolve hidden bugs/problems sooner or later. Happy holidays to every one and happy crunching!

I have the same problem on 64 bit Win 2008 server only for all Minirosetta tasks. Minirosetta 1.45 had this problem too. All other PC (32bit, XP64bit) have no problem.

Zdenek


Chu,

Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH.
I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die.

Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died.

Can you tell me how to see the difference between a error due to windows or OC speed vs a program error that triggers a windows dump with '-1073741819 (0xc0000005)'?

Thanks again for the reply.




____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58145 - Posted 23 Dec 2008 19:44:39 UTC

Chu,

I reduced the OC amount by 10 mhz and then brought it back up 5 mhz.
Everything seems stable now as I have run nearly a day without trouble since backing down. It would seem your program is more and more sensitive to tiny things that high OC rates create. In any case backing down the cpu OC speed a bit seems to have solved this issue.

thanks for taking the time to discuss this problem with me and the other person.

staffann Profile

Joined: Oct 7 07
Posts: 7
ID: 210542
Credit: 57,681
RAC: 65
Message 58146 - Posted 23 Dec 2008 22:00:59 UTC

I had one WU crash on me today. Running on a WinXPSP3 Athlon X2 3800+ with 1Gb RAM. Link to task details.

216493218
Name 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_16326_0
Workunit 197297715
Created 23 Dec 2008 8:53:31 UTC
Sent 23 Dec 2008 9:33:56 UTC
Received 23 Dec 2008 22:08:04 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
Computer ID 625945
Report deadline 2 Jan 2009 9:33:56 UTC
CPU time 4928.609
stderr out

<core_client_version>6.2.18</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,443,745
RAC: 1,849
Message 58150 - Posted 24 Dec 2008 2:20:48 UTC - in response to Message ID 58137.

Chu,

Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH.
I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die.

Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died.


I think this could happen if the system is close to being okay but just on that edge. ie if your task has run for only 7 thousand seconds it would have completed just fine, but that little extra time pushed it over the edge. This could come from the ram being pushed, the hard drive saying enough, the cpu sending that one bit of data too fast etc etc. By backing off in 10 mhz increments I think you will find the solution fairly quickly. Then you could even go back up in 1 mhz increments until the errors come back.


i suspect your right about the ram frequency and the cpu. probably just a bit to high for these tasks now. i might raise it by 5 mhz after tonight just to see what happens. my RAC is already low enough. i can't "afford" to take much more in errors.


Do you think it could be a problem in BOINC 6.4.5 instead? Chu, could you check how many machines running workunits from that batch under BOINC 6.4.5 on similar hardware have returned successful results?

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,443,745
RAC: 1,849
Message 58151 - Posted 24 Dec 2008 2:21:13 UTC - in response to Message ID 58137.

Chu,

Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH.
I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die.

Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died.


I think this could happen if the system is close to being okay but just on that edge. ie if your task has run for only 7 thousand seconds it would have completed just fine, but that little extra time pushed it over the edge. This could come from the ram being pushed, the hard drive saying enough, the cpu sending that one bit of data too fast etc etc. By backing off in 10 mhz increments I think you will find the solution fairly quickly. Then you could even go back up in 1 mhz increments until the errors come back.


i suspect your right about the ram frequency and the cpu. probably just a bit to high for these tasks now. i might raise it by 5 mhz after tonight just to see what happens. my RAC is already low enough. i can't "afford" to take much more in errors.


Do you think it could be a problem in BOINC 6.4.5 instead? Chu, could you check how many machines running workunits from that batch under BOINC 6.4.5 on similar hardware have returned successful results?

(_KoDAk_) Profile

Joined: Jul 18 06
Posts: 109
ID: 100677
Credit: 1,859,263
RAC: 0
Message 58154 - Posted 24 Dec 2008 6:35:31 UTC

Exit status -1073741819 (0xc0000005)
http://boinc.bakerlab.org/rosetta/result.php?resultid=214936635
http://boinc.bakerlab.org/rosetta/result.php?resultid=216341024
http://boinc.bakerlab.org/rosetta/result.php?resultid=215006649
http://boinc.bakerlab.org/rosetta/result.php?resultid=214872151
Exit status 1 (0x1)
http://boinc.bakerlab.org/rosetta/result.php?resultid=212896182


____________

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 58156 - Posted 24 Dec 2008 12:50:29 UTC - in response to Message ID 58150.

Do you think it could be a problem in BOINC 6.4.5 instead? Chu, could you check how many machines running workunits from that batch under BOINC 6.4.5 on similar hardware have returned successful results?


I am using version 6.4.5, on some of my pc's, and am not having any issues.
____________

DaveSun

Joined: May 3 07
Posts: 5
ID: 172723
Credit: 200,480
RAC: 0
Message 58157 - Posted 24 Dec 2008 13:38:45 UTC

I found this WU stalled after 15 hrs. I suspended the task and then reenabled it later. After it started again it stalled at the same point. I looked at the box and it had a popup saying that it had a C++ runtime error that had asked to be shutdown in an unusual way.

STDERR OUT

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
# cpu_run_time_pref: 86400

</stderr_txt>
]]>

Dalton

Joined: Nov 30 05
Posts: 2
ID: 24789
Credit: 22,811,652
RAC: 5,240
Message 58158 - Posted 24 Dec 2008 14:04:03 UTC - in response to Message ID 58157.

I found this WU stalled after 15 hrs. I suspended the task and then reenabled it later. After it started again it stalled at the same point. I looked at the box and it had a popup saying that it had a C++ runtime error that had asked to be shutdown in an unusual way.

STDERR OUT

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
# cpu_run_time_pref: 86400

</stderr_txt>
]]>



I've been getting those C++ popups as well on multiple configs machine/os, it seems as if then that core on the cpu refuses to get work after that. This is a new event for me.
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58159 - Posted 24 Dec 2008 14:18:26 UTC - in response to Message ID 58151.

Chu,

Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH.
I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die.

Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died.


I think this could happen if the system is close to being okay but just on that edge. ie if your task has run for only 7 thousand seconds it would have completed just fine, but that little extra time pushed it over the edge. This could come from the ram being pushed, the hard drive saying enough, the cpu sending that one bit of data too fast etc etc. By backing off in 10 mhz increments I think you will find the solution fairly quickly. Then you could even go back up in 1 mhz increments until the errors come back.


i suspect your right about the ram frequency and the cpu. probably just a bit to high for these tasks now. i might raise it by 5 mhz after tonight just to see what happens. my RAC is already low enough. i can't "afford" to take much more in errors.


Do you think it could be a problem in BOINC 6.4.5 instead? Chu, could you check how many machines running workunits from that batch under BOINC 6.4.5 on similar hardware have returned successful results?



robert, after dropping the OC 10 mhz and then bringing it back 5mhz (total reduction 5 mhz) I have not had any further issues. so at least for my machine the errors were caused by OC'ing to far. this accounts for the huge amount of failures I had. It would seem the the new mini is even more sensitive than 1.45 to whatever signals OC'ing produces. For those who get 1 failure in 20 tasks, then your not having the same problem as I was. Also I am on 6.4.5 after upgrading from the old version.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58160 - Posted 24 Dec 2008 14:20:59 UTC - in response to Message ID 58154.

Exit status -1073741819 (0xc0000005)
http://boinc.bakerlab.org/rosetta/result.php?resultid=214936635
http://boinc.bakerlab.org/rosetta/result.php?resultid=216341024
http://boinc.bakerlab.org/rosetta/result.php?resultid=215006649
http://boinc.bakerlab.org/rosetta/result.php?resultid=214872151
Exit status 1 (0x1)
http://boinc.bakerlab.org/rosetta/result.php?resultid=212896182



kodak, that looks similar to the rash of broken tasks I had.
are you OC'd at all?

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 58163 - Posted 24 Dec 2008 21:59:14 UTC

Hi.

I have this task at the moment running, it's odd. This morning when i restarted

the system Boinc was showing 5hrs,4mins completed, when the task got it's turn to

run it dropped back to 1hr,33mins and showing 2 models, it would have done more

than two in the five hours!

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=197257513

Thu 25 Dec 2008 08:42:56 EST|rosetta@home|Restarting task cc_nonideal_1_8_nocst4_hb_t303__IGNORE_THE_REST_1FEZA_6_6019_17_0 using minirosetta version 147

pete.


____________


Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58164 - Posted 24 Dec 2008 22:12:02 UTC - in response to Message ID 58163.

normally this is due to the last check point set. seems kind of odd that you would lose up to 4hrs of work between check points. it acts like it lost all the latest check point data. it also looks like your running a really old version of boinc. you might want to update to the latest version.

Merry Christmas

Hi.

I have this task at the moment running, it's odd. This morning when i restarted

the system Boinc was showing 5hrs,4mins completed, when the task got it's turn to

run it dropped back to 1hr,33mins and showing 2 models, it would have done more

than two in the five hours!

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=197257513

Thu 25 Dec 2008 08:42:56 EST|rosetta@home|Restarting task cc_nonideal_1_8_nocst4_hb_t303__IGNORE_THE_REST_1FEZA_6_6019_17_0 using minirosetta version 147

pete.


Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58165 - Posted 24 Dec 2008 22:17:22 UTC - in response to Message ID 58159.

Chu,

Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH.
I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die.

Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died.


I think this could happen if the system is close to being okay but just on that edge. ie if your task has run for only 7 thousand seconds it would have completed just fine, but that little extra time pushed it over the edge. This could come from the ram being pushed, the hard drive saying enough, the cpu sending that one bit of data too fast etc etc. By backing off in 10 mhz increments I think you will find the solution fairly quickly. Then you could even go back up in 1 mhz increments until the errors come back.


i suspect your right about the ram frequency and the cpu. probably just a bit to high for these tasks now. i might raise it by 5 mhz after tonight just to see what happens. my RAC is already low enough. i can't "afford" to take much more in errors.


Do you think it could be a problem in BOINC 6.4.5 instead? Chu, could you check how many machines running workunits from that batch under BOINC 6.4.5 on similar hardware have returned successful results?



robert, after dropping the OC 10 mhz and then bringing it back 5mhz (total reduction 5 mhz) I have not had any further issues. so at least for my machine the errors were caused by OC'ing to far. this accounts for the huge amount of failures I had. It would seem the the new mini is even more sensitive than 1.45 to whatever signals OC'ing produces. For those who get 1 failure in 20 tasks, then your not having the same problem as I was. Also I am on 6.4.5 after upgrading from the old version.


dec 24 22.15 UTC - system is stable and RAC is slowly returning to normal.
Chu - thanks for taking the time to look into the average return of the various tasks you sent out. It was definitely a case of to much OC and no way to verify it. probably would have got to that conclusion after a few more errors.

stewjack

Joined: Apr 23 06
Posts: 39
ID: 78784
Credit: 95,871
RAC: 0
Message 58166 - Posted 24 Dec 2008 23:12:52 UTC - in response to Message ID 58163.

Hi.

I have this task at the moment running, it's odd. This morning when i restarted

the ... task cc_nonideal_1_8_nocst4_hb_t303__IGNORE_THE_REST_1FEZA_6_6019_17_0 using minirosetta version 147
pete.


I have had that happen three times during the last 4 or 5 days. I didn't report it because technically
such actions are not prohibited. The tasks complete and grant credit.
However; I have set my tasks length to 2 hours for now,
and these task run well over that time.

NOTE: I have checkpoint logging turned on!

ALL TIMES APPROX.

4 hours with no ckeckpoints after 40 min
cc_nonideal_3_5_nocst4_hb_t374__IGNORE_THE_REST_2FCKA_10_5832_14_0

3.5 hours with no checkpoints after 35 min
cc2_1_8_mammoth_mix_cen_cst_hb_t332__IGNORE_THE_REST_1V2XA_7_5888_15_0

3 hours with no checkpoints after 50 min
cc_nonideal_0_6_nocst4_hb_t313__IGNORE_THE_REST_1GOJA_10_5910_16_0

NOTE: On the last WU I noticed that when I restarted the task,
well into the no checkpointing period -
checkpointing restarted for a short period of time!




____________

Stacey Baird Profile
Avatar

Joined: Apr 11 06
Posts: 19
ID: 75056
Credit: 74,745
RAC: 0
Message 58167 - Posted 25 Dec 2008 0:42:17 UTC - in response to Message ID 57902.

HoHo kids!

We've got a new minirosetta version, with - you've guessed it - more bug fixes ! Woo!

Please report remaining issues here - that would be grand :)


Hello, I don't know if this is a bug AND I am not one to complain about receiving credit, however, I was very surprised to receive so much credit compared to claimed credit. Is the result below likely?

216467986
Name cc_nonideal_2_2_nocst4_hb_t297__IGNORE_THE_REST_1YZFA_4_6046_19_0
Workunit 197278592
Created 23 Dec 2008 6:24:21 UTC
Sent 23 Dec 2008 7:45:54 UTC
Received 24 Dec 2008 15:54:32 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 947263
Report deadline 2 Jan 2009 7:45:54 UTC
CPU time 5719.655
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<stderr_txt>

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time
failed to create shared mem segment
CreateSemaphore failure! Cannot create semaphore!

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time
======================================================
DONE :: 1 starting structures 5719.56 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>

Validate state Valid
Claimed credit 14.4476221738839
Granted credit 41.0260851670465
application version 1.47
____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 58168 - Posted 25 Dec 2008 5:21:58 UTC - in response to Message ID 58163.

Hi.

I have this task at the moment running, it's odd. This morning when i restarted

the system Boinc was showing 5hrs,4mins completed, when the task got it's turn to

run it dropped back to 1hr,33mins and showing 2 models, it would have done more

than two in the five hours!

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=197257513

Thu 25 Dec 2008 08:42:56 EST|rosetta@home|Restarting task cc_nonideal_1_8_nocst4_hb_t303__IGNORE_THE_REST_1FEZA_6_6019_17_0 using minirosetta version 147

pete.



Well still looks odd to me, ended up taking 7hrs, 11min plus the 3 and a half

hours lost on restarting. I have a six hour R/T set and it still only did 4 models.

See below.

# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
======================================================
DONE :: 1 starting structures 25890.1 cpu seconds
This process generated 4 decoys from 4 attempts



____________


DaveSun

Joined: May 3 07
Posts: 5
ID: 172723
Credit: 200,480
RAC: 0
Message 58170 - Posted 25 Dec 2008 15:49:57 UTC - in response to Message ID 58157.

I found this WU stalled after 15 hrs. I suspended the task and then reenabled it later. After it started again it stalled at the same point. I looked at the box and it had a popup saying that it had a C++ runtime error that had asked to be shutdown in an unusual way.

STDERR OUT

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
# cpu_run_time_pref: 86400

</stderr_txt>
]]>



Had This WU this morning with the same error. It ran for 7 hours before stalling. Both are vanilla type. I still have one more of these in progress, it is currently at 21 hours and so far looks good.

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 58172 - Posted 25 Dec 2008 23:37:52 UTC

Hi.

Here's another one doing strange things, when i shutdown last night it had run for 6hrs,30min and had done 18 models, when it restarted it went back to 5hrs, 26min and showing 18 models, it then ran to 6hrs, 18min and still only 18 models!
Still odd i haven't seen this before, the same type of task.

Fri 26 Dec 2008 09:03:52 EST|rosetta@home|Restarting task cc_nonideal_1_3_nocst4_hb_t306__IGNORE_THE_REST_1AZVA_6_5992_27_0 using minirosetta version 147

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=197386767

# cpu_run_time_pref: 21600
======================================================
DONE :: 1 starting structures 22718.4 cpu seconds
This process generated 18 decoys from 18 attempts
======================================================

pete.


____________


Stacey Baird Profile
Avatar

Joined: Apr 11 06
Posts: 19
ID: 75056
Credit: 74,745
RAC: 0
Message 58173 - Posted 26 Dec 2008 4:35:26 UTC

I am having much the same problems with stops, starts, incomprehensible progress (if any progress) reports, strange error reports, stalling, misrepresentation of time budgeting in the Tasks function and other weirdness.

Minirosetta v1.47 wastes too much time and steals processing time from other processing jobs that actually work.

I suspect that part of the problem is programmers and others being on Christmas break and not being available for problem solving.

As a result I have suspended Rosetta processing until at least January 3rd pending cleanup of the issues.
____________

Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 58175 - Posted 26 Dec 2008 13:19:46 UTC - in response to Message ID 58166.


NOTE: I have checkpoint logging turned on!

ALL TIMES APPROX.

4 hours with no ckeckpoints after 40 min
cc_nonideal_3_5_nocst4_hb_t374__IGNORE_THE_REST_2FCKA_10_5832_14_0

3.5 hours with no checkpoints after 35 min
cc2_1_8_mammoth_mix_cen_cst_hb_t332__IGNORE_THE_REST_1V2XA_7_5888_15_0

3 hours with no checkpoints after 50 min
cc_nonideal_0_6_nocst4_hb_t313__IGNORE_THE_REST_1GOJA_10_5910_16_0

NOTE: On the last WU I noticed that when I restarted the task,
well into the no checkpointing period -
checkpointing restarted for a short period of time!


This is pointing to a problem with checkpointing in the FoldCst protocol. I'll put this high on the todo list for the 1.48 release.
The runtimes look very reasonable though! I'm afraid making a single decoy shorter than 3-4 hours is not always possible - what kind of machine was this on ?

Mike


____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

stewjack

Joined: Apr 23 06
Posts: 39
ID: 78784
Credit: 95,871
RAC: 0
Message 58177 - Posted 26 Dec 2008 14:55:13 UTC - in response to Message ID 58175.


The runtimes look very reasonable though! I'm afraid making a single decoy shorter than 3-4 hours is not always possible


That would make sense. Normally my WU run time is set to 4 hours.


- what kind of machine was this on ?


Compaq Presario 6029
AMD Athalon XP 2100 (1.7 GHZ)
Windows XP Home ( BOINC v 6.2.19 )
RAM: 768 MB
VIDEO CARD: Radeon 9250 128MB
Dial-up: USRobotics Controller Modem

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58180 - Posted 27 Dec 2008 9:52:34 UTC

serious credit issue here:
cc2_1_8_mammoth_fa_cst_hb_t303__IGNORE_THE_REST_2AH5A_4_6138_17_0
Claimed credit 106.166115188458
Granted credit 74.8691857584611

That is worse than the other mammoth task i had which had something like a 10 point difference. It also ran over my preferences of time. See long running tasks thread.

Ian_D Profile

Joined: Sep 21 05
Posts: 55
ID: 757
Credit: 4,216,173
RAC: 0
Message 58183 - Posted 27 Dec 2008 11:26:21 UTC - in response to Message ID 58099.

After clean runs of memtest86+ 2.10 and prime95 for linux and I can no longer get decent results out of prime95 even though memtest86+ 2.10 will run fine.

As you'd most likely expect I'm putting the errors below down to hardware !!

Don't know if it's the CPU or more likely the mainboard northbridge. Have a newer CPU on order to rule that out.

Have removed said machine from my "farm".

Cheers and Happy Christmas and a computational bug free New Year


CPU type GenuineIntel
Intel(R) Pentium(R) 4 CPU 2.60GHz [Family 15 Model 2 Stepping 9]
Number of CPUs 2
Operating System Linux
2.6.24-22-generic

process exited with code 193 (0xc1, -63)
Stack trace (22 frames):
[0x8b979b7]
[0x8bc20b0]
[0xb7f03420]
[0x83c53bc]
[0x84356a0]
[0x83c4fa3]
[0x83ba6f8]
[0x85c2f4e]
[0x80cf524]
[0x80de98f]
[0x83376f7]
[0x8337100]
[0x8243364]
[0x82a246c]
[0x818e15a]
[0x819bae3]
[0x819b3aa]
[0x8127771]
[0x8129a1a]
[0x804b9c8]
[0x8c1dbac]
[0x8048111]

http://boinc.bakerlab.org/rosetta/result.php?resultid=215801702

process exited with code 193 (0xc1, -63)
SIGSEGV: segmentation violation
Stack trace (20 frames):
[0x8b979b7]
[0x8bc20b0]
[0xb7fa5420]
[0x83c4fa3]
[0x83ba6f8]
[0x85c2f4e]
[0x80cf1ff]
[0x80de98f]
[0x83376f7]
[0x8337100]
[0x8243364]
[0x82a246c]
[0x818e15a]
[0x819bae3]
[0x819b3aa]
[0x8127771]
[0x8129a1a]
[0x804b9c8]
[0x8c1dbac]
[0x8048111]

http://boinc.bakerlab.org/rosetta/result.php?resultid=215414530

process exited with code 193 (0xc1, -63)
SIGSEGV: segmentation violation
Stack trace (23 frames):
[0x8b979b7]
[0x8bc20b0]
[0xb7f48420]
[0x8ace23a]
[0x84348d3]
[0x8ace5f6]
[0x8acd739]
[0x83b1c55]
[0x862a631]
[0x83f65af]
[0x80cece6]
[0x80de98f]
[0x82c37e4]
[0x82b897a]
[0x82c16c1]
[0x818d6ee]
[0x819bae3]
[0x819b3aa]
[0x8127771]
[0x8129a1a]
[0x804b9c8]
[0x8c1dbac]
[0x8048111]

http://boinc.bakerlab.org/rosetta/result.php?resultid=215035006

What's going on with the Rosetta Linux App ? Sometimes it works , sometimes it's duff ? Machine NOT overclocked in the slightest

Cheers


____________


Rifleman

Joined: Nov 19 08
Posts: 17
ID: 288725
Credit: 139,408
RAC: 0
Message 58188 - Posted 27 Dec 2008 22:52:28 UTC

I am having problems with another WU that ran fine up to 99 percent and gets stalled. I let one run for 37 hours until watchdog terminated it. I have preferences set for 12 hours so that is fine. The granted credit is what bothered me. : http://boinc.bakerlab.org/rosetta/result.php?resultid=216862173
I am debating cancelling the WU that is presently doing the same thing as wasting all that CPU time for 2 decoys seems like--well---a waste!!
http://boinc.bakerlab.org/rosetta/result.php?resultid=217161601

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,443,745
RAC: 1,849
Message 58192 - Posted 28 Dec 2008 0:21:56 UTC - in response to Message ID 58188.

I am having problems with another WU that ran fine up to 99 percent and gets stalled. I let one run for 37 hours until watchdog terminated it. I have preferences set for 12 hours so that is fine. The granted credit is what bothered me. : http://boinc.bakerlab.org/rosetta/result.php?resultid=216862173
I am debating cancelling the WU that is presently doing the same thing as wasting all that CPU time for 2 decoys seems like--well---a waste!!
http://boinc.bakerlab.org/rosetta/result.php?resultid=217161601


Where did it seem to get stalled at - about 10 minutes left to go? If so, that's what typically happens when a minirosetta workunit goes out with a serious underestimate of the time required to run it. When I had one like that, a few versions ago, I let it finish (in about 4 times the time I set as preference) and at least got some credit for it, but not much more than typical for workunits that actually finished in the estimated time. At about 10 minutes left to go, the estimated time calculations get messed up, but not the calculations leading to the desired results.

Rifleman

Joined: Nov 19 08
Posts: 17
ID: 288725
Credit: 139,408
RAC: 0
Message 58193 - Posted 28 Dec 2008 0:58:07 UTC

Hi Robert. Yeah----it stopped at about 10 minutes to go-----and stayed that way for 25 hours---lol. Watchdog terminated it.
I aborted another after 18 hours in. It was the same type protein as the first one. I have 2 more being crunched at the moment and am watching to see how they do after 12 hours in.
Task ID 216862173
Name 1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_17673_0
Workunit 197639536
Created 25 Dec 2008 6:09:31 UTC
Sent 25 Dec 2008 7:37:31 UTC
Received 27 Dec 2008 5:01:41 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 948562
Report deadline 4 Jan 2009 7:37:31 UTC
CPU time 134234.2
stderr out <core_client_version>6.2.19</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 43200
**********************************************************************
Rosetta is going too long. Watchdog is ending the run!
CPU time: 134233 seconds. Greater than 3X preferred time: 43200 seconds
**********************************************************************
called boinc_finish

</stderr_txt>
]]>


Validate state Valid
Claimed credit 561.58588373264
Granted credit 117.029798631356
application version 1.47

Ian_D Profile

Joined: Sep 21 05
Posts: 55
ID: 757
Credit: 4,216,173
RAC: 0
Message 58194 - Posted 28 Dec 2008 8:28:56 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=217325144

Nearly 16 hrs in when I spotted it and now it reports, after a manual abort, it has done 0 CPU time ?!?!
____________


Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58195 - Posted 28 Dec 2008 9:33:57 UTC

guys,
don't forget to also post this info in the "Report long-running models here" thread.

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 58197 - Posted 28 Dec 2008 13:57:49 UTC

Somewhere below the question was raised if the "Lock file" error has been fixed. It has not. If you look at this Computer you can see that I have several.

It is not at all clear why this happened.

As you can see it is a 4 Core processor with HT giving 8 virtual processors and I know that at one point I had at least 4 tasks running at the same time. Could this be a concurrency problem? At any rate this is a new machine in the prime of its existence in that it is just over a week old. It is run 24/7 and I have been running about 6-8 projects on the machine and I am not seeing errors like this on other projects. Heck, even GPU Grid is running reasonably well ...

The log files do not record the start time of the processing so you cannot tell for sure if that is the problem here. I still have a few tasks to go and I will run them to completion and see if I get more of these errors in the remaining tasks I have.

I note that my Mac Pro, also with 8 processors has not had this error, but, the project loading on that computer is such that I can't recall an instance where I had more than one Rosetta task running at the same time.

Looking at my other computers, all are multi-processor with at least 4 CPUs and I cannot see this error on any of those machines. I have two tasks running on the i7 right now so I will see if they will die with a collision. the tasks are cc2_1_8_native_cen_cst_hb_t373 and cc2_1_8_native_fa_cst_hb_t373 ...

I have been ignoring Rosetta so I cannot say that I know what the alphabet soup that makes up the task id means (if anything) so I can't tell if there is something common in the actual tasks or not ...

I just find it disappointing that this error surfaced so late in processing. One would think that the error would surface immediately.
____________

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 58202 - Posted 28 Dec 2008 17:17:19 UTC

Since my last post I have completed two tasks successfully on this machine. I have two more in the queue and they are running now. So, by the time you read this they should probably have run to completion or failure. Watching my 8 CPU systems for some time now I have noted that, in general, I never seem to have more than 2 Rosetta tasks running at the same time due to other projects.

On the i7 about the time of the failures I know there was a period where there were 4 or more Rosetta tasks running at the same time. Perhaps there is an upper limit on the number of simultaneous incarnations that can be run at the same time?

Well, I put the i7 into NNT until I can get an answer on this. I hate to waste my time crashing tasks ... or trying to baby sit the machines ...
____________

Ian_D Profile

Joined: Sep 21 05
Posts: 55
ID: 757
Credit: 4,216,173
RAC: 0
Message 58205 - Posted 28 Dec 2008 19:46:19 UTC
Last modified: 28 Dec 2008 19:55:30 UTC

* sigh *

http://boinc.bakerlab.org/rosetta/result.php?resultid=217461782

<core_client_version>6.2.15</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400
# cpu_run_time_pref: 7200
# cpu_run_time_pref: 7200
terminate called after throwing an instance of 'std::bad_alloc'
what(): St9bad_alloc
SIGABRT: abort called
Stack trace (27 frames):
[0x8b979b7]
[0x8bc20b0]
[0xb7f22420]
[0x8c24ca4]
[0x8c12c5b]
[0x8c10261]
[0x8c10296]
[0x8c0fe43]
[0x8c0f86c]
[0x8a88ba5]
[0x8559c48]
[0x83e8bc3]
[0x87f80df]
[0x87dc3c7]
[0x80de412]
[0x80d0686]
[0x80d0b2e]
[0x80c88b9]
[0x80de971]
[0x80d7d76]
[0x8064271]
[0x8117277]
[0x8127c00]
[0x8129a1a]
[0x804b9c8]
[0x8c1dbac]
[0x8048111]

Exiting...

</stderr_txt>
]]>

and

http://boinc.bakerlab.org/rosetta/result.php?resultid=217459230

<core_client_version>6.2.15</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 14400
# cpu_run_time_pref: 7200
# cpu_run_time_pref: 7200
**********************************************************************
Rosetta is going too long. Watchdog is ending the run!
CPU time: 26682.9 seconds. Greater than 3X preferred time: 7200 seconds
**********************************************************************
called boinc_finish

</stderr_txt>
]]>
____________


robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,443,745
RAC: 1,849
Message 58215 - Posted 29 Dec 2008 2:11:33 UTC - in response to Message ID 58202.
Last modified: 29 Dec 2008 2:38:47 UTC

On the i7 about the time of the failures I know there was a period where there were 4 or more Rosetta tasks running at the same time. Perhaps there is an upper limit on the number of simultaneous incarnations that can be run at the same time?

Well, I put the i7 into NNT until I can get an answer on this. I hate to waste my time crashing tasks ... or trying to baby sit the machines ...


I've noticed that for the more memory demanding BOINC projects, there often is a limit on how many incarnations will run at the same time, especially if you enable the Leave In Memory option but make no effort to increase the amount of swap space they can use. Before I increased the upper limit on swap space on my machine, only one minirosetta workunit would run at a time on my dual CPU core machine; now I often see a minirosetta workunit running on each of the CPU cores.

Adding more physical memory also helps, but I had previously increased it to the limit of what my machine can handle (2 GB).

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 58216 - Posted 29 Dec 2008 2:41:56 UTC - in response to Message ID 58215.
Last modified: 29 Dec 2008 3:05:54 UTC

On the i7 about the time of the failures I know there was a period where there were 4 or more Rosetta tasks running at the same time. Perhaps there is an upper limit on the number of simultaneous incarnations that can be run at the same time?

Well, I put the i7 into NNT until I can get an answer on this. I hate to waste my time crashing tasks ... or trying to baby sit the machines ...


I've noticed that for the more memory demanding BOINC projects, there often is a limit on how many incarnations will run at the same time, especially if you enable the Leave In Memory option but make no effort to increase the amount of swap space they can use. Before I increased the upper limit on swap space on my machine, only one minirosetta workunit would run at a time on my dual CPU core machine; now I often see a minirosetta workunit running on each of the CPU cores.


According to my Task manager my peak was 3.9 G with limit 5G so, I did not even get close. I have 3G normal RAM (well, 6 actually, but XP can only "see" 3 G) so ...

Well, I will try to increase the swap file, but, have suspended work on this machine till the project says something... over half the tasks failed with this one error and I am still waiting to see what happens to the last task ... it has been running with 11 min to go for a couple hours now ... if the % Complete was not slowly rising I would have killed it by now ... the main reason I am letting it run is that curiosity overwhelms me as to if it is going to fail with the same error after eating up 10 or more hours of my time or not ...

Oh, man, this is worse... I had nearly 10 hours on the clock. Changed the memory settings to increase the possible size of the swap file (even though it had 2G never used) and after a reboot, the task ended with 8 hours clock time. It looks like it is valid ... but that tells me that I just wasted nearly 2 hours on a task that should have ended ...

{edit add} The tasks that ended badly *MAY* have all been suspended. I cannot say for sure that they were or not. The *MAY* have been. My setting for switiching between tasks is 720 min (12 hours) to try to force most applications to finish before switching ... it is my way of trying to provide best results ... and with 4 plus cores it mostly works. But, I did notice that the several of the Rosetta tasks did get suspended but I did not note which ones ... so more data to ponder if someone is actually going to look at this problem.{/edit} corrected time
____________

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 507,926
RAC: 1
Message 58217 - Posted 29 Dec 2008 4:19:26 UTC
Last modified: 29 Dec 2008 4:55:22 UTC

This task http://www.boinc.bakerlab.org/rosetta/result.php?resultid=217385249 is running on vista home premium & has no graphics, on screen saver & when i click show graphics, when i close the graphics window it comes up with not responding then gives you 3 options

    *Check for a solution & close the program
  • Close the program
  • *Wait for the program to respond

i use Close the program. this task has been running with 10 minutes to go for almost an hour with 97.525% done it's moving at roughly .07.5% per minute should i abort it?

it has finished Validate state Initial. i've noticed that my recent task have been going into a pending state but they get credit quite quickly. is this related to the cc2 jobs?
____________
Have a crunching good day!!

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,492,943
RAC: 7,820
Message 58226 - Posted 29 Dec 2008 16:47:58 UTC - in response to Message ID 58093.

Mike Tyka wrote:

Has anyone seen any new Lockfile problems ? Or are these finally a thing of the past?

I've made a song and dance about this before, so I should report my situation again:

With Mini 1.45 and Boinc 6.2.19 I had 80% success with a 2 hour runtime, dropping to 55% success with a 3 hour runtime over 116 WUs.

Upgrading to Boinc 6.4.5 for a short while before Mini 1.47s came through I thought I noticed less of the lockfile problem, but they've edged out of my history now.

Of the last 103 WUs:
9 were Beta 5.98s - 100% success as usual
94 Mini 1.47 - 93 success, 1 Computation Error here: 217352482
Outcome Client error
Client state Compute error
Exit status 0 (0x0)
<core_client_version>6.4.5</core_client_version>
<![CDATA[
<stderr_txt>
======================================================
DONE :: 1 starting structures 10813.8 cpu seconds
This process generated 1904 decoys from 1904 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
<message>
<file_xfer_error>
<file_name>cc_nonideal_3_5_nocst4_hb_t364__IGNORE_THE_REST_2CYEA_4_5826_14_1_0</file_name>
<error_code>-131</error_code>
</file_xfer_error>

</message>
]]>

Claimed credit 52.7338164308948
Granted credit 52.7338164308948

I note some people are still getting problems, but mine seem to have completely gone, whether due to Boinc or the Mini WUs I don't know for sure, but I honestly don't care.

Excellent work, guys. Much appreciated here. Well done. This problem appeared for me along with this new machine in July and this is the first time I'm getting performance anything like this. My RAC has already increased by about 100 a day. I worried it was something I'd done.
____________

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,492,943
RAC: 7,820
Message 58227 - Posted 29 Dec 2008 16:59:37 UTC - in response to Message ID 58180.

serious credit issue here:
cc2_1_8_mammoth_fa_cst_hb_t303__IGNORE_THE_REST_2AH5A_4_6138_17_0
Claimed credit 106.166115188458
Granted credit 74.8691857584611

That is worse than the other mammoth task I had which had something like a 10 point difference. It also ran over my preferences of time. See long running tasks thread.

In different tasks I've had:
216878857 - CPU time 10076.6
Claimed credit 49.588655190211
Granted credit 100.839750703433

217129212 - CPU time 12904.09
Claimed credit 62.9250827866192
Granted credit 47.1981949319233

It varies. I wouldn't worry about it.
____________

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,492,943
RAC: 7,820
Message 58228 - Posted 29 Dec 2008 17:06:43 UTC - in response to Message ID 58114.

This makes 10 tasks in a days time that have died with the 0xc error. COME ON!
This ran to within 10 minutes of completion and died. Gees!
Then you insult me with me no credit granted for a 99% completed task.


Later...

dec 24 22.15 UTC - system is stable and RAC is slowly returning to normal.
Chu - thanks for taking the time to look into the average return of the various tasks you sent out. It was definitely a case of too much OC and no way to verify it. Probably would have got to that conclusion after a few more errors.

I must've missed the apology elsewhere in the thread. I'm sure it was there somewhere. But maybe not.

Literally a thankless task.
____________

Hugh Miller

Joined: Nov 2 05
Posts: 1
ID: 8255
Credit: 37,808
RAC: 0
Message 58303 - Posted 31 Dec 2008 16:01:52 UTC
Last modified: 31 Dec 2008 16:02:11 UTC

I'm running:

BOINC 6.4.5
Rosetta Mini 1.47

on a machine with:

Win Vista Ultimate 64-bit SP1
Core Duo P8600 2.4GHz
4GB RAM
NVIDIA GEForce 9200M GS chipset, 256MB dedicated graphics memory

The screensaver behaves erratically. Sometimes it presents the familiar screen, other times it just goes white with a spinning cursor; if I hit ESC to exit, I get the errorbox reading:

minirosetta_graphics_1.20_windows_x86_64.exe is not responding

I have to bail manually from the screensaver at that point.

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,492,943
RAC: 7,820
Message 58315 - Posted 1 Jan 2009 2:51:47 UTC

Happy New Year from this side of the Atlantic!

Once people sober up can you consider this scenario I've seen:

I glanced at my Boinc Manager earlier this evening and had one long-running WU at nearly 5 hours on a 3 hour run-time. A couple of hours later I noticed it had dropped back massively to just 19 minutes in (still the first model). It's done this again a few times since.

I upgraded to Boinc 6.4.5 a day or two before the Mini 1.47 WUs started coming through (mid-Dec), so I'm not sure which is responsible for this, but since the lockfile errors stopped crashing WUs out there have been several instances of WUs taking a long time with nothing at all reported in the manager's message tab, then finishing relatively early with no error message.

Am I imagining this or are others seeing the same thing? Without error messages I don't really know what to report, nor where to report it, but I'm sure it's happening.

I believe it happened with this completed WU and is currently happening with this in-progress WU. Both are cc2_1_8_mammoth_mix_fa_cst_hb jobs if that makes a difference.

Any ideas?
____________

arminius Profile

Joined: Sep 23 05
Posts: 8
ID: 863
Credit: 634,411
RAC: 0
Message 58409 - Posted 3 Jan 2009 9:07:58 UTC

some compute errors for lr5_score12

resultid=218389174

resultid=218356231
____________

arminius Profile

Joined: Sep 23 05
Posts: 8
ID: 863
Credit: 634,411
RAC: 0
Message 58411 - Posted 3 Jan 2009 9:58:04 UTC

next

resultid=218405620

stopping rosetta for now
____________

Greenshit

Joined: Jan 30 07
Posts: 3
ID: 144575
Credit: 55,173
RAC: 0
Message 58413 - Posted 3 Jan 2009 12:07:10 UTC

Three Compute errors in a row:
resultid=218358618
resultid=218443805
resultid=218358618

:-(
____________

Greenshit

Joined: Jan 30 07
Posts: 3
ID: 144575
Credit: 55,173
RAC: 0
Message 58415 - Posted 3 Jan 2009 12:15:48 UTC

sorry for typo, the last one should be:
resultid=218358619
____________

sslickerson Profile

Joined: Oct 14 05
Posts: 101
ID: 4578
Credit: 484,477
RAC: 0
Message 58425 - Posted 3 Jan 2009 17:46:47 UTC
Last modified: 3 Jan 2009 17:55:23 UTC

I've had a fairly consistent failure rate for the mini-Rosetta app on my 64bit Vista computer for several months now (hence the reason why it is rarely crunching here). I thought I saw some light at the end so I attached again yesterday only to find 3 more tasks that have failed. All have error code:

-1073741819 (0xc0000005)

The workunits are as follows:

218380490
218380489
218380488

I do hope project staff will look into these. I would really like to get back over to ROSETTA on this machine but I can' waste the cycles without the fix. I can run some RALPH WU if this is needed to track it down. Also, all three WU had messages reporting that the "Output file was missing" prior to failure.

Edit Added: Paul Buck mentioned a few posts ago that his tasks that failed were possibly suspended and I know for a fact that the tasks that failed on my computer were indeed suspended and were not left in memory after the suspension.
____________



Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58427 - Posted 3 Jan 2009 18:19:19 UTC - in response to Message ID 58425.

I've had a fairly consistent failure rate for the mini-Rosetta app on my 64bit Vista computer for several months now (hence the reason why it is rarely crunching here). I thought I saw some light at the end so I attached again yesterday only to find 3 more tasks that have failed. All have error code:

-1073741819 (0xc0000005)

The workunits are as follows:

218380490
218380489
218380488

I do hope project staff will look into these. I would really like to get back over to ROSETTA on this machine but I can' waste the cycles without the fix. I can run some RALPH WU if this is needed to track it down. Also, all three WU had messages reporting that the "Output file was missing" prior to failure.

Edit Added: Paul Buck mentioned a few posts ago that his tasks that failed were possibly suspended and I know for a fact that the tasks that failed on my computer were indeed suspended and were not left in memory after the suspension.


quick qustion. are you OC'd at all?
this looks like what I had when my OC speed was to high.
I lowered it and all was ok.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58428 - Posted 3 Jan 2009 18:24:25 UTC

what with this task and its credit?
cc2_1_8_native_cen_cst_hb_t369__IGNORE_THE_REST_1RXQA_14_5863_202_0
http://boinc.bakerlab.org/rosetta/result.php?resultid=218243427
i am running flat out cpu speed and produced 4 decoys in 11679.33 seconds in a setting of 14400 seconds and it grants me UNDER the claimed credit.
Claimed credit 78.1755065660898
Granted credit 32.0937916886001

that's just unbelievable
my frustration is rising again with bad credit granted and problems with downloads on your end as well as the lousy credit for long running tasks.

it is like the project is at the bottom of a sine wave again.

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,443,745
RAC: 1,849
Message 58429 - Posted 3 Jan 2009 18:31:45 UTC

Another lr5_score12 workunit that failed:

1/3/2009 9:34:50 AM|rosetta@home|Computation for task lr5_score12_rlbd_256b_IGNORE_THE_REST_DECOY_5559_1304_0 finished
1/3/2009 9:34:50 AM|rosetta@home|Output file lr5_score12_rlbd_256b_IGNORE_THE_REST_DECOY_5559_1304_0_0 for task lr5_score12_rlbd_256b_IGNORE_THE_REST_DECOY_5559_1304_0 absent


http://boinc.bakerlab.org/rosetta/workunit.php?wuid=199023434

sslickerson Profile

Joined: Oct 14 05
Posts: 101
ID: 4578
Credit: 484,477
RAC: 0
Message 58445 - Posted 4 Jan 2009 4:35:59 UTC - in response to Message ID 58427.

@greg_be

No, I am running stock. I lowered my runtime to 1 hour (thus no switching of apps) and of the 4 completed MR that have completed, all look like they will validate. Is there causation here, idk, but I would be interested to know.

It seems like the 4 or 5 times that I have come back to Rosetta with this setup (64bit Vista) everything works well until the runtime is increased to greater than 1 hour. Perhaps I will increase the runtime but switch to "leave app in memory" to see if there is any change...

I've had a fairly consistent failure rate for the mini-Rosetta app on my 64bit Vista computer for several months now (hence the reason why it is rarely crunching here). I thought I saw some light at the end so I attached again yesterday only to find 3 more tasks that have failed. All have error code:

-1073741819 (0xc0000005)

The workunits are as follows:

218380490
218380489
218380488

I do hope project staff will look into these. I would really like to get back over to ROSETTA on this machine but I can' waste the cycles without the fix. I can run some RALPH WU if this is needed to track it down. Also, all three WU had messages reporting that the "Output file was missing" prior to failure.

Edit Added: Paul Buck mentioned a few posts ago that his tasks that failed were possibly suspended and I know for a fact that the tasks that failed on my computer were indeed suspended and were not left in memory after the suspension.


quick qustion. are you OC'd at all?
this looks like what I had when my OC speed was to high.
I lowered it and all was ok.


Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58451 - Posted 4 Jan 2009 9:30:41 UTC - in response to Message ID 58445.

Interesting that Win64 acts up for you. Your only 1 version of boinc manager 'out of date', but that may or may not help. Leaving in memory, thats something the group always recommends. I don't really have any other idea's at the moment. Could someone else look at his tasks and see if they have any idea's why he's crashing?

@greg_be

No, I am running stock. I lowered my runtime to 1 hour (thus no switching of apps) and of the 4 completed MR that have completed, all look like they will validate. Is there causation here, idk, but I would be interested to know.

It seems like the 4 or 5 times that I have come back to Rosetta with this setup (64bit Vista) everything works well until the runtime is increased to greater than 1 hour. Perhaps I will increase the runtime but switch to "leave app in memory" to see if there is any change...

I've had a fairly consistent failure rate for the mini-Rosetta app on my 64bit Vista computer for several months now (hence the reason why it is rarely crunching here). I thought I saw some light at the end so I attached again yesterday only to find 3 more tasks that have failed. All have error code:

-1073741819 (0xc0000005)

The workunits are as follows:

218380490
218380489
218380488

I do hope project staff will look into these. I would really like to get back over to ROSETTA on this machine but I can' waste the cycles without the fix. I can run some RALPH WU if this is needed to track it down. Also, all three WU had messages reporting that the "Output file was missing" prior to failure.

Edit Added: Paul Buck mentioned a few posts ago that his tasks that failed were possibly suspended and I know for a fact that the tasks that failed on my computer were indeed suspended and were not left in memory after the suspension.


quick qustion. are you OC'd at all?
this looks like what I had when my OC speed was to high.
I lowered it and all was ok.



Path7

Joined: Aug 25 07
Posts: 128
ID: 201002
Credit: 61,751
RAC: 0
Message 58456 - Posted 4 Jan 2009 11:18:26 UTC
Last modified: 4 Jan 2009 11:27:20 UTC

Exit status: -1073741819 (0xc0000005) unhandled exception detected:

lr5_score12_rlbd_1who_IGNORE_THE_REST_DECOY_5559_986_0
lr5_score12_rlbd_1mjc_IGNORE_THE_REST_DECOY_5559_534_0

AMD Turion Dual-Core RM-70 at stock speed: 2.0 GHz
Windows Vista SP1 32-bit.
Boinc 5.10.45 with throttling 40 %.
Didn't see any errors (before) on this machine after upgrading to minirosetta 1.45.

On their second run these tasks ran:
Successfully on a Mac,
had the same error on Windows Vista.

Have a nice day,
Path7.

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,443,745
RAC: 1,849
Message 58462 - Posted 4 Jan 2009 12:27:15 UTC - in response to Message ID 58451.

Interesting that Win64 acts up for you. Your only 1 version of boinc manager 'out of date', but that may or may not help. Leaving in memory, thats something the group always recommends. I don't really have any other idea's at the moment. Could someone else look at his tasks and see if they have any idea's why he's crashing?

@greg_be

No, I am running stock. I lowered my runtime to 1 hour (thus no switching of apps) and of the 4 completed MR that have completed, all look like they will validate. Is there causation here, idk, but I would be interested to know.

It seems like the 4 or 5 times that I have come back to Rosetta with this setup (64bit Vista) everything works well until the runtime is increased to greater than 1 hour. Perhaps I will increase the runtime but switch to "leave app in memory" to see if there is any change...

I've had a fairly consistent failure rate for the mini-Rosetta app on my 64bit Vista computer for several months now (hence the reason why it is rarely crunching here). I thought I saw some light at the end so I attached again yesterday only to find 3 more tasks that have failed. All have error code:

-1073741819 (0xc0000005)

The workunits are as follows:

218380490
218380489
218380488

I do hope project staff will look into these. I would really like to get back over to ROSETTA on this machine but I can' waste the cycles without the fix. I can run some RALPH WU if this is needed to track it down. Also, all three WU had messages reporting that the "Output file was missing" prior to failure.

Edit Added: Paul Buck mentioned a few posts ago that his tasks that failed were possibly suspended and I know for a fact that the tasks that failed on my computer were indeed suspended and were not left in memory after the suspension.


quick qustion. are you OC'd at all?
this looks like what I had when my OC speed was to high.
I lowered it and all was ok.





BOINC 6.4.5 is now available, which suggests that a few people found problems in BOINC 6.4.0 and more recent. I notice that all three of those workunits were the lr5_score12 type, which a few other people have been reporting having problems with. Note that some other threads indicate that Rosetta@home is likely to have problems supplying all the workunits that are requested for at least a few more hours, though.

I've had problems with one of the lr5_score12 workunits lately, but after six workunits in a row that completed successfully but weren't the lr5_score12 type. Choosing the leave in memory option helps, especially if you also raise the upper limit on how much hard drive space BOINC can use, and at least for 32-bit Vista SP1, the upper limit on what fraction of the swap space BOINC can use.

Since then, another non-lr5_score12 workunit has completed on my machine successfully. Another lr5_score12 workunit is still running.

I'm using 14 hour workunits, but with 32-bit Vista, the leave in memory option, and with enough other projects to insure switching to another workunit a few times before these workunits complete.

My lr5_score12 workunit with an error gave an error message similar to yours, so I wouldn't be surprised if it's an error specific to that batch of workunits.

If you'd like to increase the workunit time, I've found that there's a setting for how long workunits can go before deciding whether to switch to another workunit, but I don't remember if Rosetta@home includes this in the settings you're allowed to change. I currently have it set to 2 hours between such decisions, though.

Sharlee

Joined: Nov 8 05
Posts: 1
ID: 10258
Credit: 86,487
RAC: 0
Message 58463 - Posted 4 Jan 2009 12:31:18 UTC

New error to report:
I am running an i7 CPU at 965 with 6G memory and Kapersky antivirus. Is there anything I can do to fix this problem?


1/4/2009 5:45:01 AM|rosetta@home|Sending scheduler request: To fetch work. Requesting 84480 seconds of work, reporting 0 completed tasks
1/4/2009 5:45:11 AM|rosetta@home|Scheduler request completed: got 7 new tasks
1/4/2009 5:45:13 AM|rosetta@home|Started download of boinc_mfr_aaat01_03_05.200_v1_3.gz
1/4/2009 5:45:13 AM|rosetta@home|Started download of boinc_mfr_aaAT01_03_05.200_v1_3.gz
1/4/2009 5:45:22 AM|rosetta@home|Finished download of boinc_mfr_aaat01_03_05.200_v1_3.gz
1/4/2009 5:45:22 AM|rosetta@home|Finished download of boinc_mfr_aaAT01_03_05.200_v1_3.gz
1/4/2009 5:45:22 AM|rosetta@home|Started download of boinc_mfr_aaat01_09_05.200_v1_3.gz
1/4/2009 5:45:22 AM|rosetta@home|Started download of boinc_mfr_aaAT01_09_05.200_v1_3.gz
1/4/2009 5:45:22 AM|rosetta@home|[error] MD5 check failed for boinc_mfr_aaat01_03_05.200_v1_3.gz
1/4/2009 5:45:22 AM|rosetta@home|[error] expected 9e156df4c561be65533ceb64059254ab, got a500261b0525281e82d9c3166980820c
1/4/2009 5:45:22 AM|rosetta@home|[error] Checksum or signature error for boinc_mfr_aaat01_03_05.200_v1_3.gz
1/4/2009 5:45:44 AM|rosetta@home|Finished download of boinc_mfr_aaat01_09_05.200_v1_3.gz
1/4/2009 5:45:44 AM|rosetta@home|Started download of AT01_.fasta
1/4/2009 5:45:45 AM|rosetta@home|Finished download of AT01_.fasta
1/4/2009 5:45:45 AM|rosetta@home|Started download of boinc_description_file.txt
1/4/2009 5:45:46 AM|rosetta@home|Finished download of boinc_description_file.txt
1/4/2009 5:45:46 AM|rosetta@home|Started download of AT01.pdb
1/4/2009 5:45:49 AM|rosetta@home|Finished download of AT01.pdb
1/4/2009 5:45:49 AM|rosetta@home|Started download of AT012.pdb
1/4/2009 5:45:51 AM|rosetta@home|Finished download of AT012.pdb
1/4/2009 5:45:53 AM|rosetta@home|Finished download of boinc_mfr_aaAT01_09_05.200_v1_3.gz
1/4/2009 5:45:53 AM|rosetta@home|[error] MD5 check failed for boinc_mfr_aaAT01_09_05.200_v1_3.gz
1/4/2009 5:45:53 AM|rosetta@home|[error] expected 01275336f54af3e7ff7d41ae314e4f73, got 7cbad1935a58db3fe90e367e4d2f7daf
1/4/2009 5:45:53 AM|rosetta@home|[error] Checksum or signature error for boinc_mfr_aaAT01_09_05.200_v1_3.gz

____________

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,443,745
RAC: 1,849
Message 58464 - Posted 4 Jan 2009 12:41:42 UTC - in response to Message ID 58463.

New error to report:
I am running an i7 CPU at 965 with 6G memory and Kapersky antivirus. Is there anything I can do to fix this problem?


1/4/2009 5:45:01 AM|rosetta@home|Sending scheduler request: To fetch work. Requesting 84480 seconds of work, reporting 0 completed tasks
1/4/2009 5:45:11 AM|rosetta@home|Scheduler request completed: got 7 new tasks
1/4/2009 5:45:13 AM|rosetta@home|Started download of boinc_mfr_aaat01_03_05.200_v1_3.gz
1/4/2009 5:45:13 AM|rosetta@home|Started download of boinc_mfr_aaAT01_03_05.200_v1_3.gz
1/4/2009 5:45:22 AM|rosetta@home|Finished download of boinc_mfr_aaat01_03_05.200_v1_3.gz
1/4/2009 5:45:22 AM|rosetta@home|Finished download of boinc_mfr_aaAT01_03_05.200_v1_3.gz
1/4/2009 5:45:22 AM|rosetta@home|Started download of boinc_mfr_aaat01_09_05.200_v1_3.gz
1/4/2009 5:45:22 AM|rosetta@home|Started download of boinc_mfr_aaAT01_09_05.200_v1_3.gz
1/4/2009 5:45:22 AM|rosetta@home|[error] MD5 check failed for boinc_mfr_aaat01_03_05.200_v1_3.gz
1/4/2009 5:45:22 AM|rosetta@home|[error] expected 9e156df4c561be65533ceb64059254ab, got a500261b0525281e82d9c3166980820c
1/4/2009 5:45:22 AM|rosetta@home|[error] Checksum or signature error for boinc_mfr_aaat01_03_05.200_v1_3.gz
1/4/2009 5:45:44 AM|rosetta@home|Finished download of boinc_mfr_aaat01_09_05.200_v1_3.gz
1/4/2009 5:45:44 AM|rosetta@home|Started download of AT01_.fasta
1/4/2009 5:45:45 AM|rosetta@home|Finished download of AT01_.fasta
1/4/2009 5:45:45 AM|rosetta@home|Started download of boinc_description_file.txt
1/4/2009 5:45:46 AM|rosetta@home|Finished download of boinc_description_file.txt
1/4/2009 5:45:46 AM|rosetta@home|Started download of AT01.pdb
1/4/2009 5:45:49 AM|rosetta@home|Finished download of AT01.pdb
1/4/2009 5:45:49 AM|rosetta@home|Started download of AT012.pdb
1/4/2009 5:45:51 AM|rosetta@home|Finished download of AT012.pdb
1/4/2009 5:45:53 AM|rosetta@home|Finished download of boinc_mfr_aaAT01_09_05.200_v1_3.gz
1/4/2009 5:45:53 AM|rosetta@home|[error] MD5 check failed for boinc_mfr_aaAT01_09_05.200_v1_3.gz
1/4/2009 5:45:53 AM|rosetta@home|[error] expected 01275336f54af3e7ff7d41ae314e4f73, got 7cbad1935a58db3fe90e367e4d2f7daf
1/4/2009 5:45:53 AM|rosetta@home|[error] Checksum or signature error for boinc_mfr_aaAT01_09_05.200_v1_3.gz


If you run out of Rosetta@home workunits that haven't been completed and reported, you can click on Reset project after selecting Rosetta@home in the Projects window of the Advanced view and make BOINC download all though files again.

Matthias Lehmkuhl

Joined: Nov 20 05
Posts: 10
ID: 13663
Credit: 708,110
RAC: 236
Message 58467 - Posted 4 Jan 2009 14:34:14 UTC

got also one WU lr5_score12... with error

<message>
- exit code -1073741819 (0xc0000005)
</message>


wuid=198929431
____________
Matthias

sslickerson Profile

Joined: Oct 14 05
Posts: 101
ID: 4578
Credit: 484,477
RAC: 0
Message 58480 - Posted 4 Jan 2009 18:06:45 UTC
Last modified: 4 Jan 2009 18:08:37 UTC

@ greb_be & robertmiles

Thanks for looking into this. I let rosetta run last night with increased runtimes and I left the application in memory but I see that 1 wu did fail: 218380754 for the same reason as before.

Also of note, there were 20 that failed because of client error while downloading--couldn't get input files, MD5 check failed: 218580846 for instance.

On this computer, I have set Rosetta to no new work and I had to abort the remaining wu's. I really want to attach here but the problems are far too severe at the moment. Perhaps I'll try again in 6 months, but I must say, this is getting a bit old...
____________



Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3378
ID: 106194
Credit: 0
RAC: 0
Message 58484 - Posted 4 Jan 2009 18:19:08 UTC

The runtime should not directly effect the success of a task. But, since it will run more models, it increases the odds of you hitting a long-running model. So, running 5 models on 5 different 1 hour tasks should give you the same result as running 5 models on a single 5 hour task. But if 20% of the models are long-running, you would say that 100% of your 5hr tasks "fail", and only 20% of your 1hr tasks do.

But, with a 1 hour runtime preference, the watchdog will kick in much sooner. If watchdog is set to 3 times normal, it would only allow a task to run for 3 hours. Whereas with the longer runtime above, it would go for up to a total of 15 before ending the task.
____________
Rosetta Moderator: Mod.Sense

sslickerson Profile

Joined: Oct 14 05
Posts: 101
ID: 4578
Credit: 484,477
RAC: 0
Message 58486 - Posted 4 Jan 2009 18:21:13 UTC

Just for the fun of it I checked my desktop (AMD 4200+) for any errors, typically this one is and has been rock solid for years. Lo and behold, there was one error there that occurred in the past few hours with the same error as my vista laptop. So the error is not machine or cpu specific (AMD vs Intel...XP vs Vista) it has happened in each (as far as my setup at least).

AMD 4200:218361409

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,492,943
RAC: 7,820
Message 58491 - Posted 4 Jan 2009 18:36:53 UTC - in response to Message ID 58462.
Last modified: 4 Jan 2009 18:38:04 UTC

It seems like the 4 or 5 times that I have come back to Rosetta with this setup (64bit Vista) everything works well until the runtime is increased to greater than 1 hour. Perhaps I will increase the runtime but switch to "leave app in memory" to see if there is any change...

[...]

BOINC 6.4.5 is now available, which suggests that a few people found problems in BOINC 6.4.0 and more recent. I notice that all three of those workunits were the lr5_score12 type, which a few other people have been reporting having problems with. Note that some other threads indicate that Rosetta@home is likely to have problems supplying all the workunits that are requested for at least a few more hours, though.

@Robert\sslickerson

I had loads of problems (can't acquire lockfile) with Vista64 until Boinc 6.4.5 at which point they disappeared completely. I also reduced my runtime to 2 hours for greater success with earlier versions. With 6.4.5 they seem to have gone. An upgrade may help you too.

That said, it hasn't solved any issues with exception errors, which I still get to a small extent (1 out of 93 when I investigated). All your problems seems to be of that type (many more than me) so it may not solve your problems.

For what it's worth, I kept applications in memory, which I understand to be the best advice. Maybe you should try that too. Hope it helps you to some degree.
____________

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 58493 - Posted 4 Jan 2009 18:42:07 UTC

Also check to see if processor usage is set to 100% ...

I saw a note on EaH that with windows and the processor usage not set to 100% this is a common error. In that this killed about 20 models here for me ... I am interested if this is really the case ... I know ROsetta runs well on OS-X in that I have not had any failures there ...

On Win XP I got 10 failures out of about 20 tries ... which is when *I* gave up again on RaH ...

I had set usage to 99% to give me a little more head room and that may have been enough to farble things up ...

Anyone up for the test?

THis is addressed to the "Cant' acquire lock-file" problem only ...
____________

sslickerson Profile

Joined: Oct 14 05
Posts: 101
ID: 4578
Credit: 484,477
RAC: 0
Message 58499 - Posted 4 Jan 2009 19:56:07 UTC - in response to Message ID 58493.

Thanks Paul and everyone else. I'll give these suggestions a try sometime this week (besides Rosie is out of work for the time being).

Also check to see if processor usage is set to 100% ...

I saw a note on EaH that with windows and the processor usage not set to 100% this is a common error. In that this killed about 20 models here for me ... I am interested if this is really the case ... I know ROsetta runs well on OS-X in that I have not had any failures there ...

On Win XP I got 10 failures out of about 20 tries ... which is when *I* gave up again on RaH ...

I had set usage to 99% to give me a little more head room and that may have been enough to farble things up ...

Anyone up for the test?

THis is addressed to the "Cant' acquire lock-file" problem only ...


____________



LizzieBarry

Joined: Feb 25 08
Posts: 76
ID: 243949
Credit: 201,862
RAC: 0
Message 58507 - Posted 4 Jan 2009 22:51:22 UTC - in response to Message ID 58499.

Thanks Paul and everyone else. I'll give these suggestions a try sometime this week (besides Rosie is out of work for the time being).

Not to forget, it's an ideal time to reset the project too.

rochester new york Profile
Avatar

Joined: Jul 2 06
Posts: 2562
ID: 98229
Credit: 957,089
RAC: 119
Message 58540 - Posted 5 Jan 2009 20:45:17 UTC

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=197878878

Ian_D Profile

Joined: Sep 21 05
Posts: 55
ID: 757
Credit: 4,216,173
RAC: 0
Message 58562 - Posted 6 Jan 2009 12:17:15 UTC
Last modified: 6 Jan 2009 12:27:00 UTC

Duplicate post
____________


Ian_D Profile

Joined: Sep 21 05
Posts: 55
ID: 757
Credit: 4,216,173
RAC: 0
Message 58564 - Posted 6 Jan 2009 12:21:06 UTC
Last modified: 6 Jan 2009 12:27:28 UTC

Duplicate post - wow it's slow.....
____________


Ian_D Profile

Joined: Sep 21 05
Posts: 55
ID: 757
Credit: 4,216,173
RAC: 0
Message 58565 - Posted 6 Jan 2009 12:23:46 UTC

<core_client_version>6.2.12</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
# cpu_run_time_pref: 28800

ERROR: phil how did we get here-2?
ERROR:: Exit from: src/core/kinematics/AtomTree.cc line: 1378
called boinc_finish

</stderr_txt>
]]>


You're having a laugh, right ?

http://boinc.bakerlab.org/rosetta/result.php?resultid=218842260
____________


Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58572 - Posted 6 Jan 2009 16:21:34 UTC

me and the 2nd cruncher both got computer errors on lr5_score12_rlbd_1ubi_IGNORE_THE_REST_DECOY_5559_1100_1

the combined task summary is http://boinc.bakerlab.org/rosetta/workunit.php?wuid=198993154]here

the error is:
<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0049162C read attempt to address 0x00000000

Engaging BOINC Windows Runtime Debugger...

it ran CPU time 1089.156 seconds out of 1440 on my machine

and on the other system

Computer ID 593083
Report deadline 13 Jan 2009 17:11:30 UTC
CPU time 526.051
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0049162C read attempt to address 0x00000000

Engaging BOINC Windows Runtime Debugger...

Jim Leatherman

Joined: Jun 15 08
Posts: 2
ID: 264510
Credit: 987,127
RAC: 0
Message 58573 - Posted 6 Jan 2009 17:07:16 UTC

After upgrading to 6.4.5 BOINC doesn't seem to be downloading any tasks now for Rosetta@Home. Same message all the time:

01/06/09 12:00:56|rosetta@home|Fetching scheduler list
01/06/09 12:01:01|rosetta@home|Master file download succeeded
01/06/09 12:01:06|rosetta@home|Sending scheduler request: To fetch work. Requesting 172801 seconds of work, reporting 0 completed tasks
01/06/09 12:01:11|rosetta@home|Scheduler request completed: got 0 new tasks

I have reset the project, but still no downloads -- was working fine prior to 6.4.5.

Any ideas?

Jim Leatherman

Joined: Jun 15 08
Posts: 2
ID: 264510
Credit: 987,127
RAC: 0
Message 58574 - Posted 6 Jan 2009 17:09:58 UTC

After upgrading to 6.4.5 BOINC doesn't seem to be downloading any tasks now for Rosetta@Home. Same message all the time:

01/06/09 12:00:56|rosetta@home|Fetching scheduler list
01/06/09 12:01:01|rosetta@home|Master file download succeeded
01/06/09 12:01:06|rosetta@home|Sending scheduler request: To fetch work. Requesting 172801 seconds of work, reporting 0 completed tasks
01/06/09 12:01:11|rosetta@home|Scheduler request completed: got 0 new tasks

I have reset the project, but still no downloads -- was working fine prior to 6.4.5.

Any ideas?

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 58577 - Posted 6 Jan 2009 17:35:21 UTC

The problem is not at your end. If you have similar problems in the future always check the server status. Right now there are problems on the other end as you will see by the prominent red boxes.
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58579 - Posted 6 Jan 2009 18:39:58 UTC - in response to Message ID 58577.

The problem is not at your end. If you have similar problems in the future always check the server status. Right now there are problems on the other end as you will see by the prominent red boxes.


Generate work servers have been offline today (European time)for quite some time. No news from the team as to what is causing this outage. Keep an eye on the server status page to see when they come back online.

Rifleman

Joined: Nov 19 08
Posts: 17
ID: 288725
Credit: 139,408
RAC: 0
Message 58591 - Posted 7 Jan 2009 7:23:21 UTC

I have 3 finished WUs that don't seem to upload to the server---is that because of the problems today?

Rifleman

Joined: Nov 19 08
Posts: 17
ID: 288725
Credit: 139,408
RAC: 0
Message 58592 - Posted 7 Jan 2009 7:25:45 UTC

I have 3 finished WUs that don't seem to upload to the server---is that because of the problems today?

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58593 - Posted 7 Jan 2009 8:20:08 UTC - in response to Message ID 58592.

See the server problems thread. Apparently a connection to the outside world got pulled while they were working on the rack. They expect things to be be up and running today. But it is midnight pacific time at the moment, so don't expect anything to happen for at least 8 hours. If you go into boinc manager and then goto the projects tab, you can set RAH to 'accept no new tasks' and that will stop it from requesting new work. This will cut back on your status messages. Turn it back on later tonight (European time).


I have 3 finished WUs that don't seem to upload to the server---is that because of the problems today?

yose-ue

Joined: Dec 30 05
Posts: 3
ID: 44964
Credit: 228,710
RAC: 0
Message 58654 - Posted 7 Jan 2009 21:17:04 UTC

This job (wuid=198707114)appeares to have finished twice and after using 71456 cpu seconds total I was only granted 2 points

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 28800
# cpu_run_time_pref: 28800
======================================================
DONE :: 1 starting structures 47173.5 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

BOINC :: Watchdog shutting down...
# cpu_run_time_pref: 28800
# cpu_run_time_pref: 28800
# cpu_run_time_pref: 28800
# cpu_run_time_pref: 28800
======================================================
DONE :: 1 starting structures 71456 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>


Validate state Valid
Claimed credit 156.119054549462
Granted credit 2
application version 1.47

____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3378
ID: 106194
Credit: 0
RAC: 0
Message 58655 - Posted 7 Jan 2009 21:21:30 UTC

DK has now corrected the problem where results are always granted 2 credits per model. See his post.
____________
Rosetta Moderator: Mod.Sense

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58673 - Posted 8 Jan 2009 11:57:56 UTC

some bizarre behavior for these tasks

http://boinc.bakerlab.org/rosetta/result.php?resultid=218440422
lr5_score12_rlbd_2o7k_IGNORE_THE_REST_DECOY_5559_1165_0

Exit Status -1073741819 (0xc0000005)
CPU time 8809.906
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0049162C read attempt to address 0x00000000

Engaging BOINC Windows Runtime Debugger...

Validate state Invalid
Claimed credit 58.9690388361006
Granted credit 58.9690388361006

But according to the tasks for user page the granted credit never happened.

---------

http://boinc.bakerlab.org/rosetta/result.php?resultid=218547095
lr5_score12_rlbd_1ubi_IGNORE_THE_REST_DECOY_5559_1100_1

Exit status -1073741819 (0xc0000005)
CPU time 1089.156
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0049162C read attempt to address 0x00000000

Engaging BOINC Windows Runtime Debugger...


Claimed credit 7.29025740598957
Granted credit 7.29025740598957

but again, no credit in the tasks for user page

slre

Joined: Dec 6 08
Posts: 2
ID: 291385
Credit: 1,345,264
RAC: 739
Message 58738 - Posted 11 Jan 2009 21:38:08 UTC

I'm seeing the diminishing returns problem regularly. The progress for tasks is good for the first 70-80%, then progress gets slower and slower. Yesterday I aborted a task that had taken 30 hours to go from 97 to 99.5% after taking under 12 hours to get to 97%.
The following taks is going the same way:
abinitio_norelax_homfrag_129_B_1o7uA_SAVE_ALL_OUT_4626_11775_0
After 3 hours it was reporting 70% complete; it is now at 98.8% after 13.5 hours.

My main complaint is not that the tasks can overrun - though that is clearly a problem, it is reported previously - but that I thought the target cpu time included a threshold (3*target cpu time?) that terminated an overruning task. Minirosetta is clearly ignoring this if it's set, as my target time is set to 4 hours.

Is minirosetta supposed to act on target cpu time? If it is, why isn't it?

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,443,745
RAC: 1,849
Message 58739 - Posted 11 Jan 2009 23:50:51 UTC - in response to Message ID 58738.
Last modified: 12 Jan 2009 0:23:24 UTC

I'm seeing the diminishing returns problem regularly. The progress for tasks is good for the first 70-80%, then progress gets slower and slower. Yesterday I aborted a task that had taken 30 hours to go from 97 to 99.5% after taking under 12 hours to get to 97%.
The following taks is going the same way:
abinitio_norelax_homfrag_129_B_1o7uA_SAVE_ALL_OUT_4626_11775_0
After 3 hours it was reporting 70% complete; it is now at 98.8% after 13.5 hours.

My main complaint is not that the tasks can overrun - though that is clearly a problem, it is reported previously - but that I thought the target cpu time included a threshold (3*target cpu time?) that terminated an overruning task. Minirosetta is clearly ignoring this if it's set, as my target time is set to 4 hours.

Is minirosetta supposed to act on target cpu time? If it is, why isn't it?


It is, but it doesn't check continuously for an overrun. If you have BOINC set to give each workunit a two hour timeslice before deciding what workunit gets the next timeslice, as I do, it only checks for an overrun every two hours.

In other words, your actual limit should be (3*target cpu time) + 1 timeslice at present.

Also, the diminishing returns you see is at least partly a fake; minirosetta doesn't have a good way of measuring what percentage of the work has been done, so it estimates the percentage done based on the percentage of the target CPU time it has already used until it gets within about 10 minutes of the target CPU time, then it almost stops changing the reported percentage done until it actually finishes.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58740 - Posted 11 Jan 2009 23:52:37 UTC - in response to Message ID 58738.

I'm seeing the diminishing returns problem regularly. The progress for tasks is good for the first 70-80%, then progress gets slower and slower. Yesterday I aborted a task that had taken 30 hours to go from 97 to 99.5% after taking under 12 hours to get to 97%.
The following taks is going the same way:
abinitio_norelax_homfrag_129_B_1o7uA_SAVE_ALL_OUT_4626_11775_0
After 3 hours it was reporting 70% complete; it is now at 98.8% after 13.5 hours.

My main complaint is not that the tasks can overrun - though that is clearly a problem, it is reported previously - but that I thought the target cpu time included a threshold (3*target cpu time?) that terminated an overruning task. Minirosetta is clearly ignoring this if it's set, as my target time is set to 4 hours.

Is minirosetta supposed to act on target cpu time? If it is, why isn't it?




be sure to post links to the tasks that ran over in the long running models thread. apparently the team reads this thread to find out what is going on and make corrections in the next batch of tasks that are similar in nature.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58742 - Posted 12 Jan 2009 0:53:44 UTC
Last modified: 12 Jan 2009 0:56:32 UTC

just a heads up:

1/12/2009 1:23:10 AM|rosetta@home|Task abinitio_norelax_homfrag_129_B_1a19A_SAVE_ALL_OUT_4626_9187_0 exited with zero status but no 'finished' file
1/12/2009 1:23:10 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
1/12/2009 1:23:10 AM|rosetta@home|Task abinitio_norelax_homfrag_129_B_4ubpA_SAVE_ALL_OUT_4626_9186_0 exited with zero status but no 'finished' file
1/12/2009 1:23:10 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
1/12/2009 1:23:10 AM|rosetta@home|Restarting task abinitio_norelax_homfrag_129_B_4ubpA_SAVE_ALL_OUT_4626_9186_0 using minirosetta version 147

the 87 task: http://boinc.bakerlab.org/rosetta/result.php?resultid=219581418
the 86 task http://boinc.bakerlab.org/rosetta/result.php?resultid=219581394

both tasks got credit ok. so don't know what that message was all about.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3378
ID: 106194
Credit: 0
RAC: 0
Message 58743 - Posted 12 Jan 2009 0:57:09 UTC

A link to slre's task, it ran for over 40 hours! So, yes, clearly the watchdog should have ended it.

Robert, I don't believe the watchdog is dependant upon the BOINC task switching. On the other hand, it's not constantly checking either.
____________
Rosetta Moderator: Mod.Sense

slre

Joined: Dec 6 08
Posts: 2
ID: 291385
Credit: 1,345,264
RAC: 739
Message 58748 - Posted 12 Jan 2009 1:50:36 UTC - in response to Message ID 58743.

A link to slre's task, it ran for over 40 hours! So, yes, clearly the watchdog should have ended it.

Robert, I don't believe the watchdog is dependant upon the BOINC task switching. On the other hand, it's not constantly checking either.


Thanks for that; a) I didn't know you could link to aborted taks; b) it made my case better than I did and c) thanks for confirming there's a genuine problem.

S

HA-SOFT, s.r.o.

Joined: Jan 27 07
Posts: 10
ID: 144015
Credit: 65,377,643
RAC: 50,339
Message 58754 - Posted 12 Jan 2009 10:54:27 UTC - in response to Message ID 58144.

StdErr is empty or contains message about access violation on 0xc0000005. Application hangs with 3MB RAM and does nothing. I have for example about 10 minirosetta apps in memory that do nothing. When I kill them, there is not stderr or any other file in slots directory.

greb_be and all,

When there is a new version of minirosetta update, we usually put a windows debug symbol image in a downloadable location. So when a WU crashes out, it should provide a backtrace of how an error is caused (this does not work every time and that makes our debugging very hard). If it is an error from Minirosetta program or bad command line/input file setup, the stdout or stderr usually will print out a message as hints, for example, the hbond NAN problem in the previous versions. Also, we should see a significantly higher error rate among either all or certain batches of WUs running. If it is caused by interfacing with the host's hardware or software, we will usually see that certain client hosts kept encountering errors or failure. We wish we could tell what have been wrong in every scenario when an error occurs, however, most of us Rosetta developer are far from being an expert on computer software/hardware and we can only hope to trap errors locally on our testing machines to continue with debugging.

Thank you all for voluntarily helping us on doing this project and sorry about any inconvenience/trouble caused on your computer. Please continue to report problems and/or possible fixes you have found as every bit of such information will certainly help us to improve R@H stability and resolve hidden bugs/problems sooner or later. Happy holidays to every one and happy crunching!



____________

LizzieBarry

Joined: Feb 25 08
Posts: 76
ID: 243949
Credit: 201,862
RAC: 0
Message 58777 - Posted 13 Jan 2009 2:15:46 UTC - in response to Message ID 58499.

Thanks Paul and everyone else. I'll give these suggestions a try sometime this week (besides Rosie is out of work for the time being).

In lieu of any direct reply, I note that every recent job for sslickerson has completed successfully.

Looks like Boinc 6.4.5 answers at least one person's problems with MiniRosetta WUs. Worth thinking about for anyone with otherwise persistent problems, it seems.

Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 58786 - Posted 13 Jan 2009 19:46:55 UTC

Hi all! Hope you all had a fabulous Christmas break. Despite being quiet on the message boards we've been busy working on making mini more stable. This is the top priority right now and i think we've made some progress. Your comments and feedback and error reports have been invaluable in this process! We've also set up a windows test-bed here locally which identified a number of hiden issues that the Linux machines we typically use didn't catch.

The next release 1.48 is about to go on RALPH and I am intending to test it very thoroughly before moving it onto BOINC. Since you guys posting here are already familiar with spotting problems I think it would be awesome if some of you experienced users could move over to RALPH@Home just for a few weeks while we test the new release. You've already seen the problems that used to occur and we need your feedback (and the extra processing power and variety of machines) to make sure we've fixed the issues we think we have fixed. I'll announce again here when the new version is actually out.

Here's a preview of the features that have been put into mini 1.48:

1.48 Release CHANGELOG

Faster loop closing in FoldCST/Abinitio (affects cc_* cc2_* cs_* WUs), should help with overrunning WUs.

Bug fix concerning intermittent crashes in _rlbd_ jobs.

Bug fix for a potential instability in handling text files (affects all types of WUs).

Bug fix in checkpointing machinery, states were not being correctly restored, probably contributing to long runtimes. (affects cc_* cc2_* cs_* WUs)

Increased the density of checkpoints to lose less time on restarts and address the weired "backjumping" of the time reported in this thread.

Added checkpointing to Loopclosing part of FoldCST. (affects cc_* cc2_* cs_* WUs)

Added checkpointing to Looprelax.

The Watchdog has been checked and improved, now returning information on the aborted jobs to help us figure out how the remaining long running models come about.



____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

LizzieBarry

Joined: Feb 25 08
Posts: 76
ID: 243949
Credit: 201,862
RAC: 0
Message 58790 - Posted 13 Jan 2009 22:55:10 UTC - in response to Message ID 58786.

Despite being quiet on the message boards we've been busy working on making mini more stable. This is the top priority right now and I think we've made some progress. Your comments and feedback and error reports have been invaluable in this process! We've also set up a windows test-bed here locally which identified a number of hidden issues that the Linux machines we typically use didn't catch.

That's the way I like - that you're getting busy behind the scenes rather than getting bogged down here. But it's worth a quick progress report once a week to prevent the natives getting too restless.

Good to hear you're set up with a Windows machine to pick up problems on the majority platform and it's earned its corn already. I look forward to the results and a much quieter bug thread. The work on over-running WUs, intermittent crashes and extra check-pointing should make a big difference if they're successful.

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 58796 - Posted 14 Jan 2009 5:14:36 UTC

Well, no work yet in RALPH ...

BUt, I did sign up for what it is worth ... I will watch and see if I get any work on one system ...
____________

Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 58797 - Posted 14 Jan 2009 7:11:41 UTC

Yeah - hold yer horses .. we've not yet done the update yet. I'll announce it here.

____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

sslickerson Profile

Joined: Oct 14 05
Posts: 101
ID: 4578
Credit: 484,477
RAC: 0
Message 58812 - Posted 14 Jan 2009 17:09:48 UTC - in response to Message ID 58777.

Thanks Paul and everyone else. I'll give these suggestions a try sometime this week (besides Rosie is out of work for the time being).

In lieu of any direct reply, I note that every recent job for sslickerson has completed successfully.

Looks like Boinc 6.4.5 answers at least one person's problems with MiniRosetta WUs. Worth thinking about for anyone with otherwise persistent problems, it seems.


Hey there, sorry about not replying. Actually, the Rosetta Wu's you are looking at are on my desktop (BOINC 6.4.5) which *typically* does not have issues with minirosetta. I have not allowed work on my laptop (BOINC 6.5.0) since the last batch of errors, so I am uncertain if the update would have fixed the issue.

I am going to reattach to RALPH for awhile and hopefully if there are errors we can get them fixed over there.

Timothy
____________



Krata

Joined: Oct 25 05
Posts: 2
ID: 6696
Credit: 17,084
RAC: 0
Message 58829 - Posted 15 Jan 2009 7:57:59 UTC - in response to Message ID 58812.

Hi,

I have still same problem with Minirosseta application (at least last 4 versions).

Symptoms - the aplication start (running in boinc) but CPU usage is zero... there is no progress and finally (e.g. 2 hours) I am forced to abort it. There are still some tasks that are finished without any problem...

successfull result example:
http://boinc.bakerlab.org/rosetta/result.php?resultid=220577616

need to be aborted example:
http://boinc.bakerlab.org/rosetta/result.php?resultid=220578787
http://boinc.bakerlab.org/rosetta/result.php?resultid=220578788

Due to these facts (no error and so no work performed at all) I have switched to different project. Thanks for any advice...

PS I tried detaching from project, reseting and so on...

15/01/2009 08:53:44||Starting BOINC client version 6.4.5 for windows_intelx86
15/01/2009 08:53:44||log flags: task, file_xfer, sched_ops
15/01/2009 08:53:44||Libraries: libcurl/7.19.0 OpenSSL/0.9.8i zlib/1.2.3
15/01/2009 08:53:44||Data directory: C:\Documents and Settings\kratochvil\Desktop\boincnew\CommonAppData\BOINC
15/01/2009 08:53:44||Running under account kratochvil
15/01/2009 08:53:44||Processor: 1 GenuineIntel Intel(R) Pentium(R) M processor 1.73GHz [x86 Family 6 Model 13 Stepping 8]
15/01/2009 08:53:44||Processor features: fpu tsc sse sse2 mmx
15/01/2009 08:53:44||OS: Microsoft Windows XP: Professional x86 Editon, Service Pack 2, (05.01.2600.00)
15/01/2009 08:53:44||Memory: 1.99 GB physical, 4.82 GB virtual
15/01/2009 08:53:44||Disk: 74.53 GB total, 9.79 GB free
15/01/2009 08:53:44||Local time is UTC +1 hours
15/01/2009 08:53:44||Using HTTP proxy CZproxy.de.eurw.ey.net:8080
15/01/2009 08:53:44||No CUDA devices found
15/01/2009 08:53:44||No coprocessors
15/01/2009 08:53:44|rosetta@home|URL: http://boinc.bakerlab.org/rosetta/; Computer ID: 984920; location: home; project prefs: default
15/01/2009 08:53:44|QMC@HOME|URL: http://qah.uni-muenster.de/; Computer ID: 114583; location: (none); project prefs: default
15/01/2009 08:53:44||General prefs: from rosetta@home (last modified 14-Jun-2008 11:07:07)
15/01/2009 08:53:44||Computer location: home
15/01/2009 08:53:44||General prefs: using separate prefs for home
15/01/2009 08:53:44||Reading preferences override file
15/01/2009 08:53:44||Preferences limit memory usage when active to 1426.87MB
15/01/2009 08:53:44||Preferences limit memory usage when idle to 1834.55MB
15/01/2009 08:53:45||Preferences limit disk usage to 2.00GB
15/01/2009 08:53:45|QMC@HOME|Restarting task one_bench12_s22-ecp2-TZmf.13431_0 using Amolqc-preRC1 version 501

____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3378
ID: 106194
Credit: 0
RAC: 0
Message 58836 - Posted 15 Jan 2009 14:36:50 UTC
Last modified: 15 Jan 2009 14:39:29 UTC

Krata, I do not have any specific advice to offer you to resolve the problem you describe. I only see a few tasks from that host, and only one completed normally and only two were aborted. So, perhaps greater numbers will help reveal more symptoms.

Could I ask that you keep an eye on the news portion of the home page and come back when the new Mini version is available? It will correct the majority of problems people have been reporting.

If you are willing, you might also consider attaching to Ralph to help test the new version. They need machines like yours that were having problems before, to be certain they have corrected them. The new release is not yet ready for testing, so you won't see much (or any) tasks available on Ralph right now. But should be soon.
____________
Rosetta Moderator: Mod.Sense

Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 58916 - Posted 19 Jan 2009 9:08:54 UTC

** 1.48 released over on RALPH@HOme **

Good evening all. For those who've been following this thread and are interersted in helping us get the minirosetta app stable, i've just released a new application version over on ralph with a whole slew of stuff in it to make it more stable or at least give us mroe feedback on where it breaks. It's a first step.
Since you've already been giving us incredibly invaluable feedback over the last weeks and months I'd really appreciate your feedback on this new app over on RALPH. Does it run more stably ? Do an of the familiar problems crop up ? Overrunning WUs ? Weired crasehs etc. ?

thanks !

mike


____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,443,745
RAC: 1,849
Message 58921 - Posted 19 Jan 2009 13:09:39 UTC - in response to Message ID 58916.
Last modified: 19 Jan 2009 13:23:12 UTC

** 1.48 released over on RALPH@HOme **

Good evening all. For those who've been following this thread and are interersted in helping us get the minirosetta app stable, i've just released a new application version over on ralph with a whole slew of stuff in it to make it more stable or at least give us mroe feedback on where it breaks. It's a first step.
Since you've already been giving us incredibly invaluable feedback over the last weeks and months I'd really appreciate your feedback on this new app over on RALPH. Does it run more stably ? Do an of the familiar problems crop up ? Overrunning WUs ? Weired crasehs etc. ?

thanks !

mike



I've been over on ralph. Looks like you may have made the 1.48 program available over there, but so far I've seen no sign of any new workunits in the queue over there for testing it. I'll need to run at least 10 workunits using it to tell if it's better or not, unless it's worse than 1.47.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58925 - Posted 19 Jan 2009 16:34:05 UTC - in response to Message ID 58916.

** 1.48 released over on RALPH@HOme **

Good evening all. For those who've been following this thread and are interersted in helping us get the minirosetta app stable, i've just released a new application version over on ralph with a whole slew of stuff in it to make it more stable or at least give us mroe feedback on where it breaks. It's a first step.
Since you've already been giving us incredibly invaluable feedback over the last weeks and months I'd really appreciate your feedback on this new app over on RALPH. Does it run more stably ? Do an of the familiar problems crop up ? Overrunning WUs ? Weired crasehs etc. ?

thanks !

mike




project seems to be disabled at the moment for "maintenance"

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 58944 - Posted 20 Jan 2009 15:42:38 UTC

back to 1.47 errors

this one crashed and burned:
jump-neg-1aiu___6220_9692_0
http://boinc.bakerlab.org/rosetta/result.php?resultid=221608803

state Compute error
Exit status -1073741819 (0xc0000005)

CPU time 10506.81
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0059C5C0 write attempt to address 0x00978D98

Engaging BOINC Windows Runtime Debugger...

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,776
RAC: 660
Message 59015 - Posted 24 Jan 2009 19:12:04 UTC

clemA_BOINC_ABRELAX-clemA-_6226_187582_0
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)

9659.594
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0059C5C0 write attempt to address 0x00978D98

Engaging BOINC Windows Runtime Debugger...

Message boards : Number crunching : Minirosetta v1.47 bug thread.


Home | Join | About | Participants | Community | Statistics

Copyright © 2017 University of Washington

Last Modified: 10 Nov 2010 1:51:38 UTC
Back to top ^