Which of the many users that have abandoned the project due to problems should feel it is safe to reenter the waters?
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com
ID: 57905 | Rating: 0 | rate:
/
Mike Tyka Forum moderator Project administrator Project developer Project scientist Joined: Oct 20 05 Posts: 95 ID: 5612 Credit: 2,190 RAC: 0
......sooooo which bugs do you feel you've fixed?
Which of the many users that have abandoned the project due to problems should feel it is safe to reenter the waters?
Amongst a bunch of minor things, one major bug that was fixed was causing jobs ro rash when they enetered full-atom stage ut had a fullatom energy > 0. Which usually occurs rarely, which would explain the random errors seen with the cs_vanilla jobs. The bug was due to a wrongly initialized varaible.
This bug was also causing the majority of the ccc_1_8_* jobs to fail on RALPH (we didnt move these over to BOINC of course, sicne we noticed the bug there).
THe reason those failed more frequently was that they have constraints built in and those cause the energy to be offset to higher values increasing the frequency of the problem to more like 70%.
Looking at the RALPH results i think most of the easily reproducable errors i think we've fixed. I recently ran close to 10000 WUs on our local compute cluster resulting in.. well.. 0 errors. This is wherei t gets tricky really, if stuff is only failing on other plattforms or due to machine dependent issues or *god knows what*. I will propose that the lab aquire a small farm of windows machiens to do extensive bug testing& hunting on to get a grip one these errors.. but believe us, these are difficult grounds.
Which of the many users that have abandoned the project due to problems should feel it is safe to reenter the waters?
Amongst a bunch of minor things, one major bug that was fixed was causing jobs ro rash when they enetered full-atom stage ut had a fullatom energy > 0. Which usually occurs rarely, which would explain the random errors seen with the cs_vanilla jobs. The bug was due to a wrongly initialized varaible.
This bug was also causing the majority of the ccc_1_8_* jobs to fail on RALPH (we didnt move these over to BOINC of course, sicne we noticed the bug there).
THe reason those failed more frequently was that they have constraints built in and those cause the energy to be offset to higher values increasing the frequency of the problem to more like 70%.
Looking at the RALPH results i think most of the easily reproducable errors i think we've fixed. I recently ran close to 10000 WUs on our local compute cluster resulting in.. well.. 0 errors. This is wherei t gets tricky really, if stuff is only failing on other plattforms or due to machine dependent issues or *god knows what*. I will propose that the lab aquire a small farm of windows machiens to do extensive bug testing& hunting on to get a grip one these errors.. but believe us, these are difficult grounds.
Thanks for bearing with us,
Mike
I can't even imagine the loads of code you (guys) went thru.
____________
The one bug that comes to mind that would not be easy to observe by counting successfully completed results, on a farm of Linux machines all running only a single project, would be where the tasks were not suspending properly. Someone mentioned a BOINC API compatibility problem might be the cause?
What would reasonable memory expectations be now? Are all the 1.47 tasks tagged as needing 512MB minimum? Or is there a mix? And, of that 512MB, what should one expect to see a task actually using when running normally?
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com
ID: 57913 | Rating: 0 | rate:
/
Mike Tyka Forum moderator Project administrator Project developer Project scientist Joined: Oct 20 05 Posts: 95 ID: 5612 Credit: 2,190 RAC: 0
The one bug that comes to mind that would not be easy to observe by counting successfully completed results, on a farm of Linux machines all running only a single project, would be where the tasks were not suspending properly. Someone mentioned a BOINC API compatibility problem might be the cause?
You're right. However I believe David Kim has updated and fixed this problem, at 1.45. If you guys *still* see problems with suspension of jobs then do let us know. We also hope that this lockfile problem should be largely fixed. We'll have to wait for the error statistics to come in before we know if the API fix has worked.
What would reasonable memory expectations be now? Are all the 1.47 tasks tagged as needing 512MB minimum? Or is there a mix? And, of that 512MB, what should one expect to see a task actually using when running normally?
I can't speak for the enzyme design guys but to give you an idea:
The jobs named "*_rlbd_*" and "*_rlbn_*" should take no more than 160 MB or so.
The jobs named "cc2_*" or "*_chunk_*" should take between 150 and 320MB or so (they are much larger proteins).
I'm not aware of any jobs that require more than 400MB, that would definitely point to a problem. ALthough the enzyme design guys may well have higher requirements.
Mike Tyka Forum moderator Project administrator Project developer Project scientist Joined: Oct 20 05 Posts: 95 ID: 5612 Credit: 2,190 RAC: 0
Sorry Mike, not a good start...
Yes, i know. I'm not saying the app is perfect - just that we found a bunch of definite bugs that are now fixed. No doubt there are still issues - we'r e working on it :)
Mike Tyka Forum moderator Project administrator Project developer Project scientist Joined: Oct 20 05 Posts: 95 ID: 5612 Credit: 2,190 RAC: 0
I can't even imagine the loads of code you (guys) went thru.
Well.. to give you an idea .. Minirosetta has more than 200000 (yes two hundred thousand) lines of code. Each day there are maybe around 20 additions to the code, with around 40 people working on the code at each given time.
i hope you guys get a small farm of windows machines to double check problems against your linux machines. windows is what the majority of us crunchers use and certain error types may or may not show up on linux.
for instance, how does one tell the difference between a machine error and a application error when the task dies with a (0xc0000005) error? is this something that shows up on your linux machines? or is that a specific windows error code?
also in another thread you mentioned aborting tasks that are using lower than 1.47. would these tasks be reissued using 1.47 or would they use the same mini that they originated with?
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
minirosetta_1.47_i686-apple-darwin(95094,0xa0538fa0) malloc: *** error for object 0x1747df0: Non-aligned pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
# cpu_run_time_pref: 14400
SIGSEGV: segmentation violation
And now all today's imported 1.47-tasks for the upcoming week have collapsed, most of them after less than 1 minute of computing, one was manually aborted as potentially ever-lasting.
It seems that I have to stick to my 5.98-tasks for some days and increase the default runtime.
Minirosetta apparently "looks like" malware, whether it actually is or not. This applies to all versions I've run, thru v1.47.
I run BOINC on two WinVista (God help me) boxes: one a 32 bit Sony with ZoneAlarm Pro|ESET NOD32 for security; the other a 64 bit Sony with Kaspersky Internet Security 2009.
On the first machine, NOD32 Antivirus thinks the Minirosetta .exe either contains a viral signature or looks bad heuristically (their UI doesn't say which). I have to add an exclusion to get the thing out of quarantine, every time a new version is released. Interestingly, ZoneAlarm Pro's application module hasn't had a problem with it.
On the 64 bit machine, Kaspersky's Application Control module gives Minirosetta's executable a Threat Rating of "Potentially Dangerous" with a heuristic Danger Index score of 82. I have to manually override Kaspersky and move Minirosetta out of the "Untrusted Application" zone, to allow it to execute. (By comparison, Rosetta Beta 5.98 has a DI of 12, as does SETI's recently released Astropulse 5.0. SETI's regular Enhanced v6.03 has a DI of zero.)
I realize that heuristic analysis is as much art as science, but both ESET and Kaspersky are rated at or near the top of their field. Of 10 project hosts I subscribe to, with over 25 project executables, Minirosetta is the ONLY one that has ever sent up a red flag to my security suite(s). Since most folks leave their security suite (if any) on autopilot, there are potentially many testers who never get to run Minirosetta because the .exe goes immediately into a black hole. Somewhere in those 200,000 lines of code, something apparently looks funky.
After a 1 week hiatus I downloaded v1.47 and 4 tasks. The first task showed a completion time of 12 hours which corresponds to my chosen runtime. The other 3 tasks, all _rlbd_ tasks, showed completion times of only 1 hour. What's up with that? It suggests that the staff provided an estimated task runtime of something like 45 minutes instead of the customary 8 hours.
Because of the 1-hour runtimes BOINC also downloaded additional tasks to fill the cache. Not good.
On the first machine, NOD32 Antivirus thinks the Minirosetta .exe either contains a viral signature or looks bad heuristically (their UI doesn't say which). I have to add an exclusion to get the thing out of quarantine, every time a new version is released.
Hello, I've been using both nod32 and rosetta for years now, I've never had nod32 detect rosetta as anything malicious, make sure you are updated. v3.0.672.0 DB 3695 as of writing.
Yes, I know. I'm not saying the app is perfect - just that we found a bunch of definite bugs that are now fixed. No doubt there are still issues - we're working on it :)
That's ok. Just that I'm trying to get more active here again after some computer problems and the first 1.47 task crashed out quickly. The next 4 have run with no problems though. Hopefully that continues. Usually all the problems are mine, not yours.
Good to see a more active presence from you in this forum. You're feedback to issues makes a big difference, even if it's just to say you're working on it without a solution yet. That matters too.
Yes, I know. I'm not saying the app is perfect - just that we found a bunch of definite bugs that are now fixed. No doubt there are still issues - we're working on it :)
That's ok. Just that I'm trying to get more active here again after some computer problems and the first 1.47 task crashed out quickly. The next 4 have run with no problems though. Hopefully that continues. Usually all the problems are mine, not yours.
Good to see a more active presence from you in this forum. You're feedback to issues makes a big difference, even if it's just to say you're working on it without a solution yet. That matters too.
Just to expand on the point of this person....Thanks for taking the time to tell us what is going on. We like to know and the silence has been deafening lately.
Thanks again for breaking it. We hope for more news as time goes along.
ID: 57939 | Rating: 0 | rate:
/
P . P . L . Joined: Aug 20 06 Posts: 365 ID: 105843 Credit: 362,889 RAC: 796
Hi.
I found a problem with the graphics on Ubuntu 8.04, mini 1.45 worked fine but now when i click
the show graphics button all i get is the outline of the graphic window, it looked transparent.
I could not close it normally i had to go to processes and kill it from there, also it was
showing that the graphics was using mini 1.40 for some reason. I'm sure that mini 1.45 was
Crux of my problem is this, I have no graphic display, the screen saver is blank and when I hit the 'show graphics' button in the advanced view, it opens a window (title - minirosetta version 1.47 [workunit: cs_noe_ .... etc]) that is blank, and then becomes unresponsive within about 10 seconds and requires the process to be killed.
Bonic Manager Version : 6.4.5
Wigets Ver : 2.8.7
Rosetta application : Rosetta Mini 1.47
Microsoft Windows Vista Business x86 Editon, (06.00.6000.00)
Dont know if you need this but..
PC : GenuineIntel Intel(R) Celeron(R) CPU 2.80GHz [x86 Family 15 Model 4 Stepping 9], 1gb RAM, NVIDIA GeForce 8500 GT
New account/install, 44 mins old according to its first work unit.. Vista is a fresh build, <24hrs old...
The workunits are running/progressing along, I would just like to see what im crunching :)
I have a WinXP 32-bit machine with Norton Antivirus 2009 installed.
minirosetta v1.47 is known to have fixed many bugs but there is still a major fault in this version. The bug is that it is detected by my antivirus as a high
security risk threat and is automatically removed by the antivirus. So you download the new version and after some time you will find it evaporated by your antivirus. I dont know whether it really contains some virus or not but the fact is that there is something in thousand lines of code of minirosetta that the antivirus does not like. I hope that this issue will also be resolved soon and it is my message to the developers of minirosetta that fix this issue as early as possible because most of the new users will not run it again on their machines after being detected by the antivirus as a threat.
So it is bad to hear that the new version still contains a major bug. :-(
____________
I have a WinXP 32-bit machine with Norton Antivirus 2009 installed.
minirosetta v1.47 is known to have fixed many bugs but there is still a major fault in this version. The bug is that it is detected by my antivirus as a high
security risk threat and is automatically removed by the antivirus. So you download the new version and after some time you will find it evaporated by your antivirus. I dont know whether it really contains some virus or not but the fact is that there is something in thousand lines of code of minirosetta that the antivirus does not like. I hope that this issue will also be resolved soon and it is my message to the developers of minirosetta that fix this issue as early as possible because most of the new users will not run it again on their machines after being detected by the antivirus as a threat.
So it is bad to hear that the new version still contains a major bug. :-(
why not set your antivirus to manual and then when it grabs minirosetta you can tell it to ignore that kind of file. we all know minirosetta is a safe application. it just NAV and other antivirus software that thinks it has a infection. I bet if you ran housecall from trendmicro you would find no problems. I run AVG free and none of the tasks have ever triggered that program and my system is virus free.
Minirosetta apparently "looks like" malware, whether it actually is or not. This applies to all versions I've run, thru v1.47.
I run BOINC on two WinVista (God help me) boxes: one a 32 bit Sony with ZoneAlarm Pro|ESET NOD32 for security; the other a 64 bit Sony with Kaspersky Internet Security 2009.
On the first machine, NOD32 Antivirus thinks the Minirosetta .exe either contains a viral signature or looks bad heuristically (their UI doesn't say which). I have to add an exclusion to get the thing out of quarantine, every time a new version is released. Interestingly, ZoneAlarm Pro's application module hasn't had a problem with it.
On the 64 bit machine, Kaspersky's Application Control module gives Minirosetta's executable a Threat Rating of "Potentially Dangerous" with a heuristic Danger Index score of 82. I have to manually override Kaspersky and move Minirosetta out of the "Untrusted Application" zone, to allow it to execute. (By comparison, Rosetta Beta 5.98 has a DI of 12, as does SETI's recently released Astropulse 5.0. SETI's regular Enhanced v6.03 has a DI of zero.)
I realize that heuristic analysis is as much art as science, but both ESET and Kaspersky are rated at or near the top of their field. Of 10 project hosts I subscribe to, with over 25 project executables, Minirosetta is the ONLY one that has ever sent up a red flag to my security suite(s). Since most folks leave their security suite (if any) on autopilot, there are potentially many testers who never get to run Minirosetta because the .exe goes immediately into a black hole. Somewhere in those 200,000 lines of code, something apparently looks funky.
That's weird, because I have NOD32 on one of my PC's and it doesn't have a problem with rosetta. I changed to Avast Pro... and still no problems :S
____________
3 machines with XP Pro sp3 1 machine sp2,1 machine Server 2003, mini 1.47, last 24 hours has been all Exit Status -177 (oxffffff4f)Maximum Memory exceeded.
A Sample from 2 machines:
Task ID 215092853 workunit 196054490
Task ID 215087694 work unit 196045728
they both have same computer ID 964014
1 of the machines is running a Beta 5.98 task concurrently, Im going to holdout on detaching to see what it produces for a result.
88.12 RAC on 4 x XPPRO + 1 x server2003 x 24 hours
1 quadcore 2.66ghz
1 HT 2.8ghz
1 1.6ghz
1.5`ghz
1 Mobile at 1.9ghz
3 machines with XP Pro sp3 1 machine sp2,1 machine Server 2003, mini 1.47, last 24 hours has been all Exit Status -177 (oxffffff4f)Maximum Memory exceeded.
A Sample from 2 machines:
Task ID 215092853 workunit 196054490
Task ID 215087694 work unit 196045728
they both have same computer ID 964014
1 of the machines is running a Beta 5.98 task concurrently, Im going to holdout on detaching to see what it produces for a result.
88.12 RAC on 4 x XPPRO + 1 x server2003 x 24 hours
1 quadcore 2.66ghz
1 HT 2.8ghz
1 1.6ghz
1.5`ghz
1 Mobile at 1.9ghz
can you point to which specific machine(s) this is happening on.
you have so many there is no quick way to know which machine the tasks you listed belong to.
3 machines with XP Pro sp3 1 machine sp2,1 machine Server 2003, mini 1.47, last 24 hours has been all Exit Status -177 (oxffffff4f)Maximum Memory exceeded.
A Sample from 2 machines:
Task ID 215092853 workunit 196054490
Task ID 215087694 work unit 196045728
they both have same computer ID 964014
1 of the machines is running a Beta 5.98 task concurrently, Im going to holdout on detaching to see what it produces for a result.
88.12 RAC on 4 x XPPRO + 1 x server2003 x 24 hours
1 quadcore 2.66ghz
1 HT 2.8ghz
1 1.6ghz
1.5`ghz
1 Mobile at 1.9ghz
can you point to which specific machine(s) this is happening on.
you have so many there is no quick way to know which machine the tasks you listed belong to.
I merged machines to assist.
The 2 that I have posted tasks from are
964014
965938
The other machines with simular errors are
961824
954192
954486
3 machines with XP Pro sp3 1 machine sp2,1 machine Server 2003, mini 1.47, last 24 hours has been all Exit Status -177 (oxffffff4f)Maximum Memory exceeded.
A Sample from 2 machines:
Task ID 215092853 workunit 196054490
Task ID 215087694 work unit 196045728
they both have same computer ID 964014
1 of the machines is running a Beta 5.98 task concurrently, Im going to holdout on detaching to see what it produces for a result.
88.12 RAC on 4 x XPPRO + 1 x server2003 x 24 hours
1 quadcore 2.66ghz
1 HT 2.8ghz
1 1.6ghz
1.5`ghz
1 Mobile at 1.9ghz
can you point to which specific machine(s) this is happening on.
you have so many there is no quick way to know which machine the tasks you listed belong to.
I merged machines to assist.
The 2 that I have posted tasks from are
964014
965938
The other machines with simular errors are
961824
954192
954486
computer 964014 is less than their new recomendation of 512 memory. this must be one of the tasks they were talking about.
December 10, 2008
We are now recommending systems with at least 512MB of memory. The majority of tasks will run fine with 256MB but some tasks will involve larger proteins that will use more memory.
computer 965938 is having a lockfile issue, there has been alot of discussion in 1.45 thread about this. you have to delete the empty slot folders in the boinc slot folder located in the projects folder. do a search in forums about lockfiles. it is discussed heavily in the 1.45 thread.
only one other computer had an issue, but that is due to defective task.
appears to be running fine, but when I click "show graphics" the window becomes unresponsive and requires the app to restart. the other work units are working without any problem.
I'm running vista64. I am running BOINC 64-bit edition. boincmgr.exe and boinctray.exe are running in 64-bit mode. however, minirosetta_1.47_windows_x86_64.exe is currently running in 32 bit mode. it says *32 next to the name, which I belive to indicate that it is running in 32-bit mode.
Crux of my problem is this, I have no graphic display, the screen saver is blank and when I hit the 'show graphics' button in the advanced view, it opens a window (title - minirosetta version 1.47 [workunit: cs_noe_ .... etc]) that is blank, and then becomes unresponsive within about 10 seconds and requires the process to be killed.
Bonic Manager Version : 6.4.5
Wigets Ver : 2.8.7
Rosetta application : Rosetta Mini 1.47
Microsoft Windows Vista Business x86 Editon, (06.00.6000.00)
Dont know if you need this but..
PC : GenuineIntel Intel(R) Celeron(R) CPU 2.80GHz [x86 Family 15 Model 4 Stepping 9], 1gb RAM, NVIDIA GeForce 8500 GT
New account/install, 44 mins old according to its first work unit.. Vista is a fresh build, <24hrs old...
The workunits are running/progressing along, I would just like to see what im crunching :)
I'm running vista64. I am running BOINC 64-bit edition. boincmgr.exe and boinctray.exe are running in 64-bit mode. however, minirosetta_1.47_windows_x86_64.exe is currently running in 32 bit mode. it says *32 next to the name, which I belive to indicate that it is running in 32-bit mode.
As far as I know there is no real 64 bit version for rosetta. It is the 32 bit version in a 64 bit wrapper.
____________
ID: 57995 | Rating: 0 | rate:
/
Zilli Samuel Joined: Mar 2 06 Posts: 3 ID: 62673 Credit: 22,229 RAC: 0
I've the problem with Norton Antivirus 2009 too, it delete minirosetta exe file because it's a "high security risk threat".
I entered Boinc path in Norton exclusion paths to solve it, but it would be better if Rosetta staff talk to Norton staff to avoid this problem...
____________
ID: 58006 | Rating: 0 | rate:
/
jay Joined: Jan 12 08 Posts: 10 ID: 234922 Credit: 57,684 RAC: 0
Question on memory size..
Greetings!
First of all, thanks to all of the developers for debugging the code.
I have a question about the memory size and page fault rate for mini-rosetta 1.47 .
I was looking at the windows (XP) task manager and looking at the memory size and page fault rate.
I admit that I do not know what it all means - and would like to ask the forum for an explanation that would help me..
Environment: Here is what BOINC says:
Processor: 2 GenuineIntel Intel(R) Core(TM) Duo CPU T2300 @ 1.66GHz [x86 Family 6 Model 14 Stepping 12]
Processor features: fpu tsc pae nx sse sse2 mmx
OS: Microsoft Windows XP: Professional x86 Editon, Service Pack 3, (05.01.2600.00)
Memory: 2.00 GB physical, 4.87 GB virtual
Disk: 107.41 GB total, 78.57 GB free
Here is what the Task manger is showing for mini-rosetta 1.47
Mem usage: 184,944K ( Varying between 170,000K and 247,000K while I watched.)
PF delta: 3,228 ( in a three second period)
VM size: 199,344K ( and moving up to 243,000 K)
I was running 2 Boinc projects at once: Rosetta and WCG-clean energy.
If I suspend all others so that only Rosetta is running, the page faults are more sporadic, mostly zero, then up to 6,375 in the three second period.
With Boinc only running the Rosetta task, the task manager says:
Greetings!
First of all, thanks to all of the developers for debugging the code.
I have a question about the memory size and page fault rate for mini-rosetta 1.47 .
I was looking at the windows (XP) task manager and looking at the memory size and page fault rate.
I admit that I do not know what it all means - and would like to ask the forum for an explanation that would help me..
Environment: Here is what BOINC says:
Processor: 2 GenuineIntel Intel(R) Core(TM) Duo CPU T2300 @ 1.66GHz [x86 Family 6 Model 14 Stepping 12]
Processor features: fpu tsc pae nx sse sse2 mmx
OS: Microsoft Windows XP: Professional x86 Editon, Service Pack 3, (05.01.2600.00)
Memory: 2.00 GB physical, 4.87 GB virtual
Disk: 107.41 GB total, 78.57 GB free
Here is what the Task manger is showing for mini-rosetta 1.47
Mem usage: 184,944K ( Varying between 170,000K and 247,000K while I watched.)
PF delta: 3,228 ( in a three second period)
VM size: 199,344K ( and moving up to 243,000 K)
I was running 2 Boinc projects at once: Rosetta and WCG-clean energy.
If I suspend all others so that only Rosetta is running, the page faults are more sporadic, mostly zero, then up to 6,375 in the three second period.
With Boinc only running the Rosetta task, the task manager says:
Physical Memory (K)
total: 2,095,532
available: 1,127,112
System cache: 838,252
Bottom Line - I assumed that the pf rate is not good.
Do you know of anything I can tweak to help??
THANK YOU!!
Jay E.
Can you afford to add more physical memory to that machine? That should at least decrease the page fault rate, although I don't know if it's the cheapest way to do this.
Here's a good place to find out what memory fits that machine, and how much it can hold:
It looks from the stderr file like it crunched normally for 16 hours (my current preference) with no error. However, it was then marked "Invalid" with no explanation. The only other thing I see is that it crunched an unusually high number of decoys (8777 decoys). Does that cause problems with the validator?
If you change the view you can add a column to display the number of faults since the task started. I have long runtimes, but currently have two tasks from Ralph that topped 100,000,000 page faults. One in 15hrs and the other in 19hrs. This is the highest fault rate I've ever seen. Indeed, I recall the days when I thought that 1M per hour of runtime was excessive.
The only solice I can offer is that not all faults are hard faults to disk. Some recorded faults are "soft". Perhaps someone else can further elaborate on the concepts.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com
a WU will get to around 85% complete , progress will stay the same. time to completion stays around 10 minutes. i suspend all tasks, resume then the "stuck" WUs will complete.
edited: doing this also rolls back the "cpu time spent" to around 30 minutes
ID: 58024 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
Stephen, this may be part of why you are having problems keeping all 8 CPUs busy. Suggest you just let BOINC manage the machine for the next 12 hours or so. Don't abort, suspend, update, anything at all.
Some tasks will take longer then 3 hours to run, and their % complete progress bar will not move steadily. Rather then tell you the task has -30 minutes left, they reflect the situation by making time move very slowly after the task gets to 10 minutes remaining.
It's simply a problem with the estimate, not the work being done.
____________ Rosetta Moderator: Mod.Sense
how do you "lose credit" on a task?
on this task i claimed 83 and got 68 for 4 hrs runtime. That is just weird when most of the other work I have been running always comes out on the plus side for granted.
heres a tip: before rebooting, because you never know how many times windows will want you to do that when you do a update install, goto the activity tab of boinc manager and put all activity in suspend. wait for your hardrive to stop grinding away with all the saving and then you can reboot. also be sure to have the leave jobs/tasks in memory turned on as well. then you will not lose your position in the task. suspend seems to save everything to the hardrive and you can reboot all you want and not lose any data for the task.
yes i did... thanks for that info a Microsoft upgrade required a reboot
heres a tip: before rebooting, because you never know how many times windows will want you to do that when you do a update install, goto the activity tab of boinc manager and put all activity in suspend. wait for your hardrive to stop grinding away with all the saving and then you can reboot. also be sure to have the leave jobs/tasks in memory turned on as well. then you will not lose your position in the task. suspend seems to save everything to the hardrive and you can reboot all you want and not lose any data for the task.
yes i did... thanks for that info a Microsoft upgrade required a reboot
you didn't have to reboot your computer a few times during the tasks run did you?
that will kill a task.
ID: 58034 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
I do not agree with greg's comments about preservation of work and reasons why, but would prefer to take them up in another thread if you'd like to discuss further.
[edit]
We're discussing this under a new thread here.
____________ Rosetta Moderator: Mod.Sense
I do not agree with greg's comments about preservation of work and reasons why, but would prefer to take them up in another thread if you'd like to discuss further.
"graphic viewer" hangs with this task
cs_noe_fullw_nolin_homo_bench_cs_noe_abrelax_cs_mth1598_olange_5607_11086_0
(http://boinc.bakerlab.org/rosetta/result.php?resultid=215720373)
I'm seeing problems when attempting to show graphics on workunits with names such as cs_noe* on Mac OS X 10.4.11. Its seems like several other people are seeing similar problems.
The first time Show graphics is pressed the graphics app starts and displays a blank window. Moving the mouse causes the graphics app to crash.
The second and subsequent times Show graphics is pressed the graphics app starts and displays a blank window along with the spinning rainbow beach ball. The graphics app is frozen and you can't even force quit in the normal way: it's necessary to quit via the Activity Monitor.
____________
The graphics for one of my Minirosetta 1.47 work units crash. If I click on the show graphics button under boinc, a windows is launched, but it remains black and to close it I have to physically end the unresponsive process. The work unit runs fine though. It's under boinc 6.2.19
http://boinc.bakerlab.org/rosetta/result.php?resultid=215547790
t071_1_RDC_NMR_NESG_5480_118996_0
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 941.5781
--------------
http://boinc.bakerlab.org/rosetta/result.php?resultid=215490731
t072_1_RDC_NMR_NESG_5481_92626_0
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 12309.66
-----------------------------
I'm seeing problems when attempting to show graphics on workunits with names such as cs_noe* on Mac OS X 10.4.11. Its seems like several other people are seeing similar problems.
The first time Show graphics is pressed the graphics app starts and displays a blank window. Moving the mouse causes the graphics app to crash.
The second and subsequent times Show graphics is pressed the graphics app starts and displays a blank window along with the spinning rainbow beach ball. The graphics app is frozen and you can't even force quit in the normal way: it's necessary to quit via the Activity Monitor.
I'm seeing somewhat similar problems under Windows Vista SP1.
12/21/2008 7:18:31 AM|rosetta@home|Resuming task cs_noe_fullw_nolin_homo_bench_cs_noe_abrelax_cs_ccr19_olange_5604_39348_0 using minirosetta version 147
Moving the mouse had no particular effect, but the graphics window stayed blank and shutting it down gave some error messages before it finally worked. I normally let minirosetta run without graphics.
ID: 58090 | Rating: 0 | rate:
/
Mike Tyka Forum moderator Project administrator Project developer Project scientist Joined: Oct 20 05 Posts: 95 ID: 5612 Credit: 2,190 RAC: 0
Hi all! I'm back connected with the internet. Sadly to find more errors -
we'll be back to debugging after the holidays.
Quick comments for the major issues reported above:
- The graphics problems cs_noe_* jobs. THis is v strange. we have NOT updated the graphics app - so these jobs must be doing something funny that the graphics app doesnt like. I'll ask the person submitting these to try and run the graphics app locally to see if we can reproduce this error.
- The normal_relax_rlb[dn]_* jobs validator error. I thought i had fixed this, this must be something eles then. Yes the validator will reject the WU if it has produced more than some number of decoys (like around 128 or so per hour). Now,
this is pointing to some other problem now - evidently its racing through decoys nd not doing anything with them, thereby producing thousands of results. How that can happen on a sporadic basis (< 1/1000 WUs it seems) is puzzeling me. I'll have to ook into that one.
- Virus Scanners: Aehm - not really a bug. We have no control over what virus scanners seem to "recognise" about it as a malware/virus. They won't tellus either - they have been wholy unhelpful in this matter. The only solution i see right now is to set exceptions in your virus scanner to ignore apps coming from ralph.bakerlab.org and boinc.bakerlab.org
Has anyone seen any new Lockfile problems ? Or are these finally a thing of the past ?
Task 215936807; Workunit 194706499; Name 1dsvA_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1dsvA-_5479_5614_1; crashed on Mac OS X 10.4.11 after 4 secs (thankfully)
<core_client_version>6.2.18</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
SIGSEGV: segmentation violation
Crashed executable name: minirosetta_1.47_i686-apple-darwin
built using BOINC library version 6.5.0
Machine type Intel 80486 (32-bit executable)
System version: Macintosh OS 10.4.11 build 8S2167
Sat Dec 20 23:23:58 2008
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
minirosetta_1.47_i686-apple-darwin(95094,0xa0538fa0) malloc: *** error for object 0x1747df0: Non-aligned pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
# cpu_run_time_pref: 14400
SIGSEGV: segmentation violation
#Aehm - i can't see your RALPH failure for this job. I had one result come back and it was a success..
Mike Tyka Forum moderator Project administrator Project developer Project scientist Joined: Oct 20 05 Posts: 95 ID: 5612 Credit: 2,190 RAC: 0
After a 1 week hiatus I downloaded v1.47 and 4 tasks. The first task showed a completion time of 12 hours which corresponds to my chosen runtime. The other 3 tasks, all _rlbd_ tasks, showed completion times of only 1 hour. What's up with that? It suggests that the staff provided an estimated task runtime of something like 45 minutes instead of the customary 8 hours.
Because of the 1-hour runtimes BOINC also downloaded additional tasks to fill the cache. Not good.
We run a number of very different jobs on R@home covering a number of different problems in structure prediction and now also protein design. Thus, depending on the type of workunit runtimes may vary hugely. The rldb jobs do indeed run very quickly (requiring something like 25minutes per decoy).
What was your very first job ??
I think we will put a limit into the code that will abort jobs running over 6 hours in the next update. Watch this space..
Mike Tyka Forum moderator Project administrator Project developer Project scientist Joined: Oct 20 05 Posts: 95 ID: 5612 Credit: 2,190 RAC: 0
Just to expand on the point of this person....Thanks for taking the time to tell us what is going on. We like to know and the silence has been deafening lately.
Thanks again for breaking it. We hope for more news as time goes along.
http://boinc.bakerlab.org/rosetta/result.php?resultid=215547790
t071_1_RDC_NMR_NESG_5480_118996_0
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 941.5781
--------------
http://boinc.bakerlab.org/rosetta/result.php?resultid=215490731
t072_1_RDC_NMR_NESG_5481_92626_0
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 12309.66
-----------------------------
edit - more of the same type of task errored out
http://boinc.bakerlab.org/rosetta/result.php?resultid=215554911
t071_1_RDC_NMR_NESG_5480_119941_0
state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 9361.141
http://boinc.bakerlab.org/rosetta/result.php?resultid=215583938
t072_1_RDC_NMR_NESG_5481_100236_0
state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 4056.126
i am aborting the remaing t071 and t072 tasks due to 4 errors in 5-6 hours.
wasting my time with that junk.
another note: these 2 tasks did not respond to a suspend command in the sense that the time to completion continued to count even though the actual running time had stopped and the status showed as suspended.
i think you guys should recheck the code or whatever of the t071 and t072 tasks as I see someone before me had one of these series of tasks and ran into a computer error of the same nature of what i reported. i aborted that task since i am not interested in wasting my cpu time on a compute error bugged task.
I'm seeing problems when attempting to show graphics on workunits with names such as cs_noe* on Mac OS X 10.4.11. Its seems like several other people are seeing similar problems.
The first time Show graphics is pressed the graphics app starts and displays a blank window. Moving the mouse causes the graphics app to crash.
The second and subsequent times Show graphics is pressed the graphics app starts and displays a blank window along with the spinning rainbow beach ball. The graphics app is frozen and you can't even force quit in the normal way: it's necessary to quit via the Activity Monitor.
I'm seeing somewhat similar problems under Windows Vista SP1.
12/21/2008 7:18:31 AM|rosetta@home|Resuming task cs_noe_fullw_nolin_homo_bench_cs_noe_abrelax_cs_ccr19_olange_5604_39348_0 using minirosetta version 147
Moving the mouse had no particular effect, but the graphics window stayed blank and shutting it down gave some error messages before it finally worked. I normally let minirosetta run without graphics.
Another workunit with graphics problems:
12/21/2008 11:27:13 AM|rosetta@home|Resuming task cs_noe_fullw_nolin_homo_bench_cs_noe_abrelax_cs_flua_olange_5605_35210_0 using minirosetta version 147
The previous one seemed to complete successfully despite the graphics problem.
After a 1 week hiatus I downloaded v1.47 and 4 tasks. The first task showed a completion time of 12 hours which corresponds to my chosen runtime. The other 3 tasks, all _rlbd_ tasks, showed completion times of only 1 hour. What's up with that? It suggests that the staff provided an estimated task runtime of something like 45 minutes instead of the customary 8 hours.
Because of the 1-hour runtimes BOINC also downloaded additional tasks to fill the cache. Not good.
We run a number of very different jobs on R@home covering a number of different problems in structure prediction and now also protein design. Thus, depending on the type of workunit runtimes may vary hugely. The rldb jobs do indeed run very quickly (requiring something like 25minutes per decoy).
What was your very first job ??
I think we will put a limit into the code that will abort jobs running over 6 hours in the next update. Watch this space..
What effect will that have on users who have chosen default workunit times over 6 hours? Is this 6 hours per decoy or 6 hours for the whole workunit? If it only aborts one decoy, will the other decoys still continue, with credit for the decoys that completed successfully both before and after this aborted decoy?
ID: 58104 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
What effect will that have on users who have chosen default workunit times over 6 hours? Is this 6 hours per decoy or 6 hours for the whole workunit? If it only aborts one decoy, will the other decoys still continue, with credit for the decoys that completed successfully both before and after this aborted decoy?
Yes, he's talking about per model. If any models that run that long are cut off, it would help assure a more consistent runtime inline with each person's stated preference. Not perfect, but better then having some specific models haul off and run for 12 hours.
So, yes, if time remains for the task, another model may begin.
I won't comment on credit, because it's not my decision, and so far as I know no specific decision has been made yet. But the project has always maintained that even "failures" provide information valueable to advancing the project.
At present, the model would run for (sometimes) as much as 12 hours or more, and you'd get the same credit average as those that are running models with the more average runtime under 3hrs, so if nothing else, just cutting it off at 6 hours (or whatever length is deemed appropriate) is preventing you from running for more then that, for essentially zero credit. So, this approach limits your credit loss, if nothing else.
____________ Rosetta Moderator: Mod.Sense
When this is said I seem to have reconciled with Rosetta by rebooting the computer in question. Why this was suddenly necessary on a computer with no new program installations, no new configurations, no system upgrades, no separate computing on the side, and successfully computing 1.47-tasks 24 hours earlier, I am unable to explain. Even the subsequently installed Boinc 6.5 works like a charm. So I am loaded with tasks for a peaceful Christmas session and hope for the best until reporting time next weekend.
come on guys, you say this stuff is tested and ok and then it bombs on a windows machine.
can someone tell me if this is a program error an error caused by to high of a OC speed? being that not all the tasks I get error out it would seem more of a case of a bad program and not the OC speed.
see below for a series of tasks that died part of the way through.
http://boinc.bakerlab.org/rosetta/result.php?resultid=215716365
cc2_1_8_native_cen_cst_hb_t311__IGNORE_THE_REST_2B5AA_7_5843_16_0
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
http://boinc.bakerlab.org/rosetta/result.php?resultid=215736070
Name t074_1_RDC_NMR_NESG_5568_92427_0
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 9133.313
stderr out
----------------
http://boinc.bakerlab.org/rosetta/result.php?resultid=215742498
Name 1wjbA_ZNMP_ABRELAX_tetraL_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1wjbA-_5478_130_1
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 2.984375
stderr out
------------
http://boinc.bakerlab.org/rosetta/result.php?resultid=215811069
Name t073_1_RDC_NMR_NESG_5563_143956_0
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 12305.66
stderr out
---------------
http://boinc.bakerlab.org/rosetta/result.php?resultid=215833987
Name t073_1_RDC_NMR_NESG_5563_146392_0
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 8922.172
stderr out
This makes 10 tasks in a days time that have died with the 0xc error. COME ON!
This ran to within 10 minutes of completion and died. Gees!
Then you insult me with me no credit granted for a 99% completed task.
http://boinc.bakerlab.org/rosetta/result.php?resultid=216155882
1g47A_BOINC_MPZN_vanilla_abrelax_5901_6856_0
Workunit 196996323
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 13796
stderr out
Well... The lastest attempt to effectivly utilize @home computers to further mankind in medical fields has reduced my last machine into a power wasting room heater.
Just for the fun of it, go to a Rosetta server aquiring results from the last 2 versions and search "Outcome Client error"
Ill check back after a few months to see if things are any better here.
<core_client_version>6.2.15</core_client_version>
<![CDATA[
<message>
process got signal 8
</message>
<stderr_txt>
# cpu_run_time_pref: 7200
**********************************************************************
Rosetta is going too long. Watchdog is ending the run!
CPU time: 26914 seconds. Greater than 3X preferred time: 7200 seconds
**********************************************************************
called boinc_finish
your vanilla task died at 2hrs and 23 mins.
this makes about 12 failures now in 2 days.
http://boinc.bakerlab.org/rosetta/result.php?resultid=216178144
1g47A_BOINC_MPZN_vanilla_abrelax_5901_7554_0
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 8912.25
stderr out
yet another one dies...what is going on? is it the program or my OC speed? this makes 12 in 2 days.
http://boinc.bakerlab.org/rosetta/result.php?resultid=216194755
Name t073_1_RDC_NMR_NESG_5563_176398_0
Workunit 197027384
Server state Over
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 25.375
stderr out
Chu Forum moderator Project administrator Project developer Project scientist Joined: Feb 23 06 Posts: 120 ID: 61076 Credit: 112,439 RAC: 4
Hi greg_be, this WU is one of my jobs and I just double checked this sub-batch, so far about 9000 clients have returned results successfully with normal error rate. The fact that you recently have got same error code from many different Rosetta@home workunits makes me think that it is more likely due to some certain incompatible setup on your computer, though I don't know what is exactly causing this. Did this problem happen to you before?
your vanilla task died at 2hrs and 23 mins.
this makes about 12 failures now in 2 days.
http://boinc.bakerlab.org/rosetta/result.php?resultid=216178144
1g47A_BOINC_MPZN_vanilla_abrelax_5901_7554_0
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 8912.25
stderr out
Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH.
I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die.
Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died.
Can you tell me how to see the difference between a error due to windows or OC speed vs a program error that triggers a windows dump with '-1073741819 (0xc0000005)'?
Thanks again for the reply.
Hi greg_be, this WU is one of my jobs and I just double checked this sub-batch, so far about 9000 clients have returned results successfully with normal error rate. The fact that you recently have got same error code from many different Rosetta@home workunits makes me think that it is more likely due to some certain incompatible setup on your computer, though I don't know what is exactly causing this. Did this problem happen to you before?
your vanilla task died at 2hrs and 23 mins.
this makes about 12 failures now in 2 days.
http://boinc.bakerlab.org/rosetta/result.php?resultid=216178144
1g47A_BOINC_MPZN_vanilla_abrelax_5901_7554_0
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 8912.25
stderr out
I have the same problem on 64 bit Win 2008 server only for all Minirosetta tasks. Minirosetta 1.45 had this problem too. All other PC (32bit, XP64bit) have no problem.
Zdenek
Chu,
Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH.
I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die.
Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died.
Can you tell me how to see the difference between a error due to windows or OC speed vs a program error that triggers a windows dump with '-1073741819 (0xc0000005)'?
Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH.
I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die.
Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died.
I think this could happen if the system is close to being okay but just on that edge. ie if your task has run for only 7 thousand seconds it would have completed just fine, but that little extra time pushed it over the edge. This could come from the ram being pushed, the hard drive saying enough, the cpu sending that one bit of data too fast etc etc. By backing off in 10 mhz increments I think you will find the solution fairly quickly. Then you could even go back up in 1 mhz increments until the errors come back.
____________
Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH.
I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die.
Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died.
I think this could happen if the system is close to being okay but just on that edge. ie if your task has run for only 7 thousand seconds it would have completed just fine, but that little extra time pushed it over the edge. This could come from the ram being pushed, the hard drive saying enough, the cpu sending that one bit of data too fast etc etc. By backing off in 10 mhz increments I think you will find the solution fairly quickly. Then you could even go back up in 1 mhz increments until the errors come back.
i suspect your right about the ram frequency and the cpu. probably just a bit to high for these tasks now. i might raise it by 5 mhz after tonight just to see what happens. my RAC is already low enough. i can't "afford" to take much more in errors.
ID: 58137 | Rating: 0 | rate:
/
Chu Forum moderator Project administrator Project developer Project scientist Joined: Feb 23 06 Posts: 120 ID: 61076 Credit: 112,439 RAC: 4
greb_be and all,
When there is a new version of minirosetta update, we usually put a windows debug symbol image in a downloadable location. So when a WU crashes out, it should provide a backtrace of how an error is caused (this does not work every time and that makes our debugging very hard). If it is an error from Minirosetta program or bad command line/input file setup, the stdout or stderr usually will print out a message as hints, for example, the hbond NAN problem in the previous versions. Also, we should see a significantly higher error rate among either all or certain batches of WUs running. If it is caused by interfacing with the host's hardware or software, we will usually see that certain client hosts kept encountering errors or failure. We wish we could tell what have been wrong in every scenario when an error occurs, however, most of us Rosetta developer are far from being an expert on computer software/hardware and we can only hope to trap errors locally on our testing machines to continue with debugging.
Thank you all for voluntarily helping us on doing this project and sorry about any inconvenience/trouble caused on your computer. Please continue to report problems and/or possible fixes you have found as every bit of such information will certainly help us to improve R@H stability and resolve hidden bugs/problems sooner or later. Happy holidays to every one and happy crunching!
I have the same problem on 64 bit Win 2008 server only for all Minirosetta tasks. Minirosetta 1.45 had this problem too. All other PC (32bit, XP64bit) have no problem.
Zdenek
Chu,
Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH.
I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die.
Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died.
Can you tell me how to see the difference between a error due to windows or OC speed vs a program error that triggers a windows dump with '-1073741819 (0xc0000005)'?
I reduced the OC amount by 10 mhz and then brought it back up 5 mhz.
Everything seems stable now as I have run nearly a day without trouble since backing down. It would seem your program is more and more sensitive to tiny things that high OC rates create. In any case backing down the cpu OC speed a bit seems to have solved this issue.
thanks for taking the time to discuss this problem with me and the other person.
I had one WU crash on me today. Running on a WinXPSP3 Athlon X2 3800+ with 1Gb RAM. Link to task details.
216493218
Name 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_16326_0
Workunit 197297715
Created 23 Dec 2008 8:53:31 UTC
Sent 23 Dec 2008 9:33:56 UTC
Received 23 Dec 2008 22:08:04 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
Computer ID 625945
Report deadline 2 Jan 2009 9:33:56 UTC
CPU time 4928.609
stderr out
Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH.
I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die.
Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died.
I think this could happen if the system is close to being okay but just on that edge. ie if your task has run for only 7 thousand seconds it would have completed just fine, but that little extra time pushed it over the edge. This could come from the ram being pushed, the hard drive saying enough, the cpu sending that one bit of data too fast etc etc. By backing off in 10 mhz increments I think you will find the solution fairly quickly. Then you could even go back up in 1 mhz increments until the errors come back.
i suspect your right about the ram frequency and the cpu. probably just a bit to high for these tasks now. i might raise it by 5 mhz after tonight just to see what happens. my RAC is already low enough. i can't "afford" to take much more in errors.
Do you think it could be a problem in BOINC 6.4.5 instead? Chu, could you check how many machines running workunits from that batch under BOINC 6.4.5 on similar hardware have returned successful results?
Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH.
I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die.
Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died.
I think this could happen if the system is close to being okay but just on that edge. ie if your task has run for only 7 thousand seconds it would have completed just fine, but that little extra time pushed it over the edge. This could come from the ram being pushed, the hard drive saying enough, the cpu sending that one bit of data too fast etc etc. By backing off in 10 mhz increments I think you will find the solution fairly quickly. Then you could even go back up in 1 mhz increments until the errors come back.
i suspect your right about the ram frequency and the cpu. probably just a bit to high for these tasks now. i might raise it by 5 mhz after tonight just to see what happens. my RAC is already low enough. i can't "afford" to take much more in errors.
Do you think it could be a problem in BOINC 6.4.5 instead? Chu, could you check how many machines running workunits from that batch under BOINC 6.4.5 on similar hardware have returned successful results?
Do you think it could be a problem in BOINC 6.4.5 instead? Chu, could you check how many machines running workunits from that batch under BOINC 6.4.5 on similar hardware have returned successful results?
I am using version 6.4.5, on some of my pc's, and am not having any issues.
____________
I found this WU stalled after 15 hrs. I suspended the task and then reenabled it later. After it started again it stalled at the same point. I looked at the box and it had a popup saying that it had a C++ runtime error that had asked to be shutdown in an unusual way.
STDERR OUT
<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
# cpu_run_time_pref: 86400
I found this WU stalled after 15 hrs. I suspended the task and then reenabled it later. After it started again it stalled at the same point. I looked at the box and it had a popup saying that it had a C++ runtime error that had asked to be shutdown in an unusual way.
STDERR OUT
<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
# cpu_run_time_pref: 86400
</stderr_txt>
]]>
I've been getting those C++ popups as well on multiple configs machine/os, it seems as if then that core on the cpu refuses to get work after that. This is a new event for me.
____________
Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH.
I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die.
Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died.
I think this could happen if the system is close to being okay but just on that edge. ie if your task has run for only 7 thousand seconds it would have completed just fine, but that little extra time pushed it over the edge. This could come from the ram being pushed, the hard drive saying enough, the cpu sending that one bit of data too fast etc etc. By backing off in 10 mhz increments I think you will find the solution fairly quickly. Then you could even go back up in 1 mhz increments until the errors come back.
i suspect your right about the ram frequency and the cpu. probably just a bit to high for these tasks now. i might raise it by 5 mhz after tonight just to see what happens. my RAC is already low enough. i can't "afford" to take much more in errors.
Do you think it could be a problem in BOINC 6.4.5 instead? Chu, could you check how many machines running workunits from that batch under BOINC 6.4.5 on similar hardware have returned successful results?
robert, after dropping the OC 10 mhz and then bringing it back 5mhz (total reduction 5 mhz) I have not had any further issues. so at least for my machine the errors were caused by OC'ing to far. this accounts for the huge amount of failures I had. It would seem the the new mini is even more sensitive than 1.45 to whatever signals OC'ing produces. For those who get 1 failure in 20 tasks, then your not having the same problem as I was. Also I am on 6.4.5 after upgrading from the old version.
Thu 25 Dec 2008 08:42:56 EST|rosetta@home|Restarting task cc_nonideal_1_8_nocst4_hb_t303__IGNORE_THE_REST_1FEZA_6_6019_17_0 using minirosetta version 147
normally this is due to the last check point set. seems kind of odd that you would lose up to 4hrs of work between check points. it acts like it lost all the latest check point data. it also looks like your running a really old version of boinc. you might want to update to the latest version.
Merry Christmas
Hi.
I have this task at the moment running, it's odd. This morning when i restarted
the system Boinc was showing 5hrs,4mins completed, when the task got it's turn to
run it dropped back to 1hr,33mins and showing 2 models, it would have done more
Thu 25 Dec 2008 08:42:56 EST|rosetta@home|Restarting task cc_nonideal_1_8_nocst4_hb_t303__IGNORE_THE_REST_1FEZA_6_6019_17_0 using minirosetta version 147
Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH.
I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die.
Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died.
I think this could happen if the system is close to being okay but just on that edge. ie if your task has run for only 7 thousand seconds it would have completed just fine, but that little extra time pushed it over the edge. This could come from the ram being pushed, the hard drive saying enough, the cpu sending that one bit of data too fast etc etc. By backing off in 10 mhz increments I think you will find the solution fairly quickly. Then you could even go back up in 1 mhz increments until the errors come back.
i suspect your right about the ram frequency and the cpu. probably just a bit to high for these tasks now. i might raise it by 5 mhz after tonight just to see what happens. my RAC is already low enough. i can't "afford" to take much more in errors.
Do you think it could be a problem in BOINC 6.4.5 instead? Chu, could you check how many machines running workunits from that batch under BOINC 6.4.5 on similar hardware have returned successful results?
robert, after dropping the OC 10 mhz and then bringing it back 5mhz (total reduction 5 mhz) I have not had any further issues. so at least for my machine the errors were caused by OC'ing to far. this accounts for the huge amount of failures I had. It would seem the the new mini is even more sensitive than 1.45 to whatever signals OC'ing produces. For those who get 1 failure in 20 tasks, then your not having the same problem as I was. Also I am on 6.4.5 after upgrading from the old version.
dec 24 22.15 UTC - system is stable and RAC is slowly returning to normal.
Chu - thanks for taking the time to look into the average return of the various tasks you sent out. It was definitely a case of to much OC and no way to verify it. probably would have got to that conclusion after a few more errors.
I have this task at the moment running, it's odd. This morning when i restarted
the ... task cc_nonideal_1_8_nocst4_hb_t303__IGNORE_THE_REST_1FEZA_6_6019_17_0 using minirosetta version 147
pete.
I have had that happen three times during the last 4 or 5 days. I didn't report it because technically
such actions are not prohibited. The tasks complete and grant credit.
However; I have set my tasks length to 2 hours for now,
and these task run well over that time.
NOTE: I have checkpoint logging turned on!
ALL TIMES APPROX.
4 hours with no ckeckpoints after 40 min
cc_nonideal_3_5_nocst4_hb_t374__IGNORE_THE_REST_2FCKA_10_5832_14_0
3.5 hours with no checkpoints after 35 min
cc2_1_8_mammoth_mix_cen_cst_hb_t332__IGNORE_THE_REST_1V2XA_7_5888_15_0
3 hours with no checkpoints after 50 min
cc_nonideal_0_6_nocst4_hb_t313__IGNORE_THE_REST_1GOJA_10_5910_16_0
NOTE: On the last WU I noticed that when I restarted the task,
well into the no checkpointing period -
checkpointing restarted for a short period of time!
We've got a new minirosetta version, with - you've guessed it - more bug fixes ! Woo!
Please report remaining issues here - that would be grand :)
Hello, I don't know if this is a bug AND I am not one to complain about receiving credit, however, I was very surprised to receive so much credit compared to claimed credit. Is the result below likely?
216467986
Name cc_nonideal_2_2_nocst4_hb_t297__IGNORE_THE_REST_1YZFA_4_6046_19_0
Workunit 197278592
Created 23 Dec 2008 6:24:21 UTC
Sent 23 Dec 2008 7:45:54 UTC
Received 24 Dec 2008 15:54:32 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 947263
Report deadline 2 Jan 2009 7:45:54 UTC
CPU time 5719.655
stderr out
CreateFile error 32 when trying set file time
failed to create shared mem segment
CreateSemaphore failure! Cannot create semaphore!
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
CreateFile error 32 when trying set file time
======================================================
DONE :: 1 starting structures 5719.56 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish
</stderr_txt>
]]>
Validate state Valid
Claimed credit 14.4476221738839
Granted credit 41.0260851670465
application version 1.47
____________
ID: 58167 | Rating: 0 | rate:
/
P . P . L . Joined: Aug 20 06 Posts: 365 ID: 105843 Credit: 362,889 RAC: 796
Hi.
I have this task at the moment running, it's odd. This morning when i restarted
the system Boinc was showing 5hrs,4mins completed, when the task got it's turn to
run it dropped back to 1hr,33mins and showing 2 models, it would have done more
Thu 25 Dec 2008 08:42:56 EST|rosetta@home|Restarting task cc_nonideal_1_8_nocst4_hb_t303__IGNORE_THE_REST_1FEZA_6_6019_17_0 using minirosetta version 147
pete.
Well still looks odd to me, ended up taking 7hrs, 11min plus the 3 and a half
hours lost on restarting. I have a six hour R/T set and it still only did 4 models.
See below.
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
======================================================
DONE :: 1 starting structures 25890.1 cpu seconds
This process generated 4 decoys from 4 attempts
I found this WU stalled after 15 hrs. I suspended the task and then reenabled it later. After it started again it stalled at the same point. I looked at the box and it had a popup saying that it had a C++ runtime error that had asked to be shutdown in an unusual way.
STDERR OUT
<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
# cpu_run_time_pref: 86400
</stderr_txt>
]]>
Had This WU this morning with the same error. It ran for 7 hours before stalling. Both are vanilla type. I still have one more of these in progress, it is currently at 21 hours and so far looks good.
ID: 58170 | Rating: 0 | rate:
/
P . P . L . Joined: Aug 20 06 Posts: 365 ID: 105843 Credit: 362,889 RAC: 796
Hi.
Here's another one doing strange things, when i shutdown last night it had run for 6hrs,30min and had done 18 models, when it restarted it went back to 5hrs, 26min and showing 18 models, it then ran to 6hrs, 18min and still only 18 models!
Still odd i haven't seen this before, the same type of task.
Fri 26 Dec 2008 09:03:52 EST|rosetta@home|Restarting task cc_nonideal_1_3_nocst4_hb_t306__IGNORE_THE_REST_1AZVA_6_5992_27_0 using minirosetta version 147
I am having much the same problems with stops, starts, incomprehensible progress (if any progress) reports, strange error reports, stalling, misrepresentation of time budgeting in the Tasks function and other weirdness.
Minirosetta v1.47 wastes too much time and steals processing time from other processing jobs that actually work.
I suspect that part of the problem is programmers and others being on Christmas break and not being available for problem solving.
As a result I have suspended Rosetta processing until at least January 3rd pending cleanup of the issues.
____________
ID: 58173 | Rating: 0 | rate:
/
Mike Tyka Forum moderator Project administrator Project developer Project scientist Joined: Oct 20 05 Posts: 95 ID: 5612 Credit: 2,190 RAC: 0
NOTE: I have checkpoint logging turned on!
ALL TIMES APPROX.
4 hours with no ckeckpoints after 40 min
cc_nonideal_3_5_nocst4_hb_t374__IGNORE_THE_REST_2FCKA_10_5832_14_0
3.5 hours with no checkpoints after 35 min
cc2_1_8_mammoth_mix_cen_cst_hb_t332__IGNORE_THE_REST_1V2XA_7_5888_15_0
3 hours with no checkpoints after 50 min
cc_nonideal_0_6_nocst4_hb_t313__IGNORE_THE_REST_1GOJA_10_5910_16_0
NOTE: On the last WU I noticed that when I restarted the task,
well into the no checkpointing period -
checkpointing restarted for a short period of time!
This is pointing to a problem with checkpointing in the FoldCst protocol. I'll put this high on the todo list for the 1.48 release.
The runtimes look very reasonable though! I'm afraid making a single decoy shorter than 3-4 hours is not always possible - what kind of machine was this on ?
The runtimes look very reasonable though! I'm afraid making a single decoy shorter than 3-4 hours is not always possible
That would make sense. Normally my WU run time is set to 4 hours.
- what kind of machine was this on ?
Compaq Presario 6029
AMD Athalon XP 2100 (1.7 GHZ)
Windows XP Home ( BOINC v 6.2.19 )
RAM: 768 MB
VIDEO CARD: Radeon 9250 128MB
Dial-up: USRobotics Controller Modem
That is worse than the other mammoth task i had which had something like a 10 point difference. It also ran over my preferences of time. See long running tasks thread.
After clean runs of memtest86+ 2.10 and prime95 for linux and I can no longer get decent results out of prime95 even though memtest86+ 2.10 will run fine.
As you'd most likely expect I'm putting the errors below down to hardware !!
Don't know if it's the CPU or more likely the mainboard northbridge. Have a newer CPU on order to rule that out.
Have removed said machine from my "farm".
Cheers and Happy Christmas and a computational bug free New Year
CPU type GenuineIntel
Intel(R) Pentium(R) 4 CPU 2.60GHz [Family 15 Model 2 Stepping 9]
Number of CPUs 2
Operating System Linux
2.6.24-22-generic
I am having problems with another WU that ran fine up to 99 percent and gets stalled. I let one run for 37 hours until watchdog terminated it. I have preferences set for 12 hours so that is fine. The granted credit is what bothered me. : http://boinc.bakerlab.org/rosetta/result.php?resultid=216862173
I am debating cancelling the WU that is presently doing the same thing as wasting all that CPU time for 2 decoys seems like--well---a waste!!
http://boinc.bakerlab.org/rosetta/result.php?resultid=217161601
I am having problems with another WU that ran fine up to 99 percent and gets stalled. I let one run for 37 hours until watchdog terminated it. I have preferences set for 12 hours so that is fine. The granted credit is what bothered me. : http://boinc.bakerlab.org/rosetta/result.php?resultid=216862173
I am debating cancelling the WU that is presently doing the same thing as wasting all that CPU time for 2 decoys seems like--well---a waste!!
http://boinc.bakerlab.org/rosetta/result.php?resultid=217161601
Where did it seem to get stalled at - about 10 minutes left to go? If so, that's what typically happens when a minirosetta workunit goes out with a serious underestimate of the time required to run it. When I had one like that, a few versions ago, I let it finish (in about 4 times the time I set as preference) and at least got some credit for it, but not much more than typical for workunits that actually finished in the estimated time. At about 10 minutes left to go, the estimated time calculations get messed up, but not the calculations leading to the desired results.
Hi Robert. Yeah----it stopped at about 10 minutes to go-----and stayed that way for 25 hours---lol. Watchdog terminated it.
I aborted another after 18 hours in. It was the same type protein as the first one. I have 2 more being crunched at the moment and am watching to see how they do after 12 hours in.
Task ID 216862173
Name 1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_17673_0
Workunit 197639536
Created 25 Dec 2008 6:09:31 UTC
Sent 25 Dec 2008 7:37:31 UTC
Received 27 Dec 2008 5:01:41 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 948562
Report deadline 4 Jan 2009 7:37:31 UTC
CPU time 134234.2
stderr out <core_client_version>6.2.19</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 43200
**********************************************************************
Rosetta is going too long. Watchdog is ending the run!
CPU time: 134233 seconds. Greater than 3X preferred time: 43200 seconds
**********************************************************************
called boinc_finish
</stderr_txt>
]]>
Validate state Valid
Claimed credit 561.58588373264
Granted credit 117.029798631356
application version 1.47
guys,
don't forget to also post this info in the "Report long-running models here" thread.
ID: 58195 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,023,621 RAC: 81
Somewhere below the question was raised if the "Lock file" error has been fixed. It has not. If you look at this Computer you can see that I have several.
It is not at all clear why this happened.
As you can see it is a 4 Core processor with HT giving 8 virtual processors and I know that at one point I had at least 4 tasks running at the same time. Could this be a concurrency problem? At any rate this is a new machine in the prime of its existence in that it is just over a week old. It is run 24/7 and I have been running about 6-8 projects on the machine and I am not seeing errors like this on other projects. Heck, even GPU Grid is running reasonably well ...
The log files do not record the start time of the processing so you cannot tell for sure if that is the problem here. I still have a few tasks to go and I will run them to completion and see if I get more of these errors in the remaining tasks I have.
I note that my Mac Pro, also with 8 processors has not had this error, but, the project loading on that computer is such that I can't recall an instance where I had more than one Rosetta task running at the same time.
Looking at my other computers, all are multi-processor with at least 4 CPUs and I cannot see this error on any of those machines. I have two tasks running on the i7 right now so I will see if they will die with a collision. the tasks are cc2_1_8_native_cen_cst_hb_t373 and cc2_1_8_native_fa_cst_hb_t373 ...
I have been ignoring Rosetta so I cannot say that I know what the alphabet soup that makes up the task id means (if anything) so I can't tell if there is something common in the actual tasks or not ...
I just find it disappointing that this error surfaced so late in processing. One would think that the error would surface immediately.
____________
ID: 58197 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,023,621 RAC: 81
Since my last post I have completed two tasks successfully on this machine. I have two more in the queue and they are running now. So, by the time you read this they should probably have run to completion or failure. Watching my 8 CPU systems for some time now I have noted that, in general, I never seem to have more than 2 Rosetta tasks running at the same time due to other projects.
On the i7 about the time of the failures I know there was a period where there were 4 or more Rosetta tasks running at the same time. Perhaps there is an upper limit on the number of simultaneous incarnations that can be run at the same time?
Well, I put the i7 into NNT until I can get an answer on this. I hate to waste my time crashing tasks ... or trying to baby sit the machines ...
____________
On the i7 about the time of the failures I know there was a period where there were 4 or more Rosetta tasks running at the same time. Perhaps there is an upper limit on the number of simultaneous incarnations that can be run at the same time?
Well, I put the i7 into NNT until I can get an answer on this. I hate to waste my time crashing tasks ... or trying to baby sit the machines ...
I've noticed that for the more memory demanding BOINC projects, there often is a limit on how many incarnations will run at the same time, especially if you enable the Leave In Memory option but make no effort to increase the amount of swap space they can use. Before I increased the upper limit on swap space on my machine, only one minirosetta workunit would run at a time on my dual CPU core machine; now I often see a minirosetta workunit running on each of the CPU cores.
Adding more physical memory also helps, but I had previously increased it to the limit of what my machine can handle (2 GB).
ID: 58215 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,023,621 RAC: 81
On the i7 about the time of the failures I know there was a period where there were 4 or more Rosetta tasks running at the same time. Perhaps there is an upper limit on the number of simultaneous incarnations that can be run at the same time?
Well, I put the i7 into NNT until I can get an answer on this. I hate to waste my time crashing tasks ... or trying to baby sit the machines ...
I've noticed that for the more memory demanding BOINC projects, there often is a limit on how many incarnations will run at the same time, especially if you enable the Leave In Memory option but make no effort to increase the amount of swap space they can use. Before I increased the upper limit on swap space on my machine, only one minirosetta workunit would run at a time on my dual CPU core machine; now I often see a minirosetta workunit running on each of the CPU cores.
According to my Task manager my peak was 3.9 G with limit 5G so, I did not even get close. I have 3G normal RAM (well, 6 actually, but XP can only "see" 3 G) so ...
Well, I will try to increase the swap file, but, have suspended work on this machine till the project says something... over half the tasks failed with this one error and I am still waiting to see what happens to the last task ... it has been running with 11 min to go for a couple hours now ... if the % Complete was not slowly rising I would have killed it by now ... the main reason I am letting it run is that curiosity overwhelms me as to if it is going to fail with the same error after eating up 10 or more hours of my time or not ...
Oh, man, this is worse... I had nearly 10 hours on the clock. Changed the memory settings to increase the possible size of the swap file (even though it had 2G never used) and after a reboot, the task ended with 8 hours clock time. It looks like it is valid ... but that tells me that I just wasted nearly 2 hours on a task that should have ended ...
{edit add} The tasks that ended badly *MAY* have all been suspended. I cannot say for sure that they were or not. The *MAY* have been. My setting for switiching between tasks is 720 min (12 hours) to try to force most applications to finish before switching ... it is my way of trying to provide best results ... and with 4 plus cores it mostly works. But, I did notice that the several of the Rosetta tasks did get suspended but I did not note which ones ... so more data to ponder if someone is actually going to look at this problem.{/edit} corrected time
____________
This task http://www.boinc.bakerlab.org/rosetta/result.php?resultid=217385249 is running on vista home premium & has no graphics, on screen saver & when i click show graphics, when i close the graphics window it comes up with not responding then gives you 3 options
*Check for a solution & close the program
Close the program
*Wait for the program to respond
i use Close the program. this task has been running with 10 minutes to go for almost an hour with 97.525% done it's moving at roughly .07.5% per minute should i abort it?
it has finished Validate state Initial. i've noticed that my recent task have been going into a pending state but they get credit quite quickly. is this related to the cc2 jobs?
____________
Have a crunching good day!! Live in NZ y not join Smile City?
Has anyone seen any new Lockfile problems ? Or are these finally a thing of the past?
I've made a song and dance about this before, so I should report my situation again:
With Mini 1.45 and Boinc 6.2.19 I had 80% success with a 2 hour runtime, dropping to 55% success with a 3 hour runtime over 116 WUs.
Upgrading to Boinc 6.4.5 for a short while before Mini 1.47s came through I thought I noticed less of the lockfile problem, but they've edged out of my history now.
Of the last 103 WUs:
9 were Beta 5.98s - 100% success as usual
94 Mini 1.47 - 93 success, 1 Computation Error here: 217352482
Outcome Client error
Client state Compute error
Exit status 0 (0x0)
<core_client_version>6.4.5</core_client_version>
<![CDATA[
<stderr_txt>
======================================================
DONE :: 1 starting structures 10813.8 cpu seconds
This process generated 1904 decoys from 1904 attempts
======================================================
BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish
I note some people are still getting problems, but mine seem to have completely gone, whether due to Boinc or the Mini WUs I don't know for sure, but I honestly don't care.
Excellent work, guys. Much appreciated here. Well done. This problem appeared for me along with this new machine in July and this is the first time I'm getting performance anything like this. My RAC has already increased by about 100 a day. I worried it was something I'd done.
____________
That is worse than the other mammoth task I had which had something like a 10 point difference. It also ran over my preferences of time. See long running tasks thread.
In different tasks I've had: 216878857 - CPU time 10076.6
Claimed credit 49.588655190211
Granted credit 100.839750703433
217129212 - CPU time 12904.09
Claimed credit 62.9250827866192
Granted credit 47.1981949319233
It varies. I wouldn't worry about it.
____________
This makes 10 tasks in a days time that have died with the 0xc error. COME ON!
This ran to within 10 minutes of completion and died. Gees!
Then you insult me with me no credit granted for a 99% completed task.
Later...
dec 24 22.15 UTC - system is stable and RAC is slowly returning to normal.
Chu - thanks for taking the time to look into the average return of the various tasks you sent out. It was definitely a case of too much OC and no way to verify it. Probably would have got to that conclusion after a few more errors.
I must've missed the apology elsewhere in the thread. I'm sure it was there somewhere. But maybe not.
Literally a thankless task.
____________
ID: 58228 | Rating: 0 | rate:
/
Hugh Miller Joined: Nov 2 05 Posts: 1 ID: 8255 Credit: 37,692 RAC: 0
I'm running:
BOINC 6.4.5
Rosetta Mini 1.47
on a machine with:
Win Vista Ultimate 64-bit SP1
Core Duo P8600 2.4GHz
4GB RAM
NVIDIA GEForce 9200M GS chipset, 256MB dedicated graphics memory
The screensaver behaves erratically. Sometimes it presents the familiar screen, other times it just goes white with a spinning cursor; if I hit ESC to exit, I get the errorbox reading:
minirosetta_graphics_1.20_windows_x86_64.exe is not responding
I have to bail manually from the screensaver at that point.
Once people sober up can you consider this scenario I've seen:
I glanced at my Boinc Manager earlier this evening and had one long-running WU at nearly 5 hours on a 3 hour run-time. A couple of hours later I noticed it had dropped back massively to just 19 minutes in (still the first model). It's done this again a few times since.
I upgraded to Boinc 6.4.5 a day or two before the Mini 1.47 WUs started coming through (mid-Dec), so I'm not sure which is responsible for this, but since the lockfile errors stopped crashing WUs out there have been several instances of WUs taking a long time with nothing at all reported in the manager's message tab, then finishing relatively early with no error message.
Am I imagining this or are others seeing the same thing? Without error messages I don't really know what to report, nor where to report it, but I'm sure it's happening.
I believe it happened with this completed WU and is currently happening with this in-progress WU. Both are cc2_1_8_mammoth_mix_fa_cst_hb jobs if that makes a difference.
I've had a fairly consistent failure rate for the mini-Rosetta app on my 64bit Vista computer for several months now (hence the reason why it is rarely crunching here). I thought I saw some light at the end so I attached again yesterday only to find 3 more tasks that have failed. All have error code:
I do hope project staff will look into these. I would really like to get back over to ROSETTA on this machine but I can' waste the cycles without the fix. I can run some RALPH WU if this is needed to track it down. Also, all three WU had messages reporting that the "Output file was missing" prior to failure.
Edit Added: Paul Buck mentioned a few posts ago that his tasks that failed were possibly suspended and I know for a fact that the tasks that failed on my computer were indeed suspended and were not left in memory after the suspension.
____________
I've had a fairly consistent failure rate for the mini-Rosetta app on my 64bit Vista computer for several months now (hence the reason why it is rarely crunching here). I thought I saw some light at the end so I attached again yesterday only to find 3 more tasks that have failed. All have error code:
I do hope project staff will look into these. I would really like to get back over to ROSETTA on this machine but I can' waste the cycles without the fix. I can run some RALPH WU if this is needed to track it down. Also, all three WU had messages reporting that the "Output file was missing" prior to failure.
Edit Added: Paul Buck mentioned a few posts ago that his tasks that failed were possibly suspended and I know for a fact that the tasks that failed on my computer were indeed suspended and were not left in memory after the suspension.
quick qustion. are you OC'd at all?
this looks like what I had when my OC speed was to high.
I lowered it and all was ok.
what with this task and its credit?
cc2_1_8_native_cen_cst_hb_t369__IGNORE_THE_REST_1RXQA_14_5863_202_0
http://boinc.bakerlab.org/rosetta/result.php?resultid=218243427
i am running flat out cpu speed and produced 4 decoys in 11679.33 seconds in a setting of 14400 seconds and it grants me UNDER the claimed credit.
Claimed credit 78.1755065660898
Granted credit 32.0937916886001
that's just unbelievable
my frustration is rising again with bad credit granted and problems with downloads on your end as well as the lousy credit for long running tasks.
it is like the project is at the bottom of a sine wave again.
No, I am running stock. I lowered my runtime to 1 hour (thus no switching of apps) and of the 4 completed MR that have completed, all look like they will validate. Is there causation here, idk, but I would be interested to know.
It seems like the 4 or 5 times that I have come back to Rosetta with this setup (64bit Vista) everything works well until the runtime is increased to greater than 1 hour. Perhaps I will increase the runtime but switch to "leave app in memory" to see if there is any change...
I've had a fairly consistent failure rate for the mini-Rosetta app on my 64bit Vista computer for several months now (hence the reason why it is rarely crunching here). I thought I saw some light at the end so I attached again yesterday only to find 3 more tasks that have failed. All have error code:
I do hope project staff will look into these. I would really like to get back over to ROSETTA on this machine but I can' waste the cycles without the fix. I can run some RALPH WU if this is needed to track it down. Also, all three WU had messages reporting that the "Output file was missing" prior to failure.
Edit Added: Paul Buck mentioned a few posts ago that his tasks that failed were possibly suspended and I know for a fact that the tasks that failed on my computer were indeed suspended and were not left in memory after the suspension.
quick qustion. are you OC'd at all?
this looks like what I had when my OC speed was to high.
I lowered it and all was ok.
Interesting that Win64 acts up for you. Your only 1 version of boinc manager 'out of date', but that may or may not help. Leaving in memory, thats something the group always recommends. I don't really have any other idea's at the moment. Could someone else look at his tasks and see if they have any idea's why he's crashing?
@greg_be
No, I am running stock. I lowered my runtime to 1 hour (thus no switching of apps) and of the 4 completed MR that have completed, all look like they will validate. Is there causation here, idk, but I would be interested to know.
It seems like the 4 or 5 times that I have come back to Rosetta with this setup (64bit Vista) everything works well until the runtime is increased to greater than 1 hour. Perhaps I will increase the runtime but switch to "leave app in memory" to see if there is any change...
I've had a fairly consistent failure rate for the mini-Rosetta app on my 64bit Vista computer for several months now (hence the reason why it is rarely crunching here). I thought I saw some light at the end so I attached again yesterday only to find 3 more tasks that have failed. All have error code:
I do hope project staff will look into these. I would really like to get back over to ROSETTA on this machine but I can' waste the cycles without the fix. I can run some RALPH WU if this is needed to track it down. Also, all three WU had messages reporting that the "Output file was missing" prior to failure.
Edit Added: Paul Buck mentioned a few posts ago that his tasks that failed were possibly suspended and I know for a fact that the tasks that failed on my computer were indeed suspended and were not left in memory after the suspension.
quick qustion. are you OC'd at all?
this looks like what I had when my OC speed was to high.
I lowered it and all was ok.
AMD Turion Dual-Core RM-70 at stock speed: 2.0 GHz
Windows Vista SP1 32-bit.
Boinc 5.10.45 with throttling 40 %.
Didn't see any errors (before) on this machine after upgrading to minirosetta 1.45.
On their second run these tasks ran:
Successfully on a Mac,
had the same error on Windows Vista.
Interesting that Win64 acts up for you. Your only 1 version of boinc manager 'out of date', but that may or may not help. Leaving in memory, thats something the group always recommends. I don't really have any other idea's at the moment. Could someone else look at his tasks and see if they have any idea's why he's crashing?
@greg_be
No, I am running stock. I lowered my runtime to 1 hour (thus no switching of apps) and of the 4 completed MR that have completed, all look like they will validate. Is there causation here, idk, but I would be interested to know.
It seems like the 4 or 5 times that I have come back to Rosetta with this setup (64bit Vista) everything works well until the runtime is increased to greater than 1 hour. Perhaps I will increase the runtime but switch to "leave app in memory" to see if there is any change...
I've had a fairly consistent failure rate for the mini-Rosetta app on my 64bit Vista computer for several months now (hence the reason why it is rarely crunching here). I thought I saw some light at the end so I attached again yesterday only to find 3 more tasks that have failed. All have error code:
I do hope project staff will look into these. I would really like to get back over to ROSETTA on this machine but I can' waste the cycles without the fix. I can run some RALPH WU if this is needed to track it down. Also, all three WU had messages reporting that the "Output file was missing" prior to failure.
Edit Added: Paul Buck mentioned a few posts ago that his tasks that failed were possibly suspended and I know for a fact that the tasks that failed on my computer were indeed suspended and were not left in memory after the suspension.
quick qustion. are you OC'd at all?
this looks like what I had when my OC speed was to high.
I lowered it and all was ok.
BOINC 6.4.5 is now available, which suggests that a few people found problems in BOINC 6.4.0 and more recent. I notice that all three of those workunits were the lr5_score12 type, which a few other people have been reporting having problems with. Note that some other threads indicate that Rosetta@home is likely to have problems supplying all the workunits that are requested for at least a few more hours, though.
I've had problems with one of the lr5_score12 workunits lately, but after six workunits in a row that completed successfully but weren't the lr5_score12 type. Choosing the leave in memory option helps, especially if you also raise the upper limit on how much hard drive space BOINC can use, and at least for 32-bit Vista SP1, the upper limit on what fraction of the swap space BOINC can use.
Since then, another non-lr5_score12 workunit has completed on my machine successfully. Another lr5_score12 workunit is still running.
I'm using 14 hour workunits, but with 32-bit Vista, the leave in memory option, and with enough other projects to insure switching to another workunit a few times before these workunits complete.
My lr5_score12 workunit with an error gave an error message similar to yours, so I wouldn't be surprised if it's an error specific to that batch of workunits.
If you'd like to increase the workunit time, I've found that there's a setting for how long workunits can go before deciding whether to switch to another workunit, but I don't remember if Rosetta@home includes this in the settings you're allowed to change. I currently have it set to 2 hours between such decisions, though.
New error to report:
I am running an i7 CPU at 965 with 6G memory and Kapersky antivirus. Is there anything I can do to fix this problem?
1/4/2009 5:45:01 AM|rosetta@home|Sending scheduler request: To fetch work. Requesting 84480 seconds of work, reporting 0 completed tasks
1/4/2009 5:45:11 AM|rosetta@home|Scheduler request completed: got 7 new tasks
1/4/2009 5:45:13 AM|rosetta@home|Started download of boinc_mfr_aaat01_03_05.200_v1_3.gz
1/4/2009 5:45:13 AM|rosetta@home|Started download of boinc_mfr_aaAT01_03_05.200_v1_3.gz
1/4/2009 5:45:22 AM|rosetta@home|Finished download of boinc_mfr_aaat01_03_05.200_v1_3.gz
1/4/2009 5:45:22 AM|rosetta@home|Finished download of boinc_mfr_aaAT01_03_05.200_v1_3.gz
1/4/2009 5:45:22 AM|rosetta@home|Started download of boinc_mfr_aaat01_09_05.200_v1_3.gz
1/4/2009 5:45:22 AM|rosetta@home|Started download of boinc_mfr_aaAT01_09_05.200_v1_3.gz
1/4/2009 5:45:22 AM|rosetta@home|[error] MD5 check failed for boinc_mfr_aaat01_03_05.200_v1_3.gz
1/4/2009 5:45:22 AM|rosetta@home|[error] expected 9e156df4c561be65533ceb64059254ab, got a500261b0525281e82d9c3166980820c
1/4/2009 5:45:22 AM|rosetta@home|[error] Checksum or signature error for boinc_mfr_aaat01_03_05.200_v1_3.gz
1/4/2009 5:45:44 AM|rosetta@home|Finished download of boinc_mfr_aaat01_09_05.200_v1_3.gz
1/4/2009 5:45:44 AM|rosetta@home|Started download of AT01_.fasta
1/4/2009 5:45:45 AM|rosetta@home|Finished download of AT01_.fasta
1/4/2009 5:45:45 AM|rosetta@home|Started download of boinc_description_file.txt
1/4/2009 5:45:46 AM|rosetta@home|Finished download of boinc_description_file.txt
1/4/2009 5:45:46 AM|rosetta@home|Started download of AT01.pdb
1/4/2009 5:45:49 AM|rosetta@home|Finished download of AT01.pdb
1/4/2009 5:45:49 AM|rosetta@home|Started download of AT012.pdb
1/4/2009 5:45:51 AM|rosetta@home|Finished download of AT012.pdb
1/4/2009 5:45:53 AM|rosetta@home|Finished download of boinc_mfr_aaAT01_09_05.200_v1_3.gz
1/4/2009 5:45:53 AM|rosetta@home|[error] MD5 check failed for boinc_mfr_aaAT01_09_05.200_v1_3.gz
1/4/2009 5:45:53 AM|rosetta@home|[error] expected 01275336f54af3e7ff7d41ae314e4f73, got 7cbad1935a58db3fe90e367e4d2f7daf
1/4/2009 5:45:53 AM|rosetta@home|[error] Checksum or signature error for boinc_mfr_aaAT01_09_05.200_v1_3.gz
If you run out of Rosetta@home workunits that haven't been completed and reported, you can click on Reset project after selecting Rosetta@home in the Projects window of the Advanced view and make BOINC download all though files again.
Thanks for looking into this. I let rosetta run last night with increased runtimes and I left the application in memory but I see that 1 wu did fail: 218380754 for the same reason as before.
Also of note, there were 20 that failed because of client error while downloading--couldn't get input files, MD5 check failed: 218580846 for instance.
On this computer, I have set Rosetta to no new work and I had to abort the remaining wu's. I really want to attach here but the problems are far too severe at the moment. Perhaps I'll try again in 6 months, but I must say, this is getting a bit old...
____________
ID: 58480 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
The runtime should not directly effect the success of a task. But, since it will run more models, it increases the odds of you hitting a long-running model. So, running 5 models on 5 different 1 hour tasks should give you the same result as running 5 models on a single 5 hour task. But if 20% of the models are long-running, you would say that 100% of your 5hr tasks "fail", and only 20% of your 1hr tasks do.
But, with a 1 hour runtime preference, the watchdog will kick in much sooner. If watchdog is set to 3 times normal, it would only allow a task to run for 3 hours. Whereas with the longer runtime above, it would go for up to a total of 15 before ending the task.
____________ Rosetta Moderator: Mod.Sense
Just for the fun of it I checked my desktop (AMD 4200+) for any errors, typically this one is and has been rock solid for years. Lo and behold, there was one error there that occurred in the past few hours with the same error as my vista laptop. So the error is not machine or cpu specific (AMD vs Intel...XP vs Vista) it has happened in each (as far as my setup at least).
It seems like the 4 or 5 times that I have come back to Rosetta with this setup (64bit Vista) everything works well until the runtime is increased to greater than 1 hour. Perhaps I will increase the runtime but switch to "leave app in memory" to see if there is any change...
[...]
BOINC 6.4.5 is now available, which suggests that a few people found problems in BOINC 6.4.0 and more recent. I notice that all three of those workunits were the lr5_score12 type, which a few other people have been reporting having problems with. Note that some other threads indicate that Rosetta@home is likely to have problems supplying all the workunits that are requested for at least a few more hours, though.
@Robert\sslickerson
I had loads of problems (can't acquire lockfile) with Vista64 until Boinc 6.4.5 at which point they disappeared completely. I also reduced my runtime to 2 hours for greater success with earlier versions. With 6.4.5 they seem to have gone. An upgrade may help you too.
That said, it hasn't solved any issues with exception errors, which I still get to a small extent (1 out of 93 when I investigated). All your problems seems to be of that type (many more than me) so it may not solve your problems.
For what it's worth, I kept applications in memory, which I understand to be the best advice. Maybe you should try that too. Hope it helps you to some degree.
____________
ID: 58491 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,023,621 RAC: 81
Also check to see if processor usage is set to 100% ...
I saw a note on EaH that with windows and the processor usage not set to 100% this is a common error. In that this killed about 20 models here for me ... I am interested if this is really the case ... I know ROsetta runs well on OS-X in that I have not had any failures there ...
On Win XP I got 10 failures out of about 20 tries ... which is when *I* gave up again on RaH ...
I had set usage to 99% to give me a little more head room and that may have been enough to farble things up ...
Anyone up for the test?
THis is addressed to the "Cant' acquire lock-file" problem only ...
____________
Thanks Paul and everyone else. I'll give these suggestions a try sometime this week (besides Rosie is out of work for the time being).
Also check to see if processor usage is set to 100% ...
I saw a note on EaH that with windows and the processor usage not set to 100% this is a common error. In that this killed about 20 models here for me ... I am interested if this is really the case ... I know ROsetta runs well on OS-X in that I have not had any failures there ...
On Win XP I got 10 failures out of about 20 tries ... which is when *I* gave up again on RaH ...
I had set usage to 99% to give me a little more head room and that may have been enough to farble things up ...
Anyone up for the test?
THis is addressed to the "Cant' acquire lock-file" problem only ...
The problem is not at your end. If you have similar problems in the future always check the server status. Right now there are problems on the other end as you will see by the prominent red boxes.
____________
The problem is not at your end. If you have similar problems in the future always check the server status. Right now there are problems on the other end as you will see by the prominent red boxes.
Generate work servers have been offline today (European time)for quite some time. No news from the team as to what is causing this outage. Keep an eye on the server status page to see when they come back online.