Minirosetta v1.47 bug thread.

Author	Message
Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 58121 - Posted: 23 Dec 2008, 0:22:58 UTC yet another one dies...what is going on? is it the program or my OC speed? this makes 12 in 2 days. https://boinc.bakerlab.org/rosetta/result.php?resultid=216194755 Name t073_1_RDC_NMR_NESG_5563_176398_0 Workunit 197027384 Server state Over Outcome Client error Client state Compute error Exit status -1073741819 (0xc0000005) CPU time 25.375 stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 14400 ID: 58121 · Rating: 0 · rate: / Reply Quote

Chu Send message Joined: 23 Feb 06 Posts: 120 Credit: 112,439 RAC: 0	Message 58126 - Posted: 23 Dec 2008, 3:31:31 UTC - in response to Message 58119. Hi greg_be, this WU is one of my jobs and I just double checked this sub-batch, so far about 9000 clients have returned results successfully with normal error rate. The fact that you recently have got same error code from many different Rosetta@home workunits makes me think that it is more likely due to some certain incompatible setup on your computer, though I don't know what is exactly causing this. Did this problem happen to you before? your vanilla task died at 2hrs and 23 mins. this makes about 12 failures now in 2 days. https://boinc.bakerlab.org/rosetta/result.php?resultid=216178144 1g47A_BOINC_MPZN_vanilla_abrelax_5901_7554_0 Client state Compute error Exit status -1073741819 (0xc0000005) CPU time 8912.25 stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 14400 ID: 58126 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 58130 - Posted: 23 Dec 2008, 9:09:13 UTC - in response to Message 58126. Last modified: 23 Dec 2008, 9:11:41 UTC Chu, Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH. I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die. Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died. Can you tell me how to see the difference between a error due to windows or OC speed vs a program error that triggers a windows dump with '-1073741819 (0xc0000005)'? Thanks again for the reply. Hi greg_be, this WU is one of my jobs and I just double checked this sub-batch, so far about 9000 clients have returned results successfully with normal error rate. The fact that you recently have got same error code from many different Rosetta@home workunits makes me think that it is more likely due to some certain incompatible setup on your computer, though I don't know what is exactly causing this. Did this problem happen to you before? your vanilla task died at 2hrs and 23 mins. this makes about 12 failures now in 2 days. https://boinc.bakerlab.org/rosetta/result.php?resultid=216178144 1g47A_BOINC_MPZN_vanilla_abrelax_5901_7554_0 Client state Compute error Exit status -1073741819 (0xc0000005) CPU time 8912.25 stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 14400 ID: 58130 · Rating: 0 · rate: / Reply Quote

HA-SOFT, s.r.o. Send message Joined: 27 Jan 07 Posts: 10 Credit: 94,518,643 RAC: 0	Message 58132 - Posted: 23 Dec 2008, 9:48:28 UTC - in response to Message 58130. I have the same problem on 64 bit Win 2008 server only for all Minirosetta tasks. Minirosetta 1.45 had this problem too. All other PC (32bit, XP64bit) have no problem. Zdenek Chu, Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH. I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die. Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died. Can you tell me how to see the difference between a error due to windows or OC speed vs a program error that triggers a windows dump with '-1073741819 (0xc0000005)'? Thanks again for the reply. ID: 58132 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 5 Jan 06 Posts: 1900 Credit: 12,902,147 RAC: 3	Message 58134 - Posted: 23 Dec 2008, 11:17:31 UTC - in response to Message 58130. Chu, Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH. I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die. Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died. I think this could happen if the system is close to being okay but just on that edge. ie if your task has run for only 7 thousand seconds it would have completed just fine, but that little extra time pushed it over the edge. This could come from the ram being pushed, the hard drive saying enough, the cpu sending that one bit of data too fast etc etc. By backing off in 10 mhz increments I think you will find the solution fairly quickly. Then you could even go back up in 1 mhz increments until the errors come back. ID: 58134 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 58137 - Posted: 23 Dec 2008, 11:39:54 UTC - in response to Message 58134. Chu, Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH. I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die. Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died. I think this could happen if the system is close to being okay but just on that edge. ie if your task has run for only 7 thousand seconds it would have completed just fine, but that little extra time pushed it over the edge. This could come from the ram being pushed, the hard drive saying enough, the cpu sending that one bit of data too fast etc etc. By backing off in 10 mhz increments I think you will find the solution fairly quickly. Then you could even go back up in 1 mhz increments until the errors come back. i suspect your right about the ram frequency and the cpu. probably just a bit to high for these tasks now. i might raise it by 5 mhz after tonight just to see what happens. my RAC is already low enough. i can't "afford" to take much more in errors. ID: 58137 · Rating: 0 · rate: / Reply Quote

Chu Send message Joined: 23 Feb 06 Posts: 120 Credit: 112,439 RAC: 0	Message 58144 - Posted: 23 Dec 2008, 19:00:03 UTC - in response to Message 58132. greb_be and all, When there is a new version of minirosetta update, we usually put a windows debug symbol image in a downloadable location. So when a WU crashes out, it should provide a backtrace of how an error is caused (this does not work every time and that makes our debugging very hard). If it is an error from Minirosetta program or bad command line/input file setup, the stdout or stderr usually will print out a message as hints, for example, the hbond NAN problem in the previous versions. Also, we should see a significantly higher error rate among either all or certain batches of WUs running. If it is caused by interfacing with the host's hardware or software, we will usually see that certain client hosts kept encountering errors or failure. We wish we could tell what have been wrong in every scenario when an error occurs, however, most of us Rosetta developer are far from being an expert on computer software/hardware and we can only hope to trap errors locally on our testing machines to continue with debugging. Thank you all for voluntarily helping us on doing this project and sorry about any inconvenience/trouble caused on your computer. Please continue to report problems and/or possible fixes you have found as every bit of such information will certainly help us to improve R@H stability and resolve hidden bugs/problems sooner or later. Happy holidays to every one and happy crunching! I have the same problem on 64 bit Win 2008 server only for all Minirosetta tasks. Minirosetta 1.45 had this problem too. All other PC (32bit, XP64bit) have no problem. Zdenek Chu, Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH. I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die. Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died. Can you tell me how to see the difference between a error due to windows or OC speed vs a program error that triggers a windows dump with '-1073741819 (0xc0000005)'? Thanks again for the reply. ID: 58144 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 58145 - Posted: 23 Dec 2008, 19:44:39 UTC Chu, I reduced the OC amount by 10 mhz and then brought it back up 5 mhz. Everything seems stable now as I have run nearly a day without trouble since backing down. It would seem your program is more and more sensitive to tiny things that high OC rates create. In any case backing down the cpu OC speed a bit seems to have solved this issue. thanks for taking the time to discuss this problem with me and the other person. ID: 58145 · Rating: 0 · rate: / Reply Quote

staffann Send message Joined: 7 Oct 07 Posts: 7 Credit: 69,937 RAC: 0	Message 58146 - Posted: 23 Dec 2008, 22:00:59 UTC I had one WU crash on me today. Running on a WinXPSP3 Athlon X2 3800+ with 1Gb RAM. Link to task details. 216493218 Name 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_16326_0 Workunit 197297715 Created 23 Dec 2008 8:53:31 UTC Sent 23 Dec 2008 9:33:56 UTC Received 23 Dec 2008 22:08:04 UTC Server state Over Outcome Client error Client state Compute error Exit status -1073741819 (0xc0000005) Computer ID 625945 Report deadline 2 Jan 2009 9:33:56 UTC CPU time 4928.609 stderr out <core_client_version>6.2.18</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> ID: 58146 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1265 Credit: 14,424,358 RAC: 0	Message 58150 - Posted: 24 Dec 2008, 2:20:48 UTC - in response to Message 58137. Chu, Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH. I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die. Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died. I think this could happen if the system is close to being okay but just on that edge. ie if your task has run for only 7 thousand seconds it would have completed just fine, but that little extra time pushed it over the edge. This could come from the ram being pushed, the hard drive saying enough, the cpu sending that one bit of data too fast etc etc. By backing off in 10 mhz increments I think you will find the solution fairly quickly. Then you could even go back up in 1 mhz increments until the errors come back. i suspect your right about the ram frequency and the cpu. probably just a bit to high for these tasks now. i might raise it by 5 mhz after tonight just to see what happens. my RAC is already low enough. i can't "afford" to take much more in errors. Do you think it could be a problem in BOINC 6.4.5 instead? Chu, could you check how many machines running workunits from that batch under BOINC 6.4.5 on similar hardware have returned successful results? ID: 58150 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1265 Credit: 14,424,358 RAC: 0	Message 58151 - Posted: 24 Dec 2008, 2:21:13 UTC - in response to Message 58137. Chu, Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH. I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die. Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died. I think this could happen if the system is close to being okay but just on that edge. ie if your task has run for only 7 thousand seconds it would have completed just fine, but that little extra time pushed it over the edge. This could come from the ram being pushed, the hard drive saying enough, the cpu sending that one bit of data too fast etc etc. By backing off in 10 mhz increments I think you will find the solution fairly quickly. Then you could even go back up in 1 mhz increments until the errors come back. i suspect your right about the ram frequency and the cpu. probably just a bit to high for these tasks now. i might raise it by 5 mhz after tonight just to see what happens. my RAC is already low enough. i can't "afford" to take much more in errors. Do you think it could be a problem in BOINC 6.4.5 instead? Chu, could you check how many machines running workunits from that batch under BOINC 6.4.5 on similar hardware have returned successful results? ID: 58151 · Rating: 0 · rate: / Reply Quote

(_KoDAk_) Send message Joined: 18 Jul 06 Posts: 109 Credit: 1,859,263 RAC: 0	Message 58154 - Posted: 24 Dec 2008, 6:35:31 UTC Exit status -1073741819 (0xc0000005) https://boinc.bakerlab.org/rosetta/result.php?resultid=214936635 https://boinc.bakerlab.org/rosetta/result.php?resultid=216341024 https://boinc.bakerlab.org/rosetta/result.php?resultid=215006649 https://boinc.bakerlab.org/rosetta/result.php?resultid=214872151 Exit status 1 (0x1) https://boinc.bakerlab.org/rosetta/result.php?resultid=212896182 ID: 58154 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 5 Jan 06 Posts: 1900 Credit: 12,902,147 RAC: 3	Message 58156 - Posted: 24 Dec 2008, 12:50:29 UTC - in response to Message 58150. Do you think it could be a problem in BOINC 6.4.5 instead? Chu, could you check how many machines running workunits from that batch under BOINC 6.4.5 on similar hardware have returned successful results? I am using version 6.4.5, on some of my pc's, and am not having any issues. ID: 58156 · Rating: 0 · rate: / Reply Quote

Dalton Send message Joined: 30 Nov 05 Posts: 2 Credit: 27,777,725 RAC: 0	Message 58158 - Posted: 24 Dec 2008, 14:04:03 UTC - in response to Message 58157. I found this WU stalled after 15 hrs. I suspended the task and then reenabled it later. After it started again it stalled at the same point. I looked at the box and it had a popup saying that it had a C++ runtime error that had asked to be shutdown in an unusual way. STDERR OUT <core_client_version>5.10.45</core_client_version> <![CDATA[ <message> The system cannot find the path specified. (0x3) - exit code 3 (0x3) </message> <stderr_txt> # cpu_run_time_pref: 86400 </stderr_txt> ]]> I've been getting those C++ popups as well on multiple configs machine/os, it seems as if then that core on the cpu refuses to get work after that. This is a new event for me. ID: 58158 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 58159 - Posted: 24 Dec 2008, 14:18:26 UTC - in response to Message 58151. Chu, Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH. I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die. Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died. I think this could happen if the system is close to being okay but just on that edge. ie if your task has run for only 7 thousand seconds it would have completed just fine, but that little extra time pushed it over the edge. This could come from the ram being pushed, the hard drive saying enough, the cpu sending that one bit of data too fast etc etc. By backing off in 10 mhz increments I think you will find the solution fairly quickly. Then you could even go back up in 1 mhz increments until the errors come back. i suspect your right about the ram frequency and the cpu. probably just a bit to high for these tasks now. i might raise it by 5 mhz after tonight just to see what happens. my RAC is already low enough. i can't "afford" to take much more in errors. Do you think it could be a problem in BOINC 6.4.5 instead? Chu, could you check how many machines running workunits from that batch under BOINC 6.4.5 on similar hardware have returned successful results? robert, after dropping the OC 10 mhz and then bringing it back 5mhz (total reduction 5 mhz) I have not had any further issues. so at least for my machine the errors were caused by OC'ing to far. this accounts for the huge amount of failures I had. It would seem the the new mini is even more sensitive than 1.45 to whatever signals OC'ing produces. For those who get 1 failure in 20 tasks, then your not having the same problem as I was. Also I am on 6.4.5 after upgrading from the old version. ID: 58159 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 58160 - Posted: 24 Dec 2008, 14:20:59 UTC - in response to Message 58154. Exit status -1073741819 (0xc0000005) https://boinc.bakerlab.org/rosetta/result.php?resultid=214936635 https://boinc.bakerlab.org/rosetta/result.php?resultid=216341024 https://boinc.bakerlab.org/rosetta/result.php?resultid=215006649 https://boinc.bakerlab.org/rosetta/result.php?resultid=214872151 Exit status 1 (0x1) https://boinc.bakerlab.org/rosetta/result.php?resultid=212896182 kodak, that looks similar to the rash of broken tasks I had. are you OC'd at all? ID: 58160 · Rating: 0 · rate: / Reply Quote

P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0	Message 58163 - Posted: 24 Dec 2008, 21:59:14 UTC Hi. I have this task at the moment running, it's odd. This morning when i restarted the system Boinc was showing 5hrs,4mins completed, when the task got it's turn to run it dropped back to 1hr,33mins and showing 2 models, it would have done more than two in the five hours! https://boinc.bakerlab.org/rosetta/workunit.php?wuid=197257513 Thu 25 Dec 2008 08:42:56 EST\|rosetta@home\|Restarting task cc_nonideal_1_8_nocst4_hb_t303__IGNORE_THE_REST_1FEZA_6_6019_17_0 using minirosetta version 147 pete. ID: 58163 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 58164 - Posted: 24 Dec 2008, 22:12:02 UTC - in response to Message 58163. normally this is due to the last check point set. seems kind of odd that you would lose up to 4hrs of work between check points. it acts like it lost all the latest check point data. it also looks like your running a really old version of boinc. you might want to update to the latest version. Merry Christmas Hi. I have this task at the moment running, it's odd. This morning when i restarted the system Boinc was showing 5hrs,4mins completed, when the task got it's turn to run it dropped back to 1hr,33mins and showing 2 models, it would have done more than two in the five hours! https://boinc.bakerlab.org/rosetta/workunit.php?wuid=197257513 Thu 25 Dec 2008 08:42:56 EST\|rosetta@home\|Restarting task cc_nonideal_1_8_nocst4_hb_t303__IGNORE_THE_REST_1FEZA_6_6019_17_0 using minirosetta version 147 pete. ID: 58164 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 58165 - Posted: 24 Dec 2008, 22:17:22 UTC - in response to Message 58159. Chu, Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH. I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die. Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died. I think this could happen if the system is close to being okay but just on that edge. ie if your task has run for only 7 thousand seconds it would have completed just fine, but that little extra time pushed it over the edge. This could come from the ram being pushed, the hard drive saying enough, the cpu sending that one bit of data too fast etc etc. By backing off in 10 mhz increments I think you will find the solution fairly quickly. Then you could even go back up in 1 mhz increments until the errors come back. i suspect your right about the ram frequency and the cpu. probably just a bit to high for these tasks now. i might raise it by 5 mhz after tonight just to see what happens. my RAC is already low enough. i can't "afford" to take much more in errors. Do you think it could be a problem in BOINC 6.4.5 instead? Chu, could you check how many machines running workunits from that batch under BOINC 6.4.5 on similar hardware have returned successful results? robert, after dropping the OC 10 mhz and then bringing it back 5mhz (total reduction 5 mhz) I have not had any further issues. so at least for my machine the errors were caused by OC'ing to far. this accounts for the huge amount of failures I had. It would seem the the new mini is even more sensitive than 1.45 to whatever signals OC'ing produces. For those who get 1 failure in 20 tasks, then your not having the same problem as I was. Also I am on 6.4.5 after upgrading from the old version. dec 24 22.15 UTC - system is stable and RAC is slowly returning to normal. Chu - thanks for taking the time to look into the average return of the various tasks you sent out. It was definitely a case of to much OC and no way to verify it. probably would have got to that conclusion after a few more errors. ID: 58165 · Rating: 0 · rate: / Reply Quote

stewjack Send message Joined: 23 Apr 06 Posts: 39 Credit: 95,871 RAC: 0	Message 58166 - Posted: 24 Dec 2008, 23:12:52 UTC - in response to Message 58163. Hi. I have this task at the moment running, it's odd. This morning when i restarted the ... task cc_nonideal_1_8_nocst4_hb_t303__IGNORE_THE_REST_1FEZA_6_6019_17_0 using minirosetta version 147 pete. I have had that happen three times during the last 4 or 5 days. I didn't report it because technically such actions are not prohibited. The tasks complete and grant credit. However; I have set my tasks length to 2 hours for now, and these task run well over that time. NOTE: I have checkpoint logging turned on! ALL TIMES APPROX. 4 hours with no ckeckpoints after 40 min cc_nonideal_3_5_nocst4_hb_t374__IGNORE_THE_REST_2FCKA_10_5832_14_0 3.5 hours with no checkpoints after 35 min cc2_1_8_mammoth_mix_cen_cst_hb_t332__IGNORE_THE_REST_1V2XA_7_5888_15_0 3 hours with no checkpoints after 50 min cc_nonideal_0_6_nocst4_hb_t313__IGNORE_THE_REST_1GOJA_10_5910_16_0 NOTE: On the last WU I noticed that when I restarted the task, well into the no checkpointing period - checkpointing restarted for a short period of time! ID: 58166 · Rating: 0 · rate: / Reply Quote