Message boards : Number crunching : Minirosetta v1.47 bug thread.
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 10 · Next
Author | Message |
---|---|
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
yet another one dies...what is going on? is it the program or my OC speed? this makes 12 in 2 days. https://boinc.bakerlab.org/rosetta/result.php?resultid=216194755 Name t073_1_RDC_NMR_NESG_5563_176398_0 Workunit 197027384 Server state Over Outcome Client error Client state Compute error Exit status -1073741819 (0xc0000005) CPU time 25.375 stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 14400 |
Chu Send message Joined: 23 Feb 06 Posts: 120 Credit: 112,439 RAC: 0 |
Hi greg_be, this WU is one of my jobs and I just double checked this sub-batch, so far about 9000 clients have returned results successfully with normal error rate. The fact that you recently have got same error code from many different Rosetta@home workunits makes me think that it is more likely due to some certain incompatible setup on your computer, though I don't know what is exactly causing this. Did this problem happen to you before? your vanilla task died at 2hrs and 23 mins. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Chu, Thanks for replying. I suspect its my over clock speed. If you have that many clients returning good tasks and I see the last one I posted went through to another client ok, I have to assume my speed is to high for these tasks on RAH. I dropped the speed by 10 mhz to see if that corrects the problem, if it continues then I will drop it some more until things become steady. as of a week ago I could run at the faster speed with no problems. But this week the majority die. Normally when my speed is to high, the tasks fail immediately. So I don't understand how a task can run eight thousand seconds and then crash. I had another one that ran up to 10 mins of completion and died. Can you tell me how to see the difference between a error due to windows or OC speed vs a program error that triggers a windows dump with '-1073741819 (0xc0000005)'? Thanks again for the reply. Hi greg_be, this WU is one of my jobs and I just double checked this sub-batch, so far about 9000 clients have returned results successfully with normal error rate. The fact that you recently have got same error code from many different Rosetta@home workunits makes me think that it is more likely due to some certain incompatible setup on your computer, though I don't know what is exactly causing this. Did this problem happen to you before? |
HA-SOFT, s.r.o. Send message Joined: 27 Jan 07 Posts: 10 Credit: 94,518,643 RAC: 0 |
I have the same problem on 64 bit Win 2008 server only for all Minirosetta tasks. Minirosetta 1.45 had this problem too. All other PC (32bit, XP64bit) have no problem. Zdenek Chu, |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,214,047 RAC: 1,450 |
Chu, I think this could happen if the system is close to being okay but just on that edge. ie if your task has run for only 7 thousand seconds it would have completed just fine, but that little extra time pushed it over the edge. This could come from the ram being pushed, the hard drive saying enough, the cpu sending that one bit of data too fast etc etc. By backing off in 10 mhz increments I think you will find the solution fairly quickly. Then you could even go back up in 1 mhz increments until the errors come back. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Chu, i suspect your right about the ram frequency and the cpu. probably just a bit to high for these tasks now. i might raise it by 5 mhz after tonight just to see what happens. my RAC is already low enough. i can't "afford" to take much more in errors. |
Chu Send message Joined: 23 Feb 06 Posts: 120 Credit: 112,439 RAC: 0 |
greb_be and all, When there is a new version of minirosetta update, we usually put a windows debug symbol image in a downloadable location. So when a WU crashes out, it should provide a backtrace of how an error is caused (this does not work every time and that makes our debugging very hard). If it is an error from Minirosetta program or bad command line/input file setup, the stdout or stderr usually will print out a message as hints, for example, the hbond NAN problem in the previous versions. Also, we should see a significantly higher error rate among either all or certain batches of WUs running. If it is caused by interfacing with the host's hardware or software, we will usually see that certain client hosts kept encountering errors or failure. We wish we could tell what have been wrong in every scenario when an error occurs, however, most of us Rosetta developer are far from being an expert on computer software/hardware and we can only hope to trap errors locally on our testing machines to continue with debugging. Thank you all for voluntarily helping us on doing this project and sorry about any inconvenience/trouble caused on your computer. Please continue to report problems and/or possible fixes you have found as every bit of such information will certainly help us to improve R@H stability and resolve hidden bugs/problems sooner or later. Happy holidays to every one and happy crunching! I have the same problem on 64 bit Win 2008 server only for all Minirosetta tasks. Minirosetta 1.45 had this problem too. All other PC (32bit, XP64bit) have no problem. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Chu, I reduced the OC amount by 10 mhz and then brought it back up 5 mhz. Everything seems stable now as I have run nearly a day without trouble since backing down. It would seem your program is more and more sensitive to tiny things that high OC rates create. In any case backing down the cpu OC speed a bit seems to have solved this issue. thanks for taking the time to discuss this problem with me and the other person. |
staffann Send message Joined: 7 Oct 07 Posts: 7 Credit: 69,937 RAC: 0 |
I had one WU crash on me today. Running on a WinXPSP3 Athlon X2 3800+ with 1Gb RAM. Link to task details. 216493218 Name 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_16326_0 Workunit 197297715 Created 23 Dec 2008 8:53:31 UTC Sent 23 Dec 2008 9:33:56 UTC Received 23 Dec 2008 22:08:04 UTC Server state Over Outcome Client error Client state Compute error Exit status -1073741819 (0xc0000005) Computer ID 625945 Report deadline 2 Jan 2009 9:33:56 UTC CPU time 4928.609 stderr out <core_client_version>6.2.18</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 2,014 |
Chu, Do you think it could be a problem in BOINC 6.4.5 instead? Chu, could you check how many machines running workunits from that batch under BOINC 6.4.5 on similar hardware have returned successful results? |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 2,014 |
Chu, Do you think it could be a problem in BOINC 6.4.5 instead? Chu, could you check how many machines running workunits from that batch under BOINC 6.4.5 on similar hardware have returned successful results? |
(_KoDAk_) Send message Joined: 18 Jul 06 Posts: 109 Credit: 1,859,263 RAC: 0 |
Exit status -1073741819 (0xc0000005) https://boinc.bakerlab.org/rosetta/result.php?resultid=214936635 https://boinc.bakerlab.org/rosetta/result.php?resultid=216341024 https://boinc.bakerlab.org/rosetta/result.php?resultid=215006649 https://boinc.bakerlab.org/rosetta/result.php?resultid=214872151 Exit status 1 (0x1) https://boinc.bakerlab.org/rosetta/result.php?resultid=212896182 |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,214,047 RAC: 1,450 |
Do you think it could be a problem in BOINC 6.4.5 instead? Chu, could you check how many machines running workunits from that batch under BOINC 6.4.5 on similar hardware have returned successful results? I am using version 6.4.5, on some of my pc's, and am not having any issues. |
Dalton Send message Joined: 30 Nov 05 Posts: 2 Credit: 27,777,725 RAC: 0 |
I found this WU stalled after 15 hrs. I suspended the task and then reenabled it later. After it started again it stalled at the same point. I looked at the box and it had a popup saying that it had a C++ runtime error that had asked to be shutdown in an unusual way. I've been getting those C++ popups as well on multiple configs machine/os, it seems as if then that core on the cpu refuses to get work after that. This is a new event for me. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Chu, robert, after dropping the OC 10 mhz and then bringing it back 5mhz (total reduction 5 mhz) I have not had any further issues. so at least for my machine the errors were caused by OC'ing to far. this accounts for the huge amount of failures I had. It would seem the the new mini is even more sensitive than 1.45 to whatever signals OC'ing produces. For those who get 1 failure in 20 tasks, then your not having the same problem as I was. Also I am on 6.4.5 after upgrading from the old version. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Exit status -1073741819 (0xc0000005) kodak, that looks similar to the rash of broken tasks I had. are you OC'd at all? |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi. I have this task at the moment running, it's odd. This morning when i restarted the system Boinc was showing 5hrs,4mins completed, when the task got it's turn to run it dropped back to 1hr,33mins and showing 2 models, it would have done more than two in the five hours! https://boinc.bakerlab.org/rosetta/workunit.php?wuid=197257513 Thu 25 Dec 2008 08:42:56 EST|rosetta@home|Restarting task cc_nonideal_1_8_nocst4_hb_t303__IGNORE_THE_REST_1FEZA_6_6019_17_0 using minirosetta version 147 pete. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
normally this is due to the last check point set. seems kind of odd that you would lose up to 4hrs of work between check points. it acts like it lost all the latest check point data. it also looks like your running a really old version of boinc. you might want to update to the latest version. Merry Christmas Hi. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Chu, dec 24 22.15 UTC - system is stable and RAC is slowly returning to normal. Chu - thanks for taking the time to look into the average return of the various tasks you sent out. It was definitely a case of to much OC and no way to verify it. probably would have got to that conclusion after a few more errors. |
stewjack Send message Joined: 23 Apr 06 Posts: 39 Credit: 95,871 RAC: 0 |
Hi. I have had that happen three times during the last 4 or 5 days. I didn't report it because technically such actions are not prohibited. The tasks complete and grant credit. However; I have set my tasks length to 2 hours for now, and these task run well over that time. NOTE: I have checkpoint logging turned on! ALL TIMES APPROX. 4 hours with no ckeckpoints after 40 min cc_nonideal_3_5_nocst4_hb_t374__IGNORE_THE_REST_2FCKA_10_5832_14_0 3.5 hours with no checkpoints after 35 min cc2_1_8_mammoth_mix_cen_cst_hb_t332__IGNORE_THE_REST_1V2XA_7_5888_15_0 3 hours with no checkpoints after 50 min cc_nonideal_0_6_nocst4_hb_t313__IGNORE_THE_REST_1GOJA_10_5910_16_0 NOTE: On the last WU I noticed that when I restarted the task, well into the no checkpointing period - checkpointing restarted for a short period of time! |
Message boards :
Number crunching :
Minirosetta v1.47 bug thread.
©2024 University of Washington
https://www.bakerlab.org