Message boards : Number crunching : Minirosetta v1.47 bug thread.
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 10 · Next
Author | Message |
---|---|
stewjack Send message Joined: 23 Apr 06 Posts: 39 Credit: 95,871 RAC: 0 |
Hi. I have had that happen three times during the last 4 or 5 days. I didn't report it because technically such actions are not prohibited. The tasks complete and grant credit. However; I have set my tasks length to 2 hours for now, and these task run well over that time. NOTE: I have checkpoint logging turned on! ALL TIMES APPROX. 4 hours with no ckeckpoints after 40 min cc_nonideal_3_5_nocst4_hb_t374__IGNORE_THE_REST_2FCKA_10_5832_14_0 3.5 hours with no checkpoints after 35 min cc2_1_8_mammoth_mix_cen_cst_hb_t332__IGNORE_THE_REST_1V2XA_7_5888_15_0 3 hours with no checkpoints after 50 min cc_nonideal_0_6_nocst4_hb_t313__IGNORE_THE_REST_1GOJA_10_5910_16_0 NOTE: On the last WU I noticed that when I restarted the task, well into the no checkpointing period - checkpointing restarted for a short period of time! ![]() |
![]() ![]() Send message Joined: 11 Apr 06 Posts: 19 Credit: 74,745 RAC: 0 |
HoHo kids! Hello, I don't know if this is a bug AND I am not one to complain about receiving credit, however, I was very surprised to receive so much credit compared to claimed credit. Is the result below likely? 216467986 Name cc_nonideal_2_2_nocst4_hb_t297__IGNORE_THE_REST_1YZFA_4_6046_19_0 Workunit 197278592 Created 23 Dec 2008 6:24:21 UTC Sent 23 Dec 2008 7:45:54 UTC Received 24 Dec 2008 15:54:32 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 947263 Report deadline 2 Jan 2009 7:45:54 UTC CPU time 5719.655 stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <stderr_txt> CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time failed to create shared mem segment CreateSemaphore failure! Cannot create semaphore! CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time ====================================================== DONE :: 1 starting structures 5719.56 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> ]]> Validate state Valid Claimed credit 14.4476221738839 Granted credit 41.0260851670465 application version 1.47 |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi. Well still looks odd to me, ended up taking 7hrs, 11min plus the 3 and a half hours lost on restarting. I have a six hour R/T set and it still only did 4 models. See below. # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 ====================================================== DONE :: 1 starting structures 25890.1 cpu seconds This process generated 4 decoys from 4 attempts ![]() |
DaveSun Send message Joined: 3 May 07 Posts: 5 Credit: 200,480 RAC: 0 |
I found this WU stalled after 15 hrs. I suspended the task and then reenabled it later. After it started again it stalled at the same point. I looked at the box and it had a popup saying that it had a C++ runtime error that had asked to be shutdown in an unusual way. Had This WU this morning with the same error. It ran for 7 hours before stalling. Both are vanilla type. I still have one more of these in progress, it is currently at 21 hours and so far looks good. |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi. Here's another one doing strange things, when i shutdown last night it had run for 6hrs,30min and had done 18 models, when it restarted it went back to 5hrs, 26min and showing 18 models, it then ran to 6hrs, 18min and still only 18 models! Still odd i haven't seen this before, the same type of task. Fri 26 Dec 2008 09:03:52 EST|rosetta@home|Restarting task cc_nonideal_1_3_nocst4_hb_t306__IGNORE_THE_REST_1AZVA_6_5992_27_0 using minirosetta version 147 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=197386767 # cpu_run_time_pref: 21600 ====================================================== DONE :: 1 starting structures 22718.4 cpu seconds This process generated 18 decoys from 18 attempts ====================================================== pete. ![]() |
![]() ![]() Send message Joined: 11 Apr 06 Posts: 19 Credit: 74,745 RAC: 0 |
I am having much the same problems with stops, starts, incomprehensible progress (if any progress) reports, strange error reports, stalling, misrepresentation of time budgeting in the Tasks function and other weirdness. Minirosetta v1.47 wastes too much time and steals processing time from other processing jobs that actually work. I suspect that part of the problem is programmers and others being on Christmas break and not being available for problem solving. As a result I have suspended Rosetta processing until at least January 3rd pending cleanup of the issues. |
Mike Tyka Send message Joined: 20 Oct 05 Posts: 96 Credit: 2,190 RAC: 0 |
This is pointing to a problem with checkpointing in the FoldCst protocol. I'll put this high on the todo list for the 1.48 release. The runtimes look very reasonable though! I'm afraid making a single decoy shorter than 3-4 hours is not always possible - what kind of machine was this on ? Mike http://beautifulproteins.blogspot.com/ http://www.miketyka.com/ |
stewjack Send message Joined: 23 Apr 06 Posts: 39 Credit: 95,871 RAC: 0 |
That would make sense. Normally my WU run time is set to 4 hours.
Compaq Presario 6029 AMD Athalon XP 2100 (1.7 GHZ) Windows XP Home ( BOINC v 6.2.19 ) RAM: 768 MB VIDEO CARD: Radeon 9250 128MB Dial-up: USRobotics Controller Modem |
![]() ![]() Send message Joined: 30 May 06 Posts: 5652 Credit: 5,622,096 RAC: 0 |
serious credit issue here: cc2_1_8_mammoth_fa_cst_hb_t303__IGNORE_THE_REST_2AH5A_4_6138_17_0 Claimed credit 106.166115188458 Granted credit 74.8691857584611 That is worse than the other mammoth task i had which had something like a 10 point difference. It also ran over my preferences of time. See long running tasks thread. |
![]() Send message Joined: 21 Sep 05 Posts: 55 Credit: 4,216,173 RAC: 0 |
After clean runs of memtest86+ 2.10 and prime95 for linux and I can no longer get decent results out of prime95 even though memtest86+ 2.10 will run fine. As you'd most likely expect I'm putting the errors below down to hardware !! Don't know if it's the CPU or more likely the mainboard northbridge. Have a newer CPU on order to rule that out. Have removed said machine from my "farm". Cheers and Happy Christmas and a computational bug free New Year CPU type GenuineIntel ![]() |
Rifleman Send message Joined: 19 Nov 08 Posts: 17 Credit: 139,408 RAC: 0 |
I am having problems with another WU that ran fine up to 99 percent and gets stalled. I let one run for 37 hours until watchdog terminated it. I have preferences set for 12 hours so that is fine. The granted credit is what bothered me. : https://boinc.bakerlab.org/rosetta/result.php?resultid=216862173 I am debating cancelling the WU that is presently doing the same thing as wasting all that CPU time for 2 decoys seems like--well---a waste!! https://boinc.bakerlab.org/rosetta/result.php?resultid=217161601 |
![]() Send message Joined: 16 Jun 08 Posts: 1219 Credit: 13,532,238 RAC: 94 |
I am having problems with another WU that ran fine up to 99 percent and gets stalled. I let one run for 37 hours until watchdog terminated it. I have preferences set for 12 hours so that is fine. The granted credit is what bothered me. : https://boinc.bakerlab.org/rosetta/result.php?resultid=216862173 Where did it seem to get stalled at - about 10 minutes left to go? If so, that's what typically happens when a minirosetta workunit goes out with a serious underestimate of the time required to run it. When I had one like that, a few versions ago, I let it finish (in about 4 times the time I set as preference) and at least got some credit for it, but not much more than typical for workunits that actually finished in the estimated time. At about 10 minutes left to go, the estimated time calculations get messed up, but not the calculations leading to the desired results. |
Rifleman Send message Joined: 19 Nov 08 Posts: 17 Credit: 139,408 RAC: 0 |
Hi Robert. Yeah----it stopped at about 10 minutes to go-----and stayed that way for 25 hours---lol. Watchdog terminated it. I aborted another after 18 hours in. It was the same type protein as the first one. I have 2 more being crunched at the moment and am watching to see how they do after 12 hours in. Task ID 216862173 Name 1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_17673_0 Workunit 197639536 Created 25 Dec 2008 6:09:31 UTC Sent 25 Dec 2008 7:37:31 UTC Received 27 Dec 2008 5:01:41 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 948562 Report deadline 4 Jan 2009 7:37:31 UTC CPU time 134234.2 stderr out <core_client_version>6.2.19</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 43200 ********************************************************************** Rosetta is going too long. Watchdog is ending the run! CPU time: 134233 seconds. Greater than 3X preferred time: 43200 seconds ********************************************************************** called boinc_finish </stderr_txt> ]]> Validate state Valid Claimed credit 561.58588373264 Granted credit 117.029798631356 application version 1.47 |
![]() Send message Joined: 21 Sep 05 Posts: 55 Credit: 4,216,173 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=217325144 Nearly 16 hrs in when I spotted it and now it reports, after a manual abort, it has done 0 CPU time ?!?! ![]() |
![]() ![]() Send message Joined: 30 May 06 Posts: 5652 Credit: 5,622,096 RAC: 0 |
guys, don't forget to also post this info in the "Report long-running models here" thread. |
![]() Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Somewhere below the question was raised if the "Lock file" error has been fixed. It has not. If you look at this Computer you can see that I have several. It is not at all clear why this happened. As you can see it is a 4 Core processor with HT giving 8 virtual processors and I know that at one point I had at least 4 tasks running at the same time. Could this be a concurrency problem? At any rate this is a new machine in the prime of its existence in that it is just over a week old. It is run 24/7 and I have been running about 6-8 projects on the machine and I am not seeing errors like this on other projects. Heck, even GPU Grid is running reasonably well ... The log files do not record the start time of the processing so you cannot tell for sure if that is the problem here. I still have a few tasks to go and I will run them to completion and see if I get more of these errors in the remaining tasks I have. I note that my Mac Pro, also with 8 processors has not had this error, but, the project loading on that computer is such that I can't recall an instance where I had more than one Rosetta task running at the same time. Looking at my other computers, all are multi-processor with at least 4 CPUs and I cannot see this error on any of those machines. I have two tasks running on the i7 right now so I will see if they will die with a collision. the tasks are cc2_1_8_native_cen_cst_hb_t373 and cc2_1_8_native_fa_cst_hb_t373 ... I have been ignoring Rosetta so I cannot say that I know what the alphabet soup that makes up the task id means (if anything) so I can't tell if there is something common in the actual tasks or not ... I just find it disappointing that this error surfaced so late in processing. One would think that the error would surface immediately. |
![]() Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Since my last post I have completed two tasks successfully on this machine. I have two more in the queue and they are running now. So, by the time you read this they should probably have run to completion or failure. Watching my 8 CPU systems for some time now I have noted that, in general, I never seem to have more than 2 Rosetta tasks running at the same time due to other projects. On the i7 about the time of the failures I know there was a period where there were 4 or more Rosetta tasks running at the same time. Perhaps there is an upper limit on the number of simultaneous incarnations that can be run at the same time? Well, I put the i7 into NNT until I can get an answer on this. I hate to waste my time crashing tasks ... or trying to baby sit the machines ... |
![]() Send message Joined: 21 Sep 05 Posts: 55 Credit: 4,216,173 RAC: 0 |
* sigh * https://boinc.bakerlab.org/rosetta/result.php?resultid=217461782 <core_client_version>6.2.15</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> # cpu_run_time_pref: 14400 # cpu_run_time_pref: 7200 # cpu_run_time_pref: 7200 terminate called after throwing an instance of 'std::bad_alloc' what(): St9bad_alloc SIGABRT: abort called Stack trace (27 frames): [0x8b979b7] [0x8bc20b0] [0xb7f22420] [0x8c24ca4] [0x8c12c5b] [0x8c10261] [0x8c10296] [0x8c0fe43] [0x8c0f86c] [0x8a88ba5] [0x8559c48] [0x83e8bc3] [0x87f80df] [0x87dc3c7] [0x80de412] [0x80d0686] [0x80d0b2e] [0x80c88b9] [0x80de971] [0x80d7d76] [0x8064271] [0x8117277] [0x8127c00] [0x8129a1a] [0x804b9c8] [0x8c1dbac] [0x8048111] Exiting... </stderr_txt> ]]> and https://boinc.bakerlab.org/rosetta/result.php?resultid=217459230 <core_client_version>6.2.15</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 14400 # cpu_run_time_pref: 7200 # cpu_run_time_pref: 7200 ********************************************************************** Rosetta is going too long. Watchdog is ending the run! CPU time: 26682.9 seconds. Greater than 3X preferred time: 7200 seconds ********************************************************************** called boinc_finish </stderr_txt> ]]> ![]() |
![]() Send message Joined: 16 Jun 08 Posts: 1219 Credit: 13,532,238 RAC: 94 |
On the i7 about the time of the failures I know there was a period where there were 4 or more Rosetta tasks running at the same time. Perhaps there is an upper limit on the number of simultaneous incarnations that can be run at the same time? I've noticed that for the more memory demanding BOINC projects, there often is a limit on how many incarnations will run at the same time, especially if you enable the Leave In Memory option but make no effort to increase the amount of swap space they can use. Before I increased the upper limit on swap space on my machine, only one minirosetta workunit would run at a time on my dual CPU core machine; now I often see a minirosetta workunit running on each of the CPU cores. Adding more physical memory also helps, but I had previously increased it to the limit of what my machine can handle (2 GB). |
![]() Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
On the i7 about the time of the failures I know there was a period where there were 4 or more Rosetta tasks running at the same time. Perhaps there is an upper limit on the number of simultaneous incarnations that can be run at the same time? According to my Task manager my peak was 3.9 G with limit 5G so, I did not even get close. I have 3G normal RAM (well, 6 actually, but XP can only "see" 3 G) so ... Well, I will try to increase the swap file, but, have suspended work on this machine till the project says something... over half the tasks failed with this one error and I am still waiting to see what happens to the last task ... it has been running with 11 min to go for a couple hours now ... if the % Complete was not slowly rising I would have killed it by now ... the main reason I am letting it run is that curiosity overwhelms me as to if it is going to fail with the same error after eating up 10 or more hours of my time or not ... Oh, man, this is worse... I had nearly 10 hours on the clock. Changed the memory settings to increase the possible size of the swap file (even though it had 2G never used) and after a reboot, the task ended with 8 hours clock time. It looks like it is valid ... but that tells me that I just wasted nearly 2 hours on a task that should have ended ... {edit add} The tasks that ended badly *MAY* have all been suspended. I cannot say for sure that they were or not. The *MAY* have been. My setting for switiching between tasks is 720 min (12 hours) to try to force most applications to finish before switching ... it is my way of trying to provide best results ... and with 4 plus cores it mostly works. But, I did notice that the several of the Rosetta tasks did get suspended but I did not note which ones ... so more data to ponder if someone is actually going to look at this problem.{/edit} corrected time |
Message boards :
Number crunching :
Minirosetta v1.47 bug thread.
©2023 University of Washington
https://www.bakerlab.org