Message boards : Number crunching : Minirosetta v1.47 bug thread.
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 10 · Next
Author | Message |
---|---|
Stacey Baird Send message Joined: 11 Apr 06 Posts: 19 Credit: 74,745 RAC: 0 |
HoHo kids! Hello, I don't know if this is a bug AND I am not one to complain about receiving credit, however, I was very surprised to receive so much credit compared to claimed credit. Is the result below likely? 216467986 Name cc_nonideal_2_2_nocst4_hb_t297__IGNORE_THE_REST_1YZFA_4_6046_19_0 Workunit 197278592 Created 23 Dec 2008 6:24:21 UTC Sent 23 Dec 2008 7:45:54 UTC Received 24 Dec 2008 15:54:32 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 947263 Report deadline 2 Jan 2009 7:45:54 UTC CPU time 5719.655 stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <stderr_txt> CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time failed to create shared mem segment CreateSemaphore failure! Cannot create semaphore! CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time CreateFile error 32 when trying set file time ====================================================== DONE :: 1 starting structures 5719.56 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> ]]> Validate state Valid Claimed credit 14.4476221738839 Granted credit 41.0260851670465 application version 1.47 |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi. Well still looks odd to me, ended up taking 7hrs, 11min plus the 3 and a half hours lost on restarting. I have a six hour R/T set and it still only did 4 models. See below. # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 ====================================================== DONE :: 1 starting structures 25890.1 cpu seconds This process generated 4 decoys from 4 attempts |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi. Here's another one doing strange things, when i shutdown last night it had run for 6hrs,30min and had done 18 models, when it restarted it went back to 5hrs, 26min and showing 18 models, it then ran to 6hrs, 18min and still only 18 models! Still odd i haven't seen this before, the same type of task. Fri 26 Dec 2008 09:03:52 EST|rosetta@home|Restarting task cc_nonideal_1_3_nocst4_hb_t306__IGNORE_THE_REST_1AZVA_6_5992_27_0 using minirosetta version 147 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=197386767 # cpu_run_time_pref: 21600 ====================================================== DONE :: 1 starting structures 22718.4 cpu seconds This process generated 18 decoys from 18 attempts ====================================================== pete. |
Stacey Baird Send message Joined: 11 Apr 06 Posts: 19 Credit: 74,745 RAC: 0 |
I am having much the same problems with stops, starts, incomprehensible progress (if any progress) reports, strange error reports, stalling, misrepresentation of time budgeting in the Tasks function and other weirdness. Minirosetta v1.47 wastes too much time and steals processing time from other processing jobs that actually work. I suspect that part of the problem is programmers and others being on Christmas break and not being available for problem solving. As a result I have suspended Rosetta processing until at least January 3rd pending cleanup of the issues. |
Mike Tyka Send message Joined: 20 Oct 05 Posts: 96 Credit: 2,190 RAC: 0 |
This is pointing to a problem with checkpointing in the FoldCst protocol. I'll put this high on the todo list for the 1.48 release. The runtimes look very reasonable though! I'm afraid making a single decoy shorter than 3-4 hours is not always possible - what kind of machine was this on ? Mike http://beautifulproteins.blogspot.com/ http://www.miketyka.com/ |
stewjack Send message Joined: 23 Apr 06 Posts: 39 Credit: 95,871 RAC: 0 |
That would make sense. Normally my WU run time is set to 4 hours.
Compaq Presario 6029 AMD Athalon XP 2100 (1.7 GHZ) Windows XP Home ( BOINC v 6.2.19 ) RAM: 768 MB VIDEO CARD: Radeon 9250 128MB Dial-up: USRobotics Controller Modem |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
serious credit issue here: cc2_1_8_mammoth_fa_cst_hb_t303__IGNORE_THE_REST_2AH5A_4_6138_17_0 Claimed credit 106.166115188458 Granted credit 74.8691857584611 That is worse than the other mammoth task i had which had something like a 10 point difference. It also ran over my preferences of time. See long running tasks thread. |
Ian_D Send message Joined: 21 Sep 05 Posts: 55 Credit: 4,216,173 RAC: 0 |
After clean runs of memtest86+ 2.10 and prime95 for linux and I can no longer get decent results out of prime95 even though memtest86+ 2.10 will run fine. As you'd most likely expect I'm putting the errors below down to hardware !! Don't know if it's the CPU or more likely the mainboard northbridge. Have a newer CPU on order to rule that out. Have removed said machine from my "farm". Cheers and Happy Christmas and a computational bug free New Year CPU type GenuineIntel |
Rifleman Send message Joined: 19 Nov 08 Posts: 17 Credit: 139,408 RAC: 0 |
I am having problems with another WU that ran fine up to 99 percent and gets stalled. I let one run for 37 hours until watchdog terminated it. I have preferences set for 12 hours so that is fine. The granted credit is what bothered me. : https://boinc.bakerlab.org/rosetta/result.php?resultid=216862173 I am debating cancelling the WU that is presently doing the same thing as wasting all that CPU time for 2 decoys seems like--well---a waste!! https://boinc.bakerlab.org/rosetta/result.php?resultid=217161601 |
robertmiles Send message Joined: 16 Jun 08 Posts: 1233 Credit: 14,338,560 RAC: 2,014 |
I am having problems with another WU that ran fine up to 99 percent and gets stalled. I let one run for 37 hours until watchdog terminated it. I have preferences set for 12 hours so that is fine. The granted credit is what bothered me. : https://boinc.bakerlab.org/rosetta/result.php?resultid=216862173 Where did it seem to get stalled at - about 10 minutes left to go? If so, that's what typically happens when a minirosetta workunit goes out with a serious underestimate of the time required to run it. When I had one like that, a few versions ago, I let it finish (in about 4 times the time I set as preference) and at least got some credit for it, but not much more than typical for workunits that actually finished in the estimated time. At about 10 minutes left to go, the estimated time calculations get messed up, but not the calculations leading to the desired results. |
Rifleman Send message Joined: 19 Nov 08 Posts: 17 Credit: 139,408 RAC: 0 |
Hi Robert. Yeah----it stopped at about 10 minutes to go-----and stayed that way for 25 hours---lol. Watchdog terminated it. I aborted another after 18 hours in. It was the same type protein as the first one. I have 2 more being crunched at the moment and am watching to see how they do after 12 hours in. Task ID 216862173 Name 1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_17673_0 Workunit 197639536 Created 25 Dec 2008 6:09:31 UTC Sent 25 Dec 2008 7:37:31 UTC Received 27 Dec 2008 5:01:41 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 948562 Report deadline 4 Jan 2009 7:37:31 UTC CPU time 134234.2 stderr out <core_client_version>6.2.19</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 43200 ********************************************************************** Rosetta is going too long. Watchdog is ending the run! CPU time: 134233 seconds. Greater than 3X preferred time: 43200 seconds ********************************************************************** called boinc_finish </stderr_txt> ]]> Validate state Valid Claimed credit 561.58588373264 Granted credit 117.029798631356 application version 1.47 |
Ian_D Send message Joined: 21 Sep 05 Posts: 55 Credit: 4,216,173 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=217325144 Nearly 16 hrs in when I spotted it and now it reports, after a manual abort, it has done 0 CPU time ?!?! |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
guys, don't forget to also post this info in the "Report long-running models here" thread. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Somewhere below the question was raised if the "Lock file" error has been fixed. It has not. If you look at this Computer you can see that I have several. It is not at all clear why this happened. As you can see it is a 4 Core processor with HT giving 8 virtual processors and I know that at one point I had at least 4 tasks running at the same time. Could this be a concurrency problem? At any rate this is a new machine in the prime of its existence in that it is just over a week old. It is run 24/7 and I have been running about 6-8 projects on the machine and I am not seeing errors like this on other projects. Heck, even GPU Grid is running reasonably well ... The log files do not record the start time of the processing so you cannot tell for sure if that is the problem here. I still have a few tasks to go and I will run them to completion and see if I get more of these errors in the remaining tasks I have. I note that my Mac Pro, also with 8 processors has not had this error, but, the project loading on that computer is such that I can't recall an instance where I had more than one Rosetta task running at the same time. Looking at my other computers, all are multi-processor with at least 4 CPUs and I cannot see this error on any of those machines. I have two tasks running on the i7 right now so I will see if they will die with a collision. the tasks are cc2_1_8_native_cen_cst_hb_t373 and cc2_1_8_native_fa_cst_hb_t373 ... I have been ignoring Rosetta so I cannot say that I know what the alphabet soup that makes up the task id means (if anything) so I can't tell if there is something common in the actual tasks or not ... I just find it disappointing that this error surfaced so late in processing. One would think that the error would surface immediately. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Since my last post I have completed two tasks successfully on this machine. I have two more in the queue and they are running now. So, by the time you read this they should probably have run to completion or failure. Watching my 8 CPU systems for some time now I have noted that, in general, I never seem to have more than 2 Rosetta tasks running at the same time due to other projects. On the i7 about the time of the failures I know there was a period where there were 4 or more Rosetta tasks running at the same time. Perhaps there is an upper limit on the number of simultaneous incarnations that can be run at the same time? Well, I put the i7 into NNT until I can get an answer on this. I hate to waste my time crashing tasks ... or trying to baby sit the machines ... |
Ian_D Send message Joined: 21 Sep 05 Posts: 55 Credit: 4,216,173 RAC: 0 |
* sigh * https://boinc.bakerlab.org/rosetta/result.php?resultid=217461782 <core_client_version>6.2.15</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> # cpu_run_time_pref: 14400 # cpu_run_time_pref: 7200 # cpu_run_time_pref: 7200 terminate called after throwing an instance of 'std::bad_alloc' what(): St9bad_alloc SIGABRT: abort called Stack trace (27 frames): [0x8b979b7] [0x8bc20b0] [0xb7f22420] [0x8c24ca4] [0x8c12c5b] [0x8c10261] [0x8c10296] [0x8c0fe43] [0x8c0f86c] [0x8a88ba5] [0x8559c48] [0x83e8bc3] [0x87f80df] [0x87dc3c7] [0x80de412] [0x80d0686] [0x80d0b2e] [0x80c88b9] [0x80de971] [0x80d7d76] [0x8064271] [0x8117277] [0x8127c00] [0x8129a1a] [0x804b9c8] [0x8c1dbac] [0x8048111] Exiting... </stderr_txt> ]]> and https://boinc.bakerlab.org/rosetta/result.php?resultid=217459230 <core_client_version>6.2.15</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 14400 # cpu_run_time_pref: 7200 # cpu_run_time_pref: 7200 ********************************************************************** Rosetta is going too long. Watchdog is ending the run! CPU time: 26682.9 seconds. Greater than 3X preferred time: 7200 seconds ********************************************************************** called boinc_finish </stderr_txt> ]]> |
robertmiles Send message Joined: 16 Jun 08 Posts: 1233 Credit: 14,338,560 RAC: 2,014 |
On the i7 about the time of the failures I know there was a period where there were 4 or more Rosetta tasks running at the same time. Perhaps there is an upper limit on the number of simultaneous incarnations that can be run at the same time? I've noticed that for the more memory demanding BOINC projects, there often is a limit on how many incarnations will run at the same time, especially if you enable the Leave In Memory option but make no effort to increase the amount of swap space they can use. Before I increased the upper limit on swap space on my machine, only one minirosetta workunit would run at a time on my dual CPU core machine; now I often see a minirosetta workunit running on each of the CPU cores. Adding more physical memory also helps, but I had previously increased it to the limit of what my machine can handle (2 GB). |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
On the i7 about the time of the failures I know there was a period where there were 4 or more Rosetta tasks running at the same time. Perhaps there is an upper limit on the number of simultaneous incarnations that can be run at the same time? According to my Task manager my peak was 3.9 G with limit 5G so, I did not even get close. I have 3G normal RAM (well, 6 actually, but XP can only "see" 3 G) so ... Well, I will try to increase the swap file, but, have suspended work on this machine till the project says something... over half the tasks failed with this one error and I am still waiting to see what happens to the last task ... it has been running with 11 min to go for a couple hours now ... if the % Complete was not slowly rising I would have killed it by now ... the main reason I am letting it run is that curiosity overwhelms me as to if it is going to fail with the same error after eating up 10 or more hours of my time or not ... Oh, man, this is worse... I had nearly 10 hours on the clock. Changed the memory settings to increase the possible size of the swap file (even though it had 2G never used) and after a reboot, the task ended with 8 hours clock time. It looks like it is valid ... but that tells me that I just wasted nearly 2 hours on a task that should have ended ... {edit add} The tasks that ended badly *MAY* have all been suspended. I cannot say for sure that they were or not. The *MAY* have been. My setting for switiching between tasks is 720 min (12 hours) to try to force most applications to finish before switching ... it is my way of trying to provide best results ... and with 4 plus cores it mostly works. But, I did notice that the several of the Rosetta tasks did get suspended but I did not note which ones ... so more data to ponder if someone is actually going to look at this problem.{/edit} corrected time |
Speedy Send message Joined: 25 Sep 05 Posts: 163 Credit: 808,337 RAC: 1 |
This task http://www.boinc.bakerlab.org/rosetta/result.php?resultid=217385249 is running on vista home premium & has no graphics, on screen saver & when i click show graphics, when i close the graphics window it comes up with not responding then gives you 3 options
i use Close the program. this task has been running with 10 minutes to go for almost an hour with 97.525% done it's moving at roughly .07.5% per minute should i abort it? Have a crunching good day!! |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2140 Credit: 41,518,559 RAC: 10,612 |
Mike Tyka wrote: Has anyone seen any new Lockfile problems ? Or are these finally a thing of the past? I've made a song and dance about this before, so I should report my situation again: With Mini 1.45 and Boinc 6.2.19 I had 80% success with a 2 hour runtime, dropping to 55% success with a 3 hour runtime over 116 WUs. Upgrading to Boinc 6.4.5 for a short while before Mini 1.47s came through I thought I noticed less of the lockfile problem, but they've edged out of my history now. Of the last 103 WUs: 9 were Beta 5.98s - 100% success as usual 94 Mini 1.47 - 93 success, 1 Computation Error here: 217352482 Outcome Client error Client state Compute error Exit status 0 (0x0) <core_client_version>6.4.5</core_client_version> Claimed credit 52.7338164308948 Granted credit 52.7338164308948 I note some people are still getting problems, but mine seem to have completely gone, whether due to Boinc or the Mini WUs I don't know for sure, but I honestly don't care. Excellent work, guys. Much appreciated here. Well done. This problem appeared for me along with this new machine in July and this is the first time I'm getting performance anything like this. My RAC has already increased by about 100 a day. I worried it was something I'd done. |
Message boards :
Number crunching :
Minirosetta v1.47 bug thread.
©2024 University of Washington
https://www.bakerlab.org