Message boards : Number crunching : Minirosetta v1.40 bug thread
Previous · 1 · 2 · 3 · 4 · 5 . . . 15 · Next
Author | Message |
---|---|
ramostol Send message Joined: 6 Feb 07 Posts: 64 Credit: 584,052 RAC: 0 |
My MacBook refuse to compute any loopbuild_boinc4_hombench_-task, cf this result <core_client_version>6.2.18</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 3600 # cpu_run_time_pref: 3600 # cpu_run_time_pref: 3600 # cpu_run_time_pref: 3600 # cpu_run_time_pref: 3600 Too many restarts with no progress. Keep application in memory while preempted. ====================================================== DONE :: 1 starting structures 671.186 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> ]]> Other tasks complete as exptected. |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Many problems on an iMac2 on OSX 10.4.11 a) Tasks partially completed ; either waiting to run or waiting for memory b) Mon Nov 10 08:06:41 2008|rosetta@home|Task 1hzh_1nio_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_160_0 exited with zero status but no 'finished' file Mon Nov 10 08:06:41 2008|rosetta@home|If this happens repeatedly you may need to reset the project. Mon Nov 10 08:06:42 2008|rosetta@home|Restarting task 1hzh_1nio_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_160_0 using minirosetta version 140 Believe, but can't be certain, that this was a task that had yet to complete after 12 hours work: it appears to now be starting again. c) Mon Nov 10 08:16:01 2008|rosetta@home|Resuming task 1hzh_2a1i_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_200_0 using minirosetta version 140 This task now stuck after 1:05 minutes of processing |
DJStarfox Send message Joined: 19 Jul 07 Posts: 145 Credit: 1,250,162 RAC: 0 |
App: Rosetta Mini 1.40 Name: 2ci2l_BOINC_ABRELAX_SPLIT_SPLIT_IGNORE_THE_REST-S25-9-S3-3--2ci2l-_4678_394_0 BOINC: 5.10.45 x86_64 OS: Fedora 8 x86_64 Problem: Program WILL NOT STOP CRUNCHING even if I tell BOINC to Suspend all processing. Killing it and BOINC is only way. Edit: It is behaving better since restarting BOINC daemon. But that was really weird. Note: Other projects/apps were suspending fine before the restart. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1233 Credit: 14,338,560 RAC: 2,014 |
I've got another one of those workunits that are running longer than expected: I told it to suspend, which apparantly worked. Windows Task Manager now says it's using only 97,000K of memory, but I suspect that it doesn't include any part of it that's been moved to the swapfile. The workunits already on my machine from other BOINC projects are now catching up with their CPU time allotments, and haven't given this workunit another chance yet, even though I had increased Rossetta@home's share of my machine's CPU time shortly before this problem started. I had also increased the upper limit on virtual memory size to 7 GB. |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Looks like i have another run away task it's at 6hrs, 45min at 97.655% and as Note to Sarel. This task restarted after it had ran all day yesterday for over 16hrs none stop, and was at 99.001% it then went back to 2hrs,30min at 41.64% i have aborted it, not going to waste more time, can someone please fix this. pete. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1233 Credit: 14,338,560 RAC: 2,014 |
I've got another one of those workunits that are running longer than expected: It's now running again, and at over 16.5 CPU hours. It's back up to 228,832K. I've noticed that a significant fraction of the minirosetta v1.40 workunits that have performed poorly on my machine lately or have been mentioned in this thread as having problems for other people have 4704 as part of their name. Is this significant, or just an indication of the current group of workunits? |
kcolagio Send message Joined: 7 Oct 05 Posts: 1 Credit: 62,988 RAC: 0 |
Running under Windows XP, I have it go inactive when I'm using the system (about 6 hours out of the day). Often I'll see notices that Windows is running out of virtual memory. The system is a 2.4 GHz Quad Core system with 4 Gig of memory (which Windows only sees 3 Gig of *sigh* ). Looking in the task manager, I see that there are 4 instances of Minirosetta_1.40_windows_intex86 running and that they are using between 207 Meg and 290 Meg of memory. There are also (if it's related) 2 instances of rosetta_beta_5.98_windows_intelx86 running that are taking 215 Meg each. While paused, they are using 0% of the CPU (which is right in my book), but they have used up to 1 hour 4 minutes of CPU time...I have no idea if this is "normal" or not. No idea if any of this helps, but it seems out of the ordinary to me...and I hate just killing the processes that are acting badly. Let me know if you need more info. |
Adam Gajdacs (Mr. Fusion) Send message Joined: 26 Nov 05 Posts: 13 Credit: 2,869,545 RAC: 2,204 |
1hzh_1u9p_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_97_0 using minirosetta version 140 (Wu ID: 188064180) Yesterday this task had been running for over 13 hours on a 4 hours target CPU time. It was stuck on model 1, step 79500, where step did not change for over an hour (the protein display did, however, once in every 15-20 seconds or so). Progress was increasing at the rate of roughly 0.001% per 15-20 seconds at 98.6% or so. I don't run my system 24/7 (that's why I have a relatively short runtime specified), so I had shut it down yesterday for the night, and today it's started over from 0%; looks like it didn't checkpoint even once in all those 13+ hours. So I'm considering aborting this (and any similar) WU at this point. In general, the memory use of the 1.40 has skyrocketed again, it fluctuates between 100-350 Mbytes of physical and commits about 300-350Mbytes virtual memory. Once again, this tends to fill up all available PM+VM on multi-core systems as the Rosetta WUs started in parallel will hit the combined memory limit within seconds, thus they get suspended to the "Waiting for memory" state, and then a new WU gets started only to hit the memory limit again. I usually have at least 3-4 "stuck" Rosetta WUs in memory, each holding 200-300Mbytes of VM (and a similar amount of PM until the system is forced to completely page them out). |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
These two CAPRI_comp_ems.1b.pdb.gz_docksim.protocol_8_12_4682_ WUs were ended by the watchdog because they ran over 48 hours (3x my 16 hour setting): https://boinc.bakerlab.org/rosetta/result.php?resultid=205806719 https://boinc.bakerlab.org/rosetta/result.php?resultid=205765025 |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
This WU bombed out on both machines (one Linux and the other Windos) with a file xfer error: IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_1ukf_4683_83 <file_xfer_error> <file_name>IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_1ukf_4683_83_0_0</file_name> <error_code>-161</error_code> </file_xfer_error> |
Warren B. Rogers Send message Joined: 3 Oct 05 Posts: 5 Credit: 1,127,824 RAC: 0 |
Hello everyone, I've also had trouble with this version of Minirosetta. The WU will get to about 98% completion and show approximately 9 minutes to completion and then it seems to get stuck at that point. I've stopped the WU and let other projects get a chance to complete and when BOINC returns to the WU it will start from the beginning and sometimes complete in approximately 2 hours or it will do the same thing and get stuck at 98% and run for over 6 hour. I've had 2 end with Compute Errors and 1 with a Validate Error. And I've seen even the WU's that complete are getting shut down by the watchdog because of too many restarts. I hope this information helps. Warren Rogers |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 4 |
188575665 is doing the same thing. It has been running for 04:43:43 is 96.592% complete and the time to completion flips between 00:09:52 and 00:09:53 every few seconds. It is also a 1hzh_2he4_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_262 wu. Aborted. And yet again, I suppose I have to suspend Rosetta on my remote systems. Getting to be a habit that. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_1lb4_4683_127_0 ran 5 hrs and 13 mins and then died with a huge debug output. exit status is -1073741819 (0xc0000005) Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x0083A59D read attempt to address 0xFFFFFFCC Engaging BOINC Windows Runtime Debugger... 3 calls stacks and a bunch of other stuff... that is just annoying as hell to run 5 hrs out of 6 and then die and get no credit. LAME! |
Sarel Send message Joined: 11 May 06 Posts: 51 Credit: 81,712 RAC: 0 |
I'm so sorry for this mess. The jobs labeled with the words design, jacob, or sarel, are related to a new mode that we've put into v1.40. You can read more about this new mode and why we're excited about running it on Rosetta @ Home on https://boinc.bakerlab.org/forum_thread.php?id=4477 As far as I can tell from the messages here, people are seeing two major problems: 1. long run times with relatively low credit 2. larger than anticipated memory requirements Please let me know if you see any other type of problem. Since this is a departure from previous simulations on Rostta @ Home, we expected to run into some trouble, but obviously, after the extensive testing that we had carried out (with no glitches), we didn't expect this much! We're currently looking into ways of fixing this immediately as well as in the longer term. My colleagues and I will post new messages to this thread once we've figured this out. By the way, I should mention that even this early, we're seeing that from the simulations that ran well we've gotten a huge amount of very useful output! Much more than on any other platform that I had worked with before! Thank you very much for your patience and for providing all this feedback! |
DaBrat and DaBear Send message Joined: 9 Aug 08 Posts: 16 Credit: 213,180 RAC: 0 |
Nothing but the following... 8 plus hours run for 9 credits https://boinc.bakerlab.org/rosetta/result.php?resultid=206158806 |
mikus Send message Joined: 7 Nov 05 Posts: 58 Credit: 700,115 RAC: 0 |
Rosetta/BOINC does not validate against partial results. It should. The typical Rosetta task runs multiple decoys (each of which I believe is an *independent* simulation). I had such a task terminate because while calculating decoy 7 came it up with a NAN. The results from the correctly completed previous 6 decoys were discarded. Looked in the 'Workunit Details' page and saw that another system was identified as successfully completing that same task. The catch -- it did only 5 decoys. There is something fundamentally unfair when ALL the work from a system that did more crunching gets discarded, while accepting work from a system that crunched less. . |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 4 |
1. long run times with relatively low credit That is not specific to this version. It was mentioned many times in this thread. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0 |
the memory in this computer is small but did complete some work units https://boinc.bakerlab.org/rosetta/results.php?hostid=439347 https://boinc.bakerlab.org/rosetta/results.php?hostid=267483 https://boinc.bakerlab.org/rosetta/result.php?resultid=204400732 |
robertmiles Send message Joined: 16 Jun 08 Posts: 1233 Credit: 14,338,560 RAC: 2,014 |
the memory in this computer is small but did complete some work units Rochester, it looks like the memory and time estimates for the problem workunits are now accurate enough they don't send you any of the memory-hungry workunits with design, jacob, or sarel in their names, or the workunits with serious underestimates of time required that often have 4704 in their names, but still not accurate enough to handle some of us who can handle a little more, but not the maximum required. |
Path7 Send message Joined: 25 Aug 07 Posts: 128 Credit: 61,751 RAC: 0 |
Hello Sarel, Thanks for your reaction, and good to read you are still exited about the new mode you put into 1.40 : ) @ Please let me know if you see any other type of problem. 1hzh_1mve_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_147_0 This WU was running for more than 15 hours (runtime preference = 6 hours) when I restarted my computer (Windows update). The WU started again with 38 minutes processor time! If possible more checkpoints will be welcome. Have a nice day, Path7. |
Message boards :
Number crunching :
Minirosetta v1.40 bug thread
©2024 University of Washington
https://www.bakerlab.org