Minirosetta v1.40 bug thread

Message boards : Number crunching : Minirosetta v1.40 bug thread

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 15 · Next

AuthorMessage
ramostol

Send message
Joined: 6 Feb 07
Posts: 64
Credit: 584,052
RAC: 0
Message 56806 - Posted: 10 Nov 2008, 17:32:14 UTC

My MacBook refuse to compute any loopbuild_boinc4_hombench_-task, cf this result

<core_client_version>6.2.18</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 3600
# cpu_run_time_pref: 3600
# cpu_run_time_pref: 3600
# cpu_run_time_pref: 3600
# cpu_run_time_pref: 3600
Too many restarts with no progress. Keep application in memory while preempted.
======================================================
DONE :: 1 starting structures 671.186 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>

Other tasks complete as exptected.
ID: 56806 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 56807 - Posted: 10 Nov 2008, 17:50:39 UTC

Many problems on an iMac2 on OSX 10.4.11

a) Tasks partially completed ; either waiting to run or waiting for memory

b) Mon Nov 10 08:06:41 2008|rosetta@home|Task 1hzh_1nio_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_160_0 exited with zero status but no 'finished' file
Mon Nov 10 08:06:41 2008|rosetta@home|If this happens repeatedly you may need to reset the project.
Mon Nov 10 08:06:42 2008|rosetta@home|Restarting task 1hzh_1nio_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_160_0 using minirosetta version 140

Believe, but can't be certain, that this was a task that had yet to complete after 12 hours work: it appears to now be starting again.

c) Mon Nov 10 08:16:01 2008|rosetta@home|Resuming task 1hzh_2a1i_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_200_0 using minirosetta version 140

This task now stuck after 1:05 minutes of processing




ID: 56807 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,250,162
RAC: 0
Message 56808 - Posted: 10 Nov 2008, 18:10:13 UTC
Last modified: 10 Nov 2008, 18:11:30 UTC

App: Rosetta Mini 1.40
Name: 2ci2l_BOINC_ABRELAX_SPLIT_SPLIT_IGNORE_THE_REST-S25-9-S3-3--2ci2l-_4678_394_0
BOINC: 5.10.45 x86_64
OS: Fedora 8 x86_64
Problem: Program WILL NOT STOP CRUNCHING even if I tell BOINC to Suspend all processing. Killing it and BOINC is only way.

Edit: It is behaving better since restarting BOINC daemon. But that was really weird. Note: Other projects/apps were suspending fine before the restart.
ID: 56808 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 826
Message 56810 - Posted: 10 Nov 2008, 19:14:32 UTC - in response to Message 56802.  

I've got another one of those workunits that are running longer than expected:

11/9/2008 5:57:49 PM|rosetta@home|Starting 1hzh_1o9g_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_155_1
11/9/2008 5:57:54 PM|rosetta@home|Starting task 1hzh_1o9g_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_155_1 using minirosetta version 140

Last night, it had accumulated about 6 CPU hours and claimed that it would finish in another 10 CPU minutes. This morning, it has accumulated over 12 CPU hours and claims that it will finish in another 9 CPU minutes and 56 seconds.

Also, it's currently the most memory hungry process on my machine. The Windows Task Manager recently said it was using over 256,000K of memory - over 10 times as much as the next process - but then dropped that to a little over 200,000K and is now 223,132K.

Since it hasn't let any other process take a turn in its CPU core for much longer than the 2 hours I've tried to set it for, I'll suspend it for a while and see if that helps.

The other person with a similar workunit had a compute error after about 6 CPU hours.


I told it to suspend, which apparantly worked. Windows Task Manager now says it's using only 97,000K of memory, but I suspect that it doesn't include any part of it that's been moved to the swapfile.

The workunits already on my machine from other BOINC projects are now catching up with their CPU time allotments, and haven't given this workunit another chance yet, even though I had increased Rossetta@home's share of my machine's CPU time shortly before this problem started. I had also increased the upper limit on virtual memory size to 7 GB.
ID: 56810 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 56813 - Posted: 10 Nov 2008, 22:30:44 UTC - in response to Message 56799.  

Looks like i have another run away task it's at 6hrs, 45min at 97.655% and as

slow as wet cement about .001% every 10 sec better then the last one but not much.

I bet i don't get much for it if & when it finisher's.

1hzh_2fe5_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_76

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=188078846

pete.


Note to Sarel.

This task restarted after it had ran all day yesterday for over 16hrs none

stop, and was at 99.001% it then went back to 2hrs,30min at 41.64% i have

aborted it, not going to waste more time, can someone please fix this.

pete.


ID: 56813 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 826
Message 56815 - Posted: 11 Nov 2008, 0:08:47 UTC - in response to Message 56810.  
Last modified: 11 Nov 2008, 0:15:53 UTC

I've got another one of those workunits that are running longer than expected:

11/9/2008 5:57:49 PM|rosetta@home|Starting 1hzh_1o9g_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_155_1
11/9/2008 5:57:54 PM|rosetta@home|Starting task 1hzh_1o9g_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_155_1 using minirosetta version 140


It's now running again, and at over 16.5 CPU hours. It's back up to 228,832K.

I've noticed that a significant fraction of the minirosetta v1.40 workunits that have performed poorly on my machine lately or have been mentioned in this thread as having problems for other people have 4704 as part of their name. Is this significant, or just an indication of the current group of workunits?
ID: 56815 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kcolagio

Send message
Joined: 7 Oct 05
Posts: 1
Credit: 62,988
RAC: 0
Message 56817 - Posted: 11 Nov 2008, 2:00:28 UTC


Running under Windows XP, I have it go inactive when I'm using the system (about 6 hours out of the day). Often I'll see notices that Windows is running out of virtual memory.

The system is a 2.4 GHz Quad Core system with 4 Gig of memory (which Windows only sees 3 Gig of *sigh* ).

Looking in the task manager, I see that there are 4 instances of Minirosetta_1.40_windows_intex86 running and that they are using between 207 Meg and 290 Meg of memory.

There are also (if it's related) 2 instances of rosetta_beta_5.98_windows_intelx86 running that are taking 215 Meg each.

While paused, they are using 0% of the CPU (which is right in my book), but they have used up to 1 hour 4 minutes of CPU time...I have no idea if this is "normal" or not.

No idea if any of this helps, but it seems out of the ordinary to me...and I hate just killing the processes that are acting badly.

Let me know if you need more info.

ID: 56817 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Adam Gajdacs (Mr. Fusion)

Send message
Joined: 26 Nov 05
Posts: 13
Credit: 2,876,565
RAC: 1,373
Message 56820 - Posted: 11 Nov 2008, 9:04:04 UTC

1hzh_1u9p_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_97_0 using minirosetta version 140 (Wu ID: 188064180)

Yesterday this task had been running for over 13 hours on a 4 hours target CPU time. It was stuck on model 1, step 79500, where step did not change for over an hour (the protein display did, however, once in every 15-20 seconds or so). Progress was increasing at the rate of roughly 0.001% per 15-20 seconds at 98.6% or so.

I don't run my system 24/7 (that's why I have a relatively short runtime specified), so I had shut it down yesterday for the night, and today it's started over from 0%; looks like it didn't checkpoint even once in all those 13+ hours. So I'm considering aborting this (and any similar) WU at this point.

In general, the memory use of the 1.40 has skyrocketed again, it fluctuates between 100-350 Mbytes of physical and commits about 300-350Mbytes virtual memory. Once again, this tends to fill up all available PM+VM on multi-core systems as the Rosetta WUs started in parallel will hit the combined memory limit within seconds, thus they get suspended to the "Waiting for memory" state, and then a new WU gets started only to hit the memory limit again. I usually have at least 3-4 "stuck" Rosetta WUs in memory, each holding 200-300Mbytes of VM (and a similar amount of PM until the system is forced to completely page them out).
ID: 56820 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 56828 - Posted: 11 Nov 2008, 13:08:08 UTC

These two CAPRI_comp_ems.1b.pdb.gz_docksim.protocol_8_12_4682_ WUs were ended by the watchdog because they ran over 48 hours (3x my 16 hour setting):
https://boinc.bakerlab.org/rosetta/result.php?resultid=205806719
https://boinc.bakerlab.org/rosetta/result.php?resultid=205765025
ID: 56828 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 56829 - Posted: 11 Nov 2008, 13:20:02 UTC

This WU bombed out on both machines (one Linux and the other Windos) with a file xfer error:
IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_1ukf_4683_83

<file_xfer_error>
<file_name>IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_1ukf_4683_83_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>
ID: 56829 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Warren B. Rogers

Send message
Joined: 3 Oct 05
Posts: 5
Credit: 1,127,824
RAC: 0
Message 56830 - Posted: 11 Nov 2008, 13:29:04 UTC

Hello everyone,

I've also had trouble with this version of Minirosetta. The WU will get to about 98% completion and show approximately 9 minutes to completion and then it seems to get stuck at that point. I've stopped the WU and let other projects get a chance to complete and when BOINC returns to the WU it will start from the beginning and sometimes complete in approximately 2 hours or it will do the same thing and get stuck at 98% and run for over 6 hour. I've had 2 end with Compute Errors and 1 with a Validate Error. And I've seen even the WU's that complete are getting shut down by the watchdog because of too many restarts. I hope this information helps.


Warren Rogers
ID: 56830 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 2
Message 56840 - Posted: 11 Nov 2008, 17:20:45 UTC
Last modified: 11 Nov 2008, 17:55:50 UTC

188575665 is doing the same thing. It has been running for 04:43:43 is 96.592% complete and the time to completion flips between 00:09:52 and 00:09:53 every few seconds.

It is also a 1hzh_2he4_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_262 wu.

Aborted. And yet again, I suppose I have to suspend Rosetta on my remote systems. Getting to be a habit that.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 56840 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 56842 - Posted: 11 Nov 2008, 17:34:43 UTC

IL23p40_p40BrubYhbond_design_jecorn_SAVE_ALL_OUT_IGNORE_THE_REST_ip40_1lb4_4683_127_0 ran 5 hrs and 13 mins and then died with a huge debug output.

exit status is -1073741819 (0xc0000005)
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0083A59D read attempt to address 0xFFFFFFCC

Engaging BOINC Windows Runtime Debugger...


3 calls stacks and a bunch of other stuff...

that is just annoying as hell to run 5 hrs out of 6 and then die and get no credit. LAME!
ID: 56842 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Sarel

Send message
Joined: 11 May 06
Posts: 51
Credit: 81,712
RAC: 0
Message 56848 - Posted: 11 Nov 2008, 19:16:26 UTC

I'm so sorry for this mess. The jobs labeled with the words design, jacob, or sarel, are related to a new mode that we've put into v1.40. You can read more about this new mode and why we're excited about running it on Rosetta @ Home on

https://boinc.bakerlab.org/forum_thread.php?id=4477

As far as I can tell from the messages here, people are seeing two major problems:
1. long run times with relatively low credit
2. larger than anticipated memory requirements

Please let me know if you see any other type of problem.

Since this is a departure from previous simulations on Rostta @ Home, we expected to run into some trouble, but obviously, after the extensive testing that we had carried out (with no glitches), we didn't expect this much! We're currently looking into ways of fixing this immediately as well as in the longer term. My colleagues and I will post new messages to this thread once we've figured this out.

By the way, I should mention that even this early, we're seeing that from the simulations that ran well we've gotten a huge amount of very useful output! Much more than on any other platform that I had worked with before!

Thank you very much for your patience and for providing all this feedback!
ID: 56848 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DaBrat and DaBear

Send message
Joined: 9 Aug 08
Posts: 16
Credit: 213,180
RAC: 0
Message 56850 - Posted: 11 Nov 2008, 19:36:06 UTC

Nothing but the following... 8 plus hours run for 9 credits

https://boinc.bakerlab.org/rosetta/result.php?resultid=206158806
ID: 56850 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikus

Send message
Joined: 7 Nov 05
Posts: 58
Credit: 700,115
RAC: 0
Message 56851 - Posted: 11 Nov 2008, 19:55:43 UTC

Rosetta/BOINC does not validate against partial results. It should.

The typical Rosetta task runs multiple decoys (each of which I believe is an *independent* simulation). I had such a task terminate because while calculating decoy 7 came it up with a NAN. The results from the correctly completed previous 6 decoys were discarded.

Looked in the 'Workunit Details' page and saw that another system was identified as successfully completing that same task. The catch -- it did only 5 decoys.

There is something fundamentally unfair when ALL the work from a system that did more crunching gets discarded, while accepting work from a system that crunched less.
.
ID: 56851 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 2
Message 56852 - Posted: 11 Nov 2008, 19:55:57 UTC

1. long run times with relatively low credit

That is not specific to this version. It was mentioned many times in this thread.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 56852 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 56854 - Posted: 11 Nov 2008, 21:14:23 UTC
Last modified: 11 Nov 2008, 21:44:26 UTC

ID: 56854 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 826
Message 56857 - Posted: 11 Nov 2008, 21:58:44 UTC - in response to Message 56854.  
Last modified: 11 Nov 2008, 22:02:05 UTC

the memory in this computer is small but did complete some work units

https://boinc.bakerlab.org/rosetta/results.php?hostid=439347

https://boinc.bakerlab.org/rosetta/results.php?hostid=267483

https://boinc.bakerlab.org/rosetta/result.php?resultid=204400732


Rochester, it looks like the memory and time estimates for the problem workunits are now accurate enough they don't send you any of the memory-hungry workunits with design, jacob, or sarel in their names, or the workunits with serious underestimates of time required that often have 4704 in their names, but still not accurate enough to handle some of us who can handle a little more, but not the maximum required.
ID: 56857 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Path7

Send message
Joined: 25 Aug 07
Posts: 128
Credit: 61,751
RAC: 0
Message 56859 - Posted: 11 Nov 2008, 23:20:26 UTC - in response to Message 56848.  
Last modified: 11 Nov 2008, 23:21:34 UTC

Hello Sarel,

Thanks for your reaction, and good to read you are still exited about the new mode you put into 1.40 : )

@ Please let me know if you see any other type of problem.

1hzh_1mve_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_147_0
This WU was running for more than 15 hours (runtime preference = 6 hours) when I restarted my computer (Windows update).
The WU started again with 38 minutes processor time!
If possible more checkpoints will be welcome.

Have a nice day,
Path7.
ID: 56859 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 15 · Next

Message boards : Number crunching : Minirosetta v1.40 bug thread



©2024 University of Washington
https://www.bakerlab.org