Minirosetta v1.40 bug thread

Message boards : Number crunching : Minirosetta v1.40 bug thread

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 15 · Next

AuthorMessage
DaBrat and DaBear

Send message
Joined: 9 Aug 08
Posts: 16
Credit: 213,180
RAC: 0
Message 56863 - Posted: 12 Nov 2008, 1:03:31 UTC

This appeared to run smoothly but invalid.

https://boinc.bakerlab.org/rosetta/result.php?resultid=205965356


Server state Over
Outcome Validate error
Client state Done
Exit status 0 (0x0)
Computer ID 871503
Report deadline 18 Nov 2008 2:48:02 UTC
CPU time 9476.889
stderr out <core_client_version>6.2.18</core_client_version>
<![CDATA[
<stderr_txt>

======================================================
DONE :: 1 starting structures 9476.44 cpu seconds
This process generated 7 decoys from 7 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>


Validate state Invalid
Claimed credit 30.7157035506969
Granted credit 0
application version 1.40
ID: 56863 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 826
Message 56864 - Posted: 12 Nov 2008, 1:03:47 UTC - in response to Message 56859.  
Last modified: 12 Nov 2008, 1:05:10 UTC


1hzh_1mve_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_147_0
This WU was running for more than 15 hours (runtime preference = 6 hours) when I restarted my computer (Windows update).
The WU started again with 38 minutes processor time!
If possible more checkpoints will be welcome.

Have a nice day,
Path7.


Path7,

Are you running Vista SP1? If so, I've found that when you are applying a Definition Update for Windows Defender, you don't have to shut down the computer. Suspending all the workunits under BOINC is enough, and allows you to resume them in a few minutes without losing anything except those few minutes of CPU time.

Most other types of Vista updates seem to require a BOINC shutdown, though, and often a reboot.
ID: 56864 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DaBrat and DaBear

Send message
Joined: 9 Aug 08
Posts: 16
Credit: 213,180
RAC: 0
Message 56865 - Posted: 12 Nov 2008, 1:21:25 UTC - in response to Message 56851.  

Rosetta/BOINC does not validate against partial results. It should.

The typical Rosetta task runs multiple decoys (each of which I believe is an *independent* simulation). I had such a task terminate because while calculating decoy 7 came it up with a NAN. The results from the correctly completed previous 6 decoys were discarded.

Looked in the 'Workunit Details' page and saw that another system was identified as successfully completing that same task. The catch -- it did only 5 decoys.

There is something fundamentally unfair when ALL the work from a system that did more crunching gets discarded, while accepting work from a system that crunched less.
.


I got the same thing on either machines that returned 7 decoys either a
NAN or validate error though no errors accounted for. Got one that ran sometime today over 9 hours 4G of memory on the machine and wasn't being used for anything else with a 3G dual core. Hope I get more than 9 credits for this one.
ID: 56865 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 56868 - Posted: 12 Nov 2008, 2:44:28 UTC

A number of my machines are diskless swapless single core Linux with 512k installed memory. That has worked fine for quite some time, but now these machines are getting WUs that use too much memory, which stops crunching on those machines (as they don't have any swap disk). The problem is with WUs starting with "1hzh_". For example:

1hzh_2fzp_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_136_0

The only thing you can see in the stderr is that it was restarted several times. That's because it kept running out of memory. I eventually just aborted it. Crunching had stopped on several other machines due to 1hzh_ WUs, so I went through and aborted these WUs on all my 512k machines.
ID: 56868 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Miguel Madden

Send message
Joined: 30 Nov 05
Posts: 1
Credit: 52,162
RAC: 0
Message 56869 - Posted: 12 Nov 2008, 3:51:34 UTC
Last modified: 12 Nov 2008, 3:53:06 UTC

Greetings partners. I dont know if the following is an issue of the new version. Im getting extreme high temperatures in my cores, as a matter of fact I have to shutdown one because both cores crunching Rosetta WUs gabe a dangerous 80 celsius. Other projects using my both cores give me 73 top.

I aborted some of the units.

Edit: One more thing: The graphics are frozen.
ID: 56869 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 826
Message 56870 - Posted: 12 Nov 2008, 5:21:56 UTC - in response to Message 56869.  
Last modified: 12 Nov 2008, 5:27:48 UTC

Greetings partners. I dont know if the following is an issue of the new version. Im getting extreme high temperatures in my cores, as a matter of fact I have to shutdown one because both cores crunching Rosetta WUs gabe a dangerous 80 celsius. Other projects using my both cores give me 73 top.

I aborted some of the units.

Edit: One more thing: The graphics are frozen.


mmadden,

Do you have the option of decreasing the percentage of time BOINC projects use the CPU instead, and then checking whether Minirosetta v1.40 actually obeys this decrease?

Also, can you check if it is one of the most memory-hungry processes on your machine while it is running?

You might also want to check for signs that your machine has so little free memory that the application has shut down graphics, if it can do that.
ID: 56870 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
dazman

Send message
Joined: 28 May 06
Posts: 1
Credit: 51,457,893
RAC: 0
Message 56873 - Posted: 12 Nov 2008, 8:10:49 UTC - in response to Message 56848.  

I'm so sorry for this mess. The jobs labeled with the words design, jacob, or sarel, are related to a new mode that we've put into v1.40. You can read more about this new mode and why we're excited about running it on Rosetta @ Home on

https://boinc.bakerlab.org/forum_thread.php?id=4477

As far as I can tell from the messages here, people are seeing two major problems:
1. long run times with relatively low credit
2. larger than anticipated memory requirements



Yes I just started having problems with memory, and found it was these new units. They are using WAY to much memory. I have a 8 Core 2.8Ghz Mac Pro. Only running 2gigs ram. I run it on 7cores (since running on 8 moves spotlight search to a crawl), and its been running fine for months, until now. Even though I have BOINC set to only use 45% of memory, its no obeying that rule. I'm going to have to stop running Rosetta@Home until this is resolved. Time to move on to a new project.
ID: 56873 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 56887 - Posted: 12 Nov 2008, 23:44:14 UTC

As others have reported already, I'm seeing tasks fail apparently as a result of a numerical error in a routine that calculates hydrogen bonding. The tasks end up being resent to other computers, which fail in the same way. Bit of a waste.

----

Task ID 206696346
Name loopbuild_boinc4_grow10_hombench_loopbuild_t293__IGNORE_THE_REST_1VQ1A_3_4710_10_1
Workunit 188513139

NANs occured in hbonding!
ERROR:: Exit from: src/core/scoring/hbonds/hbonds_geom.cc line: 763

----

Task ID 206661889
Name loopbuild_boinc4_grow10_hombench_loopbuild_t293__IGNORE_THE_REST_1WY7A_9_4710_13_0
Workunit 188528726

ERROR: NANs occured in hbonding!
ERROR:: Exit from: src/core/scoring/hbonds/hbonds_geom.cc line: 763

------

Mac OS X 10.4.11 : Boinc 6.2.18


ID: 56887 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 56890 - Posted: 13 Nov 2008, 0:26:33 UTC

ID: 56890 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
RC

Send message
Joined: 27 Sep 05
Posts: 13
Credit: 262,048
RAC: 0
Message 56899 - Posted: 13 Nov 2008, 11:15:06 UTC

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=188277752

This WU ran 19.5 hours before I killed it (my run time preference is 6 hours). The next computer to pick it up had a compute error.
ID: 56899 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
googloo
Avatar

Send message
Joined: 15 Sep 06
Posts: 133
Credit: 22,813,645
RAC: 1,448
Message 56910 - Posted: 13 Nov 2008, 17:00:30 UTC - in response to Message 56848.  
Last modified: 13 Nov 2008, 17:01:32 UTC

I'm so sorry for this mess. The jobs labeled with the words design, jacob, or sarel, are related to a new mode that we've put into v1.40. You can read more about this new mode and why we're excited about running it on Rosetta @ Home on

https://boinc.bakerlab.org/forum_thread.php?id=4477

As far as I can tell from the messages here, people are seeing two major problems:
1. long run times with relatively low credit
2. larger than anticipated memory requirements

Please let me know if you see any other type of problem.


Do you still want feedback on the work units that have problems 1 and/or 2?
ID: 56910 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Sarel

Send message
Joined: 11 May 06
Posts: 51
Credit: 81,712
RAC: 0
Message 56912 - Posted: 13 Nov 2008, 18:10:48 UTC - in response to Message 56910.  

I'm so sorry for this mess. The jobs labeled with the words design, jacob, or sarel, are related to a new mode that we've put into v1.40. You can read more about this new mode and why we're excited about running it on Rosetta @ Home on

https://boinc.bakerlab.org/forum_thread.php?id=4477

As far as I can tell from the messages here, people are seeing two major problems:
1. long run times with relatively low credit
2. larger than anticipated memory requirements

Please let me know if you see any other type of problem.


Do you still want feedback on the work units that have problems 1 and/or 2?


Thanks for the offer to help! I'm now in the process of finding out what went wrong. The user reports on specific workunit failures are invaluable in figuring this out, but for the time being I have quite a few to work with :) I'll let you know once this is resolved.
ID: 56912 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 826
Message 56914 - Posted: 13 Nov 2008, 19:34:31 UTC - in response to Message 56912.  

I'm so sorry for this mess. The jobs labeled with the words design, jacob, or sarel, are related to a new mode that we've put into v1.40. You can read more about this new mode and why we're excited about running it on Rosetta @ Home on

https://boinc.bakerlab.org/forum_thread.php?id=4477

As far as I can tell from the messages here, people are seeing two major problems:
1. long run times with relatively low credit
2. larger than anticipated memory requirements

Please let me know if you see any other type of problem.


Do you still want feedback on the work units that have problems 1 and/or 2?


Thanks for the offer to help! I'm now in the process of finding out what went wrong. The user reports on specific workunit failures are invaluable in figuring this out, but for the time being I have quite a few to work with :) I'll let you know once this is resolved.


You might want to check if you have the capability to give workunits that use the new mode different time and memory estimates (possibly even sometimes different from each other) so they can be directed to suitable machines.
ID: 56914 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 56915 - Posted: 13 Nov 2008, 19:45:38 UTC - in response to Message 56914.  
Last modified: 13 Nov 2008, 19:47:12 UTC

You might want to check if you have the capability to give workunits that use the new mode different time and memory estimates (possibly even sometimes different from each other) so they can be directed to suitable machines.


The memory is already done. Aligning large memory tasks with large memory machines. The runtime is defined by the users here at Rosetta@home. And the fact that some of (ok many of) Sarel's tasks exceed the runtime target is exactly what he's working to correct. Until then, there's no better estimate on how long they will take anyway. So Sarel is working to make sure the models each complete in the more normal hour or two of CPU time. Then the user's runtime preference will be the best approximation of runtime available, which is how the project works.

So, no grander change is required. Correcting (improving) the long-running models is the solution. I don't know if you've seen the graphic, but the proteins Sarel is tackling are absolutely huge! So, they are bound to turn up some behaviors in the program that smaller proteins do not run across.
Rosetta Moderator: Mod.Sense
ID: 56915 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Gavin Shaw
Avatar

Send message
Joined: 1 Feb 07
Posts: 10
Credit: 506,456
RAC: 0
Message 56922 - Posted: 14 Nov 2008, 0:43:53 UTC

Had one unit go funny.

Long Unit

My preference is set to 4 hours and this one went way longer than that and didn't finish? When I last looked it was still on the first model/decoy. I also notice that it was a resend as well.

Never surrender and never give up. In the darkest hour there is always hope.

ID: 56922 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
funkydude

Send message
Joined: 15 Jun 08
Posts: 28
Credit: 397,934
RAC: 0
Message 56923 - Posted: 14 Nov 2008, 1:36:02 UTC - in response to Message 56922.  

Rosetta Mini doesn't always respect BOINC's "Snooze" setting on making projects suspend. The weird thing is I had 2 Mini's running and when I hit "Snooze" 1 suspended and 1 continued.
ID: 56923 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 56924 - Posted: 14 Nov 2008, 16:51:35 UTC

Rosetta Mini doesn't always respect BOINC's "Snooze" setting on making projects suspend.
I find it better to use 'suspend' which you can find on the activity list
ID: 56924 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 56926 - Posted: 14 Nov 2008, 17:59:35 UTC

I seem to be having a lot of WUs bomb out with the message:

ERROR: NANs occured in hbonding!
ERROR:: Exit from: src/core/scoring/hbonds/hbonds_geom.cc line: 763
called boinc_finish

Here are some examples:

h011__BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-14-S3-7--h011_-_4675_56_0
h010__BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-6-S3-4--h010_-_4675_56_0
foldcst_minimalist_core3_homo_bench_foldcst_cheat_chunk_t293__olange_IGNORE_THE_REST_1NV8A_12_4735_30_0
ID: 56926 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mattia Verga

Send message
Joined: 15 Jul 06
Posts: 3
Credit: 124,357
RAC: 0
Message 56929 - Posted: 14 Nov 2008, 19:23:53 UTC

"Too many restarts with no progress. Keep application in memory while preempted."

206344438
ID: 56929 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile (_KoDAk_)

Send message
Joined: 18 Jul 06
Posts: 109
Credit: 1,859,263
RAC: 0
Message 56931 - Posted: 14 Nov 2008, 19:54:19 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=205975695
ID: 56931 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 15 · Next

Message boards : Number crunching : Minirosetta v1.40 bug thread



©2024 University of Washington
https://www.bakerlab.org