Message boards : Number crunching : Minirosetta v1.40 bug thread
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 15 · Next
Author | Message |
---|---|
DaBrat and DaBear Send message Joined: 9 Aug 08 Posts: 16 Credit: 213,180 RAC: 0 |
This appeared to run smoothly but invalid. https://boinc.bakerlab.org/rosetta/result.php?resultid=205965356 Server state Over Outcome Validate error Client state Done Exit status 0 (0x0) Computer ID 871503 Report deadline 18 Nov 2008 2:48:02 UTC CPU time 9476.889 stderr out <core_client_version>6.2.18</core_client_version> <![CDATA[ <stderr_txt> ====================================================== DONE :: 1 starting structures 9476.44 cpu seconds This process generated 7 decoys from 7 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> ]]> Validate state Invalid Claimed credit 30.7157035506969 Granted credit 0 application version 1.40 |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 826 |
Path7, Are you running Vista SP1? If so, I've found that when you are applying a Definition Update for Windows Defender, you don't have to shut down the computer. Suspending all the workunits under BOINC is enough, and allows you to resume them in a few minutes without losing anything except those few minutes of CPU time. Most other types of Vista updates seem to require a BOINC shutdown, though, and often a reboot. |
DaBrat and DaBear Send message Joined: 9 Aug 08 Posts: 16 Credit: 213,180 RAC: 0 |
Rosetta/BOINC does not validate against partial results. It should. I got the same thing on either machines that returned 7 decoys either a NAN or validate error though no errors accounted for. Got one that ran sometime today over 9 hours 4G of memory on the machine and wasn't being used for anything else with a 3G dual core. Hope I get more than 9 credits for this one. |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
A number of my machines are diskless swapless single core Linux with 512k installed memory. That has worked fine for quite some time, but now these machines are getting WUs that use too much memory, which stops crunching on those machines (as they don't have any swap disk). The problem is with WUs starting with "1hzh_". For example: 1hzh_2fzp_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_136_0 The only thing you can see in the stderr is that it was restarted several times. That's because it kept running out of memory. I eventually just aborted it. Crunching had stopped on several other machines due to 1hzh_ WUs, so I went through and aborted these WUs on all my 512k machines. |
Miguel Madden Send message Joined: 30 Nov 05 Posts: 1 Credit: 52,162 RAC: 0 |
Greetings partners. I dont know if the following is an issue of the new version. Im getting extreme high temperatures in my cores, as a matter of fact I have to shutdown one because both cores crunching Rosetta WUs gabe a dangerous 80 celsius. Other projects using my both cores give me 73 top. I aborted some of the units. Edit: One more thing: The graphics are frozen. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 826 |
Greetings partners. I dont know if the following is an issue of the new version. Im getting extreme high temperatures in my cores, as a matter of fact I have to shutdown one because both cores crunching Rosetta WUs gabe a dangerous 80 celsius. Other projects using my both cores give me 73 top. mmadden, Do you have the option of decreasing the percentage of time BOINC projects use the CPU instead, and then checking whether Minirosetta v1.40 actually obeys this decrease? Also, can you check if it is one of the most memory-hungry processes on your machine while it is running? You might also want to check for signs that your machine has so little free memory that the application has shut down graphics, if it can do that. |
dazman Send message Joined: 28 May 06 Posts: 1 Credit: 51,457,893 RAC: 0 |
I'm so sorry for this mess. The jobs labeled with the words design, jacob, or sarel, are related to a new mode that we've put into v1.40. You can read more about this new mode and why we're excited about running it on Rosetta @ Home on Yes I just started having problems with memory, and found it was these new units. They are using WAY to much memory. I have a 8 Core 2.8Ghz Mac Pro. Only running 2gigs ram. I run it on 7cores (since running on 8 moves spotlight search to a crawl), and its been running fine for months, until now. Even though I have BOINC set to only use 45% of memory, its no obeying that rule. I'm going to have to stop running Rosetta@Home until this is resolved. Time to move on to a new project. |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
As others have reported already, I'm seeing tasks fail apparently as a result of a numerical error in a routine that calculates hydrogen bonding. The tasks end up being resent to other computers, which fail in the same way. Bit of a waste. ---- Task ID 206696346 Name loopbuild_boinc4_grow10_hombench_loopbuild_t293__IGNORE_THE_REST_1VQ1A_3_4710_10_1 Workunit 188513139 NANs occured in hbonding! ERROR:: Exit from: src/core/scoring/hbonds/hbonds_geom.cc line: 763 ---- Task ID 206661889 Name loopbuild_boinc4_grow10_hombench_loopbuild_t293__IGNORE_THE_REST_1WY7A_9_4710_13_0 Workunit 188528726 ERROR: NANs occured in hbonding! ERROR:: Exit from: src/core/scoring/hbonds/hbonds_geom.cc line: 763 ------ Mac OS X 10.4.11 : Boinc 6.2.18 |
rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0 |
|
RC Send message Joined: 27 Sep 05 Posts: 13 Credit: 262,048 RAC: 0 |
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=188277752 This WU ran 19.5 hours before I killed it (my run time preference is 6 hours). The next computer to pick it up had a compute error. |
googloo Send message Joined: 15 Sep 06 Posts: 133 Credit: 22,813,645 RAC: 1,448 |
I'm so sorry for this mess. The jobs labeled with the words design, jacob, or sarel, are related to a new mode that we've put into v1.40. You can read more about this new mode and why we're excited about running it on Rosetta @ Home on Do you still want feedback on the work units that have problems 1 and/or 2? |
Sarel Send message Joined: 11 May 06 Posts: 51 Credit: 81,712 RAC: 0 |
I'm so sorry for this mess. The jobs labeled with the words design, jacob, or sarel, are related to a new mode that we've put into v1.40. You can read more about this new mode and why we're excited about running it on Rosetta @ Home on Thanks for the offer to help! I'm now in the process of finding out what went wrong. The user reports on specific workunit failures are invaluable in figuring this out, but for the time being I have quite a few to work with :) I'll let you know once this is resolved. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 826 |
I'm so sorry for this mess. The jobs labeled with the words design, jacob, or sarel, are related to a new mode that we've put into v1.40. You can read more about this new mode and why we're excited about running it on Rosetta @ Home on You might want to check if you have the capability to give workunits that use the new mode different time and memory estimates (possibly even sometimes different from each other) so they can be directed to suitable machines. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
You might want to check if you have the capability to give workunits that use the new mode different time and memory estimates (possibly even sometimes different from each other) so they can be directed to suitable machines. The memory is already done. Aligning large memory tasks with large memory machines. The runtime is defined by the users here at Rosetta@home. And the fact that some of (ok many of) Sarel's tasks exceed the runtime target is exactly what he's working to correct. Until then, there's no better estimate on how long they will take anyway. So Sarel is working to make sure the models each complete in the more normal hour or two of CPU time. Then the user's runtime preference will be the best approximation of runtime available, which is how the project works. So, no grander change is required. Correcting (improving) the long-running models is the solution. I don't know if you've seen the graphic, but the proteins Sarel is tackling are absolutely huge! So, they are bound to turn up some behaviors in the program that smaller proteins do not run across. Rosetta Moderator: Mod.Sense |
Gavin Shaw Send message Joined: 1 Feb 07 Posts: 10 Credit: 506,456 RAC: 0 |
Had one unit go funny. Long Unit My preference is set to 4 hours and this one went way longer than that and didn't finish? When I last looked it was still on the first model/decoy. I also notice that it was a resend as well. Never surrender and never give up. In the darkest hour there is always hope. |
funkydude Send message Joined: 15 Jun 08 Posts: 28 Credit: 397,934 RAC: 0 |
Rosetta Mini doesn't always respect BOINC's "Snooze" setting on making projects suspend. The weird thing is I had 2 Mini's running and when I hit "Snooze" 1 suspended and 1 continued. |
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
Rosetta Mini doesn't always respect BOINC's "Snooze" setting on making projects suspend.I find it better to use 'suspend' which you can find on the activity list |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
I seem to be having a lot of WUs bomb out with the message: ERROR: NANs occured in hbonding! ERROR:: Exit from: src/core/scoring/hbonds/hbonds_geom.cc line: 763 called boinc_finish Here are some examples: h011__BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-14-S3-7--h011_-_4675_56_0 h010__BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-6-S3-4--h010_-_4675_56_0 foldcst_minimalist_core3_homo_bench_foldcst_cheat_chunk_t293__olange_IGNORE_THE_REST_1NV8A_12_4735_30_0 |
Mattia Verga Send message Joined: 15 Jul 06 Posts: 3 Credit: 124,357 RAC: 0 |
|
(_KoDAk_) Send message Joined: 18 Jul 06 Posts: 109 Credit: 1,859,263 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=205975695 |
Message boards :
Number crunching :
Minirosetta v1.40 bug thread
©2024 University of Washington
https://www.bakerlab.org