Message boards : Number crunching : Problems with version 5.96
Author | Message |
---|---|
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
This version of Rosetta is pretty similar to 5.95, though we'll be running a few RNA workunits -- please post if there are any issues. |
alpha Send message Joined: 4 Nov 06 Posts: 27 Credit: 1,550,107 RAC: 0 |
Waking up this morning I've got two compute errors from one computer: Task 150487485 Task 150487488 Exit status -1073741819 (0xc0000005) on both. Zero credit. |
skildude Send message Joined: 13 Dec 05 Posts: 7 Credit: 1,295,582 RAC: 0 |
1ehzA_BOINC_SEQSEP_TITRATION_TEMP4_RNA_ABINITIO_RNA_CONTACT-1ehzA-_299 I've had at least 8 WU's fail at various times while in process. the WU's all get a computational error the moment I attempt to "show graphics" half failed before I even attempted even noticed I had the WU's. perhaps the graphics need to be turned off on these WU's typical error even with WU's that failed immediately. stderr out <core_client_version>5.10.45</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 14400 # random seed: 3404258 # cpu_run_time_pref: 14400 # random seed: 3404258 # cpu_run_time_pref: 14400 # random seed: 3404258 # cpu_run_time_pref: 14400 # random seed: 3404258 # cpu_run_time_pref: 14400 # random seed: 3404258 # cpu_run_time_pref: 14400 # random seed: 3404258 # cpu_run_time_pref: 14400 # random seed: 3404258 # cpu_run_time_pref: 14400 # random seed: 3404258 # cpu_run_time_pref: 14400 # random seed: 3404258 # cpu_run_time_pref: 14400 # random seed: 3404258 # cpu_run_time_pref: 14400 # random seed: 3404258 Too many restarts with no progress. Keep application in memory while preempted. ====================================================== DONE :: 1 starting structures 2420.03 cpu seconds This process generated 0 decoys from 0 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... </stderr_txt> <message> <file_xfer_error> <file_name>1ehzA_BOINC_SEQSEP_TITRATION_TEMP4_RNA_ABINITIO_RNA_CONTACT-1ehzA-_2996_9923_0_0</file_name> <error_code>-161</error_code> </file_xfer_error> </message> ]]> Space is a vast empty space. Let us hope that it does not occupy the region between your ears. Come visit Team Starfire at www.TSWB.org |
vicel Send message Joined: 28 Mar 06 Posts: 5 Credit: 957,142 RAC: 0 |
Freeze calculation of WU 1wrp__BOINC_SYMM_FOLD_AND_DOCK_RELAX-1wrp_-native__3004_6 (https://boinc.bakerlab.org/rosetta/workunit.php?wuid=137744826). Today, I restarted my computer. After that, process of this WU was started - status "Running", but really progress of calculation is stoping (CPU Time don't changing and System Monitor for any process "rosseta_beta_5.96..." show 0%). Message Log has only "resuming task 1wrp_..." without any errors. I try to restart processing of this WU: suspend - resume, but progress stop. (Other WU was run). Linux Ubuntu 7.10 (32bit), BOINC 5.10.45. Rosetta Beta 5.96 (for this WU). Intel Core 2 Duo E4500 (2.2 GHz). RAM: 4Gb (really used 3 Gb). HDD: 33 Gb free. MotherBoard: Intel DG33BU (internal video). SETI@Home/Rosseta@Home - 50/50. |
Andrii Muliar Send message Joined: 10 Nov 05 Posts: 12 Credit: 7,655,243 RAC: 0 |
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=137758041 Weird search process: During the process of search Accepted RMSD & Energy remains constant: |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
I got a grey windows box saying 5.96 encountered an error and is shutting down. The computation was around 66.36% finished when it crashed. *read the error log here as it is quite lenghty.* If its a windows thing, then at the moment I am trying to finish the batch of files I got after the last crash and then will be loading a new version of XP on my drive when they are done. Here is what was in BOINC manager: 4/1/2008 6:18:55 PM|rosetta@home|Starting task bench80_rozilla_abrelax_natfrag_1gu3A_2986_41161_0 using rosetta_beta version 596 4/1/2008 6:18:57 PM|rosetta@home|Started upload of bench80_rozilla_abrelax_natfrag_2hnfA_2986_38472_0_0 4/1/2008 6:19:02 PM|rosetta@home|Finished upload of bench80_rozilla_abrelax_natfrag_2hnfA_2986_38472_0_0 4/1/2008 7:31:08 PM|rosetta@home|Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 4 completed tasks 4/1/2008 7:31:13 PM|rosetta@home|Scheduler request succeeded: got 0 new tasks 4/1/2008 7:31:31 PM||Project communication failed: attempting access to reference site 4/1/2008 7:31:32 PM||Access to reference site succeeded - project servers may be temporarily down. 4/1/2008 10:29:53 PM|rosetta@home|Computation for task bench80_rozilla_abrelax_natfrag_1gu3A_2986_41161_0 finished 4/1/2008 10:29:54 PM|rosetta@home|Output file bench80_rozilla_abrelax_natfrag_1gu3A_2986_41161_0_0 for task bench80_rozilla_abrelax_natfrag_1gu3A_2986_41161_0 absent |
Christian Diepold Send message Joined: 23 Sep 05 Posts: 37 Credit: 300,225 RAC: 0 |
I also experienced a WU crash with a Windows XP box popping up and asking for a click on ok (Rosetta experienced a critical error ...). As it happened on my main PC, no big deal, I caught it the second it happened. But I hope it doesn't happen on any of the unmonitored crunchers our there. |
M.L. Send message Joined: 21 Nov 06 Posts: 182 Credit: 180,462 RAC: 0 |
05-Apr-2008 00:40:59 [rosetta@home] Task FRA_t035_CAPRI15_hom1dyoB_2_IGNORE_THE_RESTt035_2_t035.1dyo.second_round_starting_pdb.pdb_3020_54511_0 exited with a DLL initialization error. 05-Apr-2008 00:40:59 [rosetta@home] If this happens repeatedly you may need to reboot your computer. 05-Apr-2008 00:40:59 [rosetta@home] Task FRA_t848_hom1jnx_1_IGNORE_THE_RESTt848_1_t848_1jnx.template_0002.pdb_3021_2209_0 exited with a DLL initialization error. 05-Apr-2008 00:40:59 [rosetta@home] If this happens repeatedly you may need to reboot your computer. 05-Apr-2008 00:41:01 [rosetta@home] Restarting task FRA_t035_CAPRI15_hom1dyoB_2_IGNORE_THE_RESTt035_2_t035.1dyo.second_round_starting_pdb.pdb_3020_54511_0 using rosetta_beta version 596 05-Apr-2008 00:41:01 [rosetta@home] Restarting task FRA_t848_hom1jnx_1_IGNORE_THE_RESTt848_1_t848_1jnx.template_0002.pdb_3021_2209_0 using rosetta_beta version 596 Both seem good now but will check in a while. (did not reboot {see the time}). |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
The stdout.txt file for some WUs seems to be growing to over 40 Megabytes. The following WUs are doing this: BAK_1auq_Nterm_loop_3031_10951 BAK_1auq_Nterm_loop_3031_11495 BAK_1auq_Nterm_loop_3031_2944 At the moment, the WUs are still crunching. Looking at one of the stdout.txt files shows a lot of lines like: ... STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 0.0320764 1 55 -4.99472 -4.99646 STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 0.0552378 2 55 -6.28509 -6.28283 STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 0.0535297 1 56 -6.78289 -6.78004 STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 0.0397191 2 56 -7.03684 -7.03334 STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 0.0533867 3 56 -36.1671 -36.1738 STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 0.0479667 2 58 -5.27631 -5.27237 STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 0.0940199 1 59 -6.51416 -6.5067 STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 0.0393486 3 59 -32.5412 -32.5466 STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 0.0327196 1 61 -4.3825 -4.38517 STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 0.054975 1 62 -5.24462 -5.25139 ... ad nauseum |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
As a followup to my previous post, the three WUs completed normally: https://boinc.bakerlab.org/rosetta/result.php?resultid=153052812 https://boinc.bakerlab.org/rosetta/result.php?resultid=153058613 https://boinc.bakerlab.org/rosetta/result.php?resultid=152966018 When I first noticed these, all of them had a stdout.txt file a bit over 40MB. But the stdout.txt file didn't grow much bigger, even with another 8 hours or so of crunching. |
Larry256 Send message Joined: 11 Nov 05 Posts: 2 Credit: 3,928,279 RAC: 9,246 |
I experienced 2 WU crash with a Windows XP box popping up and asking for a click on ok https://boinc.bakerlab.org/rosetta/result.php?resultid=153734357 https://boinc.bakerlab.org/rosetta/result.php?resultid=153734536 |
Brian Kidd Send message Joined: 9 Dec 06 Posts: 5 Credit: 327 RAC: 0 |
Thanks for letting us know about the large stdout files that you're getting. I'm currently looking into this issue and it looks like the error is basically harmless. We'll correct the large stdout files on the next release. |
DJStarfox Send message Joined: 19 Jul 07 Posts: 145 Credit: 1,250,162 RAC: 0 |
When WU finish, three (out of four) processes/executables are still left in memory as orphaned processes for each WU. The results do show as valid. Here's an example: https://boinc.bakerlab.org/rosetta/result.php?resultid=155168683 System is Linux Fedora 7 x86_64 2 x AMD Opteron 248 HE 2 x 1GB RAM BOINC 5.8.16 x86 |
msirois Send message Joined: 12 Mar 08 Posts: 1 Credit: 112,554 RAC: 0 |
On about half of the jobs, when I reach around 95% completed progress simply crawls. To completion time stops but percentages increment extremely slowly. I assume the job is progreessing but I don;t know. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
On about half of the jobs, when I reach around 95% completed progress simply crawls. To completion time stops but percentages increment extremely slowly. I assume the job is progreessing but I don;t know. That would be normal. Especially if you are still on the first model, and/or have a short preferred runtime specified in your Rosetta preferences. Rosetta Moderator: Mod.Sense |
hedera Send message Joined: 15 Jul 06 Posts: 76 Credit: 5,246,877 RAC: 991 |
I don't know if this is an issue with 5.96 (nothing actually seems to be crashing) or just with the RNA work units, but I'm posting this from my laptop because my desktop machine is completely non-responsive - I have 3 WU's "active" and among them they're eating over half a GB of RAM. And I only have a GB on a WinXP box. I literally can't get a response from keyboard or mouse. I've ordered a memory upgrade to 4 GB after which you can analyze RNA if you want to, but in the meantime, have pity!! --hedera Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Hedera, you can tell BOINC to "snooze", or limit it to less then all of your CPUs, or put a memory constraint on it to help keep it responsive for you. Most likely, as you already seem aware, you are bottlenecking on memory. Yes those RNA's do take a lot. And so it you limit the amount of memory BOINC is allowed to use, it will basically scale back on how many it's trying to do at the same time. Rosetta Moderator: Mod.Sense |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Hedera, can you post a link to one of your results, so that I can figure out what the problem is? I actually did not expect any of these RNA jobs to take a lot of memory, so I definitely want to track ths down. Hedera, you can tell BOINC to "snooze", or limit it to less then all of your CPUs, or put a memory constraint on it to help keep it responsive for you. Most likely, as you already seem aware, you are bottlenecking on memory. Yes those RNA's do that a lot. And so it you limit the amount of memory BOINC is allowed to use, it will basically scale back on how many it's trying to do at the same time. |
hedera Send message Joined: 15 Jul 06 Posts: 76 Credit: 5,246,877 RAC: 991 |
At the moment this task is 54% done and "waiting for memory": https://boinc.bakerlab.org/rosetta/result.php?resultid=156211157 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=142644377 And these tasks are running: https://boinc.bakerlab.org/rosetta/result.php?resultid=156203235 (31% done) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=142636856 https://boinc.bakerlab.org/rosetta/result.php?resultid=156409361 (4% done) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=142826368 I may have screwed up its scheduling last night when I suspended the project so I could do some actual work on the box. I don't usually have 3 at a time. One of the 2 running tasks has a working memory set of just under 192K, and the other set size is just under 116K; I don't know which is which. I actually have restricted memory use on my system, to 50% of memory when the box is in use and 85% when not, and until this last round of WUs that was enough. It isn't as bad this morning as it was last night. (I got the memory upgrade today! But it isn't in yet.) --hedera Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic. |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
I've got a bunch of these CAPRI15 tasks. First one has been running 4hrs and shows peak memory usage of 470M and it still hasn't completed the first model. This is the one running now: https://boinc.bakerlab.org/rosetta/result.php?resultid=156183910 Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Message boards :
Number crunching :
Problems with version 5.96
©2024 University of Washington
https://www.bakerlab.org