Message boards : Number crunching : Problems with rosetta 5.48
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0 |
-- Looks like the HINGE units also grant about 1/2 the credits on Linux of other WU's... What the heck is going on...??? Here is a HINGE result... https://boinc.bakerlab.org/rosetta/result.php?resultid=65587313 Here is a result just prior to running all HINGE WU's https://boinc.bakerlab.org/rosetta/result.php?resultid=65547754 The machine is running 8 at a time, so, the L4 cache is sharing all the data, but, it's still huge code, and not granting much credit.. Hmmm .... Looking for a team ??? Join BoincSynergy!! |
Rene Send message Joined: 2 Dec 05 Posts: 10 Credit: 67,269 RAC: 0 |
Other Rosetta wu in que seems to have the same problem... only this one stopt at 1.030% At least this wu came with some info... ---------------- <core_client_version>5.8.15</core_client_version> <![CDATA[ <message> aborted by user </message> <stderr_txt> Graphics are disabled due to configuration... # random seed: 3363307 Graphics are disabled due to configuration... # random seed: 3363307 SIGSEGV: segmentation violation Stack trace (15 frames): [0x8b63f87] [0x8b7fdcc] [0xffffe420] [0x8bb654e] [0x8bb68ad] [0x82631fc] [0x86a0dbf] [0x8ae0a75] [0x843d845] [0x80e592d] [0x8521d37] [0x863a03b] [0x863a0e4] [0x8bdf184] [0x8048121] Exiting... </stderr_txt> ]]> ----------------- ;-) |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
Have 512mb on a 1.2ghz cpu with w2k. I can no longer get work units after finishing the last one. Your prefs at 95% would be enough for 477Mb, 0.95 x 512 = 486 (now there's a nostalgic number!) so I am guessing that you have shared memory with your video card, yes? If not, I am totally confuzdled. If so, you could go into the BIOS and reduce graphics memory to the smallest setting and see if that helped. Or possibly (used to be the case, don't know if it is still true with modern cards) there might be a program that came with the card to adjust the amount of video memory. If BOINC's 477Mb is 477 x 1024 x 1024 then you need to get the video down to around 35Mb. If BOINC's 477 is 477 x 1000 x 1000, then that corresponds to about 454 x 1024 x 1024, and you need a video allocation < 58 Mb, or maybe 56Mb depending on rounding, etc. So either way a setting of 32Mb would show whether that is the way forward. Remember, if you change any settings, to note what they are now, so that you can get back to where you are now if the effect on your system is too horrible... R~~ |
Thomas Leibold Send message Joined: 30 Jul 06 Posts: 55 Credit: 19,627,164 RAC: 0 |
Memory requirements appear to have skyrocketed suddenly (not sure if this is due to the HINGE workunits or the new 5.48 rosetta client). On a dual core AMD X2 with 1GB of memory I still get new work from the project, but the performance of the system has become poor because of swapping to disk. I have also seen in boinc_manager that there was a time when only one task was running instead of the usual two with the other one showing a status of "waiting". However I did not see any corresponding entries in the messages tab (such as suspending and restarting the task). Edit: I thought this was 'waiting for memory', but it just happened again and after expanding the width of the status column in boinc manager it is 'waiting to run' and corresponds to the time after it crashes with SIGSEGV before boinc restarts it again. One HINGE workunit is not completing properly and keeps getting restarted. I'm going to abort that one. Here is the information from the stderr.txt file: Graphics are disabled due to configuration... # random seed: 2872684 # cpu_run_time_pref: 28800 SIGSEGV: segmentation violation Stack trace (23 frames): [0x8b63f87] [0x8b7fdcc] [0xffffe420] [0x849380d] [0x8a0efe3] [0x804d254] [0x854b9cd] [0x867872a] [0x867ab18] [0x867e675] [0x868a9e1] [0x854ffc0] [0x8690835] [0x804db1d] [0x8872178] [0x8886b19] [0x87c3bc2] [0x8320d55] [0x8521a9b] [0x863a03b] [0x863a0e4] [0x8bdf184] [0x8048121] Exiting... Graphics are disabled due to configuration... # cpu_run_time_pref: 28800 SIGSEGV: segmentation violation Stack trace (15 frames): [0x8b63f87] [0x8b7fdcc] [0xffffe420] [0x837ce31] [0x8a0e7b3] [0x8a10a15] [0x8066e41] [0x8886d4d] [0x87c3bc2] [0x8320d55] [0x8521a9b] [0x863a03b] [0x863a0e4] [0x8bdf184] [0x8048121] Exiting... Graphics are disabled due to configuration... # cpu_run_time_pref: 28800 SIGSEGV: segmentation violation Stack trace (21 frames): [0x8b63f87] [0x8b7fdcc] [0xffffe420] [0x848e320] [0x8681d8b] [0x855441e] [0x8687e62] [0x8688fe0] [0x868ab44] [0x854ffc0] [0x8690835] [0x804db1d] [0x887152d] [0x8886b19] [0x87c3bc2] [0x8320d55] [0x8521a9b] [0x863a03b] [0x863a0e4] [0x8bdf184] [0x8048121] Exiting... The workunit is 58632028 (HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731). The system has completed at least one HINGE workunit with 5.48 successfully. Sun 04 Mar 2007 08:06:32 AM PST|rosetta@home|Computation for task HINGE_1nd7_CAPRI_11nd7_1_cc1nd7.ppk_0481_1594_4121_0 finished Sun 04 Mar 2007 08:06:32 AM PST|rosetta@home|Starting HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731_0 Sun 04 Mar 2007 08:06:32 AM PST|rosetta@home|Starting task HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731_0 using rosetta version 548 Sun 04 Mar 2007 08:06:35 AM PST|rosetta@home|[file_xfer] Started upload of file HINGE_1nd7_CAPRI_11nd7_1_cc1nd7.ppk_0481_1594_4121_0_0 Sun 04 Mar 2007 08:06:59 AM PST|rosetta@home|[file_xfer] Finished upload of file HINGE_1nd7_CAPRI_11nd7_1_cc1nd7.ppk_0481_1594_4121_0_0 Sun 04 Mar 2007 08:06:59 AM PST|rosetta@home|[file_xfer] Throughput 38581 bytes/sec Sun 04 Mar 2007 09:54:32 AM PST|rosetta@home|Task HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731_0 exited with zero status but no 'finished' file Sun 04 Mar 2007 09:54:32 AM PST|rosetta@home|If this happens repeatedly you may need to reset the project. Sun 04 Mar 2007 09:56:59 AM PST|rosetta@home|Restarting task HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731_0 using rosetta version 548 Sun 04 Mar 2007 09:57:31 AM PST|rosetta@home|Task HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731_0 exited with zero status but no 'finished' file Sun 04 Mar 2007 09:57:31 AM PST|rosetta@home|If this happens repeatedly you may need to reset the project. Sun 04 Mar 2007 09:59:03 AM PST|rosetta@home|Sending scheduler request: Requested by user Sun 04 Mar 2007 09:59:03 AM PST|rosetta@home|Reporting 1 tasks Sun 04 Mar 2007 09:59:08 AM PST|rosetta@home|Scheduler RPC succeeded [server version 509] Sun 04 Mar 2007 09:59:08 AM PST|rosetta@home|Deferring communication for 4 min 2 sec Sun 04 Mar 2007 09:59:08 AM PST|rosetta@home|Reason: requested by project Sun 04 Mar 2007 09:59:09 AM PST|rosetta@home|Restarting task HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731_0 using rosetta version 548 Sun 04 Mar 2007 10:16:18 AM PST|rosetta@home|Task HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731_0 exited with zero status but no 'finished' file Sun 04 Mar 2007 10:16:18 AM PST|rosetta@home|If this happens repeatedly you may need to reset the project. Sun 04 Mar 2007 10:31:41 AM PST|rosetta@home|Restarting task HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731_0 using rosetta version 548 Sun 04 Mar 2007 10:34:32 AM PST|rosetta@home|Task HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731_0 exited with zero status but no 'finished' file Sun 04 Mar 2007 10:34:32 AM PST|rosetta@home|If this happens repeatedly you may need to reset the project. Sun 04 Mar 2007 10:37:33 AM PST|rosetta@home|Restarting task HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731_0 using rosetta version 548 This is just from my home desktop system. There seem to be widespread issues, because some of the servers at work haven't reported any results in over 12 hours and others show less results than they should have for the number of cpus they are running. Why was 5.48 rolled out so quickly ? It seems to have been on ralph for just a couple of days. Team Helix |
j2satx Send message Joined: 17 Sep 05 Posts: 97 Credit: 3,670,592 RAC: 0 |
My guess is they only have 4,000 hosts to test on Ralph, but 273,000 to test with on Rosetta. Same as the graphics.........everything seems to get rolled out before it is thoroughly tested on Ralph. |
hedera Send message Joined: 15 Jul 06 Posts: 76 Credit: 5,263,150 RAC: 87 |
A little more background on my system which I thought was sucessfully running 2 Rosetta tasks: Windows XP Pro, SP2 Pentium 4 320 GHz 1 GB RAM The process list in the Windows task manager showed that the rosetta_5.48_windows_intelx86.exe was averaging around 290KB of system memory when running. I looked at the CPU, and it was using 50% of CPU, which means that at that time only one task was actually running. So I looked at my tasks and discovered that only one is running - the other is "Waiting for memory". So I HAVE the problem after all, my system is just handling it a little more gracefully than some. This is a negative change - with earlier versions I was able to run 2 tasks simultaneously, using 100% of CPU. My system status shows about 378K of physical memory available. It seems to me that a system this size ought to be able to run two tasks at once. --hedera Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic. |
Thomas Leibold Send message Joined: 30 Jul 06 Posts: 55 Credit: 19,627,164 RAC: 0 |
Additional Problem: after aborting workunit 58632028 (HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731) through boinc manager it remained in the system (not using cpu time, but of course holding on to all the allocated memory). This meant that there were now 3 rosetta clients on this dual cpu system (however only two showed up in the boinc manager task list). This aborted task still remained in the system even after completely shutting down the boinc client! Note that the parent process id has changed to 1 (init) instead of the process id of the no longer running boinc client. user 28422 1 2 10:37 pts/1 00:01:40 rosetta_5.48_i686-pc-linux-gnu aa 1nd7 _ -s 1nd7_3_cc1nd7.ppk_0983.pdb -dock_mcm -read_all_chains -use_pdb_numbering -dock -dock_mcm -docking_pose_symmetry -pose -nstruct 5 -ex1 -ex2aro_only -dock_score_norepack -pose_symm_n_monomers 2 -docking_pose_symm_full -symm_type cn -randomize1 -hinge -hinge_start 255 -hinge_end 260 -inner_cycle_mcm_hinge 7 -fake_native -output_silent_gz -pose_silent_out -no_filters -output_all -accept_all -cpu_run_time 10800 -watchdog -constant_seed -jran 2872684 user 28423 28422 0 10:37 pts/1 00:00:01 rosetta_5.48_i686-pc-linux-gnu aa 1nd7 _ -s 1nd7_3_cc1nd7.ppk_0983.pdb -dock_mcm -read_all_chains -use_pdb_numbering -dock -dock_mcm -docking_pose_symmetry -pose -nstruct 5 -ex1 -ex2aro_only -dock_score_norepack -pose_symm_n_monomers 2 -docking_pose_symm_full -symm_type cn -randomize1 -hinge -hinge_start 255 -hinge_end 260 -inner_cycle_mcm_hinge 7 -fake_native -output_silent_gz -pose_silent_out -no_filters -output_all -accept_all -cpu_run_time 10800 -watchdog -constant_seed -jran 2872684 user 28424 28423 0 10:37 pts/1 00:00:00 rosetta_5.48_i686-pc-linux-gnu aa 1nd7 _ -s 1nd7_3_cc1nd7.ppk_0983.pdb -dock_mcm -read_all_chains -use_pdb_numbering -dock -dock_mcm -docking_pose_symmetry -pose -nstruct 5 -ex1 -ex2aro_only -dock_score_norepack -pose_symm_n_monomers 2 -docking_pose_symm_full -symm_type cn -randomize1 -hinge -hinge_start 255 -hinge_end 260 -inner_cycle_mcm_hinge 7 -fake_native -output_silent_gz -pose_silent_out -no_filters -output_all -accept_all -cpu_run_time 10800 -watchdog -constant_seed -jran 2872684 user 28425 28423 0 10:37 pts/1 00:00:00 rosetta_5.48_i686-pc-linux-gnu aa 1nd7 _ -s 1nd7_3_cc1nd7.ppk_0983.pdb -dock_mcm -read_all_chains -use_pdb_numbering -dock -dock_mcm -docking_pose_symmetry -pose -nstruct 5 -ex1 -ex2aro_only -dock_score_norepack -pose_symm_n_monomers 2 -docking_pose_symm_full -symm_type cn -randomize1 -hinge -hinge_start 255 -hinge_end 260 -inner_cycle_mcm_hinge 7 -fake_native -output_silent_gz -pose_silent_out -no_filters -output_all -accept_all -cpu_run_time 10800 -watchdog -constant_seed -jran 2872684 The stderr.txt file is: Graphics are disabled due to configuration... # random seed: 2930724 SIGSEGV: segmentation violation Stack trace (27 frames): [0x8b63f87] [0x8b7fdcc] [0xffffe420] [0x8812550] [0x804d2f0] [0x888e6d0] [0x8af7190] [0x837c075] [0x8a0e7b3] [0x804d254] [0x854b9cd] [0x867872a] [0x867ab18] [0x867e6c6] [0x868a9e1] [0x854ffc0] [0x8690835] [0x804db1d] [0x8872178] [0x8886b19] [0x87c3bc2] [0x8320d55] [0x8521a9b] [0x863a03b] [0x863a0e4] [0x8bdf184] [0x8048121] Exiting... SIGSEGV: segmentation violation Stack trace (12 frames): [0x8b63f87] [0x8b7fdcc] [0xffffe420] [0x8b59755] [0x8b59831] [0x8b59618] [0x8069816] [0x875e7a9] [0x8be67cf] [0x8b7791d] [0x8b8102d] [0x8c1292a] Exiting... I would not be surprised if this turns out to be related to the same issue that prevents the watchdog timer from properly terminating rosetta tasks on linux. Obviously processes with large memory requirements that cannot be terminated (through the user interface, a simple kill works just fine) when they are causing a problem are not good! Team Helix |
Thomas Leibold Send message Joined: 30 Jul 06 Posts: 55 Credit: 19,627,164 RAC: 0 |
Yet at the same time on the Ralph message boards are questions why there isn't enough work for testers. I don't think a lack of testers and test machines is the issue. Team Helix |
Stan in CT Send message Joined: 21 Dec 05 Posts: 1 Credit: 241,611 RAC: 0 |
Houston(administrators) YOU have a problem!! I, along with others,are trying to help you with our computers. Without making any adjustments on our part,we are suddenly not able to to help you. What's with this 'not enough memory'? |
NYgnat Send message Joined: 28 Dec 06 Posts: 2 Credit: 248,388 RAC: 40 |
Have 512mb on a 1.2ghz cpu with w2k. I can no longer get work units after finishing the last one. My problem is quite similar on W2K/512MB and preferences set to 1 and 2 GIGs: Note that Everything in the Rosetta folder under BOINC is still GZIPPED! Drive:Program FilesBOINCprojectsboinc.bakerlab.org_rosetta Directory listing is: 03/04/2007 07:13p <DIR> . 03/04/2007 07:13p <DIR> .. 12/28/2006 04:19p 11,994 avgE_from_pdb.gz 12/28/2006 04:19p 6,665,378 bbdep02.May.sortlib.gz 12/28/2006 04:19p 245 bb_hbW.gz 12/28/2006 04:19p 177,854 disulf_jumps.dat.gz 12/28/2006 04:20p 1,667,226 DunbrackBBDepRots12.dat.gz 12/28/2006 04:20p 356 Equil_AM.mean.dat.gz 12/28/2006 04:20p 322 Equil_AM.stddev.dat.gz 12/28/2006 04:20p 187 Equil_bp_AM.mean.dat.gz 12/28/2006 04:20p 169 Equil_bp_AM.stddev.dat.gz 12/28/2006 04:20p 1,267 Fij_AM.dat.gz 12/28/2006 04:20p 383 Fij_bp_AM.dat.gz 12/28/2006 04:20p 689 jump_templates.dat.gz 12/28/2006 04:20p 516,693 jump_templates_v2.dat.gz 12/28/2006 04:20p 165 Paa.gz 12/28/2006 04:20p 1,832 Paa_n.gz 12/28/2006 04:20p 120,642 Paa_pp.gz 12/28/2006 04:20p 2,515 paircutoffs.gz 12/28/2006 04:20p 71,365 pdbpairstats_fine.gz 12/28/2006 04:20p 19,510 phi.theta.36.HS.resmooth.gz 12/28/2006 04:20p 11,863 phi.theta.36.SS.resmooth.gz 12/28/2006 04:20p 136,751 plane_data_table_1015.dat.gz 12/28/2006 04:20p 425,070 Rama_smooth_dyn.dat_ss_6.4.gz 12/28/2006 04:20p 1,132 SASA-angles.dat.gz 12/28/2006 04:20p 68,809 SASA-masks.dat.gz 12/28/2006 04:20p 2,475 sasa_offsets.txt.gz 12/28/2006 04:20p 34,833 sasa_prob_cdf.txt.gz 12/28/2006 04:20p 2,998 sc_hbW.gz 29 File(s) 9,942,723 bytes 2 Dir(s) 12,516,331,520 bytes free Log file is as follows: 3/4/2007 7:03:14 PM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 3/4/2007 7:03:14 PM|rosetta@home|Reason: To fetch work 3/4/2007 7:03:14 PM|rosetta@home|Requesting 8640 seconds of new work 3/4/2007 7:03:19 PM|rosetta@home|Scheduler request succeeded 3/4/2007 7:03:19 PM|rosetta@home|Message from server: Your preferences limit memory usage to 431.54MB, and a job requires 476.84MB 3/4/2007 7:03:19 PM|rosetta@home|Message from server: No work sent 3/4/2007 7:03:19 PM|rosetta@home|Message from server: (there was work but your computer doesn't have enough memory) 3/4/2007 7:03:19 PM|rosetta@home|No work from project 3/4/2007 7:07:25 PM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 3/4/2007 7:07:25 PM|rosetta@home|Reason: To fetch work 3/4/2007 7:07:25 PM|rosetta@home|Requesting 8640 seconds of new work 3/4/2007 7:07:31 PM|rosetta@home|Scheduler request succeeded 3/4/2007 7:07:31 PM|rosetta@home|Message from server: Your preferences limit memory usage to 431.54MB, and a job requires 476.84MB 3/4/2007 7:07:31 PM|rosetta@home|Message from server: No work sent 3/4/2007 7:07:31 PM|rosetta@home|Message from server: (there was work but your computer doesn't have enough memory) 3/4/2007 7:07:31 PM|rosetta@home|No work from project |
NYgnat Send message Joined: 28 Dec 06 Posts: 2 Credit: 248,388 RAC: 40 |
[quote]Have 512mb on a 1.2ghz cpu with w2k. I can no longer get work units after finishing the last one. My problem is quite similar on W2K/512MB and preferences set to 1 and 2 GIGs: Problem was alleviated when I set prefs to 99% mem usage, then it started to downlaod 5.48. Possibly a global prob with W2k at 512MB ram, My video card has its own ram. I tried 95% and that didn't work either.. Seems 5.48 is a memory hog... |
Thomas Leibold Send message Joined: 30 Jul 06 Posts: 55 Credit: 19,627,164 RAC: 0 |
It is happening again, and this time I definitely saw the message "waiting for memory" in the boinc manager task window (workunit 58661010). What I'm wondering now is whether this is related to the two different memory preference settings starting with Boinc 5.8 ? This machine is running 5.8.11 and has 1GB of memory which is enough for the scheduler to give it two of the HINGE workunits. However when I'm doing other work on the system the available memory amount may drop sufficiently for Boinc to try stopping one of the Rosetta tasks (which then promptly fails with the same kind of SIGSEGV we see when the watchdog timer tries to terminate a Rosetta task). 2007-03-04 15:08:09 [rosetta@home] Task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_1078_1595_3691_0 exited with zero status but no 'finished' file 2007-03-04 15:08:09 [rosetta@home] If this happens repeatedly you may need to reset the project. 2007-03-04 15:09:18 [rosetta@home] Restarting task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_1078_1595_3691_0 using rosetta version 548 2007-03-04 16:38:17 [rosetta@home] Task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_1078_1595_3691_0 exited with zero status but no 'finished' file 2007-03-04 16:38:17 [rosetta@home] If this happens repeatedly you may need to reset the project. 2007-03-04 16:39:23 [rosetta@home] Restarting task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_1078_1595_3691_0 using rosetta version 548 2007-03-04 16:50:13 [rosetta@home] Computation for task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0534_1595_1654_0 finished 2007-03-04 16:50:13 [rosetta@home] Restarting task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 using rosetta version 548 2007-03-04 16:50:15 [rosetta@home] [file_xfer] Started upload of file HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0534_1595_1654_0_0 2007-03-04 16:50:43 [rosetta@home] [file_xfer] Finished upload of file HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0534_1595_1654_0_0 2007-03-04 16:50:43 [rosetta@home] [file_xfer] Throughput 38845 bytes/sec 2007-03-04 17:12:50 [rosetta@home] Task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 exited with zero status but no 'finished' file 2007-03-04 17:12:51 [rosetta@home] If this happens repeatedly you may need to reset the project. 2007-03-04 17:13:59 [rosetta@home] Restarting task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 using rosetta version 548 2007-03-04 17:53:45 [rosetta@home] Task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 exited with zero status but no 'finished' file 2007-03-04 17:53:45 [rosetta@home] If this happens repeatedly you may need to reset the project. 2007-03-04 17:55:43 [rosetta@home] Restarting task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 using rosetta version 548 2007-03-04 17:59:25 [rosetta@home] Task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 exited with zero status but no 'finished' file 2007-03-04 17:59:25 [rosetta@home] If this happens repeatedly you may need to reset the project. 2007-03-04 18:00:29 [rosetta@home] Restarting task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 using rosetta version 548 2007-03-04 18:02:12 [rosetta@home] Task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 exited with zero status but no 'finished' file 2007-03-04 18:02:12 [rosetta@home] If this happens repeatedly you may need to reset the project. 2007-03-04 18:02:12 [rosetta@home] Restarting task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 using rosetta version 548 2007-03-04 18:03:33 [rosetta@home] Task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 exited with zero status but no 'finished' file 2007-03-04 18:03:33 [rosetta@home] If this happens repeatedly you may need to reset the project. 2007-03-04 18:03:57 [rosetta@home] Restarting task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 using rosetta version 548 2007-03-04 18:08:20 [rosetta@home] Task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 exited with zero status but no 'finished' file 2007-03-04 18:08:20 [rosetta@home] If this happens repeatedly you may need to reset the project. 2007-03-04 18:12:46 [rosetta@home] Restarting task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 using rosetta version 548 2007-03-04 18:12:48 [rosetta@home] Computation for task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 finished 2007-03-04 18:12:48 [rosetta@home] Sending scheduler request: To fetch work 2007-03-04 18:12:48 [rosetta@home] Requesting 8308 seconds of new work, and reporting 1 completed tasks 2007-03-04 18:12:50 [rosetta@home] [file_xfer] Started upload of file HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0_0 2007-03-04 18:12:53 [rosetta@home] Scheduler RPC succeeded [server version 509] 2007-03-04 18:12:53 [rosetta@home] Deferring communication for 4 min 2 sec 2007-03-04 18:12:53 [rosetta@home] Reason: requested by project 2007-03-04 18:12:54 [rosetta@home] [file_xfer] Finished upload of file HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0_0 2007-03-04 18:12:54 [rosetta@home] [file_xfer] Throughput 38087 bytes/sec 2007-03-04 18:12:55 [rosetta@home] Starting HINGE_1nd7_CAPRI_31nd7_.pdb_1596_4655_0 2007-03-04 18:12:55 [rosetta@home] Starting task HINGE_1nd7_CAPRI_31nd7_.pdb_1596_4655_0 using rosetta version 548 2007-03-04 18:19:10 [rosetta@home] Resuming task HINGE_1nd7_CAPRI_31nd7_.pdb_1596_4655_0 using rosetta version 548 2007-03-04 18:21:26 [rosetta@home] Resuming task HINGE_1nd7_CAPRI_31nd7_.pdb_1596_4655_0 using rosetta version 548 2007-03-04 18:25:50 [rosetta@home] Resuming task HINGE_1nd7_CAPRI_31nd7_.pdb_1596_4655_0 using rosetta version 548 stderr.txt: Graphics are disabled due to configuration... # random seed: 2930724 SIGSEGV: segmentation violation Stack trace (27 frames): [0x8b63f87] [0x8b7fdcc] [0xffffe420] [0x8812550] [0x804d2f0] [0x888e6d0] [0x8af7190] [0x837c075] [0x8a0e7b3] [0x804d254] [0x854b9cd] [0x867872a] [0x867ab18] [0x867e6c6] [0x868a9e1] [0x854ffc0] [0x8690835] [0x804db1d] [0x8872178] [0x8886b19] [0x87c3bc2] [0x8320d55] [0x8521a9b] [0x863a03b] [0x863a0e4] [0x8bdf184] [0x8048121] Exiting... SIGSEGV: segmentation violation Stack trace (12 frames): [0x8b63f87] [0x8b7fdcc] [0xffffe420] [0x8b59755] [0x8b59831] [0x8b59618] [0x8069816] [0x875e7a9] [0x8be67cf] [0x8b7791d] [0x8b8102d] [0x8c1292a] Exiting... Graphics are disabled due to configuration... # random seed: 2930724 # cpu_run_time_pref: 28800 SIGSEGV: segmentation violation Stack trace (19 frames): [0x8b63f87] [0x8b7fdcc] [0xffffe420] [0x8551b3e] [0x8687e62] [0x8688fe0] [0x868ab44] [0x854ffc0] [0x8690835] [0x804db1d] [0x887152d] [0x8886b19] [0x87c3bc2] [0x8320d55] [0x8521a9b] [0x863a03b] [0x863a0e4] [0x8bdf184] [0x8048121] Exiting... Graphics are disabled due to configuration... # cpu_run_time_pref: 28800 SIGSEGV: segmentation violation Stack trace (20 frames): [0x8b63f87] [0x8b7fdcc] [0xffffe420] [0x849380d] [0x8a0efe3] [0x804d254] [0x854b9cd] [0x867872a] [0x854f14f] [0x8690835] [0x804db1d] [0x8872178] [0x8886b19] [0x87c3bc2] [0x8320d55] [0x8521a9b] [0x863a03b] [0x863a0e4] [0x8bdf184] [0x8048121] Exiting... Graphics are disabled due to configuration... # cpu_run_time_pref: 28800 P.S.: While I was still typing this message the workunit suddenly became 'finished' within 2 seconds of being restarted and was returned. A new workunit was started and seems to be doing the same thing (stopped and restarted by Boinc with "waiting for memory" status) except it hasn't crashed yet (no SIGSEGV in stderr.txt so far). P.P.S.: I even got partial credit for this one (workunit 58661010): Too many restarts with no progress. Keep application in memory while preempted. ====================================================== DONE :: 1 starting structures built 31 (nstruct) times This process generated 2 decoys from 2 attempts 0 starting pdbs were skipped ====================================================== Regarding that message "Keep application in memory while preempted", I already do that because otherwise Rosetta doesn't work at all. How about fixing that problem too, it sure has been around long enough! Keeping ~80MB per Rosetta client in memory was just a minor annoyance. With these new HINGE workunits and several hundred MB this becomes a real issue. Team Helix |
Chu Send message Joined: 23 Feb 06 Posts: 120 Credit: 112,439 RAC: 0 |
Hi all, the "HINGE" WUs are simulating a very large protein which has more than 800 residues ( versus less than 200 residues normally we have run on BOINC ) and thus requires much more memory than usual. We have put a higher memory requirement on these jobs. Also, high priority has been assigned because it is for a blind docking prediction with the deadline coming soon. That can explain why some low-memory clients can not receive jobs temporarily. |
Chu Send message Joined: 23 Feb 06 Posts: 120 Credit: 112,439 RAC: 0 |
I think it is the problem of lacking enough memory as yours has only 256MB. Other Rosetta wu in que seems to have the same problem... only this one stopt at 1.030% |
MattDavis Send message Joined: 22 Sep 05 Posts: 206 Credit: 1,377,748 RAC: 0 |
See, everyone, it's not a big conspiracy :P Once again, BOINC tells us exactly what the issue is but people still act like something's horribly wrong -_- |
Chu Send message Joined: 23 Feb 06 Posts: 120 Credit: 112,439 RAC: 0 |
All the jobs we are running on Rosetta@Home were tested on Ralph@Home beforehand and if we have seen an unusually higher rate of problems, we will not add them to the queue here. For example, all the "HINGE" WUs have been tested on Ralph with the same high memory requirement and the error rate was normal. However, since Ralph is only for testing purpose, we do not usually send out too many jobs for each batch. So this means that 1. one client computer may not get multiple jobs running at the same time; 2. not all platforms are tested with the same distribution as represented on BOINC ( I think Ralph is overly represented by windows platforms ). This could be partially responsible for the problems you guys are reporting here. We are sorry for any inconvenience and will try our best to do better testing in future.
|
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
Hi all, the "HINGE" WUs are simulating a very large protein which has more than 800 residues ( versus less than 200 residues normally we have run on BOINC ) and thus requires much more memory than usual. We have put a higher memory requirement on these jobs. Also, high priority has been assigned because it is for a blind docking prediction with the deadline coming soon. That can explain why some low-memory clients can not receive jobs temporarily. I recommend telling in advance if WU with bigger memory requirement are to be sent out (preferable as a news item together with a warning that it might cause more errors and some problems). People don't complain if they know the cause and the need for (deadline, competition, etc.). So much for the social aspects of DC. |
j2satx Send message Joined: 17 Sep 05 Posts: 97 Credit: 3,670,592 RAC: 0 |
[quote]All the jobs we are running on Rosetta@Home were tested on Ralph@Home beforehand and if we have seen an unusually higher rate of problems, we will not add them to the queue here. For example, all the "HINGE" WUs have been tested on Ralph with the same high memory requirement and the error rate was normal. However, since Ralph is only for testing purpose, we do not usually send out too many jobs for each batch. So this means that 1. one client computer may not get multiple jobs running at the same time; 2. not all platforms are tested with the same distribution as represented on BOINC ( I think Ralph is overly represented by windows platforms ). This could be partially responsible for the problems you guys are reporting here. We are sorry for any inconvenience and will try our best to do better testing in future. [quote] You could show us the distribution of hosts on Ralph (like this URL from Docking: http://docking.utep.edu/sharedmemory.php) and let us see which machines we have to add that would help. Perhaps you could double the number of test WUs for Ralph to reduce the errors showing up on Rosetta. I've taken all my machines off Rosetta and added some to Ralph. |
netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0 |
Hi all, the "HINGE" WUs are simulating a very large protein which has more than 800 residues ( versus less than 200 residues normally we have run on BOINC ) and thus requires much more memory than usual. We have put a higher memory requirement on these jobs. Also, high priority has been assigned because it is for a blind docking prediction with the deadline coming soon. That can explain why some low-memory clients can not receive jobs temporarily. Hi Chu... This may seem a little petty, but, could you have the HINGE WU's grant a bit more credit... Seems like my machines were working harder and should get a little more credit than they did... If it's a big deal or you need to change all granting, then don't bother... it's just a thought... I will still crunch them when I get them... Looking for a team ??? Join BoincSynergy!! |
netwraith Send message Joined: 3 Sep 06 Posts: 80 Credit: 13,483,227 RAC: 0 |
See, everyone, it's not a big conspiracy :P I don't think anyone thought it was a conspiracy... just lack of *COMMUNICATION*. As has been stated elsewhere, a little blurb on the front page or a posting in the message boards on the subject could have averted some of the anguish... It still would not have helped those who did not see it and were idle, but, you know that I am talking about the principle and not necessarily the effectiveness..... Looking for a team ??? Join BoincSynergy!! |
Message boards :
Number crunching :
Problems with rosetta 5.48
©2024 University of Washington
https://www.bakerlab.org