Problems with rosetta 5.48

Message boards : Number crunching : Problems with rosetta 5.48

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile netwraith
Avatar

Send message
Joined: 3 Sep 06
Posts: 80
Credit: 13,483,227
RAC: 0
Message 37405 - Posted: 4 Mar 2007, 14:14:42 UTC

--

Looks like the HINGE units also grant about 1/2 the credits on Linux of other WU's... What the heck is going on...???

Here is a HINGE result...
https://boinc.bakerlab.org/rosetta/result.php?resultid=65587313

Here is a result just prior to running all HINGE WU's
https://boinc.bakerlab.org/rosetta/result.php?resultid=65547754

The machine is running 8 at a time, so, the L4 cache is sharing all the data, but, it's still huge code, and not granting much credit.. Hmmm ....
Looking for a team ??? Join BoincSynergy!!


ID: 37405 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rene
Avatar

Send message
Joined: 2 Dec 05
Posts: 10
Credit: 67,269
RAC: 0
Message 37406 - Posted: 4 Mar 2007, 14:35:48 UTC - in response to Message 37402.  
Last modified: 4 Mar 2007, 14:36:22 UTC

Other Rosetta wu in que seems to have the same problem... only this one stopt at 1.030%

;-)


At least this wu came with some info...


----------------

<core_client_version>5.8.15</core_client_version>
<![CDATA[
<message>
aborted by user
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# random seed: 3363307
Graphics are disabled due to configuration...
# random seed: 3363307
SIGSEGV: segmentation violation
Stack trace (15 frames):
[0x8b63f87]
[0x8b7fdcc]
[0xffffe420]
[0x8bb654e]
[0x8bb68ad]
[0x82631fc]
[0x86a0dbf]
[0x8ae0a75]
[0x843d845]
[0x80e592d]
[0x8521d37]
[0x863a03b]
[0x863a0e4]
[0x8bdf184]
[0x8048121]

Exiting...

</stderr_txt>
]]>


-----------------

;-)
ID: 37406 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 37414 - Posted: 4 Mar 2007, 16:43:43 UTC - in response to Message 37377.  

Have 512mb on a 1.2ghz cpu with w2k. I can no longer get work units after finishing the last one.

3/4/2007 12:16:54 AM|rosetta@home|Message from server: Your preferences limit memory usage to 460.34MB, and a job requires 476.84MB
3/4/2007 12:16:54 AM|rosetta@home|Message from server: No work sent
3/4/2007 12:16:54 AM|rosetta@home|Message from server: (there was work but your computer doesn't have enough memory)

Change preferences to 95%, change pagefile to 2gb, still can not get any work.


Your prefs at 95% would be enough for 477Mb, 0.95 x 512 = 486 (now there's a nostalgic number!) so I am guessing that you have shared memory with your video card, yes? If not, I am totally confuzdled.

If so, you could go into the BIOS and reduce graphics memory to the smallest setting and see if that helped. Or possibly (used to be the case, don't know if it is still true with modern cards) there might be a program that came with the card to adjust the amount of video memory.

If BOINC's 477Mb is 477 x 1024 x 1024 then you need to get the video down to around 35Mb. If BOINC's 477 is 477 x 1000 x 1000, then that corresponds to about 454 x 1024 x 1024, and you need a video allocation < 58 Mb, or maybe 56Mb depending on rounding, etc. So either way a setting of 32Mb would show whether that is the way forward.

Remember, if you change any settings, to note what they are now, so that you can get back to where you are now if the effect on your system is too horrible...

R~~
ID: 37414 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Thomas Leibold

Send message
Joined: 30 Jul 06
Posts: 55
Credit: 19,627,164
RAC: 0
Message 37417 - Posted: 4 Mar 2007, 18:50:09 UTC
Last modified: 4 Mar 2007, 18:57:28 UTC

Memory requirements appear to have skyrocketed suddenly (not sure if this is due to the HINGE workunits or the new 5.48 rosetta client).

On a dual core AMD X2 with 1GB of memory I still get new work from the project, but the performance of the system has become poor because of swapping to disk.

I have also seen in boinc_manager that there was a time when only one task was running instead of the usual two with the other one showing a status of "waiting". However I did not see any corresponding entries in the messages tab (such as suspending and restarting the task).
Edit: I thought this was 'waiting for memory', but it just happened again and after expanding the width of the status column in boinc manager it is 'waiting to run' and corresponds to the time after it crashes with SIGSEGV before boinc restarts it again.

One HINGE workunit is not completing properly and keeps getting restarted. I'm going to abort that one. Here is the information from the stderr.txt file:

Graphics are disabled due to configuration...
# random seed: 2872684
# cpu_run_time_pref: 28800
SIGSEGV: segmentation violation
Stack trace (23 frames):
[0x8b63f87]
[0x8b7fdcc]
[0xffffe420]
[0x849380d]
[0x8a0efe3]
[0x804d254]
[0x854b9cd]
[0x867872a]
[0x867ab18]
[0x867e675]
[0x868a9e1]
[0x854ffc0]
[0x8690835]
[0x804db1d]
[0x8872178]
[0x8886b19]
[0x87c3bc2]
[0x8320d55]
[0x8521a9b]
[0x863a03b]
[0x863a0e4]
[0x8bdf184]
[0x8048121]

Exiting...
Graphics are disabled due to configuration...
# cpu_run_time_pref: 28800
SIGSEGV: segmentation violation
Stack trace (15 frames):
[0x8b63f87]
[0x8b7fdcc]
[0xffffe420]
[0x837ce31]
[0x8a0e7b3]
[0x8a10a15]
[0x8066e41]
[0x8886d4d]
[0x87c3bc2]
[0x8320d55]
[0x8521a9b]
[0x863a03b]
[0x863a0e4]
[0x8bdf184]
[0x8048121]

Exiting...
Graphics are disabled due to configuration...
# cpu_run_time_pref: 28800
SIGSEGV: segmentation violation
Stack trace (21 frames):
[0x8b63f87]
[0x8b7fdcc]
[0xffffe420]
[0x848e320]
[0x8681d8b]
[0x855441e]
[0x8687e62]
[0x8688fe0]
[0x868ab44]
[0x854ffc0]
[0x8690835]
[0x804db1d]
[0x887152d]
[0x8886b19]
[0x87c3bc2]
[0x8320d55]
[0x8521a9b]
[0x863a03b]
[0x863a0e4]
[0x8bdf184]
[0x8048121]

Exiting...

The workunit is 58632028 (HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731). The system has completed at least one HINGE workunit with 5.48 successfully.

Sun 04 Mar 2007 08:06:32 AM PST|rosetta@home|Computation for task HINGE_1nd7_CAPRI_11nd7_1_cc1nd7.ppk_0481_1594_4121_0 finished
Sun 04 Mar 2007 08:06:32 AM PST|rosetta@home|Starting HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731_0
Sun 04 Mar 2007 08:06:32 AM PST|rosetta@home|Starting task HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731_0 using rosetta version 548
Sun 04 Mar 2007 08:06:35 AM PST|rosetta@home|[file_xfer] Started upload of file HINGE_1nd7_CAPRI_11nd7_1_cc1nd7.ppk_0481_1594_4121_0_0
Sun 04 Mar 2007 08:06:59 AM PST|rosetta@home|[file_xfer] Finished upload of file HINGE_1nd7_CAPRI_11nd7_1_cc1nd7.ppk_0481_1594_4121_0_0
Sun 04 Mar 2007 08:06:59 AM PST|rosetta@home|[file_xfer] Throughput 38581 bytes/sec
Sun 04 Mar 2007 09:54:32 AM PST|rosetta@home|Task HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731_0 exited with zero status but no 'finished' file
Sun 04 Mar 2007 09:54:32 AM PST|rosetta@home|If this happens repeatedly you may need to reset the project.
Sun 04 Mar 2007 09:56:59 AM PST|rosetta@home|Restarting task HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731_0 using rosetta version 548
Sun 04 Mar 2007 09:57:31 AM PST|rosetta@home|Task HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731_0 exited with zero status but no 'finished' file
Sun 04 Mar 2007 09:57:31 AM PST|rosetta@home|If this happens repeatedly you may need to reset the project.
Sun 04 Mar 2007 09:59:03 AM PST|rosetta@home|Sending scheduler request: Requested by user
Sun 04 Mar 2007 09:59:03 AM PST|rosetta@home|Reporting 1 tasks
Sun 04 Mar 2007 09:59:08 AM PST|rosetta@home|Scheduler RPC succeeded [server version 509]
Sun 04 Mar 2007 09:59:08 AM PST|rosetta@home|Deferring communication for 4 min 2 sec
Sun 04 Mar 2007 09:59:08 AM PST|rosetta@home|Reason: requested by project
Sun 04 Mar 2007 09:59:09 AM PST|rosetta@home|Restarting task HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731_0 using rosetta version 548
Sun 04 Mar 2007 10:16:18 AM PST|rosetta@home|Task HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731_0 exited with zero status but no 'finished' file
Sun 04 Mar 2007 10:16:18 AM PST|rosetta@home|If this happens repeatedly you may need to reset the project.
Sun 04 Mar 2007 10:31:41 AM PST|rosetta@home|Restarting task HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731_0 using rosetta version 548
Sun 04 Mar 2007 10:34:32 AM PST|rosetta@home|Task HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731_0 exited with zero status but no 'finished' file
Sun 04 Mar 2007 10:34:32 AM PST|rosetta@home|If this happens repeatedly you may need to reset the project.
Sun 04 Mar 2007 10:37:33 AM PST|rosetta@home|Restarting task HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731_0 using rosetta version 548

This is just from my home desktop system. There seem to be widespread issues, because some of the servers at work haven't reported any results in over 12 hours and others show less results than they should have for the number of cpus they are running.

Why was 5.48 rolled out so quickly ? It seems to have been on ralph for just a couple of days.
Team Helix
ID: 37417 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
j2satx

Send message
Joined: 17 Sep 05
Posts: 97
Credit: 3,670,592
RAC: 0
Message 37420 - Posted: 4 Mar 2007, 19:11:35 UTC - in response to Message 37417.  
Last modified: 4 Mar 2007, 19:12:45 UTC



Why was 5.48 rolled out so quickly ? It seems to have been on ralph for just a couple of days.


My guess is they only have 4,000 hosts to test on Ralph, but 273,000 to test with on Rosetta. Same as the graphics.........everything seems to get rolled out before it is thoroughly tested on Ralph.
ID: 37420 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile hedera
Avatar

Send message
Joined: 15 Jul 06
Posts: 76
Credit: 5,151,801
RAC: 672
Message 37421 - Posted: 4 Mar 2007, 19:30:50 UTC

A little more background on my system which I thought was sucessfully running 2 Rosetta tasks:

Windows XP Pro, SP2
Pentium 4 320 GHz
1 GB RAM


The process list in the Windows task manager showed that the rosetta_5.48_windows_intelx86.exe was averaging around 290KB of system memory when running. I looked at the CPU, and it was using 50% of CPU, which means that at that time only one task was actually running. So I looked at my tasks and discovered that only one is running - the other is "Waiting for memory". So I HAVE the problem after all, my system is just handling it a little more gracefully than some. This is a negative change - with earlier versions I was able to run 2 tasks simultaneously, using 100% of CPU. My system status shows about 378K of physical memory available. It seems to me that a system this size ought to be able to run two tasks at once.
--hedera

Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic.

ID: 37421 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Thomas Leibold

Send message
Joined: 30 Jul 06
Posts: 55
Credit: 19,627,164
RAC: 0
Message 37426 - Posted: 4 Mar 2007, 19:57:13 UTC
Last modified: 4 Mar 2007, 20:04:38 UTC

Additional Problem: after aborting workunit 58632028 (HINGE_1nd7_CAPRI_31nd7_3_cc1nd7.ppk_0983_1596_1731) through boinc manager it remained in the system (not using cpu time, but of course holding on to all the allocated memory).

This meant that there were now 3 rosetta clients on this dual cpu system (however only two showed up in the boinc manager task list). This aborted task still remained in the system even after completely shutting down the boinc client!

Note that the parent process id has changed to 1 (init) instead of the process id of the no longer running boinc client.

user 28422 1 2 10:37 pts/1 00:01:40 rosetta_5.48_i686-pc-linux-gnu aa 1nd7 _ -s 1nd7_3_cc1nd7.ppk_0983.pdb -dock_mcm -read_all_chains -use_pdb_numbering -dock -dock_mcm -docking_pose_symmetry -pose -nstruct 5 -ex1 -ex2aro_only -dock_score_norepack -pose_symm_n_monomers 2 -docking_pose_symm_full -symm_type cn -randomize1 -hinge -hinge_start 255 -hinge_end 260 -inner_cycle_mcm_hinge 7 -fake_native -output_silent_gz -pose_silent_out -no_filters -output_all -accept_all -cpu_run_time 10800 -watchdog -constant_seed -jran 2872684
user 28423 28422 0 10:37 pts/1 00:00:01 rosetta_5.48_i686-pc-linux-gnu aa 1nd7 _ -s 1nd7_3_cc1nd7.ppk_0983.pdb -dock_mcm -read_all_chains -use_pdb_numbering -dock -dock_mcm -docking_pose_symmetry -pose -nstruct 5 -ex1 -ex2aro_only -dock_score_norepack -pose_symm_n_monomers 2 -docking_pose_symm_full -symm_type cn -randomize1 -hinge -hinge_start 255 -hinge_end 260 -inner_cycle_mcm_hinge 7 -fake_native -output_silent_gz -pose_silent_out -no_filters -output_all -accept_all -cpu_run_time 10800 -watchdog -constant_seed -jran 2872684
user 28424 28423 0 10:37 pts/1 00:00:00 rosetta_5.48_i686-pc-linux-gnu aa 1nd7 _ -s 1nd7_3_cc1nd7.ppk_0983.pdb -dock_mcm -read_all_chains -use_pdb_numbering -dock -dock_mcm -docking_pose_symmetry -pose -nstruct 5 -ex1 -ex2aro_only -dock_score_norepack -pose_symm_n_monomers 2 -docking_pose_symm_full -symm_type cn -randomize1 -hinge -hinge_start 255 -hinge_end 260 -inner_cycle_mcm_hinge 7 -fake_native -output_silent_gz -pose_silent_out -no_filters -output_all -accept_all -cpu_run_time 10800 -watchdog -constant_seed -jran 2872684
user 28425 28423 0 10:37 pts/1 00:00:00 rosetta_5.48_i686-pc-linux-gnu aa 1nd7 _ -s 1nd7_3_cc1nd7.ppk_0983.pdb -dock_mcm -read_all_chains -use_pdb_numbering -dock -dock_mcm -docking_pose_symmetry -pose -nstruct 5 -ex1 -ex2aro_only -dock_score_norepack -pose_symm_n_monomers 2 -docking_pose_symm_full -symm_type cn -randomize1 -hinge -hinge_start 255 -hinge_end 260 -inner_cycle_mcm_hinge 7 -fake_native -output_silent_gz -pose_silent_out -no_filters -output_all -accept_all -cpu_run_time 10800 -watchdog -constant_seed -jran 2872684

The stderr.txt file is:

Graphics are disabled due to configuration...
# random seed: 2930724
SIGSEGV: segmentation violation
Stack trace (27 frames):
[0x8b63f87]
[0x8b7fdcc]
[0xffffe420]
[0x8812550]
[0x804d2f0]
[0x888e6d0]
[0x8af7190]
[0x837c075]
[0x8a0e7b3]
[0x804d254]
[0x854b9cd]
[0x867872a]
[0x867ab18]
[0x867e6c6]
[0x868a9e1]
[0x854ffc0]
[0x8690835]
[0x804db1d]
[0x8872178]
[0x8886b19]
[0x87c3bc2]
[0x8320d55]
[0x8521a9b]
[0x863a03b]
[0x863a0e4]
[0x8bdf184]
[0x8048121]

Exiting...
SIGSEGV: segmentation violation
Stack trace (12 frames):
[0x8b63f87]
[0x8b7fdcc]
[0xffffe420]
[0x8b59755]
[0x8b59831]
[0x8b59618]
[0x8069816]
[0x875e7a9]
[0x8be67cf]
[0x8b7791d]
[0x8b8102d]
[0x8c1292a]

Exiting...

I would not be surprised if this turns out to be related to the same issue that prevents the watchdog timer from properly terminating rosetta tasks on linux.
Obviously processes with large memory requirements that cannot be terminated (through the user interface, a simple kill works just fine) when they are causing a problem are not good!
Team Helix
ID: 37426 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Thomas Leibold

Send message
Joined: 30 Jul 06
Posts: 55
Credit: 19,627,164
RAC: 0
Message 37427 - Posted: 4 Mar 2007, 20:09:36 UTC - in response to Message 37420.  


My guess is they only have 4,000 hosts to test on Ralph, but 273,000 to test with on Rosetta. Same as the graphics.........everything seems to get rolled out before it is thoroughly tested on Ralph.

Yet at the same time on the Ralph message boards are questions why there isn't enough work for testers. I don't think a lack of testers and test machines is the issue.
Team Helix
ID: 37427 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Stan in CT

Send message
Joined: 21 Dec 05
Posts: 1
Credit: 241,611
RAC: 0
Message 37437 - Posted: 5 Mar 2007, 0:15:38 UTC
Last modified: 5 Mar 2007, 1:14:15 UTC

Houston(administrators) YOU have a problem!!
I, along with others,are trying to help you with our computers.
Without making any adjustments on our part,we are suddenly not able to to help you.
What's with this 'not enough memory'?
ID: 37437 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
NYgnat

Send message
Joined: 28 Dec 06
Posts: 2
Credit: 244,894
RAC: 0
Message 37438 - Posted: 5 Mar 2007, 0:18:16 UTC - in response to Message 37414.  

Have 512mb on a 1.2ghz cpu with w2k. I can no longer get work units after finishing the last one.

3/4/2007 12:16:54 AM|rosetta@home|Message from server: Your preferences limit memory usage to 460.34MB, and a job requires 476.84MB
3/4/2007 12:16:54 AM|rosetta@home|Message from server: No work sent
3/4/2007 12:16:54 AM|rosetta@home|Message from server: (there was work but your computer doesn't have enough memory)

Change preferences to 95%, change pagefile to 2gb, still can not get any work.


Your prefs at 95% would be enough for 477Mb, 0.95 x 512 = 486 (now there's a nostalgic number!) so I am guessing that you have shared memory with your video card, yes? If not, I am totally confuzdled.

If so, you could go into the BIOS and reduce graphics memory to the smallest setting and see if that helped. Or possibly (used to be the case, don't know if it is still true with modern cards) there might be a program that came with the card to adjust the amount of video memory.

If BOINC's 477Mb is 477 x 1024 x 1024 then you need to get the video down to around 35Mb. If BOINC's 477 is 477 x 1000 x 1000, then that corresponds to about 454 x 1024 x 1024, and you need a video allocation < 58 Mb, or maybe 56Mb depending on rounding, etc. So either way a setting of 32Mb would show whether that is the way forward.

Remember, if you change any settings, to note what they are now, so that you can get back to where you are now if the effect on your system is too horrible...

R~~


My problem is quite similar on W2K/512MB and preferences set to 1 and 2 GIGs:
Note that Everything in the Rosetta folder under BOINC is still GZIPPED!
Drive:Program FilesBOINCprojectsboinc.bakerlab.org_rosetta
Directory listing is:


03/04/2007 07:13p <DIR> .
03/04/2007 07:13p <DIR> ..
12/28/2006 04:19p 11,994 avgE_from_pdb.gz
12/28/2006 04:19p 6,665,378 bbdep02.May.sortlib.gz
12/28/2006 04:19p 245 bb_hbW.gz
12/28/2006 04:19p 177,854 disulf_jumps.dat.gz
12/28/2006 04:20p 1,667,226 DunbrackBBDepRots12.dat.gz
12/28/2006 04:20p 356 Equil_AM.mean.dat.gz
12/28/2006 04:20p 322 Equil_AM.stddev.dat.gz
12/28/2006 04:20p 187 Equil_bp_AM.mean.dat.gz
12/28/2006 04:20p 169 Equil_bp_AM.stddev.dat.gz
12/28/2006 04:20p 1,267 Fij_AM.dat.gz
12/28/2006 04:20p 383 Fij_bp_AM.dat.gz
12/28/2006 04:20p 689 jump_templates.dat.gz
12/28/2006 04:20p 516,693 jump_templates_v2.dat.gz
12/28/2006 04:20p 165 Paa.gz
12/28/2006 04:20p 1,832 Paa_n.gz
12/28/2006 04:20p 120,642 Paa_pp.gz
12/28/2006 04:20p 2,515 paircutoffs.gz
12/28/2006 04:20p 71,365 pdbpairstats_fine.gz
12/28/2006 04:20p 19,510 phi.theta.36.HS.resmooth.gz
12/28/2006 04:20p 11,863 phi.theta.36.SS.resmooth.gz
12/28/2006 04:20p 136,751 plane_data_table_1015.dat.gz
12/28/2006 04:20p 425,070 Rama_smooth_dyn.dat_ss_6.4.gz
12/28/2006 04:20p 1,132 SASA-angles.dat.gz
12/28/2006 04:20p 68,809 SASA-masks.dat.gz
12/28/2006 04:20p 2,475 sasa_offsets.txt.gz
12/28/2006 04:20p 34,833 sasa_prob_cdf.txt.gz
12/28/2006 04:20p 2,998 sc_hbW.gz
29 File(s) 9,942,723 bytes
2 Dir(s) 12,516,331,520 bytes free

Log file is as follows:

3/4/2007 7:03:14 PM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
3/4/2007 7:03:14 PM|rosetta@home|Reason: To fetch work
3/4/2007 7:03:14 PM|rosetta@home|Requesting 8640 seconds of new work
3/4/2007 7:03:19 PM|rosetta@home|Scheduler request succeeded
3/4/2007 7:03:19 PM|rosetta@home|Message from server: Your preferences limit memory usage to 431.54MB, and a job requires 476.84MB
3/4/2007 7:03:19 PM|rosetta@home|Message from server: No work sent
3/4/2007 7:03:19 PM|rosetta@home|Message from server: (there was work but your computer doesn't have enough memory)
3/4/2007 7:03:19 PM|rosetta@home|No work from project
3/4/2007 7:07:25 PM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
3/4/2007 7:07:25 PM|rosetta@home|Reason: To fetch work
3/4/2007 7:07:25 PM|rosetta@home|Requesting 8640 seconds of new work
3/4/2007 7:07:31 PM|rosetta@home|Scheduler request succeeded
3/4/2007 7:07:31 PM|rosetta@home|Message from server: Your preferences limit memory usage to 431.54MB, and a job requires 476.84MB
3/4/2007 7:07:31 PM|rosetta@home|Message from server: No work sent
3/4/2007 7:07:31 PM|rosetta@home|Message from server: (there was work but your computer doesn't have enough memory)
3/4/2007 7:07:31 PM|rosetta@home|No work from project




ID: 37438 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
NYgnat

Send message
Joined: 28 Dec 06
Posts: 2
Credit: 244,894
RAC: 0
Message 37447 - Posted: 5 Mar 2007, 2:35:14 UTC - in response to Message 37438.  

[quote]Have 512mb on a 1.2ghz cpu with w2k. I can no longer get work units after finishing the last one.

3/4/2007 12:16:54 AM|rosetta@home|Message from server: Your preferences limit memory usage to 460.34MB, and a job requires 476.84MB
3/4/2007 12:16:54 AM|rosetta@home|Message from server: No work sent
3/4/2007 12:16:54 AM|rosetta@home|Message from server: (there was work but your computer doesn't have enough memory)

Change preferences to 95%, change pagefile to 2gb, still can not get any work.


Your prefs at 95% would be enough for 477Mb, 0.95 x 512 = 486 (now there's a nostalgic number!) so I am guessing that you have shared memory with your video card, yes? If not, I am totally confuzdled.

If so, you could go into the BIOS and reduce graphics memory to the smallest setting and see if that helped. Or possibly (used to be the case, don't know if it is still true with modern cards) there might be a program that came with the card to adjust the amount of video memory.

If BOINC's 477Mb is 477 x 1024 x 1024 then you need to get the video down to around 35Mb. If BOINC's 477 is 477 x 1000 x 1000, then that corresponds to about 454 x 1024 x 1024, and you need a video allocation < 58 Mb, or maybe 56Mb depending on rounding, etc. So either way a setting of 32Mb would show whether that is the way forward.

Remember, if you change any settings, to note what they are now, so that you can get back to where you are now if the effect on your system is too horrible...

R~~


My problem is quite similar on W2K/512MB and preferences set to 1 and 2 GIGs:

Problem was alleviated when I set prefs to 99% mem usage, then it started to downlaod 5.48. Possibly a global prob with W2k at 512MB ram, My video card has its own ram. I tried 95% and that didn't work either.. Seems 5.48 is a memory hog...

ID: 37447 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Thomas Leibold

Send message
Joined: 30 Jul 06
Posts: 55
Credit: 19,627,164
RAC: 0
Message 37448 - Posted: 5 Mar 2007, 2:48:32 UTC - in response to Message 37417.  


I have also seen in boinc_manager that there was a time when only one task was running instead of the usual two with the other one showing a status of "waiting". However I did not see any corresponding entries in the messages tab (such as suspending and restarting the task).
Edit: I thought this was 'waiting for memory', but it just happened again and after expanding the width of the status column in boinc manager it is 'waiting to run' and corresponds to the time after it crashes with SIGSEGV before boinc restarts it again.


It is happening again, and this time I definitely saw the message "waiting for memory" in the boinc manager task window (workunit 58661010).

What I'm wondering now is whether this is related to the two different memory preference settings starting with Boinc 5.8 ? This machine is running 5.8.11 and has 1GB of memory which is enough for the scheduler to give it two of the HINGE workunits. However when I'm doing other work on the system the available memory amount may drop sufficiently for Boinc to try stopping one of the Rosetta tasks (which then promptly fails with the same kind of SIGSEGV we see when the watchdog timer tries to terminate a Rosetta task).

2007-03-04 15:08:09 [rosetta@home] Task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_1078_1595_3691_0 exited with zero status but no 'finished' file
2007-03-04 15:08:09 [rosetta@home] If this happens repeatedly you may need to reset the project.
2007-03-04 15:09:18 [rosetta@home] Restarting task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_1078_1595_3691_0 using rosetta version 548
2007-03-04 16:38:17 [rosetta@home] Task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_1078_1595_3691_0 exited with zero status but no 'finished' file
2007-03-04 16:38:17 [rosetta@home] If this happens repeatedly you may need to reset the project.
2007-03-04 16:39:23 [rosetta@home] Restarting task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_1078_1595_3691_0 using rosetta version 548
2007-03-04 16:50:13 [rosetta@home] Computation for task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0534_1595_1654_0 finished
2007-03-04 16:50:13 [rosetta@home] Restarting task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 using rosetta version 548
2007-03-04 16:50:15 [rosetta@home] [file_xfer] Started upload of file HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0534_1595_1654_0_0
2007-03-04 16:50:43 [rosetta@home] [file_xfer] Finished upload of file HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0534_1595_1654_0_0
2007-03-04 16:50:43 [rosetta@home] [file_xfer] Throughput 38845 bytes/sec
2007-03-04 17:12:50 [rosetta@home] Task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 exited with zero status but no 'finished' file
2007-03-04 17:12:51 [rosetta@home] If this happens repeatedly you may need to reset the project.
2007-03-04 17:13:59 [rosetta@home] Restarting task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 using rosetta version 548
2007-03-04 17:53:45 [rosetta@home] Task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 exited with zero status but no 'finished' file
2007-03-04 17:53:45 [rosetta@home] If this happens repeatedly you may need to reset the project.
2007-03-04 17:55:43 [rosetta@home] Restarting task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 using rosetta version 548
2007-03-04 17:59:25 [rosetta@home] Task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 exited with zero status but no 'finished' file
2007-03-04 17:59:25 [rosetta@home] If this happens repeatedly you may need to reset the project.
2007-03-04 18:00:29 [rosetta@home] Restarting task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 using rosetta version 548
2007-03-04 18:02:12 [rosetta@home] Task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 exited with zero status but no 'finished' file
2007-03-04 18:02:12 [rosetta@home] If this happens repeatedly you may need to reset the project.
2007-03-04 18:02:12 [rosetta@home] Restarting task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 using rosetta version 548
2007-03-04 18:03:33 [rosetta@home] Task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 exited with zero status but no 'finished' file
2007-03-04 18:03:33 [rosetta@home] If this happens repeatedly you may need to reset the project.
2007-03-04 18:03:57 [rosetta@home] Restarting task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 using rosetta version 548
2007-03-04 18:08:20 [rosetta@home] Task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 exited with zero status but no 'finished' file
2007-03-04 18:08:20 [rosetta@home] If this happens repeatedly you may need to reset the project.
2007-03-04 18:12:46 [rosetta@home] Restarting task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 using rosetta version 548
2007-03-04 18:12:48 [rosetta@home] Computation for task HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0 finished
2007-03-04 18:12:48 [rosetta@home] Sending scheduler request: To fetch work
2007-03-04 18:12:48 [rosetta@home] Requesting 8308 seconds of new work, and reporting 1 completed tasks
2007-03-04 18:12:50 [rosetta@home] [file_xfer] Started upload of file HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0_0
2007-03-04 18:12:53 [rosetta@home] Scheduler RPC succeeded [server version 509]
2007-03-04 18:12:53 [rosetta@home] Deferring communication for 4 min 2 sec
2007-03-04 18:12:53 [rosetta@home] Reason: requested by project
2007-03-04 18:12:54 [rosetta@home] [file_xfer] Finished upload of file HINGE_1nd7_CAPRI_21nd7_2_cc1nd7.ppk_0181_1595_3740_0_0
2007-03-04 18:12:54 [rosetta@home] [file_xfer] Throughput 38087 bytes/sec
2007-03-04 18:12:55 [rosetta@home] Starting HINGE_1nd7_CAPRI_31nd7_.pdb_1596_4655_0
2007-03-04 18:12:55 [rosetta@home] Starting task HINGE_1nd7_CAPRI_31nd7_.pdb_1596_4655_0 using rosetta version 548
2007-03-04 18:19:10 [rosetta@home] Resuming task HINGE_1nd7_CAPRI_31nd7_.pdb_1596_4655_0 using rosetta version 548
2007-03-04 18:21:26 [rosetta@home] Resuming task HINGE_1nd7_CAPRI_31nd7_.pdb_1596_4655_0 using rosetta version 548
2007-03-04 18:25:50 [rosetta@home] Resuming task HINGE_1nd7_CAPRI_31nd7_.pdb_1596_4655_0 using rosetta version 548

stderr.txt:
Graphics are disabled due to configuration...
# random seed: 2930724
SIGSEGV: segmentation violation
Stack trace (27 frames):
[0x8b63f87]
[0x8b7fdcc]
[0xffffe420]
[0x8812550]
[0x804d2f0]
[0x888e6d0]
[0x8af7190]
[0x837c075]
[0x8a0e7b3]
[0x804d254]
[0x854b9cd]
[0x867872a]
[0x867ab18]
[0x867e6c6]
[0x868a9e1]
[0x854ffc0]
[0x8690835]
[0x804db1d]
[0x8872178]
[0x8886b19]
[0x87c3bc2]
[0x8320d55]
[0x8521a9b]
[0x863a03b]
[0x863a0e4]
[0x8bdf184]
[0x8048121]

Exiting...
SIGSEGV: segmentation violation
Stack trace (12 frames):
[0x8b63f87]
[0x8b7fdcc]
[0xffffe420]
[0x8b59755]
[0x8b59831]
[0x8b59618]
[0x8069816]
[0x875e7a9]
[0x8be67cf]
[0x8b7791d]
[0x8b8102d]
[0x8c1292a]

Exiting...
Graphics are disabled due to configuration...
# random seed: 2930724
# cpu_run_time_pref: 28800
SIGSEGV: segmentation violation
Stack trace (19 frames):
[0x8b63f87]
[0x8b7fdcc]
[0xffffe420]
[0x8551b3e]
[0x8687e62]
[0x8688fe0]
[0x868ab44]
[0x854ffc0]
[0x8690835]
[0x804db1d]
[0x887152d]
[0x8886b19]
[0x87c3bc2]
[0x8320d55]
[0x8521a9b]
[0x863a03b]
[0x863a0e4]
[0x8bdf184]
[0x8048121]

Exiting...
Graphics are disabled due to configuration...
# cpu_run_time_pref: 28800
SIGSEGV: segmentation violation
Stack trace (20 frames):
[0x8b63f87]
[0x8b7fdcc]
[0xffffe420]
[0x849380d]
[0x8a0efe3]
[0x804d254]
[0x854b9cd]
[0x867872a]
[0x854f14f]
[0x8690835]
[0x804db1d]
[0x8872178]
[0x8886b19]
[0x87c3bc2]
[0x8320d55]
[0x8521a9b]
[0x863a03b]
[0x863a0e4]
[0x8bdf184]
[0x8048121]

Exiting...
Graphics are disabled due to configuration...
# cpu_run_time_pref: 28800

P.S.: While I was still typing this message the workunit suddenly became 'finished' within 2 seconds of being restarted and was returned. A new workunit was started and seems to be doing the same thing (stopped and restarted by Boinc with "waiting for memory" status) except it hasn't crashed yet (no SIGSEGV in stderr.txt so far).

P.P.S.: I even got partial credit for this one (workunit 58661010):
Too many restarts with no progress. Keep application in memory while preempted.
======================================================
DONE ::     1 starting structures built        31 (nstruct) times
This process generated      2 decoys from       2 attempts
                            0 starting pdbs were skipped
======================================================

Regarding that message "Keep application in memory while preempted", I already do that because otherwise Rosetta doesn't work at all. How about fixing that problem too, it sure has been around long enough! Keeping ~80MB per Rosetta client in memory was just a minor annoyance. With these new HINGE workunits and several hundred MB this becomes a real issue.
Team Helix
ID: 37448 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chu

Send message
Joined: 23 Feb 06
Posts: 120
Credit: 112,439
RAC: 0
Message 37449 - Posted: 5 Mar 2007, 4:39:13 UTC

Hi all, the "HINGE" WUs are simulating a very large protein which has more than 800 residues ( versus less than 200 residues normally we have run on BOINC ) and thus requires much more memory than usual. We have put a higher memory requirement on these jobs. Also, high priority has been assigned because it is for a blind docking prediction with the deadline coming soon. That can explain why some low-memory clients can not receive jobs temporarily.
ID: 37449 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chu

Send message
Joined: 23 Feb 06
Posts: 120
Credit: 112,439
RAC: 0
Message 37450 - Posted: 5 Mar 2007, 4:56:15 UTC - in response to Message 37406.  

I think it is the problem of lacking enough memory as yours has only 256MB.
Other Rosetta wu in que seems to have the same problem... only this one stopt at 1.030%

;-)


At least this wu came with some info...


----------------

<core_client_version>5.8.15</core_client_version>
<![CDATA[
<message>
aborted by user
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# random seed: 3363307
Graphics are disabled due to configuration...
# random seed: 3363307
SIGSEGV: segmentation violation
Stack trace (15 frames):
[0x8b63f87]
[0x8b7fdcc]
[0xffffe420]
[0x8bb654e]
[0x8bb68ad]
[0x82631fc]
[0x86a0dbf]
[0x8ae0a75]
[0x843d845]
[0x80e592d]
[0x8521d37]
[0x863a03b]
[0x863a0e4]
[0x8bdf184]
[0x8048121]

Exiting...

</stderr_txt>
]]>


-----------------

;-)


ID: 37450 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MattDavis
Avatar

Send message
Joined: 22 Sep 05
Posts: 206
Credit: 1,377,748
RAC: 0
Message 37451 - Posted: 5 Mar 2007, 4:56:45 UTC
Last modified: 5 Mar 2007, 4:57:14 UTC

See, everyone, it's not a big conspiracy :P

Once again, BOINC tells us exactly what the issue is but people still act like something's horribly wrong -_-
ID: 37451 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chu

Send message
Joined: 23 Feb 06
Posts: 120
Credit: 112,439
RAC: 0
Message 37453 - Posted: 5 Mar 2007, 5:12:41 UTC - in response to Message 37427.  
Last modified: 5 Mar 2007, 5:17:08 UTC

All the jobs we are running on Rosetta@Home were tested on Ralph@Home beforehand and if we have seen an unusually higher rate of problems, we will not add them to the queue here. For example, all the "HINGE" WUs have been tested on Ralph with the same high memory requirement and the error rate was normal. However, since Ralph is only for testing purpose, we do not usually send out too many jobs for each batch. So this means that 1. one client computer may not get multiple jobs running at the same time; 2. not all platforms are tested with the same distribution as represented on BOINC ( I think Ralph is overly represented by windows platforms ). This could be partially responsible for the problems you guys are reporting here. We are sorry for any inconvenience and will try our best to do better testing in future.

My guess is they only have 4,000 hosts to test on Ralph, but 273,000 to test with on Rosetta. Same as the graphics.........everything seems to get rolled out before it is thoroughly tested on Ralph.

Yet at the same time on the Ralph message boards are questions why there isn't enough work for testers. I don't think a lack of testers and test machines is the issue.


ID: 37453 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 37464 - Posted: 5 Mar 2007, 9:17:41 UTC - in response to Message 37449.  

Hi all, the "HINGE" WUs are simulating a very large protein which has more than 800 residues ( versus less than 200 residues normally we have run on BOINC ) and thus requires much more memory than usual. We have put a higher memory requirement on these jobs. Also, high priority has been assigned because it is for a blind docking prediction with the deadline coming soon. That can explain why some low-memory clients can not receive jobs temporarily.


I recommend telling in advance if WU with bigger memory requirement are to be sent out (preferable as a news item together with a warning that it might cause more errors and some problems). People don't complain if they know the cause and the need for (deadline, competition, etc.). So much for the social aspects of DC.
ID: 37464 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
j2satx

Send message
Joined: 17 Sep 05
Posts: 97
Credit: 3,670,592
RAC: 0
Message 37467 - Posted: 5 Mar 2007, 11:00:50 UTC - in response to Message 37453.  

[quote]All the jobs we are running on Rosetta@Home were tested on Ralph@Home beforehand and if we have seen an unusually higher rate of problems, we will not add them to the queue here. For example, all the "HINGE" WUs have been tested on Ralph with the same high memory requirement and the error rate was normal. However, since Ralph is only for testing purpose, we do not usually send out too many jobs for each batch. So this means that 1. one client computer may not get multiple jobs running at the same time; 2. not all platforms are tested with the same distribution as represented on BOINC ( I think Ralph is overly represented by windows platforms ). This could be partially responsible for the problems you guys are reporting here. We are sorry for any inconvenience and will try our best to do better testing in future.
[quote]

You could show us the distribution of hosts on Ralph (like this URL from Docking: http://docking.utep.edu/sharedmemory.php) and let us see which machines we have to add that would help.

Perhaps you could double the number of test WUs for Ralph to reduce the errors showing up on Rosetta.

I've taken all my machines off Rosetta and added some to Ralph.
ID: 37467 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile netwraith
Avatar

Send message
Joined: 3 Sep 06
Posts: 80
Credit: 13,483,227
RAC: 0
Message 37475 - Posted: 5 Mar 2007, 15:13:25 UTC - in response to Message 37449.  
Last modified: 5 Mar 2007, 15:13:58 UTC

Hi all, the "HINGE" WUs are simulating a very large protein which has more than 800 residues ( versus less than 200 residues normally we have run on BOINC ) and thus requires much more memory than usual. We have put a higher memory requirement on these jobs. Also, high priority has been assigned because it is for a blind docking prediction with the deadline coming soon. That can explain why some low-memory clients can not receive jobs temporarily.


Hi Chu... This may seem a little petty, but, could you have the HINGE WU's grant a bit more credit... Seems like my machines were working harder and should get a little more credit than they did...

If it's a big deal or you need to change all granting, then don't bother... it's just a thought... I will still crunch them when I get them...


Looking for a team ??? Join BoincSynergy!!


ID: 37475 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile netwraith
Avatar

Send message
Joined: 3 Sep 06
Posts: 80
Credit: 13,483,227
RAC: 0
Message 37476 - Posted: 5 Mar 2007, 15:18:02 UTC - in response to Message 37451.  

See, everyone, it's not a big conspiracy :P

Once again, BOINC tells us exactly what the issue is but people still act like something's horribly wrong -_-


I don't think anyone thought it was a conspiracy... just lack of *COMMUNICATION*. As has been stated elsewhere, a little blurb on the front page or a posting in the message boards on the subject could have averted some of the anguish... It still would not have helped those who did not see it and were idle, but, you know that I am talking about the principle and not necessarily the effectiveness.....


Looking for a team ??? Join BoincSynergy!!


ID: 37476 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Problems with rosetta 5.48



©2024 University of Washington
https://www.bakerlab.org