Waiting to Run

Message boards : Number crunching : Waiting to Run

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 115,568,105
RAC: 59,147
Message 74609 - Posted: 28 Nov 2012, 14:48:18 UTC

Does it just switch to "Waiting to Run" when you move the mouse? With the current setting I believe it will switch back to the last checkpoint (probably 0%) when interrupted becuase when you move the mouse/press a key it will switch from 90% RAM available to 50% available, possibly causing the switch to "Waiting to Run" and back to 0% complete.

Are you able to leave the work in memory (paged to disk) when the task is suspended? That way it won't have to drop back to the last checkpoint and can continue processing from where it was up to.

i.e. change this to 1:
<leave_apps_in_memory>0</leave_apps_in_memory>
ID: 74609 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 352
Credit: 382,349
RAC: 0
Message 74611 - Posted: 28 Nov 2012, 16:07:15 UTC - in response to Message 74608.  
Last modified: 28 Nov 2012, 16:15:40 UTC

Actually, shouldn't these be the other way around:

<max_ncpus_pct>100.000000</max_ncpus_pct>
<cpu_usage_limit>50.000000</cpu_usage_limit>

I think max_ncpus_pct is the number of processors (so 50% to use one physical processor) and cpu_usage_limit is the proportion of run-time to pause-time while running. I'd recommend swap those values and get BOINC to re-read the file.

Yep. Haven't seen that.


Well I was a bit hopeful. With the following changes (see my current settings below) it ran for about a 1 1/2 days and then one job hit "waiting to run".

You might have missed my first post:

<leave_apps_in_memory>1</leave_apps_in_memory>
<ram_max_used_busy_pct>65.000000</ram_max_used_busy_pct>

If with that it should still go into waiting to run, I'd try:
<ram_max_used_busy_pct>70.000000</ram_max_used_busy_pct>
<ram_max_used_idle_pct>70.000000</ram_max_used_idle_pct>

And than set in your client_state.xml (exit BOINC first):
<user_run_request>1</user_run_request>

You have there a 2 probably right now.
.
ID: 74611 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile E the P

Send message
Joined: 5 Jun 06
Posts: 36
Credit: 28,333,251
RAC: 0
Message 74627 - Posted: 30 Nov 2012, 13:42:48 UTC - in response to Message 74611.  

Actually, shouldn't these be the other way around:

<max_ncpus_pct>100.000000</max_ncpus_pct>
<cpu_usage_limit>50.000000</cpu_usage_limit>

I think max_ncpus_pct is the number of processors (so 50% to use one physical processor) and cpu_usage_limit is the proportion of run-time to pause-time while running. I'd recommend swap those values and get BOINC to re-read the file.

Yep. Haven't seen that.


Well I was a bit hopeful. With the following changes (see my current settings below) it ran for about a 1 1/2 days and then one job hit "waiting to run".

You might have missed my first post:

<leave_apps_in_memory>1</leave_apps_in_memory>
<ram_max_used_busy_pct>65.000000</ram_max_used_busy_pct>

If with that it should still go into waiting to run, I'd try:
<ram_max_used_busy_pct>70.000000</ram_max_used_busy_pct>
<ram_max_used_idle_pct>70.000000</ram_max_used_idle_pct>

And than set in your client_state.xml (exit BOINC first):
<user_run_request>1</user_run_request>

You have there a 2 probably right now.


Ok, I've made the two suggested changes, changing leave_apps_in_memory to "1" and upping the busy memory to 65%. I'll report back on how it works.
ID: 74627 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile E the P

Send message
Joined: 5 Jun 06
Posts: 36
Credit: 28,333,251
RAC: 0
Message 74662 - Posted: 4 Dec 2012, 13:02:42 UTC - in response to Message 74627.  

Actually, shouldn't these be the other way around:

<max_ncpus_pct>100.000000</max_ncpus_pct>
<cpu_usage_limit>50.000000</cpu_usage_limit>

I think max_ncpus_pct is the number of processors (so 50% to use one physical processor) and cpu_usage_limit is the proportion of run-time to pause-time while running. I'd recommend swap those values and get BOINC to re-read the file.

Yep. Haven't seen that.


Well I was a bit hopeful. With the following changes (see my current settings below) it ran for about a 1 1/2 days and then one job hit "waiting to run".

You might have missed my first post:

<leave_apps_in_memory>1</leave_apps_in_memory>
<ram_max_used_busy_pct>65.000000</ram_max_used_busy_pct>

If with that it should still go into waiting to run, I'd try:
<ram_max_used_busy_pct>70.000000</ram_max_used_busy_pct>
<ram_max_used_idle_pct>70.000000</ram_max_used_idle_pct>

And than set in your client_state.xml (exit BOINC first):
<user_run_request>1</user_run_request>

You have there a 2 probably right now.


Ok, I've made the two suggested changes, changing leave_apps_in_memory to "1" and upping the busy memory to 65%. I'll report back on how it works.


Sadly, same results. Runs for about a day to day 1/2 and then hangs. Here is my latest overide file:
---------------------------------------------------------
<global_preferences>
<run_on_batteries>0</run_on_batteries>
<run_if_user_active>1</run_if_user_active>
<run_gpu_if_user_active>0</run_gpu_if_user_active>
<idle_time_to_run>0.000000</idle_time_to_run>
<start_hour>0.000000</start_hour>
<end_hour>0.000000</end_hour>
<net_start_hour>0.000000</net_start_hour>
<net_end_hour>0.000000</net_end_hour>
<leave_apps_in_memory>1</leave_apps_in_memory>
<confirm_before_connecting>0</confirm_before_connecting>
<hangup_if_dialed>0</hangup_if_dialed>
<dont_verify_images>0</dont_verify_images>
<work_buf_min_days>0.100000</work_buf_min_days>
<work_buf_additional_days>0.250000</work_buf_additional_days>
<max_ncpus_pct>50.000000</max_ncpus_pct>
<cpu_scheduling_period_minutes>60.000000</cpu_scheduling_period_minutes>
<disk_interval>60.000000</disk_interval>
<disk_max_used_gb>100.000000</disk_max_used_gb>
<disk_max_used_pct>50.000000</disk_max_used_pct>
<disk_min_free_gb>0.000000</disk_min_free_gb>
<vm_max_used_pct>75.000000</vm_max_used_pct>
<ram_max_used_busy_pct>65.000000</ram_max_used_busy_pct>
<ram_max_used_idle_pct>90.000000</ram_max_used_idle_pct>
<max_bytes_sec_up>0.000000</max_bytes_sec_up>
<max_bytes_sec_down>0.000000</max_bytes_sec_down>
<cpu_usage_limit>100.000000</cpu_usage_limit>
<suspend_cpu_usage>0.000000</suspend_cpu_usage>
</global_preferences>

ID: 74662 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 352
Credit: 382,349
RAC: 0
Message 74667 - Posted: 4 Dec 2012, 19:35:47 UTC - in response to Message 74662.  

Have you tried this part?
If with that it should still go into waiting to run, I'd try:
<ram_max_used_busy_pct>70.000000</ram_max_used_busy_pct>
<ram_max_used_idle_pct>70.000000</ram_max_used_idle_pct>

And than set in your client_state.xml (exit BOINC first):
<user_run_request>1</user_run_request>

You have there a 2 probably right now.

Also post the log, when BOINC suspends the task.

It also might be helpful to use <cpu_sched>, <cpu_sched_debug> and <mem_usage_debug> in cc_config, so we can better see in the log what's going on there.

What is the size of the pagefile/partition (whatever that is called in Linux)?
.
ID: 74667 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile E the P

Send message
Joined: 5 Jun 06
Posts: 36
Credit: 28,333,251
RAC: 0
Message 74704 - Posted: 10 Dec 2012, 14:11:15 UTC - in response to Message 74667.  

Have you tried this part?
If with that it should still go into waiting to run, I'd try:
<ram_max_used_busy_pct>70.000000</ram_max_used_busy_pct>
<ram_max_used_idle_pct>70.000000</ram_max_used_idle_pct>

And than set in your client_state.xml (exit BOINC first):
<user_run_request>1</user_run_request>

You have there a 2 probably right now.

Also post the log, when BOINC suspends the task.

It also might be helpful to use <cpu_sched>, <cpu_sched_debug> and <mem_usage_debug> in cc_config, so we can better see in the log what's going on there.

What is the size of the pagefile/partition (whatever that is called in Linux)?


So far it's gone the entire weekend with no hangups. I'll keep monitoring and apply your suggesting is it hangs.
ID: 74704 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile E the P

Send message
Joined: 5 Jun 06
Posts: 36
Credit: 28,333,251
RAC: 0
Message 74966 - Posted: 24 Jan 2013, 15:27:42 UTC - in response to Message 74704.  

Have you tried this part?
If with that it should still go into waiting to run, I'd try:
<ram_max_used_busy_pct>70.000000</ram_max_used_busy_pct>
<ram_max_used_idle_pct>70.000000</ram_max_used_idle_pct>

And than set in your client_state.xml (exit BOINC first):
<user_run_request>1</user_run_request>

You have there a 2 probably right now.

Also post the log, when BOINC suspends the task.

It also might be helpful to use <cpu_sched>, <cpu_sched_debug> and <mem_usage_debug> in cc_config, so we can better see in the log what's going on there.

What is the size of the pagefile/partition (whatever that is called in Linux)?


So far it's gone the entire weekend with no hangups. I'll keep monitoring and apply your suggesting is it hangs.


Well as a final warp-up I am processing for about 4-5 days without a termination. At this point I can live with that. I want to thank everyone on this list for your suggestions and help.

ID: 74966 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Kong Kandal den 1.

Send message
Joined: 28 Apr 06
Posts: 1
Credit: 9,024,376
RAC: 0
Message 75450 - Posted: 24 Apr 2013, 14:41:19 UTC - in response to Message 74966.  

Have you tried this part?
If with that it should still go into waiting to run, I'd try:
<ram_max_used_busy_pct>70.000000</ram_max_used_busy_pct>
<ram_max_used_idle_pct>70.000000</ram_max_used_idle_pct>

And than set in your client_state.xml (exit BOINC first):
<user_run_request>1</user_run_request>

You have there a 2 probably right now.

Also post the log, when BOINC suspends the task.

It also might be helpful to use <cpu_sched>, <cpu_sched_debug> and <mem_usage_debug> in cc_config, so we can better see in the log what's going on there.

What is the size of the pagefile/partition (whatever that is called in Linux)?


So far it's gone the entire weekend with no hangups. I'll keep monitoring and apply your suggesting is it hangs.


Well as a final warp-up I am processing for about 4-5 days without a termination. At this point I can live with that. I want to thank everyone on this list for your suggestions and help.



Hello

I am experiencing the same problem and have not found any solution.
I have tried all the tricks in this thread,- but nothing seems to help.

Any advices will be appreciated.

Thank you.

ID: 75450 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 352
Credit: 382,349
RAC: 0
Message 75547 - Posted: 30 Apr 2013, 9:36:37 UTC - in response to Message 75450.  

For any advices you need to unhide your computers or post a link to the details page of the affected machine. Also post the contents of your global_prefs_override.xml or if that file is not available on your computer (in your BOINC data directory), than global_prefs.xml.
.
ID: 75547 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Xenus

Send message
Joined: 14 May 09
Posts: 2
Credit: 659,344
RAC: 8
Message 75561 - Posted: 4 May 2013, 16:11:30 UTC - in response to Message 74054.  

I'm running BOINC on an Ubuntu 12 system and about 6-8 weeks ago it began to develop a problem (no new software/hardware changes). It will frequently get stuck with one job at the "Waiting to Run" state. If I manuall abort that work unit it will begin to run the next job normally. The pattern is inconsistant. Sometimes it will process 2-4 work units just fine, other times it will hang on 2-3 in a row. Any thoughts?


Exactly the same problem in Ubuntu 12.04 and 12.10 with Boinc 7.0.27 and Rosetta tasks.
Also get the next task stuck on "Waiting to Run" for no good reason. Aborting that task then gets the tasks "Ready to Start" running.



ID: 75561 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Xenus

Send message
Joined: 14 May 09
Posts: 2
Credit: 659,344
RAC: 8
Message 75562 - Posted: 4 May 2013, 16:21:57 UTC - in response to Message 75561.  

I'm running BOINC on an Ubuntu 12 system and about 6-8 weeks ago it began to develop a problem (no new software/hardware changes). It will frequently get stuck with one job at the "Waiting to Run" state. If I manuall abort that work unit it will begin to run the next job normally. The pattern is inconsistant. Sometimes it will process 2-4 work units just fine, other times it will hang on 2-3 in a row. Any thoughts?


Exactly the same problem in Ubuntu 12.04 and 12.10 with Boinc 7.0.27 and Rosetta tasks.
Also get the next task stuck on "Waiting to Run" for no good reason. Aborting that task then gets the tasks "Ready to Start" running.




Looks like the max memory issue. Increasing the percentage of memory usable gets the process running again. Seems like the Rosetta jobs have large and/or different memory requirements. Ideally there should be log message to indicate job can't run without more memory or it should simply abort itself to allow another job to run.
ID: 75562 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 352
Credit: 382,349
RAC: 0
Message 75662 - Posted: 25 May 2013, 16:39:03 UTC - in response to Message 75562.  

Looks like the max memory issue. Increasing the percentage of memory usable gets the process running again. Seems like the Rosetta jobs have large and/or different memory requirements. Ideally there should be log message to indicate job can't run without more memory or it should simply abort itself to allow another job to run.

You need to allow at least 500MB per Rosetta task, better 1GB since some tasks need that much. Check all the posts in this thread if the issue comes back, all the relevant setting has been posted above.
.
ID: 75662 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3

Message boards : Number crunching : Waiting to Run



©2024 University of Washington
https://www.bakerlab.org