Posts by amgthis

41) Questions and Answers : Windows : partial completion 'waiting to run' (Message 71630)
Posted 22 Nov 2011 by amgthis
Post:
I've since put together another quad core box this time with the 2500
sandy bridge intel and 16g of ram. I have boinc manager set to use
100% of memory and plenty of disk space. I'm also now running Debian
'squeeze' release on this box. Unfortunately, Boinc running Rosetta
still exhibits this behavior of abandoning wu's partially completed, and
starting work with *later* expire dates. This has resulted in lots
of work dying on the vine and expiring prior to completion. I've also
set my extra work buffer up to 9 days sometimes because I've run out of
work. Now I've lowered it to 4-5 days, since it never really gathers
enough work for all cores for all days you set anyhow. Plus I didn't want
it starting even more work before finishing others already in progress.
BTW, right now Rosetta is the only project for this manager to
try and manage. (6.10.58 from the debian stable tree)

With all cores running 100% 24/7 no restrictions - my memory free
is over 9 gigs. No swap being used.

It seems the manager really isn't all that great at queuing work
to consistently avoid letting good work go to waste and not being
returned on time.


Mod Sense, thanks again. More experimentation is needed. I have only one i7 cpu
but I can tweak both ways for a couple of weeks and watch what happens. I'm hoping to install 64 bit windows 7 if I can get past some BIOS issues. I did watch while BOINC 'orphaned' off several of my nearly complete WU's as time expired and they were still 'waiting to run'. So that answered one question I had - BOINC will let the WU expire past it's due date and start newer WU's with
later deadline dates if you have memory issues like I do.



I have a Dell T7500 with dual 6 core Xeons (HT enabled) for a total of 24 threads available (boinc is set to use only 85%).

I'm seeing this same behavior and I have 48gb of ram. I will check to see how much I am allocating to boinc when I get home and adjust from there.

Sounds like I'm close to running out of headspace trying to do too much all at once (GPUGrid is also running with two GPUs at the same time).

Mike

42) Questions and Answers : Windows : partial completion 'waiting to run' (Message 68851)
Posted 21 Dec 2010 by amgthis
Post:
Mod Sense, thanks again. More experimentation is needed. I have only one i7 cpu
but I can tweak both ways for a couple of weeks and watch what happens. I'm hoping to install 64 bit windows 7 if I can get past some BIOS issues. I did watch while BOINC 'orphaned' off several of my nearly complete WU's as time expired and they were still 'waiting to run'. So that answered one question I had - BOINC will let the WU expire past it's due date and start newer WU's with
later deadline dates if you have memory issues like I do.

More testing is in order.

Merry Christmas everyone!
43) Questions and Answers : Windows : partial completion 'waiting to run' (Message 68783)
Posted 7 Dec 2010 by amgthis
Post:
Mod.Sense - first thanks for taking the time for such a detailed response.
I believe what you are saying makes complete sense for my situation. I just installed my first i7 Bloomfield core cpu and while it's a quad I was a little surprised to see it running 8 tasks right off the bat. The i7 threading capabilities make for that apparently. The box has 4 gigs of ram but being Windows XP it's only using 3. I just checked another box with a Q9550 quad that has done the same thing with 2 WU's now waiting to run. Same deal, XP,
4 gigs of RAM (3 useable), etc. I think I just hit some bigger projects that pegged my RAM.

My preferences are set to use 100% of all memory, page file, etc. on my boxes.

Everything else you write appears to be what I've seen. I'll recheck system log messages also to see if this is started by a 'waiting for memory' issue that morphs into the 'waiting to run' as you write.

I'm leaving the rest of your great response complete so hopefully it can help someone else if they experience this and are wondering. Now it will be shown twice on the page.

Thanks again and best of the Holidays to you and everyone associated with Rosetta@home.

/amgthis
My best guess is that this is a memory issue. What can happen, especially with a many core machine, is that a task reaches a point or a model that requires more memory then the rest of the execution has. The combination of all 4 running at the same time then exceeds your preference for how much memory BOINC should use and the task goes to a status of "waiting for memory"... and BOINC seems to take a note that indicates it was using xxx MB of memory when it got deferred to the waiting status.

And so it starts a new task, hoping it might run with less memory and often it can. At no point during the execution of the new task does the memory requirement of the 3 other tasks go low enough for this one that's waiting to run. And so BOINC continues to wait on that one.

Then you reboot your computer, or restart BOINC, and it knows how much memory that task needs, and it doesn't start it for the same reasons that existed when the machine was powered down. At this point, I believe I'm correct in saying it will show a "waiting to run" status rather then the previous "waiting for memory" status. The reason for this might be that it only shows the waiting for memory status when this run of BOINC has actually kicked it out due to the preference on how much memory to use for BOINC tasks. Since it hasn't run it yet this time, it shows the status differently. But the underlaying fact is that BOINC knows how much memory needs to be free for that task to run, and that is now why it is waiting.

Most people allow a higher percentage of memory to be used by BOINC when the machine is idle. And so often such tasks will be picked up and run when the machine is not in use and BOINC is allowed more memory. But BOINC strives to preserve as much completed work as possible as well, and so it probably wouldn't transition back to that task until another task reaches a checkpoint. So I wouldn't expect it to instantly pick it up when the machine is idle for the configured number of minutes.

BOINC is trying to meet your preferences. One presumably is to use all 4 CPUs, and another is for BOINC to live within your preference for memory usage.

What happens when the task approaches deadline? It sounds like it does eventually run... when it does run again, do you find you only have 3 active tasks? Or perhaps a fourth that is just getting started and is not using much memory yet?

How much memory does your machine have? How much is BOINC configured to use? (check the messages as BOINC starts).

If memory does prove to be the issue, there are only a few ways for it to run any differently then it already is:
1) Get more memory, or allow BOINC to use a higher percentage of memory (which may make your machine a bit sluggish, but try it and see. You can always set it back)
2) Keep the BOINC % of memory the same for when active, but allow BOINC to use more when idle. This can make it take a moment to wake up after you've been away for a while. Generally not a big deal, just don't panic fearing a blue screen of death. But it should give enough for the task to run when you are away.
3) Limit BOINC to using 3 (or less) CPUs. This would reduce the amount of memory BOINC needs to run, but also reduce your throughput.
4) Manually suspend tasks until your pesky one runs.
5) Don't worry about it. BOINC will "git 'er done" when the deadline approaches. No tinkering required.

44) Questions and Answers : Windows : partial completion 'waiting to run' (Message 68767)
Posted 6 Dec 2010 by amgthis
Post:
Win XP with intel quad core cpu's.
45) Questions and Answers : Windows : partial completion 'waiting to run' (Message 68766)
Posted 6 Dec 2010 by amgthis
Post:
I notice sometimes my work units are shown partially (sometimes almost nearly) finished, but shown as 'waiting to run' while other work units have been started. I try to cache several days worth of work since I've run out many times in the past when the project is down. I don't understand why these units stop in the middle while others are started and finished, then new work started. But somehow the 'waiting to run' units sit. Some are like 95% complete and they just sit and wait to expire from work not being completed by the deadline. Does anyone know why this occasionally happens? The BOINC manager version doesn't seem to matter. I have this happen with new and old versions. ????? Why if a partial WU shows 'waiting to run' and it's almost totally finished, it never restarts before a brand new WU starts?
46) Message boards : Number crunching : minirosetta 2.15 (Message 67956)
Posted 4 Oct 2010 by amgthis
Post:
I have a quad core Q6700 with 4 gigs of RAM and I'm having the same problem reported here. With Windoze XP SP2 I'm getting constant 'nag' bubbles about low system memory. I check the usage under task manager and one instance is using over 1 gig of memory. The other 3 running WU's are looking more typical, using
right around ~300k each of RAM, plus or minus. The work unit that is sucking over a gig is this one:

task T0592_t4_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_22268_1995_0

running under 2.15.

I have my preferences basically set to use any resources they can grab, which has always worked great up till now. Also, I never run the graphic screen saver. I leave my computer on 24/7 with no other restrictions on Rosetta, other
than the time it accesses my local LAN.
47) Message boards : Number crunching : No new work (Message 64689)
Posted 1 Jan 2010 by amgthis
Post:
My boxes are running out of work.
48) Questions and Answers : Windows : lr8 wu's won't run (Message 64331)
Posted 2 Dec 2009 by amgthis
Post:
These look like the "rama" work units that were posted last week. Your machines are hidden so I can't see the history of your specific WUs or when you got them to determine why it took you so long to run in to them. But, these have been discussed in some detail on the Number Crunching board.



thanks, Mod sense. all of my boxes have been running slower this last week or
so - maybe that accounts for the backup.
49) Questions and Answers : Windows : lr8 wu's won't run (Message 64328)
Posted 1 Dec 2009 by amgthis
Post:
Is anyone else having trouble with these?

01-Dec-2009 12:44:41 [rosetta@home] Starting lr8_combine_smooth_torsion_it00_rama04_A_rlbd_1tul_IGNORE_THE_REST_DECOY_14889_747_0
01-Dec-2009 12:44:42 [rosetta@home] Starting task lr8_combine_smooth_torsion_it00_rama04_A_rlbd_1tul_IGNORE_THE_REST_DECOY_14889_747_0 using minirosetta version 200
01-Dec-2009 12:44:53 [rosetta@home] Computation for task lr8_combine_smooth_torsion_it00_rama04_A_rlbd_1tul_IGNORE_THE_REST_DECOY_14889_747_0 finished
01-Dec-2009 12:44:53 [rosetta@home] Output file lr8_combine_smooth_torsion_it00_rama04_A_rlbd_1tul_IGNORE_THE_REST_DECOY_14889_747_0_0 for task lr8_combine_smooth_torsion_it00_rama04_A_rlbd_1tul_IGNORE_THE_REST_DECOY_14889_747_0 absent

dies after ~ 11 seconds and then 'output file absent'.

I've had a bunch of these do this on several boxes.
Sorry about the word wrap but I thought the time stamp should be included.
50) Message boards : Number crunching : Minirosetta 1.90 and 1.91 (Message 62785)
Posted 5 Aug 2009 by amgthis
Post:
Anyone else having trouble with these w/u's bombing out after only 20 seconds or so? More output file absent errors?

05-Aug-2009 12:16:20 [rosetta@home] Starting task lr5_combine_mods_run01_rlbn_1enh_IGNORE_THE_REST_NATIVE_14608_23_0 using minirosetta version 190
05-Aug-2009 12:16:42 [rosetta@home] Computation for task lr5_combine_mods_run01_rlbn_1enh_IGNORE_THE_REST_NATIVE_14608_23_0 finished
05-Aug-2009 12:16:42 [rosetta@home] Output file RE_THE_REST_lr5_combine_mods_run01_rlbn_1enh_IGNONATIVE_14608_23_0_0 for task lr5_combine_mods_run01_rlbn_1enh_IGNORE_THE_REST_NATIVE_14608_23_0 absent
51) Message boards : Number crunching : Problems with web site (Message 59593)
Posted 16 Feb 2009 by amgthis
Post:
server status says everything is running but no results can be uploaded. ???

huh??

52) Questions and Answers : Windows : SAN upgrade issue? (Message 57524)
Posted 3 Dec 2008 by amgthis
Post:


Hitting 'update' about 10 times is a slow and dirty fix. Once the master file
is fetched, all is redirected to the new server URL.






this must be due to the upgrade not being quite completed yet:

01-Dec-2008 16:59:22 [rosetta@home] Message from server: Server error: can't attach shared memory

'patience is a virtue' my old girlfriend
used to claim. I'm still not sure I completely believed her...... 8^)

53) Message boards : Number crunching : Problems with web site (Message 57523)
Posted 3 Dec 2008 by amgthis
Post:


Hitting 'update' about 10 times worked for me. PITA on 20 boxes, though.
I guess I could have done it thru boingmanager. Thanks moderators and others
for the suggestion.

/amgthis



Now getting a message from BOINC

Mon 01 Dec 2008 07:49:39 PM EST|rosetta@home|Message from server: Server error: can't attach shared memory

This has been happening the last couple of hours...



Ditto for me. And before that, I was seeing all of my Rosetta Mini 1.40 "abinitio_nohomgraf_..." tasks failing to report in for at least several hours this morning. They never did successfully transfer. I've got them queued up.

54) Questions and Answers : Windows : SAN upgrade issue? (Message 57439)
Posted 2 Dec 2008 by amgthis
Post:
this must be due to the upgrade not being quite completed yet:

01-Dec-2008 16:59:22 [rosetta@home] Message from server: Server error: can't attach shared memory

'patience is a virtue' my old girlfriend
used to claim. I'm still not sure I completely believed her...... 8^)
55) Questions and Answers : Windows : Output file absent (Message 57227)
Posted 25 Nov 2008 by amgthis
Post:
I'm thinking this is why I'm getting many 'computation error' messages, even on units that have run what appears to be the full time to completion (7:48 or so)
I'm set for 8 hr. work units:

<snip>

24-Nov-2008 11:51:07 [rosetta@home] Computation for task loopbuild_minimalist_core_control_standardloopfile2_homo_bench_looprelax_cheat_chunk_control_standard_loopfiles_t286__olange_IGNORE_THE_REST_1FXWF_9_4817_73_0 finished
24-Nov-2008 11:51:07 [rosetta@home] Output file loopbuild_minimalist_core_control_standardloopfile2_homo_bench_looprelax_cheat_chunk_control_standard_loopfiles_t286__olange_IGNORE_THE_REST_1FXWF_9_4817_73_0_0 for task loopbuild_minimalist_core_control_standardloopfile2_homo_bench_looprelax_cheat_chunk_control_standard_loopfiles_t286__olange_IGNORE_THE_REST_1FXWF_9_4817_73_0 absent
24-Nov-2008 11:51:07 [rosetta@home] Computation for task loopbuild_minimalist_core_control_standardloopfile2_homo_bench_looprelax_cheat_chunk_control_standard_loopfiles_t286__olange_IGNORE_THE_REST_1FXWF_9_4817_74_0 finished
24-Nov-2008 11:51:07 [rosetta@home] Output file loopbuild_minimalist_core_control_standardloopfile2_homo_bench_looprelax_cheat_chunk_control_standard_loopfiles_t286__olange_IGNORE_THE_REST_1FXWF_9_4817_74_0_0 for task loopbuild_minimalist_core_control_standardloopfile2_homo_bench_looprelax_cheat_chunk_control_standard_loopfiles_t286__olange_IGNORE_THE_REST_1FXWF_9_4817_74_0 absent

<snip>

anyone else seen this? I have had a lot of the infamous 'no finished file' errors also but this one
appears to be new for me.

/amgthis
56) Questions and Answers : Windows : What's wrong with the "Rosetta mini with new score terms 1.02"? (Message 56752)
Posted 7 Nov 2008 by amgthis
Post:
Same here all of those work units done blowed up.

/amgthis



Are you from South Carolina?


No, Bruce. I was just practicing my NASCAR-speak.

8^)

If I could understand the error messages better I'd
forward them along but I think other people had that covered
already.


Of course! NASCAR speak. I used to live near Charlotte, a big NASCAR city (Lowe's Motor Speedway, but I bet you knew that).

My Rosetta trouble has cleared up.

Good Luck!

Bruce.


Thanks, Bruce. Yeah I think with 1.40 hopefully all the problems will stop.
I still had some of the older "mini with new score terms" queued but they are
almost all gone now. I've had no other problems (lately) with any other version(s).

Cheers,

/amgthis
57) Questions and Answers : Windows : What's wrong with the "Rosetta mini with new score terms 1.02"? (Message 56730)
Posted 6 Nov 2008 by amgthis
Post:
Same here all of those work units done blowed up.

/amgthis



Are you from South Carolina?


No, Bruce. I was just practicing my NASCAR-speak.

8^)

If I could understand the error messages better I'd
forward them along but I think other people had that covered
already.
58) Questions and Answers : Windows : What's wrong with the "Rosetta mini with new score terms 1.02"? (Message 56693)
Posted 4 Nov 2008 by amgthis
Post:
Same here all of those work units done blowed up.

/amgthis
59) Message boards : Number crunching : minirosetta v1.15 bug thread (Message 52870)
Posted 5 May 2008 by amgthis
Post:
The mini rosetta 1.15 units just continually crash. Why keep queuing them to
distribute until the problems are sorted? People are wasting k watts of power
for nothing in the meantime...
I would think we would just line up 5.96 units until the bugs were sorted instead
of wasting thousands of watts of energy for nothing.

????


I just abort them as soon as I see them but I'm sure that may be a problem for someone such as yourself with 14K RAC, sadly I only have 1 computer...

I would set Rosetta to "no new work" for the time being and come back later when this gets all sorted out.

Yes, you are right. I should stop whining and do as you say or just run another
project in the meantime. Hopefully it will be sorted out soon.

/amgthis


60) Message boards : Number crunching : minirosetta v1.15 bug thread (Message 52856)
Posted 4 May 2008 by amgthis
Post:
The mini rosetta 1.15 units just continually crash. Why keep queuing them to
distribute until the problems are sorted? People are wasting k watts of power
for nothing in the meantime...
I would think we would just line up 5.96 units until the bugs were sorted instead
of wasting thousands of watts of energy for nothing.

????


Previous 20 · Next 20



©2024 University of Washington
https://www.bakerlab.org