Frequent hung work units

Questions and Answers : Windows : Frequent hung work units

To post messages, you must log in.

AuthorMessage
Brock Jones

Send message
Joined: 30 Dec 09
Posts: 6
Credit: 163,688
RAC: 0
Message 64863 - Posted: 8 Jan 2010, 20:12:48 UTC
Last modified: 8 Jan 2010, 20:19:15 UTC

I have two Windows machines I'm running BOINC and Rosetta@home on: one XP machine and one Win7 x64 machine. Both are running BOINC v.6.10.18. I'm frequently getting hung WUs on *both* machines. When the WUs are hung, I'll see no progress after running for > 15 hours and an ever climbing 'To completion' time. I can abort the WUs in question and it will typically process several additional WUs, but will eventually (within a day or two) get hung again.

When a task is hung, it's reported as running, I can see the process in the Windows task manager (minirosetta_2.03_windows_intelx86.exe on the XP box that I'm at right now) and it's 'using' memory, but it never uses any processor time. Meanwhile another WU running on the other core is taking it's typical 50%. I've got one right now that says it's been running for 15.5 hours with 30.5 hours remaining.

[edit]I should add that - while a work unit is hung - the screensaver does not work - it simply displays a completely blank black screen. Additionally - the hung work units do not always hang in the same place. They typically start processing normally and get hung up mid-way. A currently hung WU (job 16684_27_0)is stuck at 8.459%, hasn't made any progress in > 12 hours, and is currently using no CPU cycles.[/edit]
ID: 64863 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 64877 - Posted: 9 Jan 2010, 17:25:53 UTC

What is the status shown for the two active tasks? What shows in the messages around the time it stopped getting CPU time? I am thinking perhaps the BOINC Manager suspended the task to assure that the two combined do not exceed your memory usage preferences. The amount of memory used by a WU varies as it runs. And so at the random points in time where they both are hitting a peak, it sounds like it is crossing your preference. Note: I'm talking about the amount of memory BOINC is allowed to use, not the amount of memory on the machine.

I guess that doesn't really explain 10+ hours though.

Ideally, you would try suspending tasks and then restarting them first. This can often clear up any problem, and preserve the work you have completed.

Does this seem to you like a new problem with the 2.03 version? Or has it been occurring longer then that? Any pattern in the names of the WUs that are hanging?
Rosetta Moderator: Mod.Sense
ID: 64877 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brock Jones

Send message
Joined: 30 Dec 09
Posts: 6
Credit: 163,688
RAC: 0
Message 64919 - Posted: 11 Jan 2010, 18:49:01 UTC - in response to Message 64877.  

What is the status shown for the two active tasks?


They display as 'Running'

What shows in the messages around the time it stopped getting CPU time?


Nothing in there at all. It starts and then there are no further messages about it until I suspend or abort it.

I am thinking perhaps the BOINC Manager suspended the task to assure that the two combined do not exceed your memory usage preferences. The amount of memory used by a WU varies as it runs. And so at the random points in time where they both are hitting a peak, it sounds like it is crossing your preference. Note: I'm talking about the amount of memory BOINC is allowed to use, not the amount of memory on the machine.


Both of the machines in question are set to allow up to 90% of memory when idle
and 75% of swap space. Their memory usage doesn't appear to be anywhere near that high when they stall. Each machine has 4GB of physical memory and each process is using 180-300MB. Frequently, the next work unit to come along will use *more* memory (and more total memory between the two running WUs) and complete just fine.

Ideally, you would try suspending tasks and then restarting them first. This can often clear up any problem, and preserve the work you have completed.


I actually tried that first -- suspending and resuming them had no impact.

Any pattern in the names of the WUs that are hanging?


Here are the most recent WUs from the XP machine that I've had to abort:

homopt_nat.t312_.t312_.IGNORE_THE_REST.native_0001_0026.pdb.JOB_16681_27_0
homopt_nat.t322_.t322_.IGNORE_THE_REST.native_0001_0095.pdb.JOB_16684_27_0
homopt2b.t331_.t331_.IGNORE_THE_REST.S_00002_0000473_00069.pdb.JOB_16718_12_0

ID: 64919 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brock Jones

Send message
Joined: 30 Dec 09
Posts: 6
Credit: 163,688
RAC: 0
Message 64922 - Posted: 11 Jan 2010, 21:15:50 UTC - in response to Message 64919.  

I have another one that appears to be currently stalled as well:

homopt4.t293_.t293_.IGNORE_THE_REST.S_00002_0000001_0_0_00088.pdb_00008.pdb_00002.pdb.JOB_16810_2_0


It's only been running for 2.5 hours, but it's exactly the same presentation. It showing as 'Running', taking 220MB of memory (task manager), and using no CPU time. The 'To completion' estimate just keeps on climbing - it's at 5 hours and climbing right now.
ID: 64922 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Tyka

Send message
Joined: 20 Oct 05
Posts: 96
Credit: 2,190
RAC: 0
Message 64933 - Posted: 12 Jan 2010, 17:17:58 UTC

THose are my jobs. Seems like the display counter is not being updated during this protocol. THis is a cosmetic problem though - the jobs are running just fine underneath and we're getting lots of good data back! I'll put a bug fix in in the next version.


Cheers, Mike
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/
ID: 64933 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brock Jones

Send message
Joined: 30 Dec 09
Posts: 6
Credit: 163,688
RAC: 0
Message 64934 - Posted: 12 Jan 2010, 18:44:22 UTC - in response to Message 64933.  
Last modified: 12 Jan 2010, 18:48:45 UTC

THose are my jobs. Seems like the display counter is not being updated during this protocol. THis is a cosmetic problem though - the jobs are running just fine underneath and we're getting lots of good data back! I'll put a bug fix in in the next version.


Cheers, Mike



It's *not* a display problem. When they hang, they use no processor time whatsoever and they *never* finish. I have one that has been going for 30 hours solid right now.

Perhaps there is something specific about my configuration on these machines that is causing a problem, but once I get two hung WUs on a machine (both are dual core machines), it's completely stopped at that point and will never process another WU - at least out to just over 40 hours of run time.
ID: 64934 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brock Jones

Send message
Joined: 30 Dec 09
Posts: 6
Credit: 163,688
RAC: 0
Message 64952 - Posted: 13 Jan 2010, 18:39:39 UTC - in response to Message 64934.  

I find it difficult to believe that more people aren't having this problem. This is happening on two totally fresh/vanilla installs of BOINC with Rosetta@home as the only running science app.
ID: 64952 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Tyka

Send message
Joined: 20 Oct 05
Posts: 96
Credit: 2,190
RAC: 0
Message 64955 - Posted: 13 Jan 2010, 21:03:48 UTC

Hmm. We've not been able to reproduce this problem here unfortunately.
There is an update going out today (2.05) (it may have already gone out in fact) that fixed a different issue to do with checkpointing. Is that version still giving you these troubles ?

Mike

http://beautifulproteins.blogspot.com/
http://www.miketyka.com/
ID: 64955 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brock Jones

Send message
Joined: 30 Dec 09
Posts: 6
Credit: 163,688
RAC: 0
Message 64956 - Posted: 14 Jan 2010, 1:13:47 UTC - in response to Message 64955.  

Hmm. We've not been able to reproduce this problem here unfortunately.
There is an update going out today (2.05) (it may have already gone out in fact) that fixed a different issue to do with checkpointing. Is that version still giving you these troubles ?


Nope -- these are running 2.03. I just had one go fail with a 'Computation error' after 53 hours. I've got another one that's been running for 32 hours and predicts 50 hours remaining.
ID: 64956 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin

Send message
Joined: 13 Apr 07
Posts: 42
Credit: 260,782
RAC: 0
Message 64958 - Posted: 14 Jan 2010, 3:33:40 UTC

Brock,

I had the same issue, but 2.05 update seems to have fixed it. Abort the stuck WU's as nothing else will happen other than a time increase. I had to abort 5-6 in a row due to this issue. Hope this helps ya.
ID: 64958 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin

Send message
Joined: 13 Apr 07
Posts: 42
Credit: 260,782
RAC: 0
Message 64968 - Posted: 14 Jan 2010, 15:06:38 UTC

Just had to abort a stuck homopt WU @ 40% on minirosetta 2.05. Doesnt look like whatever the issue is has been totally fixed yet.
ID: 64968 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mdillenk

Send message
Joined: 19 Feb 06
Posts: 8
Credit: 865,454
RAC: 0
Message 65200 - Posted: 4 Feb 2010, 3:41:35 UTC - in response to Message 64952.  

I find it difficult to believe that more people aren't having this problem. This is happening on two totally fresh/vanilla installs of BOINC with Rosetta@home as the only running science app.

I'm having the same exact problem:
Jobs such as these never finished, in the BOINC client they look frozen but the job doesn't utilize any cpu. I would guess that between 5% to 10% of the jobs do this. I'm running the 64 bit BOINC client on Windows 7 64. Any body else having problems like this or know what may be wrong?

t374__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_5733_0


t365__boinc_filtered_loopbuild_threading_cst_all_tex_IGNORE_THE_REST_16902_5996_0

lr15clus_opt_.1eyv.1eyv.IGNORE_THE_REST.c.10.2.pdb.pdb.JOB_17448_1_0
ID: 65200 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
banicki

Send message
Joined: 7 Dec 05
Posts: 1
Credit: 7,676,012
RAC: 1,024
Message 65452 - Posted: 3 Mar 2010, 3:30:27 UTC
Last modified: 3 Mar 2010, 3:37:18 UTC

Me too! Brand New WIN7 x64 machine picks up units, starts them and sometimes they just stop processing. They stop requesting CPU, or using CPU, for long periods of time, like 24 hours with no credits rac'ed up. These are 4 hour units. Tasks that I aborted as suspected as hung: lrmixclus2_opt_.1bq9.1bq9.SAVE_ALL_OUT_IGNORE_THE_REST.c.5.1.pdb.pdb.JOB_18226_2_0 aborted by user
3/2/2010 9:04:11 PM rosetta@home Computation for task lrmixclus2_opt_.1bq9.1bq9.SAVE_ALL_OUT_IGNORE_THE_REST.c.5.1.pdb.pdb.JOB_18226_2_0 finished
3/2/2010 9:04:27 PM rosetta@home task lrmixclus2_opt_.1o5u.1o5u.SAVE_ALL_OUT_IGNORE_THE_REST.c.12.3.pdb.pdb.JOB_18257_2_0 aborted by user
3/2/2010 9:04:29 PM rosetta@home Computation for task lrmixclus2_opt_.1o5u.1o5u.SAVE_ALL_OUT_IGNORE_THE_REST.c.12.3.pdb.pdb.JOB_18257_2_0 finished

I'm running 6.10.18
2/28/2010 5:22:23 PM Starting BOINC client version 6.10.18 for windows_x86_64
ID: 65452 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65453 - Posted: 3 Mar 2010, 4:51:44 UTC

Please post these details to the appropriate version number'd thread on the Number Crunching board. mdillenk, please do the same, and post BOINC and Windows versions.
Rosetta Moderator: Mod.Sense
ID: 65453 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 65865 - Posted: 28 Apr 2010, 2:49:27 UTC

Well, here it is, almost the end of April, and I am experiencing the exact same problem. My symptoms mirror those of Brock Jones. I am running XP-SP3 on a Pentium D with 2 Gigs RAM, and I have been plagued by this problem since about two days after I began using BOINC and processing Rosetta tasks. There appears to be nothing in the messages to offer any clues, and it takes a re-boot to fix the problem.

Should we have to baby-sit these tasks in order to feel confident that they will complete? Is there a problem with Windows and BOINC?

I left the FAH Project to contribute to Rosetta, but this project appears to be even less stable than the "new" FAH SMP2 client.

Is there any cure for this problem?

deesy
ID: 65865 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65866 - Posted: 28 Apr 2010, 3:33:11 UTC

Which Rosetta version are the problem tasks running? Any pattern in the naming of problem and successful tasks?
Rosetta Moderator: Mod.Sense
ID: 65866 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 65897 - Posted: 29 Apr 2010, 23:59:10 UTC

Which Rosetta version are the problem tasks running? Any pattern in the naming of problem and successful tasks?


I have started a new thread with additional information, but the Windows Task Manager says that MiniRosetta_2.11_windows_intelx86.exe are the running processes.

No pattern that I could see.

deesy
ID: 65897 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65900 - Posted: 30 Apr 2010, 4:47:45 UTC

Link to the new thread to discuss the specifics of your issue.
Rosetta Moderator: Mod.Sense
ID: 65900 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Questions and Answers : Windows : Frequent hung work units



©2024 University of Washington
https://www.bakerlab.org