Posts by Scott A. Howard*

1) Message boards : Number crunching : Problems with Minirosetta v1.54 (Message 59138)
Posted 29 Jan 2009 by Scott A. Howard*
Post:
Here's a follow up.

I did the following:
1) detached from the project.
2) removed the Rosetta project folder from under Bonic...
3) removed all files from a slot that contained Rosetta data
4) reattached to the project
5) allowed for 50% of the cpus to be used (4 in this case)
6) allowed the four projects to run - each expected to take about 4 hours

Observed results: The status for the projects are "Running, high priority", each has used about 20 minutes of cpu time, the progress is 0.000%

Setting the activity back to "run based on preferences" results in each task no longer using cpu time but they are not removed from memory.

It looks like that's all I can do. If there are no suggestions from your end, I'll need to stay detached from the project so I don't waste cycles.

I see the thread that's consuming the CPU has a pretty regular call stack. Here is the call stack. If you have your debug symbols for your build, you should be able to locate the routine and line at which the program is hung...

ntkrnlpa.exe!KiSwapContext+0x2f
ntkrnlpa.exe!KiSwapThread+0x8a
ntkrnlpa.exe!KeWaitForSingleObject+0x1c2
ntkrnlpa.exe!KiSuspendThread+0x18
ntkrnlpa.exe!KiDeliverApc+0x124
hal.dll!HalpApcInterrupt+0xc6
minirosetta_1.54_windows_intelx86.exe+0x91a63 <------ look for problem here
minirosetta_1.54_windows_intelx86.exe+0x17d3
minirosetta_1.54_windows_intelx86.exe+0x1afcd
minirosetta_1.54_windows_intelx86.exe+0x9289e
minirosetta_1.54_windows_intelx86.exe+0x4a4bc3
minirosetta_1.54_windows_intelx86.exe+0xb0892
minirosetta_1.54_windows_intelx86.exe+0x3e0c24
2) Message boards : Number crunching : Problems with Minirosetta v1.54 (Message 59135)
Posted 29 Jan 2009 by Scott A. Howard*
Post:
Hello,

Here's the problem in a nutshell.

On my Dell Precision T5400 with dual Xeon E5410 2.33 GHz chips (for a total of 8 cores) running on XP Pro SP3, almost every one of the Rosetta jobs (minirosetta version 154) fail. The typical failure mode is that they are exceeding their CPU time allocation. For example, if the job is estimated to require 4 hours of CPU time, they are killed at something like 20 hours. Sometimes the tasks show progress, other times they are stuck at zero.

Also, the exe is not removed from memory when the computer is in use.

I have reset the project and detached and attached again but it continues to happen.

Nothing like this happens with the lhcathome, QMC@HOME, Docking@Home, or boincsimap tasks. I also don't see this behavior on any of my other machines.

Do you guys produce any diagnostic logs that might of use in troubleshooting the problem? Maybe it's my configuration - maybe a coding error showing up when running 6 or 8 of these tasks simultaneously. (It appears to occur with any number running, from 1 - 8).

I have a full development environment and debuggers if you want some traces.

Scott Howard


Addendum: Now that I thought about it a little more, does the app use any global resource locking? E.g., mutexes, semaphores, file acess? Maybe that's why the progress is halted, it's deadlocked - but I am not sure why the task would continue to use CPU time though. Just some random thoughts...






©2024 University of Washington
https://www.bakerlab.org