Problem with task "exited with zero status but no 'finished' file" error

Message boards : Number crunching : Problem with task "exited with zero status but no 'finished' file" error

To post messages, you must log in.

AuthorMessage
sirlampsalot

Send message
Joined: 18 Feb 11
Posts: 1
Credit: 336,823
RAC: 0
Message 78008 - Posted: 8 Mar 2015, 14:55:42 UTC
Last modified: 8 Mar 2015, 15:13:28 UTC

Hi - I am a very occasional Rosetta cruncher. This may have happened in the past, but today was the first time I noticed this error associated with every work unit

"3/8/2015 8:40:28 AM | rosetta@home | Task rb_03_07_54085_99588_ab_stage0_t000___robetta_cstwt_3.0_IGNORE_THE_REST_03_09_246104_1456_0 exited with zero status but no 'finished' file"

I have two computers that generate this error (different hardware and OS but both running BOINC 7.4.36). The log suggested resetting the project, which I did on both pc's. It promptly returned after resetting.

Is this an error message I can ignore? Or how can I fix it?

I found this tread https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6586 which did not show up when I did a search for "zero status but no 'finished' file".
My CPU time is set to default.

Many thanks
ID: 78008 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,644,168
RAC: 214
Message 78009 - Posted: 8 Mar 2015, 18:48:55 UTC

Do you have a link to the workunit? I checked the workunits on your machines and most of them seem to be completing without issue - which ones threw these errors?
ID: 78009 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Erik

Send message
Joined: 25 Jun 09
Posts: 11
Credit: 2,904,454
RAC: 0
Message 78017 - Posted: 11 Mar 2015, 23:39:14 UTC

Allowing BOINC to use 100% CPU time got rid of that error for me, possibly in conjunction with a 12-hour run time. In practice, the CPU usage tends to run between 75 to 90%. I also found that rebooting the machine will cause an in-process work unit to fail.
ID: 78017 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ananas

Send message
Joined: 1 Jan 06
Posts: 232
Credit: 752,471
RAC: 0
Message 78133 - Posted: 17 Apr 2015, 20:44:56 UTC
Last modified: 17 Apr 2015, 21:13:30 UTC

I have the same problem on one computer, from what I can see it is caused by the huge HDD activity when unpacking and starting each Rosetta WU, combined with the outdated API version that still has the first and much too low heartbeat tolerance setting.

If you have a not too fast HDD, this slows down the BOINC core client and the same Rosetta workunit that caused the slowdown doesn't receive the heartbeat in time and restarts itself, before it managed to unzip the result. Plus it kills other Rosetta WUs in that process, they run into the same heartbeat error too. This goes on for quite a while until the core client gives up on restarting the workunit.

The only other project that seems to use such an old API is Leiden Classical, their WUs are also victim of the heartbeat bug now and then. Other projects are more robust when it comes to that problem.

Afaik. the heartbeat timeout has been increased quite much in later API versions.

On the box where I have that problem I can start one Rosetta WU, it usually comes to start, sometimes after one heartbeat crash. Trying to starting a second one kills both and none of the two will ever recover from that restart/crash loop.

p.s.: another way to fix it might be to split the compressed database into smaller parts and hoping that the core client can use the break between the unzips to refresh the heartbeat in the shared memory - or to exclude those .gz files from the .zip file and deliver them (only the needed ones!) separately, as packing it like that is very stupid anyway.

@Erik : I don't think that it is entirely fixed on your system, the WUs still show "unpacking ..." way too often. If it was a clean run, it would unpack the database and unpack the workunit file - two unpack commands per workunit and that's it.
ID: 78133 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Problem with task "exited with zero status but no 'finished' file" error



©2024 University of Washington
https://www.bakerlab.org