Message boards : Number crunching : Problems with version 5.96
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · Next
Author | Message |
---|---|
sslickerson Send message Joined: 14 Oct 05 Posts: 101 Credit: 578,497 RAC: 0 |
I posted minirosetta version 1.29 and rosetta_beta 5.97 on ralph which both include a fix for this bad bug that stalls clients. The problem was a possible infinite loop in the boinc api when an access violation caused by our t405 job was caught after the job completed. Hopefully the tests running on ralph will confirm the fix. Thanks David! I really need to get back to Rosetta. Let's hope this works. Tim |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 2 |
If it was an infinite loop, surely the task would still be using 100% of it's core, just doing nothing? It looked more like some asynch call had been fired off and the task was sitting idle waiting for a completion status that never appeared. Also, mine were sticking part way through the job, not at the end. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0 |
Here is David Anderson's take on the issue: "Our guess is that something called from exit() (either an atexit() function or something internal to the C library) was causing a signal, and the signal handler (boinc_catch_signal()) called exit() which made the same thing happen, infinitely. I changed the signal handler to call _exit() instead of exit()." That change prevented the hanging in our local tests. I don't know what was happening with your particular job that was hanging mid run. Are you absolutely sure it was hung up? |
David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0 |
24 hours of crunching time lost to a validate error: t405__CASP8_JUMPAB_TYPE2_RES81to192_SAVE_ALL_OUT_BARCODE__3785_36095_0 I hate it when that happens... Rosie, Rosie, she's our gal, If she can't do it, no one shall! |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 2 |
Are you absolutely sure it was hung up? I am absolutely 100% certain they, (note plural), were stuck! I am not stupid, if I see my quad CPU temperature bobbling around 35C I KNOW something is odd. When I look at BM and see that I have 4 tasks "Running", and yet, see from Process Manager that I am running 25% or 50% Windows Idle Process, (depended on which machine I was looking at), then something is screwed. So I suspend each project in turn until I see which is wasting the time, and was suprised to see it was Rosetta. As soon as I suspended Rosetta, the other projects tasks started and filled the machines to 100%. Release the suspended tasks and they start again, they sit there with wall time and completion % fixed. If you read the thread, you will find others with similar stories. Suggesting we are "mistaken" is sticking your head in the sand, there is an issue here. It also irritates. I don't suspend Rosetta lightly, but it was clear to me that there was a problem. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
sslickerson Send message Joined: 14 Oct 05 Posts: 101 Credit: 578,497 RAC: 0 |
Are you absolutely sure it was hung up? Yes, before I suspended Rosetta, none of my tasks of version 5.96 went to completion at 100%. All were hung prior to 100%. Task manager in windows showed exactly 50% system idle process. One core was always idle. |
Jipsu Send message Joined: 27 Jan 08 Posts: 10 Credit: 454,555 RAC: 0 |
This one was stuck on my Gentoo Linux-2.6.24-gentoo-r7, it finished after 2 restarts, but was marked invalid. It also has an interesting stderr. <core_client_version>5.10.45</core_client_version> <![CDATA[ <stderr_txt> Graphics are disabled due to configuration... # cpu_run_time_pref: 86400 # random seed: 3053615 ====================================================== DONE :: 1 starting structures 85344.8 cpu seconds This process generated 22 decoys from 22 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish *** glibc detected *** free(): invalid next size (normal): 0x0959cfb8 *** SIGABRT: abort called Stack trace (18 frames): [0x8e1b49b] [0x8e15d8c] [0xb7f97420] [0x8e870f4] [0x8e9c05f] [0x8ea10c5] [0x8ea13a3] [0x8e71d51] [0x8e73779] [0x87cb085] [0x8e8763f] [0x8e179ac] [0x8e17ab7] [0x8628fd6] [0x8768a2a] [0x8768b4a] [0x8e80034] [0x8048111] Exiting... Graphics are disabled due to configuration... # cpu_run_time_pref: 86400 ====================================================== DONE :: 1 starting structures 86247.3 cpu seconds This process generated 23 decoys from 23 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish *** glibc detected *** free(): invalid next size (normal): 0x0959d070 *** SIGABRT: abort called Stack trace (18 frames): [0x8e1b49b] [0x8e15d8c] [0xb7fa7420] [0x8e870f4] [0x8e9c05f] [0x8ea10c5] [0x8ea13a3] [0x8e71d51] [0x8e73779] [0x87cb085] [0x8e8763f] [0x8e179ac] [0x8e17ab7] [0x8628fd6] [0x8768a2a] [0x8768b4a] [0x8e80034] [0x8048111] Exiting... Graphics are disabled due to configuration... # cpu_run_time_pref: 86400 WARNING! attempt to gzip file ./xxd010.out failed: file does not exist. ====================================================== DONE :: 1 starting structures 83399.6 cpu seconds This process generated 23 decoys from 23 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> ]]> https://boinc.bakerlab.org/result.php?resultid=172011426 |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0 |
Are you absolutely sure it was hung up? I was just trying to double check. I hope this is a related problem that will be fixed with the api change. Bottom line is that the t405 task has uncovered a bug in rosetta++ that has to be fixed for that particular protocol which is important for some casp targets. Sorry, I wasn't assuming you were wrong in your diagnosis, just double checking to make sure. |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,860,059 RAC: 4,566 |
does anyone know if these tasks will be aborted after 4 restarts? I've got quite a few remotes that I don't have much contact with and assume at least some of these will be hit by the bug... |
Aaronb Send message Joined: 6 May 06 Posts: 1 Credit: 20,022 RAC: 0 |
Looked at my multi-core AMD machine (64-bit Linux client) and saw that four cores were idle... I'm experiencing the same issue where the tasks will get to 100% then idle. (Ubuntu 8.04 64 bit on a Intel Q6600). |
rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0 |
|
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=171780274 please post for the team, why you aborted this task. this will help them solve whatever problem you and others might be having. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
does anyone know if these tasks will be aborted after 4 restarts? I've got quite a few remotes that I don't have much contact with and assume at least some of these will be hit by the bug... Actually I believe it would be 5 restarts. So when any specific task has run long enough that it records it's initial information, and hasn't made any progress (i.e. saved a checkpoint) since the last restart, if this occurs 5 times, the task will be ended. Rosetta Moderator: Mod.Sense |
rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=171780274 no problem i know them ...they live close by and i talked to them |
rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=171780274 |
Stephen Send message Joined: 5 Jun 06 Posts: 23 Credit: 2,570,438 RAC: 0 |
I'm experiencing the same issue where the tasks will get to 100% then idle. (Ubuntu 8.04 64 bit on a Intel Q6600). Just a "me too", using 32 bit Ubuntu 7.10 and a dual core Athlon. I haven't checked today to see if crunching had resumed. It would be nice to get the answers to these questions: 1. Is there a known issue with crunching stopping after WUs reach 100%? 2. Are we supposed to "wait it out" or take some action? Stephen |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=171780274 that has to be nice..lol |
Stephen Send message Joined: 5 Jun 06 Posts: 23 Credit: 2,570,438 RAC: 0 |
I've noticed some "watchdog" processes in ubuntu that have RT priority. Could these be pushing boinc into the background? |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0 |
I'm experiencing the same issue where the tasks will get to 100% then idle. (Ubuntu 8.04 64 bit on a Intel Q6600). There is a known problem that is most prevalent with t405 tasks (which have been canceled since) that can cause the client to stall when the task is complete. If you have a task at 100% and your cpu(s) are idle please click the update button on the boinc manager for the project while connected online and if the task persists, abort it. We are testing a boinc api fix for this on ralph. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 2 |
It is a shame that the t405 cancellation wasn't mentioned when it was done, I could have resumed Rosetta then instead of now! The news flow is certainly a bit stagnant, I appreciate CASP is a busy time, but a couple of lines on the news column on the front page would not take long. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Message boards :
Number crunching :
Problems with version 5.96
©2024 University of Washington
https://www.bakerlab.org