Problems with version 5.96

Message boards : Number crunching : Problems with version 5.96

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · Next

AuthorMessage
Profile sslickerson

Send message
Joined: 14 Oct 05
Posts: 101
Credit: 578,497
RAC: 0
Message 53889 - Posted: 21 Jun 2008, 4:23:21 UTC - in response to Message 53887.  

I posted minirosetta version 1.29 and rosetta_beta 5.97 on ralph which both include a fix for this bad bug that stalls clients. The problem was a possible infinite loop in the boinc api when an access violation caused by our t405 job was caught after the job completed. Hopefully the tests running on ralph will confirm the fix.


Thanks David! I really need to get back to Rosetta. Let's hope this works.

Tim



ID: 53889 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,637,805
RAC: 799
Message 53892 - Posted: 21 Jun 2008, 9:10:36 UTC
Last modified: 21 Jun 2008, 9:11:43 UTC

If it was an infinite loop, surely the task would still be using 100% of it's core, just doing nothing? It looked more like some asynch call had been fired off and the task was sitting idle waiting for a completion status that never appeared.

Also, mine were sticking part way through the job, not at the end.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 53892 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 53897 - Posted: 21 Jun 2008, 15:41:47 UTC

Here is David Anderson's take on the issue:

"Our guess is that something called from exit()
(either an atexit() function or something internal to the C library)
was causing a signal,
and the signal handler (boinc_catch_signal()) called exit()
which made the same thing happen, infinitely.
I changed the signal handler to call _exit() instead of exit()."


That change prevented the hanging in our local tests. I don't know what was happening with your particular job that was hanging mid run. Are you absolutely sure it was hung up?

ID: 53897 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David Emigh
Avatar

Send message
Joined: 13 Mar 06
Posts: 158
Credit: 417,178
RAC: 0
Message 53901 - Posted: 21 Jun 2008, 18:14:24 UTC
Last modified: 21 Jun 2008, 18:15:38 UTC

24 hours of crunching time lost to a validate error:

t405__CASP8_JUMPAB_TYPE2_RES81to192_SAVE_ALL_OUT_BARCODE__3785_36095_0

I hate it when that happens...
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
ID: 53901 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,637,805
RAC: 799
Message 53904 - Posted: 21 Jun 2008, 19:46:49 UTC
Last modified: 21 Jun 2008, 19:51:38 UTC

Are you absolutely sure it was hung up?

I am absolutely 100% certain they, (note plural), were stuck! I am not stupid, if I see my quad CPU temperature bobbling around 35C I KNOW something is odd. When I look at BM and see that I have 4 tasks "Running", and yet, see from Process Manager that I am running 25% or 50% Windows Idle Process, (depended on which machine I was looking at), then something is screwed.

So I suspend each project in turn until I see which is wasting the time, and was suprised to see it was Rosetta. As soon as I suspended Rosetta, the other projects tasks started and filled the machines to 100%. Release the suspended tasks and they start again, they sit there with wall time and completion % fixed. If you read the thread, you will find others with similar stories.

Suggesting we are "mistaken" is sticking your head in the sand, there is an issue here. It also irritates. I don't suspend Rosetta lightly, but it was clear to me that there was a problem.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 53904 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile sslickerson

Send message
Joined: 14 Oct 05
Posts: 101
Credit: 578,497
RAC: 0
Message 53905 - Posted: 21 Jun 2008, 20:52:25 UTC - in response to Message 53897.  

Are you absolutely sure it was hung up?


Yes, before I suspended Rosetta, none of my tasks of version 5.96 went to completion at 100%. All were hung prior to 100%. Task manager in windows showed exactly 50% system idle process. One core was always idle.



ID: 53905 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jipsu

Send message
Joined: 27 Jan 08
Posts: 10
Credit: 454,555
RAC: 0
Message 53906 - Posted: 22 Jun 2008, 1:12:23 UTC

This one was stuck on my Gentoo Linux-2.6.24-gentoo-r7, it finished after 2 restarts, but was marked invalid. It also has an interesting stderr.

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 86400
# random seed: 3053615
======================================================
DONE :: 1 starting structures 85344.8 cpu seconds
This process generated 22 decoys from 22 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish
*** glibc detected *** free(): invalid next size (normal): 0x0959cfb8 ***
SIGABRT: abort called
Stack trace (18 frames):
[0x8e1b49b]
[0x8e15d8c]
[0xb7f97420]
[0x8e870f4]
[0x8e9c05f]
[0x8ea10c5]
[0x8ea13a3]
[0x8e71d51]
[0x8e73779]
[0x87cb085]
[0x8e8763f]
[0x8e179ac]
[0x8e17ab7]
[0x8628fd6]
[0x8768a2a]
[0x8768b4a]
[0x8e80034]
[0x8048111]

Exiting...
Graphics are disabled due to configuration...
# cpu_run_time_pref: 86400
======================================================
DONE :: 1 starting structures 86247.3 cpu seconds
This process generated 23 decoys from 23 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish
*** glibc detected *** free(): invalid next size (normal): 0x0959d070 ***
SIGABRT: abort called
Stack trace (18 frames):
[0x8e1b49b]
[0x8e15d8c]
[0xb7fa7420]
[0x8e870f4]
[0x8e9c05f]
[0x8ea10c5]
[0x8ea13a3]
[0x8e71d51]
[0x8e73779]
[0x87cb085]
[0x8e8763f]
[0x8e179ac]
[0x8e17ab7]
[0x8628fd6]
[0x8768a2a]
[0x8768b4a]
[0x8e80034]
[0x8048111]

Exiting...
Graphics are disabled due to configuration...
# cpu_run_time_pref: 86400
WARNING! attempt to gzip file ./xxd010.out failed: file does not exist.
======================================================
DONE :: 1 starting structures 83399.6 cpu seconds
This process generated 23 decoys from 23 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>

https://boinc.bakerlab.org/result.php?resultid=172011426
ID: 53906 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 53907 - Posted: 22 Jun 2008, 6:14:42 UTC - in response to Message 53904.  

Are you absolutely sure it was hung up?

I am absolutely 100% certain they, (note plural), were stuck! I am not stupid, if I see my quad CPU temperature bobbling around 35C I KNOW something is odd. When I look at BM and see that I have 4 tasks "Running", and yet, see from Process Manager that I am running 25% or 50% Windows Idle Process, (depended on which machine I was looking at), then something is screwed.

So I suspend each project in turn until I see which is wasting the time, and was suprised to see it was Rosetta. As soon as I suspended Rosetta, the other projects tasks started and filled the machines to 100%. Release the suspended tasks and they start again, they sit there with wall time and completion % fixed. If you read the thread, you will find others with similar stories.

Suggesting we are "mistaken" is sticking your head in the sand, there is an issue here. It also irritates. I don't suspend Rosetta lightly, but it was clear to me that there was a problem.


I was just trying to double check. I hope this is a related problem that will be fixed with the api change. Bottom line is that the t405 task has uncovered a bug in rosetta++ that has to be fixed for that particular protocol which is important for some casp targets. Sorry, I wasn't assuming you were wrong in your diagnosis, just double checking to make sure.
ID: 53907 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 114,371,266
RAC: 53,072
Message 53908 - Posted: 22 Jun 2008, 11:00:05 UTC

does anyone know if these tasks will be aborted after 4 restarts? I've got quite a few remotes that I don't have much contact with and assume at least some of these will be hit by the bug...
ID: 53908 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aaronb

Send message
Joined: 6 May 06
Posts: 1
Credit: 20,022
RAC: 0
Message 53914 - Posted: 22 Jun 2008, 21:55:12 UTC - in response to Message 53717.  

Looked at my multi-core AMD machine (64-bit Linux client) and saw that four cores were idle...


I'm experiencing the same issue where the tasks will get to 100% then idle. (Ubuntu 8.04 64 bit on a Intel Q6600).
ID: 53914 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 53915 - Posted: 22 Jun 2008, 22:09:01 UTC

ID: 53915 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 53918 - Posted: 23 Jun 2008, 10:18:52 UTC - in response to Message 53915.  

https://boinc.bakerlab.org/rosetta/result.php?resultid=171780274


please post for the team, why you aborted this task.
this will help them solve whatever problem you and others might be having.
ID: 53918 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 53929 - Posted: 23 Jun 2008, 15:59:04 UTC - in response to Message 53908.  

does anyone know if these tasks will be aborted after 4 restarts? I've got quite a few remotes that I don't have much contact with and assume at least some of these will be hit by the bug...


Actually I believe it would be 5 restarts. So when any specific task has run long enough that it records it's initial information, and hasn't made any progress (i.e. saved a checkpoint) since the last restart, if this occurs 5 times, the task will be ended.
Rosetta Moderator: Mod.Sense
ID: 53929 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 53930 - Posted: 23 Jun 2008, 16:32:16 UTC - in response to Message 53918.  

https://boinc.bakerlab.org/rosetta/result.php?resultid=171780274


please post for the team, why you aborted this task.
this will help them solve whatever problem you and others might be having.

no problem i know them ...they live close by and i talked to them
ID: 53930 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 53931 - Posted: 23 Jun 2008, 16:32:59 UTC - in response to Message 53930.  
Last modified: 23 Jun 2008, 16:34:51 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=171780274


please post for the team, why you aborted this task.
this will help them solve whatever problem you and others might be having.

no problem i know 2 of them ...and live close by,,,,, i talked to them >> that 1 machine was not mine just giving them a hand
ID: 53931 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Stephen

Send message
Joined: 5 Jun 06
Posts: 23
Credit: 2,570,438
RAC: 0
Message 53934 - Posted: 23 Jun 2008, 17:31:22 UTC - in response to Message 53914.  

I'm experiencing the same issue where the tasks will get to 100% then idle. (Ubuntu 8.04 64 bit on a Intel Q6600).

Just a "me too", using 32 bit Ubuntu 7.10 and a dual core Athlon. I haven't checked today to see if crunching had resumed.

It would be nice to get the answers to these questions:

1. Is there a known issue with crunching stopping after WUs reach 100%?
2. Are we supposed to "wait it out" or take some action?

Stephen
ID: 53934 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 53936 - Posted: 23 Jun 2008, 17:49:51 UTC - in response to Message 53930.  

https://boinc.bakerlab.org/rosetta/result.php?resultid=171780274


please post for the team, why you aborted this task.
this will help them solve whatever problem you and others might be having.

no problem i know them ...they live close by and i talked to them


that has to be nice..lol
ID: 53936 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Stephen

Send message
Joined: 5 Jun 06
Posts: 23
Credit: 2,570,438
RAC: 0
Message 53944 - Posted: 24 Jun 2008, 2:17:10 UTC - in response to Message 53934.  

I've noticed some "watchdog" processes in ubuntu that have RT priority. Could these be pushing boinc into the background?
ID: 53944 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 53945 - Posted: 24 Jun 2008, 3:56:56 UTC - in response to Message 53934.  
Last modified: 24 Jun 2008, 3:59:16 UTC

I'm experiencing the same issue where the tasks will get to 100% then idle. (Ubuntu 8.04 64 bit on a Intel Q6600).

Just a "me too", using 32 bit Ubuntu 7.10 and a dual core Athlon. I haven't checked today to see if crunching had resumed.

It would be nice to get the answers to these questions:

1. Is there a known issue with crunching stopping after WUs reach 100%?
2. Are we supposed to "wait it out" or take some action?

Stephen



There is a known problem that is most prevalent with t405 tasks (which have been canceled since) that can cause the client to stall when the task is complete. If you have a task at 100% and your cpu(s) are idle please click the update button on the boinc manager for the project while connected online and if the task persists, abort it. We are testing a boinc api fix for this on ralph.
ID: 53945 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,637,805
RAC: 799
Message 53948 - Posted: 24 Jun 2008, 5:36:29 UTC
Last modified: 24 Jun 2008, 5:37:36 UTC

It is a shame that the t405 cancellation wasn't mentioned when it was done, I could have resumed Rosetta then instead of now! The news flow is certainly a bit stagnant, I appreciate CASP is a busy time, but a couple of lines on the news column on the front page would not take long.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 53948 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · Next

Message boards : Number crunching : Problems with version 5.96



©2024 University of Washington
https://www.bakerlab.org