Very long run time.

Message boards : Number crunching : Very long run time.

To post messages, you must log in.

AuthorMessage
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,632,350
RAC: 1,054
Message 75696 - Posted: 4 Jun 2013, 14:51:23 UTC
Last modified: 4 Jun 2013, 15:09:19 UTC

I have this wu here at the moment. It's remaining is "---" like a finished wu, but the elapsed is 40:09:09 increasing. It purports to be 54.248% completed.

This is the second issue I've reported to Rosetta today, I've suspended that wu, and set No Now Tasks pending replies.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 75696 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 75697 - Posted: 4 Jun 2013, 15:50:37 UTC

Did you happen to notice if the task was actually getting any CPU time? There are cases where between BOINC Manager and the OS, the task does not actually get CPU time. It would also explain why the watch-dog hasn't detected the problem and wrapped up the WU.
Rosetta Moderator: Mod.Sense
ID: 75697 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,632,350
RAC: 1,054
Message 75699 - Posted: 4 Jun 2013, 19:05:58 UTC
Last modified: 4 Jun 2013, 19:10:00 UTC

I hadn't but will enable it and watch.

<edit>
And I saw that the elapsed time reduced to 03:28:23, the time to completion reverted to a normal looking 03:16:28.
</edit>
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 75699 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,632,350
RAC: 1,054
Message 75701 - Posted: 5 Jun 2013, 2:42:48 UTC

The wu has run to completion and reported a success. That, of course, does not alter the fact that the problem occurred.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 75701 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,456,727
RAC: 11,262
Message 75702 - Posted: 5 Jun 2013, 10:48:22 UTC - in response to Message 75699.  

I hadn't but will enable it and watch.

<edit>
And I saw that the elapsed time reduced to 03:28:23, the time to completion reverted to a normal looking 03:16:28.
</edit>


This means it wasn't actually crunching but the clock was still running, it was a 'hung unit'. This is a LONG STANDING Boinc problem that happens once in a while but never seems to happen when the 'experts' try to replicate it. Kind of like the noise you take the car to the mechanic for that he never hears, it is there just not when the 'experts' look at it. Doing what you did, suspend the project, waiting a few seconds, and then resuming the project, often 'fixes' the problem.
ID: 75702 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,632,350
RAC: 1,054
Message 75753 - Posted: 12 Jun 2013, 19:51:55 UTC

I allowed new tasks and have not seen this again. I am, however, removing Rosetta from unattended systems.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 75753 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,456,727
RAC: 11,262
Message 75757 - Posted: 13 Jun 2013, 11:24:26 UTC - in response to Message 75753.  

I allowed new tasks and have not seen this again. I am, however, removing Rosetta from unattended systems.


That's what alot of people end up doing.
ID: 75757 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,632,350
RAC: 1,054
Message 75758 - Posted: 13 Jun 2013, 16:01:15 UTC

This one also twitched the, now sensitized wu radar. Seemed stuck at 97.715%, elapsed going up, to completion not moving. Stopped and restarted, seemed to start progressing again. I'll let it go and the other wu I've got, (unstarted), run then think I'll take a "vacation" from Rosetta.

The completed wu's disappear from the list to fast to do any comparison, (I'm thinking wu type etc.).

I have always regarded Rosetta as a steady safe project, but it is bordering on the dubious area right now.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 75758 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,632,350
RAC: 1,054
Message 75759 - Posted: 13 Jun 2013, 18:08:57 UTC
Last modified: 13 Jun 2013, 18:13:05 UTC

That one finished fine after restarting it. Grossly low credit though, 31.90 for 36,259.55 seconds. There IS something wrong here.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 75759 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile declis

Send message
Joined: 13 Jul 06
Posts: 1
Credit: 123,727
RAC: 0
Message 75762 - Posted: 15 Jun 2013, 17:16:03 UTC

same problem here,
18,388.82 secs and only 20.00 granted credits..
WU

ID: 75762 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,456,727
RAC: 11,262
Message 75777 - Posted: 19 Jun 2013, 15:12:56 UTC
Last modified: 19 Jun 2013, 15:13:12 UTC

The cryo units are screwing me again too:
25,773.17 115.91 20.00
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=533977210
and:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=533939813
25,782.22 115.96 25.64
PATHETIC!!

As well as the rb units:
25,782.21 115.96 33.99
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=533965328

I am normally getting around something like this:
9,695.84 43.61 53.33
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=533966921

BUT these darn rb and cryo units are STILL BAD, BAD, BAD!!!
ID: 75777 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Polian
Avatar

Send message
Joined: 21 Sep 05
Posts: 152
Credit: 10,141,266
RAC: 0
Message 75779 - Posted: 19 Jun 2013, 20:30:17 UTC

Part (most?) of this is working out the kinks and helping develop new techniques by volunteering our PCs' resources. They have made breakthroughs as evidenced in the other subforum and the twitter feed, etc. If you think you're getting "screwed" and that it's "pathetic", then it's probably time to move on, bro. Life is far too short to get your blood pressure up and/or be constantly worried about a background process that others use (to great benefit, granted) to crunch numbers in. Make no mistake: it's helping, whether the units fail or not. From the failures come learning.
ID: 75779 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,456,727
RAC: 11,262
Message 75780 - Posted: 19 Jun 2013, 22:05:08 UTC - in response to Message 75779.  

Part (most?) of this is working out the kinks and helping develop new techniques by volunteering our PCs' resources. They have made breakthroughs as evidenced in the other subforum and the twitter feed, etc. If you think you're getting "screwed" and that it's "pathetic", then it's probably time to move on, bro. Life is far too short to get your blood pressure up and/or be constantly worried about a background process that others use (to great benefit, granted) to crunch numbers in. Make no mistake: it's helping, whether the units fail or not. From the failures come learning.


Come 3.5 million and I will be gone, I have a goal that I am trying to meet and until them I am here come good news or bad.
ID: 75780 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,632,350
RAC: 1,054
Message 75806 - Posted: 25 Jun 2013, 10:34:08 UTC
Last modified: 25 Jun 2013, 10:37:09 UTC

>>> If you think you're getting "screwed" and that it's "pathetic", then it's probably time to move on, bro

The fact remains, however, that a project that is causing "issues" for the cruncher pool WILL cause people to move on. There are a lot of good projects out there now, and if DB wants to, at least, stay where he is in terms of numbers of participants, something needs to be done.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 75806 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Very long run time.



©2024 University of Washington
https://www.bakerlab.org