Please ABORT these 4 stuck workunits

Message boards : Number crunching : Please ABORT these 4 stuck workunits

To post messages, you must log in.

AuthorMessage
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 13576 - Posted: 12 Apr 2006, 20:41:38 UTC

Hi all:

We are getting reports back that the following four workunits are consistently getting stuck and hanging clients. The problem is currently under investigation. But for now please abort these work units so that your client can do something useful:

TRUNCATE_TERMINI_FULLRELAX_1enh__433
TRUNCATE_TERMINI_FULLRELAX_1b3aA_433
TRUNCATE_TERMINI_FULLRELAX_1ptq__433
TRUNCATE_TERMINI_FULLRELAX_2tif__433

Thanks!



ID: 13576 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[DPC]Charley

Send message
Joined: 18 Mar 06
Posts: 9
Credit: 295,915
RAC: 0
Message 13600 - Posted: 12 Apr 2006, 23:32:28 UTC

Great, thanks for the warning. Just did with one that seemed to be getting stuck on one of those on this box :)
ID: 13600 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
crazyk4952

Send message
Joined: 23 Oct 05
Posts: 1
Credit: 541,814
RAC: 0
Message 13842 - Posted: 15 Apr 2006, 17:18:18 UTC - in response to Message 13576.  

I was wondering why my computer was not making any progress. It was one of these work units!

Thanks!


Hi all:

We are getting reports back that the following four workunits are consistently getting stuck and hanging clients. The problem is currently under investigation. But for now please abort these work units so that your client can do something useful:

TRUNCATE_TERMINI_FULLRELAX_1enh__433
TRUNCATE_TERMINI_FULLRELAX_1b3aA_433
TRUNCATE_TERMINI_FULLRELAX_1ptq__433
TRUNCATE_TERMINI_FULLRELAX_2tif__433

Thanks!



ID: 13842 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 14009 - Posted: 18 Apr 2006, 4:48:32 UTC - in response to Message 13576.  
Last modified: 18 Apr 2006, 4:51:33 UTC

...
But for now please abort
...
TRUNCATE_TERMINI_FULLRELAX_1enh__433
TRUNCATE_TERMINI_FULLRELAX_1b3aA_433
TRUNCATE_TERMINI_FULLRELAX_1ptq__433
TRUNCATE_TERMINI_FULLRELAX_2tif__433


>doh<

I only noticed I had this when it ran to 27 hours. Sorry everyone @ Bakerlab, it was sat in my cache for 5 days before it ran, while the box focussed on another project. Main reason for posting is to remind others that if you have a sizable cache or combined caches from other projects, you still might have one of these beasties lurking...

Mine is aborted now, and a look at the output suggests it did nothing useful. I wonder how much longer it would have crunched for before it hit the buffers?

Anyway, it is my fault for not noticing before it started, you gave us fair warning.

River~~
ID: 14009 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
gpcola

Send message
Joined: 31 Dec 05
Posts: 8
Credit: 361,118
RAC: 0
Message 14419 - Posted: 22 Apr 2006, 22:04:03 UTC

I have a remotely based P4

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=143202

to which I have no easy means of access and which had a 2 day buffer before these WUs hit. It has crunched and timed out one of them

https://boinc.bakerlab.org/rosetta/result.php?resultid=16975136

and is probably crunching another right now, as it hasn't communicated with the server in several days. I'm worried there are still more left in the buffer...

Will your credit granting script cover these timeouts? I really hope so because I'm effectively down one machine until these WUs run out :(

-gpcola
ID: 14419 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
gpcola

Send message
Joined: 31 Dec 05
Posts: 8
Credit: 361,118
RAC: 0
Message 14420 - Posted: 22 Apr 2006, 22:09:09 UTC

dammit, trust me to not check thoroughly before I post - I see that WU has actually been granted 300 points... Why so little though - claimed credit was 727.45 :/

-gpcola
ID: 14420 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Theadalus

Send message
Joined: 6 Nov 05
Posts: 7
Credit: 9,945,810
RAC: 0
Message 14503 - Posted: 23 Apr 2006, 22:35:50 UTC

I aborted the following WU after 17+ hours and never got credited :(

- TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_131 (ID #13909791)

ID: 14503 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MattDavis
Avatar

Send message
Joined: 22 Sep 05
Posts: 206
Credit: 1,377,748
RAC: 0
Message 14549 - Posted: 24 Apr 2006, 21:30:32 UTC

Why the HECK wouldn't you list on the FRONT PAGE that there were problem work units?

I looked at a computer (that I usually leave alone, trusting BOINC and the projects) AND SAW A WORK UNIT WAS UP TO 127 HOURS AND STILL NOT COMPLETED.

I check the news (front page) EVERY DAY but had to look in the forums for this. WHY WASN'T THIS ON THE FRONT PAGE? I've wasted HUNDREDS OF HOURS on two computers because this wasn't on the front page.
ID: 14549 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 14550 - Posted: 24 Apr 2006, 22:01:24 UTC

I'm with Matt, I expect to see items which effect everyone in the project news. If it's any consolation to you Matt, I believe you'll get credit for all of that time when they do their credit run at the end of the week.

Also, there are three things to prevent this from happening again that are coming up.
1) WUs are going to be tested on Ralph before sent en masse to Rosetta
The version they are testing on Ralph has two new features:
1) Watch dog will kill stuck WUs if they occur again, and report details back to the project so they can investigate why it got stuck and make any necessary corrections.
2) Checkpointing will occur more frequently and keep WUs moving forward, even when they don't get long periods of run time on the machine.

So these should help assure this doesn't happen again. The R@H team took the actions needed to prevent such problems in the future so we can all crunch more productively.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 14550 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MattDavis
Avatar

Send message
Joined: 22 Sep 05
Posts: 206
Credit: 1,377,748
RAC: 0
Message 14557 - Posted: 24 Apr 2006, 23:50:39 UTC
Last modified: 24 Apr 2006, 23:50:47 UTC

It's not about the credits. It's about how I can't trust BOINC to run without me checking up on it. I'm not one of those BOINC-obsessed people who checks their computers every 10 minutes. I leave my computers alone for weeks at a time, and I hate to go back and see that 127 hours have been wasted. Since I run many projects, that 127 hours was over the course of several weeks.
ID: 14557 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 14560 - Posted: 25 Apr 2006, 2:14:21 UTC - in response to Message 14557.  
Last modified: 25 Apr 2006, 2:16:37 UTC

It's not about the credits. It's about how I can't trust BOINC to run without me checking up on it. I'm not one of those BOINC-obsessed people who checks their computers every 10 minutes. I leave my computers alone for weeks at a time, and I hate to go back and see that 127 hours have been wasted. Since I run many projects, that 127 hours was over the course of several weeks.


This will not happen again after the next release of the application. This should happen by tomorrow. In the new version there is a soft shut off feature for errant work units. It will stop an over running Work Unit, report any results it may have produced, and claim the appropriate credit.

The reason this information did not appear on the home page or the news, is because the project does not like to put transient information in permanent locations, and the news is predominately for use outside the project..

In any case the problem you had should be gone very soon.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 14560 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
gpcola

Send message
Joined: 31 Dec 05
Posts: 8
Credit: 361,118
RAC: 0
Message 14584 - Posted: 25 Apr 2006, 12:44:30 UTC - in response to Message 14550.  

If it's any consolation to you Matt, I believe you'll get credit for all of that time when they do their credit run at the end of the week.

Well clearly Matt won't get credit for ALL that time lost if my own timed out WU is anything to go by.

I have to say I'm extremely annoyed - you released bugged WUs and then stated (https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1365#13591) that you will give credit to WUs that timeout... but... this credit is nowhere near the full amount that would have been granted had the WU (or rather WUs that would have been crunched in the same period) been crunchable. I'm in a position where I CAN'T babysit this computer and for that I'm penalised. Seems just a little unfair don't you think?
ID: 14584 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Los Alcoholicos~La Muis

Send message
Joined: 4 Nov 05
Posts: 34
Credit: 1,041,724
RAC: 0
Message 14593 - Posted: 25 Apr 2006, 15:44:22 UTC

I don't think the projectteam realise that it is not always possible to babysit the computers running R@h. Of the 10 computers I'm running 4 are not accessible on a frequent or regular base because they are located elsewhere. So aborting isn't an option. And what about the "smart" remark "Look at the graphic display to see if a wu is stucked" when you're running Boinc as service or from the cli client as i do with my Mac's.

The only way to notice if anything is wrong, is when I see my daily score collapsing.
(E.g. today I had to phone the owner of one of the extern pc's and after a long and annoying conversation he could tell me that a wu is at 15 hours and at 15% (but Boinc is installed as service so i don't know if it's stuck). He don't mind running a dc-project for me, but he for sure didnot liked to be bothered with it).

I think the projectteam have to try harder to keep us (at least me) enthousastic for this project.

By the way: once there was a promise to grant credits for the 'maximum cputime exceeded' error as well. Did anyone saw that happen?
ID: 14593 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 14594 - Posted: 25 Apr 2006, 16:11:03 UTC - in response to Message 14593.  
Last modified: 25 Apr 2006, 16:27:52 UTC


I think the projectteam have to try harder to keep us (at least me) enthousastic for this project.

As far as I'm concerned they don't need to bother anymore.
Lost the motivation to run this project just before the start of our Stampede.
The DPC-poll however showed the majority wanted to run R@H so I had no choice.
The changes are not enough to keep me here, so another 5 days and I'm gone.


By the way: once there was a promise to grant credits for the 'maximum cputime exceeded' error as well. Did anyone saw that happen?


Do you really think/expect that to happen ?
Still waiting for credits but not for long.

@ gpcola : Just running to keep XS from second place during our Stampede.
The 1st of May you're welcome to take second place, I'm leaving.
Tell movieman to bring the promised wooden shoes to Bubbles.
ID: 14594 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
gpcola

Send message
Joined: 31 Dec 05
Posts: 8
Credit: 361,118
RAC: 0
Message 14595 - Posted: 25 Apr 2006, 16:18:17 UTC - in response to Message 14593.  

I think the projectteam have to try harder to keep us (at least me) enthousastic for this project.

I think my enthusiasim wained after about the 15th or 20th uncrunchable WU I encountered (1%ers where the bug of choice at that time). Since the 4.97 fiasco I have to say that the only reason I'm still on this project is because my team (xtremesystems) is currently trying to take second place from you cows :P
ID: 14595 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 14601 - Posted: 25 Apr 2006, 17:22:06 UTC - in response to Message 14593.  

I don't think the projectteam realise that it is not always possible to babysit the computers running R@h.


I think the project team has more computers to babysit than any of us do. And they fully realize these problems need to be addressed. This is why they have addressed them in several ways. They are simply testing these enhancements on Ralph for a few days to confirm the changes resolve the problems they are intended to resolve, and to assure that the rollout to Rosetta will be a smooth, babysitter free one.

Stay the course. This is the most responsive project team out there. Responses are not instantaneous. It is the nature of software. But they have consistently taken interest and action in getting symptom reports, and delivering resolutions and process changes to assure that mistakes are not repeated.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 14601 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Angus

Send message
Joined: 17 Sep 05
Posts: 412
Credit: 321,053
RAC: 0
Message 14602 - Posted: 25 Apr 2006, 17:37:44 UTC - in response to Message 14560.  

It's not about the credits. It's about how I can't trust BOINC to run without me checking up on it. I'm not one of those BOINC-obsessed people who checks their computers every 10 minutes. I leave my computers alone for weeks at a time, and I hate to go back and see that 127 hours have been wasted. Since I run many projects, that 127 hours was over the course of several weeks.


This will not happen again after the next release of the application. This should happen by tomorrow. In the new version there is a soft shut off feature for errant work units. It will stop an over running Work Unit, report any results it may have produced, and claim the appropriate credit.

The reason this information did not appear on the home page or the news, is because the project does not like to put transient information in permanent locations, and the news is predominately for use outside the project..

In any case the problem you had should be gone very soon.


Perhaps the project could do a different RSS feed for these types of issues -

"NOTICE - Please Abort all WUs of type XXXXXXXXXXXXXXXXX"

"Project is down"

"New Version 5.x.x Released"

That way people don't have to monitor the forums to find out this information, and it would keep it off the front page News (although other projects DO use the front page News and Tech News for those typoe of announcements).


Proudly Banned from Predictator@Home and now Cosmology@home as well. Added SETI to the list today. Temporary ban only - so need to work harder :)



"You can't fix stupid" (Ron White)
ID: 14602 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 14616 - Posted: 25 Apr 2006, 19:53:57 UTC - in response to Message 14594.  
Last modified: 25 Apr 2006, 19:56:58 UTC

...
By the way: once there was a promise to grant credits for the 'maximum cputime exceeded' error as well. Did anyone saw that happen?

Do you really think/expect that to happen ?
Still waiting for credits but not for long....


This question has come up again and again. As I have said before, the project IS GOING TO GRANT THOSE CREDITS. I have also indicated that a number of conditions had to be met before that could happen. All of those conditions have now been met except one, and it is a big one. The project never stabilized into a no error mode, despite the reductions in the error rate. The most annoying errors were the ones that remained.

The time that would have been spent awarding the credit for Max time errors has been devoted to chasing down and stopping other bugs in the system.

As I mentioned before, there was no point in doing the award process over and over in a piecemeal way when if the project waited until the right time the award could be done all at once. Soon after the deployment of the new version of Rosetta, there should be time (assuming it works ok) to go back and award these credits. The projected time I was given was mid to late March for the awards. If you look back over that time, there have been a number of issues that HAD to be addressed. Most users on this project would prefer to STOP the loss of additional credits/time by fixing the bugs, and then worrying about cleaning up the credit awards. You ay disagree with that approach, but that is what the project has decided to do.


As a personal note:

I am not just a Moderator on these forums, despite appearances resulting from the zero credits shown with my Moderator ID. I also run this project on my systems, and I personally have credits outstanding from the Max time problems totaling well over 9,000 credits. I have repeatedly put this question to the project team on behalf of the entire user community, and the answer is and always has been that the credit will be awarded. When that changes I (or the project) will tell you.

Your insinuations and innuendoes implying that the Project team has mislead you in some way and does not intend to follow through on their promise, are not true nor are they welcome.
Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 14616 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 14619 - Posted: 25 Apr 2006, 20:06:54 UTC - in response to Message 14616.  
Last modified: 25 Apr 2006, 20:09:06 UTC

...
By the way: once there was a promise to grant credits for the 'maximum cputime exceeded' error as well. Did anyone saw that happen?

Do you really think/expect that to happen ?
Still waiting for credits but not for long....


As a personal note:

Your insinuations and innuendoes implying that the Project team has mislead you in some way and does not intend to follow through on their promise, are not true nor are they welcome.


As I stated before I'll run R@H for just another 5 days and I promise you here and now, I will never reply here ever again.
Lost my confidence some time ago and that's deadly.



"Sick and tired by Anastasia"
ID: 14619 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Please ABORT these 4 stuck workunits



©2024 University of Washington
https://www.bakerlab.org