Message boards : Number crunching : Please ABORT these 4 stuck workunits
Author | Message |
---|---|
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Hi all: We are getting reports back that the following four workunits are consistently getting stuck and hanging clients. The problem is currently under investigation. But for now please abort these work units so that your client can do something useful: TRUNCATE_TERMINI_FULLRELAX_1enh__433 TRUNCATE_TERMINI_FULLRELAX_1b3aA_433 TRUNCATE_TERMINI_FULLRELAX_1ptq__433 TRUNCATE_TERMINI_FULLRELAX_2tif__433 Thanks! |
[DPC]Charley Send message Joined: 18 Mar 06 Posts: 9 Credit: 295,915 RAC: 0 |
Great, thanks for the warning. Just did with one that seemed to be getting stuck on one of those on this box :) |
crazyk4952 Send message Joined: 23 Oct 05 Posts: 1 Credit: 541,814 RAC: 0 |
I was wondering why my computer was not making any progress. It was one of these work units! Thanks! Hi all: |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
... >doh< I only noticed I had this when it ran to 27 hours. Sorry everyone @ Bakerlab, it was sat in my cache for 5 days before it ran, while the box focussed on another project. Main reason for posting is to remind others that if you have a sizable cache or combined caches from other projects, you still might have one of these beasties lurking... Mine is aborted now, and a look at the output suggests it did nothing useful. I wonder how much longer it would have crunched for before it hit the buffers? Anyway, it is my fault for not noticing before it started, you gave us fair warning. River~~ |
gpcola Send message Joined: 31 Dec 05 Posts: 8 Credit: 361,118 RAC: 0 |
I have a remotely based P4 https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=143202 to which I have no easy means of access and which had a 2 day buffer before these WUs hit. It has crunched and timed out one of them https://boinc.bakerlab.org/rosetta/result.php?resultid=16975136 and is probably crunching another right now, as it hasn't communicated with the server in several days. I'm worried there are still more left in the buffer... Will your credit granting script cover these timeouts? I really hope so because I'm effectively down one machine until these WUs run out :( -gpcola |
gpcola Send message Joined: 31 Dec 05 Posts: 8 Credit: 361,118 RAC: 0 |
dammit, trust me to not check thoroughly before I post - I see that WU has actually been granted 300 points... Why so little though - claimed credit was 727.45 :/ -gpcola |
Theadalus Send message Joined: 6 Nov 05 Posts: 7 Credit: 9,945,810 RAC: 0 |
I aborted the following WU after 17+ hours and never got credited :( - TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_131 (ID #13909791) |
MattDavis Send message Joined: 22 Sep 05 Posts: 206 Credit: 1,377,748 RAC: 0 |
Why the HECK wouldn't you list on the FRONT PAGE that there were problem work units? I looked at a computer (that I usually leave alone, trusting BOINC and the projects) AND SAW A WORK UNIT WAS UP TO 127 HOURS AND STILL NOT COMPLETED. I check the news (front page) EVERY DAY but had to look in the forums for this. WHY WASN'T THIS ON THE FRONT PAGE? I've wasted HUNDREDS OF HOURS on two computers because this wasn't on the front page. |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
I'm with Matt, I expect to see items which effect everyone in the project news. If it's any consolation to you Matt, I believe you'll get credit for all of that time when they do their credit run at the end of the week. Also, there are three things to prevent this from happening again that are coming up. 1) WUs are going to be tested on Ralph before sent en masse to Rosetta The version they are testing on Ralph has two new features: 1) Watch dog will kill stuck WUs if they occur again, and report details back to the project so they can investigate why it got stuck and make any necessary corrections. 2) Checkpointing will occur more frequently and keep WUs moving forward, even when they don't get long periods of run time on the machine. So these should help assure this doesn't happen again. The R@H team took the actions needed to prevent such problems in the future so we can all crunch more productively. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
MattDavis Send message Joined: 22 Sep 05 Posts: 206 Credit: 1,377,748 RAC: 0 |
It's not about the credits. It's about how I can't trust BOINC to run without me checking up on it. I'm not one of those BOINC-obsessed people who checks their computers every 10 minutes. I leave my computers alone for weeks at a time, and I hate to go back and see that 127 hours have been wasted. Since I run many projects, that 127 hours was over the course of several weeks. |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
It's not about the credits. It's about how I can't trust BOINC to run without me checking up on it. I'm not one of those BOINC-obsessed people who checks their computers every 10 minutes. I leave my computers alone for weeks at a time, and I hate to go back and see that 127 hours have been wasted. Since I run many projects, that 127 hours was over the course of several weeks. This will not happen again after the next release of the application. This should happen by tomorrow. In the new version there is a soft shut off feature for errant work units. It will stop an over running Work Unit, report any results it may have produced, and claim the appropriate credit. The reason this information did not appear on the home page or the news, is because the project does not like to put transient information in permanent locations, and the news is predominately for use outside the project.. In any case the problem you had should be gone very soon. Moderator9 ROSETTA@home FAQ Moderator Contact |
gpcola Send message Joined: 31 Dec 05 Posts: 8 Credit: 361,118 RAC: 0 |
If it's any consolation to you Matt, I believe you'll get credit for all of that time when they do their credit run at the end of the week. Well clearly Matt won't get credit for ALL that time lost if my own timed out WU is anything to go by. I have to say I'm extremely annoyed - you released bugged WUs and then stated (https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1365#13591) that you will give credit to WUs that timeout... but... this credit is nowhere near the full amount that would have been granted had the WU (or rather WUs that would have been crunched in the same period) been crunchable. I'm in a position where I CAN'T babysit this computer and for that I'm penalised. Seems just a little unfair don't you think? |
Los Alcoholicos~La Muis Send message Joined: 4 Nov 05 Posts: 34 Credit: 1,041,724 RAC: 0 |
I don't think the projectteam realise that it is not always possible to babysit the computers running R@h. Of the 10 computers I'm running 4 are not accessible on a frequent or regular base because they are located elsewhere. So aborting isn't an option. And what about the "smart" remark "Look at the graphic display to see if a wu is stucked" when you're running Boinc as service or from the cli client as i do with my Mac's. The only way to notice if anything is wrong, is when I see my daily score collapsing. (E.g. today I had to phone the owner of one of the extern pc's and after a long and annoying conversation he could tell me that a wu is at 15 hours and at 15% (but Boinc is installed as service so i don't know if it's stuck). He don't mind running a dc-project for me, but he for sure didnot liked to be bothered with it). I think the projectteam have to try harder to keep us (at least me) enthousastic for this project. By the way: once there was a promise to grant credits for the 'maximum cputime exceeded' error as well. Did anyone saw that happen? |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
As far as I'm concerned they don't need to bother anymore. Lost the motivation to run this project just before the start of our Stampede. The DPC-poll however showed the majority wanted to run R@H so I had no choice. The changes are not enough to keep me here, so another 5 days and I'm gone.
Do you really think/expect that to happen ? Still waiting for credits but not for long. @ gpcola : Just running to keep XS from second place during our Stampede. The 1st of May you're welcome to take second place, I'm leaving. Tell movieman to bring the promised wooden shoes to Bubbles. |
gpcola Send message Joined: 31 Dec 05 Posts: 8 Credit: 361,118 RAC: 0 |
I think the projectteam have to try harder to keep us (at least me) enthousastic for this project. I think my enthusiasim wained after about the 15th or 20th uncrunchable WU I encountered (1%ers where the bug of choice at that time). Since the 4.97 fiasco I have to say that the only reason I'm still on this project is because my team (xtremesystems) is currently trying to take second place from you cows :P |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
I don't think the projectteam realise that it is not always possible to babysit the computers running R@h. I think the project team has more computers to babysit than any of us do. And they fully realize these problems need to be addressed. This is why they have addressed them in several ways. They are simply testing these enhancements on Ralph for a few days to confirm the changes resolve the problems they are intended to resolve, and to assure that the rollout to Rosetta will be a smooth, babysitter free one. Stay the course. This is the most responsive project team out there. Responses are not instantaneous. It is the nature of software. But they have consistently taken interest and action in getting symptom reports, and delivering resolutions and process changes to assure that mistakes are not repeated. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Angus Send message Joined: 17 Sep 05 Posts: 412 Credit: 321,053 RAC: 0 |
It's not about the credits. It's about how I can't trust BOINC to run without me checking up on it. I'm not one of those BOINC-obsessed people who checks their computers every 10 minutes. I leave my computers alone for weeks at a time, and I hate to go back and see that 127 hours have been wasted. Since I run many projects, that 127 hours was over the course of several weeks. Perhaps the project could do a different RSS feed for these types of issues - "NOTICE - Please Abort all WUs of type XXXXXXXXXXXXXXXXX" "Project is down" "New Version 5.x.x Released" That way people don't have to monitor the forums to find out this information, and it would keep it off the front page News (although other projects DO use the front page News and Tech News for those typoe of announcements). Proudly Banned from Predictator@Home and now Cosmology@home as well. Added SETI to the list today. Temporary ban only - so need to work harder :) "You can't fix stupid" (Ron White) |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
... This question has come up again and again. As I have said before, the project IS GOING TO GRANT THOSE CREDITS. I have also indicated that a number of conditions had to be met before that could happen. All of those conditions have now been met except one, and it is a big one. The project never stabilized into a no error mode, despite the reductions in the error rate. The most annoying errors were the ones that remained. The time that would have been spent awarding the credit for Max time errors has been devoted to chasing down and stopping other bugs in the system. As I mentioned before, there was no point in doing the award process over and over in a piecemeal way when if the project waited until the right time the award could be done all at once. Soon after the deployment of the new version of Rosetta, there should be time (assuming it works ok) to go back and award these credits. The projected time I was given was mid to late March for the awards. If you look back over that time, there have been a number of issues that HAD to be addressed. Most users on this project would prefer to STOP the loss of additional credits/time by fixing the bugs, and then worrying about cleaning up the credit awards. You ay disagree with that approach, but that is what the project has decided to do. As a personal note: I am not just a Moderator on these forums, despite appearances resulting from the zero credits shown with my Moderator ID. I also run this project on my systems, and I personally have credits outstanding from the Max time problems totaling well over 9,000 credits. I have repeatedly put this question to the project team on behalf of the entire user community, and the answer is and always has been that the credit will be awarded. When that changes I (or the project) will tell you. Your insinuations and innuendoes implying that the Project team has mislead you in some way and does not intend to follow through on their promise, are not true nor are they welcome. Moderator9 ROSETTA@home FAQ Moderator Contact |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
... As I stated before I'll run R@H for just another 5 days and I promise you here and now, I will never reply here ever again. Lost my confidence some time ago and that's deadly. "Sick and tired by Anastasia" |
Message boards :
Number crunching :
Please ABORT these 4 stuck workunits
©2024 University of Washington
https://www.bakerlab.org