Message boards : Number crunching : Miscellaneous Work Unit Errors Version 5.01
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Dimitris, I think you're right about this being an infinite loop. We're really glad you posted. Bin just tracked down a potential cause of this problem. Please abort this workunit. I'm canceling these workunits (they are a tiny fraction of the current queued jobs, so hopefully won't they won't tie up too many more machines). I see, but I think it's in a endless loop, because |
![]() Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
Another One down the tubes. I am going to let all the Wu's left ( I am not accepting new work) to run their course and then will probably remove myself from the project unless I get a satisfactory answer , THAT I CAN UNDERSTAND ( I am not shouting ) Let's say I am not too keen on inefficiency and my cup is running over. This is not "official advice", but if you would just follow my suggestions to install BOINC v5.4.5 (latest devel version) and attach to RALPH, the Rosetta folks will receive (automatically, by the new BOINC sw) MUCH more elaborate bug reports, that would hopefully allow them to track it down. It'll take less time to take those two simple steps (upgrade BOINC and join RALPH), instead of posting the error results here. Maybe your PC isn't contributing to the science, but if it can lead to a "cure" of some software incompatibility, it's almost as good. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Please abort these jobs. The workunits with the following names appear to be causing problems on some machines: HBLR_1.0_XXX_ROT_TRIALS_TRIE_449... Its a bit strange, since we didn't see this problem in our Ralph tests. But just to be safe, go ahead and abort! I've got the following three work units running at a snails pace but running all the same. Perhaps they are resetting themselves as above, don't know as this is the first time I've looked at them. An Athlon64 3200+ and Sempron 3300+ running FC4 2.6.15-1.1831_FC4 and Sempron 2500+ running Mandrake linux 2.6.9-1.667. They are respectively running 19 hours 32 minutes at 38.52 percent done, 16 hours 50 minutes at 7.83 percent and 17 hours 31 minutes at 3.96 percent. All on 5.01 |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Jose, this is interesting. At present, we can't understand what's going on either! We've seen your error a few times on different machines, but yours is the first case where we see it for nearly every job. I think Dimitris' advice is good. You'd help us a LOT if you attached your project to Ralph ... we'd get a bunch of nice error reports (since your client is frequently causing the same error) and hopefully that will give us enough info to solve this problem. Another One down the tubes. I am going to let all the Wu's left ( I am not accepting new work) to run their course and then will probably remove myself from the project unless I get a satisfactory answer , THAT I CAN UNDERSTAND ( I am not shouting ) Let's say I am not too keen on inefficiency and my cup is running over. |
charmed Send message Joined: 2 Nov 05 Posts: 11 Credit: 1,780,440 RAC: 0 |
Oh look, it's the weekend again. Imagine that a new version released on Friday and problems right away. How many times do you need to be hit over the head before you guys learn :-) :-) Talk about gluttons for punishment!! |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Yea, that's a good point about the weekend. Give us a little credit, though -- we released the app on a Thursday, which is better than some of our previous disastrous Friday night rosetta@home releases! If you're interested, I'm about to release the next build (5.02)-- with the watchdog thread -- on ralph. If our timing is right, we could release the next app *early* next week, rather than later! Oh look, it's the weekend again. Imagine that a new version released on Friday and problems right away. How many times do you need to be hit over the head before you guys learn :-) :-) Talk about gluttons for punishment!! |
![]() ![]() Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Oh look, it's the weekend again. Imagine that a new version released on Friday and problems right away. How many times do you need to be hit over the head before you guys learn :-) :-) Talk about gluttons for punishment!! I see your smilies. But to be fair... they tested it on Ralph, and released it Thursday night... problem is, if you have a backlog or other projects to run, you may not get any new WUs for a few days from when they release it. So, Mondays would be the best time to release new code. Gives the maximum number of weekdays for experience and results to come in. Jose, I'm not clear how they plan to get you any WUs from Ralph. They seem to get downloaded just as fast as they are put out there. Rhiju, why not just have him download the rosetta_5.01_windows_intelx86.pdb file?? Doesn't that give you the diagnostics you need? And it wouldn't disturb the rest of the environment. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Jose, the advice from Feet1st seems reasonable. If you decide not to join Ralph, please do get the pdb file that is linked below. As for Ralph, based on Feet1st's comments, I'll make sure to send out more work tonight! Oh look, it's the weekend again. Imagine that a new version released on Friday and problems right away. How many times do you need to be hit over the head before you guys learn :-) :-) Talk about gluttons for punishment!! |
Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0 |
Warning: This one is a tad long. Rhiju at the end there is a very important question for you . Also look in the middle for another red message: I got at least one good piece of news :). [quote]Jose, the advice from Feet1st seems reasonable. If you decide not to join Ralph, please do get the pdb file that is linked below. Rhiju: When I aborted the first batch of nasty 5.01 work units , I told you that I was s going to get a second batch and set it up not to receive more Wu's until ALL the Wu's in that batch were processed and that I was going to report back to you on the final status of each WU. My reasoning was (and is still) that by checking the error reports you were going to find a common thread ( for example type of WU, size of WU, etc) . [ Note: As of now, the only commonality that has been found is the type of error: all the erroneous Wu's of that batch regardless of type and size have reported the same error. ]Finding the common thread for the error, then you and the other scientists could find ways to solve them and provide answers and suggestions to us . Call me obtuse , hard headed or plain ornery ( Please be gentle I am a tad sensitive LOL LOL LOL ) but until ALL the units in this batch are processed, I WONT change anything that may alter the computing environment/conditions under which the whole batch is processed. In order for a complete/reliable analysis of how the full batch was processed all the units in the batch have to be run under the same conditions: adding/subtracting environmental variables can make the findings of a comparative analysis of the units of batch that is being analysed specious. Chances are very high that if there is an error reported in any of the additional Wu's it is going to be the same that has been reported BUT what if NOT? So let's see if there is any other nasty little surprise lurking in those other units. But more important: I think it is important to see if in that batch of nasty Wu's are units that can be completed successfully as that would allow you and the others to compare the Wu's that ran successfully with the ones that failed and check for differences that may lead to an understanding of why some failed and some were successes without adding into the analysis the issue that I changed "horses in the middle of the stream" ( that I changed the way my computer was operating during the computational processes. Guess what?!!! the last unit from the batch that has been processed was completed (insert happy emotie here) wo error ( dancing , dancing). So I am very happy to report this unit to you: https://boinc.bakerlab.org/rosetta/result.php?resultid=17838238 Would you and the team check it and see what separates/differentiated it from all those nasty companions that came with it in the batch? That could give a clue as to what has been happening with the first and second batch of Wu's I got. What the heck!!!! I have become for a while a guinea pig ( at 390 pounds a big guinea pig) . So let me finish the batch wo making any changes to the computational environment. I may even get more successes (I can dream, cannot I? ) Then after the whole batch is done. I will revisit the suggestions made as to they way I am running Rosetta in my computer. Be warned : As the self proclaimed "Official Rosetta Guinea Pig" I reserve for myself the rights to some ARGHS every now and then. Rhiju, since this thread is one for reporting errors. Is there another way I can report to you ( as I previously promised I was going to do ) any other successful Wu's in the batch? This and no other is the root from which a Tyrant springs; when he first appears he is a protector.†Plato |
Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0 |
If memory doesn't fail me I got from the BOINC Download page? I am almost sure. Don't take my word for it... I tend to forget things fast. Right now, I am trying to remember where I placed my eyeglasses ( I think I use eyeglasses...LOL LOL ) Please do read Message 14365. The one where I explain to Rhiju why I am going to let the batch that is currently running, run wo changes into my computer environment. It will help you to understand why I am doing it. Cordially Jose [Who at least today had a good ,complete WU to report and who is facing the near orgasmic possibility that a Second Unit in a row will be completed wo error (insert happy dancing emotie here)] This and no other is the root from which a Tyrant springs; when he first appears he is a protector.†Plato |
charmed Send message Joined: 2 Nov 05 Posts: 11 Credit: 1,780,440 RAC: 0 |
Yeah, it was a lame attempt at humour but it did feel a little like we were getting stuck in a loop as well, ours being weekends ;-) I did abort the three long runners that I had reported earlier and things have been running fine since. |
Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0 |
Be still my heart!!!! Guess What?!!!: the second consecutive unit came in through wo errors https://boinc.bakerlab.org/rosetta/result.php?resultid=17838239 Could it be that after a rough patch of nasties...all the nasties were weeded out and the batch I got is starting to process the good seed? BTW I have noticed something that may or not be significant: All the nasty Wu's when they were downloaded ( as a matter of fact ALL the unit in the batch )had times "To Completion" more than 10 hours long which was higher than the estimated I had placed in my preferences ( I had placed 10 ). The first WU that I reported from this batch as completed wo errors and this one I am reporting now had times "To Completion" of less than 10. I just looked at my work batch thingie (pardon the technical jargon) in my BOINC Manager and now all the remaining Wu's are reporting "To completion" times less than when they were originally sent to me: All of them less than 10 ( to be exact all are now reporting 09:20:02) I don't know how that happened, but it happened and after it happened..the Wu's are working fine. Watch me jinx the rest. LOL LOL LOL Let's see what happens during the rest of the day. Sign me a Happy for now Jose. This and no other is the root from which a Tyrant springs; when he first appears he is a protector.†Plato |
[DPC]Division_Brabant~OldButNotSoWise![]() Send message Joined: 23 Jan 06 Posts: 42 Credit: 371,797 RAC: 0 |
What should I do with this one? 1.6% 17:30:00 hours of crunching, but still very active with the graphics. If it's no error or stuck WU I don't matter that it takes his time :) http://members.lycos.nl/oldbutnotsowise/fora/rosetta_wu.png (sorry for the crossposting.) |
Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0 |
What should I do with this one? What type of WU is it? If they are FACONTACTS_RECENTER jobs and/or HBLR1.0 jobs you can well abort them Rhiju asked for them to be aborted in another post. This and no other is the root from which a Tyrant springs; when he first appears he is a protector.†Plato |
[DPC]Division_Brabant~OldButNotSoWise![]() Send message Joined: 23 Jan 06 Posts: 42 Credit: 371,797 RAC: 0 |
It's indeed a FACONTACTS_RECENTER job. I've read that post, but on the other hand, if it's not completely useless to let it run (for days ?), it's oke for me. |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
Please abort these jobs. The workunits with the following names appear to be causing problems on some machines: I got one of these, and it was clearly stuck in the same loop that others have reported. I aborted it and it was promptly sent out to the next victim. Whatever you did to keep it from being resent didn't work. HBLR_1.0_1hz6_ROT_TRIALS_TRIE_449_8 |
Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0 |
Well it seems that the "Wu's from hell" have been exorcised from the Batch. I am happy to Report that a third unit today was completed without error. That is the third in a row. (Insert very Happy Dancing Emotie) https://boinc.bakerlab.org/rosetta/result.php?resultid=17838240 when I reported the second WU that was completed without error, I mentioned the fact that all the remaining Wu's in the batch were reporting a "Termination Time" of 09:20:02 instead of the original "Termination time of more than 10 with which they were downloaded. When I checked the remaining 7 Wu's still left in the batch now, the "Termination Time" they are now reporting is even lower than the 09:20:02 I reported in the last thread : The units are now reporting a "Termination Time" of 09:00:04 So I ask was the reduction in Termination Time I have been noticing with the successful processing of each of remaining Wu's in the batch expected ? Is it normal?? Status as of 5:23 PM AST April 22: 7 Wu's left in the Batch of those one is being executed right now (CPU time about 1;20 for about 27% completion an an estimated termination time of 6:13) and 6 units ready to run. Let's see what happens with them. Note: The graphics are showing a lot of movement and a lot of the "little dots" showing up all over the graph area. ( again, excuse my use of extremely technical jargon). The images do give the impression of faster processing. And before I am told that the screen saver slows down the computation time , please be advised that given the sorry state of TV in America, the Rosetta @ Home Screen Saver is the oly show in town in my household. ( LOL LOL LOL ) This and no other is the root from which a Tyrant springs; when he first appears he is a protector.†Plato |
Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0 |
ACCKKKK you were right. During all the hassles and confusion the time was changed to 6 hours. So the batch has been running 6 hour Wu's. BTW one more WU from the batch completed without problems: https://boinc.bakerlab.org/rosetta/result.php?resultid=17838279 So that means 6 Wu's from the batch left ( 1 is working and 5 pending) Re the lettuce and sunflower seeds: this guinea pig likes BBQ Beef , specially from powerful dutch cows ...LOL LOL LOL . (I know it is not nice to gloat but , I just read the team totals and my soul and intellect and psyche needs a good gloating break. ) I wouldn't mind seeing the "missing credits" credited...they certainly would taste very nice: like a yummy triple chocolated mousse. Sign me a happier and more relaxed Jose This and no other is the root from which a Tyrant springs; when he first appears he is a protector.†Plato |
Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0 |
...Re the lettuce and sunflower seeds: this guinea pig likes BBQ Beef ... Let's say that in my case my taste is less noble. LOL LOL LOL . This and no other is the root from which a Tyrant springs; when he first appears he is a protector.†Plato |
XS_DDT's_Cattle_Prods Send message Joined: 24 Mar 06 Posts: 12 Credit: 1,180,072 RAC: 0 |
...Re the lettuce and sunflower seeds: this guinea pig likes BBQ Beef ... OH, believe me, XS is doing all that they can to put a stop to these Mad Cows. |
Message boards :
Number crunching :
Miscellaneous Work Unit Errors Version 5.01
©2025 University of Washington
https://www.bakerlab.org