Message boards : Number crunching : Report Problems with Rosetta Version 5.16 I
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next
Author | Message |
---|---|
Jphelan Send message Joined: 7 Apr 06 Posts: 1 Credit: 88,443 RAC: 0 |
I had to abort a greater number of work units after about a day since Rosetta 5.16 due to a work unit,freezing up during the process of being being worked on. |
Ian Send message Joined: 14 Apr 06 Posts: 29 Credit: 326,863 RAC: 637 |
Couple more errors in the we small hours (well, where I am anyway :)) https://boinc.bakerlab.org/rosetta/result.php?resultid=21060345 https://boinc.bakerlab.org/rosetta/result.php?resultid=21039948 Eyeballing it, I seem to go through bursts of great stability with no errors and then a brief period of alternating errors and success. Ian Cundell, St Albans, UK |
Seth Aaronson Send message Joined: 5 Mar 06 Posts: 18 Credit: 3,976 RAC: 0 |
Moderator9, Since my errors and freezes seem to be related to the rosetta/BOINC screen saver, can you point me in the right direction to find some answers for the problems with that? Now that I am not using the BOINC screen saver, rosetta is error free for me. |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
Moderator9, Seth, Yes. Could you please attach to Ralph at this address. The programers are looking for problem system to help find this specific error. Moderator9 ROSETTA@home FAQ Moderator Contact |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
There are too many errors with version 5.16 in my case. belldandy from pleiades, I hope you won't mind if I replace your original post with this one. The image was very large and it stretches the forum page display, requiring people to scroll right and left to read and reply to posts. I would recommend you try a project reset. There is no problem with the work unit batch, so the problem is local to your machine. I have seen this same error before, and on some systems a rest fixes it, on other an attach/reattach fixes it. If these things do not work then we will have to dig deeper. One thing I would recommend is that you upgrade to the BOINC 5.4.9 client. That is the current recommended version of BOINC. It is far more stable, and it work very well with version 5.16 of Rosetta. That alone might solve your problem. Moderator9 ROSETTA@home FAQ Moderator Contact |
belldandy from pleiades Send message Joined: 2 Nov 05 Posts: 6 Credit: 102,731 RAC: 0 |
There are too many errors with version 5.16 in my case. I did use BOINC 5.4.9. I will try resetting the project tommorow. Campeones everywhere! |
Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0 |
A lot of my nodes are without work due to reaching there WU quotas Rosetta should check there system and purge the BAD WU's they just sent out If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
A lot of my nodes are without work due to reaching there WU quotas Rosetta should check there system and purge the BAD WU's they just sent out I am not seeing any errors taht look like a bad batch of work units, and I am running a mix of machines and work unit types. I have not seen any errors in over 2 weeks with the single exception of a short group of work units that the project did abort. Could you post a few links to your results? Also, if you are seeing a lot of these please attach to Ralph. so we can get better diagnostics. Any of you running the screen-saver and seeing errors, please note that we are seeing problems on a few system related to the screen-saver. Rhiju is tracking that down right now. So if you are seeing any relationship between the screen saver and your errors, please attach to the Ralph project. Moderator9 ROSETTA@home FAQ Moderator Contact |
Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0 |
Now this is weird: I reattached to Rosetta. I got a work unit that is not starting. When I checked the allotted DISK SPACE assigned to Rosetta by the manager I find that ZERO, Bupcous has been assigned. And that RALGH that has been assigned 1/11th of my resources has 27+ Gigabytes assigned. There is no way a Rosetta WU can run on zero disk space. Can someone tell me what would drive the manager to do that? BTW I am attached to RALPH and I am waiting for jobs to run. |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
Now this is weird: You resource assignment is different than the disk use settings. In fact they are not directly related at all. That said, you are correct that Rosetta should be using about 15-20 MB of space. It is possible that it did not actually download, or that it did and BOINC has not noticed the change in disk use yet. Since it is not processing the work unit yet, you might try two things. 1) Make certain that rosetta has not been suspended in the projects tab 2) restart BOINC manager, and see if it wakes up. I directed Rhiju to your post at Ralph. He is thrilled at the data you provided (see his post to you there). He is contacting Rom (BOINC Development) to discuss your report. Thank you for helping them. Moderator9 ROSETTA@home FAQ Moderator Contact |
Aglarond Send message Joined: 29 Jan 06 Posts: 26 Credit: 446,212 RAC: 0 |
LINUX problem: I don't think Watchdog can catch it, because whole process is sleeping.. it was in this state for more than 2 days and watchdog didn't catch it.
I also have leave-in-mem=yes .. and it can be something with memory, as this is primarily webserver and it has only 1GB RAM so it can be low on RAM from time to time..
No it wasn't faulty WU. After restarting boinc, both WUs were completed successfully. |
Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0 |
Now this is weird: More weirdness: The Rosetta exe and the Ralph Exe files have disappeared from the Task Manager. |
Thor[Free-DC] Send message Joined: 24 Oct 05 Posts: 2 Credit: 354,251 RAC: 0 |
This ist not really a bug, but it is bugging me: The new work units seem to have only very few "saving points" Which means, you put half an hour or even an hour of crunching in, shut down the computer for some reason and when you get back to runching, you have to start over again.. I had this happen at least three times, so I wonder if there is any possibility to put more save spots in the WUs for the crunchers who are not running 24/7 ??? Greets Thor[Free-DC] |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
This ist not really a bug, but it is bugging me: You are already using the version that has had checkpoints added. Originally the checkpoints only were done at the end of a full model. Now they are every ~20 min. There will be a better way to tell when checkpoints occur in future versions, but they cannot add more checkpoints. This is just a limitation of the nature of the work at this time. It should only be falling back to the last checkpoint, not starting over. Unless of course you are shutting down when the percent complete is 1.04x%, then it will start over form the start because it has not check-pointed yet. Moderator9 ROSETTA@home FAQ Moderator Contact |
Laurenu2 Send message Joined: 6 Nov 05 Posts: 57 Credit: 3,818,778 RAC: 0 |
This ist not really a bug, but it is bugging me: I to have seen this happen you reboot a pc that have a hour+ loged on it and it starts over at 00:00 you the check points are not working on all WU's And Mod 9 then you are the lucky one that do not get these Errors But just becuse you do not get them does not meen we are not getting them If You Want The Best You Must forget The Rest ---------------And Join Free-DC---------------- |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
...And Mod 9 then you are the lucky one that do not get these Errors But just becuse you do not get them does not meen we are not getting them You misunderstood the post. I am trying to DUPLICATE the errors you are getting and I am still not seeing them. I never said that you are not getting errors, I have no doubt you are. In any case I only mentioned that in response to the report that there is a bad batch of work units. At present there is no bad batch of work units that I can find. But if I cannot duplicate the errors you are seeing and I do not get the kind of information that might be required to see what is happening on your systems then it becomes very difficult to help you. Could you at least provide a link to the error results so I can read the messages? We should both excuse each others frustrations, this kind of diagnosis is not easy for either of us in an open forum, but so long as we communicate we can work to a solution together. Moderator9 ROSETTA@home FAQ Moderator Contact |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Hi belldandy: I just took a look at your results too. You're getting the same error every time -- and its due to a problem reading in a file called bbdep02.May.sortlib.gz. (Not very obvious huh?). It occured with some 5.13 workunits also, maybe some old ones that were still running when you also got 5.16 on your system. I think that file is corrupted on your system. I'm not exactly sure how to fix this -- a boinc reinstall may trigger your system to re-download it. Alternatively, you could detach from the project, abort current workunits, and completely remove the directory that has this file, then start up BOINC again, and attach from the project. Thanks for posting -- hope one of those solutions works! Its certainly an error that we haven't seen before. There are too many errors with version 5.16 in my case. |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Hi Laurenu2... can you post the results page for one of your nodes that has this problem? Thanks! I just looked through the pages for four or five of the nodes that are under your userid -- they all have had perfect success rates for the last three days! We're not aware of any bad WU's being sent out on rosetta@home, and have been checking that the error rates are low. Obviously, we need to know ASAP if there are any bad WUs. (There was a bad batch last week on ralph, but it was a small batch, and has been purged from the system.) A lot of my nodes are without work due to reaching there WU quotas Rosetta should check there system and purge the BAD WU's they just sent out |
Seth Aaronson Send message Joined: 5 Mar 06 Posts: 18 Credit: 3,976 RAC: 0 |
Moderator9, What is the recommended way of doing that? Should I suspend rosetta after I've created a RALPH account, attach to RALPH, then start to use the BOINC screen saver? I'm also attached to SETI and Einstein. Please advise. -Seth |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
Moderator9, You can just treat RALPH like any other project for the most part. The biggest difference is that while credits are awarded on RALPH there is no effort to restore lost credits. It is a development and diagnostic project. On a brighter note you will get to see the next versions of RALPH before the the rest of the world, and please do provide suggestion there if you think of any. The link I provided is the URL that BOINC Manager is going to ask you for. Once you are attached, set the project priority low, say 10-20 percent share of your system. This will assure than when work is available you will get some, but it will not interfere with other processing too much. As far as running it just treat it as you would rosetta. If you have errors report them in the threads at RALPH, with a link to the result that had the error. Thank you for the help. Moderator9 ROSETTA@home FAQ Moderator Contact |
Message boards :
Number crunching :
Report Problems with Rosetta Version 5.16 I
©2024 University of Washington
https://www.bakerlab.org