Problems and Technical Issues with Rosetta@home

Author	Message
Ed Send message Joined: 2 Aug 11 Posts: 31 Credit: 662,563 RAC: 0	Message 70926 - Posted: 5 Aug 2011, 21:19:39 UTC - in response to Message 70925. Did you set the "While processor usage is less than X %" to 0%? Yes it is set at zero. And I shifted allocation. Seti and Rosetta are now 50 50 on a 2 core system so essentially they each have one core. I figured fighting disease is at least as important as finding ET. ID: 70926 · Rating: 0 · rate: /

TPCBF Send message Joined: 29 Nov 10 Posts: 111 Credit: 6,042,546 RAC: 24	Message 70965 - Posted: 8 Aug 2011, 1:09:01 UTC - in response to Message 70962. And what you're saying is? Ralf ID: 70965 · Rating: 0 · rate: /

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 70969 - Posted: 8 Aug 2011, 5:35:51 UTC - in response to Message 70962. Oh yeah, once again someone said awhile back that they would be monitoring the boards for discussions like this one. Once again the system fails and no one see's or says anything about it. It's not really an issue about the system failing - it simply that we don't (currently) have jobs that are ready to run right this minute. The running of the jobs on R@h is only one step in the process - it takes a while to figure out what sorts of jobs will give usable scientific results, to set up the jobs, test them to make sure they won't cause a huge failure rate, and then at the end of the runs to process the results to figure out what the next round should do. Usually we have enough things going on that the computational lull in one project will be covered by the compute phase of a different one. We just happen to have hit a point where none of the currently active projects is in an active compute phase. (And doesn't help that we're maximally distant from both the previous and next CASP - as you've probably noticed, activity seems to ramp up before [mad rush to finalize improvements], during, and after [post-analyis] CASP.) We're aware that the queue is empty - a message has been sent out on the appropriate internal mailing list. While we want to provide you with work units, we don't want to waste your time with scientifically pointless make-work. - It's somewhat trivial to re-run old jobs, but is that worth doing if no one is going to look at the results? I hesitate to say this, as I don't want it to sound like we're chasing you away(), but I'd agree with the implicit recommendation stated above to crunch other projects while we have this momentary lull. You can increase your stats on other projects secure in the knowledge that no one will gain on you with Rosetta@home. With any luck, we'll have new jobs for you early next week. (e.g. "We apologize for the inconvenience - Regular service should resume shortly.") ) We really do appreciate your efforts. Having access to the computational resources of R@h allows us to do things we couldn't do otherwise. Frankly speaking, I was surprised how quickly and easily R@h handled my recent jobs. I would have monopolized our local computational resources, but R@h crunched through it like it was nothing. - It's prompted me to think about possible process improvement experiments that I probably wouldn't have otherwise considered due to the computational cost. (Unfortunately, it's in the very preliminary stages and nowhere near the point where I could actually launch any jobs.) So why was this not posted out on the front page where everyone can read it instead of buried in deep in this topic? ID: 70969 · Rating: 0 · rate: /

TPCBF Send message Joined: 29 Nov 10 Posts: 111 Credit: 6,042,546 RAC: 24	Message 70971 - Posted: 8 Aug 2011, 6:05:44 UTC - in response to Message 70969. We're aware that the queue is empty - a message has been sent out on the appropriate internal mailing list. So why was this not posted out on the front page where everyone can read it instead of buried in deep in this topic? Looks like nobody who cares got the memo... :-( Ralf ID: 70971 · Rating: 0 · rate: /

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2161 Credit: 13,047,946 RAC: 4,432	Message 70974 - Posted: 8 Aug 2011, 12:47:38 UTC - in response to Message 70969. So why was this not posted out on the front page where everyone can read it instead of buried in deep in this topic? +1 ID: 70974 · Rating: 0 · rate: /

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 70977 - Posted: 8 Aug 2011, 17:04:18 UTC - in response to Message 70969. Oh yeah, once again someone said awhile back that they would be monitoring the boards for discussions like this one. Once again the system fails and no one see's or says anything about it. It's not really an issue about the system failing - it simply that we don't (currently) have jobs that are ready to run right this minute. The running of the jobs on R@h is only one step in the process - it takes a while to figure out what sorts of jobs will give usable scientific results, to set up the jobs, test them to make sure they won't cause a huge failure rate, and then at the end of the runs to process the results to figure out what the next round should do. Usually we have enough things going on that the computational lull in one project will be covered by the compute phase of a different one. We just happen to have hit a point where none of the currently active projects is in an active compute phase. (And doesn't help that we're maximally distant from both the previous and next CASP - as you've probably noticed, activity seems to ramp up before [mad rush to finalize improvements], during, and after [post-analyis] CASP.) We're aware that the queue is empty - a message has been sent out on the appropriate internal mailing list. While we want to provide you with work units, we don't want to waste your time with scientifically pointless make-work. - It's somewhat trivial to re-run old jobs, but is that worth doing if no one is going to look at the results? I hesitate to say this, as I don't want it to sound like we're chasing you away(), but I'd agree with the implicit recommendation stated above to crunch other projects while we have this momentary lull. You can increase your stats on other projects secure in the knowledge that no one will gain on you with Rosetta@home. With any luck, we'll have new jobs for you early next week. (e.g. "We apologize for the inconvenience - Regular service should resume shortly.") ) We really do appreciate your efforts. Having access to the computational resources of R@h allows us to do things we couldn't do otherwise. Frankly speaking, I was surprised how quickly and easily R@h handled my recent jobs. I would have monopolized our local computational resources, but R@h crunched through it like it was nothing. - It's prompted me to think about possible process improvement experiments that I probably wouldn't have otherwise considered due to the computational cost. (Unfortunately, it's in the very preliminary stages and nowhere near the point where I could actually launch any jobs.) So why was this not posted out on the front page where everyone can read it instead of buried in deep in this topic? +2 I mean... seriously... All it takes is to edit the home page and add this... and EVERYONE would be like "God, this project is serious, it's analyzing our work... doesn't like to send useless WUs to keep us busy like SETI. I'm going to add a similar project (like POEM@Home!) to further help this field of science". So little can do so much. ID: 70977 · Rating: 0 · rate: /

Sid Celery Send message Joined: 11 Feb 08 Posts: 2541 Credit: 47,118,286 RAC: 389	Message 70982 - Posted: 9 Aug 2011, 2:36:26 UTC - in response to Message 70919. When was then last time we had a post from the project about these problems? ...maybe everyone took a vacation at the same time Nothing since we were told there'd be no work until early next week. So, if everyone let the WU run longer, more work would be done with fewer WU and we would not run out of Rosetta tasks as often. This would make the best use of the admin's time, the server resources and probably give the scientists more bang for our efforts. It would seem to be that people would want to set their run times longer than 3 hours making every WU really count. Yes, that's exactly right. I usually have my preferences set to 8 hours, but once it became clear there'd be a delay in new WUs I maxed my remaining WUs to 24hrs. Some came to an end before that time, but most reached that kind of runtime. I ran out for a couple of hours on one core out of four last week on my desktop so Boinc grabbed some WCG WUs just as Rosetta WUs reappeared, and the same happened tonight, just as another batch seems to be coming through again, so I've had (almost) no downtime & only a few hours (maybe a day total) of running my backup project. On my laptop I had no downtime at all, but again a few WCG WUs came down to fill a buffer (maybe half a day's worth) both last week and tonight. So I keep just a 2 day buffer, only very intermittent resupply over the last 12 days, but less than a day on each machine not running my preferred project and almost no downtime to speak of. I only sneak a look at my status once a day (if that) and without any advice from the project team or mod it's pretty straightforward to guess what's happening with minimal intervention. It's not especially clever, so for the people this kind of thing matters to I'm surprised there's so much whinging to be honest. If people refuse to do even this small amount for themselves I don't see what can be complained about, especially as there's no permanent guarantee of continuous supply. In case you missed my mentioning it, new WUs seem to be available now. ID: 70982 · Rating: 0 · rate: /

Ed Send message Joined: 2 Aug 11 Posts: 31 Credit: 662,563 RAC: 0	Message 70987 - Posted: 9 Aug 2011, 9:05:34 UTC - in response to Message 70919. Last modified: 9 Aug 2011, 9:07:19 UTC So, if everyone let the WU run longer, more work would be done with fewer WU and we would not run out of Rosetta tasks as often. This would make the best use of the admin's time, the server resources and probably give the scientists more bang for our efforts. It would seem to be that people would want to set their run times longer than 3 hours making every WU really count. Follow-up to my earlier comment. For anyone who may not have seen it, this post provides some interesting insights around WU Run time and the actual work that is done. But the net net seems to be that setting a longer run time has a positive impact on the project and the infrastructure. https://boinc.bakerlab.org/rosetta/forum_thread.php?id=4489&nowrap=true#67551 ID: 70987 · Rating: 0 · rate: /

rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0	Message 70992 - Posted: 9 Aug 2011, 17:51:59 UTC I'm letting my computer run for 12 more hrs and then decide if i should disconnect until some answers can be found ID: 70992 · Rating: 0 · rate: /

Sid Celery Send message Joined: 11 Feb 08 Posts: 2541 Credit: 47,118,286 RAC: 389	Message 70993 - Posted: 9 Aug 2011, 18:10:31 UTC - in response to Message 70992. I'm letting my computer run for 12 more hrs and then decide if i should disconnect until some answers can be found Have you reported a problem somewhere I missed? WUs have been coming through all day - was there something else? ID: 70993 · Rating: 0 · rate: /

Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0	Message 70994 - Posted: 9 Aug 2011, 19:08:43 UTC Sid - just did a quick scan through my logs for today and I'm not seeing any problems with the number of available work units or with their successful completion. I also took a quick look at his task log and see that as recently as two days ago he was successfully processing tasks. Since then all I noticed was a bunch of "abort before start" (and not on the flex design WU) I did not see a post describing a problem so I can't offer any suggestions. ID: 70994 · Rating: 0 · rate: /

rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0	Message 71005 - Posted: 10 Aug 2011, 18:54:19 UTC - in response to Message 70993. I'm letting my computer run for 12 more hrs and then decide if i should disconnect until some answers can be found Have you reported a problem somewhere I missed? WUs have been coming through all day - was there something else? the units are not completing any more ID: 71005 · Rating: 0 · rate: /

Ed Send message Joined: 2 Aug 11 Posts: 31 Credit: 662,563 RAC: 0	Message 71021 - Posted: 11 Aug 2011, 3:09:17 UTC I have my preference set to 6 hours. I have run maybe 4 WU in the last 24 hours. They seems to be running to some kind of completion. Some are running more than 6 hours on the elapsed line. ID: 71021 · Rating: 0 · rate: /

Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0	Message 71024 - Posted: 11 Aug 2011, 4:41:16 UTC Rochester - the tasks which are not completing - are they actually pulling cycles or have they stalled - what does your Performance Monitor (or whatever Windows calls it) currently show. I also note you are running Windows 7 (unlike most of the Windows users here who have moldy copies of XP) - did you put on any maintenance this past weekend before the issue of non-completing tasks started? And finally, was whoever it was who gave you the moniker "New York" upset at you and seeking to punish you for something? ID: 71024 · Rating: 0 · rate: /

Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0	Message 71025 - Posted: 11 Aug 2011, 4:45:48 UTC Ed - thats about right - remember, the system will run past the target time if it is in the middle of a model (which for some unknown reason the call a decoy) If it does run past the target time it will either terminate when the model it is currently working on completes, or until the "watch dog" wakes up and terminates it. This occurs when you reach a point four hours past the target time. ID: 71025 · Rating: 0 · rate: /

Ed Send message Joined: 2 Aug 11 Posts: 31 Credit: 662,563 RAC: 0	Message 71027 - Posted: 11 Aug 2011, 11:21:07 UTC - in response to Message 71025. Ed - thats about right - remember, the system will run past the target time if it is in the middle of a model (which for some unknown reason the call a decoy) If it does run past the target time it will either terminate when the model it is currently working on completes, or until the "watch dog" wakes up and terminates it. This occurs when you reach a point four hours past the target time. Thanks for confirming my understanding of the process and the time factors. That was why I went to 6. That should allow it to run to perhaps 10 hours. If it is going longer than that something may be wrong. Happy cruncher. ID: 71027 · Rating: 0 · rate: /

rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0	Message 71029 - Posted: 11 Aug 2011, 11:36:25 UTC - in response to Message 71024. Last modified: 11 Aug 2011, 11:39:12 UTC Rochester - the tasks which are not completing - are they actually pulling cycles or have they stalled - what does your Performance Monitor (or whatever Windows calls it) currently show. I also note you are running Windows 7 (unlike most of the Windows users here who have moldy copies of XP) - did you put on any maintenance this past weekend before the issue of non-completing tasks started? And finally, was whoever it was who gave you the moniker "New York" upset at you and seeking to punish you for something? i don't think the tasks are even starting only maintenance might have been a auto de-frag i do once at the end of every month id need more info on the last new york thing https://boinc.bakerlab.org/rosetta/results.php?hostid=1423271&offset=60 ID: 71029 · Rating: 0 · rate: /

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 71034 - Posted: 12 Aug 2011, 2:59:11 UTC Keep an eye out on the new T0xxxxxxxxx tasks. Another person and me just had 1 each die on us. He got a validate error and mine crashed and burned 50% of the way through. ID: 71034 · Rating: 0 · rate: /

dcdc Send message Joined: 3 Nov 05 Posts: 1835 Credit: 124,952,580 RAC: 242	Message 71039 - Posted: 12 Aug 2011, 18:03:57 UTC - in response to Message 71034. Keep an eye out on the new T0xxxxxxxxx tasks. Another person and me just had 1 each die on us. He got a validate error and mine crashed and burned 50% of the way through. Same here - have one 50.940% through. BOINC time was increasing but using no CPU time. I've just suspended it and resumed it with no effect. The Time Remaining for it isn't given in BOINC Manager and the graphics close pretty quickly after opening without displaying anything... ID: 71039 · Rating: 0 · rate: /

rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0	Message 71040 - Posted: 12 Aug 2011, 19:02:11 UTC - in response to Message 71029. Rochester - the tasks which are not completing - are they actually pulling cycles or have they stalled - what does your Performance Monitor (or whatever Windows calls it) currently show. I also note you are running Windows 7 (unlike most of the Windows users here who have moldy copies of XP) - did you put on any maintenance this past weekend before the issue of non-completing tasks started? And finally, was whoever it was who gave you the moniker "New York" upset at you and seeking to punish you for something? i don't think the tasks are even starting only maintenance might have been a auto de-frag i do once at the end of every month id need more info on the last new york thing https://boinc.bakerlab.org/rosetta/results.php?hostid=1423271&offset=60 https://boinc.bakerlab.org/rosetta/results.php?hostid=1423271&offset=20 ID: 71040 · Rating: 0 · rate: /