Problems and Technical Issues with Rosetta@home

Author	Message
Sid Celery Send message Joined: 11 Feb 08 Posts: 2596 Credit: 47,220,881 RAC: 1	Message 81152 - Posted: 7 Feb 2017, 5:28:30 UTC - in response to Message 81039. And while I'm annoyed, I may as well give this message a bump. 5 months now. That said, I'm aware we could be waiting another 4-8 years given recent events. I'll do a full boinc server upgrade when we get our hardware. This relates to David EK's message in the previous thread which reads: Our database server is running out of disk space. We had to reconfigure it which took a long time because it was over 140gigs, however it is operating at a very sluggish pace. Our project has been quite busy lately mainly due to Charity Engine providing 1000s of new hosts each day. This has been going on for quite some time and our database finally reached it's space limit with the current project configuration. We are working on a temporary solution since our full upgrade will take some time, in the order of months I am told. That's dated 8 Sep 2016, so 4 months ago. The new hardware must be due for delivery quite soon. What's the latest plan, please? <whistles> ID: 81152 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2205 Credit: 13,720,774 RAC: 6	Message 81153 - Posted: 7 Feb 2017, 9:01:20 UTC - in response to Message 81148. I think this alert only occurs when there are no non-android work units available. It's not a serious issue. IMHO, this is a serious issue. And i'm amazed that you think it's not a problem And i'm even more amazed that you respond after MONTHS ID: 81153 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 81154 - Posted: 7 Feb 2017, 14:31:55 UTC To be clear, I took what DK said to mean that the specific text that is delivered when no work is available is not a serious issue (i.e. whether or not Android is mentioned). Because the text does not change anything functionally (but would really make the event log much less confusing to people). The 24hr back-off experienced at times is a separate issue with implications to when work is reported and requested, system utilization, total work done etc. as mentioned by others. DK is looking in to both issues. Rosetta Moderator: Mod.Sense ID: 81154 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2596 Credit: 47,220,881 RAC: 1	Message 81156 - Posted: 7 Feb 2017, 18:01:38 UTC - in response to Message 81154. Last modified: 7 Feb 2017, 18:09:42 UTC To be clear, I took what DK said to mean that the specific text that is delivered when no work is available is not a serious issue (i.e. whether or not Android is mentioned). Because the text does not change anything functionally (but would really make the event log much less confusing to people). The 24hr back-off experienced at times is a separate issue with implications to when work is reported and requested, system utilization, total work done etc. as mentioned by others. DK is looking in to both issues. The text is certainly confusing, but I think we worked out long ago what it meant to say. But the text is a consistent precursor to the unique form of backoff Boinc selects, bypassing the usual escalation. This legitimate complaint has been going on for 6 months or more and the only acknowledgement over the last few days has seemingly dismissed it as an issue. To me, whatever logic calls up that message leads directly to the 24hr backoff. Find the one and the other will be found 5 seconds later. Not 6 months, 5 seconds. We've been <remarkably> patient. I'd like this 'non-issue' to be dealt with and closed quickly so we can move on to hear about the much bigger one of the hardware upgrade and server software upgrade. On that, we can be much more understanding of delays due to both the cost and workload involved. Edit: luckily the bigger of my unattended PCs polled just in time to prevent backup project tasks coming down - partly due to some pre-planning I did before I left on Saturday. And I just realised I set my phone to "No New (Rosetta Android) Tasks" to clear up WCG tasks but forgot to unset it until this morning so ended up with even more WCG tasks... <sigh> ID: 81156 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 81157 - Posted: 7 Feb 2017, 19:51:33 UTC When we run out of work units, the 24 hour backoff may be appropriate as it can help reduce the load of our servers when work is available again. So if you get the message, you can ignore it. If there is no work, there is no work. In the mean time, I'll see if we can improve the message to be more meaningful. I did increase the deadlines to 2 weeks for standard jobs which should help keep your computers busy if we run out of work units. We do have plenty of work but it sometimes comes in waves. ID: 81157 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 81158 - Posted: 7 Feb 2017, 21:18:43 UTC Also, please keep in mind that we had a significant outage a month ago that lasted a couple weeks. This caused a reduction of work units because no one in the lab had access to their local data and local computing resources. It was caused by a campus wide power outage. In the rare case that you are unable to keep a buffer of Rosetta@home work units, the most likely reason will be that there are no work units due to an unusual circumstance. However, these situations should be rare events. Sorry for any inconvenience. ID: 81158 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 81159 - Posted: 7 Feb 2017, 23:08:52 UTC I updated the scheduler so that the hard coded default delays that were 24 hours are now 6 hours. We'll have to monitor how our servers handle this change, particularly after our job queue is at 0 for a while which hopefully will not happen anytime soon. I also changed the "Rosetta Mini for Android is not available for your type of Computer" message to "No work is available", which should avoid any confusion. Let me know if the issues persist. ID: 81159 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2596 Credit: 47,220,881 RAC: 1	Message 81160 - Posted: 8 Feb 2017, 0:22:16 UTC - in response to Message 81159. Short version: A 6 hour backoff ought to be a sufficient compromise even if people are unlucky enough to poll when there's no work twice running. Thanks. Longer version: most of the rest of that was nonsense. 1) There's often work 4 minutes after a 24hr backoff. No work just means at that precise moment and nothing more. 2) Increasing deadlines to 2 weeks for standard jobs makes no difference if the buffer is only a day or two large. The deadline could be 2 years and the buffer would still be exhausted in the day or two the buffer lasts. 3) The decisive factor is the difference between the backoff, the buffer size and the shortest deadline, so increasing the shortest deadline from 2 to 3 days made the biggest difference. That's appreciaated. The rest is jam. 4) The power outage affected things most in earlymid January. We ran dry then for understandable reasons and switched projects to fill the gap. What's happening now is different. Meanwhile, hearing that the servers may or may not cope with a 6 hour backoff returns us to the question about the hardwareserver-software upgrade. An updated statement on that is due after 5 months. Personally I'm already anticipating disappointment on that front, so better to get it out of the way sooner rather than later imo. ID: 81160 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 81161 - Posted: 8 Feb 2017, 1:39:24 UTC Last modified: 8 Feb 2017, 1:41:12 UTC I'm just trying to help, respond to suggestions, requests, and questions, and also explain things. You can always email me at dekim at uw dot edu. That is the best way to get a fast response to urgent issues. I hope the changes I made in response to feedback helps everyone and sorry it didn't happen sooner. Just as a reminder to all, what I did was: 1. Changed the high priority job deadlines to 3 days and increased the standard deadlines to 2 weeks. If this increase to 2 weeks makes no difference, then I'd like to reduce it back to a week. I thought it would at least give users the option to increase their job buffer to help deal with the back off issue but this also comes at the expense of increasing the size of our database which we can handle at the moment. 2. I decreased the backoff delay to 6 hours which is a hard coded value in the scheduler code. 3. I changed the "... is not available for your type of Computer" message to "No work is available" which is also hard coded in the scheduler code. I suspect the relatively recent no work backoff issues have been due to our database server hanging on some big queries like the table backups and stats dumps, causing the feeder queries to also take longer. I've been dealing with these issues lately but am working on hopefully a long term solution. I might backup and then remove the 1 million Charity Engine users that are no longer active. I'll ask for an update about the server hardware upgrades but don't hold your breath. If there are any other issues and/or concerns please let me know. Thanks! ID: 81161 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 81162 - Posted: 8 Feb 2017, 5:54:19 UTC Thanks DK! Those were the main issues that I know have been annoying people. The 6 hour back-off should work well. It can help get results returned to you faster and hopefully keep the DB cleaner as well, because once the 24hr delay hits, all completed work stacks up "ready to report" until the delay time is reached. So it is more than just getting work, it is also the reporting back of it. I think where the extension to 2 weeks helps is when you do keep a fairly large buffer of work, but then have a pile of the shorter deadline work drop in at the front of the deadline list. So extending the 2day work into 3day work helps get the short deadline work done in time, and extending the other deadline marginally helps the rest of the queue proceed fairly normally. However, as Sid pointed out, since the project likes faster turn around, many of us with full time network connections tend to keep a work buffer of only a day or two. And so the extension out to 2 weeks really doesn't help that use-case at all (I believe the BOINC defaults are fairly close to that case as well). And so, if you intended to say that the DB server cannot really handle all of the outstanding WUs at the moment, it would seem that would be a change you could take back to help reduce the scale of the database. Although if the work is generally returned within 3 days (regardless of short or long deadline)... would the extended 2week deadline really result in more WUs residing in the DB? Rosetta Moderator: Mod.Sense ID: 81162 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 81163 - Posted: 8 Feb 2017, 8:07:10 UTC I extended it to 2 weeks arbitrarily in order to give users the option to increase their buffer size to help deal with running out of work as this seemed to be the main issue for people - running out of work on the client. I assume most users don't really pay attention to this stuff but for the few that do, I wanted to give that option to help keep their clients filled with work. Our database has enough space at the moment if there is any increase in WUs but if it's not necessary as Sid suggests, then I'd like to bring it back to 1 week. 1 week is of course more in line with fast turn around times desired by anxious, impatient researchers. Is this ok for everyone? ID: 81163 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 81165 - Posted: 8 Feb 2017, 19:02:48 UTC Well, just speaking with one voice here, but I think the "running out of work" issue is more resolved by the change from 24hr to 6hr back-off. So it was more the "sitting idle for 23hrs" that people were concerned about. And actually most were probably only idle for 12hrs of that time. It was actually the project running out of work, and the resulting 24hr back-off, that people were frustrated with. I understand that if you are out of work, as you say, you are out of work. And this will still happen. But I think your other changes accommodate. I'd suggest putting the long deadlines back to the 7 or 10 days. Could you tell us more about the weekly surge of rapid turn-around deadlines? Do these WUs get created over the course of the day Friday? Or is there suddenly half a million of them in the queue? I tend to extend my buffer at some point Thursday or Friday and load up a bit for the weekend. Understanding that the deadlines don't necessarily reflect when you need the work back (since the 3 days is defined on the task as it is assigned, and not from any given creation date), when do you need those WUs back to be most useful? Sunday evening? Pacific time? Monday morning? Rosetta Moderator: Mod.Sense ID: 81165 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2596 Credit: 47,220,881 RAC: 1	Message 81166 - Posted: 8 Feb 2017, 19:07:21 UTC - in response to Message 81161. Last modified: 8 Feb 2017, 19:58:48 UTC Posted too early - several aborted attempts to type this out including a blue-screen :( ID: 81166 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2596 Credit: 47,220,881 RAC: 1	Message 81168 - Posted: 8 Feb 2017, 19:57:25 UTC - in response to Message 81161. I'm just trying to help, respond to suggestions, requests, and questions, and also explain things. You can always email me at dekim at uw dot edu. That is the best way to get a fast response to urgent issues. I hope the changes I made in response to feedback helps everyone and sorry it didn't happen sooner. Just as a reminder to all, what I did was: 1. Changed the high priority job deadlines to 3 days and increased the standard deadlines to 2 weeks. If this increase to 2 weeks makes no difference, then I'd like to reduce it back to a week. I thought it would at least give users the option to increase their job buffer to help deal with the back off issue but this also comes at the expense of increasing the size of our database which we can handle at the moment. 2. I decreased the backoff delay to 6 hours which is a hard coded value in the scheduler code. 3. I changed the "... is not available for your type of Computer" message to "No work is available" which is also hard coded in the scheduler code. I suspect the relatively recent no work backoff issues have been due to our database server hanging on some big queries like the table backups and stats dumps, causing the feeder queries to also take longer. I've been dealing with these issues lately but am working on hopefully a long term solution. I might backup and then remove the 1 million Charity Engine users that are no longer active. I'll ask for an update about the server hardware upgrades but don't hold your breath. If there are any other issues and/or concerns please let me know. Thanks! First of all I should apologise for my previous post. I was still ticked off after the earlier message and I let that carry on through. I'm sorry for that and appreciate your actions this week. The default buffer is 0.35 days (0.1+0.25). The default runtime is 8 hours (0.33 days). When a full day backoff occurs then 1.68 days for default settings is right up against a 2-day deadline. Almost planning to fail - especially so when tasks from another project is in play. There's little scope for any increased buffer of tasks. With a combination of 3-day deadlines and a 6hr backoff (total 0.93 days) an unattended machine at default settings can even cope with a 2nd backoff. More importantly for me, it can cope with a reasonable buffer size of 1.5 or 2 days, so this fully addresses the issues reported and prevents tasks from backup projects taking their place. Thank you for that. (I'm not sure 2-day deadline tasks have worked their way out of the system yet, but the 6hr backoff goes a long way to resolving things in the meantime). Imo the 14 day deadline is a response to user settings that are better dealt with on our side. If longer deadlines encourage larger buffers at a time when tasks are at a premium it sounds like a bad thing. If database queries hang and take longer and this reduces the number of tasks coming available leading to more failed task requests, even moreso, even if you can handle the increased database size. There's a case for saying 8 days is better than 7 (a week absence plus a day to resolve) and I'd maintain that buffer size determines turnaround time more than longer deadlines, but if longer deadlines cause the problems you decribe, it's not worth it. We should defer to your discretion on this matter and adapt to whatever you decide. Hopefully I can stop whining now... ID: 81168 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 81169 - Posted: 8 Feb 2017, 20:41:23 UTC - in response to Message 81165. Could you tell us more about the weekly surge of rapid turn-around deadlines? Do these WUs get created over the course of the day Friday? Or is there suddenly half a million of them in the queue? I tend to extend my buffer at some point Thursday or Friday and load up a bit for the weekend. Understanding that the deadlines don't necessarily reflect when you need the work back (since the 3 days is defined on the task as it is assigned, and not from any given creation date), when do you need those WUs back to be most useful? Sunday evening? Pacific time? Monday morning? Sounds like you two are ok to bring the deadline back to 8-10 days. The main cause of the database sluggishness is the large user and host tables which get backed up daily and also dumped for stats. For some reason it has been occurring more frequently lately. A moderate increase in workunits should be ok but it would be best to keep the deadlines shorter for reasonable turnaround and to reduce the number of workunits floating around and taking up space. The weekly surge has existed for years, we joined the CAMEO project in 2012. I'm not sure when I added the 2 day deadlines but they have been around for a while, years. The issues have cropped up lately because we have been running Android jobs fairly consistently, we've run out of standard workunits here and there, and our database server has been sluggish here and there. I'm trying to stabilize the system which should bring things back to normal. Every Friday evening we get 20 CAMEO targets submitted to Robetta. The targets have to be completed by Monday afternoon. These jobs get high priority and get put directly in front of our job queue. It usually takes an hour or so for the jobs to be generated after the first submission and may take all day to submit them all since there's a lot of pre-processing - domain prediction, template detection, etc. To be safe, it would be ideal to have the results back early Sunday since there is also post processing - refinement, model selection, domain assembly etc. ID: 81169 · Rating: 0 · rate: / Reply Quote

Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0	Message 81170 - Posted: 8 Feb 2017, 21:42:11 UTC IMHO: Science first. Just maintain the original deadlines... This "donate your unused CPU cycles" was probably interesting back in those days when there was no big difference in energy consumption between an idle or fully utilized CPU. Either have a dedicated machine for the purpose of crunching or stay away and join some stupid project like SETI. This heterogeneous architecture with all its uncertainties is already bad enough... ID: 81170 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2596 Credit: 47,220,881 RAC: 1	Message 81181 - Posted: 12 Feb 2017, 4:17:01 UTC - in response to Message 81169. Every Friday evening we get 20 CAMEO targets submitted to Robetta. The targets have to be completed by Monday afternoon. These jobs get high priority and get put directly in front of our job queue. It usually takes an hour or so for the jobs to be generated after the first submission and may take all day to submit them all since there's a lot of pre-processing - domain prediction, template detection, etc. To be safe, it would be ideal to have the results back early Sunday since there is also post processing - refinement, model selection, domain assembly etc. Did you mean early Sunday or early Monday? Either way, how is the 3 day deadline appropriate? From Friday evening to Monday morning is 2.5 days. Not that I want to get back into this, but istm 3-day deadlines miss your target, with 2.5 days the maximum you can pinch. ID: 81181 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 81182 - Posted: 12 Feb 2017, 5:17:43 UTC - in response to Message 81181. Every Friday evening we get 20 CAMEO targets submitted to Robetta. The targets have to be completed by Monday afternoon. These jobs get high priority and get put directly in front of our job queue. It usually takes an hour or so for the jobs to be generated after the first submission and may take all day to submit them all since there's a lot of pre-processing - domain prediction, template detection, etc. To be safe, it would be ideal to have the results back early Sunday since there is also post processing - refinement, model selection, domain assembly etc. Did you mean early Sunday or early Monday? Either way, how is the 3 day deadline appropriate? From Friday evening to Monday morning is 2.5 days. Not that I want to get back into this, but istm 3-day deadlines miss your target, with 2.5 days the maximum you can pinch. 2 days was more optimal. 3 days was in response to user complaints. We can see how it goes. Thanks. ID: 81182 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2596 Credit: 47,220,881 RAC: 1	Message 81184 - Posted: 14 Feb 2017, 2:11:30 UTC - in response to Message 81182. Last modified: 14 Feb 2017, 2:12:30 UTC Every Friday evening we get 20 CAMEO targets submitted to Robetta. The targets have to be completed by Monday afternoon. These jobs get high priority and get put directly in front of our job queue. It usually takes an hour or so for the jobs to be generated after the first submission and may take all day to submit them all since there's a lot of pre-processing - domain prediction, template detection, etc. To be safe, it would be ideal to have the results back early Sunday since there is also post processing - refinement, model selection, domain assembly etc. Did you mean early Sunday or early Monday? Either way, how is the 3 day deadline appropriate? From Friday evening to Monday morning is 2.5 days. Not that I want to get back into this, but istm 3-day deadlines miss your target, with 2.5 days the maximum you can pinch. 2 days was more optimal. 3 days was in response to user complaints. We can see how it goes. Thanks. Ok. It'll affect you more than us. I was going to restore my buffer from 0+1.5 to 0+2.0 days with all these changes, but reflecting on everything that's been said I think 0.1+1.5 covers CASP, CAMEO and all other eventualities. A year-round setting that'll suit all my devices. It seems most of my deadlines are resolving themselves now and there haven't been any backoffs of any type for a while. ID: 81184 · Rating: 0 · rate: / Reply Quote

svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0	Message 81187 - Posted: 15 Feb 2017, 16:58:27 UTC I have a stuck workunit: it's spent several hours at least stuck on Model 6 Step 7205 (Fast Relax). It's an all-sheet hexamer. Is it worth letting it continue? 170214.3._fold_and_dock_SAVE_ALL_OUT_468868_40_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=901485891 ID: 81187 · Rating: 0 · rate: / Reply Quote