simIF2...ProteinInterfaceDesign...-tasks

Message boards : Number crunching : simIF2...ProteinInterfaceDesign...-tasks

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Jochen

Send message
Joined: 6 Jun 06
Posts: 133
Credit: 3,847,433
RAC: 0
Message 66859 - Posted: 12 Jul 2010, 14:33:34 UTC

How long are these simIF2...ProteinInterfaceDesign...-taks going to be dispatched?

I hate those WUs. They usually run 10 instead of 6 hours, with the last save point at a run time of 5:50. I usually have 2 or 3 of those running, with 7 to 9 hours done, when I have to turn off my computers in the morning, before I go to work. I don't want to know, how much CPU-time I already wasted, since they fall back to 5:xx hours and are aborted shortly after I restart my computer in the afternoon.

I'm about to increase my buffer level, so I can abort all simIF2-tasks before they start. Those WUs are waste of time and engergy, if you can't leave your computer running 24/7.

cu Joe
ID: 66859 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 66870 - Posted: 13 Jul 2010, 5:30:20 UTC
Last modified: 13 Jul 2010, 5:30:49 UTC

Instead of selectivity aborting these tasks (and I too seem to have had a bunch of long running tasks over the past week) why don't you set your target CPU time down to 2 hours?

Then when you encounter one of these tasks it should run 2 hours + 4 hours for the watchdog for a total of 6 hours - well within your limits.

It seems that culling your work queue with selective aborts is a bit harsh and could tend to evoke the ire of the project managers.
ID: 66870 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 115,568,105
RAC: 59,147
Message 66871 - Posted: 13 Jul 2010, 8:15:48 UTC - in response to Message 66859.  

How long are these simIF2...ProteinInterfaceDesign...-taks going to be dispatched?

I hate those WUs. They usually run 10 instead of 6 hours, with the last save point at a run time of 5:50. I usually have 2 or 3 of those running, with 7 to 9 hours done, when I have to turn off my computers in the morning, before I go to work. I don't want to know, how much CPU-time I already wasted, since they fall back to 5:xx hours and are aborted shortly after I restart my computer in the afternoon.

I'm about to increase my buffer level, so I can abort all simIF2-tasks before they start. Those WUs are waste of time and engergy, if you can't leave your computer running 24/7.

cu Joe

Could you hibernate instead of shutting down? That way it should restart from where it was up to rather than the checkpoint (AFAIK).
ID: 66871 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jochen

Send message
Joined: 6 Jun 06
Posts: 133
Credit: 3,847,433
RAC: 0
Message 66877 - Posted: 13 Jul 2010, 10:57:55 UTC

Thanks for your posts!

@Chris: I don't see how this would help. The two affected computers run from 5 pm to 9 am. I would guess chances are pretty much the same that I run into a long running model that has the last checkpoint hours ago. I could imagine, that this will even happen more often since there are more tasks processed. From this point I would say, increasing to 12 hours run time would result in less problems. But I'm assuming that WUs run up to 4 hours longer, independend from the actual run time (2, 6 or 12 hours). Does anybody have information on this?

@dcdc: I'm afraid, hybernation is not an option. I've tried hybernation and found it corrupts my database connections, so I have to do a reboot anyway.

cu Joe
ID: 66877 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 66880 - Posted: 13 Jul 2010, 16:36:47 UTC

I'm assuming that WUs run up to 4 hours longer, independend from the actual run time (2, 6 or 12 hours). Does anybody have information on this?


Yes, the current implementation of the watchdog is to wrap up anything that runs more then about 4 hours longer then your configured target runtime. In the past it was a multiple of target runtime, which did not produce the desired results for those with longer runtime configured.
Rosetta Moderator: Mod.Sense
ID: 66880 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jochen

Send message
Joined: 6 Jun 06
Posts: 133
Credit: 3,847,433
RAC: 0
Message 66881 - Posted: 13 Jul 2010, 17:21:57 UTC

Ok, thanks for this information.

What about the run time? Is 8 hours still the prefered target run time (I've read something like this in a CASP9 thread - because of short deadlines?)? Otherwise I would increase the run time to 12 hours.

Increasing the run time seems to be the best approach, since there are less simIF2-tasks processed, thus lowering the overall risk. I don't know wich way the tasks behave. Will they generally 'go beserk' after 5:50 run time, or 10 minutes before the targets run time is reached? If the second, this would again lower the risk of having to turn off the computer, after it passed the 'magic threshold'.

cu Joe
ID: 66881 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 66883 - Posted: 13 Jul 2010, 18:05:33 UTC

Your odds of getting any given type of work will be the same, regardless of your runtime preference. The only thing that happens 10 minutes before the target runtime is that the displayed estimated time remaining slows down, i.e. 5 more seconds of CPU time may only reduce the estimate by 4 or less seconds. This is simply a way to show forward progress, without running in to a negative time remaining (which most would consider "beserk" behavior and possibly panic and abort the task unnecessarily). So, time to completion slows exponentially in the last 10 minutes of estimated time, and if this goes on for 4 hours, then the watchdog will step in.

I would go with the runtime that works for you, regardless of CASP. Many of the tasks being worked on are not CASP tasks anyway.
Rosetta Moderator: Mod.Sense
ID: 66883 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jochen

Send message
Joined: 6 Jun 06
Posts: 133
Credit: 3,847,433
RAC: 0
Message 66884 - Posted: 13 Jul 2010, 18:44:58 UTC

So it's only luck that the last saved checkpoint is near 5:50 run time? There have been some with an earlier last saved checkpoint, but I never saw one with a later checkpoint. I was hoping that on 12 hours run time models, this last saved checkpoint would be at 11:50 run time. This last saved checkpoint is the point where the calculation is resumed after a reboot, isn't it?

Anyway, I didn't spot any of those tasks in the last 24 hours. There are only fc_A_noSmallMvc....ProteinInterfaceDesign...-tasks around, which so far don't seem to 'go beserk'.

cu Joe
ID: 66884 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 66885 - Posted: 13 Jul 2010, 20:12:56 UTC

This is why I have my CPU time @ 2 hours. Any error won't "hurt" me more than 2 hours max (1 hour for dual cores). I too hate it when I have to restart and the WUs go back 30 mins or so.
ID: 66885 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 66886 - Posted: 13 Jul 2010, 20:18:28 UTC

In a task that produces a model every 15 minutes or so normally (I don't know what normal is for the specific tasks in question), and on more rare occasions hits models that run 4+ hours without reaching completion, the task will normally be able to complete within 15 minutes of the target run time. In fact, the only reason it would still be running more then 15min after the runtime target would be if it has hit one of these long models. And the long models don't seem to checkpoint either, perhaps none of them do for that type of task, but the save at the end of each model suffices for the normal models.

So yes, I think the odds are that when you spot a long running model, it would show a checkpoint taken within a half hour of your target runtime. And it is more a factor of encountering the long-running model at the end of your target runtime then what your actual target runtime setting is.

If you hit a model that took 3 hours to run, and it was the first model of the task, you'd probably not notice it at all... unless it was still in that 3 hours when it was time to shutdown your machine for the day.
Rosetta Moderator: Mod.Sense
ID: 66886 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 66887 - Posted: 13 Jul 2010, 20:21:05 UTC - in response to Message 66885.  

This is why I have my CPU time @ 2 hours. Any error won't "hurt" me more than 2 hours max (1 hour for dual cores). I too hate it when I have to restart and the WUs go back 30 mins or so.


You could be "hurt" 4 hours per task, regardless of your runtime preference. And I am not following how having 2 tasks in progress running on a dual core would make any difference, unless you are averaging one normally running task on core 1 with a long running model on core 2.
Rosetta Moderator: Mod.Sense
ID: 66887 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
strauch
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 15 Mar 10
Posts: 7
Credit: 40,011
RAC: 0
Message 66911 - Posted: 14 Jul 2010, 22:10:22 UTC

Thanks for your comments! The simIF simulations are only slightly different from other Protein-interface Design tasks that ran with no complaints. I think that these long jobs are something of a rarity for simIF as well, and have a good idea for where the problem originates. While I investigate this, I will submit no more such jobs, but please note that some jobs that were already disseminated from the BOINC server might still be running on your computers for a few days. Despite these running problems, I'd like to comment that the designs that I'm getting out of these runs are extremely promising and I'm eager to test them experimentally! I'll obviously let you know how well they work.
ID: 66911 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 66915 - Posted: 15 Jul 2010, 5:40:56 UTC

Straunch said ...

While I investigate this, I will submit no more such jobs


Hey my friend, I don't know if you have the ability to "push" jobs out to a specific user or system but my systems are up and running 24 hours a day and I really don't care if I have some long-running, low credit jobs in the queue as long as I understand what is going on so I don't scratch my head and wonder if there is some sort of problem.

If it would be useful, you can target systems 1290176, 1277612, 1312275, or 1277776 - these systems seem to always have the shortest work queues. Because the Rosetta servers are so reliable I am setup to have only a 0.25 day queue but BOINC seems to have a mind of its own and the queue for several of my systems is much longer.

My current run time is set to 4 hours and with the additional 4 hours grace provided by the watchdog you have a total run time of eight hours. If you need an adjustment in this just let me know.

Chris

ID: 66915 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jochen

Send message
Joined: 6 Jun 06
Posts: 133
Credit: 3,847,433
RAC: 0
Message 66918 - Posted: 15 Jul 2010, 9:38:08 UTC - in response to Message 66911.  

... I think that these long jobs are something of a rarity for simIF as well, and have a good idea for where the problem originates.


Thank you very much for your feedback, strauch. I appreciate this.

I didn't get any of those simIF2-tasks in the last two days, but I had two fc_A_noSmallMVS-tasks on one of my computers this morning, both already running for 7 hours. One with the last checkpoint at 3:40, the other one with the last checkpoint at 4:10. I had to turn of the computer, thus loosing again 7 hours of work. I could post the links to the results this afternoon, if you want to have a look.

It's just too hot, to leave the computers running. I've already killed two of the five HDDs in my working rig last week, when I left the computers running at day. As well my (rather small) home office heated up to almost 50° C.
I'm already using NZXT Zero 2 Towers, fully equiped with 7 120 mm fans, but close to 50° C (35° C outdoor temperature) just seems to be to much... And I can't leave the window open, when I'm not at home.

cu Joe


ID: 66918 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jochen

Send message
Joined: 6 Jun 06
Posts: 133
Credit: 3,847,433
RAC: 0
Message 66920 - Posted: 15 Jul 2010, 14:46:23 UTC

This are the two results:
fc_A_noSmallMvs_fc6x_1tev_ProteinInterfaceDesign_20Jun2010_21458_110_1
fc_A_noSmallMvs_fc6x_2ije_ProteinInterfaceDesign_20Jun2010_21458_53_0

They were both automatically ended shortly after turning the computer on. The first one had been running for 7:30 and the second one had been running for 7:10 when I turned off my computer in the morning.
So this type of ProteinIntefaceDesign-tasks might have a similar problem.

cu Joe
ID: 66920 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 66921 - Posted: 15 Jul 2010, 16:47:56 UTC

Jochen, have you considered just suspending BOINC (keeping suspended tasks in memory) during the hours of the day that you are not home? (you can set this in the preferences tab in the advanced view) Leaving the computer on, but not actively processing, it produces dramatically less heat (and uses a corresponding amount less power). You might try it some time when you are home and just see how your temps work out. Another thing to consider would be running at less then 100%, say 60 or 70% will reduce the heat output, just not as much as suspending.

Keep in mind, that is just ideas from one user to another. I'm just a volunteer, not a Rosetta developer. There still is the whole point that checkpoints should be more frequent and long-running models should be avoided. So I'm not skirting that. Just trying to give you alternatives to consider in the meantime.
Rosetta Moderator: Mod.Sense
ID: 66921 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jochen

Send message
Joined: 6 Jun 06
Posts: 133
Credit: 3,847,433
RAC: 0
Message 66922 - Posted: 15 Jul 2010, 19:12:18 UTC

Jochen, have you considered just suspending BOINC (keeping suspended tasks in memory) during the hours of the day that you are not home?

No, actually I haven't. Sounds reasonable. I will give a try on Saturday.

cu Joe
ID: 66922 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1982
Credit: 38,463,172
RAC: 15,101
Message 66924 - Posted: 16 Jul 2010, 2:27:39 UTC - in response to Message 66911.  

Thanks for your comments! The simIF simulations are only slightly different from other Protein-interface Design tasks that ran with no complaints. I think that these long jobs are something of a rarity for simIF as well, and have a good idea for where the problem originates. While I investigate this, I will submit no more such jobs, but please note that some jobs that were already disseminated from the BOINC server might still be running on your computers for a few days. Despite these running problems, I'd like to comment that the designs that I'm getting out of these runs are extremely promising and I'm eager to test them experimentally! I'll obviously let you know how well they work.

As long as these long-running jobs are providing promising results in scientific terms I will continue to run them as long as they require, whether they achieve low credits or not.

Eyes firmly on the prize here.
ID: 66924 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Sarel

Send message
Joined: 11 May 06
Posts: 51
Credit: 81,712
RAC: 0
Message 66953 - Posted: 20 Jul 2010, 1:01:39 UTC - in response to Message 66920.  

Hi Joe,

I'm trying to figure out whether these fc tasks are giving you trouble as well. I see that for both of these fc tasks they ran for many hours and produced many models and you got credits accordingly. As far as I know in-between models are saved so you would have gotten plenty of credit for these if you stopped them mid-way, wouldn't you?

Sarel.

This are the two results:
fc_A_noSmallMvs_fc6x_1tev_ProteinInterfaceDesign_20Jun2010_21458_110_1
fc_A_noSmallMvs_fc6x_2ije_ProteinInterfaceDesign_20Jun2010_21458_53_0

They were both automatically ended shortly after turning the computer on. The first one had been running for 7:30 and the second one had been running for 7:10 when I turned off my computer in the morning.
So this type of ProteinIntefaceDesign-tasks might have a similar problem.

cu Joe


ID: 66953 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 66955 - Posted: 20 Jul 2010, 2:32:25 UTC - in response to Message 66887.  

This is why I have my CPU time @ 2 hours. Any error won't "hurt" me more than 2 hours max (1 hour for dual cores). I too hate it when I have to restart and the WUs go back 30 mins or so.


You could be "hurt" 4 hours per task, regardless of your runtime preference. And I am not following how having 2 tasks in progress running on a dual core would make any difference, unless you are averaging one normally running task on core 1 with a long running model on core 2.


What I meant was that if I set my run time, say 24 hours, and on the 15th hour I get an error, I loose 15 hours of work.
ID: 66955 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : simIF2...ProteinInterfaceDesign...-tasks



©2024 University of Washington
https://www.bakerlab.org