No checkpoint in more than 1 hour - Largescale_large_fullatom...

Message boards : Number crunching : No checkpoint in more than 1 hour - Largescale_large_fullatom...

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Aglarond

Send message
Joined: 29 Jan 06
Posts: 26
Credit: 446,212
RAC: 0
Message 13834 - Posted: 15 Apr 2006, 13:57:49 UTC

Hi, I just had to reboot my computer (Centrino 1.5, 1G RAM) and R@H was showing something around 1.5% progress and it wat runnig litte more than 1 hour. It was counting 1. model. After reboot it started from begining.

That means there was no checkpoint in more than 1 hour. Standart settings in Boinc are: switching projects after 60 minutes and not leaving them in memory. I think people with several projects and standart settings will never finish 1 model. Is it so?

I think there should be at least warning on the front page that this project has rare checkpoints. (Something similar to what CPDN Seasonal has.)

Regards, Aglarond
ID: 13834 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[DPC]Charley

Send message
Joined: 18 Mar 06
Posts: 9
Credit: 286,193
RAC: 0
Message 13836 - Posted: 15 Apr 2006, 15:37:39 UTC

yes you're right.
Rosetta will only save after every completed model. With these large models, you have to complete them in one go or start over from the beginning when you unload the project (switch to another project for a little time, reboot your computer, you get the idea).
ID: 13836 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Insidious
Avatar

Send message
Joined: 10 Nov 05
Posts: 49
Credit: 604,937
RAC: 0
Message 13867 - Posted: 16 Apr 2006, 1:18:40 UTC - in response to Message 13836.  

yes you're right.
Rosetta will only save after every completed model. With these large models, you have to complete them in one go or start over from the beginning when you unload the project (switch to another project for a little time, reboot your computer, you get the idea).



OK, I'm not sure I get it???

Is the answer to leave these in memory? change settings to switch projects less frequently? only do Rosey on dedicated (one project) machines? Abort these WUs?

Any advise would be appreciated

-Sid
Proudly crunching with TeAm Anandtech
ID: 13867 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 13869 - Posted: 16 Apr 2006, 1:27:56 UTC

There was mention of the new large WUs taking over 4 hours to finish a model on some machines. The application switch time needs to be set for more than the time it takes to finish a model/decoy on your machine. Or the keep-in-memory flag needs to be turned on.


ID: 13869 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
nairb

Send message
Joined: 8 Dec 05
Posts: 15
Credit: 897,741
RAC: 68
Message 13871 - Posted: 16 Apr 2006, 1:35:30 UTC

I have one dedicated machine for rosy. I have seen it take nearly 9 hrs to complete one of the largescale wu. Mostly they are about 4 hrs but some are more or less. Bit of a pain if you run more than one project on a machine and dont change the switch project setting to be longer than the wu complete time.
ID: 13871 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Insidious
Avatar

Send message
Joined: 10 Nov 05
Posts: 49
Credit: 604,937
RAC: 0
Message 13873 - Posted: 16 Apr 2006, 1:58:37 UTC

Thanks for the quick replys!

So do I need to have my switch time > time for the entire WU to complete?
Isn't there any checkpointing?

-Sid
Proudly crunching with TeAm Anandtech
ID: 13873 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 13877 - Posted: 16 Apr 2006, 2:43:21 UTC - in response to Message 13873.  

Thanks for the quick replys!

So do I need to have my switch time > time for the entire WU to complete?
Isn't there any checkpointing?

-Sid


ALL Work Units will checkpoint at the completion of a model. For some Work Units this means every 5 minuets, for larger ones this could mean 5 or six hours. Also ALL Work units will complete AT LEAST one model no matter how you set your user selectable time setting

The BEST answer if you can do it, is to set your preferences to keep the application in memory during a swap. You could try to set the swap time to 4+ hours, but there is no guarantee that that will make it to a checkpoint. It depends on the size of the protein.

Also keep in mind that "keep in memory" only works if you do not turn your machine off, or stop BOINC for some reason, as these actions would also remove the application from memory.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 13877 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 13878 - Posted: 16 Apr 2006, 2:48:19 UTC

Rosetta doesn't checkpoint until after it's done with a decoy/model. And if it takes up to 9 hours to complete a single decoy/model, then you won't get checkpointed until that's done. Perhaps the keep in memory flag is the way to go, in case we get more of these really large WUs in the future.
(Nairb - what are the math specs on that system that took 9 hours?)


ID: 13878 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DigiK-oz

Send message
Joined: 8 Nov 05
Posts: 13
Credit: 333,730
RAC: 0
Message 13882 - Posted: 16 Apr 2006, 8:17:59 UTC

This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely!


ID: 13882 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Insidious
Avatar

Send message
Joined: 10 Nov 05
Posts: 49
Credit: 604,937
RAC: 0
Message 13887 - Posted: 16 Apr 2006, 12:05:29 UTC - in response to Message 13882.  

This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely!



It does sound like a rather stringent requirement.

Is this going to be the norm. from here on out?

Proudly crunching with TeAm Anandtech
ID: 13887 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
nairb

Send message
Joined: 8 Dec 05
Posts: 15
Credit: 897,741
RAC: 68
Message 13891 - Posted: 16 Apr 2006, 12:50:37 UTC - in response to Message 13878.  
Last modified: 16 Apr 2006, 12:51:14 UTC

(Nairb - what are the math specs on that system that took 9 Hours?)


The wu was this one:-
https://boinc.bakerlab.org/rosetta/result.php?resultid=16830240

The machine is a dual cpu 1ghz coppermines with 700+ ram. It spent 9.3 hrs at 1.6 % or so. It very nearly got the abort option.... except I forgot about it and it worked ok.

Not the fastest bit of kit but its very stable (win nt4)
ID: 13891 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 13894 - Posted: 16 Apr 2006, 15:00:48 UTC - in response to Message 13887.  

This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely!



It does sound like a rather stringent requirement.

Is this going to be the norm. from here on out?


The plan is to try to have it checkpoint more often, or at least try to dump everything to a memory image at program swaps. But there is a lot of data involved, and interrupting the model effects the model outcome adversely.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 13894 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DigiK-oz

Send message
Joined: 8 Nov 05
Posts: 13
Credit: 333,730
RAC: 0
Message 13895 - Posted: 16 Apr 2006, 15:15:30 UTC - in response to Message 13894.  

This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely!



It does sound like a rather stringent requirement.

Is this going to be the norm. from here on out?


The plan is to try to have it checkpoint more often, or at least try to dump everything to a memory image at program swaps. But there is a lot of data involved, and interrupting the model effects the model outcome adversely.


Well, NOT having checkpoints/memory images whatever will adversely affect the entire project, as the small home-crunchers are getting fed up with not getting any results in because the same WU restarts from scratch over and over again. Maybe there is a way to hand out these large WUs only to computers having a high RAC? Or only to people who have their run-time preferences set to 8 hours or something similar?
ID: 13895 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Insidious
Avatar

Send message
Joined: 10 Nov 05
Posts: 49
Credit: 604,937
RAC: 0
Message 13896 - Posted: 16 Apr 2006, 15:17:12 UTC - in response to Message 13894.  

This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely!



It does sound like a rather stringent requirement.

Is this going to be the norm. from here on out?


The plan is to try to have it checkpoint more often, or at least try to dump everything to a memory image at program swaps. But there is a lot of data involved, and interrupting the model effects the model outcome adversely.


I'm in a fortunate position that I can flex to requirements that impact how my PCs are operated, but I am concerned that the majority of your participants just simply cannot assure a computer will not be re-booted or available to crunch nothing but Rosetta 24/7/365.

I understand the prospective of the project developers, but I think the project simply MUST acknowledge that the vast majority of it's participants consider DC to be a side endeavor. If your project impacts the use the PC was really bought to serve...your membership will suffer.

A thread asking for suggestions to increase project participation was posted some time ago.

I believe the single, most important answer is... your project must be hands-off with utter transparency to the work the PC was procured to do in the first place.

I think I speak for many when I say I like Rosetta and the work it hopes to accomplish, but it is incorrect to expect PC users to arrange their PC to crunch your project.... it must be the other way around. Rosetta must arrange itself to fit within the resources that are available.

Respectfully,
-Sid


Proudly crunching with TeAm Anandtech
ID: 13896 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bin Qian

Send message
Joined: 13 Jul 05
Posts: 33
Credit: 36,897
RAC: 0
Message 13897 - Posted: 16 Apr 2006, 16:25:30 UTC - in response to Message 13896.  

I totally agree with you Sid. It is not acceptable to have a job running for hours without a checkpoint. We realized that this is a mistake and have had those largescale jobs canceled from the boinc server. If your computer still has the largescale jobs running or queued please abort them - the new short jobs are waiting to be sent!

In the future we will be extremely cautious to not let this happen again - we are coding and testing some solutions to make the program checkpoint more often. It is of our highest priority at this moment to make sure that ever minute of your precious computer-on time spent on the Rosetta@home project can contribute to the scientific goals we are trying to achieve together.

I should add that for users who crunch Rosetta 24/7 or have "leave in memory" on, you can choose to let the largescale jobs currently in your computers keep running. These results are still of great interest to us!

I'm in a fortunate position that I can flex to requirements that impact how my PCs are operated, but I am concerned that the majority of your participants just simply cannot assure a computer will not be re-booted or available to crunch nothing but Rosetta 24/7/365.

I understand the prospective of the project developers, but I think the project simply MUST acknowledge that the vast majority of it's participants consider DC to be a side endeavor. If your project impacts the use the PC was really bought to serve...your membership will suffer.

A thread asking for suggestions to increase project participation was posted some time ago.

I believe the single, most important answer is... your project must be hands-off with utter transparency to the work the PC was procured to do in the first place.

I think I speak for many when I say I like Rosetta and the work it hopes to accomplish, but it is incorrect to expect PC users to arrange their PC to crunch your project.... it must be the other way around. Rosetta must arrange itself to fit within the resources that are available.

Respectfully,
-Sid



ID: 13897 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Insidious
Avatar

Send message
Joined: 10 Nov 05
Posts: 49
Credit: 604,937
RAC: 0
Message 13902 - Posted: 16 Apr 2006, 17:42:08 UTC - in response to Message 13897.  

I totally agree with you Sid. It is not acceptable to have a job running for hours without a checkpoint. We realized that this is a mistake and have had those largescale jobs canceled from the boinc server. If your computer still has the largescale jobs running or queued please abort them - the new short jobs are waiting to be sent!

In the future we will be extremely cautious to not let this happen again - we are coding and testing some solutions to make the program checkpoint more often. It is of our highest priority at this moment to make sure that ever minute of your precious computer-on time spent on the Rosetta@home project can contribute to the scientific goals we are trying to achieve together.

I should add that for users who crunch Rosetta 24/7 or have "leave in memory" on, you can choose to let the largescale jobs currently in your computers keep running. These results are still of great interest to us!



Thanks Bin!

Knowing the project IS sensitive to this kind of concern and IS making an effort to shape this project in a productive way, is plenty to keep me supporting such an interesting and valuable research endeavor!

-Sid

Proudly crunching with TeAm Anandtech
ID: 13902 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 13904 - Posted: 16 Apr 2006, 20:27:20 UTC - in response to Message 13902.  

I totally agree with you Sid. It is not acceptable to have a job running for hours without a checkpoint. We realized that this is a mistake and have had those largescale jobs canceled from the boinc server. If your computer still has the largescale jobs running or queued please abort them - the new short jobs are waiting to be sent!

In the future we will be extremely cautious to not let this happen again - we are coding and testing some solutions to make the program checkpoint more often. It is of our highest priority at this moment to make sure that ever minute of your precious computer-on time spent on the Rosetta@home project can contribute to the scientific goals we are trying to achieve together.

I should add that for users who crunch Rosetta 24/7 or have "leave in memory" on, you can choose to let the largescale jobs currently in your computers keep running. These results are still of great interest to us!



Thanks Bin!

Knowing the project IS sensitive to this kind of concern and IS making an effort to shape this project in a productive way, is plenty to keep me supporting such an interesting and valuable research endeavor!

-Sid

I get the impression from your response that you have misinterpreted the information I tried to provide, to imply that I had in some way said the project does not think this is important. Nothing could be further from the truth. The original question was why did the Work Unit run so long without a checkpoint (see thread title). I never said this was an issue the project was ignoring, or not trying to fix. I just tried to explain WHY it was working the way it was.
Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 13904 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Insidious
Avatar

Send message
Joined: 10 Nov 05
Posts: 49
Credit: 604,937
RAC: 0
Message 13906 - Posted: 16 Apr 2006, 20:35:39 UTC - in response to Message 13904.  

I totally agree with you Sid. It is not acceptable to have a job running for hours without a checkpoint. We realized that this is a mistake and have had those largescale jobs canceled from the boinc server. If your computer still has the largescale jobs running or queued please abort them - the new short jobs are waiting to be sent!

In the future we will be extremely cautious to not let this happen again - we are coding and testing some solutions to make the program checkpoint more often. It is of our highest priority at this moment to make sure that ever minute of your precious computer-on time spent on the Rosetta@home project can contribute to the scientific goals we are trying to achieve together.

I should add that for users who crunch Rosetta 24/7 or have "leave in memory" on, you can choose to let the largescale jobs currently in your computers keep running. These results are still of great interest to us!



Thanks Bin!

Knowing the project IS sensitive to this kind of concern and IS making an effort to shape this project in a productive way, is plenty to keep me supporting such an interesting and valuable research endeavor!

-Sid

I get the impression from your response that you have misinterpreted the information I tried to provide, to imply that I had in some way said the project does not think this is important. Nothing could be further from the truth. The original question was why did the Work Unit run so long without a checkpoint (see thread title). I never said this was an issue the project was ignoring, or not trying to fix. I just tried to explain WHY it was working the way it was.


I didn't think you were dismissing the concerns. Actually, I saw your post as an attempt to help shed light. It was appreciated as was Bin's.

Please believe me when I tell you, I am much more the fan than the critic.

-Sid

Proudly crunching with TeAm Anandtech
ID: 13906 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Buffalo Bill
Avatar

Send message
Joined: 25 Mar 06
Posts: 71
Credit: 1,630,458
RAC: 0
Message 13912 - Posted: 16 Apr 2006, 21:31:40 UTC - in response to Message 13894.  
Last modified: 16 Apr 2006, 21:38:02 UTC

This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely!



It does sound like a rather stringent requirement.

Is this going to be the norm. from here on out?


The plan is to try to have it checkpoint more often, or at least try to dump everything to a memory image at program swaps. But there is a lot of data involved, and interrupting the model effects the model outcome adversely.


You already have a separate test project (Ralph), so if there's no good solution to interrupting a big model, maybe you could start a new "RosettaExtreme" project just for those of us who would be happy to take care of those big proteins for you. Big WU's only project.

Bill
ID: 13912 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 7 Oct 05
Posts: 234
Credit: 15,020
RAC: 0
Message 13914 - Posted: 16 Apr 2006, 21:38:02 UTC - in response to Message 13912.  

... maybe you could start a new "RosettaExtreme" project just for those of us who would be happy to take care of those big proteins for you. Big WU's only.

Bill


RosettaExtreme?!

Hmmmm.... sounds interesting. :-D


[b]"I'm trying to maintain a shred of dignity in this world." - Me[/b]

ID: 13914 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : No checkpoint in more than 1 hour - Largescale_large_fullatom...



©2021 University of Washington
https://www.bakerlab.org