Message boards : Number crunching : Report long-running models here
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 14 · Next
Author | Message |
---|---|
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Here's another mamoth named task that ran 11hrs and 36 mins according to boinc mgr https://boinc.bakerlab.org/rosetta/result.php?resultid=218122036 CPU time 41793.55 stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 14400 ====================================================== DONE :: 1 starting structures 41792.8 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== asking 279 and waiting for validation at this point in time. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 1,227 |
If a fast CPU runs flat out for 28 hours and generates one decoy------there must have been a hell of a lot of work done to figure the decoy out? I have had over a week of these difficult units and credit for them is abysmal compared to what earlier WUs were awarding. It's almost like folks with long runtime preferences are being penalized for it. I've seen something somewhere about someone measuring the electricity used and calculating its added cost. The results indicated that few of the machines used will need even 50 cents (US) worth of extra electricity a day, even if the machine is running 24 hours a day. |
Rifleman Send message Joined: 19 Nov 08 Posts: 17 Credit: 139,408 RAC: 0 |
If a fast CPU runs flat out for 28 hours and generates one decoy------there must have been a hell of a lot of work done to figure the decoy out? I have had over a week of these difficult units and credit for them is abysmal compared to what earlier WUs were awarding. It's almost like folks with long runtime preferences are being penalized for it. According to tests done by Tom's Hardware my Phenom system running 24 hours a day loaded will use 285 Euros a year in hydro. That works out to roughly 600 in Canadian dollars. And that is just one machine. Your AMD 3600 is using 198.69 euros a year. Here's the link for most AMD processors. http://www.tomshardware.com/reviews/amd-power-cpu,1925-16.html |
Paul Send message Joined: 29 Oct 05 Posts: 193 Credit: 66,607,712 RAC: 6,512 |
Keep in mind that almost 100% of the power used by the computer is converted to heat. This means that in the winter, you computer crunches work units and acts at a mini-heater. The additional cost of electricity should be a wash as you should spend less to heat the same room. The summer is a different story. If you have air conditioning (cooling), the computers require lots of cooling. If you don't have air conditioning, the heat may push you out of your house. With 11 computers, my electric bill went down in the winter. I think these computers will move to the garage this summer. It is just too expesive to keep them cool. I am also looking at some 45nm processors that might help. Thx! Paul |
Jerry Murphy Send message Joined: 29 May 06 Posts: 5 Credit: 5,118,090 RAC: 0 |
One of my data sets processed normally until it reached about 98%. Then it became slower and slower. Last night it had reached 99.66% after more than 44 hours of processing time. I had suspended processing on all other files (Rosetta and SETI) in hopes of getting the file processed. This morning it is gone. Processing seems to have returned to normal. If this was an unusually large file I would like to know how to avoid them in the future. Any comments? Thanks and Happy New Year. Jerry |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Also keep in mind that one of you is looking at all power required to run the machine, and the other is looking at the power of an idle machine as compared to the power of a machine that is actively crunching. Rosetta Moderator: Mod.Sense |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I moved Jerry's post here from the Cafe. Jerry, see the original post of this thread. That was abnormally long. Nothing you can do on your end other then kill them if they go several hours without reaching the next model. Jerry, you must be crunching under more than one user ID?? I only see one host attached under the ID you used for your post, and it doesn't have a long task as you describe. Can you provide a link to the task you are referring to? Rosetta Moderator: Mod.Sense |
Jerry Murphy Send message Joined: 29 May 06 Posts: 5 Credit: 5,118,090 RAC: 0 |
I moved Jerry's post here from the Cafe. Jerry, see the original post of this thread. That was abnormally long. Nothing you can do on your end other then kill them if they go several hours without reaching the next model. I don't know if I have two accounts, but I have the e-mail I received on 5/29/2006 when I signed up. I can send you my Account Key if that will help. I can try to send the link, but I don't know how. BTW-All seems well now. Jerry |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Jerry, all I was saying is that if you click on your name there to the left of the message, and click on your hosts, there is only one host. And that host only had one task when I looked, and it ran in about 3 hours. So, was not the one you were talking about. Was the very long running one perhaps already passed it's 10 day deadline by the time you were able to put 44 hours in to it? Rosetta Moderator: Mod.Sense |
Wissi Send message Joined: 19 Nov 08 Posts: 14 Credit: 485,807 RAC: 0 |
And again: https://boinc.bakerlab.org/rosetta/result.php?resultid=217748451 This task is really taking a lot of time. After restarting this task again after a shutdown, it stated a processor time of 6 hrs 30 min and 15 seconds (after I stopped it the day before with a total time of 9 hrs 20 min and 37 seconds). I started my computer with the boinc manager at about 15:40 on January 3rd, the task was restarted at 15:42:20. It then got suspended due to a computing task for LHC@home for about 1 hour, and it restarted at 17:45:14, got suspended again for about 1 hour with restart at 19:38:41. From then on until now, the task was computing and computing. It has now a total processor time of 10 hours 38 minutes 50 seconds with 98,458% of completion and (guess what) with an estimated time of 15 minutes to go. Well, maybe the units for "remaining time" are wrong: minutes should be hours, and seconds should be minutes from what I have seen. Now (00:54 of January 4th) I will shutdown my computer because I will go to sleep. I will see what happens tomorrow :-( When doing a rough addition, this job has used already more than 20 hours of processor time (assuming that at least 6 hourse were "gone" due to the restarts and going back to the last checkpoint). But I think, one of the major problems with those long running tasks is the selection of checkpoints. IMHO checkpoints should be chosen in a way, that no more than 15 minutes of processor time are lost (didn't I read somewhere this value??) and this definitely does NOT work with this job. I will keep my computer running tomorrow and see, if this awesome job finally ends (and I would bet a crate of beer, that it will finish with a "client error" or a calculation error... thanks for the credits!). Ah, by the way: when trying to open the graphic window for this job, no graphic at all will be shown. The "minirosetta_graphics_1.40_windows_intelx86.exe" does not react, it has to be stopped by the task manager. Shouldn't this application have also the version number 1.47?? |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2145 Credit: 41,555,266 RAC: 8,961 |
Workunit 198419747 It's obvious what will most likely happen because you've already got the experience. It will start from its previous checkpoint, as every WU does, which will lose the majority of your processing time. Why isn't this obvious to you? But I think, one of the major problems with those long running tasks is the selection of checkpoints. IMHO checkpoints should be chosen in a way, that no more than 15 minutes of processor time are lost (didn't I read somewhere this value??) and this definitely does NOT work with this job. Yes, this is a legitimate complaint IMO and a challenge to the coders. 15 minutes or half an hour should be maximum between checkpoints so that no-one loses out and the project itself gets the greater benefit of the processing effort put in. Everyone's a winner. Maybe it wasn't such an issue in the past when WUs were less complicated, but as they become more complex then the checkpointing issue needs to be revisited. I haven't been around to comment earlier, but the claim that "something smells" is fatuous. What is supposed to smell about it? What could anyone possibly gain from this situation? Credits are just numbers - you could get one or a million and it wouldn't make the slightest difference to anything. If the WU doesn't complete then no-one's winning. It's very easy to throw these comments around, but to me it reflects worse on the person who says it. Also, bear in mind that no WU shows a time to completion of less than 10 minutes (maybe 9m 48s), even if it were to complete precisely when it should. I've never seen a lower figure ever. When a WU gets near the end it just shows it'll finish "any minute now". That's all it's saying. If it's a long-running model, then that ends up looking dumb, but it's only ever a guess, not ever a guarantee. This thread is supposed to highlight the issue of long-running models publicly in order to pin it down and stop it happening. Let's just keep our mind on that please. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
checkpoints should be chosen in a way, that no more than 15 minutes of processor time are lost Noone will disagree with that. It is just one of those things that is very difficult to achieve. You have to balance between the time lost due to the tasks restarting or the machine rebooting, and the time lost doing all of the overhead required to take so many checkpoints. Rosetta is particularly difficult to take checkpoints in the middle of a model. But, over time, more checkpoints are added to more types of work. Rosetta Moderator: Mod.Sense |
Jerry Murphy Send message Joined: 29 May 06 Posts: 5 Credit: 5,118,090 RAC: 0 |
Jerry, all I was saying is that if you click on your name there to the left of the message, and click on your hosts, there is only one host. And that host only had one task when I looked, and it ran in about 3 hours. So, was not the one you were talking about. Was the very long running one perhaps already passed it's 10 day deadline by the time you were able to put 44 hours in to it? Yes, the deadline for that one was 12/14/08. |
AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
long-running models: 1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_134659_0 took more than 3x my preferred time (which is 12 hours) 1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_188618_1 is still running. I am the second one to try this workunit, the first time there was an error because there were too many restarts. Yesterday is saw that the CPU time was over 13 hours, when i tried to look at the graphics it crashed. Today (after crunching for some other projects) it restarted at 6 hours. This time the graphics worked fine, but it took 20 minutes to go from 'model 1 step 203980' to 'model 1 step 203991'. So, what to do? How many steps are there in a model? Should i let it run because it is almost finished, or abort it because there is no way i can finish this model? AdeB |
Idan Shifres Send message Joined: 12 Dec 08 Posts: 11 Credit: 126,517 RAC: 0 |
Have the wu: 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_51691_0 running over 62 hours now, most of that time it just crawling towards the 100% mark from 99.730 mark.... Going to hit deadline soon :( |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Have the wu: 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_51691_0 running over 62 hours now, most of that time it just crawling towards the 100% mark from 99.730 mark.... you going for the world record of crunching a single task? lol it looks like it will never finish. |
Idan Shifres Send message Joined: 12 Dec 08 Posts: 11 Credit: 126,517 RAC: 0 |
I know this WU doesn't look promising, but I want it to finish by itself or quit for passing the deadline... I see no one else returned a result for this WU, I hope it will get taken care of, so it won't repeat again... Good day! :D |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 1,227 |
I know this WU doesn't look promising, but I want it to finish by itself or quit for passing the deadline... You might as well. Rosetta@home has finally got its workunit generators working again, but they haven't caught up with the demand for more workunits yet. I believe that Rosetta@home in one of the BOINC projects that will even let you return a workunit after the deadline and get credit for it, as long as not enough other people have already returned it to meet the quorum. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 1,227 |
long-running models: Rosetta@home is having problems keeping up with the demand for more workunits, so for now I'd suggest that you let it keep running. Note that after such workunits get within about 10 minutes of the expected runtime, but the actual time required is even longer, the reports of how much longer it will take almost stop changing until it finally completes. |
Idan Shifres Send message Joined: 12 Dec 08 Posts: 11 Credit: 126,517 RAC: 0 |
I know this WU doesn't look promising, but I want it to finish by itself or quit for passing the deadline... I will let it run, as you said in another post, it came to the last 10 mins of "estimated" time and just got stuck there, for 63 hours and 30 mins so far... hopefully this WU will finally finish and even better it would be if I'll get credit for it... :) I'll keep updating... :) |
Message boards :
Number crunching :
Report long-running models here
©2024 University of Washington
https://www.bakerlab.org