Report long-running models here

Message boards : Number crunching : Report long-running models here

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 14 · Next

AuthorMessage
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58393 - Posted: 2 Jan 2009, 23:09:49 UTC - in response to Message 58391.  

mod,

how do you explain the weirdness of this users tasks that were running at 6 hrs and still a long ways to go to completion. then upon reboot of the computer nearly the same amount of work is shown completed for less time used.

kind of some odd stuff going on with his tasks.


It all relates to the time to completion estimate. As was stated earlier, when the machine rebooted, the task reverted to it's last checkpoint. And at that point, the % complete is going to based on the runtime preference. In short, the task should be about to proceed down the same path. Running for too long, showing about 10 minutes to go the whole time.

There's no exceptional weirdness described there. It is simply how the symptoms appear when you have a long-running model.
Rosetta Moderator: Mod.Sense
ID: 58393 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5528
Credit: 5,544,194
RAC: 1,215
Message 58395 - Posted: 3 Jan 2009, 1:11:47 UTC

Here's another mamoth named task that ran 11hrs and 36 mins according to boinc mgr

https://boinc.bakerlab.org/rosetta/result.php?resultid=218122036
CPU time 41793.55
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 41792.8 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

asking 279 and waiting for validation at this point in time.
ID: 58395 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1193
Credit: 13,208,362
RAC: 839
Message 58404 - Posted: 3 Jan 2009, 4:52:03 UTC - in response to Message 58384.  

If a fast CPU runs flat out for 28 hours and generates one decoy------there must have been a hell of a lot of work done to figure the decoy out? I have had over a week of these difficult units and credit for them is abysmal compared to what earlier WUs were awarding. It's almost like folks with long runtime preferences are being penalized for it.
I am new to the project and distributed computing in general but increasing my hydro bill by significant amounts there should be closer attention to the way these credits are awarded.


I've seen something somewhere about someone measuring the electricity used and calculating its added cost. The results indicated that few of the machines used will need even 50 cents (US) worth of extra electricity a day, even if the machine is running 24 hours a day.
ID: 58404 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rifleman

Send message
Joined: 19 Nov 08
Posts: 17
Credit: 139,408
RAC: 0
Message 58406 - Posted: 3 Jan 2009, 5:31:53 UTC - in response to Message 58404.  
Last modified: 3 Jan 2009, 5:38:45 UTC

If a fast CPU runs flat out for 28 hours and generates one decoy------there must have been a hell of a lot of work done to figure the decoy out? I have had over a week of these difficult units and credit for them is abysmal compared to what earlier WUs were awarding. It's almost like folks with long runtime preferences are being penalized for it.
I am new to the project and distributed computing in general but increasing my hydro bill by significant amounts there should be closer attention to the way these credits are awarded.


I've seen something somewhere about someone measuring the electricity used and calculating its added cost. The results indicated that few of the machines used will need even 50 cents (US) worth of extra electricity a day, even if the machine is running 24 hours a day.

According to tests done by Tom's Hardware my Phenom system running 24 hours a day loaded will use 285 Euros a year in hydro. That works out to roughly 600 in Canadian dollars. And that is just one machine. Your AMD 3600 is using 198.69 euros a year. Here's the link for most AMD processors.
http://www.tomshardware.com/reviews/amd-power-cpu,1925-16.html
ID: 58406 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Paul

Send message
Joined: 29 Oct 05
Posts: 193
Credit: 64,405,714
RAC: 822
Message 58417 - Posted: 3 Jan 2009, 13:10:20 UTC - in response to Message 58406.  

Keep in mind that almost 100% of the power used by the computer is converted to heat. This means that in the winter, you computer crunches work units and acts at a mini-heater. The additional cost of electricity should be a wash as you should spend less to heat the same room.

The summer is a different story. If you have air conditioning (cooling), the computers require lots of cooling. If you don't have air conditioning, the heat may push you out of your house.

With 11 computers, my electric bill went down in the winter. I think these computers will move to the garage this summer. It is just too expesive to keep them cool. I am also looking at some 45nm processors that might help.
Thx!

Paul

ID: 58417 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jerry Murphy

Send message
Joined: 29 May 06
Posts: 5
Credit: 4,522,529
RAC: 1,221
Message 58418 - Posted: 3 Jan 2009, 15:24:11 UTC

One of my data sets processed normally until it reached about 98%. Then it became slower and slower. Last night it had reached 99.66% after more than 44 hours of processing time. I had suspended processing on all other files (Rosetta and SETI) in hopes of getting the file processed. This morning it is gone. Processing seems to have returned to normal. If this was an unusually large file I would like to know how to avoid them in the future. Any comments?

Thanks and Happy New Year.
Jerry
ID: 58418 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58431 - Posted: 3 Jan 2009, 20:07:10 UTC

Also keep in mind that one of you is looking at all power required to run the machine, and the other is looking at the power of an idle machine as compared to the power of a machine that is actively crunching.
Rosetta Moderator: Mod.Sense
ID: 58431 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58432 - Posted: 3 Jan 2009, 20:20:42 UTC
Last modified: 3 Jan 2009, 20:21:55 UTC

I moved Jerry's post here from the Cafe. Jerry, see the original post of this thread. That was abnormally long. Nothing you can do on your end other then kill them if they go several hours without reaching the next model.

Jerry, you must be crunching under more than one user ID?? I only see one host attached under the ID you used for your post, and it doesn't have a long task as you describe. Can you provide a link to the task you are referring to?
Rosetta Moderator: Mod.Sense
ID: 58432 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jerry Murphy

Send message
Joined: 29 May 06
Posts: 5
Credit: 4,522,529
RAC: 1,221
Message 58434 - Posted: 3 Jan 2009, 21:04:32 UTC - in response to Message 58432.  

I moved Jerry's post here from the Cafe. Jerry, see the original post of this thread. That was abnormally long. Nothing you can do on your end other then kill them if they go several hours without reaching the next model.

Jerry, you must be crunching under more than one user ID?? I only see one host attached under the ID you used for your post, and it doesn't have a long task as you describe. Can you provide a link to the task you are referring to?


I don't know if I have two accounts, but I have the e-mail I received on 5/29/2006 when I signed up. I can send you my Account Key if that will help. I can try to send the link, but I don't know how. BTW-All seems well now.

Jerry
ID: 58434 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58436 - Posted: 3 Jan 2009, 22:00:43 UTC

Jerry, all I was saying is that if you click on your name there to the left of the message, and click on your hosts, there is only one host. And that host only had one task when I looked, and it ran in about 3 hours. So, was not the one you were talking about. Was the very long running one perhaps already passed it's 10 day deadline by the time you were able to put 44 hours in to it?
Rosetta Moderator: Mod.Sense
ID: 58436 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Wissi

Send message
Joined: 19 Nov 08
Posts: 14
Credit: 485,807
RAC: 0
Message 58440 - Posted: 4 Jan 2009, 0:01:45 UTC - in response to Message 58364.  

And again:

https://boinc.bakerlab.org/rosetta/result.php?resultid=217748451
Workunit 198419747

Name is 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_156441_0 for Rosetta Mini 1.47.

Okay, here comes again, what I saw: yesterday, when shutting down my computer, workunit 198419747 had a processor time of 6 hrs 17 min and 40 seconds. Boinc told me, that 97.418% of the work was done. 9 mins 40 seconds was the time estimated to the end of the job.

...

Now, a few minutes ago, I restarted my computer, and so the boinc manager restarted the computation. Now the values are as follows:

Workunit 198419747: 4 hrs 8 mins 36 seconds of processor time, 96.131% work done, and 15 minutes and 1 seconds as an estimation until the end.


This task is really taking a lot of time. After restarting this task again after a shutdown, it stated a processor time of 6 hrs 30 min and 15 seconds (after I stopped it the day before with a total time of 9 hrs 20 min and 37 seconds). I started my computer with the boinc manager at about 15:40 on January 3rd, the task was restarted at 15:42:20. It then got suspended due to a computing task for LHC@home for about 1 hour, and it restarted at 17:45:14, got suspended again for about 1 hour with restart at 19:38:41. From then on until now, the task was computing and computing. It has now a total processor time of 10 hours 38 minutes 50 seconds with 98,458% of completion and (guess what) with an estimated time of 15 minutes to go. Well, maybe the units for "remaining time" are wrong: minutes should be hours, and seconds should be minutes from what I have seen. Now (00:54 of January 4th) I will shutdown my computer because I will go to sleep. I will see what happens tomorrow :-(

When doing a rough addition, this job has used already more than 20 hours of processor time (assuming that at least 6 hourse were "gone" due to the restarts and going back to the last checkpoint). But I think, one of the major problems with those long running tasks is the selection of checkpoints. IMHO checkpoints should be chosen in a way, that no more than 15 minutes of processor time are lost (didn't I read somewhere this value??) and this definitely does NOT work with this job.

I will keep my computer running tomorrow and see, if this awesome job finally ends (and I would bet a crate of beer, that it will finish with a "client error" or a calculation error... thanks for the credits!).

Ah, by the way: when trying to open the graphic window for this job, no graphic at all will be shown. The "minirosetta_graphics_1.40_windows_intelx86.exe" does not react, it has to be stopped by the task manager. Shouldn't this application have also the version number 1.47??
ID: 58440 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1852
Credit: 34,136,262
RAC: 10,001
Message 58478 - Posted: 4 Jan 2009, 17:49:35 UTC - in response to Message 58440.  

Workunit 198419747


Now (00:54 of January 4th) I will shutdown my computer because I will go to sleep. I will see what happens tomorrow :-(

It's obvious what will most likely happen because you've already got the experience. It will start from its previous checkpoint, as every WU does, which will lose the majority of your processing time. Why isn't this obvious to you?

But I think, one of the major problems with those long running tasks is the selection of checkpoints. IMHO checkpoints should be chosen in a way, that no more than 15 minutes of processor time are lost (didn't I read somewhere this value??) and this definitely does NOT work with this job.

Yes, this is a legitimate complaint IMO and a challenge to the coders. 15 minutes or half an hour should be maximum between checkpoints so that no-one loses out and the project itself gets the greater benefit of the processing effort put in. Everyone's a winner.

Maybe it wasn't such an issue in the past when WUs were less complicated, but as they become more complex then the checkpointing issue needs to be revisited.

I haven't been around to comment earlier, but the claim that "something smells" is fatuous. What is supposed to smell about it? What could anyone possibly gain from this situation? Credits are just numbers - you could get one or a million and it wouldn't make the slightest difference to anything. If the WU doesn't complete then no-one's winning. It's very easy to throw these comments around, but to me it reflects worse on the person who says it.

Also, bear in mind that no WU shows a time to completion of less than 10 minutes (maybe 9m 48s), even if it were to complete precisely when it should. I've never seen a lower figure ever. When a WU gets near the end it just shows it'll finish "any minute now". That's all it's saying. If it's a long-running model, then that ends up looking dumb, but it's only ever a guess, not ever a guarantee.

This thread is supposed to highlight the issue of long-running models publicly in order to pin it down and stop it happening. Let's just keep our mind on that please.
ID: 58478 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58494 - Posted: 4 Jan 2009, 18:46:30 UTC

checkpoints should be chosen in a way, that no more than 15 minutes of processor time are lost


Noone will disagree with that. It is just one of those things that is very difficult to achieve. You have to balance between the time lost due to the tasks restarting or the machine rebooting, and the time lost doing all of the overhead required to take so many checkpoints. Rosetta is particularly difficult to take checkpoints in the middle of a model. But, over time, more checkpoints are added to more types of work.
Rosetta Moderator: Mod.Sense
ID: 58494 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jerry Murphy

Send message
Joined: 29 May 06
Posts: 5
Credit: 4,522,529
RAC: 1,221
Message 58503 - Posted: 4 Jan 2009, 21:09:04 UTC - in response to Message 58436.  

Jerry, all I was saying is that if you click on your name there to the left of the message, and click on your hosts, there is only one host. And that host only had one task when I looked, and it ran in about 3 hours. So, was not the one you were talking about. Was the very long running one perhaps already passed it's 10 day deadline by the time you were able to put 44 hours in to it?


Yes, the deadline for that one was 12/14/08.
ID: 58503 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,368,361
RAC: 0
Message 58508 - Posted: 4 Jan 2009, 23:24:22 UTC

long-running models:
1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_134659_0 took more than 3x my preferred time (which is 12 hours)
1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_188618_1 is still running. I am the second one to try this workunit, the first time there was an error because there were too many restarts.
Yesterday is saw that the CPU time was over 13 hours, when i tried to look at the graphics it crashed. Today (after crunching for some other projects) it restarted at 6 hours. This time the graphics worked fine, but it took 20 minutes to go from 'model 1 step 203980' to 'model 1 step 203991'.
So, what to do? How many steps are there in a model? Should i let it run because it is almost finished, or abort it because there is no way i can finish this model?

AdeB
ID: 58508 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Idan Shifres

Send message
Joined: 12 Dec 08
Posts: 11
Credit: 126,517
RAC: 0
Message 58514 - Posted: 5 Jan 2009, 8:34:27 UTC

Have the wu: 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_51691_0 running over 62 hours now, most of that time it just crawling towards the 100% mark from 99.730 mark....

Going to hit deadline soon :(
ID: 58514 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5528
Credit: 5,544,194
RAC: 1,215
Message 58515 - Posted: 5 Jan 2009, 9:05:51 UTC - in response to Message 58514.  

Have the wu: 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_51691_0 running over 62 hours now, most of that time it just crawling towards the 100% mark from 99.730 mark....

Going to hit deadline soon :(



you going for the world record of crunching a single task? lol
it looks like it will never finish.
ID: 58515 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Idan Shifres

Send message
Joined: 12 Dec 08
Posts: 11
Credit: 126,517
RAC: 0
Message 58519 - Posted: 5 Jan 2009, 12:32:53 UTC

I know this WU doesn't look promising, but I want it to finish by itself or quit for passing the deadline...

I see no one else returned a result for this WU, I hope it will get taken care of, so it won't repeat again...

Good day! :D
ID: 58519 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1193
Credit: 13,208,362
RAC: 839
Message 58521 - Posted: 5 Jan 2009, 13:15:37 UTC - in response to Message 58519.  

I know this WU doesn't look promising, but I want it to finish by itself or quit for passing the deadline...

I see no one else returned a result for this WU, I hope it will get taken care of, so it won't repeat again...

Good day! :D


You might as well. Rosetta@home has finally got its workunit generators working again, but they haven't caught up with the demand for more workunits yet. I believe that Rosetta@home in one of the BOINC projects that will even let you return a workunit after the deadline and get credit for it, as long as not enough other people have already returned it to meet the quorum.

ID: 58521 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1193
Credit: 13,208,362
RAC: 839
Message 58522 - Posted: 5 Jan 2009, 13:27:20 UTC - in response to Message 58508.  

long-running models:
1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_134659_0 took more than 3x my preferred time (which is 12 hours)
1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_188618_1 is still running. I am the second one to try this workunit, the first time there was an error because there were too many restarts.
Yesterday is saw that the CPU time was over 13 hours, when i tried to look at the graphics it crashed. Today (after crunching for some other projects) it restarted at 6 hours. This time the graphics worked fine, but it took 20 minutes to go from 'model 1 step 203980' to 'model 1 step 203991'.
So, what to do? How many steps are there in a model? Should i let it run because it is almost finished, or abort it because there is no way i can finish this model?

AdeB


Rosetta@home is having problems keeping up with the demand for more workunits, so for now I'd suggest that you let it keep running. Note that after such workunits get within about 10 minutes of the expected runtime, but the actual time required is even longer, the reports of how much longer it will take almost stop changing until it finally completes.
ID: 58522 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 14 · Next

Message boards : Number crunching : Report long-running models here



©2022 University of Washington
https://www.bakerlab.org