Report long-running models here

Message boards : Number crunching : Report long-running models here

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 14 · Next

AuthorMessage
William Timbrook

Send message
Joined: 2 Nov 05
Posts: 3
Credit: 11,623,185
RAC: 0
Message 55850 - Posted: 18 Sep 2008, 3:13:27 UTC - in response to Message 55835.  
Last modified: 18 Sep 2008, 3:17:47 UTC

William, parts of what you describe are normal and expected, and some parts are not. I've moved your posts here to this thread because you appear to have a 3hr runtime (the default) configured for that host, and so the 8hrs you report is well beyond that.

Your tasks was abinitio_nohomfrag_70_A_1qgvA_4466_9601, v1.34, running BOINC 6.2.18 and Windows 2000.

So it ran longer then expected.

The parts of what you describe that are normal are that any time you end BOINC or remove a task from memory (which happens if BOINC switches to running another project, suspending the R@h task, and you are not keeping suspended tasks in memory), you will lose some work. The amount lost depends on when Rosetta was able to last save a checkpoint. And some tasks are able to checkpoint more frequently then others.

So, seeing the CPU time reduced (sometimes all the way back to zero) when the task restarts, is normal.

The other thing is that the 3 hours you are probably currently seeing as the initial estimated time to completion is just based on your runtime preference (which you can set here on the website in your Rosetta-specific preferences). Actually, it is based on your BOINC client's history of working tasks with your runtime preference. Some tasks take longer then that. So, rather then showing a negative estimated time to complete once the original estimate is reached, the program starts to make time pass slower and slower once it reaches about 10 minutes remaining. So, the part you describe about 10 minutes remaining for an extended period of time is normal as well.

The resulting confusion when tasks go longer then your preference is why I started this thread, and why the Project Team is working to address these long-running models that cause runtimes to be exceeded.




Thanks for the update.
I had another one like that which was experiencing the same thing. Seeing the 10hrs of cpu time just didn't look that comforting.
I wanted to finish the jobs but... some other hosts can pick those 2 up.
I was on 5.10.45 (?) but upgraded that host to 6.2.18 but it seem to not make a difference.
ID: 55850 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 55953 - Posted: 22 Sep 2008, 17:24:37 UTC

This one took 4 hrs to complete the first model (and only checkpoint).
AA2A_6_modeling_1_AA2A_1_AA2A_2RH1_align_4492_20784_0

Similar WU name is already over 2hrs and hasn't completed model 1.
AA2A_7_modeling_1_AA2A_1_AA2A_2RH1_align_4493_31406_0
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 55953 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 55996 - Posted: 24 Sep 2008, 2:00:20 UTC
Last modified: 24 Sep 2008, 2:01:15 UTC

hombench_mtyka_looprelax_test_full_2_looprelax_t326__IGNORE_THE_REST_1A9XB_17_4531_8_0 using minirosetta version 134
Running almost 9 hours and still on model 1
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 55996 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 55997 - Posted: 24 Sep 2008, 6:25:19 UTC

these "homebench" WU take way longer than expected. Happened in two different PCs from two different accounts.
ID: 55997 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 56011 - Posted: 24 Sep 2008, 22:38:27 UTC
Last modified: 24 Sep 2008, 22:43:06 UTC

Two models in nearly 18hrs of crunchtime.
AA2A_7_modeling_1_AA2A_1_AA2A_2RH1_align_4493_97149_0
...and this one
...and this one
...and this one
...and this one
all AA2A's, all 2 models completed in ~62,000 seconds.

this one took over 12hrs to do just one model.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 56011 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nothing But Idle Time

Send message
Joined: 28 Sep 05
Posts: 209
Credit: 139,545
RAC: 0
Message 56023 - Posted: 25 Sep 2008, 11:21:33 UTC - in response to Message 56011.  


this one took over 12hrs to do just one model.

...and 30 credits for that 12 hours work; so generous.
ID: 56023 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 137
Message 56063 - Posted: 27 Sep 2008, 19:48:15 UTC
Last modified: 27 Sep 2008, 20:05:45 UTC

hombench_mtyka_foldcst_boinc_test3_foldcst_simple_t328___4598_485 on a machine that has 3 hour wu times ran for well over 7 hours. For a normal 3 hour wu it claims and gets 50-60 credit, this wu claimed 148 and was granted 20. There is something wrong here.

Windows XP SP3, Intel Q6600, CC 5.10.30, MR 1.34.


Task ID 195085827
Name hombench_mtyka_foldcst_boinc_test3_foldcst_simple_t328___4598_485_0
Workunit 178217022
Created 27 Sep 2008 9:01:52 UTC
Sent 27 Sep 2008 9:01:59 UTC
Received 27 Sep 2008 19:27:07 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 708637
Report deadline 7 Oct 2008 9:01:59 UTC
CPU time 26326.86
stderr out <core_client_version>5.10.30</core_client_version>
<![CDATA[
<stderr_txt>
======================================================
DONE :: 1 starting structures 26326.4 cpu seconds
This process generated 2 decoys from 2 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>


Validate state Valid
Claimed credit 148.027127998142
Granted credit 20.8630305375669
application version 1.34


I really don't want to drop Rosetta from my portfolio, I have been here from the start, but recently there have been a near continual series of issues with the production quality project which a good deal of the Beta and some of the Alphas do not have. This, in spite of a dedicated Beta test project. Not good enough?
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 56063 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Semunozg

Send message
Joined: 23 Nov 05
Posts: 2
Credit: 15,659,163
RAC: 0
Message 56073 - Posted: 28 Sep 2008, 17:08:57 UTC - in response to Message 56063.  

hombench_mtyka_foldcst_boinc_test3_foldcst_simple_t328___4598_485 on a machine that has 3 hour wu times ran for well over 7 hours. For a normal 3 hour wu it claims and gets 50-60 credit, this wu claimed 148 and was granted 20. There is something wrong here.

Windows XP SP3, Intel Q6600, CC 5.10.30, MR 1.34.


Task ID 195085827
Name hombench_mtyka_foldcst_boinc_test3_foldcst_simple_t328___4598_485_0
Workunit 178217022
Created 27 Sep 2008 9:01:52 UTC
Sent 27 Sep 2008 9:01:59 UTC
Received 27 Sep 2008 19:27:07 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 708637
Report deadline 7 Oct 2008 9:01:59 UTC
CPU time 26326.86
stderr out <core_client_version>5.10.30</core_client_version>
<![CDATA[
<stderr_txt>
======================================================
DONE :: 1 starting structures 26326.4 cpu seconds
This process generated 2 decoys from 2 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>


Validate state Valid
Claimed credit 148.027127998142
Granted credit 20.8630305375669
application version 1.34


I really don't want to drop Rosetta from my portfolio, I have been here from the start, but recently there have been a near continual series of issues with the production quality project which a good deal of the Beta and some of the Alphas do not have. This, in spite of a dedicated Beta test project. Not good enough?



Same here, a bunch of workunits that ran way over my set limit claimed 1xx+ crdits, and received 90 or less.
I dont mind having to crunch longer a WU, but getting credit when its due would be nice. Im not going to drop the project though becuase of this... but it would be nice if they fixed this or at least kept the public informed about current problems... challenges... etc. A little more participation.

For example, SETI@Home has excellent forums and the team is constantly making updates about diverese news. From either server issues, to Tflops goals. Rosetta@Home team didnt even acknoledge the server down that ocurred a few night ago (A forum Mod said it was a kernel panic, but this was posted in the forums... and very few people actually read the forums).
ID: 56073 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 137
Message 56083 - Posted: 29 Sep 2008, 15:57:30 UTC

Im not going to drop the project though becuase of this...

As I said in my report, it is not just this issue. Recently I have had to post a number of times with different, unrelated problems. This one, just short changing with credit, is pretty minor, goes along with the reasonably high number of simple wu crashes.

Others involving locking out cores or whole machines are much more serious. This from a production status project with a seperate Beta tester.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 56083 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Hubington

Send message
Joined: 3 Feb 06
Posts: 24
Credit: 127,236
RAC: 0
Message 56103 - Posted: 30 Sep 2008, 10:29:18 UTC

minirosetta 1.34: hombench_mtyka_foldcst_boinc_test3_foldcst_simple_t328___4598_724

Usually it takes between 2-3 hours for a work unit to compelte for me however for the above listed unit it is currently on 9 hours with 98.2% compelte with 9 mins 50 secs remaining

I first noticed the runtime of it earlier today when it was at about 6.5hours at 97.3% with 9 mins 50 secs remaining.

Now I know the remaining times are estimates, but there are estimates, theres what windows estimates when you go to copy a file and then there is this. Basicly I'm worried that the work unit is just wasteing cycles and wondered if anyone has any thoughts on it. based on a 60 second sampling I just took It is notching up 0.001% of progress every 20 seconds so in theory it should complete in about 9-10 hours time. That is assuming that it just contains a lot more work than normal rather than just spinning it's wheels.

any comments welcomed
ID: 56103 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Hubington

Send message
Joined: 3 Feb 06
Posts: 24
Credit: 127,236
RAC: 0
Message 56104 - Posted: 30 Sep 2008, 10:42:05 UTC - in response to Message 56103.  

in fine accordance with Murphys law, it just finished.

If someone could find out why the run time was over 3 times the norm though it could be useful as others may kill off the work units thinking they had died.

ID: 56104 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 56111 - Posted: 30 Sep 2008, 13:42:30 UTC - in response to Message 56104.  

Read this thread https://boinc.bakerlab.org/forum_thread.php?id=4388

in fine accordance with Murphys law, it just finished.

If someone could find out why the run time was over 3 times the norm though it could be useful as others may kill off the work units thinking they had died.

ID: 56111 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 137
Message 56137 - Posted: 1 Oct 2008, 7:43:10 UTC

in fine accordance with Murphys law, it just finished.

Your machines are hidden so I can't look at your results. Curious to see how the claimed/granted credit you got from that wu compares to the same machines regular performance.


Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 56137 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 56177 - Posted: 2 Oct 2008, 22:09:12 UTC
Last modified: 2 Oct 2008, 22:09:56 UTC

abinitio_nohomfrag_70_A_2hx5A_4482_62740_0 did 6 models in 25hrs.

Not surprisingly, his brothers
abinitio_nohomfrag_70_A_2hx5A_4482_58343_0
abinitio_nohomfrag_70_A_2hx5A_4482_59391_0 both did 5 models in just over 24hrs and 19.5hrs.

abinitio_nohomfrag_70_A_2hx5A_4482_46233_1 did 4 models in 18hrs.

abinitio_nohomfrag_70_A_2hcmA_4482_36776_0 did 5 models in 26hrs.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 56177 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 56178 - Posted: 2 Oct 2008, 22:27:58 UTC
Last modified: 2 Oct 2008, 22:28:29 UTC

this AA2A has been running for 10hrs and not even checkpointed yet. So it must still be on model one.

AA2A_20_modeling_1_AA2A_1_AA2A_2VTA_SAVE_ALL_OUT_align_4600_37939_0
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 56178 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Hubington

Send message
Joined: 3 Feb 06
Posts: 24
Credit: 127,236
RAC: 0
Message 56181 - Posted: 3 Oct 2008, 1:20:18 UTC - in response to Message 56137.  
Last modified: 3 Oct 2008, 1:32:39 UTC

in fine accordance with Murphys law, it just finished.

Your machines are hidden so I can't look at your results. Curious to see how the claimed/granted credit you got from that wu compares to the same machines regular performance.



yeah I'm paranoid :)

Here is a smattering of the surrounding restults for that WU though
CPU Time |Claimed |Granted
9,912.30 |37.37 |33.71
9,435.89 |32.95 |34.51
33,868.45 |127.69 |22.29
10,209.95 |35.65 |33.69
10,327.84 |36.06 |34.75
5,979.70 |20.88 |23.02
9,766.56 |34.10 |25.06

(the formating gets messed up so I've seperated the coloumns with | marks)

New one on the way incidently

minirosetta 1.34: hombench_mtyka_foldcst_boinc_test3_foldcst_simple_t286___4580_1561_0

currently been running for 36 hours & 5 mins! 99.540% complete

OK I just noticed something VERY worrying while trying to see how long it took to click over 0.001%, the run time jumped back 6 mins?!?!?! and now it lost 0.001% from the progress taking it back to 99.539
ID: 56181 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 137
Message 56184 - Posted: 3 Oct 2008, 7:57:06 UTC
Last modified: 3 Oct 2008, 7:59:40 UTC

9,933.34 | 55.85 | 57.86
9,569.86 | 53.81 | 56.64
11,264.72 | 63.34 | 71.97
10,648.98 | 59.88 | 64.31
26,326.86 | 148.03 | 20.86 <------
10,750.00 | 60.44 | 63.28
9,991.64 | 56.18 | 58.85

They really stand out don't they.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 56184 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Hubington

Send message
Joined: 3 Feb 06
Posts: 24
Credit: 127,236
RAC: 0
Message 56187 - Posted: 3 Oct 2008, 12:08:14 UTC
Last modified: 3 Oct 2008, 12:37:44 UTC

Well I imgaine that the claime is based on cycles used so I imagaine that your processor puts out more power than mine which is why you generate more credits per hour than I do. But then the granted credit is problaby result based rather than effort put in. The theory being that X amount of effort usually yeilds Y amount of results. Which is why you get small variances between the claimed and granted, usually being granted less than claimed but seemingly not always. Also I suspect certain sub projects of WU yeild more/less results per hour than others.

In the case of these work units though the system is seemingly using a lot of cycles but producing little or no results for it and so the claime going in is much higher than whats being granted.

Just an observation though
ID: 56187 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 56193 - Posted: 3 Oct 2008, 13:31:33 UTC

26,326.86 | 148.03 | 20.86 <------


Credit claimed represents how much time your computer put in to it, as compared to the benchmarks for that computer. The credit granted is based on the work you actually completed and is the average of the claims of others that did similar work.

So, it looks to me as though that task took you 2.5 times longer then normal. And everyone else completed the models in "normal" time. Hence, you have a long-running model there that required dramatically more work then others for the same protein. And hence, this thread. To identify such occurences so the team can track down what caused it to run for so long.
Rosetta Moderator: Mod.Sense
ID: 56193 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mark1212

Send message
Joined: 23 Sep 05
Posts: 1
Credit: 1,952,358
RAC: 0
Message 56194 - Posted: 3 Oct 2008, 13:38:13 UTC

You think you have it bad what a waste of time and energy this one was

194952967 26 Sep 2008 18:51:19 UTC Over Client error Compute error 61,849.00 562.03 ---

ID: 56194 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 14 · Next

Message boards : Number crunching : Report long-running models here



©2024 University of Washington
https://www.bakerlab.org