Report long-running models here

Message boards : Number crunching : Report long-running models here

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 14 · Next

AuthorMessage
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 49379 - Posted: 4 Dec 2007, 16:51:09 UTC
Last modified: 15 Sep 2008, 14:10:12 UTC

I'd like to start a thread for reports of long running models. These appear to be related more to the specific batches of work released, then to any given specific application version. So, I've moved the problems with v1.34 posts that seemed more about runtime into this thread.

Here's what I'd like to see:

Firstly, it is hard to talk about total time, when everyone has different CPUs. So, for a frame of reference, we'll try to talk in terms of a fairly modern 3GHz machine. If yours is slower then that, you will have to adjust the times discussed here upwards accordingly.

Keep in mind that tasks starting with "AA2A" are expected to run about 3 hours per model. So, please only report an AA2A if it runs longer then, say 6hrs, for a single model.

Each task typically runs through several "models". You can see the model number in the graphic display, or on the web page of the completed task (as the number of "decoys").

So, if you see tasks that are averaging more then 2 hours per model, or specific models that are taking more then 2 hours, please report them as per below.

Such tasks might typically be noticed by opening your results page and scanning the list for an abnormally high, or low number of CPU seconds (low can occur when you have say a 3 hour runtime preference and the first model takes 2.3hrs to complete. There isn't time for a second model, so it reports in using 40 minutes less CPU time then your average).

Another way you might typically spot such tasks is if you have a target runtime of 3hrs or less, and a task takes significantly longer then that to complete (say 1 or more hours passed your target).

Task will likely also tend to be granted significantly less credit then is claimed. I don't want this thread to become a credit discussion, we have other threads to focus on that. So if granted credit is poor as compared to your claim, for your entire results page, that's not the case we are interested in here.

The theory is that specific models within a task are taking significantly longer then others, and if your series of models happens to visit one of these, then you will spend considerable time working through it, and still only be granted the average credit per model. If the Project Team can study these outlaying long-running models in more detail, they may be able to find coding changes to make that allow them to complete in a more normal period of time.

If you feel you have a long-running model to report here, please post with the following details:

Full WU name (you can copy the BOINC message from when the task completes).

Type of operating system (version of Windows, Linux distribution, or Mac info.)

BOINC version (see BOINC Manager "About" page).

Rosetta version (see BOINC Manager "tasks" page).

A link to the task's results page.

If a specific model took longer then the rest of them, then what model # was shown in the graphic?
Rosetta Moderator: Mod.Sense
ID: 49379 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Benny Mikkelsen

Send message
Joined: 2 May 07
Posts: 2
Credit: 672,295
RAC: 0
Message 55734 - Posted: 13 Sep 2008, 10:39:07 UTC

Hi

Now I have started my first 1.34 task. There seems to be
the same difficulty with completing jobs as I reported for
the 1.32 version (never got an answer to that).
When reaching about 95,5% completnes the job goes into a
'loop', meaning the rest time counter becomes VERY slow.
The remaining 16 minutes will complete in about 3-5 hours!
I have observed this only for abinitio_homfrag tasks.
I'll reject any abinitio_homfrag tasks unless you convince
me that the completion is much wanted and the job OK.
Still, the credit for such tasks is very low regarding the
high 'overtime' for such jobs.

Benny
ID: 55734 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Benny Mikkelsen

Send message
Joined: 2 May 07
Posts: 2
Credit: 672,295
RAC: 0
Message 55737 - Posted: 13 Sep 2008, 13:30:30 UTC - in response to Message 55734.  

Hi again

The task abinitio_nohomfrag_70_A_1dzoA_4466_1025_0 finished
at 14:15:39, about 4 hours after reaching the 95% level.
Is the results useful?

Benny

Hi

Now I have started my first 1.34 task. There seems to be
the same difficulty with completing jobs as I reported for
the 1.32 version (never got an answer to that).
When reaching about 95,5% completnes the job goes into a
'loop', meaning the rest time counter becomes VERY slow.
The remaining 16 minutes will complete in about 3-5 hours!
I have observed this only for abinitio tasks.
I'll reject any abinitio tasks unless you convince me of
that the completion is much wanted and the job is OK.
Still, the credit for such tasks is very low regarding the
high 'overtime' for such jobs.

Benny

ID: 55737 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5661
Credit: 5,699,872
RAC: 2,052
Message 55738 - Posted: 13 Sep 2008, 14:30:18 UTC - in response to Message 55737.  

they like ALL results good or bad, it tells them alot of stuff.

Hi again

The task abinitio_nohomfrag_70_A_1dzoA_4466_1025_0 finished
at 14:15:39, about 4 hours after reaching the 95% level.
Is the results useful?

Benny

Hi

Now I have started my first 1.34 task. There seems to be
the same difficulty with completing jobs as I reported for
the 1.32 version (never got an answer to that).
When reaching about 95,5% completnes the job goes into a
'loop', meaning the rest time counter becomes VERY slow.
The remaining 16 minutes will complete in about 3-5 hours!
I have observed this only for abinitio tasks.
I'll reject any abinitio tasks unless you convince me of
that the completion is much wanted and the job is OK.
Still, the credit for such tasks is very low regarding the
high 'overtime' for such jobs.

Benny


ID: 55738 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 55775 - Posted: 15 Sep 2008, 15:30:12 UTC

abinitio_nohomfrag_70_A_1wouA_4466_201_0 ran almost 2hrs passed 24hr runtime preference on Linux 2.6.18-53.1.13.el5PAE, BOINC 5.10.45. It only produced 9 models in 26hrs of runtime, Mini 1.32.... yet was granted MORE credit then claimed.

abinitio_homfrag_71_A_1wgbA_4443_49969_0 completed 4 models in 19hrs on Linux 2.6.18-53.1.13.el5PAE, BOINC 5.10.45. Mini 1.32.

t040_1_NMRREF_1_t040_1_S_00003_0001370_0IGNORE_THE_REST_core_4463_580_0, BOINC 5.10.45, Linux 2.6.18-53.1.13.el5PAE, 8 models in 21.5hrs, v5.98

GB1t_BOINC_MFR_ABRELAX_PICKED_4461_8121_0, BOINC 6.2.18, Win XP Pro, 6 models in 25hrs, and 1hr passed RT Pref., v5.98

rb_09_05_12203_22029_T0482_tri_IGNORE_THE_REST_03_09_4455_12767_0, BOINC 5.10.45, Linux 2.6.18-53.1.13.el5PAE, 1 model in 12+hrs, v1.32
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 55775 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
UtahTestLabs

Send message
Joined: 1 Jan 07
Posts: 4
Credit: 164,281
RAC: 0
Message 55785 - Posted: 15 Sep 2008, 18:47:52 UTC
Last modified: 15 Sep 2008, 19:12:45 UTC

Hello;

I am running Minirosetta V1.34 on abinitio_nohomfrag_70_A_2hnfA_4466_3274_0_0

I just noticed that it has been running for over 21 hours and is at 99.2%.
The percentage is only going up at 0.001% every 3 or 4 minutes.

Normally, your WUs take a little over 2 hours. Does this mean there is something wrong with my current WU, or should I wait for it finish. (could take 20 more hours)?

Windows - Boinc Ver. 5.10.45
ID: 55785 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 55787 - Posted: 15 Sep 2008, 19:14:54 UTC

Utah, the rate at which the estimated runtime decreases at this point is not relevant. It's just trying to show you progress, without finding itself reaching zero time to completion without being done yet. In my experience with my 24hr runtime preference, they seem to complete on their own. But that one sure sounds odd. In fact, if your preference is 2hrs, I am surprised the watchdog hasn't cleaned it out of there.

Your machines are hidden, could you post a link to that WU?

I'd suggest you let it go to 24hrs, and if it still hasn't finished, then abort it.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 55787 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
UtahTestLabs

Send message
Joined: 1 Jan 07
Posts: 4
Credit: 164,281
RAC: 0
Message 55788 - Posted: 15 Sep 2008, 19:21:34 UTC - in response to Message 55787.  

Utah, the rate at which the estimated runtime decreases at this point is not relevant. It's just trying to show you progress, without finding itself reaching zero time to completion without being done yet. In my experience with my 24hr runtime preference, they seem to complete on their own. But that one sure sounds odd. In fact, if your preference is 2hrs, I am surprised the watchdog hasn't cleaned it out of there.

Your machines are hidden, could you post a link to that WU?

I'd suggest you let it go to 24hrs, and if it still hasn't finished, then abort it.


Is this the link you need?
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=175174690

ID: 55788 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 55791 - Posted: 15 Sep 2008, 19:44:46 UTC

Yep, that's the link.

In rereading my post, I noticed my wording might be read to say go for 24hrs more. I meant to say go until it completes or total CPU time crosses 24hrs.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 55791 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
UtahTestLabs

Send message
Joined: 1 Jan 07
Posts: 4
Credit: 164,281
RAC: 0
Message 55793 - Posted: 15 Sep 2008, 20:01:53 UTC - in response to Message 55791.  

Yep, that's the link.

In rereading my post, I noticed my wording might be read to say go for 24hrs more. I meant to say go until it completes or total CPU time crosses 24hrs.


Thanks for the advice. I will wait and see where it stands tomorrow when I come in. If it is still going, I will abort it.

ID: 55793 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 55796 - Posted: 16 Sep 2008, 2:27:22 UTC

For a frame of reference, my first AA2A task took 6h:45m for it's first model on a 2.8Ghz P4 (more then double the expectation). And I was 4 hours in to model 2 when I rebooted my PC. Model 2 gets to start over from step 0. So, it ran for 4 hours without checkpointing.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 55796 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ramostol

Send message
Joined: 6 Feb 07
Posts: 64
Credit: 584,052
RAC: 0
Message 55800 - Posted: 16 Sep 2008, 10:09:17 UTC - in response to Message 49379.  

I'd like to start a thread for reports of long running models.


Pure mindreading. And you may be needing this discussion as a sticky...

The theory is that specific models within a task are taking significantly longer then others, ...


This part is a bit more than a theory. My documentation disappeared last week, but I remember a wu starting with a model lasting less that 6 hours and concluding with model 2 lasting around 17 hours, and a wu starting with models running about an hour finishing with a model 9 running 4 hours longer than the default runtime.

My present results page consists of three tasks:

Task ID: 190622469 WU ID: 174119129 8 Sep 2008 15:44:37 UTC 15 Sep 2008 16:14:01 UTC Over Success Done CPU: 45,044.32 Claimed: 83.08 Granted: 22.40 (2 models) (25% of claim is a normal result for my computer these days)

Or81__BOINC_SYMM_FOLD_AND_DOCK_RELAX-Or81_-nmr_foldanddock__4445_17477_1:
Task ID: 190080493 WU ID: 171441384 6 Sep 2008 11:55:27 UTC 15 Sep 2008 8:21:41 UTC Over Success Done CPU: 124,679.00 (34h) Claimed: 229.96 Granted: 33.70 (1 model) (11% of claim, less than 1 credit per CPU hour)

amer__BOINC_SYMM_FOLD_AND_DOCK_RELAX-amer_-STIH_tetr_4437_41444_1:
Task ID: 190080492 WU ID: 171441412 6 Sep 2008 11:55:27 UTC 12 Sep 2008 9:40:47 UTC Over Success Done CPU: 204,936.90 (56h 55min) Claimed: 368.54 Granted: 64.58 (1 model)

# The first task included solely for comparison of benchmarks.
# The middle task is one of my poorest results as for credits.
# The third task with a record-breaking model demanding 57 hours of computing was also on the low side.
(It must be noted that I was the second cruncher of all these wus.)
But one has to be grateful that such models now are checkpointing.

The inability to predict the runtime needed for such large wus is a bit of a problem, as they upset the computer's time schedule and downloading ability. I wonder if we could at least be helped by changing the graphics window to display
"Model 1 Step xxxx of totally nnnnn steps"
Then one would at least see if the end is near...

Specifications:
iBook G4 1.2 GHz PPC, 768 MB RAM
MacOS 10.3.9
BOINC 6.2.18
Rosetta 5.98
ID: 55800 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nothing But Idle Time

Send message
Joined: 28 Sep 05
Posts: 209
Credit: 139,545
RAC: 0
Message 55803 - Posted: 16 Sep 2008, 12:00:06 UTC - in response to Message 49379.  

I'd like to start a thread for reports of long running models....
The theory is that specific models within a task are taking significantly longer then others...


Does a long running model imply something abnormal and erronious? Help me understand.

The project staff has said the number of proteins, their widely varying sizes and number of amino acids, varying complexity and increasing complexity means that producing any single decoy may be a short or very long process. And this process is being studied and refined.

I'm not trying to be facetious, but why should we care if it takes 2 hours or 6 hours or 12 hours to produce a decoy as long as a decoy is produced? Are they not all equally useful?

If there is a program loop or the decoy otherwise gets hung then I can see some reason for concern. Seems like a lot of people complain about long running models because their chosen run times are simply too short to be viable for the ever larger and more complex proteins that are being studied.

So, if you see tasks that are averaging more then 2 hours per model, or specific models that are taking more then 2 hours, please report them as per below.


I can take the size of an atom and the size of the cosmos and average them also, but what possible use would it have? Given the variety of proteins and complexity and other factors why would an average of 2 hours have any more meaning than my atom/cosmos average? OTH, averaging the time it takes to produce decoys for a specific protein might be useful if the times tend to cluster around their average (narrow spread); but like the atom/cosmos example, if the spread is large then the average is less meaningful or even useless.

Speaking for myself, I think you need to be more specific about what you or the project is looking for and how to find it. And are their any particular proteins/tasks that are more suspicious than others? If I run a task that produces 34 decoys, some may have taken 1 hour to produce while others may have taken 10 hours and their average could be under 2 hours; how would I know?
ID: 55803 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 55805 - Posted: 16 Sep 2008, 12:38:16 UTC - in response to Message 55803.  

Speaking for myself, I think you need to be more specific about what you or the project is looking for and how to find it. And are their any particular proteins/tasks that are more suspicious than others? If I run a task that produces 34 decoys, some may have taken 1 hour to produce while others may have taken 10 hours and their average could be under 2 hours; how would I know?


Thanks NBIT. Good points. You are absolutely right, some proteins take longer then others. And Rosetta's batches are typically comprised of a number of different proteins.

I opened the thread because it is difficult to know that one out of the 34 decoys took 10 hours. And by soliciting more eyes, I am hoping to bring some specific examples back to the Project Team for them to study further. I've also suggested the CPU seconds be recorded in the .out file at the time each model is completed, so they will know the exact answer.

The overall objective is to have the majority of work sent on Rosetta to complete each model within an hour, so the 2 hrs "average" was meant to say anything that is running more then twice as long as expected.

By meeting this 1 hour (or less) per model objective, the BOINC runtime estimates and amount of work that it requests and downloads will better match your configured objectives. And the average runtime you see on the client should be very close to your Rosetta runtime preference. It will also help assure the credit system remains fair, and eliminate as much of the "luck of the draw" as possible from the credit results.

At present, some models take significantly longer then the objective. In order to identify the cause and make program revisions, it will be helpful to identify specific models that are taking longer then desired and rerun them in the lab to understand what they came across that is causing them to take so long. Then methods of more efficiently handling that set of circumstances can usually be identified, and code written to avoid them if they are typically fruitless, or seek them out if this seems to produce a better result. Either way, overall runtimes will become more consistent.

So, I won't call the current long-running models a "bug". Instead I would say they represent an area where it is believed improvements can be made. An "enhancement opportunity" you might say.
Rosetta Moderator: Mod.Sense
ID: 55805 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nothing But Idle Time

Send message
Joined: 28 Sep 05
Posts: 209
Credit: 139,545
RAC: 0
Message 55807 - Posted: 16 Sep 2008, 14:14:15 UTC - in response to Message 55805.  

Agree, the project software should record and track whatever is necessary to achieve its objectives or observe any anomalies it chooses to define. This is especially true because most contributors only contribute computer time, not personal time.
ID: 55807 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
bob

Send message
Joined: 11 Dec 05
Posts: 3
Credit: 618,194
RAC: 0
Message 55809 - Posted: 16 Sep 2008, 17:19:08 UTC

Took about 2 hours to get to 95%. Took 8 more hours to get to 99.9%. Finally gets to 100% but never finishes. Suspend and restart causes work to restart at 32%. I aborted this WU (abinitio_nohomfrag_70_A_1qgvA_4466_6257_1).
ID: 55809 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 55811 - Posted: 16 Sep 2008, 18:19:16 UTC

A link to Bob's task BOINC 6.2.18, Win 2000, Rosetta 1.34

The first user to crunch it was on Win Vista, and got an:
Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x007D3863 read attempt to address 0x00000008

Rosetta Moderator: Mod.Sense
ID: 55811 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
William Timbrook

Send message
Joined: 2 Nov 05
Posts: 3
Credit: 11,623,185
RAC: 0
Message 55823 - Posted: 17 Sep 2008, 1:33:49 UTC
Last modified: 17 Sep 2008, 2:23:14 UTC

I've noticed a couple of odd work unit packets but I think I now have a specific question.

Work Unit 175541763 has clocked over 8 hours of cpu time and 98.001% done. I went to the host machine, stopped and restarted boinc. Now the work unit is showing 55.704% done and about 100 minutes of cpu time.

Is this a known opportunity?



Thanks,
William
ID: 55823 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
William Timbrook

Send message
Joined: 2 Nov 05
Posts: 3
Credit: 11,623,185
RAC: 0
Message 55825 - Posted: 17 Sep 2008, 5:18:52 UTC - in response to Message 55823.  

[quote]I've noticed a couple of odd work unit packets but I think I now have a specific question.

Work Unit 175541763 has clocked over 8 hours of cpu time and 98.001% done. I went to the host machine, stopped and restarted boinc. Now the work unit is showing 55.704% done and about 100 minutes of cpu time.

Is this a known opportunity?


After 2 restarts of boinc (and upgrading to 6.2.18), the unit got stuck with 9:52 to finish. I aborted it.

ID: 55825 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 55835 - Posted: 17 Sep 2008, 14:49:34 UTC

William, parts of what you describe are normal and expected, and some parts are not. I've moved your posts here to this thread because you appear to have a 3hr runtime (the default) configured for that host, and so the 8hrs you report is well beyond that.

Your tasks was abinitio_nohomfrag_70_A_1qgvA_4466_9601, v1.34, running BOINC 6.2.18 and Windows 2000.

So it ran longer then expected.

The parts of what you describe that are normal are that any time you end BOINC or remove a task from memory (which happens if BOINC switches to running another project, suspending the R@h task, and you are not keeping suspended tasks in memory), you will lose some work. The amount lost depends on when Rosetta was able to last save a checkpoint. And some tasks are able to checkpoint more frequently then others.

So, seeing the CPU time reduced (sometimes all the way back to zero) when the task restarts, is normal.

The other thing is that the 3 hours you are probably currently seeing as the initial estimated time to completion is just based on your runtime preference (which you can set here on the website in your Rosetta-specific preferences). Actually, it is based on your BOINC client's history of working tasks with your runtime preference. Some tasks take longer then that. So, rather then showing a negative estimated time to complete once the original estimate is reached, the program starts to make time pass slower and slower once it reaches about 10 minutes remaining. So, the part you describe about 10 minutes remaining for an extended period of time is normal as well.

The resulting confusion when tasks go longer then your preference is why I started this thread, and why the Project Team is working to address these long-running models that cause runtimes to be exceeded.
Rosetta Moderator: Mod.Sense
ID: 55835 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 14 · Next

Message boards : Number crunching : Report long-running models here



©2024 University of Washington
https://www.bakerlab.org