Report long-running models here

Author	Message
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 49379 - Posted: 4 Dec 2007, 16:51:09 UTC Last modified: 15 Sep 2008, 14:10:12 UTC I'd like to start a thread for reports of long running models. These appear to be related more to the specific batches of work released, then to any given specific application version. So, I've moved the problems with v1.34 posts that seemed more about runtime into this thread. Here's what I'd like to see: Firstly, it is hard to talk about total time, when everyone has different CPUs. So, for a frame of reference, we'll try to talk in terms of a fairly modern 3GHz machine. If yours is slower then that, you will have to adjust the times discussed here upwards accordingly. Keep in mind that tasks starting with "AA2A" are expected to run about 3 hours per model. So, please only report an AA2A if it runs longer then, say 6hrs, for a single model. Each task typically runs through several "models". You can see the model number in the graphic display, or on the web page of the completed task (as the number of "decoys"). So, if you see tasks that are averaging more then 2 hours per model, or specific models that are taking more then 2 hours, please report them as per below. Such tasks might typically be noticed by opening your results page and scanning the list for an abnormally high, or low number of CPU seconds (low can occur when you have say a 3 hour runtime preference and the first model takes 2.3hrs to complete. There isn't time for a second model, so it reports in using 40 minutes less CPU time then your average). Another way you might typically spot such tasks is if you have a target runtime of 3hrs or less, and a task takes significantly longer then that to complete (say 1 or more hours passed your target). Task will likely also tend to be granted significantly less credit then is claimed. I don't want this thread to become a credit discussion, we have other threads to focus on that. So if granted credit is poor as compared to your claim, for your entire results page, that's not the case we are interested in here. The theory is that specific models within a task are taking significantly longer then others, and if your series of models happens to visit one of these, then you will spend considerable time working through it, and still only be granted the average credit per model. If the Project Team can study these outlaying long-running models in more detail, they may be able to find coding changes to make that allow them to complete in a more normal period of time. If you feel you have a long-running model to report here, please post with the following details: Full WU name (you can copy the BOINC message from when the task completes). Type of operating system (version of Windows, Linux distribution, or Mac info.) BOINC version (see BOINC Manager "About" page). Rosetta version (see BOINC Manager "tasks" page). A link to the task's results page. If a specific model took longer then the rest of them, then what model # was shown in the graphic? Rosetta Moderator: Mod.Sense ID: 49379 · Rating: 0 · rate: / Reply Quote

Benny Mikkelsen Send message Joined: 2 May 07 Posts: 2 Credit: 672,295 RAC: 0	Message 55734 - Posted: 13 Sep 2008, 10:39:07 UTC Hi Now I have started my first 1.34 task. There seems to be the same difficulty with completing jobs as I reported for the 1.32 version (never got an answer to that). When reaching about 95,5% completnes the job goes into a 'loop', meaning the rest time counter becomes VERY slow. The remaining 16 minutes will complete in about 3-5 hours! I have observed this only for abinitio_homfrag tasks. I'll reject any abinitio_homfrag tasks unless you convince me that the completion is much wanted and the job OK. Still, the credit for such tasks is very low regarding the high 'overtime' for such jobs. Benny ID: 55734 · Rating: 0 · rate: / Reply Quote

Benny Mikkelsen Send message Joined: 2 May 07 Posts: 2 Credit: 672,295 RAC: 0	Message 55737 - Posted: 13 Sep 2008, 13:30:30 UTC - in response to Message 55734. Hi again The task abinitio_nohomfrag_70_A_1dzoA_4466_1025_0 finished at 14:15:39, about 4 hours after reaching the 95% level. Is the results useful? Benny Hi Now I have started my first 1.34 task. There seems to be the same difficulty with completing jobs as I reported for the 1.32 version (never got an answer to that). When reaching about 95,5% completnes the job goes into a 'loop', meaning the rest time counter becomes VERY slow. The remaining 16 minutes will complete in about 3-5 hours! I have observed this only for abinitio tasks. I'll reject any abinitio tasks unless you convince me of that the completion is much wanted and the job is OK. Still, the credit for such tasks is very low regarding the high 'overtime' for such jobs. Benny ID: 55737 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 55738 - Posted: 13 Sep 2008, 14:30:18 UTC - in response to Message 55737. they like ALL results good or bad, it tells them alot of stuff. Hi again The task abinitio_nohomfrag_70_A_1dzoA_4466_1025_0 finished at 14:15:39, about 4 hours after reaching the 95% level. Is the results useful? Benny Hi Now I have started my first 1.34 task. There seems to be the same difficulty with completing jobs as I reported for the 1.32 version (never got an answer to that). When reaching about 95,5% completnes the job goes into a 'loop', meaning the rest time counter becomes VERY slow. The remaining 16 minutes will complete in about 3-5 hours! I have observed this only for abinitio tasks. I'll reject any abinitio tasks unless you convince me of that the completion is much wanted and the job is OK. Still, the credit for such tasks is very low regarding the high 'overtime' for such jobs. Benny ID: 55738 · Rating: 0 · rate: / Reply Quote

Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0	Message 55775 - Posted: 15 Sep 2008, 15:30:12 UTC abinitio_nohomfrag_70_A_1wouA_4466_201_0 ran almost 2hrs passed 24hr runtime preference on Linux 2.6.18-53.1.13.el5PAE, BOINC 5.10.45. It only produced 9 models in 26hrs of runtime, Mini 1.32.... yet was granted MORE credit then claimed. abinitio_homfrag_71_A_1wgbA_4443_49969_0 completed 4 models in 19hrs on Linux 2.6.18-53.1.13.el5PAE, BOINC 5.10.45. Mini 1.32. t040_1_NMRREF_1_t040_1_S_00003_0001370_0IGNORE_THE_REST_core_4463_580_0, BOINC 5.10.45, Linux 2.6.18-53.1.13.el5PAE, 8 models in 21.5hrs, v5.98 GB1t_BOINC_MFR_ABRELAX_PICKED_4461_8121_0, BOINC 6.2.18, Win XP Pro, 6 models in 25hrs, and 1hr passed RT Pref., v5.98 rb_09_05_12203_22029_T0482_tri_IGNORE_THE_REST_03_09_4455_12767_0, BOINC 5.10.45, Linux 2.6.18-53.1.13.el5PAE, 1 model in 12+hrs, v1.32 Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ ID: 55775 · Rating: 0 · rate: / Reply Quote

UtahTestLabs Send message Joined: 1 Jan 07 Posts: 4 Credit: 164,281 RAC: 0	Message 55785 - Posted: 15 Sep 2008, 18:47:52 UTC Last modified: 15 Sep 2008, 19:12:45 UTC Hello; I am running Minirosetta V1.34 on abinitio_nohomfrag_70_A_2hnfA_4466_3274_0_0 I just noticed that it has been running for over 21 hours and is at 99.2%. The percentage is only going up at 0.001% every 3 or 4 minutes. Normally, your WUs take a little over 2 hours. Does this mean there is something wrong with my current WU, or should I wait for it finish. (could take 20 more hours)? Windows - Boinc Ver. 5.10.45 ID: 55785 · Rating: 0 · rate: / Reply Quote

Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0	Message 55787 - Posted: 15 Sep 2008, 19:14:54 UTC Utah, the rate at which the estimated runtime decreases at this point is not relevant. It's just trying to show you progress, without finding itself reaching zero time to completion without being done yet. In my experience with my 24hr runtime preference, they seem to complete on their own. But that one sure sounds odd. In fact, if your preference is 2hrs, I am surprised the watchdog hasn't cleaned it out of there. Your machines are hidden, could you post a link to that WU? I'd suggest you let it go to 24hrs, and if it still hasn't finished, then abort it. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ ID: 55787 · Rating: 0 · rate: / Reply Quote

UtahTestLabs Send message Joined: 1 Jan 07 Posts: 4 Credit: 164,281 RAC: 0	Message 55788 - Posted: 15 Sep 2008, 19:21:34 UTC - in response to Message 55787. Utah, the rate at which the estimated runtime decreases at this point is not relevant. It's just trying to show you progress, without finding itself reaching zero time to completion without being done yet. In my experience with my 24hr runtime preference, they seem to complete on their own. But that one sure sounds odd. In fact, if your preference is 2hrs, I am surprised the watchdog hasn't cleaned it out of there. Your machines are hidden, could you post a link to that WU? I'd suggest you let it go to 24hrs, and if it still hasn't finished, then abort it. Is this the link you need? https://boinc.bakerlab.org/rosetta/workunit.php?wuid=175174690 ID: 55788 · Rating: 0 · rate: / Reply Quote

Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0	Message 55791 - Posted: 15 Sep 2008, 19:44:46 UTC Yep, that's the link. In rereading my post, I noticed my wording might be read to say go for 24hrs more. I meant to say go until it completes or total CPU time crosses 24hrs. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ ID: 55791 · Rating: 0 · rate: / Reply Quote

UtahTestLabs Send message Joined: 1 Jan 07 Posts: 4 Credit: 164,281 RAC: 0	Message 55793 - Posted: 15 Sep 2008, 20:01:53 UTC - in response to Message 55791. Yep, that's the link. In rereading my post, I noticed my wording might be read to say go for 24hrs more. I meant to say go until it completes or total CPU time crosses 24hrs. Thanks for the advice. I will wait and see where it stands tomorrow when I come in. If it is still going, I will abort it. ID: 55793 · Rating: 0 · rate: / Reply Quote

Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0	Message 55796 - Posted: 16 Sep 2008, 2:27:22 UTC For a frame of reference, my first AA2A task took 6h:45m for it's first model on a 2.8Ghz P4 (more then double the expectation). And I was 4 hours in to model 2 when I rebooted my PC. Model 2 gets to start over from step 0. So, it ran for 4 hours without checkpointing. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ ID: 55796 · Rating: 0 · rate: / Reply Quote

ramostol Send message Joined: 6 Feb 07 Posts: 64 Credit: 584,052 RAC: 0	Message 55800 - Posted: 16 Sep 2008, 10:09:17 UTC - in response to Message 49379. I'd like to start a thread for reports of long running models. Pure mindreading. And you may be needing this discussion as a sticky... The theory is that specific models within a task are taking significantly longer then others, ... This part is a bit more than a theory. My documentation disappeared last week, but I remember a wu starting with a model lasting less that 6 hours and concluding with model 2 lasting around 17 hours, and a wu starting with models running about an hour finishing with a model 9 running 4 hours longer than the default runtime. My present results page consists of three tasks: Task ID: 190622469 WU ID: 174119129 8 Sep 2008 15:44:37 UTC 15 Sep 2008 16:14:01 UTC Over Success Done CPU: 45,044.32 Claimed: 83.08 Granted: 22.40 (2 models) (25% of claim is a normal result for my computer these days) Or81__BOINC_SYMM_FOLD_AND_DOCK_RELAX-Or81_-nmr_foldanddock__4445_17477_1: Task ID: 190080493 WU ID: 171441384 6 Sep 2008 11:55:27 UTC 15 Sep 2008 8:21:41 UTC Over Success Done CPU: 124,679.00 (34h) Claimed: 229.96 Granted: 33.70 (1 model) (11% of claim, less than 1 credit per CPU hour) amer__BOINC_SYMM_FOLD_AND_DOCK_RELAX-amer_-STIH_tetr_4437_41444_1: Task ID: 190080492 WU ID: 171441412 6 Sep 2008 11:55:27 UTC 12 Sep 2008 9:40:47 UTC Over Success Done CPU: 204,936.90 (56h 55min) Claimed: 368.54 Granted: 64.58 (1 model) # The first task included solely for comparison of benchmarks. # The middle task is one of my poorest results as for credits. # The third task with a record-breaking model demanding 57 hours of computing was also on the low side. (It must be noted that I was the second cruncher of all these wus.) But one has to be grateful that such models now are checkpointing. The inability to predict the runtime needed for such large wus is a bit of a problem, as they upset the computer's time schedule and downloading ability. I wonder if we could at least be helped by changing the graphics window to display "Model 1 Step xxxx of totally nnnnn steps" Then one would at least see if the end is near... Specifications: iBook G4 1.2 GHz PPC, 768 MB RAM MacOS 10.3.9 BOINC 6.2.18 Rosetta 5.98 ID: 55800 · Rating: 0 · rate: / Reply Quote

Nothing But Idle Time Send message Joined: 28 Sep 05 Posts: 209 Credit: 139,545 RAC: 0	Message 55803 - Posted: 16 Sep 2008, 12:00:06 UTC - in response to Message 49379. I'd like to start a thread for reports of long running models.... The theory is that specific models within a task are taking significantly longer then others... Does a long running model imply something abnormal and erronious? Help me understand. The project staff has said the number of proteins, their widely varying sizes and number of amino acids, varying complexity and increasing complexity means that producing any single decoy may be a short or very long process. And this process is being studied and refined. I'm not trying to be facetious, but why should we care if it takes 2 hours or 6 hours or 12 hours to produce a decoy as long as a decoy is produced? Are they not all equally useful? If there is a program loop or the decoy otherwise gets hung then I can see some reason for concern. Seems like a lot of people complain about long running models because their chosen run times are simply too short to be viable for the ever larger and more complex proteins that are being studied. So, if you see tasks that are averaging more then 2 hours per model, or specific models that are taking more then 2 hours, please report them as per below. I can take the size of an atom and the size of the cosmos and average them also, but what possible use would it have? Given the variety of proteins and complexity and other factors why would an average of 2 hours have any more meaning than my atom/cosmos average? OTH, averaging the time it takes to produce decoys for a specific protein might be useful if the times tend to cluster around their average (narrow spread); but like the atom/cosmos example, if the spread is large then the average is less meaningful or even useless. Speaking for myself, I think you need to be more specific about what you or the project is looking for and how to find it. And are their any particular proteins/tasks that are more suspicious than others? If I run a task that produces 34 decoys, some may have taken 1 hour to produce while others may have taken 10 hours and their average could be under 2 hours; how would I know? ID: 55803 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 55805 - Posted: 16 Sep 2008, 12:38:16 UTC - in response to Message 55803. Speaking for myself, I think you need to be more specific about what you or the project is looking for and how to find it. And are their any particular proteins/tasks that are more suspicious than others? If I run a task that produces 34 decoys, some may have taken 1 hour to produce while others may have taken 10 hours and their average could be under 2 hours; how would I know? Thanks NBIT. Good points. You are absolutely right, some proteins take longer then others. And Rosetta's batches are typically comprised of a number of different proteins. I opened the thread because it is difficult to know that one out of the 34 decoys took 10 hours. And by soliciting more eyes, I am hoping to bring some specific examples back to the Project Team for them to study further. I've also suggested the CPU seconds be recorded in the .out file at the time each model is completed, so they will know the exact answer. The overall objective is to have the majority of work sent on Rosetta to complete each model within an hour, so the 2 hrs "average" was meant to say anything that is running more then twice as long as expected. By meeting this 1 hour (or less) per model objective, the BOINC runtime estimates and amount of work that it requests and downloads will better match your configured objectives. And the average runtime you see on the client should be very close to your Rosetta runtime preference. It will also help assure the credit system remains fair, and eliminate as much of the "luck of the draw" as possible from the credit results. At present, some models take significantly longer then the objective. In order to identify the cause and make program revisions, it will be helpful to identify specific models that are taking longer then desired and rerun them in the lab to understand what they came across that is causing them to take so long. Then methods of more efficiently handling that set of circumstances can usually be identified, and code written to avoid them if they are typically fruitless, or seek them out if this seems to produce a better result. Either way, overall runtimes will become more consistent. So, I won't call the current long-running models a "bug". Instead I would say they represent an area where it is believed improvements can be made. An "enhancement opportunity" you might say. Rosetta Moderator: Mod.Sense ID: 55805 · Rating: 0 · rate: / Reply Quote

Nothing But Idle Time Send message Joined: 28 Sep 05 Posts: 209 Credit: 139,545 RAC: 0	Message 55807 - Posted: 16 Sep 2008, 14:14:15 UTC - in response to Message 55805. Agree, the project software should record and track whatever is necessary to achieve its objectives or observe any anomalies it chooses to define. This is especially true because most contributors only contribute computer time, not personal time. ID: 55807 · Rating: 0 · rate: / Reply Quote

bob Send message Joined: 11 Dec 05 Posts: 3 Credit: 618,194 RAC: 0	Message 55809 - Posted: 16 Sep 2008, 17:19:08 UTC Took about 2 hours to get to 95%. Took 8 more hours to get to 99.9%. Finally gets to 100% but never finishes. Suspend and restart causes work to restart at 32%. I aborted this WU (abinitio_nohomfrag_70_A_1qgvA_4466_6257_1). ID: 55809 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 55811 - Posted: 16 Sep 2008, 18:19:16 UTC A link to Bob's task BOINC 6.2.18, Win 2000, Rosetta 1.34 The first user to crunch it was on Win Vista, and got an: Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x007D3863 read attempt to address 0x00000008 Rosetta Moderator: Mod.Sense ID: 55811 · Rating: 0 · rate: / Reply Quote

William Timbrook Send message Joined: 2 Nov 05 Posts: 3 Credit: 11,623,185 RAC: 0	Message 55823 - Posted: 17 Sep 2008, 1:33:49 UTC Last modified: 17 Sep 2008, 2:23:14 UTC I've noticed a couple of odd work unit packets but I think I now have a specific question. Work Unit 175541763 has clocked over 8 hours of cpu time and 98.001% done. I went to the host machine, stopped and restarted boinc. Now the work unit is showing 55.704% done and about 100 minutes of cpu time. Is this a known opportunity? Thanks, William ID: 55823 · Rating: 0 · rate: / Reply Quote

William Timbrook Send message Joined: 2 Nov 05 Posts: 3 Credit: 11,623,185 RAC: 0	Message 55825 - Posted: 17 Sep 2008, 5:18:52 UTC - in response to Message 55823. [quote]I've noticed a couple of odd work unit packets but I think I now have a specific question. Work Unit 175541763 has clocked over 8 hours of cpu time and 98.001% done. I went to the host machine, stopped and restarted boinc. Now the work unit is showing 55.704% done and about 100 minutes of cpu time. Is this a known opportunity? After 2 restarts of boinc (and upgrading to 6.2.18), the unit got stuck with 9:52 to finish. I aborted it. ID: 55825 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 55835 - Posted: 17 Sep 2008, 14:49:34 UTC William, parts of what you describe are normal and expected, and some parts are not. I've moved your posts here to this thread because you appear to have a 3hr runtime (the default) configured for that host, and so the 8hrs you report is well beyond that. Your tasks was abinitio_nohomfrag_70_A_1qgvA_4466_9601, v1.34, running BOINC 6.2.18 and Windows 2000. So it ran longer then expected. The parts of what you describe that are normal are that any time you end BOINC or remove a task from memory (which happens if BOINC switches to running another project, suspending the R@h task, and you are not keeping suspended tasks in memory), you will lose some work. The amount lost depends on when Rosetta was able to last save a checkpoint. And some tasks are able to checkpoint more frequently then others. So, seeing the CPU time reduced (sometimes all the way back to zero) when the task restarts, is normal. The other thing is that the 3 hours you are probably currently seeing as the initial estimated time to completion is just based on your runtime preference (which you can set here on the website in your Rosetta-specific preferences). Actually, it is based on your BOINC client's history of working tasks with your runtime preference. Some tasks take longer then that. So, rather then showing a negative estimated time to complete once the original estimate is reached, the program starts to make time pass slower and slower once it reaches about 10 minutes remaining. So, the part you describe about 10 minutes remaining for an extended period of time is normal as well. The resulting confusion when tasks go longer then your preference is why I started this thread, and why the Project Team is working to address these long-running models that cause runtimes to be exceeded. Rosetta Moderator: Mod.Sense ID: 55835 · Rating: 0 · rate: / Reply Quote