Minirosetta v1.40 bug thread

Message boards : Number crunching : Minirosetta v1.40 bug thread

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 15 · Next

AuthorMessage
Christian

Send message
Joined: 11 Jun 06
Posts: 1
Credit: 215,203
RAC: 0
Message 57112 - Posted: 21 Nov 2008, 0:31:39 UTC

ALLCON,
For some reason Minirosetta v1.40 continues to lock-up my machine. This condition has existed for about two weeks now. It gets a little exasperating when I need my machine to do something and have to reboot it... consequentially I will be discontinuing running BOINC and Rosetta until someone cleans up the problem, nevermind the CPU tasking!

Hardware stats:

EVGA nForce 590 SLI mobo
AMD Athlon 64 x2 5000+
Corsair XMS 4 gb ram (2x2gb)
EVGA/Nvidia 8800gt (512gb x 2 in SLI)
BFG 650w PS
etc...

Software stats:
Win XP sp2 (up to date)
Trend Micro IS 2009 (up to date)

This is a fairly new build (6 months) and has very little in the way of garbage on it. I've never had a problem with BOINC or Rosetta before these past few weeks with the introduction of Minirosetta v1.40.
ID: 57112 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1223
Credit: 13,824,497
RAC: 2,340
Message 57113 - Posted: 21 Nov 2008, 0:33:01 UTC - in response to Message 57106.  

I too, have 8 WU's of Mini 1.40 in progress for 15+ hours and stuck at above 98% completion, and still showing 9 hours 57 minutes left to completition. In fact the time-to-completion hasn't changed in over 10 hours.

I don't think my problem is for lack of RAM as I have 24GB installed. I am running Vista X64 on twin dual-core Xeons at 3.0GHz.

I have suspended all but one WU and I have bumped the task priority by two levels, just to see if I could hasten this one WU along. It doesn't seem to be helping as my CPUs are hardly even taxed at this point. So the problem does not seem to be a shortage of compute power. And I have over 1 Terabyte of free disk space. So it can't be for a lack of disk space either.

n2n


Are you sure that isn't 9 minutes 57 seconds to go? Rosetta@home workunits tend to stick at about that estimated time to go if they come with a serious underestimate of how much CPU time they need, until they finally reach a time when the actual time to go is less than that.

I've read of some of these workunits needing about 800 MB to run well, but if any one core on your machine can get this much, suspending jobs on other cores won't help it run any faster. Exception: If some of the reported cores on your machine are due to hyperthreading, telling it to use only as many cores as are available without hyperthreading often at least doubles the speed on that number of cores.

I've seen one of these workunits that seemed to get stuck actually take about 19.5 hours CPU time, but it seemed to complete normally otherwise. It got a bad credit granted to credit requested ratio, though.

Adjusting my settings so that BOINC is allowed to use more than the default of about 10 GB of disk space seemed to help my more recent jobs, though.
ID: 57113 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1223
Credit: 13,824,497
RAC: 2,340
Message 57114 - Posted: 21 Nov 2008, 0:50:36 UTC - in response to Message 57112.  

ALLCON,
For some reason Minirosetta v1.40 continues to lock-up my machine. This condition has existed for about two weeks now. It gets a little exasperating when I need my machine to do something and have to reboot it... consequentially I will be discontinuing running BOINC and Rosetta until someone cleans up the problem, nevermind the CPU tasking!

Software stats:
Win XP sp2 (up to date)
Trend Micro IS 2009 (up to date)


Christian,

Who or what is ALLCON?

Is that a 32-bit version of Windows XP SP2, which is unlikely to be able to actually use more than about 3.5 GB of your RAM memory, or a 64-bit version, which can use more of it?

When I had a similar problem on my machine, I found that it was helpful to tell BOINC that it could make use of more of my disk space than the default of about 10 GB; I had significantly more free disk space than that.
ID: 57114 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Martin Johnson

Send message
Joined: 18 Oct 05
Posts: 19
Credit: 171,164
RAC: 0
Message 57115 - Posted: 21 Nov 2008, 0:53:58 UTC

1.4 units WILL NOT STOP / wait / suspend.
So my Rosetta RAC is rising, and the others are falling !!!
ID: 57115 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Sarel

Send message
Joined: 11 May 06
Posts: 51
Credit: 81,712
RAC: 0
Message 57116 - Posted: 21 Nov 2008, 1:08:03 UTC

Sorry for being away for a while. I was busy in the wet lab testing some of my older designs (some of which show promise! when I get verification on this, I'll post an update on the protein-protein interactions thread).

Using the information that you posted on this thread I've been able to reproduce on the lab's machines the long run-time problems that you have reported. I now have a good idea about how to avoid such occurrences in the future so that future runs will not be poorly behaved. Also, we have found a way for lowering the memory signature of our runs, but for at least a while, we'll keep the current 512Mb restriction, just in case. We will probably submit more protein-interface jobs to boinc over the next week or so and I will look for your messages to see whether we've completely resolved this issue.

So, I'm planning to sift through the 500 thousand designed models that you have produced over the next few days and am extremely excited about seeing all these new possibilities!
ID: 57116 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Martin Johnson

Send message
Joined: 18 Oct 05
Posts: 19
Credit: 171,164
RAC: 0
Message 57117 - Posted: 21 Nov 2008, 1:57:09 UTC

What about this "refusal to stop" issue ?
ID: 57117 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jim_Clark
Avatar

Send message
Joined: 11 Sep 07
Posts: 7
Credit: 38,439
RAC: 0
Message 57121 - Posted: 21 Nov 2008, 4:22:16 UTC
Last modified: 21 Nov 2008, 4:26:23 UTC

On my AMD Athlon 64 X2 Dual Core with Windows XP Pro SP3, and with 2 GB RAM and 100 GB available HD, no Rosetta Mini WU of any version has ever completed successrfully since Rosetta Mini came into existence. . They fail with a compute error or sometimes lockup my computer after wasting time that could be applied to WUs that can complete OK.

So I abort all Rosetta Mini WUs until I finally get a Rosetta Beta WU. . This is a lot of work, since I generally need to abort about 30 or more Rosetta Mini WUs to get one Rosetta Beta WU. . About once a week, I allow one Rosetta Mini WU to run, to see if the problem is fixed yet -- which hasn't happened yet.

Other project sites such as World Community Grid and PrimeGrid allow me to choose which applications my computer will run. . Why can't Rosetta provide this feature, too? . I would like to run the Rosetta Beta WUs, but if I get tired of aborting hundreds of Rosetta Mini WUs, I may feel forced to abandon Rosetta altogether.

ID: 57121 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Speedy
Avatar

Send message
Joined: 25 Sep 05
Posts: 163
Credit: 800,690
RAC: 173
Message 57122 - Posted: 21 Nov 2008, 5:25:00 UTC

This isn't a bug. Is there away to delete old database files without the project re downloading them once you restarted boinc? I have database rev 23035 25/6, 23035 7/8 & 25538 11/11 in total 54.8MB, do I need them all
Thanks for any advice
Have a crunching good day!!
ID: 57122 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alec Rosa

Send message
Joined: 11 Nov 08
Posts: 18
Credit: 2,635
RAC: 0
Message 57125 - Posted: 21 Nov 2008, 6:23:21 UTC - in response to Message 57117.  

What about this "refusal to stop" issue ?


Yes, and what about the recurring "exited with zero status but no 'finished' file" issue?
ID: 57125 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alec Rosa

Send message
Joined: 11 Nov 08
Posts: 18
Credit: 2,635
RAC: 0
Message 57126 - Posted: 21 Nov 2008, 6:28:43 UTC - in response to Message 57121.  

On my AMD Athlon 64 X2 Dual Core with Windows XP Pro SP3, and with 2 GB RAM and 100 GB available HD, no Rosetta Mini WU of any version has ever completed successrfully since Rosetta Mini came into existence. . They fail with a compute error or sometimes lockup my computer after wasting time that could be applied to WUs that can complete OK.

So I abort all Rosetta Mini WUs until I finally get a Rosetta Beta WU. . This is a lot of work, since I generally need to abort about 30 or more Rosetta Mini WUs to get one Rosetta Beta WU. . About once a week, I allow one Rosetta Mini WU to run, to see if the problem is fixed yet -- which hasn't happened yet.

Other project sites such as World Community Grid and PrimeGrid allow me to choose which applications my computer will run. . Why can't Rosetta provide this feature, too? . I would like to run the Rosetta Beta WUs, but if I get tired of aborting hundreds of Rosetta Mini WUs, I may feel forced to abandon Rosetta altogether.


Well, well, it seems I'm not alone here.
ID: 57126 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Martin Johnson

Send message
Joined: 18 Oct 05
Posts: 19
Credit: 171,164
RAC: 0
Message 57127 - Posted: 21 Nov 2008, 7:12:03 UTC

No, you're not. But no one has yet admitted this is a problem,
so I too will be forced to abort until it is settled.
ID: 57127 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alec Rosa

Send message
Joined: 11 Nov 08
Posts: 18
Credit: 2,635
RAC: 0
Message 57128 - Posted: 21 Nov 2008, 9:27:29 UTC

I generally need to abort about 30 or more Rosetta Mini WUs to get one Rosetta Beta WU. . About once a week, I allow one Rosetta Mini WU to run, to see if the problem is fixed yet -- which hasn't happened yet.

Well, well, it seems I'm not alone here.

No, you're not. But no one has yet admitted this is a problem,
so I too will be forced to abort until it is settled.


Abort. Abort. Abort.

I too am starting to feel like an abort-robot.
ID: 57128 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1223
Credit: 13,824,497
RAC: 2,340
Message 57129 - Posted: 21 Nov 2008, 10:56:50 UTC

A workunit that refused to give up its CPU core when its timeslot ended and it was time for a workunit from a different BOINC project to take over that CPU core:

11/20/2008 10:46:25 PM|rosetta@home|Starting loopbuild_minimalist_core_control_standardloopfile2_homo_bench_looprelax_cheat_chunk_control_standard_loopfiles_t326__olange_IGNORE_THE_REST_2GHRA_8_4830_404_0
11/20/2008 10:46:26 PM|rosetta@home|Starting task loopbuild_minimalist_core_control_standardloopfile2_homo_bench_looprelax_cheat_chunk_control_standard_loopfiles_t326__olange_IGNORE_THE_REST_2GHRA_8_4830_404_0 using minirosetta version 140

I've now told the BOINC interface to suspend both that task and the whole Rosetta@home project, but it's still taking about half the CPU time on that CPU CORE.

I'm using the leave-in-memory option, in case that matters. The BOINC version is 5.10.45, under 32-bit Windows Vista SP1.
ID: 57129 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
A Few Good Men

Send message
Joined: 25 Mar 07
Posts: 14
Credit: 2,031,382
RAC: 0
Message 57133 - Posted: 21 Nov 2008, 14:57:58 UTC

72 hours of crunching on Qx6700 with Boinc 6.2.19 and 1.40mini and I have recieved 25 credits. "Computational errors" I have reset the project 3 times.
ID: 57133 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
droople
Avatar

Send message
Joined: 19 Aug 08
Posts: 18
Credit: 3,330,109
RAC: 9
Message 57134 - Posted: 21 Nov 2008, 15:19:25 UTC

Hi

I ran a minirosetta and got an error message as follows
https://boinc.bakerlab.org/rosetta/result.php?resultid=206624050

t <core_client_version>6.2.19</core_client_version>
<![CDATA[
<stderr_txt>
Too many restarts with no progress. Keep application in memory while preempted.
======================================================
DONE :: 1 starting structures 3.59375 cpu seconds
This process generated 0 decoys from 0 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
<message>
<file_xfer_error>
<file_name>1hzh_1xk5_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_206_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>


I did check keep application in memory while preempted, but got this error
any idea?

Cheers
ID: 57134 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Daniel Kohn

Send message
Joined: 30 Dec 05
Posts: 18
Credit: 2,899,939
RAC: 0
Message 57135 - Posted: 21 Nov 2008, 15:31:08 UTC - in response to Message 57129.  

I noticed the other day that I "Snoozed" BOINC and one of my 2 Rosetta work-units keept crunching anyway.

ID: 57135 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
craig_bye

Send message
Joined: 30 Nov 06
Posts: 1
Credit: 84,909
RAC: 0
Message 57136 - Posted: 21 Nov 2008, 15:46:11 UTC

I too keep seeing an issue that the Rosetta Mini 1.40 just keeps running although BOINC reports it as "Waiting to Run". I've seen this twice now and I end up having to kill off the minirosetta_1.40_windows_intelx86.exe process.
ID: 57136 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sarha1

Send message
Joined: 23 Sep 05
Posts: 5
Credit: 6,339,735
RAC: 0
Message 57137 - Posted: 21 Nov 2008, 16:18:13 UTC

Really, "loopbuild_" WUs seem to ignore all the requests to suspend and use the full CPU.
ID: 57137 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,637,805
RAC: 799
Message 57138 - Posted: 21 Nov 2008, 16:37:49 UTC
Last modified: 21 Nov 2008, 16:41:09 UTC

Crashed out wu's.

208927642 Watchdog after 20,991 seconds?
208802490 after 16,202 seconds NAN in HBonding.
208717837 after 7,255 seconds NAN in HBonding.

Machines are all set to 6 Hour wu time. Leave in memory. Core Client 6.2.19. All "loopbuild_....." wu's - man they have long names.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 57138 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 57141 - Posted: 21 Nov 2008, 17:33:51 UTC

ID: 57141 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 15 · Next

Message boards : Number crunching : Minirosetta v1.40 bug thread



©2024 University of Washington
https://www.bakerlab.org