Rosetta@home

Problems with Minirosetta v1.54

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search
Message boards : Number crunching : Problems with Minirosetta v1.54

Sort
AuthorMessage
Mike Tyka
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Oct 20 05
Posts: 95
ID: 5612
Credit: 2,190
RAC: 0
Message 59045 - Posted 26 Jan 2009 22:45:57 UTC
Last modified: 27 Jan 2009 1:36:31 UTC

Hello All!

We're ready for a new update. I want to say thank all of you who have helped over the last months to find and fix errors in minirosetta. A particular thank you goes to those who have donated their time over on RALPH and helped with their active feedback - we managed to find a number of difficult and rare bugs and put some new features into minirosetta that should help conserve computer time. Read about it here: http://ralph.bakerlab.org/forum_thread.php?id=431
and here http://ralph.bakerlab.org/forum_thread.php?id=432
I should add that work over there will continue,but now supplemented with information from Rosetta@HOME.

This update is highly focused on bugfixing and stability issues - we have virtually no new science in it, but: We will hopefully now be able to run the science projects that have been in the pipeline waiting for BOINC - we're expecting quite a bit of work to go out very soon indeed. See Dr. Baker's journal for more details.


Features/Fixes:
1.54 Release CHANGELOG


  • Faster loop closing in FoldCST/Abinitio (affects cc_* cc2_* cs_* WUs), should help with overrunning WUs.

  • Bug fix concerning intermittent crashes in relax benchmark jobs (_rlbd_) jobs - caused by buggy input file reader.

  • Bug fix for a potential instability in handling text files (affects all types of WUs).

  • Bug fix in checkpointing machinery, states were not being correctly restored, probably contributing to long runtimes. (affects cc_* cc2_* cs_* WUs)

  • Increased the density of checkpoints to lose less time on restarts and address the weired "backjumping" of the time reported in this thread. This will still happen, but the jumps should be much smaller (basically maximally as long as the time between checkpoints.)

  • Added checkpointing to Loopclosing part of FoldCST. (affects cc_* cc2_* cs_* WUs)

  • Added checkpointing to Looprelax.

  • The Watchdog has been checked and improved, now returning information on the aborted jobs to help us figure out how the remaining long running models come about. The watchdog will now abort if the runtime exceeds your preferred runtime + 4 hours. In other words the WUs should not overrun for more than around 4 hours. If they do please let us know !!

  • Added a limit ont he number of decoys per WU: 99. The WU will end gracefully after that and give full credit. This should address issues with excessive upload problems.

  • Fixed a bug in the BOINC API concerned with unzipping the input data. (I will let the BOINC guys know about this)

  • Fixed a strange problem in the options system leading to early crashes on some systems.

  • Two nasty instabilities fixed deep in the FoldConstraints/abinitio protocol (cc_* tasks and other homology modelling tasks)

  • Generally implemented much better error reporting - many many potential problems will now show up a meaningful error messages and not random segmentation faults.



NOTE: This new version contains a lot of debug output still. YOu will see that the stderr fills up with stuff - that is ok . It does not slow down the program nor cause much extra upload - but it tells us a lot about where things can go wrong still.


Despite all these fixes there are, i'm sure, many problems left. Most of them occur extremely rarely now though or are highly specific to particular machines. Thus we have decided to move the current version over from RALPH to Rosetta@HOME and give it a go on a much larger scale. Our effords to keep the failure rate down will continue and your time donations over on RALPH as well as error reports are still highly appreciated.

Please let us know how things work out there. Particularily i'd like to know about


  • Stuck workunits
  • Overrunning workunits (WUs should now, due to the new watchdog, never run more than 4 hours longer than the preferred user time)
  • Problems with checkpointing.
  • Any other strange behaviour.




Happy crunching - I'm very excited to see how this new version will pan out.

Mike
____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59047 - Posted 26 Jan 2009 23:40:44 UTC

The link in the news item that should bring you to this thread is truncated.
____________
Rosetta Moderator: Mod.Sense

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59048 - Posted 27 Jan 2009 0:21:40 UTC
Last modified: 27 Jan 2009 0:24:04 UTC

The news item also shows the year as 2008 (which is probably the last time you had enough coffee to be able to read the calendar!! All these improvements are going to send TeraFLOPS much higher! Nice work Mike, and BakerLab. I can really see that you've come through for people here).
____________
Rosetta Moderator: Mod.Sense

darengosse Jean-Paul
Avatar

Joined: Jun 9 06
Posts: 17
ID: 93705
Credit: 164,501
RAC: 18
Message 59074 - Posted 27 Jan 2009 22:46:52 UTC

Hello, the version 1.47 was very well for me with 151 Workunits and 0 errors and an average CPU time 2.8 hours. Hope that the new version 1.54 will be as well...
____________


Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,023,621
RAC: 81
Message 59086 - Posted 28 Jan 2009 6:50:38 UTC

If you are seeing errors with lock-file problems try setting the cpu setting back to 100%. If you are running at 100% CPU preference and are getting this problem, I for one, am very interested. If you are getting the failures and change the CPU setting to 100% and that cures the issue ... well, we are interested in THAT too ...

I read about this in Einstein@Home and it seems to work for me ... YMMV ...
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59088 - Posted 28 Jan 2009 9:55:25 UTC

I don't know about others but my Rosetta machines are running dry!!! The new minirosetta is stuck downloading at 89.25% and has been there for HOURS!!!!
I have had to attach to a different project until it gets sorted out. So far all machines, exact same problem, one a dual core one a single core. If you llok at my computers, they are not hidden, any task that says "outcome unknown" is because the mini-rosetta download ain't happenning!!!! Message in Boinc says 1/28/2009 4:45:03 AM|rosetta@home|Started download of minirosetta_1.54_windows_intelx86.exe
1/28/2009 4:50:11 AM|rosetta@home|Temporarily failed download of minirosetta_1.54_windows_intelx86.exe: HTTP error
1/28/2009 4:50:12 AM|rosetta@home|Started download of minirosetta_1.54_windows_intelx86.exe
1/28/2009 4:50:13 AM||Internet access OK - project servers may be temporarily down.
1/28/2009 4:50:34 AM||Project communication failed: attempting access to reference site
1/28/2009 4:50:34 AM|rosetta@home|Temporarily failed download of minirosetta_1.54_windows_intelx86.exe: connect() failed
1/28/2009 4:50:35 AM||Internet access OK - project servers may be temporarily down.
etc, etc, etc, etc forever!!!!
Another project now loves you!!
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59093 - Posted 28 Jan 2009 12:27:44 UTC

mikey, I haven't seen that problem my self, so it's not likely on the server side. At least not consistently. So it also seems odd that all of your servers are stopping... is it on the same file? You have to download the new programs, which is several MB. Are your machines all going through the same proxy or something that might be hung up on that particular file?

Could I ask you to check the transfers tab and see exactly which file and how much of it you've downloaded? Your hosts seem to have pretty good bandwidth.

Is anyone else seeing such a problem? Given then increase in project TFLOPS, I am thinking it is rare at best.

Have you tried aborting the transfer on one of the machines? This may cause a couple of tasks to fail due to downloading error, but BOINC will recover and eventually try to pull a fresh copy of the problem file.
____________
Rosetta Moderator: Mod.Sense

Mike Tyka
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Oct 20 05
Posts: 95
ID: 5612
Credit: 2,190
RAC: 0
Message 59103 - Posted 28 Jan 2009 19:59:21 UTC

Paul,

can you point me to the thing you read about Lockfile problems on Einstein !?

5% of jobs fail in this way consistently. I would love to know if the problem is us or the clients or what, and get it resolved.

What do you mean by 100% CPU ? If i can make this happen here on my machine i could learn better about what's going on.

Mike

____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59106 - Posted 28 Jan 2009 20:20:22 UTC

What do you mean by 100% CPU ?


"computing preferences" configured on website for the venue of the machine. The setting is called "Use at most" at the bottom of the processor usage section.

Can also be configured via the BOINC Manager for a specific host.
____________
Rosetta Moderator: Mod.Sense

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59108 - Posted 28 Jan 2009 22:10:00 UTC - in response to Message ID 59093.

mikey, I haven't seen that problem my self, so it's not likely on the server side. At least not consistently. So it also seems odd that all of your servers are stopping... is it on the same file? You have to download the new programs, which is several MB. Are your machines all going through the same proxy or something that might be hung up on that particular file?

I do not use a proxy, just straight to the net. I use Comcast.

Could I ask you to check the transfers tab and see exactly which file and how much of it you've downloaded? Your hosts seem to have pretty good bandwidth.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's. The one I am looking at right now has been trying for 11:51:02 and is going to retry in 03:34:34, and counting.

Is anyone else seeing such a problem? Given then increase in project TFLOPS, I am thinking it is rare at best.

Have you tried aborting the transfer on one of the machines? This may cause a couple of tasks to fail due to downloading error, but BOINC will recover and eventually try to pull a fresh copy of the problem file.

Yes I have, no luck, the file is stuck at 89.25, 89.26 or 89.27% depending on the pc. I am stuck at exactly 5.85 meg of 6.56 meg on all machines.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59110 - Posted 28 Jan 2009 22:21:08 UTC

mikey, if you would like to study this further, it would be helpful if you could create a cc_config.xml file and add the flag for debug of file transfers. You have to define the first three flags as shown, then just add a line for the:
<file_xfer_debug>1</file_xfer_debug>

If you already have such a file set up, do you have the <http_1_0> flag defined? Not asking you to do that one, just asking if you were already doing it. HTTP 1.0 does not have the ability to retry from the middle of the transfer (persistent file transfer is the term BOINC uses for this). It has to start over each attempt. Then BOINC seems to only open the pipe for 5minutes at a time. So if you can't get the whole thing in 5min. It might never happen.
____________
Rosetta Moderator: Mod.Sense

svincent

Joined: Dec 30 05
Posts: 130
ID: 44923
Credit: 951,788
RAC: 618
Message 59112 - Posted 28 Jan 2009 23:24:06 UTC

I'm seeing a validate error on task 224245929 , workunit 204213187, Mac OS X 10.4.11. The task name is 1nkuA_BOINC_MPZN_with_zinc_abrelax_cs_frags_6231_115354_1 : it ran twice as long as it was supposed to and I was the second person to get it. The original person to whom it was sent also got the same validate error: irritating after it took twice as long as it was supposed to. It seems to be one of these zinc-containing proteins that have a habit of doing this.

<core_client_version>6.2.18</core_client_version>
<![CDATA[
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 1-28 1:26:32:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Starting work on structure: _00001
Watchdog active.
# cpu_run_time_pref: 14400
Starting work on structure: _00002
====>
called boinc_finish

</stderr_txt>
]]>

____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59116 - Posted 29 Jan 2009 0:11:48 UTC

mikey, I don't know why I didn't think of this before...

Do a binary ftp of the file from here:

boinc.bakerlab.org/download/minirosetta_1.54_windows_intelx86.exe

and drop it in to your Rosetta folder in your BOINC data directory under the projects subfolder. That will at least get you up and running, or on to the next file to see if similar problems continue.
____________
Rosetta Moderator: Mod.Sense

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 13,026,869
RAC: 16,575
Message 59119 - Posted 29 Jan 2009 2:38:00 UTC - in response to Message ID 59108.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.

Sid Celery

Joined: Feb 11 08
Posts: 366
ID: 241409
Credit: 1,126,789
RAC: 1,954
Message 59121 - Posted 29 Jan 2009 2:51:39 UTC

Long-running model reported in the appropriate thread here



____________

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 13,026,869
RAC: 16,575
Message 59122 - Posted 29 Jan 2009 2:57:50 UTC

I'm seeing a number of WUs ending at 99 models. They are ending normally, but they often take less than half my 12 hour (43,200 sec) preference.

Some examples:
http://boinc.bakerlab.org/rosetta/result.php?resultid=223957908
http://boinc.bakerlab.org/rosetta/result.php?resultid=223968996
http://boinc.bakerlab.org/rosetta/result.php?resultid=223981088
http://boinc.bakerlab.org/rosetta/result.php?resultid=223989528
http://boinc.bakerlab.org/rosetta/result.php?resultid=223997524
http://boinc.bakerlab.org/rosetta/result.php?resultid=224065056

Mike Tyka
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Oct 20 05
Posts: 95
ID: 5612
Credit: 2,190
RAC: 0
Message 59123 - Posted 29 Jan 2009 3:10:15 UTC - in response to Message ID 59122.

I'm seeing a number of WUs ending at 99 models. They are ending normally, but they often take less than half my 12 hour (43,200 sec) preference.

Some examples:
http://boinc.bakerlab.org/rosetta/result.php?resultid=223957908
http://boinc.bakerlab.org/rosetta/result.php?resultid=223968996
http://boinc.bakerlab.org/rosetta/result.php?resultid=223981088
http://boinc.bakerlab.org/rosetta/result.php?resultid=223989528
http://boinc.bakerlab.org/rosetta/result.php?resultid=223997524
http://boinc.bakerlab.org/rosetta/result.php?resultid=224065056


Sorry i should have mentioned there is a new rule. Mini will not produce more than 99 models. It will finish gracefully and grant full credit. The reason for this is that i want to prevent your individual uploads from getting too large. In the future there will be a better way to do this, like it will check that the output file size has not reached some limit.
ITs just another safety hook that's been put in to prevent WUs from misbehaving.

____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

darengosse Jean-Paul
Avatar

Joined: Jun 9 06
Posts: 17
ID: 93705
Credit: 164,501
RAC: 18
Message 59124 - Posted 29 Jan 2009 5:11:01 UTC

Hello with all.
For me no problems to receive from Wu Minirosetta v1.54.
J'received 17 Wu to be made for February 6, 2009 with 21:28:04 (France Time).
The first calculations should begin today (January 29), and if it with problems I you will warn about it there.
____________


Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,023,621
RAC: 81
Message 59126 - Posted 29 Jan 2009 6:49:25 UTC - in response to Message ID 59103.

Paul,

can you point me to the thing you read about Lockfile problems on Einstein !?

5% of jobs fail in this way consistently. I would love to know if the problem is us or the clients or what, and get it resolved.

What do you mean by 100% CPU ? If i can make this happen here on my machine i could learn better about what's going on.

Mike


Two places to start: are here and here ... I can also report that since I made that change i have been getting good results on Win XP systems ... I cannot see the high error rate I had in the past as the tasks have been purged ...

It seemed to me to be a problem I had on XP and it was most severe on the i7 where there are more things going on ...

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59128 - Posted 29 Jan 2009 11:34:33 UTC - in response to Message ID 59116.
Last modified: 29 Jan 2009 12:05:38 UTC

mikey, I don't know why I didn't think of this before...

Do a binary ftp of the file from here:

boinc.bakerlab.org/download/minirosetta_1.54_windows_intelx86.exe

and drop it in to your Rosetta folder in your BOINC data directory under the projects subfolder. That will at least get you up and running, or on to the next file to see if similar problems continue.


No difference, I downloaded the file, dropped it into the directory C:\Boinc\Data\Projects\boinc.bakerlab.org_rosetta. It asked if I wanted to overwrite what was there, the new file was bigger, I said yes, exited Boinc, restarted Boinc, did an update of Rosetta and it is downloading the file AGAIN and is stuck at exactly the same place. I even turned off my ad-aware and anti-virus and no change.

Change #1....Just after I first posted this I did a total shutdown and then a restart, no change, Boinc is still trying to download that same file! I am about ready to detach and then reattach and see if that fixes it!

Change #2....I detached and then reattached. Started downloading all the Rosetta files again. I made sure everything Rosetta was gone out of the Boinc and all subdirectories, so downloading was not a surprise. It got thru all the files except the usual one, stopped at exactly the same place. I aborted the transfer and stopped Boinc. I then copied the file I had downloaded manually into the same place as before, and did another update of Rosetta. It asked for 36000 seconds of work and got none. It went into the communication deferred state and is now downloading the EXACT SAME FILE again!!!! It is also STUCK at the EXACT SAME PLACE!!!!
I have no clue how to fix this and other projects are working just fine. Frustrating to say the least!!!!!
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59129 - Posted 29 Jan 2009 11:35:24 UTC - in response to Message ID 59119.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.


I turned it off and nothing changed, I use the free version of Avast.
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4363
ID: 85645
Credit: 774,444
RAC: 465
Message 59132 - Posted 29 Jan 2009 13:26:47 UTC - in response to Message ID 59129.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.


I turned it off and nothing changed, I use the free version of Avast.


got any firewalls active?

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59133 - Posted 29 Jan 2009 13:30:28 UTC - in response to Message ID 59128.

mikey, I don't know why I didn't think of this before...

Do a binary ftp of the file from here:

boinc.bakerlab.org/download/minirosetta_1.54_windows_intelx86.exe

and drop it in to your Rosetta folder in your BOINC data directory under the projects subfolder. That will at least get you up and running, or on to the next file to see if similar problems continue.


No difference, I downloaded the file, dropped it into the directory C:\Boinc\Data\Projects\boinc.bakerlab.org_rosetta. It asked if I wanted to overwrite what was there, the new file was bigger, I said yes, exited Boinc, restarted Boinc, did an update of Rosetta and it is downloading the file AGAIN and is stuck at exactly the same place. I even turned off my ad-aware and anti-virus and no change.

Change #1....Just after I first posted this I did a total shutdown and then a restart, no change, Boinc is still trying to download that same file! I am about ready to detach and then reattach and see if that fixes it!

Change #2....I detached and then reattached. Started downloading all the Rosetta files again. I made sure everything Rosetta was gone out of the Boinc and all subdirectories, so downloading was not a surprise. It got thru all the files except the usual one, stopped at exactly the same place. I aborted the transfer and stopped Boinc. I then copied the file I had downloaded manually into the same place as before, and did another update of Rosetta. It asked for 36000 seconds of work and got none. It went into the communication deferred state and is now downloading the EXACT SAME FILE again!!!! It is also STUCK at the EXACT SAME PLACE!!!!
I have no clue how to fix this and other projects are working just fine. Frustrating to say the least!!!!!


Change #3....I downloaded and installed the latest version of DirectX, no changes noted.

Change #4....I installed Boinc 6.6.3, got this message "1/29/2009 8:28:31 AM|rosetta@home|Scheduler request completed: got 0 new tasks". I may have errored out all my available work for the day. No files downloading, so maybe it will take this time? No clue.
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59134 - Posted 29 Jan 2009 13:32:40 UTC - in response to Message ID 59110.
Last modified: 29 Jan 2009 13:39:56 UTC

mikey, if you would like to study this further, it would be helpful if you could create a cc_config.xml file and add the flag for debug of file transfers. You have to define the first three flags as shown, then just add a line for the:
<file_xfer_debug>1</file_xfer_debug>

If you already have such a file set up, do you have the <http_1_0> flag defined? Not asking you to do that one, just asking if you were already doing it. HTTP 1.0 does not have the ability to retry from the middle of the transfer (persistent file transfer is the term BOINC uses for this). It has to start over each attempt. Then BOINC seems to only open the pipe for 5minutes at a time. So if you can't get the whole thing in 5min. It might never happen.


Okay I have downloaded the file and put it in the Boinc\Data directory. I took out the asterisks and changed the <file-xfer_debug line to a 1, it was a zero.
As for the http setting I use Firefox 3.0.5 and do not see that setting. I know it is/was in IE, but I do not see it in Firefox.
____________

Scott A. Howard

Joined: Oct 16 05
Posts: 2
ID: 4994
Credit: 307,651
RAC: 66
Message 59135 - Posted 29 Jan 2009 14:38:20 UTC
Last modified: 29 Jan 2009 15:00:22 UTC

Hello,

Here's the problem in a nutshell.

On my Dell Precision T5400 with dual Xeon E5410 2.33 GHz chips (for a total of 8 cores) running on XP Pro SP3, almost every one of the Rosetta jobs (minirosetta version 154) fail. The typical failure mode is that they are exceeding their CPU time allocation. For example, if the job is estimated to require 4 hours of CPU time, they are killed at something like 20 hours. Sometimes the tasks show progress, other times they are stuck at zero.

Also, the exe is not removed from memory when the computer is in use.

I have reset the project and detached and attached again but it continues to happen.

Nothing like this happens with the lhcathome, QMC@HOME, Docking@Home, or boincsimap tasks. I also don't see this behavior on any of my other machines.

Do you guys produce any diagnostic logs that might of use in troubleshooting the problem? Maybe it's my configuration - maybe a coding error showing up when running 6 or 8 of these tasks simultaneously. (It appears to occur with any number running, from 1 - 8).

I have a full development environment and debuggers if you want some traces.

Scott Howard


Addendum: Now that I thought about it a little more, does the app use any global resource locking? E.g., mutexes, semaphores, file acess? Maybe that's why the progress is halted, it's deadlocked - but I am not sure why the task would continue to use CPU time though. Just some random thoughts...
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59136 - Posted 29 Jan 2009 15:28:39 UTC
Last modified: 29 Jan 2009 15:31:55 UTC

mikey, we're not talking about the HTTP setting of your browser. We're talking about the http setting used by BOINC. If it were specifically set, it would have a line in that cc_config.xml file.

Once you have the file in the directory, abort the transfer.

You probably got no work because BOINC knew you already had enough coming. So you probably see a number of tasks in a "downloading" state.

The file transfer debug messages will appear in the messages tab. One of the things to note there is which of Rosetta's servers is currently being used to retrieve the file (the host name). I believe this will change from one retry to the next. But if not, you might try blocking outbound traffic to that server with a firewall, and this would then force the client to try the next server in the list.

Does each try go for 5 minutes before waiting again? Does any data come down in that period of time?

Once you determine which server is being used, could you do a ping and a tracert to that server's host name and report the results?
____________
Rosetta Moderator: Mod.Sense

robertmiles Profile

Joined: Jun 16 08
Posts: 424
ID: 264600
Credit: 312,368
RAC: 198
Message 59137 - Posted 29 Jan 2009 15:53:14 UTC - in response to Message ID 59074.

Hello, the version 1.47 was very well for me with 151 Workunits and 0 errors and an average CPU time 2.8 hours. Hope that the new version 1.54 will be as well...


1.47 worked rather well for me, with perhaps one out of ten workunits giving an error. Not enough 1.54 workunits yet to say whether 1.54 is better. I'm asking for 14 hour workunits, so it will take me longer to run that many.

Scott A. Howard

Joined: Oct 16 05
Posts: 2
ID: 4994
Credit: 307,651
RAC: 66
Message 59138 - Posted 29 Jan 2009 16:00:11 UTC - in response to Message ID 59135.
Last modified: 29 Jan 2009 16:09:39 UTC

Here's a follow up.

I did the following:
1) detached from the project.
2) removed the Rosetta project folder from under \Bonic\...
3) removed all files from a slot that contained Rosetta data
4) reattached to the project
5) allowed for 50% of the cpus to be used (4 in this case)
6) allowed the four projects to run - each expected to take about 4 hours

Observed results: The status for the projects are "Running, high priority", each has used about 20 minutes of cpu time, the progress is 0.000%

Setting the activity back to "run based on preferences" results in each task no longer using cpu time but they are not removed from memory.

It looks like that's all I can do. If there are no suggestions from your end, I'll need to stay detached from the project so I don't waste cycles.

I see the thread that's consuming the CPU has a pretty regular call stack. Here is the call stack. If you have your debug symbols for your build, you should be able to locate the routine and line at which the program is hung...

ntkrnlpa.exe!KiSwapContext+0x2f
ntkrnlpa.exe!KiSwapThread+0x8a
ntkrnlpa.exe!KeWaitForSingleObject+0x1c2
ntkrnlpa.exe!KiSuspendThread+0x18
ntkrnlpa.exe!KiDeliverApc+0x124
hal.dll!HalpApcInterrupt+0xc6
minirosetta_1.54_windows_intelx86.exe+0x91a63 <------ look for problem here
minirosetta_1.54_windows_intelx86.exe+0x17d3
minirosetta_1.54_windows_intelx86.exe+0x1afcd
minirosetta_1.54_windows_intelx86.exe+0x9289e
minirosetta_1.54_windows_intelx86.exe+0x4a4bc3
minirosetta_1.54_windows_intelx86.exe+0xb0892
minirosetta_1.54_windows_intelx86.exe+0x3e0c24
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59139 - Posted 29 Jan 2009 16:06:12 UTC - in response to Message ID 59136.
Last modified: 29 Jan 2009 16:46:54 UTC

mikey, we're not talking about the HTTP setting of your browser. We're talking about the http setting used by BOINC. If it were specifically set, it would have a line in that cc_config.xml file.

Once you have the file in the directory, abort the transfer.

You probably got no work because BOINC knew you already had enough coming. So you probably see a number of tasks in a "downloading" state.

The file transfer debug messages will appear in the messages tab. One of the things to note there is which of Rosetta's servers is currently being used to retrieve the file (the host name). I believe this will change from one retry to the next. But if not, you might try blocking outbound traffic to that server with a firewall, and this would then force the client to try the next server in the list.

Does each try go for 5 minutes before waiting again? Does any data come down in that period of time?

Once you determine which server is being used, could you do a ping and a tracert to that server's host name and report the results?


I changed the dual core settings to use both cores, this is a laptop and I do not like stressing it that much, and set the other project to no new work. I updated Rosetta and it proceeded to download new work. The same file stopped at the same place, 89.25%. I aborted it, after all other files were done downl0ading, and no new entries showed up in the cc_config.xml file.
I was browsing thru the stdout.txt file and found this:
9:21:33 AM: Error: can't open file 'C:\Boinc\\RebootPending.txt' (error 2: the system cannot find the file specified.)
[01/27/09 09:21:34] TRACE [2064]: RPC_CLIENT::init connect 2: Winsock error '10061'

[01/27/09 09:21:34] TRACE [2064]: RPC_CLIENT::init connect on 444 returned -1

[01/27/09 09:21:35] TRACE [2064]: RPC_CLIENT::init boinc_socket returned 444

[01/27/09 09:21:35] TRACE [2064]: RPC_CLIENT::init connect returned -1

[01/27/09 09:21:35] TRACE [2064]: RPC_CLIENT::init attempting connect

[01/27/09 09:21:35] TRACE [2064]: RPC_CLIENT::init_poll sock = 444
It is in there many, many times.

I do not see what server I am downloading from, and only use the Windows firewall, so unless I could block thru the Hosts files, I do not know how to block that particular server anyway.

Yes each retry deferral is about 4 minutes.

I did find one more thing in that stdoutgiu.txt file:
[01/29/09 11:10:31] TRACE [3932]: RPC_CLIENT::init connect 2: Winsock error '10061'

[01/29/09 11:10:31] TRACE [3932]: RPC_CLIENT::init connect on 524 returned -1

It is also in there many, many times. I did a search and found where it said to change the attributes for the Boinc directory and all subdirectories. It was set to read only and when I unchecked that and changed it also for all subdirectories, Boinc will not run. It also auto defaults back to read only after it errors out.
DO NOT DO THIS LAST PART It crashed my whole Boinc setup and I had to delete the Boinc directory, and all subdirectories, then reboot and then reinstall Boinc from scratch. FORTUNATELY it did a repair install instead of a brand new install from scratch! I lost all workunits from all projects though!!!! I attached to Rosetta and guess what? The EXACT SAME FILE is stuck at the EXACT SAME PLACE!!! A TON of files are downloading besides just that one, but that one is stuck all over AGAIN!!!
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59140 - Posted 29 Jan 2009 16:07:39 UTC - in response to Message ID 59132.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.


I turned it off and nothing changed, I use the free version of Avast.


got any firewalls active?


No I use the Windows one, I have Windows XP Media Center on this laptop.
____________

robertmiles Profile

Joined: Jun 16 08
Posts: 424
ID: 264600
Credit: 312,368
RAC: 198
Message 59141 - Posted 29 Jan 2009 16:12:43 UTC - in response to Message ID 59128.

mikey, I don't know why I didn't think of this before...

Do a binary ftp of the file from here:

boinc.bakerlab.org/download/minirosetta_1.54_windows_intelx86.exe

and drop it in to your Rosetta folder in your BOINC data directory under the projects subfolder. That will at least get you up and running, or on to the next file to see if similar problems continue.


No difference, I downloaded the file, dropped it into the directory C:\Boinc\Data\Projects\boinc.bakerlab.org_rosetta. It asked if I wanted to overwrite what was there, the new file was bigger, I said yes, exited Boinc, restarted Boinc, did an update of Rosetta and it is downloading the file AGAIN and is stuck at exactly the same place. I even turned off my ad-aware and anti-virus and no change.


What antivirus program do you have, and what version? Some antivirus programs don't fully turn off when you try to turn them off; they stop reporting that they have found a virus, but don't stop looking for a virus.

I'm also running Ad-Aware, but without this problem, so this antispyware program is less likely to be causing the problem.

robertmiles Profile

Joined: Jun 16 08
Posts: 424
ID: 264600
Credit: 312,368
RAC: 198
Message 59142 - Posted 29 Jan 2009 16:23:13 UTC - in response to Message ID 59129.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.


I turned it off and nothing changed, I use the free version of Avast.


I also use the free version of avast!, version 4.8 Home Edition, without seeing that problem, so if it's that program, the problem is probably in a section specific to the operating system you are using. I use 32-bit Windows Vista SP1 with nearly all the updates applied; what operating system and what version are you using?

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59144 - Posted 29 Jan 2009 16:52:02 UTC - in response to Message ID 59142.
Last modified: 29 Jan 2009 16:53:29 UTC

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.


I turned it off and nothing changed, I use the free version of Avast.


I also use the free version of avast!, version 4.8 Home Edition, without seeing that problem, so if it's that program, the problem is probably in a section specific to the operating system you are using. I use 32-bit Windows Vista SP1 with nearly all the updates applied; what operating system and what version are you using?


I am running 32 bit Windows XP Media Center and have been running Boinc on this thing ever since I bought it a couple of years ago. I was running Rosetta on it until this new mini-rosetta came out! I have run Malaria, I can run Poem on it right now! I have run ABC on it but the units take too long on this T2300 1.6ghz, 2 gig of ram machine. EVERYTHING runs except it just won't download, or recognize, the new mini-rosetta file! I downloaded the mini-rosetta file directly, put it in the proper directory, and it STILL wants to download that exact same file!!!

I am also using the 4.8 Home version of Avast.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59145 - Posted 29 Jan 2009 16:58:30 UTC
Last modified: 31 Jan 2009 19:52:02 UTC

Scott Howard:

Setting the activity back to "run based on preferences" results in each task no longer using cpu time but they are not removed from memory.


There are many many BOINC settings possible and you've not described any of yours. When you set BOINC to run based on preference, you are telling it to only use CPU on the days and during the hours you've configured. If you've configured it to not be running at the current time or day of the week, it will suspend the currently active tasks. Any time a task is suspended, it will not make any progress. And there is a memory setting for whether or not tasks should remain "in memory" (virtual memory) while suspended. Doing so preserves the work done since the last checkpoint taken by the task.

...so major portions of what you are reporting may be exactly what you have configured BOINC to do.

You have 4 hosts, three are Windows XP and one is Win Vista. Which one is having problems? Is it this one? There are many failed tasks there with access violations. Are you overclocking this machine? Other then more CPUs and different CPU type, what is different about this machine then your others that having been running fine?
____________
Rosetta Moderator: Mod.Sense

robertmiles Profile

Joined: Jun 16 08
Posts: 424
ID: 264600
Credit: 312,368
RAC: 198
Message 59148 - Posted 29 Jan 2009 17:14:30 UTC - in response to Message ID 59140.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.


I turned it off and nothing changed, I use the free version of Avast.


got any firewalls active?


No I use the Windows one, I have Windows XP Media Center on this laptop.


I also use the Windows firewall, but the Vista SP1 version.

Laptops sometimes have problems with overheating when running BOINC workunits but set to use 100% of the CPU time, and I think I've read that minirosetta is likely to have problems when set to run at less than 100% of the CPU time. What is your setting of what fraction of the CPU time to use?

I've installed the SpeedFan program on my machine program to check for overheating, but don't have the file needed to show results with proper labels for my motherboard yet. The highest temperature it shows is 109F, though.

http://www.almico.com/speedfan.php

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 13,026,869
RAC: 16,575
Message 59149 - Posted 29 Jan 2009 17:16:54 UTC

To Scott_A_Howard,

I notice that your 8 core machine only has 3GB. That's a bit small for 8 rosetta tasks. In your BOINC preferences what percent of memory are you allowing the machine to use when the machine is/isn't in use? You might try setting both to 100% on that machine and see if it makes any difference.

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 608,823
RAC: 73
Message 59150 - Posted 29 Jan 2009 17:23:17 UTC - in response to Message ID 59145.

Scott Howard:


I seem to have the same problem. No special settings in Rosetta preferences, all kind of computers under XP, and tasks running 100+ hours with 0% progress.

Nothing But Idle Time

Joined: Sep 28 05
Posts: 209
ID: 1675
Credit: 139,545
RAC: 0
Message 59151 - Posted 29 Jan 2009 17:44:21 UTC

resultid=224470749

Reason: Access Violation (0xc0000005) at address 0x00467846 read attempt to address 0x11B524C4

This task was running fine but after I suspended it, rebooted my system, and restarted the task it terminated almost immediately with access violation. Maybe restarts don't work very well or something is flakey with my hard drive or system. Having some troubles with access violations on Einstein tasks as well. But I've run memtest86 and prime95 and CHKDSK and none of them indicate any local computer problems. I'm just shaking my head in disgust.

robertmiles Profile

Joined: Jun 16 08
Posts: 424
ID: 264600
Credit: 312,368
RAC: 198
Message 59152 - Posted 29 Jan 2009 17:51:16 UTC - in response to Message ID 59144.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.


I turned it off and nothing changed, I use the free version of Avast.


I also use the free version of avast!, version 4.8 Home Edition, without seeing that problem, so if it's that program, the problem is probably in a section specific to the operating system you are using. I use 32-bit Windows Vista SP1 with nearly all the updates applied; what operating system and what version are you using?


I am running 32 bit Windows XP Media Center and have been running Boinc on this thing ever since I bought it a couple of years ago. I was running Rosetta on it until this new mini-rosetta came out! I have run Malaria, I can run Poem on it right now! I have run ABC on it but the units take too long on this T2300 1.6ghz, 2 gig of ram machine. EVERYTHING runs except it just won't download, or recognize, the new mini-rosetta file! I downloaded the mini-rosetta file directly, put it in the proper directory, and it STILL wants to download that exact same file!!!

I am also using the 4.8 Home version of Avast.


I've recently run Poem workunits on one core of my HP Compaq Presario PC, model SR5125CL, and minirosetta 1.54 workunits at the same time on the other core, without problems. I previously ran Malaria workunits on one core and earlier minirosetta workunits on the other without problems, back when I could still get Malaria workunits. My machine also has 2 GB. I haven't tried ABC workunits, so you might want to try running yours a while with no ABC workunits active. I also run WCG workunits (all active projects there except beta test), with Ralph workunits and Boincsimap workunits when I can get them. I used to run Cels workunits, back when that project was active.

My CPU is an AMD Athlon(tm) 64 X2 Dual Core Processor 3600+ 1.90 Ghz; what's yours?

Also, You may want to give your ISP the instructions for downloading the problem file with FTP, and ask them to test whether their antivirus software considers it to have a problem.

AdeB

Joined: Dec 12 06
Posts: 42
ID: 135244
Credit: 389,999
RAC: 541
Message 59153 - Posted 29 Jan 2009 17:58:43 UTC

This task was aborted after my preferred runtime + 4 hours. It was working on the 3th model.
stderr out:

...
Watchdog active.
Starting work on structure: S_shuffle_00001 <--- F_00008_0003416_0
Fullatom mode ..
# cpu_run_time_pref: 43200
Starting work on structure: S_shuffle_00002 <--- F_00001_0000109_0
Fullatom mode ..
Starting work on structure: S_shuffle_00003 <--- F_00002_0003276_0
Fullatom mode ..
Hbond tripped.
====>
called boinc_finish


AdeB
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59156 - Posted 29 Jan 2009 18:46:03 UTC - in response to Message ID 59148.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.


I turned it off and nothing changed, I use the free version of Avast.


got any firewalls active?


No I use the Windows one, I have Windows XP Media Center on this laptop.


I also use the Windows firewall, but the Vista SP1 version.

Laptops sometimes have problems with overheating when running BOINC workunits but set to use 100% of the CPU time, and I think I've read that minirosetta is likely to have problems when set to run at less than 100% of the CPU time. What is your setting of what fraction of the CPU time to use?

I've installed the SpeedFan program on my machine program to check for overheating, but don't have the file needed to show results with proper labels for my motherboard yet. The highest temperature it shows is 109F, though.

http://www.almico.com/speedfan.php


I only run one core so the setting is to use 50% of the cpu's. Thus I do not have a problem with overheating on this laptop. I have it set to use 100% of the available cpu.
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59157 - Posted 29 Jan 2009 18:51:04 UTC - in response to Message ID 59152.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.


I turned it off and nothing changed, I use the free version of Avast.


I also use the free version of avast!, version 4.8 Home Edition, without seeing that problem, so if it's that program, the problem is probably in a section specific to the operating system you are using. I use 32-bit Windows Vista SP1 with nearly all the updates applied; what operating system and what version are you using?


I am running 32 bit Windows XP Media Center and have been running Boinc on this thing ever since I bought it a couple of years ago. I was running Rosetta on it until this new mini-rosetta came out! I have run Malaria, I can run Poem on it right now! I have run ABC on it but the units take too long on this T2300 1.6ghz, 2 gig of ram machine. EVERYTHING runs except it just won't download, or recognize, the new mini-rosetta file! I downloaded the mini-rosetta file directly, put it in the proper directory, and it STILL wants to download that exact same file!!!

I am also using the 4.8 Home version of Avast.


I've recently run Poem workunits on one core of my HP Compaq Presario PC, model SR5125CL, and minirosetta 1.54 workunits at the same time on the other core, without problems. I previously ran Malaria workunits on one core and earlier minirosetta workunits on the other without problems, back when I could still get Malaria workunits. My machine also has 2 GB. I haven't tried ABC workunits, so you might want to try running yours a while with no ABC workunits active. I also run WCG workunits (all active projects there except beta test), with Ralph workunits and Boincsimap workunits when I can get them. I used to run Cels workunits, back when that project was active.

My CPU is an AMD Athlon(tm) 64 X2 Dual Core Processor 3600+ 1.90 Ghz; what's yours?

This is an Intel T2300 dual core, only using one of them for Boinc, 1.6ghz machine.

Also, You may want to give your ISP the instructions for downloading the problem file with FTP, and ask them to test whether their antivirus software considers it to have a problem.

Yeah me telling Comcast what to do isn't going to happen in this lifetime. I can download any file in the World EXCEPT this damned mini-rosetta file and then ONLY thru Boinc!!!! I download the same file thru a direct download, Boinc just won't recognize it. Yes I did put the file in the proper directory. We have been thru this already.
____________

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,023,621
RAC: 81
Message 59159 - Posted 29 Jan 2009 19:05:10 UTC

I am not sure that Comcast is the problem as I do use their AV software and Have no problems with downloading work ...

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59160 - Posted 29 Jan 2009 19:14:19 UTC
Last modified: 29 Jan 2009 19:17:27 UTC

mikey, have you tried a different version of BOINC?
____________
Rosetta Moderator: Mod.Sense

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59161 - Posted 29 Jan 2009 19:17:13 UTC

rembertw
Please open the advanced view of the BOINC Manager, go to the tasks tab, and note the "application" name shown, this will have the application version. The only reports of tasks running that long are from the prior version. If it is not Rosetta mini 1.54, please select that task, and abort it with the button on the left. There were some problems like that on the prior version that are corrected now.
____________
Rosetta Moderator: Mod.Sense

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59162 - Posted 29 Jan 2009 19:24:53 UTC - in response to Message ID 59160.
Last modified: 29 Jan 2009 19:46:58 UTC

mikey, have you tried a different version of BOINC?


Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!
____________

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,023,621
RAC: 81
Message 59163 - Posted 29 Jan 2009 20:13:06 UTC - in response to Message ID 59162.

mikey, have you tried a different version of BOINC?


Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!



Just a wild shot. ..

How is your disk space?

How about BOINC settings for disk space? Are you at BOINC's limit?

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 13,026,869
RAC: 16,575
Message 59166 - Posted 29 Jan 2009 21:06:07 UTC

mikey,

You mentioned that when you manually copied the .exe file that you overwrote the half-downloaded file. Perhaps Boinc tried to resume the download without noticing the change. I think Boinc must be stopped, and must NOT have the file half-downloaded for this copying trick to work.

If you've avoided the above problem and Boinc is still trying to download the file, check the messages tab to see if Boinc is complaining about a bad checksum. It's possible that whatever is preventing Boinc from downloading the file could also be corrupting the file when you manually download it.

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59172 - Posted 29 Jan 2009 23:31:23 UTC - in response to Message ID 59163.

mikey, have you tried a different version of BOINC?


Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!



Just a wild shot. ..

How is your disk space?

How about BOINC settings for disk space? Are you at BOINC's limit?


No I am fine on both. Boinc still has 10 gig available to it and there is over 20 gig total available.
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59173 - Posted 29 Jan 2009 23:35:32 UTC - in response to Message ID 59166.
Last modified: 29 Jan 2009 23:47:48 UTC

mikey,

You mentioned that when you manually copied the .exe file that you overwrote the half-downloaded file. Perhaps Boinc tried to resume the download without noticing the change. I think Boinc must be stopped, and must NOT have the file half-downloaded for this copying trick to work.

If you've avoided the above problem and Boinc is still trying to download the file, check the messages tab to see if Boinc is complaining about a bad checksum. It's possible that whatever is preventing Boinc from downloading the file could also be corrupting the file when you manually download it.


Nope this is the only message regarding the file:
1/29/2009 6:29:05 PM|rosetta@home|Started download of minirosetta_1.54_windows_intelx86.exe

I exited Boinc, deleted the old file, copied the new one into its location and then restarted the whole pc. Then when Boinc started up that message, along with a few dozen others, came up.
I appreciate all the help but I am done trying to make this work. I am on to another project and will try again another time. THANK YOU ALL!!!!

PS in the time it took me to type this I attached to Poem@Home and got 8 new units plus all the associated files and the pc is now happily crunching.

Thanks again for all your help, I still have a hard time believing it is my pc that can download just fine from any other project but just cannot download one file from Rosetta. Here is a partial list of files just downloaded:
1/29/2009 6:37:15 PM|Poem@Home|Started download of poem_1.0_windows_intelx86
1/29/2009 6:37:15 PM|Poem@Home|Started download of JParmJan97
1/29/2009 6:37:23 PM|Poem@Home|Sending scheduler request: To fetch work. Requesting 95475 seconds of work, reporting 0 completed tasks
1/29/2009 6:37:28 PM|Poem@Home|Scheduler request completed: got 8 new tasks
1/29/2009 6:37:29 PM|Poem@Home|Finished download of poem_1.0_windows_intelx86

As you can see it works just fine!! I do see that the mini-rosetta file has a ".exe" at the end while the poem file does not. Could that be the problem, no clue, seems it has worked for all other users.
Thanks for the ride it has been loads of fun but I am getting off for now. I will still come back and read and reply in the forums until my credits don't let me anymore.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59174 - Posted 30 Jan 2009 0:18:07 UTC
Last modified: 30 Jan 2009 0:18:27 UTC

mikey, whatever the problem is, it stands a good chance of clearing itself when the next Rosetta version comes out. The .exe will be a different name afterall. So, please monitor the new release thread and give another try at that time.
____________
Rosetta Moderator: Mod.Sense

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,023,621
RAC: 81
Message 59177 - Posted 30 Jan 2009 2:43:17 UTC

Sometimes it is best to take a breather and come back later ...

At times the problems go away on their own for no apparent reason ... other times they can be found.

I have been doing Ralph for a week or so now and all I can say is that I am impressed with how many issues we found in 1.54 and now in 1.55 ...

I know that we all regret we could not get you going Besides POEM, WCG also does folding, and there are other projects that are related ... yell if you need us ... or want us ... or to say hi ...

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59180 - Posted 30 Jan 2009 10:01:06 UTC - in response to Message ID 59174.

mikey, whatever the problem is, it stands a good chance of clearing itself when the next Rosetta version comes out. The .exe will be a different name afterall. So, please monitor the new release thread and give another try at that time.


I will, I like the premise of Rosetta and that is what brought me here in the first place. I will certainly try again in the future, probably when you put out a new version as you suggest.
Thanks for all your help
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59181 - Posted 30 Jan 2009 10:05:27 UTC - in response to Message ID 59177.

Sometimes it is best to take a breather and come back later ...

At times the problems go away on their own for no apparent reason ... other times they can be found.

I have been doing Ralph for a week or so now and all I can say is that I am impressed with how many issues we found in 1.54 and now in 1.55 ...

I know that we all regret we could not get you going Besides POEM, WCG also does folding, and there are other projects that are related ... yell if you need us ... or want us ... or to say hi ...


Oh I have 17 computers, I think, on line here at home right now. All are crunching for Boinc, plus I have 2 video cards doing the folding thing. I do ABC and Poem right now. But if you click on my name you will see I have crunched for a few projects and am not intending to stop anytime soon. In fact I have 2 new motherboard and dual core cpus to bring on line this weekend to replace 2 single core machines. I have already set them to no new work in preparation for the changeover.
____________

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 608,823
RAC: 73
Message 59184 - Posted 30 Jan 2009 12:12:52 UTC - in response to Message ID 59161.

Mod.Sense


I detached yesterday, and re-attached just now so there is no way to see (for me) what applications were running. Also, my computers are too widely spread to start micromanagement. Anyway, I'll keep an eye on a couple of computers for a couple of days to see if they reattach succesfully and if that problem is indeed solved.

I supposed it was a wrong batch (or application) and detach/reattach was the fastest way to have a full reset. If the problem shows up again, I'll let you know. If it doesn't... then thanks for the info!

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4363
ID: 85645
Credit: 774,444
RAC: 465
Message 59207 - Posted 31 Jan 2009 19:28:54 UTC
Last modified: 31 Jan 2009 19:31:41 UTC

disregard, found answer in short running model thread

darengosse Jean-Paul
Avatar

Joined: Jun 9 06
Posts: 17
ID: 93705
Credit: 164,501
RAC: 18
Message 59217 - Posted 1 Feb 2009 11:35:22 UTC

Hello with all.
I do not understand that some have problems.
Indeed my Desktop machine Intel Core 2 CPU Windows XP home x86 SP2 with carried out 27 wu with the v1.54 and 0 errors with an average CPU time of 2,8 hours. I cross the fingers so that continuous as well...
I specify only that betwen 80% and 85% of work, that passes directly to 100%.
With this new version I also notice that the processes generate more lures and attemps,(1 example: 23 decoys from 23 attemps on a wu), but as that the working mean by wu and more important as with the v1.47.
To finish, (although this n'is not the good forum), I specify that one of my Computeurs has been broken down for 8 days due to segments broken on the hard drive and qu'it is in repair. As it there to 3 wu as I n'is not puses to return before the dealine and which I think will be lost.
It would be thus although a person sympathetic nerve informs the persons in charge of the rosetta project of this problem.
Thank you very much d'advances...
Good memories...
____________


johnny64

Joined: Dec 25 08
Posts: 3
ID: 294317
Credit: 69,465
RAC: 0
Message 59218 - Posted 1 Feb 2009 12:26:59 UTC

The 1.54 version seems to be in conflict with the Linux ABI in FreeBSD.
One machine I'm running boinc on is a FreeBSD one, boinc downloads and runs the Linux binaries through the Linuxulator. Version 1.47 worked flawlessly, but the 1.54 version crashes randomly on SIGILL. http://boinc.bakerlab.org/rosetta/results.php?hostid=973136 shows only one successful task, which was run with Rosetta beta rather than minirosetta; all of the minirosetta tasks crashed sooner or later.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4363
ID: 85645
Credit: 774,444
RAC: 465
Message 59224 - Posted 1 Feb 2009 21:13:19 UTC

1nkuA_BOINC_MPZN_with_zinc_abrelax_cs_frags_6231_188268_0
3.5 hours for 1 decoy? kinda odd i think.

LizzieBarry

Joined: Feb 25 08
Posts: 76
ID: 243949
Credit: 197,236
RAC: 217
Message 59225 - Posted 1 Feb 2009 22:18:23 UTC

I've had a couple of these "Validate Errors" recently:

Mini 1.47 Task 223871308
and
Mini 1.54 Task 224694361

Both ended with exit code: 0 (0x0) and seemed to run successfully, but with no credit given.

Is it something I did, a bug or just one of those things?

Gavin Shaw Profile
Avatar

Joined: Feb 1 07
Posts: 10
ID: 144828
Credit: 251,942
RAC: 0
Message 59228 - Posted 1 Feb 2009 23:07:15 UTC

Got this one a day or so ago. Not sure if it is a failure/error.

224812655

____________
Never surrender and never give up. In the darkest hour there is always hope.

P . P . L .

Joined: Aug 20 06
Posts: 365
ID: 105843
Credit: 362,889
RAC: 796
Message 59233 - Posted 2 Feb 2009 4:19:31 UTC

Hi Mike.

I did see somewhere that you said something about large upload file size, i think

this is one that got away. ;)

99 models in 4hrs, 26min and result file of 8.32mb.

_CAPRI17_T39_2_.sjf_br_one_docking.protocol__6483_19318_1.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=205403421

pete.

____________


Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59236 - Posted 2 Feb 2009 5:30:33 UTC
Last modified: 2 Feb 2009 5:35:16 UTC

Peter, the potential for large output files is why Mike changed it to exit after 99 models. That lets the task report back that it's running through models like candy and then they can weigh that before releasing more similar tasks.
____________
Rosetta Moderator: Mod.Sense

P . P . L .

Joined: Aug 20 06
Posts: 365
ID: 105843
Credit: 362,889
RAC: 796
Message 59237 - Posted 2 Feb 2009 6:00:42 UTC - in response to Message ID 59236.
Last modified: 2 Feb 2009 6:02:21 UTC

Peter, the potential for large output files is why Mike changed it to exit after 99 models. That lets the task report back that it's running through models like candy and then they can weigh that before releasing more similar tasks.


Hi.

Just as well it did finish after 99 i would hate to see the file size after

12 or 24 hours! :) I just returned another one the same size.

pete.
____________


transient
Avatar

Joined: Sep 30 06
Posts: 255
ID: 115553
Credit: 1,174,447
RAC: 1,791
Message 59238 - Posted 2 Feb 2009 6:22:36 UTC - in response to Message ID 59225.

I've had a couple of these "Validate Errors" recently:

Mini 1.47 Task 223871308
and
Mini 1.54 Task 224694361

Both ended with exit code: 0 (0x0) and seemed to run successfully, but with no credit given.

Is it something I did, a bug or just one of those things?


I only checked the 1.54-task. You have a runtime-preference of 3 hours. This one ran for 7 hours, no finished models. I'd say the watchdog, which aborts tasks running longer than intended, cut in. This is one for the long-running tasks thread.
____________

LizzieBarry

Joined: Feb 25 08
Posts: 76
ID: 243949
Credit: 197,236
RAC: 217
Message 59239 - Posted 2 Feb 2009 13:39:00 UTC - in response to Message ID 59238.

I've had a couple of these "Validate Errors" recently:

Mini 1.47 Task 223871308
and
Mini 1.54 Task 224694361

Both ended with exit code: 0 (0x0) and seemed to run successfully, but with no credit given.

Is it something I did, a bug or just one of those things?

I only checked the 1.54-task. You have a runtime-preference of 3 hours. This one ran for 7 hours, no finished models. I'd say the watchdog, which aborts tasks running longer than intended, cut in. This is one for the long-running tasks thread.

Thanks for copying here - I thought it was just a problem with the validator (the error message being the clue). You're right, there's no "Done" section after the first model starts until the boinc_finish, which is odd, but no mention of the watchdog cutting in, even though it does run a long time. But on the 1.47 WU there are 3 models done, so I'm not entirely convinced it's the same thing.

Usually long-running jobs get a default credit of 80, don't they? Looks like I missed out all ways. Oh well...

falingtrea

Joined: Aug 8 07
Posts: 1
ID: 196401
Credit: 70,530
RAC: 32
Message 59240 - Posted 2 Feb 2009 16:25:44 UTC

Just got this error trying to perform an update:

2/2/2009 10:05:58 AM|rosetta@home|Sending scheduler request: Requested by user
2/2/2009 10:05:58 AM|rosetta@home|(not requesting new work or reporting completed tasks)
2/2/2009 10:06:03 AM|rosetta@home|Scheduler RPC succeeded
2/2/2009 10:06:03 AM|rosetta@home|Message from server: Server error: can't attach shared memory
2/2/2009 10:06:03 AM|rosetta@home|Deferring communication for 1 hr 0 min 0 sec
2/2/2009 10:06:03 AM|rosetta@home|Reason: project is down

Server is up according to the webpage. One task was updated as complete.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4363
ID: 85645
Credit: 774,444
RAC: 465
Message 59245 - Posted 2 Feb 2009 22:02:36 UTC

there is something odd going on with the graphics of lr5_D_score12_rlbd_2hsh_IGNORE_THE_REST_DECOY_6246_424_0 the plot disappears completely at times and the accepted energy does the same at times. then they reappear at times. all seems to depend on the energy value of the moment. as far as i know this is not normal.

TeAm Enterprise Profile
Avatar

Joined: Sep 28 05
Posts: 15
ID: 1546
Credit: 14,190,849
RAC: 0
Message 59249 - Posted 3 Feb 2009 3:44:56 UTC

I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.

For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.

Right now 3 cores are running:

Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4363
ID: 85645
Credit: 774,444
RAC: 465
Message 59253 - Posted 3 Feb 2009 10:11:54 UTC - in response to Message ID 59249.

I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.

For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.

Right now 3 cores are running:

Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.



what version of boinc manager are you using?
it looked like you were using 5.10.45 which is quite old.

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59254 - Posted 3 Feb 2009 12:41:38 UTC - in response to Message ID 59249.

I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.

For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.

Right now 3 cores are running:

Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.


Sounds like your Duration Correction Factor is waaaay off. You can wait for Boinc to fix it, and it will all by itself. Or you can shut down Boinc and manually edit the file client_state.xml. Do this thru notepad or whatever, and go down until you find this Project and the line that says:
<duration_correction_factor>0.705998</duration_correction_factor>
That is a copy of my line, yours will have the numbers be like 70.?????????? or whatever. Change it to 1.000000, save the file and then restart Boinc. Boinc will then use those new numbers and get both new work and recalculate how long it will take to crunch your existing units. If you do a right click on the file name the EDIT option is listed, use that to edit the file. Do NOT change anything else in the file. The line you are looking for is near the top, mine was only 65 lines down from the top.
____________

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,023,621
RAC: 81
Message 59256 - Posted 3 Feb 2009 12:49:09 UTC

Compute error, though it looks more like a zip error ...


process exited with code 1 (0x1, -255)

Watchdog active.
Hbond tripped.

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


Not sure what to make of this error ... happened on the Mac Pro ...

TeAm Enterprise Profile
Avatar

Joined: Sep 28 05
Posts: 15
ID: 1546
Credit: 14,190,849
RAC: 0
Message 59258 - Posted 3 Feb 2009 15:06:00 UTC - in response to Message ID 59253.

Using version 6.4.5 which I downloaded and installed about 6 days ago.


what version of boinc manager are you using?
it looked like you were using 5.10.45 which is quite old.[/quote]

TeAm Enterprise Profile
Avatar

Joined: Sep 28 05
Posts: 15
ID: 1546
Credit: 14,190,849
RAC: 0
Message 59259 - Posted 3 Feb 2009 15:11:39 UTC - in response to Message ID 59254.

That fixed it! Thanks, my duration was set at 55+.


I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.

For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.

Right now 3 cores are running:

Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.


Sounds like your Duration Correction Factor is waaaay off. You can wait for Boinc to fix it, and it will all by itself. Or you can shut down Boinc and manually edit the file client_state.xml. Do this thru notepad or whatever, and go down until you find this Project and the line that says:
<duration_correction_factor>0.705998</duration_correction_factor>
That is a copy of my line, yours will have the numbers be like 70.?????????? or whatever. Change it to 1.000000, save the file and then restart Boinc. Boinc will then use those new numbers and get both new work and recalculate how long it will take to crunch your existing units. If you do a right click on the file name the EDIT option is listed, use that to edit the file. Do NOT change anything else in the file. The line you are looking for is near the top, mine was only 65 lines down from the top.

cenit Profile

Joined: Apr 1 07
Posts: 11
ID: 161706
Credit: 504,070
RAC: 946
Message 59261 - Posted 3 Feb 2009 17:12:07 UTC - in response to Message ID 59240.

Just got this error trying to perform an update:

2/2/2009 10:05:58 AM|rosetta@home|Sending scheduler request: Requested by user
2/2/2009 10:05:58 AM|rosetta@home|(not requesting new work or reporting completed tasks)
2/2/2009 10:06:03 AM|rosetta@home|Scheduler RPC succeeded
2/2/2009 10:06:03 AM|rosetta@home|Message from server: Server error: can't attach shared memory
2/2/2009 10:06:03 AM|rosetta@home|Deferring communication for 1 hr 0 min 0 sec
2/2/2009 10:06:03 AM|rosetta@home|Reason: project is down

Server is up according to the webpage. One task was updated as complete.


you have to wait and it will correct by itself.
Maybe it is a long time from your last rosetta WU... during this time the project changed its web address and so boinc need to re-fetch master file. Leave it alone and in 24 hour max it will redownload it and resume working!

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59266 - Posted 3 Feb 2009 23:48:58 UTC - in response to Message ID 59259.

That fixed it! Thanks, my duration was set at 55+.


I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.

For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.

Right now 3 cores are running:

Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.


Sounds like your Duration Correction Factor is waaaay off. You can wait for Boinc to fix it, and it will all by itself. Or you can shut down Boinc and manually edit the file client_state.xml. Do this thru notepad or whatever, and go down until you find this Project and the line that says:
<duration_correction_factor>0.705998</duration_correction_factor>
That is a copy of my line, yours will have the numbers be like 70.?????????? or whatever. Change it to 1.000000, save the file and then restart Boinc. Boinc will then use those new numbers and get both new work and recalculate how long it will take to crunch your existing units. If you do a right click on the file name the EDIT option is listed, use that to edit the file. Do NOT change anything else in the file. The line you are looking for is near the top, mine was only 65 lines down from the top.


Yea for some reason this has happened ALOT lately.
____________

transient
Avatar

Joined: Sep 30 06
Posts: 255
ID: 115553
Credit: 1,174,447
RAC: 1,791
Message 59272 - Posted 4 Feb 2009 6:22:55 UTC

That could be related to the BOINC version (6.4.5 and higher). The complaints about the RDCF being completely off are usually coming from people having installed it. A not uncommon opinion is that version 6.4.5 was made the recommended version too hasty and done to get the CUDA capabilities out.
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4363
ID: 85645
Credit: 774,444
RAC: 465
Message 59273 - Posted 4 Feb 2009 9:08:55 UTC
Last modified: 4 Feb 2009 9:09:38 UTC

lr6_E_score12_rlbd_1e6i_IGNORE_THE_REST_DECOY_6254_236_1
ERROR:: Exit from: ..\..\src\protocols\checkpoint\CheckPointer.cc line: 87
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4363
ID: 85645
Credit: 774,444
RAC: 465
Message 59274 - Posted 4 Feb 2009 9:11:24 UTC

lr5_D_hybrid_rlbd_1bmg_IGNORE_THE_REST_DECOY_6250_424_0
Initializing options.... ok
ERROR: Option file open failed for: relax_options_lr5_D_hybrid_mtyka

KC0ISW

Joined: Sep 28 05
Posts: 2
ID: 1538
Credit: 21,263
RAC: 0
Message 59278 - Posted 4 Feb 2009 12:43:32 UTC - in response to Message ID 59274.

http://boinc.bakerlab.org/rosetta/result.php?resultid=226103545
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59331 - Posted 5 Feb 2009 13:40:19 UTC - in response to Message ID 59180.

mikey, whatever the problem is, it stands a good chance of clearing itself when the next Rosetta version comes out. The .exe will be a different name afterall. So, please monitor the new release thread and give another try at that time.


I will, I like the premise of Rosetta and that is what brought me here in the first place. I will certainly try again in the future, probably when you put out a new version as you suggest.
Thanks for all your help


I just had a thought...NOT dangerous this time, I am off for a few days here and I have finally figured out how to make Ubuntu Linux work for me and crunch Boinc projects too. I will try switching one of the machines that won't download the Windows app to Linux and see if that works.
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59355 - Posted 5 Feb 2009 19:00:02 UTC - in response to Message ID 59331.

mikey, whatever the problem is, it stands a good chance of clearing itself when the next Rosetta version comes out. The .exe will be a different name afterall. So, please monitor the new release thread and give another try at that time.


I will, I like the premise of Rosetta and that is what brought me here in the first place. I will certainly try again in the future, probably when you put out a new version as you suggest.
Thanks for all your help


I just had a thought...NOT dangerous this time, I am off for a few days here and I have finally figured out how to make Ubuntu Linux work for me and crunch Boinc projects too. I will try switching one of the machines that won't download the Windows app to Linux and see if that works.


Guess what...............NO PROBLEM, it is crunching just fine. Here is the pc: http://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1001897
It is on its first unit, so no results yet, but one unit is crunching just fine so far!!
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4363
ID: 85645
Credit: 774,444
RAC: 465
Message 59363 - Posted 5 Feb 2009 20:19:02 UTC

lr5_D_hybrid_rlbd_1e6i_IGNORE_THE_REST_DECOY_6250_347_0 died at 3 hrs out of 4 and also kicked up a dialogue box on my desktop.

the error - exit code -1073740791 (0xc0000409)

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59365 - Posted 5 Feb 2009 21:33:33 UTC - in response to Message ID 59355.

mikey, whatever the problem is, it stands a good chance of clearing itself when the next Rosetta version comes out. The .exe will be a different name afterall. So, please monitor the new release thread and give another try at that time.


I will, I like the premise of Rosetta and that is what brought me here in the first place. I will certainly try again in the future, probably when you put out a new version as you suggest.
Thanks for all your help


I just had a thought...NOT dangerous this time, I am off for a few days here and I have finally figured out how to make Ubuntu Linux work for me and crunch Boinc projects too. I will try switching one of the machines that won't download the Windows app to Linux and see if that works.


Guess what...............NO PROBLEM, it is crunching just fine. Here is the pc: http://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1001897
It is on its first unit, so no results yet, but one unit is crunching just fine so far!!


I just thought of something....I wonder if changing the setting for:
Skip image file verification? to yes would have let my Windows pc's download the file? Hmmmmm
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59367 - Posted 5 Feb 2009 21:48:57 UTC

The image verification can't occur until the download completes. So, that's not what's causing the download problem.
____________
Rosetta Moderator: Mod.Sense

epcorian

Joined: Jan 1 09
Posts: 16
ID: 295242
Credit: 194,641
RAC: 29
Message 59371 - Posted 6 Feb 2009 1:21:48 UTC

Mod.Sense had asked me to posts my results in here. A little history, I've been getting Compute Error's for every Minirosetta WU I try and crunch, they usually crash and burn within the first 60 seconds or so...I am running a Q6600 with everything running at stock speeds but I was throttling my processor to use only 3 of 4 cores, so it was suggested that I let all 4 cores run unthrottled and here's what happenned:

I changed it to: "On multiprocessor systems, use at most 100% of the processors" so that it would run completely unthrottled and use all 4 cores. And I let it download minirosetta WU's and it got 5 of them and all failed after 0:33, 1:39, 0:56, 0:38, and last one at 0:51 crashed with a Vista popup saying "minirosetta_1.54_windows_x86_64.exe has stopped working"

So it didn't seem to help, I don't know what else to try but I'm little ashamed of all the compute errors when you look at my results page..so I think I may have to give up on minirosetta and just stick to Beta WU's, they seem to work great when I'm not messing around with the BOINC client.

I think it may have something to do with Vista 64. Because I have an E8500 running Vista 64 and they fail on there too but the E8500 is throttled to 1 core and is OC'ed from 3.16Ghz to 3.8Ghz (I've been told OC'ing will effect minirosetta) but the E8500 is my gaming rig so I don't mind if it doesn't crunch WU's because it's crunching games! :)

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59374 - Posted 6 Feb 2009 2:58:56 UTC

And epcorian is not overclocked. Running BOINC version 6.4.5

They consistently fail with Access Violations on the Mini tasks. The "Rosetta Beta" tasks are the successes you will find.

Is it possible you've got something like an antivirus application that's conflicting on Vista?

The only other thought is to go back to the prior stable version of the BOINC client. There have been a number of fishy issues with the 6.4.x level. You can download older BOINC versions here
____________
Rosetta Moderator: Mod.Sense

epcorian

Joined: Jan 1 09
Posts: 16
ID: 295242
Credit: 194,641
RAC: 29
Message 59376 - Posted 6 Feb 2009 3:29:44 UTC

That's right, the Q6600 isn't overclocked, the system contains a Intel DQ35JO MB, Q6600 Processor, 4GB (2x2GB) Kingston Value Ram, Corsair HX-520W PS, 36GB WD Raptor HD, 2x750GB WD HD's in RAID 1, and a Zalman HSF running Vista 64 SP1, no external video card. I use it as a home file and print server and recently a BOINC cruncher as I leave it on 24/7. No issues with Beta WU's or SETI.

I do have NOD32 installed on there but I tried disabling it (I haven't gone as far to uninstall it) and they would still fail.

Maybe I should try an older version of the BOINC client, I will give it a go this weekend and post back.

Thanks!

NewtonianRefractor

Joined: Sep 29 08
Posts: 16
ID: 281324
Credit: 16,474
RAC: 0
Message 59382 - Posted 6 Feb 2009 9:38:53 UTC
Last modified: 6 Feb 2009 9:45:30 UTC

can someone please explain what happened here?

Here is another one.

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59384 - Posted 6 Feb 2009 12:28:02 UTC - in response to Message ID 59367.

The image verification can't occur until the download completes. So, that's not what's causing the download problem.


DARN, I was hoping that would solve my problem, oh well. Thanks!
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59385 - Posted 6 Feb 2009 12:33:32 UTC - in response to Message ID 59374.

And epcorian is not overclocked. Running BOINC version 6.4.5

They consistently fail with Access Violations on the Mini tasks. The "Rosetta Beta" tasks are the successes you will find.

Is it possible you've got something like an antivirus application that's conflicting on Vista?

The only other thought is to go back to the prior stable version of the BOINC client. There have been a number of fishy issues with the 6.4.x level. You can download older BOINC versions here


He is running a 64 bit OS though, I read on one of the projects that you need to do something to make 32 bit units work on a 64 bit system, is that true with Rosetta units too? That is NOT true for all projects and I do not remember where I read it.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59391 - Posted 6 Feb 2009 14:26:53 UTC
Last modified: 6 Feb 2009 14:29:47 UTC

Moved NewtonianRefractor's post here. They report a validation error on a tasks that had a visit from the watchdog. They ended at target runtime plus 4hrs, but show with validation errors.
____________
Rosetta Moderator: Mod.Sense

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 608,823
RAC: 73
Message 59394 - Posted 6 Feb 2009 16:32:31 UTC - in response to Message ID 59161.

rembertw
Please open the advanced view of the BOINC Manager, go to the tasks tab, and note the "application" name shown, this will have the application version. The only reports of tasks running that long are from the prior version. If it is not Rosetta mini 1.54, please select that task, and abort it with the button on the left. There were some problems like that on the prior version that are corrected now.


Same problem again on at least one of my computers. This time I have more details:
Application: Rosetta Mini 1.54
Task name: lr6_E_score12_rlbd_1ail_IGNORE_THE_REST_DECOY_6254_459_0

Total runtime before manual cancellation: 72:21:22
Total Progress: 0%
Time to go: 6:42:30 (as usual on my computers)

Any comments/ideas?

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59395 - Posted 6 Feb 2009 17:10:50 UTC

Rembertw, which machine are you having the problem with? What version of BOINC are you running? Was this a newly installed machine? Or was it working before?
____________
Rosetta Moderator: Mod.Sense

svincent

Joined: Dec 30 05
Posts: 130
ID: 44923
Credit: 951,788
RAC: 618
Message 59406 - Posted 7 Feb 2009 1:36:32 UTC

Similar error to that reported by Paul Buck

Task: 226615095
Workunit: 206537670
Name: loopbuild_ref_tex_cst_hombench_loopbuild_tex_cst_t326__IGNORE_THE_REST_1R9GA_7_6642_18_0

Mac OSX 10.4.11

<core_client_version>6.2.18</core_client_version>
<![CDATA[

*** Probably irrelevant stuff deleted

End of unzipping.
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/loopbuild_ref_tex_cst.loopbuild_tex_cst.t326_.tex.boinc_files.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/loopbuild_ref_tex_cst.loopbuild_tex_cst.t326_.tex.boinc_files.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 14400
Hbond tripped.

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>


____________

epcorian

Joined: Jan 1 09
Posts: 16
ID: 295242
Credit: 194,641
RAC: 29
Message 59409 - Posted 7 Feb 2009 3:40:46 UTC

So I took Mod.Sense's advice and downgraded to the 6.2.19 64-bit version of the BOINC client and so far so good with the mini's, I've crunched with 30 minutes thus far and no errors yet, much better then the 30-60 seconds I was getting before.

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 608,823
RAC: 73
Message 59411 - Posted 7 Feb 2009 8:39:55 UTC - in response to Message ID 59395.

Rembertw, which machine are you having the problem with? What version of BOINC are you running? Was this a newly installed machine? Or was it working before?

- This specific problem occurs with computer ID 586996
- Boinc version 6.2.14 as on most of my computers currently
- Not newly installed, but hardly a price winner with Rosetta. It crunches succesfully for other projects though

Extra comments: I have the impression that it is Rosetta that crashes. This morning I noticed 2 other tasks at +7h progress and 0% progress. When cancelling these tasks I got the Windows crash notice where I can "inform microsoft of the problem". The only "special" about this computer is that it doesn't have 24/7 internet access.

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59412 - Posted 7 Feb 2009 12:17:19 UTC - in response to Message ID 59409.

So I took Mod.Sense's advice and downgraded to the 6.2.19 64-bit version of the BOINC client and so far so good with the mini's, I've crunched with 30 minutes thus far and no errors yet, much better then the 30-60 seconds I was getting before.


ALRIGHT!!! Glad you guys found the problem, I guess the reports of the newer versions being released without proper testing were true in your case.
____________

[C@B] PcLis Profile
Avatar

Joined: Dec 16 07
Posts: 3
ID: 227480
Credit: 42,596
RAC: 11
Message 59416 - Posted 7 Feb 2009 13:56:34 UTC

Hola,

En primer lugar, disculpas por escribir en castellano, pero mi inglés es insuficiente.

Desde agosto de 2008 me están finalizando el 99% de las tareas de Rosetta Mini con error de cálculo. Al cabo de un tiempo decidí no seguir procesando en este proyecto. Aun así, de cuando en cuando vuelvo a intentarlo, pero todo sigue igual: incluso con las nuevas versiones de Rosetta Mini, incluida esta última.

El caso es que las tareas de Rosetta Beta no me fallan, pero de ésas me envía proporcinalmente muy pocas. La pena es que en este proyecto no existe la posibilidad de seleccionar subproyectos, como sí la hay en otros muchos.

Me gustaría seguir procesando para este proyecto, pero no hay manera, y no es cuestión de tirar horas de computación desaprovechadas. Espero que este problema se resuelva pronto. Por mi parte seguiré probando de vez en cuando.

Un coridal saludo para todos,

Juan

Klimax

Joined: Apr 27 07
Posts: 29
ID: 170261
Credit: 101,685
RAC: 1
Message 59418 - Posted 7 Feb 2009 14:14:01 UTC

Hello.
Following task (http://boinc.bakerlab.org/rosetta/result.php?resultid=225859224) is suspended as it has produced "accepted energy": QNAN(Not a Number?) and RMSD: QO.Model number 25 step 9518. Running time: 20h 2min 21sec.
Set runtime 24h.
For now suspended.No crash before.
OS:Windows 7 beta.I can create dump file using task manager.

Should I let it try to finish?

Thanks

robertmiles Profile

Joined: Jun 16 08
Posts: 424
ID: 264600
Credit: 312,368
RAC: 198
Message 59419 - Posted 7 Feb 2009 15:04:48 UTC - in response to Message ID 59172.

mikey, have you tried a different version of BOINC?


Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!



Just a wild shot. ..

How is your disk space?

How about BOINC settings for disk space? Are you at BOINC's limit?


No I am fine on both. Boinc still has 10 gig available to it and there is over 20 gig total available.


How many BOINC projects do you have set up? I've seen signs that BOINC divides the available space equally among projects, even if some projects don't even try to use all of their share. I'm currently allowing BOINC to share up to 30 GB among 8 BOINC projects (not all making workunits available recently). I had problems getting Rosetta@home to run workunits on both cores of my dual-core CPU at the same time before that. Also, I believe I've seen a maximum percentage of the available free space on the hard drive BOINC is allowed to use, which can reduce the limits even further.

robertmiles Profile

Joined: Jun 16 08
Posts: 424
ID: 264600
Credit: 312,368
RAC: 198
Message 59420 - Posted 7 Feb 2009 15:25:32 UTC

I recently had a 1.54 workunit with a validate error for no reason I could spot in the Task ID details file. A wingman got a Success, but apparantly with a much shorter preferred workunit length than the 14 hours I request.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=204095976

Could you check for problems in parts of the workunit the wingman probably never reached?

mikey
Avatar

Joined: Jan 5 06
Posts: 859
ID: 47185
Credit: 102,931
RAC: 0
Message 59421 - Posted 7 Feb 2009 16:24:33 UTC - in response to Message ID 59419.
Last modified: 7 Feb 2009 16:27:55 UTC

mikey, have you tried a different version of BOINC?


Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!



Just a wild shot. ..

How is your disk space?

How about BOINC settings for disk space? Are you at BOINC's limit?


No I am fine on both. Boinc still has 10 gig available to it and there is over 20 gig total available.


How many BOINC projects do you have set up? I've seen signs that BOINC divides the available space equally among projects, even if some projects don't even try to use all of their share. I'm currently allowing BOINC to share up to 30 GB among 8 BOINC projects (not all making workunits available recently). I had problems getting Rosetta@home to run workunits on both cores of my dual-core CPU at the same time before that. Also, I believe I've seen a maximum percentage of the available free space on the hard drive BOINC is allowed to use, which can reduce the limits even further.


I only have one project per pc, but I will add a second if the first is having workunit issues. All machines have at least a 20 gig hard drive but most have a 100 gig or bigger hard drive. The one above is a laptop with a 50 gig hard drive with almost 30 gig free. I have Boinc setup to use no more than 50% of the free hard drive space and don't have any issues with space.
____________

epcorian

Joined: Jan 1 09
Posts: 16
ID: 295242
Credit: 194,641
RAC: 29
Message 59428 - Posted 7 Feb 2009 20:06:18 UTC - in response to Message ID 59412.

So I took Mod.Sense's advice and downgraded to the 6.2.19 64-bit version of the BOINC client and so far so good with the mini's, I've crunched with 30 minutes thus far and no errors yet, much better then the 30-60 seconds I was getting before.


ALRIGHT!!! Glad you guys found the problem, I guess the reports of the newer versions being released without proper testing were true in your case.


I think I spoke too soon...that first WU crunched successfully but only 1 other was WU successful out of the 8 WU's. 2/8, better but still not good. I might try replacing Vista 64 with XP 64 another weekend when I'm bored. Just for curiosity sake I had my P4 and Atom 330 PC's running 32-bit XP SP3 crunch some Mini's and they did just fine.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59437 - Posted 8 Feb 2009 2:28:40 UTC - in response to Message ID 59418.
Last modified: 8 Feb 2009 2:29:01 UTC

Hello.
Following task (http://boinc.bakerlab.org/rosetta/result.php?resultid=225859224) is suspended as it has produced "accepted energy": QNAN(Not a Number?) and RMSD: QO.Model number 25 step 9518. Running time: 20h 2min 21sec.
Set runtime 24h.
For now suspended.No crash before.
OS:Windows 7 beta.I can create dump file using task manager.

Should I let it try to finish?

Thanks


I'd suggest allowing it to run normally. Was it still using CPU time? If you want to kind of cut it off, but get it to report in, let it run, then exit (not close) BOINC and restart it, let it run about 2 minutes, then exit again and restart, until you've done that 5 times and the task should be ended and report in with "too many restarts".
____________
Rosetta Moderator: Mod.Sense

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59439 - Posted 8 Feb 2009 2:43:27 UTC - in response to Message ID 59416.

Hola,

En primer lugar, disculpas por escribir en castellano, pero mi inglés es insuficiente.

Desde agosto de 2008 me están finalizando el 99% de las tareas de Rosetta Mini con error de cálculo. Al cabo de un tiempo decidí no seguir procesando en este proyecto. Aun así, de cuando en cuando vuelvo a intentarlo, pero todo sigue igual: incluso con las nuevas versiones de Rosetta Mini, incluida esta última.

El caso es que las tareas de Rosetta Beta no me fallan, pero de ésas me envía proporcinalmente muy pocas. La pena es que en este proyecto no existe la posibilidad de seleccionar subproyectos, como sí la hay en otros muchos.

Me gustaría seguir procesando para este proyecto, pero no hay manera, y no es cuestión de tirar horas de computación desaprovechadas. Espero que este problema se resuelva pronto. Por mi parte seguiré probando de vez en cuando.

Un coridal saludo para todos,

Juan


Hola Juan,

I was able to translate his message and basically, he's been having problems with Mini, including the lastest version. He wishes Rosetta had subprojects, so he could select to crunch only the RosettaBeta application instead of mini.

Looking at his 2 failed tasks, they both have Exit status -226 and the Can't acquire lockfile errors.

He is running Win Vista x86.

I know some of you have had these lock file problems as well. Were they always with WinVista? And I thought the v1.54 release of mini had resolved these issues. Can any of you that have had the problem suggest the best steps for Juan to take to resolve it? You might even convert your reply to Spanish as best we can using a tool like this: http://dictionary.reference.com/translate/text.html
____________
Rosetta Moderator: Mod.Sense

Fishead

Joined: Sep 3 08
Posts: 7
ID: 276548
Credit: 89,566
RAC: 0
Message 59443 - Posted 8 Feb 2009 6:45:05 UTC

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=206610287
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=206617445
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=206618707
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=204395981

According to the graphics screen of these four WUs, every "accepted" step becomes the new low energy state. No matter if the energy value is smaller or higher...

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,023,621
RAC: 81
Message 59447 - Posted 8 Feb 2009 10:12:14 UTC

*I* cured the lock file problem by running with 100% time ... if he has opted to run at some lower percentage of CPU time this may be the issue. Something else to try ... and if it works we can report another success ... this is one of the issues that we have been trying to pin down in RALPH...

Evan

Joined: Dec 23 05
Posts: 264
ID: 42505
Credit: 353,034
RAC: 396
Message 59462 - Posted 8 Feb 2009 16:37:47 UTC

I have aborted the following loopbuilds:

226468615
226473496

They both were going on a slow boat to nowhere with an accepted energy of 1.#INF


____________

Klimax

Joined: Apr 27 07
Posts: 29
ID: 170261
Credit: 101,685
RAC: 1
Message 59465 - Posted 8 Feb 2009 18:47:55 UTC - in response to Message ID 59437.

Hello.
Following task (http://boinc.bakerlab.org/rosetta/result.php?resultid=225859224) is suspended as it has produced "accepted energy": QNAN(Not a Number?) and RMSD: QO.Model number 25 step 9518. Running time: 20h 2min 21sec.
Set runtime 24h.
For now suspended.No crash before.
OS:Windows 7 beta.I can create dump file using task manager.

Should I let it try to finish?

Thanks


I'd suggest allowing it to run normally. Was it still using CPU time? If you want to kind of cut it off, but get it to report in, let it run, then exit (not close) BOINC and restart it, let it run about 2 minutes, then exit again and restart, until you've done that 5 times and the task should be ended and report in with "too many restarts".


OK,set runtime at 8hours,so watchdog would cut it at 24hours.It has now uploaded and reported it.I have dump files as well,if somebody in team is interested.(Captured at reported time and step)
And I see I was not alone... :-(

parish

Joined: Aug 13 06
Posts: 1
ID: 104623
Credit: 375,447
RAC: 1
Message 59469 - Posted 8 Feb 2009 20:24:44 UTC

Hi,

The work units exit with status code 193 (0xc1).
Rosetta 5.98 and other projects work OK.

Do I miss something? Some library.

Full error report below:

Server state Over
Outcome Client error
Client state Compute error
Exit status 193 (0xc1)
CPU time 0

<core_client_version>6.2.15</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 2- 8 1:29: 8:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
*** glibc detected *** corrupted double-linked list: 0x093544cc ***
SIGABRT: abort called
Stack trace (15 frames):
[0x8f88f07]
[0x8fb3778]
[0xb7fff420]
[0x9016944]
[0x902c693]
[0x90310d2]
[0x9031c84]
[0x903353d]
[0x9000ec7]
[0x81bed6d]
[0x81bee1d]
[0x8195f15]
[0x8048e93]
[0x900f84c]
[0x8048111]

Exiting...

</stderr_txt>
]]>
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59473 - Posted 8 Feb 2009 23:12:20 UTC

As of v1.54, the watchdog kicks in at runtime pref. plus 4 hours. So, no longer 3 times runtime preference.
____________
Rosetta Moderator: Mod.Sense

Andreas

Joined: Sep 22 08
Posts: 1
ID: 280173
Credit: 39,402
RAC: 0
Message 59493 - Posted 9 Feb 2009 22:07:37 UTC - in response to Message ID 59086.

If you are seeing errors with lock-file problems try setting the cpu setting back to 100%. If you are running at 100% CPU preference and are getting this problem, I for one, am very interested. If you are getting the failures and change the CPU setting to 100% and that cures the issue ... well, we are interested in THAT too ...

I read about this in Einstein@Home and it seems to work for me ... YMMV ...


I, too, was plagued by frequent R@H lock file problems. Setting CPU to 100% seems to have cured that.
And, as I have a quad-core CPU, I can limit BOINC usage by setting "On Multiprocessor Systems, use at most 51% of all processors". (If I run BOINC at 100% on all cores, my system gets too hot - more precisely, my fan gets too loud)
-- Andreas

Evan

Joined: Dec 23 05
Posts: 264
ID: 42505
Credit: 353,034
RAC: 396
Message 59494 - Posted 9 Feb 2009 23:02:10 UTC

problems with this one:
227327540

heartbeat error messages

</stderr_txt>
<message>
<file_xfer_error>
<file_name>abinitio_norelax_homfrag_natfrag_129_B_1o7uA_SAVE_ALL_OUT_6252_5178_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>




____________

robertmiles Profile

Joined: Jun 16 08
Posts: 424
ID: 264600
Credit: 312,368
RAC: 198
Message 59502 - Posted 10 Feb 2009 14:44:11 UTC - in response to Message ID 59439.
Last modified: 10 Feb 2009 14:52:25 UTC

Hola,

En primer lugar, disculpas por escribir en castellano, pero mi inglés es insuficiente.

Desde agosto de 2008 me están finalizando el 99% de las tareas de Rosetta Mini con error de cálculo. Al cabo de un tiempo decidí no seguir procesando en este proyecto. Aun así, de cuando en cuando vuelvo a intentarlo, pero todo sigue igual: incluso con las nuevas versiones de Rosetta Mini, incluida esta última.

El caso es que las tareas de Rosetta Beta no me fallan, pero de ésas me envía proporcinalmente muy pocas. La pena es que en este proyecto no existe la posibilidad de seleccionar subproyectos, como sí la hay en otros muchos.

Me gustaría seguir procesando para este proyecto, pero no hay manera, y no es cuestión de tirar horas de computación desaprovechadas. Espero que este problema se resuelva pronto. Por mi parte seguiré probando de vez en cuando.

Un coridal saludo para todos,

Juan


Hola Juan,

I was able to translate his message and basically, he's been having problems with Mini, including the lastest version. He wishes Rosetta had subprojects, so he could select to crunch only the RosettaBeta application instead of mini.

Looking at his 2 failed tasks, they both have Exit status -226 and the Can't acquire lockfile errors.

He is running Win Vista x86.

I know some of you have had these lock file problems as well. Were they always with WinVista? And I thought the v1.54 release of mini had resolved these issues. Can any of you that have had the problem suggest the best steps for Juan to take to resolve it? You might even convert your reply to Spanish as best we can using a tool like this: http://dictionary.reference.com/translate/text.html


I never learned enough Spanish to do such a translation myself, so I tried asking that web site to translate all of your reply at once to Spanish, in preparation for writing an answer in English and doing the same to it. It appeared that the translation succeeded, but enough of it was hidden by advertisements that it was unusable.

Anyone know another automatic translation site that doesn't have this problem?

I've been trying to trigger that problem over on RALPH@home by setting my CPU time less than 100% and unable to actually get it less than 100%, so you might want to consider this: For anyone having this problem repeatedly, give them 1.54 workunits with extra debugging output enabled. Then have someone on the RALPH@home staff analyze the results and give them credits according to the RALPH@home standards instead of the Rosetta@home standards.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4363
ID: 85645
Credit: 774,444
RAC: 465
Message 59505 - Posted 10 Feb 2009 18:23:04 UTC
Last modified: 10 Feb 2009 18:32:00 UTC

http://www.babelfish.yahoo.com translates it as:

Hello, First of all, excuses to write in Castilian, but my English is insufficient. From August of 2008 me 99% of the tasks of Mini Rosetta with computational error are finalizing. After a time I decided not to continue processing in this project. Even so, sometimes I return to try it, but everything follows equal: even with the new versions of Mini Rosetta, including this last one. The case is that the tasks of Rosetta Beta do not fail to me, but of that one sends very few proporcinalmente to me. The pain is that in this project the possibility of selecting sub-projects, does not exist there is as if it in other many. I would like to continue processing for this project, but there is no way, and it is not question to throw low-achieving hours of computation. I hope that this problem is solved soon. As for me I will continue trying from time to time. A coridal greeting for all, Juan

he has 4 tasks running and 2 of them failed

abinitio_norelax_homfrag_natfrag_129_B_1tit__SAVE_ALL_OUT_6252_2628_0
he got a lockfile failure on this one and it ran only CPU time 683.9708

and

loopbuild_ref_tex_cst_hombench_loopbuild_tex_cst_t363__IGNORE_THE_REST_1WWTA_12_6651_14_0
this got lockfile as well it ran for CPU time 2155.325

the other 2 are split with a completion and in process

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 608,823
RAC: 73
Message 59518 - Posted 11 Feb 2009 16:45:08 UTC - in response to Message ID 59395.

Mod.Sense

Rembertw, which machine are you having the problem with? What version of BOINC are you running? Was this a newly installed machine? Or was it working before?


- This specific problem occurs with computer ID 586996
- Boinc version 6.2.14 as on most of my computers currently
- Not newly installed, but hardly a price winner with Rosetta. It crunches succesfully for other projects though

Extra comments: I have the impression that it is Rosetta that crashes. This morning I noticed 2 other tasks at +7h progress and 0% progress. When cancelling these tasks I got the Windows crash notice where I can "inform microsoft of the problem". The only "special" about this computer is that it doesn't have 24/7 internet access.

No solution as yet?

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59520 - Posted 11 Feb 2009 19:18:25 UTC - in response to Message ID 59518.

Mod.Sense

Rembertw, which machine are you having the problem with? What version of BOINC are you running? Was this a newly installed machine? Or was it working before?


- This specific problem occurs with computer ID 586996
- Boinc version 6.2.14 as on most of my computers currently
- Not newly installed, but hardly a price winner with Rosetta. It crunches succesfully for other projects though

Extra comments: I have the impression that it is Rosetta that crashes. This morning I noticed 2 other tasks at +7h progress and 0% progress. When cancelling these tasks I got the Windows crash notice where I can "inform microsoft of the problem". The only "special" about this computer is that it doesn't have 24/7 internet access.

No solution as yet?


I've not heard any other reports of the percent completed not increasing. What is it showing for the estimated runtime, before the task starts?

Odd, the failed task with some time on it shows that your
core client version is 6.2.14, but your BOINC Windows Runtime Debugger Version is 6.5.0. Not sure how that would happen.

____________
Rosetta Moderator: Mod.Sense

Verrie Pearce

Joined: Dec 2 05
Posts: 3
ID: 27415
Credit: 90,299
RAC: 0
Message 59524 - Posted 12 Feb 2009 3:13:06 UTC - in response to Message ID 59045.

Hello All!

We're ready for a new update. I want to say thank all of you who have helped over the last months to find and fix errors in minirosetta. A particular thank you goes to those who have donated their time over on RALPH and helped with their active feedback - we managed to find a number of difficult and rare bugs and put some new features into minirosetta that should help conserve computer time. Read about it here: http://ralph.bakerlab.org/forum_thread.php?id=431
and here http://ralph.bakerlab.org/forum_thread.php?id=432
I should add that work over there will continue,but now supplemented with information from Rosetta@HOME.

This update is highly focused on bugfixing and stability issues - we have virtually no new science in it, but: We will hopefully now be able to run the science projects that have been in the pipeline waiting for BOINC - we're expecting quite a bit of work to go out very soon indeed. See Dr. Baker's journal for more details.


Features/Fixes:
1.54 Release CHANGELOG


  • Faster loop closing in FoldCST/Abinitio (affects cc_* cc2_* cs_* WUs), should help with overrunning WUs.

  • Bug fix concerning intermittent crashes in relax benchmark jobs (_rlbd_) jobs - caused by buggy input file reader.

  • Bug fix for a potential instability in handling text files (affects all types of WUs).

  • Bug fix in checkpointing machinery, states were not being correctly restored, probably contributing to long runtimes. (affects cc_* cc2_* cs_* WUs)

  • Increased the density of checkpoints to lose less time on restarts and address the weired "backjumping" of the time reported in this thread. This will still happen, but the jumps should be much smaller (basically maximally as long as the time between checkpoints.)

  • Added checkpointing to Loopclosing part of FoldCST. (affects cc_* cc2_* cs_* WUs)

  • Added checkpointing to Looprelax.

  • The Watchdog has been checked and improved, now returning information on the aborted jobs to help us figure out how the remaining long running models come about. The watchdog will now abort if the runtime exceeds your preferred runtime + 4 hours. In other words the WUs should not overrun for more than around 4 hours. If they do please let us know !!

  • Added a limit ont he number of decoys per WU: 99. The WU will end gracefully after that and give full credit. This should address issues with excessive upload problems.

  • Fixed a bug in the BOINC API concerned with unzipping the input data. (I will let the BOINC guys know about this)

  • Fixed a strange problem in the options system leading to early crashes on some systems.

  • Two nasty instabilities fixed deep in the FoldConstraints/abinitio protocol (cc_* tasks and other homology modelling tasks)

  • Generally implemented much better error reporting - many many potential problems will now show up a meaningful error messages and not random segmentation faults.



NOTE: This new version contains a lot of debug output still. YOu will see that the stderr fills up with stuff - that is ok . It does not slow down the program nor cause much extra upload - but it tells us a lot about where things can go wrong still.


Despite all these fixes there are, i'm sure, many problems left. Most of them occur extremely rarely now though or are highly specific to particular machines. Thus we have decided to move the current version over from RALPH to Rosetta@HOME and give it a go on a much larger scale. Our effords to keep the failure rate down will continue and your time donations over on RALPH as well as error reports are still highly appreciated.

Please let us know how things work out there. Particularily i'd like to know about


  • Stuck workunits
  • Overrunning workunits (WUs should now, due to the new watchdog, never run more than 4 hours longer than the preferred user time)
  • Problems with checkpointing.
  • Any other strange behaviour.




Happy crunching - I'm very excited to see how this new version will pan out.

Mike



____________

Verrie Pearce

Joined: Dec 2 05
Posts: 3
ID: 27415
Credit: 90,299
RAC: 0
Message 59525 - Posted 12 Feb 2009 3:14:52 UTC

I have reached the end since your new patch nothing works form your project. I keep resetting and still I get no improvement. Until you patch your patch I am done sorry, I wanted to help.
____________

Sid Celery

Joined: Feb 11 08
Posts: 366
ID: 241409
Credit: 1,126,789
RAC: 1,954
Message 59526 - Posted 12 Feb 2009 4:08:05 UTC - in response to Message ID 59525.

I have reached the end since your new patch nothing works form your project. I keep resetting and still I get no improvement. Until you patch your patch I am done sorry, I wanted to help.

Urgh - bad news :(

I notice you're using Boinc 6.2.19 with Vista64. Can you give it one last try and upgrade to 6.4.5? I had similar problems to you (not anywhere as bad) using Vista64 and these problems have disappeared for me after upgrading. It might make all the difference for you too.
____________

transient
Avatar

Joined: Sep 30 06
Posts: 255
ID: 115553
Credit: 1,174,447
RAC: 1,791
Message 59527 - Posted 12 Feb 2009 6:08:38 UTC

Do you 'overclock' your PC? In that case lowering the overclock might help.
____________

Markus

Joined: Feb 21 08
Posts: 1
ID: 243327
Credit: 25,065
RAC: 0
Message 59528 - Posted 12 Feb 2009 8:22:53 UTC

Good morning!

I reinstalled my complete System a few days ago and restarded crunching rosetta@home again. Unfortuanally i got some errors

Here is what i got

12.02.2009 05:37:59|rosetta@home|Restarting task cc_1_3_mamcstmix_cen_0.1_hb_t369__IGNORE_THE_REST_1RXQA_12_6836_46_0 using minirosetta version 154
12.02.2009 05:38:00|rosetta@home|Task cc_1_3_mamcstmix_cen_0.1_hb_t369__IGNORE_THE_REST_1RXQA_12_6836_46_0 exited with zero status but no 'finished' file
12.02.2009 05:38:00|rosetta@home|If this happens repeatedly you may need to reset the project.

Therefore two workunits aborted with compuation error. Maybe just an error for my System, just wanted to post it

Greetings

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 608,823
RAC: 73
Message 59530 - Posted 12 Feb 2009 14:02:04 UTC - in response to Message ID 59520.

Mod.Sense

I've not heard any other reports of the percent completed not increasing. What is it showing for the estimated runtime, before the task starts?


In the meantime I have set that computer on NNT, and changed the preferred runtime. I will reactivate that computer, and evaluate Saturday or after the weekend. You'll be informed :)

BrnmccO1

Joined: Jun 26 07
Posts: 17
ID: 186323
Credit: 578,825
RAC: 0
Message 59532 - Posted 12 Feb 2009 21:23:42 UTC

Very good so far, zero error results on all machines for a long time. This 1.54 is much better than the prev versions, much more stable etc. Keep up the good work stamping out the bugs.

Its been a long time since I've reviewed the results on all my crunchers and found no compute errors. If things keep going the way they are, we might break 100 Tflops yet!
____________

svincent

Joined: Dec 30 05
Posts: 130
ID: 44923
Credit: 951,788
RAC: 618
Message 59560 - Posted 14 Feb 2009 17:10:02 UTC

Workunit 205979363
Task 228619747
Bame loopbuild_ref_tex_cst_hombench_loopbuild_tex_cst_t332__IGNORE_THE_REST_2FLIA_6_6646_10_1
Mac OS X 10.4.11

This failed after 216 seconds : tail of stderr below

Setting database description ...
Setting up checkpointing ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 14400
Hbond tripped.
interpolate rotamers bin out of range: ARG 1.43667e-05 nan nan nan nan nan
81 81 19 20 2147483649 22 1.43667e-06 nan
ERROR:: Exit from: src/core/scoring/dunbrack/RotamericSingleResidueDunbrackLibrary.tmpl.hh line: 593
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>

____________

Yaroslav Isakov

Joined: Nov 2 07
Posts: 11
ID: 217531
Credit: 98,027
RAC: 0
Message 59596 - Posted 16 Feb 2009 3:03:25 UTC
Last modified: 16 Feb 2009 3:05:52 UTC

Hello, I have some problems with Minirosetta 1.54
validate error (about 25,000 seconds of runtime each)

1
2
3

client error

1
2

LizzieBarry

Joined: Feb 25 08
Posts: 76
ID: 243949
Credit: 197,236
RAC: 217
Message 59601 - Posted 16 Feb 2009 10:15:59 UTC - in response to Message ID 59596.

Hello, I have some problems with Minirosetta 1.54
validate error (about 25,000 seconds of runtime each)

1
2
3

client error

1
2

I got a couple of validate errors too:
Task 228125280
Task 228133134
There's nothing more frustrating than completing a job ok only for it to go wrong when uploaded.

I notice yours are a bit different though.
The first ones just include the line:
hbond tripped


The other two show:
Starting work on structure: _1JUDA_2_00001
Hbond tripped.

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

Not sure if one leads to the other but hbond tripped seems to be coming up in reports more regularly.

epcorian

Joined: Jan 1 09
Posts: 16
ID: 295242
Credit: 194,641
RAC: 29
Message 59610 - Posted 16 Feb 2009 16:39:56 UTC - in response to Message ID 59428.
Last modified: 16 Feb 2009 16:42:55 UTC

I think I spoke too soon...that first WU crunched successfully but only 1 other was WU successful out of the 8 WU's. 2/8, better but still not good. I might try replacing Vista 64 with XP 64 another weekend when I'm bored. Just for curiosity sake I had my P4 and Atom 330 PC's running 32-bit XP SP3 crunch some Mini's and they did just fine.


So this weekend I installed a fresh copy of XP x64, upgraded it to SP2, installed my x64 version of NOD32 antivirus, told BOINC to use "...use at most 75% of the processors" meaning 3 of 4 cores on my Q6600 and it's crunching Mini's and Beta's without a problem! 1 successful Beta, 5 successful Mini's with 4 more coming down the pipe. So it looks like Mini does not like Vista x64 and on my adventures on google, it turns out that XP x64 is actually based on the Server 2003 code tree while Vista is based on crap. :)

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,023,621
RAC: 81
Message 59614 - Posted 16 Feb 2009 18:41:30 UTC

Just noted that I have two tasks that failed. One had an exception, the other a validate error with 99 decoys ...

Validate Error
Exception

Does the system have an issue with too many decoys? The reissue has not returned ...

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4363
ID: 85645
Credit: 774,444
RAC: 465
Message 59615 - Posted 16 Feb 2009 18:45:12 UTC - in response to Message ID 59614.

Just noted that I have two tasks that failed. One had an exception, the other a validate error with 99 decoys ...

Validate Error
Exception

Does the system have an issue with too many decoys? The reissue has not returned ...


If I remember correctly, they have created a 99 model stop line to keep the tasks from running forever.

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,023,621
RAC: 81
Message 59617 - Posted 16 Feb 2009 19:25:37 UTC
Last modified: 16 Feb 2009 19:27:33 UTC

Yeah, the 99 stop limit was to avoid a problem with the file size that is zipped up and uploaded. However, I was just wondering if there is now a new companion problem that the validator does not properly handle those results... or, the result itself is somehow bad...

In that I have gone back to the 3rd of Feb and have at least a hundred (220) results with only three errors this is a puzzlement ...

{edit}
added number ..

Also I note that The runtime is only 145 seconds ... so that was fast work ... :)

Pharrg

Joined: Jul 10 06
Posts: 10
ID: 99406
Credit: 6,403
RAC: 0
Message 59625 - Posted 17 Feb 2009 2:22:04 UTC

I started running Rosetta this morning on a 64bit Vista machine and all seems to be working well. It's been working well on other projects too. Here is what I'm running:

Core i7 920 CPU
Asus P6T6 WS Revolution motherboard
6Gb DDR3 Triple Channel RAM
Vista Home Premium SP1 64bit

64bit BOINC 6.6.7

As I said, no problems yet and a number of WU's have completed already.


____________

Pharrg

Joined: Jul 10 06
Posts: 10
ID: 99406
Credit: 6,403
RAC: 0
Message 59626 - Posted 17 Feb 2009 3:14:15 UTC

Ok, after a number of successful completions, I did see one that looks like it failed. Message as follows:

2/16/2009 7:49:12 PM rosetta@home Computation for task ss-neg-1i17__7365_4677_1 finished
2/16/2009 7:49:12 PM rosetta@home Output file ss-neg-1i17__7365_4677_1_0 for task ss-neg-1i17__7365_4677_1 absent


Don't know the cause of that one...

____________

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,023,621
RAC: 81
Message 59627 - Posted 17 Feb 2009 6:35:01 UTC

Well, a couple hundred tasks and several with the same error, multiple systems (3 different), based on Xeon, Q9300, and i7 processors, various amounts of available RAM, though in common all are running Win XP Pro 32-Bit:

228932012
229013783
229066094
229072515

The error:

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004E3308 read attempt to address 0x00000000

Yaroslav Isakov

Joined: Nov 2 07
Posts: 11
ID: 217531
Credit: 98,027
RAC: 0
Message 59631 - Posted 17 Feb 2009 12:16:07 UTC - in response to Message ID 59601.


I notice yours are a bit different though.
The first ones just include the line:
hbond tripped


The other two show:
Starting work on structure: _1JUDA_2_00001
Hbond tripped.

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

Not sure if one leads to the other but hbond tripped seems to be coming up in reports more regularly.


Hey, you're right, all my errors are with Hbond tripped in stderr, so I think that it's a source of problems

Pharrg

Joined: Jul 10 06
Posts: 10
ID: 99406
Credit: 6,403
RAC: 0
Message 59632 - Posted 17 Feb 2009 15:42:53 UTC
Last modified: 17 Feb 2009 15:45:01 UTC

So... I completed a bunch more tasks successfully, then got a 2nd task where it said the output file was missing. Anyone else getting these?

2/17/2009 6:20:35 AM rosetta@home Computation for task ss-neg-1i17__7365_5964_0 finished
2/17/2009 6:20:35 AM rosetta@home Output file ss-neg-1i17__7365_5964_0_0 for task ss-neg-1i17__7365_5964_0 absent

I noticed that both tasks that gave the 'absent output file' message had a name the started witht the same first part:

ss-neg-1i17__7365_

perhaps a bug in that one?
____________

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1725
ID: 44890
Credit: 843,485
RAC: 54
Message 59633 - Posted 17 Feb 2009 17:14:07 UTC - in response to Message ID 59632.
Last modified: 17 Feb 2009 17:15:24 UTC


I noticed that both tasks that gave the 'absent output file' message had a name the started witht the same first part:

ss-neg-1i17__7365_

perhaps a bug in that one?


I had one of those fail too. Firewall blocked it from reporting the symbol tables :(
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59634 - Posted 17 Feb 2009 17:25:15 UTC

Looks like Pharrg actually had three of these fail

ss-neg-1i17__7365_5964_0
ss-neg-1i17__7365_5190_1 (wingman failed too)
ss-neg-1i17__7365_4677_1 (wingman failed too)

____________
Rosetta Moderator: Mod.Sense

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1725
ID: 44890
Credit: 843,485
RAC: 54
Message 59635 - Posted 17 Feb 2009 17:40:09 UTC

I had two more similar tasks on my machiens, so I suspended others to try and run them.

I've got an ss-neg-1je9 that seems normal so far. But my other ss-net-1i17 doesn't seem able to display graphics. Black window, no pane lines, on WinXP.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1725
ID: 44890
Credit: 843,485
RAC: 54
Message 59637 - Posted 17 Feb 2009 18:44:34 UTC
Last modified: 17 Feb 2009 18:45:25 UTC

Yep, my next ss-neg-1i17 failed too.

As soon as you bring up the graphic, which never gets beyond black, Windows task manager shows the graphic thread as "not responding".
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4363
ID: 85645
Credit: 774,444
RAC: 465
Message 59638 - Posted 17 Feb 2009 21:39:56 UTC

2 ss-neg tasks died on me as well, i have a 3rd in progress at 50% complete so far.

Here are the failures:

ss-neg-1i17__7365_1743_0

ss-neg-1i17__7365_542_1

They both do the following:

initialization is ok, but then when it is about to start it errors out:
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004E3308 read attempt to address 0x00000000
----------

Sid Celery

Joined: Feb 11 08
Posts: 366
ID: 241409
Credit: 1,126,789
RAC: 1,954
Message 59640 - Posted 17 Feb 2009 23:35:02 UTC
Last modified: 17 Feb 2009 23:35:45 UTC

Ditto:

ss-neg-1i17__7365_5466_0
ss-neg-1i17__7365_1656_0
____________

svincent

Joined: Dec 30 05
Posts: 130
ID: 44923
Credit: 951,788
RAC: 618
Message 59641 - Posted 18 Feb 2009 0:53:03 UTC

A couple of these ssneg-1i17* workunits failing on Mac OS X 10.4.11

Workunit 208810096, Task 229094592, Name ss-neg-1i17__7365_4132_0

and

Workunit 208854507, Task 229142269, Name ss-neg-1i17__7365_4742_0

They're both failing in the same routine: here's the crash info from the first one

Thread 0 Crashed:
0 ...etta_1.54_i686-apple-darwin 0x001b13b7 __ZN4core10kinematics10build_treeERKNS0_8FoldTreeERKN7utility7vector1INS4_7pointer10access_ptrIKNS_12conformation7ResidueEEESaISB_EEERNS_2id10AtomID_MapINS6_10owning_ptrINS0_4tree4AtomEEEEE + 235
1 ...etta_1.54_i686-apple-darwin 0x00027735 __ZN4core12conformation12Conformation15setup_atom_treeEv + 109
2 ...etta_1.54_i686-apple-darwin 0x0002a378 __ZN4core12conformation12Conformation9fold_treeERKNS_10kinematics8FoldTreeE + 2910
3 ...etta_1.54_i686-apple-darwin 0x00400e64 __ZN4core2io13serialization11read_binaryERNS_4pose4PoseERNS1_6BUFFERE + 516
4 ...etta_1.54_i686-apple-darwin 0x00107b23 __ZN9protocols5boinc5Boinc18worker_is_finishedERKi + 913
5 ...etta_1.54_i686-apple-darwin 0x00c8d172 __ZN9protocols7jobdist18BaseJobDistributorIN7utility7pointer10owning_ptrINS0_8BasicJobEEEE8next_jobERS6_Ri + 2102
6 ...etta_1.54_i686-apple-darwin 0x001177a5 __ZN9protocols8abinitio18AbrelaxApplication4foldERN4core4pose4PoseEN7utility7pointer10owning_ptrINS_8ProtocolEEE + 1449
7 ...etta_1.54_i686-apple-darwin 0x001289ad __ZN9protocols8abinitio18AbrelaxApplication3runEv + 807
8 ...etta_1.54_i686-apple-darwin 0x000039cc _main + 1356
9 ...etta_1.54_i686-apple-darwin 0x00001dee __start + 216
10 ...etta_1.54_i686-apple-darwin 0x00001d15 start + 41


____________

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 13,026,869
RAC: 16,575
Message 59645 - Posted 18 Feb 2009 4:37:41 UTC

I've had three ss-neg-1i17__7365 WUs fail with segmentation violations on three different linux machines:

http://boinc.bakerlab.org/rosetta/result.php?resultid=229167706
http://boinc.bakerlab.org/rosetta/result.php?resultid=229161990
http://boinc.bakerlab.org/rosetta/result.php?resultid=229084435

(I notice that only the third number is different in the stack traces of the above three WUs.)

robertmiles Profile

Joined: Jun 16 08
Posts: 424
ID: 264600
Credit: 312,368
RAC: 198
Message 59647 - Posted 18 Feb 2009 9:16:58 UTC

A workunit with some odd behavior, but no definite error:

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=209046400

A few minutes ago when it was about 93% complete, I told it to display graphics (which I usually don't do). After about a minute, I closed the graphics window. Within another minute or two, that workunit decided it was finished.

It may or may not be significant that a few minutes before doing this, I had set the Activity to Suspend, also suspended the network communications, ran some antispyware programs, then set the Activity back to normal.

Is this something normal that just happened at an unusual time, or something more significant?

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 608,823
RAC: 73
Message 59649 - Posted 18 Feb 2009 10:57:15 UTC - in response to Message ID 59520.

Mod.Sense

What is it showing for the estimated runtime, before the task starts?


There is a new task running on that same computer:
- Estimated runtime: 09:43:55
- current runtime: 18:03:14
- Progress: 0%

I think my settings before were asking for about 6 hours runtime and now 10 hours. Changing this did not solve the problem. For the sake of testing I will keep this task running for some more time. You can let me know what to do. In the worst case I'll set that computer on NNT for Rosetta but I'm willing to wait some longer.

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,023,621
RAC: 81
Message 59650 - Posted 18 Feb 2009 13:14:18 UTC

Three more errors ... this time two I have not seen before:

229353838 0 0x0056d881 SIGPIPE: write on a pipe with no reader

229355014 Reason: Access Violation (0xc0000005) at address 0x004E3308 read attempt to address 0x00000000

229435564 ERROR: ERROR: FragmentIO: could not open file cs_aa_1ji8A09_05.200_v1_3.gz

So, two shiny new errors and one old rusty access violation that quite a few of us have seen ...

Keith T.
Avatar

Joined: Mar 1 07
Posts: 36
ID: 150379
Credit: 12,752
RAC: 0
Message 59651 - Posted 18 Feb 2009 13:30:29 UTC

At least 3 of my recent tasks have resulted in Validate errors.

http://boinc.bakerlab.org/rosetta/result.php?resultid=227721905
http://boinc.bakerlab.org/rosetta/result.php?resultid=227934901
http://boinc.bakerlab.org/rosetta/result.php?resultid=227919237

Please could someone in authority explain why there have been so many of these recently.

I currently have Rosetta set to "No New Tasks", partly because of these. I am still accepting work from RALPH.

Keith

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59655 - Posted 18 Feb 2009 14:47:25 UTC

rembertw, the maximum runtime preference possible is 24hrs, and if it is a v1.54 task, the watchdog should end it if it runs longer then 28hrs. So, if you could, let it run at least 29hrs and if it is still running at that point, then abort it.

I still have not seen anyone else reporting such a problem, and you've got a score of other hosts running fine. What is different about this one that's having trouble? anitvirus software? Windows service pack? age of machine? BOINC version?
____________
Rosetta Moderator: Mod.Sense

Yaroslav Isakov

Joined: Nov 2 07
Posts: 11
ID: 217531
Credit: 98,027
RAC: 0
Message 59657 - Posted 18 Feb 2009 15:01:59 UTC

Another hbond tripped

Path7

Joined: Aug 25 07
Posts: 128
ID: 201002
Credit: 61,751
RAC: 2
Message 59658 - Posted 18 Feb 2009 18:57:12 UTC

About 12 hours ago the next WU ended with an Unhandled Exception Detected:

ss-neg-1i17__7365_3969_1

This WU had the same error before running on another computer.

Path7.

Sid Celery

Joined: Feb 11 08
Posts: 366
ID: 241409
Credit: 1,126,789
RAC: 1,954
Message 59667 - Posted 19 Feb 2009 5:04:25 UTC

Another one snuck through:

ss-neg-1i17__7365_4076_1

Looks like I'll have to abort all these on sight. I'm not sure any of them have run successfully for me yet. :(
____________

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,023,621
RAC: 81
Message 59668 - Posted 19 Feb 2009 7:07:58 UTC

New error -161 on both Mini 1.54 and 5.98 ...

Mini-1.54
229605017
229597762
229594079
229593677

5.98
229601150

Yaroslav Isakov

Joined: Nov 2 07
Posts: 11
ID: 217531
Credit: 98,027
RAC: 0
Message 59672 - Posted 19 Feb 2009 16:29:16 UTC
Last modified: 19 Feb 2009 16:32:01 UTC

Hey! Very strange one! it's valid, but with Hbond tripped and verys short time, 2380 secs instead of ~10000:
loopbuild_chunk_1_3_B_hb_t357__IGNORE_THE_REST_1VBGA_4_7477_27_0

BTW, I notice that all my wrong results (and this last one) are loopbuild_chunk*.

xrobert Profile

Joined: Oct 28 05
Posts: 3
ID: 7210
Credit: 56,694
RAC: 121
Message 59674 - Posted 19 Feb 2009 18:02:55 UTC

So far, all my mini-Rosetta WUs are sticking. I've to abort them.
The normal WUs work fine.


____________

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 608,823
RAC: 73
Message 59677 - Posted 20 Feb 2009 7:03:21 UTC - in response to Message ID 59655.
Last modified: 20 Feb 2009 7:12:40 UTC

mod.sense

I still have not seen anyone else reporting such a problem, and you've got a score of other hosts running fine. What is different about this one that's having trouble? antivirus software? Windows service pack? age of machine? BOINC version?


I it strange indeed. My other computers seem to be running fine. About the computer: I have an identical computer that gives no problems. They both have the same antivirus software, same servicepack, same age, same Boinc version.

Some things I noticed:
- when a 0% task (only at Rosetta 1.54) gets paused manually after x hours and it gets restarted, also the time resets to 0.
- When the 1.54 task starts both processors get work (multiple projects). However, when one of the other project tasks stop, then the 2nd processor starts idling. It can not get another task to run from Rosetta or any other project despite the queue having multiple tasks ready to start or continue.

I broke off 2 remaining tasks of Rosetta that still had to get started and am letting run the restarted task. Before it had already 24h+ but because of a pauze it reset its time. At this moment it is at 19h again. I will let it run until it gets past 31h runtime. After (tomorrow) that I will set that computer on NNT for Rosetta so it can crunch for my other projects while I wait for your comment.

[edit]Changed "all" in "both" and corrected a typo[/edit]

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59683 - Posted 20 Feb 2009 14:32:26 UTC

rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another.

Is it possible BOINC is having trouble writing to disk (authorities?)? Have you checked the authorities to the data directory and it's contents? Is the data on the local hard drive of the machine, or off on a network somewhere?

I see from the one task that completed that you are running BOINC 6.2.14. Have you tried other BOINC versions on this machine?
____________
Rosetta Moderator: Mod.Sense

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 608,823
RAC: 73
Message 59684 - Posted 20 Feb 2009 15:23:06 UTC - in response to Message ID 59683.

rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another.

I agree, but this shows only when it started a "0%" Minirosetta task. To check this I put Rosetta on NNT for a while. When it runs only other projects there are no problems at all, making me think the problem is with Minirosetta.

Is it possible BOINC is having trouble writing to disk (authorities?)? Have you checked the authorities to the data directory and it's contents? Is the data on the local hard drive of the machine, or off on a network somewhere?

Standard setup with full authority running on a local hard drive. No fancy settings.

I see from the one task that completed that you are running BOINC 6.2.14. Have you tried other BOINC versions on this machine?

Every now and again I do a Boinc upgrade on my machines. I heard some negative comments about the current Boinc version, which is why I considered waiting until Summer or so to upgrade. I guess now the time has come. To be certain I'll do a total Boinc uninstall on that computer followed by a cleanup before I download the current version. We'll see if this helped...

robertmiles Profile

Joined: Jun 16 08
Posts: 424
ID: 264600
Credit: 312,368
RAC: 198
Message 59686 - Posted 20 Feb 2009 16:41:25 UTC - in response to Message ID 59684.

rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another.

I agree, but this shows only when it started a "0%" Minirosetta task. To check this I put Rosetta on NNT for a while. When it runs only other projects there are no problems at all, making me think the problem is with Minirosetta.

Is it possible BOINC is having trouble writing to disk (authorities?)? Have you checked the authorities to the data directory and it's contents? Is the data on the local hard drive of the machine, or off on a network somewhere?

Standard setup with full authority running on a local hard drive. No fancy settings.

I see from the one task that completed that you are running BOINC 6.2.14. Have you tried other BOINC versions on this machine?

Every now and again I do a Boinc upgrade on my machines. I heard some negative comments about the current Boinc version, which is why I considered waiting until Summer or so to upgrade. I guess now the time has come. To be certain I'll do a total Boinc uninstall on that computer followed by a cleanup before I download the current version. We'll see if this helped...


Which BOINC version do you consider current? I'm running 6.2.28 without seeing such a problem, but I've read some negative comments about the 6.4.* series.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59688 - Posted 20 Feb 2009 18:31:56 UTC
Last modified: 20 Feb 2009 18:33:48 UTC

robertmiles, if you were directing the question to me, I try to stay out of that one. And am only recommending a change to BOINC version because problems are occurring with the version installed now. I know we've seen many work-fetch and DCF problems reported on the 6.6 (which is the current test version) and I think 6.4 series introduced those problems. So, if it were me, I'd try the 6.2.19 shown at the link below. I myself am on 6.2.18 and running well on WinXP. (nothing against 6.2.28, but it's not listed anymore for some reason)

You can see more BOINC versions for download on this page:
http://boinc.berkeley.edu/download_all.php
____________
Rosetta Moderator: Mod.Sense

TimL

Joined: Sep 16 06
Posts: 13
ID: 112884
Credit: 2,118,054
RAC: 5,546
Message 59723 - Posted 22 Feb 2009 9:59:14 UTC

Hi all,
loopbuild_mamaln_ideal_hb_t305__IGNORE_THE_REST_1zc0_1_7630_19 finished early with error -
Access Violation (0xc0000005) at address 0x7C91AA01 read attempt to address 0x0D1BF548

Haven't had much luck getting errors of late but will mention that I had just bumped the bus speed up a touch when this error occurred.


____________

TomaszPawel

Joined: Apr 28 07
Posts: 52
ID: 170716
Credit: 1,723,353
RAC: 763
Message 59751 - Posted 23 Feb 2009 7:06:15 UTC - in response to Message ID 59045.

Hi:

http://boinc.bakerlab.org/rosetta/result.php?resultid=229237620
http://boinc.bakerlab.org/rosetta/result.php?resultid=229237620
http://boinc.bakerlab.org/rosetta/result.php?resultid=229237514
http://boinc.bakerlab.org/rosetta/result.php?resultid=229145242
http://boinc.bakerlab.org/rosetta/result.php?resultid=228892067
http://boinc.bakerlab.org/rosetta/result.php?resultid=228820491
http://boinc.bakerlab.org/rosetta/result.php?resultid=228820477

Any tips?

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 608,823
RAC: 73
Message 59752 - Posted 23 Feb 2009 7:50:20 UTC - in response to Message ID 59688.

Mod.Sense

And am only recommending a change to BOINC version because problems are occurring with the version installed now.

I set up Boinc 6.4.5 on that computer, and it seems to be running fine with Rosetta. I still will wait for a general upgrade until there are new Boinc versions, I think.

robertmiles
"Current" is for me the version that the actual Boinc site gives as standard. Researching older versions and installing those is too much micromanagement for me. Same like posting on the boards... If this problem gets solved with 6.4.5 (and it seems to be solved) then I'm off again.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2399
ID: 106194
Credit: 0
RAC: 0
Message 59756 - Posted 23 Feb 2009 14:09:26 UTC - in response to Message ID 59751.

Hi:

http://boinc.bakerlab.org/rosetta/result.php?resultid=229237620
http://boinc.bakerlab.org/rosetta/result.php?resultid=229237620
http://boinc.bakerlab.org/rosetta/result.php?resultid=229237514
http://boinc.bakerlab.org/rosetta/result.php?resultid=229145242
http://boinc.bakerlab.org/rosetta/result.php?resultid=228892067
http://boinc.bakerlab.org/rosetta/result.php?resultid=228820491
http://boinc.bakerlab.org/rosetta/result.php?resultid=228820477

Any tips?


Looks like all of these were the ss-neg-1i17s that most people have been having trouble with. Something specific to the 1i17, the other ss-neg's do not seem to be having any trouble.

Except for your last one on the list, it got a
"Too many restarts with no progress. Keep application in memory while preempted."
error. Perhaps you rebooted your machine several times in a row to install fixes or something?
____________
Rosetta Moderator: Mod.Sense

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,023,621
RAC: 81
Message 59761 - Posted 23 Feb 2009 18:49:59 UTC

-161 error on 230728890

RodrigoPS
Avatar

Joined: Nov 28 08
Posts: 3
ID: 289807
Credit: 780,161
RAC: 30
Message 59782 - Posted 24 Feb 2009 22:01:20 UTC

I noticed that with the minirosetta 1.54 the granted credit was very low in the Athlon X2 processors - sometimes half the claimed credit. This did not occur with the single core Athlon.