Rosetta@home

Problems with Minirosetta v1.54

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search
Message boards : Number crunching : Problems with Minirosetta v1.54

Sort
AuthorMessage
Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 59045 - Posted 26 Jan 2009 22:45:57 UTC
Last modified: 27 Jan 2009 1:36:31 UTC

Hello All!

We're ready for a new update. I want to say thank all of you who have helped over the last months to find and fix errors in minirosetta. A particular thank you goes to those who have donated their time over on RALPH and helped with their active feedback - we managed to find a number of difficult and rare bugs and put some new features into minirosetta that should help conserve computer time. Read about it here: http://ralph.bakerlab.org/forum_thread.php?id=431
and here http://ralph.bakerlab.org/forum_thread.php?id=432
I should add that work over there will continue,but now supplemented with information from Rosetta@HOME.

This update is highly focused on bugfixing and stability issues - we have virtually no new science in it, but: We will hopefully now be able to run the science projects that have been in the pipeline waiting for BOINC - we're expecting quite a bit of work to go out very soon indeed. See Dr. Baker's journal for more details.


Features/Fixes:
1.54 Release CHANGELOG


  • Faster loop closing in FoldCST/Abinitio (affects cc_* cc2_* cs_* WUs), should help with overrunning WUs.

  • Bug fix concerning intermittent crashes in relax benchmark jobs (_rlbd_) jobs - caused by buggy input file reader.

  • Bug fix for a potential instability in handling text files (affects all types of WUs).

  • Bug fix in checkpointing machinery, states were not being correctly restored, probably contributing to long runtimes. (affects cc_* cc2_* cs_* WUs)

  • Increased the density of checkpoints to lose less time on restarts and address the weired "backjumping" of the time reported in this thread. This will still happen, but the jumps should be much smaller (basically maximally as long as the time between checkpoints.)

  • Added checkpointing to Loopclosing part of FoldCST. (affects cc_* cc2_* cs_* WUs)

  • Added checkpointing to Looprelax.

  • The Watchdog has been checked and improved, now returning information on the aborted jobs to help us figure out how the remaining long running models come about. The watchdog will now abort if the runtime exceeds your preferred runtime + 4 hours. In other words the WUs should not overrun for more than around 4 hours. If they do please let us know !!

  • Added a limit ont he number of decoys per WU: 99. The WU will end gracefully after that and give full credit. This should address issues with excessive upload problems.

  • Fixed a bug in the BOINC API concerned with unzipping the input data. (I will let the BOINC guys know about this)

  • Fixed a strange problem in the options system leading to early crashes on some systems.

  • Two nasty instabilities fixed deep in the FoldConstraints/abinitio protocol (cc_* tasks and other homology modelling tasks)

  • Generally implemented much better error reporting - many many potential problems will now show up a meaningful error messages and not random segmentation faults.



NOTE: This new version contains a lot of debug output still. YOu will see that the stderr fills up with stuff - that is ok . It does not slow down the program nor cause much extra upload - but it tells us a lot about where things can go wrong still.


Despite all these fixes there are, i'm sure, many problems left. Most of them occur extremely rarely now though or are highly specific to particular machines. Thus we have decided to move the current version over from RALPH to Rosetta@HOME and give it a go on a much larger scale. Our effords to keep the failure rate down will continue and your time donations over on RALPH as well as error reports are still highly appreciated.

Please let us know how things work out there. Particularily i'd like to know about


  • Stuck workunits
  • Overrunning workunits (WUs should now, due to the new watchdog, never run more than 4 hours longer than the preferred user time)
  • Problems with checkpointing.
  • Any other strange behaviour.




Happy crunching - I'm very excited to see how this new version will pan out.

Mike
____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59047 - Posted 26 Jan 2009 23:40:44 UTC

The link in the news item that should bring you to this thread is truncated.
____________
Rosetta Moderator: Mod.Sense

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59048 - Posted 27 Jan 2009 0:21:40 UTC
Last modified: 27 Jan 2009 0:24:04 UTC

The news item also shows the year as 2008 (which is probably the last time you had enough coffee to be able to read the calendar!! All these improvements are going to send TeraFLOPS much higher! Nice work Mike, and BakerLab. I can really see that you've come through for people here).
____________
Rosetta Moderator: Mod.Sense

darengosse Jean-Paul
Avatar

Joined: Jun 9 06
Posts: 18
ID: 93705
Credit: 259,459
RAC: 0
Message 59074 - Posted 27 Jan 2009 22:46:52 UTC

Hello, the version 1.47 was very well for me with 151 Workunits and 0 errors and an average CPU time 2.8 hours. Hope that the new version 1.54 will be as well...
____________

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59086 - Posted 28 Jan 2009 6:50:38 UTC

If you are seeing errors with lock-file problems try setting the cpu setting back to 100%. If you are running at 100% CPU preference and are getting this problem, I for one, am very interested. If you are getting the failures and change the CPU setting to 100% and that cures the issue ... well, we are interested in THAT too ...

I read about this in Einstein@Home and it seems to work for me ... YMMV ...
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59088 - Posted 28 Jan 2009 9:55:25 UTC

I don't know about others but my Rosetta machines are running dry!!! The new minirosetta is stuck downloading at 89.25% and has been there for HOURS!!!!
I have had to attach to a different project until it gets sorted out. So far all machines, exact same problem, one a dual core one a single core. If you llok at my computers, they are not hidden, any task that says "outcome unknown" is because the mini-rosetta download ain't happenning!!!! Message in Boinc says 1/28/2009 4:45:03 AM|rosetta@home|Started download of minirosetta_1.54_windows_intelx86.exe
1/28/2009 4:50:11 AM|rosetta@home|Temporarily failed download of minirosetta_1.54_windows_intelx86.exe: HTTP error
1/28/2009 4:50:12 AM|rosetta@home|Started download of minirosetta_1.54_windows_intelx86.exe
1/28/2009 4:50:13 AM||Internet access OK - project servers may be temporarily down.
1/28/2009 4:50:34 AM||Project communication failed: attempting access to reference site
1/28/2009 4:50:34 AM|rosetta@home|Temporarily failed download of minirosetta_1.54_windows_intelx86.exe: connect() failed
1/28/2009 4:50:35 AM||Internet access OK - project servers may be temporarily down.
etc, etc, etc, etc forever!!!!
Another project now loves you!!
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59093 - Posted 28 Jan 2009 12:27:44 UTC

mikey, I haven't seen that problem my self, so it's not likely on the server side. At least not consistently. So it also seems odd that all of your servers are stopping... is it on the same file? You have to download the new programs, which is several MB. Are your machines all going through the same proxy or something that might be hung up on that particular file?

Could I ask you to check the transfers tab and see exactly which file and how much of it you've downloaded? Your hosts seem to have pretty good bandwidth.

Is anyone else seeing such a problem? Given then increase in project TFLOPS, I am thinking it is rare at best.

Have you tried aborting the transfer on one of the machines? This may cause a couple of tasks to fail due to downloading error, but BOINC will recover and eventually try to pull a fresh copy of the problem file.
____________
Rosetta Moderator: Mod.Sense

Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 59103 - Posted 28 Jan 2009 19:59:21 UTC

Paul,

can you point me to the thing you read about Lockfile problems on Einstein !?

5% of jobs fail in this way consistently. I would love to know if the problem is us or the clients or what, and get it resolved.

What do you mean by 100% CPU ? If i can make this happen here on my machine i could learn better about what's going on.

Mike

____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59106 - Posted 28 Jan 2009 20:20:22 UTC

What do you mean by 100% CPU ?


"computing preferences" configured on website for the venue of the machine. The setting is called "Use at most" at the bottom of the processor usage section.

Can also be configured via the BOINC Manager for a specific host.
____________
Rosetta Moderator: Mod.Sense

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59108 - Posted 28 Jan 2009 22:10:00 UTC - in response to Message ID 59093.

mikey, I haven't seen that problem my self, so it's not likely on the server side. At least not consistently. So it also seems odd that all of your servers are stopping... is it on the same file? You have to download the new programs, which is several MB. Are your machines all going through the same proxy or something that might be hung up on that particular file?

I do not use a proxy, just straight to the net. I use Comcast.

Could I ask you to check the transfers tab and see exactly which file and how much of it you've downloaded? Your hosts seem to have pretty good bandwidth.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's. The one I am looking at right now has been trying for 11:51:02 and is going to retry in 03:34:34, and counting.

Is anyone else seeing such a problem? Given then increase in project TFLOPS, I am thinking it is rare at best.

Have you tried aborting the transfer on one of the machines? This may cause a couple of tasks to fail due to downloading error, but BOINC will recover and eventually try to pull a fresh copy of the problem file.

Yes I have, no luck, the file is stuck at 89.25, 89.26 or 89.27% depending on the pc. I am stuck at exactly 5.85 meg of 6.56 meg on all machines.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59110 - Posted 28 Jan 2009 22:21:08 UTC

mikey, if you would like to study this further, it would be helpful if you could create a cc_config.xml file and add the flag for debug of file transfers. You have to define the first three flags as shown, then just add a line for the:
<file_xfer_debug>1</file_xfer_debug>

If you already have such a file set up, do you have the <http_1_0> flag defined? Not asking you to do that one, just asking if you were already doing it. HTTP 1.0 does not have the ability to retry from the middle of the transfer (persistent file transfer is the term BOINC uses for this). It has to start over each attempt. Then BOINC seems to only open the pipe for 5minutes at a time. So if you can't get the whole thing in 5min. It might never happen.
____________
Rosetta Moderator: Mod.Sense

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 59112 - Posted 28 Jan 2009 23:24:06 UTC

I'm seeing a validate error on task 224245929 , workunit 204213187, Mac OS X 10.4.11. The task name is 1nkuA_BOINC_MPZN_with_zinc_abrelax_cs_frags_6231_115354_1 : it ran twice as long as it was supposed to and I was the second person to get it. The original person to whom it was sent also got the same validate error: irritating after it took twice as long as it was supposed to. It seems to be one of these zinc-containing proteins that have a habit of doing this.

<core_client_version>6.2.18</core_client_version>
<![CDATA[
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 1-28 1:26:32:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Starting work on structure: _00001
Watchdog active.
# cpu_run_time_pref: 14400
Starting work on structure: _00002
====>
called boinc_finish

</stderr_txt>
]]>

____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59116 - Posted 29 Jan 2009 0:11:48 UTC

mikey, I don't know why I didn't think of this before...

Do a binary ftp of the file from here:

boinc.bakerlab.org/download/minirosetta_1.54_windows_intelx86.exe

and drop it in to your Rosetta folder in your BOINC data directory under the projects subfolder. That will at least get you up and running, or on to the next file to see if similar problems continue.
____________
Rosetta Moderator: Mod.Sense

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 59119 - Posted 29 Jan 2009 2:38:00 UTC - in response to Message ID 59108.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,545,746
RAC: 7,447
Message 59121 - Posted 29 Jan 2009 2:51:39 UTC

Long-running model reported in the appropriate thread here



____________

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 59122 - Posted 29 Jan 2009 2:57:50 UTC

I'm seeing a number of WUs ending at 99 models. They are ending normally, but they often take less than half my 12 hour (43,200 sec) preference.

Some examples:
http://boinc.bakerlab.org/rosetta/result.php?resultid=223957908
http://boinc.bakerlab.org/rosetta/result.php?resultid=223968996
http://boinc.bakerlab.org/rosetta/result.php?resultid=223981088
http://boinc.bakerlab.org/rosetta/result.php?resultid=223989528
http://boinc.bakerlab.org/rosetta/result.php?resultid=223997524
http://boinc.bakerlab.org/rosetta/result.php?resultid=224065056

Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 59123 - Posted 29 Jan 2009 3:10:15 UTC - in response to Message ID 59122.

I'm seeing a number of WUs ending at 99 models. They are ending normally, but they often take less than half my 12 hour (43,200 sec) preference.

Some examples:
http://boinc.bakerlab.org/rosetta/result.php?resultid=223957908
http://boinc.bakerlab.org/rosetta/result.php?resultid=223968996
http://boinc.bakerlab.org/rosetta/result.php?resultid=223981088
http://boinc.bakerlab.org/rosetta/result.php?resultid=223989528
http://boinc.bakerlab.org/rosetta/result.php?resultid=223997524
http://boinc.bakerlab.org/rosetta/result.php?resultid=224065056


Sorry i should have mentioned there is a new rule. Mini will not produce more than 99 models. It will finish gracefully and grant full credit. The reason for this is that i want to prevent your individual uploads from getting too large. In the future there will be a better way to do this, like it will check that the output file size has not reached some limit.
ITs just another safety hook that's been put in to prevent WUs from misbehaving.

____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

darengosse Jean-Paul
Avatar

Joined: Jun 9 06
Posts: 18
ID: 93705
Credit: 259,459
RAC: 0
Message 59124 - Posted 29 Jan 2009 5:11:01 UTC

Hello with all.
For me no problems to receive from Wu Minirosetta v1.54.
J'received 17 Wu to be made for February 6, 2009 with 21:28:04 (France Time).
The first calculations should begin today (January 29), and if it with problems I you will warn about it there.
____________

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59126 - Posted 29 Jan 2009 6:49:25 UTC - in response to Message ID 59103.

Paul,

can you point me to the thing you read about Lockfile problems on Einstein !?

5% of jobs fail in this way consistently. I would love to know if the problem is us or the clients or what, and get it resolved.

What do you mean by 100% CPU ? If i can make this happen here on my machine i could learn better about what's going on.

Mike


Two places to start: are here and here ... I can also report that since I made that change i have been getting good results on Win XP systems ... I cannot see the high error rate I had in the past as the tasks have been purged ...

It seemed to me to be a problem I had on XP and it was most severe on the i7 where there are more things going on ...

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59128 - Posted 29 Jan 2009 11:34:33 UTC - in response to Message ID 59116.
Last modified: 29 Jan 2009 12:05:38 UTC

mikey, I don't know why I didn't think of this before...

Do a binary ftp of the file from here:

boinc.bakerlab.org/download/minirosetta_1.54_windows_intelx86.exe

and drop it in to your Rosetta folder in your BOINC data directory under the projects subfolder. That will at least get you up and running, or on to the next file to see if similar problems continue.


No difference, I downloaded the file, dropped it into the directory C:\Boinc\Data\Projects\boinc.bakerlab.org_rosetta. It asked if I wanted to overwrite what was there, the new file was bigger, I said yes, exited Boinc, restarted Boinc, did an update of Rosetta and it is downloading the file AGAIN and is stuck at exactly the same place. I even turned off my ad-aware and anti-virus and no change.

Change #1....Just after I first posted this I did a total shutdown and then a restart, no change, Boinc is still trying to download that same file! I am about ready to detach and then reattach and see if that fixes it!

Change #2....I detached and then reattached. Started downloading all the Rosetta files again. I made sure everything Rosetta was gone out of the Boinc and all subdirectories, so downloading was not a surprise. It got thru all the files except the usual one, stopped at exactly the same place. I aborted the transfer and stopped Boinc. I then copied the file I had downloaded manually into the same place as before, and did another update of Rosetta. It asked for 36000 seconds of work and got none. It went into the communication deferred state and is now downloading the EXACT SAME FILE again!!!! It is also STUCK at the EXACT SAME PLACE!!!!
I have no clue how to fix this and other projects are working just fine. Frustrating to say the least!!!!!
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59129 - Posted 29 Jan 2009 11:35:24 UTC - in response to Message ID 59119.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.


I turned it off and nothing changed, I use the free version of Avast.
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 59132 - Posted 29 Jan 2009 13:26:47 UTC - in response to Message ID 59129.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.


I turned it off and nothing changed, I use the free version of Avast.


got any firewalls active?

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59133 - Posted 29 Jan 2009 13:30:28 UTC - in response to Message ID 59128.

mikey, I don't know why I didn't think of this before...

Do a binary ftp of the file from here:

boinc.bakerlab.org/download/minirosetta_1.54_windows_intelx86.exe

and drop it in to your Rosetta folder in your BOINC data directory under the projects subfolder. That will at least get you up and running, or on to the next file to see if similar problems continue.


No difference, I downloaded the file, dropped it into the directory C:\Boinc\Data\Projects\boinc.bakerlab.org_rosetta. It asked if I wanted to overwrite what was there, the new file was bigger, I said yes, exited Boinc, restarted Boinc, did an update of Rosetta and it is downloading the file AGAIN and is stuck at exactly the same place. I even turned off my ad-aware and anti-virus and no change.

Change #1....Just after I first posted this I did a total shutdown and then a restart, no change, Boinc is still trying to download that same file! I am about ready to detach and then reattach and see if that fixes it!

Change #2....I detached and then reattached. Started downloading all the Rosetta files again. I made sure everything Rosetta was gone out of the Boinc and all subdirectories, so downloading was not a surprise. It got thru all the files except the usual one, stopped at exactly the same place. I aborted the transfer and stopped Boinc. I then copied the file I had downloaded manually into the same place as before, and did another update of Rosetta. It asked for 36000 seconds of work and got none. It went into the communication deferred state and is now downloading the EXACT SAME FILE again!!!! It is also STUCK at the EXACT SAME PLACE!!!!
I have no clue how to fix this and other projects are working just fine. Frustrating to say the least!!!!!


Change #3....I downloaded and installed the latest version of DirectX, no changes noted.

Change #4....I installed Boinc 6.6.3, got this message "1/29/2009 8:28:31 AM|rosetta@home|Scheduler request completed: got 0 new tasks". I may have errored out all my available work for the day. No files downloading, so maybe it will take this time? No clue.
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59134 - Posted 29 Jan 2009 13:32:40 UTC - in response to Message ID 59110.
Last modified: 29 Jan 2009 13:39:56 UTC

mikey, if you would like to study this further, it would be helpful if you could create a cc_config.xml file and add the flag for debug of file transfers. You have to define the first three flags as shown, then just add a line for the:
<file_xfer_debug>1</file_xfer_debug>

If you already have such a file set up, do you have the <http_1_0> flag defined? Not asking you to do that one, just asking if you were already doing it. HTTP 1.0 does not have the ability to retry from the middle of the transfer (persistent file transfer is the term BOINC uses for this). It has to start over each attempt. Then BOINC seems to only open the pipe for 5minutes at a time. So if you can't get the whole thing in 5min. It might never happen.


Okay I have downloaded the file and put it in the Boinc\Data directory. I took out the asterisks and changed the <file-xfer_debug line to a 1, it was a zero.
As for the http setting I use Firefox 3.0.5 and do not see that setting. I know it is/was in IE, but I do not see it in Firefox.
____________

Scott A. Howard*

Joined: Oct 16 05
Posts: 2
ID: 4994
Credit: 8,416,995
RAC: 4,290
Message 59135 - Posted 29 Jan 2009 14:38:20 UTC
Last modified: 29 Jan 2009 15:00:22 UTC

Hello,

Here's the problem in a nutshell.

On my Dell Precision T5400 with dual Xeon E5410 2.33 GHz chips (for a total of 8 cores) running on XP Pro SP3, almost every one of the Rosetta jobs (minirosetta version 154) fail. The typical failure mode is that they are exceeding their CPU time allocation. For example, if the job is estimated to require 4 hours of CPU time, they are killed at something like 20 hours. Sometimes the tasks show progress, other times they are stuck at zero.

Also, the exe is not removed from memory when the computer is in use.

I have reset the project and detached and attached again but it continues to happen.

Nothing like this happens with the lhcathome, QMC@HOME, Docking@Home, or boincsimap tasks. I also don't see this behavior on any of my other machines.

Do you guys produce any diagnostic logs that might of use in troubleshooting the problem? Maybe it's my configuration - maybe a coding error showing up when running 6 or 8 of these tasks simultaneously. (It appears to occur with any number running, from 1 - 8).

I have a full development environment and debuggers if you want some traces.

Scott Howard


Addendum: Now that I thought about it a little more, does the app use any global resource locking? E.g., mutexes, semaphores, file acess? Maybe that's why the progress is halted, it's deadlocked - but I am not sure why the task would continue to use CPU time though. Just some random thoughts...
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59136 - Posted 29 Jan 2009 15:28:39 UTC
Last modified: 29 Jan 2009 15:31:55 UTC

mikey, we're not talking about the HTTP setting of your browser. We're talking about the http setting used by BOINC. If it were specifically set, it would have a line in that cc_config.xml file.

Once you have the file in the directory, abort the transfer.

You probably got no work because BOINC knew you already had enough coming. So you probably see a number of tasks in a "downloading" state.

The file transfer debug messages will appear in the messages tab. One of the things to note there is which of Rosetta's servers is currently being used to retrieve the file (the host name). I believe this will change from one retry to the next. But if not, you might try blocking outbound traffic to that server with a firewall, and this would then force the client to try the next server in the list.

Does each try go for 5 minutes before waiting again? Does any data come down in that period of time?

Once you determine which server is being used, could you do a ping and a tracert to that server's host name and report the results?
____________
Rosetta Moderator: Mod.Sense

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 59137 - Posted 29 Jan 2009 15:53:14 UTC - in response to Message ID 59074.

Hello, the version 1.47 was very well for me with 151 Workunits and 0 errors and an average CPU time 2.8 hours. Hope that the new version 1.54 will be as well...


1.47 worked rather well for me, with perhaps one out of ten workunits giving an error. Not enough 1.54 workunits yet to say whether 1.54 is better. I'm asking for 14 hour workunits, so it will take me longer to run that many.

Scott A. Howard*

Joined: Oct 16 05
Posts: 2
ID: 4994
Credit: 8,416,995
RAC: 4,290
Message 59138 - Posted 29 Jan 2009 16:00:11 UTC - in response to Message ID 59135.
Last modified: 29 Jan 2009 16:09:39 UTC

Here's a follow up.

I did the following:
1) detached from the project.
2) removed the Rosetta project folder from under \Bonic\...
3) removed all files from a slot that contained Rosetta data
4) reattached to the project
5) allowed for 50% of the cpus to be used (4 in this case)
6) allowed the four projects to run - each expected to take about 4 hours

Observed results: The status for the projects are "Running, high priority", each has used about 20 minutes of cpu time, the progress is 0.000%

Setting the activity back to "run based on preferences" results in each task no longer using cpu time but they are not removed from memory.

It looks like that's all I can do. If there are no suggestions from your end, I'll need to stay detached from the project so I don't waste cycles.

I see the thread that's consuming the CPU has a pretty regular call stack. Here is the call stack. If you have your debug symbols for your build, you should be able to locate the routine and line at which the program is hung...

ntkrnlpa.exe!KiSwapContext+0x2f
ntkrnlpa.exe!KiSwapThread+0x8a
ntkrnlpa.exe!KeWaitForSingleObject+0x1c2
ntkrnlpa.exe!KiSuspendThread+0x18
ntkrnlpa.exe!KiDeliverApc+0x124
hal.dll!HalpApcInterrupt+0xc6
minirosetta_1.54_windows_intelx86.exe+0x91a63 <------ look for problem here
minirosetta_1.54_windows_intelx86.exe+0x17d3
minirosetta_1.54_windows_intelx86.exe+0x1afcd
minirosetta_1.54_windows_intelx86.exe+0x9289e
minirosetta_1.54_windows_intelx86.exe+0x4a4bc3
minirosetta_1.54_windows_intelx86.exe+0xb0892
minirosetta_1.54_windows_intelx86.exe+0x3e0c24
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59139 - Posted 29 Jan 2009 16:06:12 UTC - in response to Message ID 59136.
Last modified: 29 Jan 2009 16:46:54 UTC

mikey, we're not talking about the HTTP setting of your browser. We're talking about the http setting used by BOINC. If it were specifically set, it would have a line in that cc_config.xml file.

Once you have the file in the directory, abort the transfer.

You probably got no work because BOINC knew you already had enough coming. So you probably see a number of tasks in a "downloading" state.

The file transfer debug messages will appear in the messages tab. One of the things to note there is which of Rosetta's servers is currently being used to retrieve the file (the host name). I believe this will change from one retry to the next. But if not, you might try blocking outbound traffic to that server with a firewall, and this would then force the client to try the next server in the list.

Does each try go for 5 minutes before waiting again? Does any data come down in that period of time?

Once you determine which server is being used, could you do a ping and a tracert to that server's host name and report the results?


I changed the dual core settings to use both cores, this is a laptop and I do not like stressing it that much, and set the other project to no new work. I updated Rosetta and it proceeded to download new work. The same file stopped at the same place, 89.25%. I aborted it, after all other files were done downl0ading, and no new entries showed up in the cc_config.xml file.
I was browsing thru the stdout.txt file and found this:
9:21:33 AM: Error: can't open file 'C:\Boinc\\RebootPending.txt' (error 2: the system cannot find the file specified.)
[01/27/09 09:21:34] TRACE [2064]: RPC_CLIENT::init connect 2: Winsock error '10061'

[01/27/09 09:21:34] TRACE [2064]: RPC_CLIENT::init connect on 444 returned -1

[01/27/09 09:21:35] TRACE [2064]: RPC_CLIENT::init boinc_socket returned 444

[01/27/09 09:21:35] TRACE [2064]: RPC_CLIENT::init connect returned -1

[01/27/09 09:21:35] TRACE [2064]: RPC_CLIENT::init attempting connect

[01/27/09 09:21:35] TRACE [2064]: RPC_CLIENT::init_poll sock = 444
It is in there many, many times.

I do not see what server I am downloading from, and only use the Windows firewall, so unless I could block thru the Hosts files, I do not know how to block that particular server anyway.

Yes each retry deferral is about 4 minutes.

I did find one more thing in that stdoutgiu.txt file:
[01/29/09 11:10:31] TRACE [3932]: RPC_CLIENT::init connect 2: Winsock error '10061'

[01/29/09 11:10:31] TRACE [3932]: RPC_CLIENT::init connect on 524 returned -1

It is also in there many, many times. I did a search and found where it said to change the attributes for the Boinc directory and all subdirectories. It was set to read only and when I unchecked that and changed it also for all subdirectories, Boinc will not run. It also auto defaults back to read only after it errors out.
DO NOT DO THIS LAST PART It crashed my whole Boinc setup and I had to delete the Boinc directory, and all subdirectories, then reboot and then reinstall Boinc from scratch. FORTUNATELY it did a repair install instead of a brand new install from scratch! I lost all workunits from all projects though!!!! I attached to Rosetta and guess what? The EXACT SAME FILE is stuck at the EXACT SAME PLACE!!! A TON of files are downloading besides just that one, but that one is stuck all over AGAIN!!!
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59140 - Posted 29 Jan 2009 16:07:39 UTC - in response to Message ID 59132.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.


I turned it off and nothing changed, I use the free version of Avast.


got any firewalls active?


No I use the Windows one, I have Windows XP Media Center on this laptop.
____________

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 59141 - Posted 29 Jan 2009 16:12:43 UTC - in response to Message ID 59128.

mikey, I don't know why I didn't think of this before...

Do a binary ftp of the file from here:

boinc.bakerlab.org/download/minirosetta_1.54_windows_intelx86.exe

and drop it in to your Rosetta folder in your BOINC data directory under the projects subfolder. That will at least get you up and running, or on to the next file to see if similar problems continue.


No difference, I downloaded the file, dropped it into the directory C:\Boinc\Data\Projects\boinc.bakerlab.org_rosetta. It asked if I wanted to overwrite what was there, the new file was bigger, I said yes, exited Boinc, restarted Boinc, did an update of Rosetta and it is downloading the file AGAIN and is stuck at exactly the same place. I even turned off my ad-aware and anti-virus and no change.


What antivirus program do you have, and what version? Some antivirus programs don't fully turn off when you try to turn them off; they stop reporting that they have found a virus, but don't stop looking for a virus.

I'm also running Ad-Aware, but without this problem, so this antispyware program is less likely to be causing the problem.

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 59142 - Posted 29 Jan 2009 16:23:13 UTC - in response to Message ID 59129.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.


I turned it off and nothing changed, I use the free version of Avast.


I also use the free version of avast!, version 4.8 Home Edition, without seeing that problem, so if it's that program, the problem is probably in a section specific to the operating system you are using. I use 32-bit Windows Vista SP1 with nearly all the updates applied; what operating system and what version are you using?

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59144 - Posted 29 Jan 2009 16:52:02 UTC - in response to Message ID 59142.
Last modified: 29 Jan 2009 16:53:29 UTC

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.


I turned it off and nothing changed, I use the free version of Avast.


I also use the free version of avast!, version 4.8 Home Edition, without seeing that problem, so if it's that program, the problem is probably in a section specific to the operating system you are using. I use 32-bit Windows Vista SP1 with nearly all the updates applied; what operating system and what version are you using?


I am running 32 bit Windows XP Media Center and have been running Boinc on this thing ever since I bought it a couple of years ago. I was running Rosetta on it until this new mini-rosetta came out! I have run Malaria, I can run Poem on it right now! I have run ABC on it but the units take too long on this T2300 1.6ghz, 2 gig of ram machine. EVERYTHING runs except it just won't download, or recognize, the new mini-rosetta file! I downloaded the mini-rosetta file directly, put it in the proper directory, and it STILL wants to download that exact same file!!!

I am also using the 4.8 Home version of Avast.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59145 - Posted 29 Jan 2009 16:58:30 UTC
Last modified: 31 Jan 2009 19:52:02 UTC

Scott Howard:

Setting the activity back to "run based on preferences" results in each task no longer using cpu time but they are not removed from memory.


There are many many BOINC settings possible and you've not described any of yours. When you set BOINC to run based on preference, you are telling it to only use CPU on the days and during the hours you've configured. If you've configured it to not be running at the current time or day of the week, it will suspend the currently active tasks. Any time a task is suspended, it will not make any progress. And there is a memory setting for whether or not tasks should remain "in memory" (virtual memory) while suspended. Doing so preserves the work done since the last checkpoint taken by the task.

...so major portions of what you are reporting may be exactly what you have configured BOINC to do.

You have 4 hosts, three are Windows XP and one is Win Vista. Which one is having problems? Is it this one? There are many failed tasks there with access violations. Are you overclocking this machine? Other then more CPUs and different CPU type, what is different about this machine then your others that having been running fine?
____________
Rosetta Moderator: Mod.Sense

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 59148 - Posted 29 Jan 2009 17:14:30 UTC - in response to Message ID 59140.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.


I turned it off and nothing changed, I use the free version of Avast.


got any firewalls active?


No I use the Windows one, I have Windows XP Media Center on this laptop.


I also use the Windows firewall, but the Vista SP1 version.

Laptops sometimes have problems with overheating when running BOINC workunits but set to use 100% of the CPU time, and I think I've read that minirosetta is likely to have problems when set to run at less than 100% of the CPU time. What is your setting of what fraction of the CPU time to use?

I've installed the SpeedFan program on my machine program to check for overheating, but don't have the file needed to show results with proper labels for my motherboard yet. The highest temperature it shows is 109F, though.

http://www.almico.com/speedfan.php

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 59149 - Posted 29 Jan 2009 17:16:54 UTC

To Scott_A_Howard,

I notice that your 8 core machine only has 3GB. That's a bit small for 8 rosetta tasks. In your BOINC preferences what percent of memory are you allowing the machine to use when the machine is/isn't in use? You might try setting both to 100% on that machine and see if it makes any difference.

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 628,529
RAC: 0
Message 59150 - Posted 29 Jan 2009 17:23:17 UTC - in response to Message ID 59145.

Scott Howard:


I seem to have the same problem. No special settings in Rosetta preferences, all kind of computers under XP, and tasks running 100+ hours with 0% progress.

Nothing But Idle Time

Joined: Sep 28 05
Posts: 209
ID: 1675
Credit: 139,545
RAC: 0
Message 59151 - Posted 29 Jan 2009 17:44:21 UTC

resultid=224470749

Reason: Access Violation (0xc0000005) at address 0x00467846 read attempt to address 0x11B524C4

This task was running fine but after I suspended it, rebooted my system, and restarted the task it terminated almost immediately with access violation. Maybe restarts don't work very well or something is flakey with my hard drive or system. Having some troubles with access violations on Einstein tasks as well. But I've run memtest86 and prime95 and CHKDSK and none of them indicate any local computer problems. I'm just shaking my head in disgust.

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 59152 - Posted 29 Jan 2009 17:51:16 UTC - in response to Message ID 59144.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.


I turned it off and nothing changed, I use the free version of Avast.


I also use the free version of avast!, version 4.8 Home Edition, without seeing that problem, so if it's that program, the problem is probably in a section specific to the operating system you are using. I use 32-bit Windows Vista SP1 with nearly all the updates applied; what operating system and what version are you using?


I am running 32 bit Windows XP Media Center and have been running Boinc on this thing ever since I bought it a couple of years ago. I was running Rosetta on it until this new mini-rosetta came out! I have run Malaria, I can run Poem on it right now! I have run ABC on it but the units take too long on this T2300 1.6ghz, 2 gig of ram machine. EVERYTHING runs except it just won't download, or recognize, the new mini-rosetta file! I downloaded the mini-rosetta file directly, put it in the proper directory, and it STILL wants to download that exact same file!!!

I am also using the 4.8 Home version of Avast.


I've recently run Poem workunits on one core of my HP Compaq Presario PC, model SR5125CL, and minirosetta 1.54 workunits at the same time on the other core, without problems. I previously ran Malaria workunits on one core and earlier minirosetta workunits on the other without problems, back when I could still get Malaria workunits. My machine also has 2 GB. I haven't tried ABC workunits, so you might want to try running yours a while with no ABC workunits active. I also run WCG workunits (all active projects there except beta test), with Ralph workunits and Boincsimap workunits when I can get them. I used to run Cels workunits, back when that project was active.

My CPU is an AMD Athlon(tm) 64 X2 Dual Core Processor 3600+ 1.90 Ghz; what's yours?

Also, You may want to give your ISP the instructions for downloading the problem file with FTP, and ask them to test whether their antivirus software considers it to have a problem.

AdeB Profile
Avatar

Joined: Dec 12 06
Posts: 45
ID: 135244
Credit: 2,358,915
RAC: 2,105
Message 59153 - Posted 29 Jan 2009 17:58:43 UTC

This task was aborted after my preferred runtime + 4 hours. It was working on the 3th model.
stderr out:

...
Watchdog active.
Starting work on structure: S_shuffle_00001 <--- F_00008_0003416_0
Fullatom mode ..
# cpu_run_time_pref: 43200
Starting work on structure: S_shuffle_00002 <--- F_00001_0000109_0
Fullatom mode ..
Starting work on structure: S_shuffle_00003 <--- F_00002_0003276_0
Fullatom mode ..
Hbond tripped.
====>
called boinc_finish


AdeB
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59156 - Posted 29 Jan 2009 18:46:03 UTC - in response to Message ID 59148.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.


I turned it off and nothing changed, I use the free version of Avast.


got any firewalls active?


No I use the Windows one, I have Windows XP Media Center on this laptop.


I also use the Windows firewall, but the Vista SP1 version.

Laptops sometimes have problems with overheating when running BOINC workunits but set to use 100% of the CPU time, and I think I've read that minirosetta is likely to have problems when set to run at less than 100% of the CPU time. What is your setting of what fraction of the CPU time to use?

I've installed the SpeedFan program on my machine program to check for overheating, but don't have the file needed to show results with proper labels for my motherboard yet. The highest temperature it shows is 109F, though.

http://www.almico.com/speedfan.php


I only run one core so the setting is to use 50% of the cpu's. Thus I do not have a problem with overheating on this laptop. I have it set to use 100% of the available cpu.
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59157 - Posted 29 Jan 2009 18:51:04 UTC - in response to Message ID 59152.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.


I turned it off and nothing changed, I use the free version of Avast.


I also use the free version of avast!, version 4.8 Home Edition, without seeing that problem, so if it's that program, the problem is probably in a section specific to the operating system you are using. I use 32-bit Windows Vista SP1 with nearly all the updates applied; what operating system and what version are you using?


I am running 32 bit Windows XP Media Center and have been running Boinc on this thing ever since I bought it a couple of years ago. I was running Rosetta on it until this new mini-rosetta came out! I have run Malaria, I can run Poem on it right now! I have run ABC on it but the units take too long on this T2300 1.6ghz, 2 gig of ram machine. EVERYTHING runs except it just won't download, or recognize, the new mini-rosetta file! I downloaded the mini-rosetta file directly, put it in the proper directory, and it STILL wants to download that exact same file!!!

I am also using the 4.8 Home version of Avast.


I've recently run Poem workunits on one core of my HP Compaq Presario PC, model SR5125CL, and minirosetta 1.54 workunits at the same time on the other core, without problems. I previously ran Malaria workunits on one core and earlier minirosetta workunits on the other without problems, back when I could still get Malaria workunits. My machine also has 2 GB. I haven't tried ABC workunits, so you might want to try running yours a while with no ABC workunits active. I also run WCG workunits (all active projects there except beta test), with Ralph workunits and Boincsimap workunits when I can get them. I used to run Cels workunits, back when that project was active.

My CPU is an AMD Athlon(tm) 64 X2 Dual Core Processor 3600+ 1.90 Ghz; what's yours?

This is an Intel T2300 dual core, only using one of them for Boinc, 1.6ghz machine.

Also, You may want to give your ISP the instructions for downloading the problem file with FTP, and ask them to test whether their antivirus software considers it to have a problem.

Yeah me telling Comcast what to do isn't going to happen in this lifetime. I can download any file in the World EXCEPT this damned mini-rosetta file and then ONLY thru Boinc!!!! I download the same file thru a direct download, Boinc just won't recognize it. Yes I did put the file in the proper directory. We have been thru this already.
____________

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59159 - Posted 29 Jan 2009 19:05:10 UTC

I am not sure that Comcast is the problem as I do use their AV software and Have no problems with downloading work ...

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59160 - Posted 29 Jan 2009 19:14:19 UTC
Last modified: 29 Jan 2009 19:17:27 UTC

mikey, have you tried a different version of BOINC?
____________
Rosetta Moderator: Mod.Sense

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59161 - Posted 29 Jan 2009 19:17:13 UTC

rembertw
Please open the advanced view of the BOINC Manager, go to the tasks tab, and note the "application" name shown, this will have the application version. The only reports of tasks running that long are from the prior version. If it is not Rosetta mini 1.54, please select that task, and abort it with the button on the left. There were some problems like that on the prior version that are corrected now.
____________
Rosetta Moderator: Mod.Sense

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59162 - Posted 29 Jan 2009 19:24:53 UTC - in response to Message ID 59160.
Last modified: 29 Jan 2009 19:46:58 UTC

mikey, have you tried a different version of BOINC?


Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!
____________

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59163 - Posted 29 Jan 2009 20:13:06 UTC - in response to Message ID 59162.

mikey, have you tried a different version of BOINC?


Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!



Just a wild shot. ..

How is your disk space?

How about BOINC settings for disk space? Are you at BOINC's limit?

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 59166 - Posted 29 Jan 2009 21:06:07 UTC

mikey,

You mentioned that when you manually copied the .exe file that you overwrote the half-downloaded file. Perhaps Boinc tried to resume the download without noticing the change. I think Boinc must be stopped, and must NOT have the file half-downloaded for this copying trick to work.

If you've avoided the above problem and Boinc is still trying to download the file, check the messages tab to see if Boinc is complaining about a bad checksum. It's possible that whatever is preventing Boinc from downloading the file could also be corrupting the file when you manually download it.

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59172 - Posted 29 Jan 2009 23:31:23 UTC - in response to Message ID 59163.

mikey, have you tried a different version of BOINC?


Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!



Just a wild shot. ..

How is your disk space?

How about BOINC settings for disk space? Are you at BOINC's limit?


No I am fine on both. Boinc still has 10 gig available to it and there is over 20 gig total available.
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59173 - Posted 29 Jan 2009 23:35:32 UTC - in response to Message ID 59166.
Last modified: 29 Jan 2009 23:47:48 UTC

mikey,

You mentioned that when you manually copied the .exe file that you overwrote the half-downloaded file. Perhaps Boinc tried to resume the download without noticing the change. I think Boinc must be stopped, and must NOT have the file half-downloaded for this copying trick to work.

If you've avoided the above problem and Boinc is still trying to download the file, check the messages tab to see if Boinc is complaining about a bad checksum. It's possible that whatever is preventing Boinc from downloading the file could also be corrupting the file when you manually download it.


Nope this is the only message regarding the file:
1/29/2009 6:29:05 PM|rosetta@home|Started download of minirosetta_1.54_windows_intelx86.exe

I exited Boinc, deleted the old file, copied the new one into its location and then restarted the whole pc. Then when Boinc started up that message, along with a few dozen others, came up.
I appreciate all the help but I am done trying to make this work. I am on to another project and will try again another time. THANK YOU ALL!!!!

PS in the time it took me to type this I attached to Poem@Home and got 8 new units plus all the associated files and the pc is now happily crunching.

Thanks again for all your help, I still have a hard time believing it is my pc that can download just fine from any other project but just cannot download one file from Rosetta. Here is a partial list of files just downloaded:
1/29/2009 6:37:15 PM|Poem@Home|Started download of poem_1.0_windows_intelx86
1/29/2009 6:37:15 PM|Poem@Home|Started download of JParmJan97
1/29/2009 6:37:23 PM|Poem@Home|Sending scheduler request: To fetch work. Requesting 95475 seconds of work, reporting 0 completed tasks
1/29/2009 6:37:28 PM|Poem@Home|Scheduler request completed: got 8 new tasks
1/29/2009 6:37:29 PM|Poem@Home|Finished download of poem_1.0_windows_intelx86

As you can see it works just fine!! I do see that the mini-rosetta file has a ".exe" at the end while the poem file does not. Could that be the problem, no clue, seems it has worked for all other users.
Thanks for the ride it has been loads of fun but I am getting off for now. I will still come back and read and reply in the forums until my credits don't let me anymore.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59174 - Posted 30 Jan 2009 0:18:07 UTC
Last modified: 30 Jan 2009 0:18:27 UTC

mikey, whatever the problem is, it stands a good chance of clearing itself when the next Rosetta version comes out. The .exe will be a different name afterall. So, please monitor the new release thread and give another try at that time.
____________
Rosetta Moderator: Mod.Sense

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59177 - Posted 30 Jan 2009 2:43:17 UTC

Sometimes it is best to take a breather and come back later ...

At times the problems go away on their own for no apparent reason ... other times they can be found.

I have been doing Ralph for a week or so now and all I can say is that I am impressed with how many issues we found in 1.54 and now in 1.55 ...

I know that we all regret we could not get you going Besides POEM, WCG also does folding, and there are other projects that are related ... yell if you need us ... or want us ... or to say hi ...

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59180 - Posted 30 Jan 2009 10:01:06 UTC - in response to Message ID 59174.

mikey, whatever the problem is, it stands a good chance of clearing itself when the next Rosetta version comes out. The .exe will be a different name afterall. So, please monitor the new release thread and give another try at that time.


I will, I like the premise of Rosetta and that is what brought me here in the first place. I will certainly try again in the future, probably when you put out a new version as you suggest.
Thanks for all your help
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59181 - Posted 30 Jan 2009 10:05:27 UTC - in response to Message ID 59177.

Sometimes it is best to take a breather and come back later ...

At times the problems go away on their own for no apparent reason ... other times they can be found.

I have been doing Ralph for a week or so now and all I can say is that I am impressed with how many issues we found in 1.54 and now in 1.55 ...

I know that we all regret we could not get you going Besides POEM, WCG also does folding, and there are other projects that are related ... yell if you need us ... or want us ... or to say hi ...


Oh I have 17 computers, I think, on line here at home right now. All are crunching for Boinc, plus I have 2 video cards doing the folding thing. I do ABC and Poem right now. But if you click on my name you will see I have crunched for a few projects and am not intending to stop anytime soon. In fact I have 2 new motherboard and dual core cpus to bring on line this weekend to replace 2 single core machines. I have already set them to no new work in preparation for the changeover.
____________

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 628,529
RAC: 0
Message 59184 - Posted 30 Jan 2009 12:12:52 UTC - in response to Message ID 59161.

Mod.Sense


I detached yesterday, and re-attached just now so there is no way to see (for me) what applications were running. Also, my computers are too widely spread to start micromanagement. Anyway, I'll keep an eye on a couple of computers for a couple of days to see if they reattach succesfully and if that problem is indeed solved.

I supposed it was a wrong batch (or application) and detach/reattach was the fastest way to have a full reset. If the problem shows up again, I'll let you know. If it doesn't... then thanks for the info!

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 59207 - Posted 31 Jan 2009 19:28:54 UTC
Last modified: 31 Jan 2009 19:31:41 UTC

disregard, found answer in short running model thread

darengosse Jean-Paul
Avatar

Joined: Jun 9 06
Posts: 18
ID: 93705
Credit: 259,459
RAC: 0
Message 59217 - Posted 1 Feb 2009 11:35:22 UTC

Hello with all.
I do not understand that some have problems.
Indeed my Desktop machine Intel Core 2 CPU Windows XP home x86 SP2 with carried out 27 wu with the v1.54 and 0 errors with an average CPU time of 2,8 hours. I cross the fingers so that continuous as well...
I specify only that betwen 80% and 85% of work, that passes directly to 100%.
With this new version I also notice that the processes generate more lures and attemps,(1 example: 23 decoys from 23 attemps on a wu), but as that the working mean by wu and more important as with the v1.47.
To finish, (although this n'is not the good forum), I specify that one of my Computeurs has been broken down for 8 days due to segments broken on the hard drive and qu'it is in repair. As it there to 3 wu as I n'is not puses to return before the dealine and which I think will be lost.
It would be thus although a person sympathetic nerve informs the persons in charge of the rosetta project of this problem.
Thank you very much d'advances...
Good memories...
____________

koniiiik

Joined: Dec 25 08
Posts: 3
ID: 294317
Credit: 69,586
RAC: 0
Message 59218 - Posted 1 Feb 2009 12:26:59 UTC

The 1.54 version seems to be in conflict with the Linux ABI in FreeBSD.
One machine I'm running boinc on is a FreeBSD one, boinc downloads and runs the Linux binaries through the Linuxulator. Version 1.47 worked flawlessly, but the 1.54 version crashes randomly on SIGILL. http://boinc.bakerlab.org/rosetta/results.php?hostid=973136 shows only one successful task, which was run with Rosetta beta rather than minirosetta; all of the minirosetta tasks crashed sooner or later.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 59224 - Posted 1 Feb 2009 21:13:19 UTC

1nkuA_BOINC_MPZN_with_zinc_abrelax_cs_frags_6231_188268_0
3.5 hours for 1 decoy? kinda odd i think.

LizzieBarry

Joined: Feb 25 08
Posts: 76
ID: 243949
Credit: 201,862
RAC: 0
Message 59225 - Posted 1 Feb 2009 22:18:23 UTC

I've had a couple of these "Validate Errors" recently:

Mini 1.47 Task 223871308
and
Mini 1.54 Task 224694361

Both ended with exit code: 0 (0x0) and seemed to run successfully, but with no credit given.

Is it something I did, a bug or just one of those things?

Gavin Shaw Profile
Avatar

Joined: Feb 1 07
Posts: 10
ID: 144828
Credit: 506,456
RAC: 0
Message 59228 - Posted 1 Feb 2009 23:07:15 UTC

Got this one a day or so ago. Not sure if it is a failure/error.

224812655

____________
Never surrender and never give up. In the darkest hour there is always hope.

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 59233 - Posted 2 Feb 2009 4:19:31 UTC

Hi Mike.

I did see somewhere that you said something about large upload file size, i think

this is one that got away. ;)

99 models in 4hrs, 26min and result file of 8.32mb.

_CAPRI17_T39_2_.sjf_br_one_docking.protocol__6483_19318_1.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=205403421

pete.

____________


Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59236 - Posted 2 Feb 2009 5:30:33 UTC
Last modified: 2 Feb 2009 5:35:16 UTC

Peter, the potential for large output files is why Mike changed it to exit after 99 models. That lets the task report back that it's running through models like candy and then they can weigh that before releasing more similar tasks.
____________
Rosetta Moderator: Mod.Sense

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 59237 - Posted 2 Feb 2009 6:00:42 UTC - in response to Message ID 59236.
Last modified: 2 Feb 2009 6:02:21 UTC

Peter, the potential for large output files is why Mike changed it to exit after 99 models. That lets the task report back that it's running through models like candy and then they can weigh that before releasing more similar tasks.


Hi.

Just as well it did finish after 99 i would hate to see the file size after

12 or 24 hours! :) I just returned another one the same size.

pete.
____________


transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 7,590,569
RAC: 2,277
Message 59238 - Posted 2 Feb 2009 6:22:36 UTC - in response to Message ID 59225.

I've had a couple of these "Validate Errors" recently:

Mini 1.47 Task 223871308
and
Mini 1.54 Task 224694361

Both ended with exit code: 0 (0x0) and seemed to run successfully, but with no credit given.

Is it something I did, a bug or just one of those things?


I only checked the 1.54-task. You have a runtime-preference of 3 hours. This one ran for 7 hours, no finished models. I'd say the watchdog, which aborts tasks running longer than intended, cut in. This is one for the long-running tasks thread.
____________

LizzieBarry

Joined: Feb 25 08
Posts: 76
ID: 243949
Credit: 201,862
RAC: 0
Message 59239 - Posted 2 Feb 2009 13:39:00 UTC - in response to Message ID 59238.

I've had a couple of these "Validate Errors" recently:

Mini 1.47 Task 223871308
and
Mini 1.54 Task 224694361

Both ended with exit code: 0 (0x0) and seemed to run successfully, but with no credit given.

Is it something I did, a bug or just one of those things?

I only checked the 1.54-task. You have a runtime-preference of 3 hours. This one ran for 7 hours, no finished models. I'd say the watchdog, which aborts tasks running longer than intended, cut in. This is one for the long-running tasks thread.

Thanks for copying here - I thought it was just a problem with the validator (the error message being the clue). You're right, there's no "Done" section after the first model starts until the boinc_finish, which is odd, but no mention of the watchdog cutting in, even though it does run a long time. But on the 1.47 WU there are 3 models done, so I'm not entirely convinced it's the same thing.

Usually long-running jobs get a default credit of 80, don't they? Looks like I missed out all ways. Oh well...

falingtrea

Joined: Aug 8 07
Posts: 1
ID: 196401
Credit: 271,913
RAC: 239
Message 59240 - Posted 2 Feb 2009 16:25:44 UTC

Just got this error trying to perform an update:

2/2/2009 10:05:58 AM|rosetta@home|Sending scheduler request: Requested by user
2/2/2009 10:05:58 AM|rosetta@home|(not requesting new work or reporting completed tasks)
2/2/2009 10:06:03 AM|rosetta@home|Scheduler RPC succeeded
2/2/2009 10:06:03 AM|rosetta@home|Message from server: Server error: can't attach shared memory
2/2/2009 10:06:03 AM|rosetta@home|Deferring communication for 1 hr 0 min 0 sec
2/2/2009 10:06:03 AM|rosetta@home|Reason: project is down

Server is up according to the webpage. One task was updated as complete.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 59245 - Posted 2 Feb 2009 22:02:36 UTC

there is something odd going on with the graphics of lr5_D_score12_rlbd_2hsh_IGNORE_THE_REST_DECOY_6246_424_0 the plot disappears completely at times and the accepted energy does the same at times. then they reappear at times. all seems to depend on the energy value of the moment. as far as i know this is not normal.

TeAm Enterprise Profile
Avatar

Joined: Sep 28 05
Posts: 18
ID: 1546
Credit: 20,535,719
RAC: 2,662
Message 59249 - Posted 3 Feb 2009 3:44:56 UTC

I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.

For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.

Right now 3 cores are running:

Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 59253 - Posted 3 Feb 2009 10:11:54 UTC - in response to Message ID 59249.

I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.

For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.

Right now 3 cores are running:

Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.



what version of boinc manager are you using?
it looked like you were using 5.10.45 which is quite old.

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59254 - Posted 3 Feb 2009 12:41:38 UTC - in response to Message ID 59249.

I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.

For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.

Right now 3 cores are running:

Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.


Sounds like your Duration Correction Factor is waaaay off. You can wait for Boinc to fix it, and it will all by itself. Or you can shut down Boinc and manually edit the file client_state.xml. Do this thru notepad or whatever, and go down until you find this Project and the line that says:
<duration_correction_factor>0.705998</duration_correction_factor>
That is a copy of my line, yours will have the numbers be like 70.?????????? or whatever. Change it to 1.000000, save the file and then restart Boinc. Boinc will then use those new numbers and get both new work and recalculate how long it will take to crunch your existing units. If you do a right click on the file name the EDIT option is listed, use that to edit the file. Do NOT change anything else in the file. The line you are looking for is near the top, mine was only 65 lines down from the top.
____________

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59256 - Posted 3 Feb 2009 12:49:09 UTC

Compute error, though it looks more like a zip error ...


process exited with code 1 (0x1, -255)

Watchdog active.
Hbond tripped.

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


Not sure what to make of this error ... happened on the Mac Pro ...

TeAm Enterprise Profile
Avatar

Joined: Sep 28 05
Posts: 18
ID: 1546
Credit: 20,535,719
RAC: 2,662
Message 59258 - Posted 3 Feb 2009 15:06:00 UTC - in response to Message ID 59253.

Using version 6.4.5 which I downloaded and installed about 6 days ago.


what version of boinc manager are you using?
it looked like you were using 5.10.45 which is quite old.[/quote]

TeAm Enterprise Profile
Avatar

Joined: Sep 28 05
Posts: 18
ID: 1546
Credit: 20,535,719
RAC: 2,662
Message 59259 - Posted 3 Feb 2009 15:11:39 UTC - in response to Message ID 59254.

That fixed it! Thanks, my duration was set at 55+.


I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.

For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.

Right now 3 cores are running:

Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.


Sounds like your Duration Correction Factor is waaaay off. You can wait for Boinc to fix it, and it will all by itself. Or you can shut down Boinc and manually edit the file client_state.xml. Do this thru notepad or whatever, and go down until you find this Project and the line that says:
<duration_correction_factor>0.705998</duration_correction_factor>
That is a copy of my line, yours will have the numbers be like 70.?????????? or whatever. Change it to 1.000000, save the file and then restart Boinc. Boinc will then use those new numbers and get both new work and recalculate how long it will take to crunch your existing units. If you do a right click on the file name the EDIT option is listed, use that to edit the file. Do NOT change anything else in the file. The line you are looking for is near the top, mine was only 65 lines down from the top.

cenit Profile

Joined: Apr 1 07
Posts: 13
ID: 161706
Credit: 1,630,287
RAC: 0
Message 59261 - Posted 3 Feb 2009 17:12:07 UTC - in response to Message ID 59240.

Just got this error trying to perform an update:

2/2/2009 10:05:58 AM|rosetta@home|Sending scheduler request: Requested by user
2/2/2009 10:05:58 AM|rosetta@home|(not requesting new work or reporting completed tasks)
2/2/2009 10:06:03 AM|rosetta@home|Scheduler RPC succeeded
2/2/2009 10:06:03 AM|rosetta@home|Message from server: Server error: can't attach shared memory
2/2/2009 10:06:03 AM|rosetta@home|Deferring communication for 1 hr 0 min 0 sec
2/2/2009 10:06:03 AM|rosetta@home|Reason: project is down

Server is up according to the webpage. One task was updated as complete.


you have to wait and it will correct by itself.
Maybe it is a long time from your last rosetta WU... during this time the project changed its web address and so boinc need to re-fetch master file. Leave it alone and in 24 hour max it will redownload it and resume working!

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59266 - Posted 3 Feb 2009 23:48:58 UTC - in response to Message ID 59259.

That fixed it! Thanks, my duration was set at 55+.


I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.

For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.

Right now 3 cores are running:

Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.


Sounds like your Duration Correction Factor is waaaay off. You can wait for Boinc to fix it, and it will all by itself. Or you can shut down Boinc and manually edit the file client_state.xml. Do this thru notepad or whatever, and go down until you find this Project and the line that says:
<duration_correction_factor>0.705998</duration_correction_factor>
That is a copy of my line, yours will have the numbers be like 70.?????????? or whatever. Change it to 1.000000, save the file and then restart Boinc. Boinc will then use those new numbers and get both new work and recalculate how long it will take to crunch your existing units. If you do a right click on the file name the EDIT option is listed, use that to edit the file. Do NOT change anything else in the file. The line you are looking for is near the top, mine was only 65 lines down from the top.


Yea for some reason this has happened ALOT lately.
____________

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 7,590,569
RAC: 2,277
Message 59272 - Posted 4 Feb 2009 6:22:55 UTC

That could be related to the BOINC version (6.4.5 and higher). The complaints about the RDCF being completely off are usually coming from people having installed it. A not uncommon opinion is that version 6.4.5 was made the recommended version too hasty and done to get the CUDA capabilities out.
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 59273 - Posted 4 Feb 2009 9:08:55 UTC
Last modified: 4 Feb 2009 9:09:38 UTC

lr6_E_score12_rlbd_1e6i_IGNORE_THE_REST_DECOY_6254_236_1
ERROR:: Exit from: ..\..\src\protocols\checkpoint\CheckPointer.cc line: 87
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 59274 - Posted 4 Feb 2009 9:11:24 UTC

lr5_D_hybrid_rlbd_1bmg_IGNORE_THE_REST_DECOY_6250_424_0
Initializing options.... ok
ERROR: Option file open failed for: relax_options_lr5_D_hybrid_mtyka

KC0ISW

Joined: Sep 28 05
Posts: 2
ID: 1538
Credit: 58,926
RAC: 0
Message 59278 - Posted 4 Feb 2009 12:43:32 UTC - in response to Message ID 59274.

http://boinc.bakerlab.org/rosetta/result.php?resultid=226103545
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59331 - Posted 5 Feb 2009 13:40:19 UTC - in response to Message ID 59180.

mikey, whatever the problem is, it stands a good chance of clearing itself when the next Rosetta version comes out. The .exe will be a different name afterall. So, please monitor the new release thread and give another try at that time.


I will, I like the premise of Rosetta and that is what brought me here in the first place. I will certainly try again in the future, probably when you put out a new version as you suggest.
Thanks for all your help


I just had a thought...NOT dangerous this time, I am off for a few days here and I have finally figured out how to make Ubuntu Linux work for me and crunch Boinc projects too. I will try switching one of the machines that won't download the Windows app to Linux and see if that works.
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59355 - Posted 5 Feb 2009 19:00:02 UTC - in response to Message ID 59331.

mikey, whatever the problem is, it stands a good chance of clearing itself when the next Rosetta version comes out. The .exe will be a different name afterall. So, please monitor the new release thread and give another try at that time.


I will, I like the premise of Rosetta and that is what brought me here in the first place. I will certainly try again in the future, probably when you put out a new version as you suggest.
Thanks for all your help


I just had a thought...NOT dangerous this time, I am off for a few days here and I have finally figured out how to make Ubuntu Linux work for me and crunch Boinc projects too. I will try switching one of the machines that won't download the Windows app to Linux and see if that works.


Guess what...............NO PROBLEM, it is crunching just fine. Here is the pc: http://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1001897
It is on its first unit, so no results yet, but one unit is crunching just fine so far!!
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 59363 - Posted 5 Feb 2009 20:19:02 UTC

lr5_D_hybrid_rlbd_1e6i_IGNORE_THE_REST_DECOY_6250_347_0 died at 3 hrs out of 4 and also kicked up a dialogue box on my desktop.

the error - exit code -1073740791 (0xc0000409)

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59365 - Posted 5 Feb 2009 21:33:33 UTC - in response to Message ID 59355.

mikey, whatever the problem is, it stands a good chance of clearing itself when the next Rosetta version comes out. The .exe will be a different name afterall. So, please monitor the new release thread and give another try at that time.


I will, I like the premise of Rosetta and that is what brought me here in the first place. I will certainly try again in the future, probably when you put out a new version as you suggest.
Thanks for all your help


I just had a thought...NOT dangerous this time, I am off for a few days here and I have finally figured out how to make Ubuntu Linux work for me and crunch Boinc projects too. I will try switching one of the machines that won't download the Windows app to Linux and see if that works.


Guess what...............NO PROBLEM, it is crunching just fine. Here is the pc: http://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1001897
It is on its first unit, so no results yet, but one unit is crunching just fine so far!!


I just thought of something....I wonder if changing the setting for:
Skip image file verification? to yes would have let my Windows pc's download the file? Hmmmmm
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59367 - Posted 5 Feb 2009 21:48:57 UTC

The image verification can't occur until the download completes. So, that's not what's causing the download problem.
____________
Rosetta Moderator: Mod.Sense

epcorian

Joined: Jan 1 09
Posts: 16
ID: 295242
Credit: 253,062
RAC: 0
Message 59371 - Posted 6 Feb 2009 1:21:48 UTC

Mod.Sense had asked me to posts my results in here. A little history, I've been getting Compute Error's for every Minirosetta WU I try and crunch, they usually crash and burn within the first 60 seconds or so...I am running a Q6600 with everything running at stock speeds but I was throttling my processor to use only 3 of 4 cores, so it was suggested that I let all 4 cores run unthrottled and here's what happenned:

I changed it to: "On multiprocessor systems, use at most 100% of the processors" so that it would run completely unthrottled and use all 4 cores. And I let it download minirosetta WU's and it got 5 of them and all failed after 0:33, 1:39, 0:56, 0:38, and last one at 0:51 crashed with a Vista popup saying "minirosetta_1.54_windows_x86_64.exe has stopped working"

So it didn't seem to help, I don't know what else to try but I'm little ashamed of all the compute errors when you look at my results page..so I think I may have to give up on minirosetta and just stick to Beta WU's, they seem to work great when I'm not messing around with the BOINC client.

I think it may have something to do with Vista 64. Because I have an E8500 running Vista 64 and they fail on there too but the E8500 is throttled to 1 core and is OC'ed from 3.16Ghz to 3.8Ghz (I've been told OC'ing will effect minirosetta) but the E8500 is my gaming rig so I don't mind if it doesn't crunch WU's because it's crunching games! :)

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59374 - Posted 6 Feb 2009 2:58:56 UTC

And epcorian is not overclocked. Running BOINC version 6.4.5

They consistently fail with Access Violations on the Mini tasks. The "Rosetta Beta" tasks are the successes you will find.

Is it possible you've got something like an antivirus application that's conflicting on Vista?

The only other thought is to go back to the prior stable version of the BOINC client. There have been a number of fishy issues with the 6.4.x level. You can download older BOINC versions here
____________
Rosetta Moderator: Mod.Sense

epcorian

Joined: Jan 1 09
Posts: 16
ID: 295242
Credit: 253,062
RAC: 0
Message 59376 - Posted 6 Feb 2009 3:29:44 UTC

That's right, the Q6600 isn't overclocked, the system contains a Intel DQ35JO MB, Q6600 Processor, 4GB (2x2GB) Kingston Value Ram, Corsair HX-520W PS, 36GB WD Raptor HD, 2x750GB WD HD's in RAID 1, and a Zalman HSF running Vista 64 SP1, no external video card. I use it as a home file and print server and recently a BOINC cruncher as I leave it on 24/7. No issues with Beta WU's or SETI.

I do have NOD32 installed on there but I tried disabling it (I haven't gone as far to uninstall it) and they would still fail.

Maybe I should try an older version of the BOINC client, I will give it a go this weekend and post back.

Thanks!

NewtonianRefractor

Joined: Sep 29 08
Posts: 19
ID: 281324
Credit: 2,350,860
RAC: 0
Message 59382 - Posted 6 Feb 2009 9:38:53 UTC
Last modified: 6 Feb 2009 9:45:30 UTC

can someone please explain what happened here?

Here is another one.

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59384 - Posted 6 Feb 2009 12:28:02 UTC - in response to Message ID 59367.

The image verification can't occur until the download completes. So, that's not what's causing the download problem.


DARN, I was hoping that would solve my problem, oh well. Thanks!
____________

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59385 - Posted 6 Feb 2009 12:33:32 UTC - in response to Message ID 59374.

And epcorian is not overclocked. Running BOINC version 6.4.5

They consistently fail with Access Violations on the Mini tasks. The "Rosetta Beta" tasks are the successes you will find.

Is it possible you've got something like an antivirus application that's conflicting on Vista?

The only other thought is to go back to the prior stable version of the BOINC client. There have been a number of fishy issues with the 6.4.x level. You can download older BOINC versions here


He is running a 64 bit OS though, I read on one of the projects that you need to do something to make 32 bit units work on a 64 bit system, is that true with Rosetta units too? That is NOT true for all projects and I do not remember where I read it.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59391 - Posted 6 Feb 2009 14:26:53 UTC
Last modified: 6 Feb 2009 14:29:47 UTC

Moved NewtonianRefractor's post here. They report a validation error on a tasks that had a visit from the watchdog. They ended at target runtime plus 4hrs, but show with validation errors.
____________
Rosetta Moderator: Mod.Sense

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 628,529
RAC: 0
Message 59394 - Posted 6 Feb 2009 16:32:31 UTC - in response to Message ID 59161.

rembertw
Please open the advanced view of the BOINC Manager, go to the tasks tab, and note the "application" name shown, this will have the application version. The only reports of tasks running that long are from the prior version. If it is not Rosetta mini 1.54, please select that task, and abort it with the button on the left. There were some problems like that on the prior version that are corrected now.


Same problem again on at least one of my computers. This time I have more details:
Application: Rosetta Mini 1.54
Task name: lr6_E_score12_rlbd_1ail_IGNORE_THE_REST_DECOY_6254_459_0

Total runtime before manual cancellation: 72:21:22
Total Progress: 0%
Time to go: 6:42:30 (as usual on my computers)

Any comments/ideas?

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59395 - Posted 6 Feb 2009 17:10:50 UTC

Rembertw, which machine are you having the problem with? What version of BOINC are you running? Was this a newly installed machine? Or was it working before?
____________
Rosetta Moderator: Mod.Sense

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 59406 - Posted 7 Feb 2009 1:36:32 UTC

Similar error to that reported by Paul Buck

Task: 226615095
Workunit: 206537670
Name: loopbuild_ref_tex_cst_hombench_loopbuild_tex_cst_t326__IGNORE_THE_REST_1R9GA_7_6642_18_0

Mac OSX 10.4.11

<core_client_version>6.2.18</core_client_version>
<![CDATA[

*** Probably irrelevant stuff deleted

End of unzipping.
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/loopbuild_ref_tex_cst.loopbuild_tex_cst.t326_.tex.boinc_files.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/loopbuild_ref_tex_cst.loopbuild_tex_cst.t326_.tex.boinc_files.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 14400
Hbond tripped.

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>


____________

epcorian

Joined: Jan 1 09
Posts: 16
ID: 295242
Credit: 253,062
RAC: 0
Message 59409 - Posted 7 Feb 2009 3:40:46 UTC

So I took Mod.Sense's advice and downgraded to the 6.2.19 64-bit version of the BOINC client and so far so good with the mini's, I've crunched with 30 minutes thus far and no errors yet, much better then the 30-60 seconds I was getting before.

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 628,529
RAC: 0
Message 59411 - Posted 7 Feb 2009 8:39:55 UTC - in response to Message ID 59395.

Rembertw, which machine are you having the problem with? What version of BOINC are you running? Was this a newly installed machine? Or was it working before?

- This specific problem occurs with computer ID 586996
- Boinc version 6.2.14 as on most of my computers currently
- Not newly installed, but hardly a price winner with Rosetta. It crunches succesfully for other projects though

Extra comments: I have the impression that it is Rosetta that crashes. This morning I noticed 2 other tasks at +7h progress and 0% progress. When cancelling these tasks I got the Windows crash notice where I can "inform microsoft of the problem". The only "special" about this computer is that it doesn't have 24/7 internet access.

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59412 - Posted 7 Feb 2009 12:17:19 UTC - in response to Message ID 59409.

So I took Mod.Sense's advice and downgraded to the 6.2.19 64-bit version of the BOINC client and so far so good with the mini's, I've crunched with 30 minutes thus far and no errors yet, much better then the 30-60 seconds I was getting before.


ALRIGHT!!! Glad you guys found the problem, I guess the reports of the newer versions being released without proper testing were true in your case.
____________

[07p] PcLis Profile
Avatar

Joined: Dec 16 07
Posts: 3
ID: 227480
Credit: 62,843
RAC: 0
Message 59416 - Posted 7 Feb 2009 13:56:34 UTC

Hola,

En primer lugar, disculpas por escribir en castellano, pero mi inglés es insuficiente.

Desde agosto de 2008 me están finalizando el 99% de las tareas de Rosetta Mini con error de cálculo. Al cabo de un tiempo decidí no seguir procesando en este proyecto. Aun así, de cuando en cuando vuelvo a intentarlo, pero todo sigue igual: incluso con las nuevas versiones de Rosetta Mini, incluida esta última.

El caso es que las tareas de Rosetta Beta no me fallan, pero de ésas me envía proporcinalmente muy pocas. La pena es que en este proyecto no existe la posibilidad de seleccionar subproyectos, como sí la hay en otros muchos.

Me gustaría seguir procesando para este proyecto, pero no hay manera, y no es cuestión de tirar horas de computación desaprovechadas. Espero que este problema se resuelva pronto. Por mi parte seguiré probando de vez en cuando.

Un coridal saludo para todos,

Juan

Klimax

Joined: Apr 27 07
Posts: 29
ID: 170261
Credit: 107,923
RAC: 0
Message 59418 - Posted 7 Feb 2009 14:14:01 UTC

Hello.
Following task (http://boinc.bakerlab.org/rosetta/result.php?resultid=225859224) is suspended as it has produced "accepted energy": QNAN(Not a Number?) and RMSD: QO.Model number 25 step 9518. Running time: 20h 2min 21sec.
Set runtime 24h.
For now suspended.No crash before.
OS:Windows 7 beta.I can create dump file using task manager.

Should I let it try to finish?

Thanks

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 59419 - Posted 7 Feb 2009 15:04:48 UTC - in response to Message ID 59172.

mikey, have you tried a different version of BOINC?


Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!



Just a wild shot. ..

How is your disk space?

How about BOINC settings for disk space? Are you at BOINC's limit?


No I am fine on both. Boinc still has 10 gig available to it and there is over 20 gig total available.


How many BOINC projects do you have set up? I've seen signs that BOINC divides the available space equally among projects, even if some projects don't even try to use all of their share. I'm currently allowing BOINC to share up to 30 GB among 8 BOINC projects (not all making workunits available recently). I had problems getting Rosetta@home to run workunits on both cores of my dual-core CPU at the same time before that. Also, I believe I've seen a maximum percentage of the available free space on the hard drive BOINC is allowed to use, which can reduce the limits even further.

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 59420 - Posted 7 Feb 2009 15:25:32 UTC

I recently had a 1.54 workunit with a validate error for no reason I could spot in the Task ID details file. A wingman got a Success, but apparantly with a much shorter preferred workunit length than the 14 hours I request.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=204095976

Could you check for problems in parts of the workunit the wingman probably never reached?

mikey
Avatar

Joined: Jan 5 06
Posts: 1445
ID: 47185
Credit: 3,503,433
RAC: 0
Message 59421 - Posted 7 Feb 2009 16:24:33 UTC - in response to Message ID 59419.
Last modified: 7 Feb 2009 16:27:55 UTC

mikey, have you tried a different version of BOINC?


Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!



Just a wild shot. ..

How is your disk space?

How about BOINC settings for disk space? Are you at BOINC's limit?


No I am fine on both. Boinc still has 10 gig available to it and there is over 20 gig total available.


How many BOINC projects do you have set up? I've seen signs that BOINC divides the available space equally among projects, even if some projects don't even try to use all of their share. I'm currently allowing BOINC to share up to 30 GB among 8 BOINC projects (not all making workunits available recently). I had problems getting Rosetta@home to run workunits on both cores of my dual-core CPU at the same time before that. Also, I believe I've seen a maximum percentage of the available free space on the hard drive BOINC is allowed to use, which can reduce the limits even further.


I only have one project per pc, but I will add a second if the first is having workunit issues. All machines have at least a 20 gig hard drive but most have a 100 gig or bigger hard drive. The one above is a laptop with a 50 gig hard drive with almost 30 gig free. I have Boinc setup to use no more than 50% of the free hard drive space and don't have any issues with space.
____________

epcorian

Joined: Jan 1 09
Posts: 16
ID: 295242
Credit: 253,062
RAC: 0
Message 59428 - Posted 7 Feb 2009 20:06:18 UTC - in response to Message ID 59412.

So I took Mod.Sense's advice and downgraded to the 6.2.19 64-bit version of the BOINC client and so far so good with the mini's, I've crunched with 30 minutes thus far and no errors yet, much better then the 30-60 seconds I was getting before.


ALRIGHT!!! Glad you guys found the problem, I guess the reports of the newer versions being released without proper testing were true in your case.


I think I spoke too soon...that first WU crunched successfully but only 1 other was WU successful out of the 8 WU's. 2/8, better but still not good. I might try replacing Vista 64 with XP 64 another weekend when I'm bored. Just for curiosity sake I had my P4 and Atom 330 PC's running 32-bit XP SP3 crunch some Mini's and they did just fine.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59437 - Posted 8 Feb 2009 2:28:40 UTC - in response to Message ID 59418.
Last modified: 8 Feb 2009 2:29:01 UTC

Hello.
Following task (http://boinc.bakerlab.org/rosetta/result.php?resultid=225859224) is suspended as it has produced "accepted energy": QNAN(Not a Number?) and RMSD: QO.Model number 25 step 9518. Running time: 20h 2min 21sec.
Set runtime 24h.
For now suspended.No crash before.
OS:Windows 7 beta.I can create dump file using task manager.

Should I let it try to finish?

Thanks


I'd suggest allowing it to run normally. Was it still using CPU time? If you want to kind of cut it off, but get it to report in, let it run, then exit (not close) BOINC and restart it, let it run about 2 minutes, then exit again and restart, until you've done that 5 times and the task should be ended and report in with "too many restarts".
____________
Rosetta Moderator: Mod.Sense

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59439 - Posted 8 Feb 2009 2:43:27 UTC - in response to Message ID 59416.

Hola,

En primer lugar, disculpas por escribir en castellano, pero mi inglés es insuficiente.

Desde agosto de 2008 me están finalizando el 99% de las tareas de Rosetta Mini con error de cálculo. Al cabo de un tiempo decidí no seguir procesando en este proyecto. Aun así, de cuando en cuando vuelvo a intentarlo, pero todo sigue igual: incluso con las nuevas versiones de Rosetta Mini, incluida esta última.

El caso es que las tareas de Rosetta Beta no me fallan, pero de ésas me envía proporcinalmente muy pocas. La pena es que en este proyecto no existe la posibilidad de seleccionar subproyectos, como sí la hay en otros muchos.

Me gustaría seguir procesando para este proyecto, pero no hay manera, y no es cuestión de tirar horas de computación desaprovechadas. Espero que este problema se resuelva pronto. Por mi parte seguiré probando de vez en cuando.

Un coridal saludo para todos,

Juan


Hola Juan,

I was able to translate his message and basically, he's been having problems with Mini, including the lastest version. He wishes Rosetta had subprojects, so he could select to crunch only the RosettaBeta application instead of mini.

Looking at his 2 failed tasks, they both have Exit status -226 and the Can't acquire lockfile errors.

He is running Win Vista x86.

I know some of you have had these lock file problems as well. Were they always with WinVista? And I thought the v1.54 release of mini had resolved these issues. Can any of you that have had the problem suggest the best steps for Juan to take to resolve it? You might even convert your reply to Spanish as best we can using a tool like this: http://dictionary.reference.com/translate/text.html
____________
Rosetta Moderator: Mod.Sense

Fishead

Joined: Sep 3 08
Posts: 7
ID: 276548
Credit: 89,566
RAC: 0
Message 59443 - Posted 8 Feb 2009 6:45:05 UTC

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=206610287
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=206617445
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=206618707
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=204395981

According to the graphics screen of these four WUs, every "accepted" step becomes the new low energy state. No matter if the energy value is smaller or higher...

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59447 - Posted 8 Feb 2009 10:12:14 UTC

*I* cured the lock file problem by running with 100% time ... if he has opted to run at some lower percentage of CPU time this may be the issue. Something else to try ... and if it works we can report another success ... this is one of the issues that we have been trying to pin down in RALPH...

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 59462 - Posted 8 Feb 2009 16:37:47 UTC

I have aborted the following loopbuilds:

226468615
226473496

They both were going on a slow boat to nowhere with an accepted energy of 1.#INF


____________

Klimax

Joined: Apr 27 07
Posts: 29
ID: 170261
Credit: 107,923
RAC: 0
Message 59465 - Posted 8 Feb 2009 18:47:55 UTC - in response to Message ID 59437.

Hello.
Following task (http://boinc.bakerlab.org/rosetta/result.php?resultid=225859224) is suspended as it has produced "accepted energy": QNAN(Not a Number?) and RMSD: QO.Model number 25 step 9518. Running time: 20h 2min 21sec.
Set runtime 24h.
For now suspended.No crash before.
OS:Windows 7 beta.I can create dump file using task manager.

Should I let it try to finish?

Thanks


I'd suggest allowing it to run normally. Was it still using CPU time? If you want to kind of cut it off, but get it to report in, let it run, then exit (not close) BOINC and restart it, let it run about 2 minutes, then exit again and restart, until you've done that 5 times and the task should be ended and report in with "too many restarts".


OK,set runtime at 8hours,so watchdog would cut it at 24hours.It has now uploaded and reported it.I have dump files as well,if somebody in team is interested.(Captured at reported time and step)
And I see I was not alone... :-(

Arkadiusz Dykiel

Joined: Aug 13 06
Posts: 3
ID: 104623
Credit: 6,930,077
RAC: 9,498
Message 59469 - Posted 8 Feb 2009 20:24:44 UTC

Hi,

The work units exit with status code 193 (0xc1).
Rosetta 5.98 and other projects work OK.

Do I miss something? Some library.

Full error report below:

Server state Over
Outcome Client error
Client state Compute error
Exit status 193 (0xc1)
CPU time 0

<core_client_version>6.2.15</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 2- 8 1:29: 8:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
*** glibc detected *** corrupted double-linked list: 0x093544cc ***
SIGABRT: abort called
Stack trace (15 frames):
[0x8f88f07]
[0x8fb3778]
[0xb7fff420]
[0x9016944]
[0x902c693]
[0x90310d2]
[0x9031c84]
[0x903353d]
[0x9000ec7]
[0x81bed6d]
[0x81bee1d]
[0x8195f15]
[0x8048e93]
[0x900f84c]
[0x8048111]

Exiting...

</stderr_txt>
]]>
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59473 - Posted 8 Feb 2009 23:12:20 UTC

As of v1.54, the watchdog kicks in at runtime pref. plus 4 hours. So, no longer 3 times runtime preference.
____________
Rosetta Moderator: Mod.Sense

Andreas

Joined: Sep 22 08
Posts: 1
ID: 280173
Credit: 39,402
RAC: 0
Message 59493 - Posted 9 Feb 2009 22:07:37 UTC - in response to Message ID 59086.

If you are seeing errors with lock-file problems try setting the cpu setting back to 100%. If you are running at 100% CPU preference and are getting this problem, I for one, am very interested. If you are getting the failures and change the CPU setting to 100% and that cures the issue ... well, we are interested in THAT too ...

I read about this in Einstein@Home and it seems to work for me ... YMMV ...


I, too, was plagued by frequent R@H lock file problems. Setting CPU to 100% seems to have cured that.
And, as I have a quad-core CPU, I can limit BOINC usage by setting "On Multiprocessor Systems, use at most 51% of all processors". (If I run BOINC at 100% on all cores, my system gets too hot - more precisely, my fan gets too loud)
-- Andreas

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 59494 - Posted 9 Feb 2009 23:02:10 UTC

problems with this one:
227327540

heartbeat error messages

</stderr_txt>
<message>
<file_xfer_error>
<file_name>abinitio_norelax_homfrag_natfrag_129_B_1o7uA_SAVE_ALL_OUT_6252_5178_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>




____________

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 59502 - Posted 10 Feb 2009 14:44:11 UTC - in response to Message ID 59439.
Last modified: 10 Feb 2009 14:52:25 UTC

Hola,

En primer lugar, disculpas por escribir en castellano, pero mi inglés es insuficiente.

Desde agosto de 2008 me están finalizando el 99% de las tareas de Rosetta Mini con error de cálculo. Al cabo de un tiempo decidí no seguir procesando en este proyecto. Aun así, de cuando en cuando vuelvo a intentarlo, pero todo sigue igual: incluso con las nuevas versiones de Rosetta Mini, incluida esta última.

El caso es que las tareas de Rosetta Beta no me fallan, pero de ésas me envía proporcinalmente muy pocas. La pena es que en este proyecto no existe la posibilidad de seleccionar subproyectos, como sí la hay en otros muchos.

Me gustaría seguir procesando para este proyecto, pero no hay manera, y no es cuestión de tirar horas de computación desaprovechadas. Espero que este problema se resuelva pronto. Por mi parte seguiré probando de vez en cuando.

Un coridal saludo para todos,

Juan


Hola Juan,

I was able to translate his message and basically, he's been having problems with Mini, including the lastest version. He wishes Rosetta had subprojects, so he could select to crunch only the RosettaBeta application instead of mini.

Looking at his 2 failed tasks, they both have Exit status -226 and the Can't acquire lockfile errors.

He is running Win Vista x86.

I know some of you have had these lock file problems as well. Were they always with WinVista? And I thought the v1.54 release of mini had resolved these issues. Can any of you that have had the problem suggest the best steps for Juan to take to resolve it? You might even convert your reply to Spanish as best we can using a tool like this: http://dictionary.reference.com/translate/text.html


I never learned enough Spanish to do such a translation myself, so I tried asking that web site to translate all of your reply at once to Spanish, in preparation for writing an answer in English and doing the same to it. It appeared that the translation succeeded, but enough of it was hidden by advertisements that it was unusable.

Anyone know another automatic translation site that doesn't have this problem?

I've been trying to trigger that problem over on RALPH@home by setting my CPU time less than 100% and unable to actually get it less than 100%, so you might want to consider this: For anyone having this problem repeatedly, give them 1.54 workunits with extra debugging output enabled. Then have someone on the RALPH@home staff analyze the results and give them credits according to the RALPH@home standards instead of the Rosetta@home standards.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 59505 - Posted 10 Feb 2009 18:23:04 UTC
Last modified: 10 Feb 2009 18:32:00 UTC

http://www.babelfish.yahoo.com translates it as:

Hello, First of all, excuses to write in Castilian, but my English is insufficient. From August of 2008 me 99% of the tasks of Mini Rosetta with computational error are finalizing. After a time I decided not to continue processing in this project. Even so, sometimes I return to try it, but everything follows equal: even with the new versions of Mini Rosetta, including this last one. The case is that the tasks of Rosetta Beta do not fail to me, but of that one sends very few proporcinalmente to me. The pain is that in this project the possibility of selecting sub-projects, does not exist there is as if it in other many. I would like to continue processing for this project, but there is no way, and it is not question to throw low-achieving hours of computation. I hope that this problem is solved soon. As for me I will continue trying from time to time. A coridal greeting for all, Juan

he has 4 tasks running and 2 of them failed

abinitio_norelax_homfrag_natfrag_129_B_1tit__SAVE_ALL_OUT_6252_2628_0
he got a lockfile failure on this one and it ran only CPU time 683.9708

and

loopbuild_ref_tex_cst_hombench_loopbuild_tex_cst_t363__IGNORE_THE_REST_1WWTA_12_6651_14_0
this got lockfile as well it ran for CPU time 2155.325

the other 2 are split with a completion and in process

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 628,529
RAC: 0
Message 59518 - Posted 11 Feb 2009 16:45:08 UTC - in response to Message ID 59395.

Mod.Sense

Rembertw, which machine are you having the problem with? What version of BOINC are you running? Was this a newly installed machine? Or was it working before?


- This specific problem occurs with computer ID 586996
- Boinc version 6.2.14 as on most of my computers currently
- Not newly installed, but hardly a price winner with Rosetta. It crunches succesfully for other projects though

Extra comments: I have the impression that it is Rosetta that crashes. This morning I noticed 2 other tasks at +7h progress and 0% progress. When cancelling these tasks I got the Windows crash notice where I can "inform microsoft of the problem". The only "special" about this computer is that it doesn't have 24/7 internet access.

No solution as yet?

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59520 - Posted 11 Feb 2009 19:18:25 UTC - in response to Message ID 59518.

Mod.Sense

Rembertw, which machine are you having the problem with? What version of BOINC are you running? Was this a newly installed machine? Or was it working before?


- This specific problem occurs with computer ID 586996
- Boinc version 6.2.14 as on most of my computers currently
- Not newly installed, but hardly a price winner with Rosetta. It crunches succesfully for other projects though

Extra comments: I have the impression that it is Rosetta that crashes. This morning I noticed 2 other tasks at +7h progress and 0% progress. When cancelling these tasks I got the Windows crash notice where I can "inform microsoft of the problem". The only "special" about this computer is that it doesn't have 24/7 internet access.

No solution as yet?


I've not heard any other reports of the percent completed not increasing. What is it showing for the estimated runtime, before the task starts?

Odd, the failed task with some time on it shows that your
core client version is 6.2.14, but your BOINC Windows Runtime Debugger Version is 6.5.0. Not sure how that would happen.

____________
Rosetta Moderator: Mod.Sense

Verrie Pearce

Joined: Dec 2 05
Posts: 3
ID: 27415
Credit: 90,299
RAC: 0
Message 59524 - Posted 12 Feb 2009 3:13:06 UTC - in response to Message ID 59045.

Hello All!

We're ready for a new update. I want to say thank all of you who have helped over the last months to find and fix errors in minirosetta. A particular thank you goes to those who have donated their time over on RALPH and helped with their active feedback - we managed to find a number of difficult and rare bugs and put some new features into minirosetta that should help conserve computer time. Read about it here: http://ralph.bakerlab.org/forum_thread.php?id=431
and here http://ralph.bakerlab.org/forum_thread.php?id=432
I should add that work over there will continue,but now supplemented with information from Rosetta@HOME.

This update is highly focused on bugfixing and stability issues - we have virtually no new science in it, but: We will hopefully now be able to run the science projects that have been in the pipeline waiting for BOINC - we're expecting quite a bit of work to go out very soon indeed. See Dr. Baker's journal for more details.


Features/Fixes:
1.54 Release CHANGELOG


  • Faster loop closing in FoldCST/Abinitio (affects cc_* cc2_* cs_* WUs), should help with overrunning WUs.

  • Bug fix concerning intermittent crashes in relax benchmark jobs (_rlbd_) jobs - caused by buggy input file reader.

  • Bug fix for a potential instability in handling text files (affects all types of WUs).

  • Bug fix in checkpointing machinery, states were not being correctly restored, probably contributing to long runtimes. (affects cc_* cc2_* cs_* WUs)

  • Increased the density of checkpoints to lose less time on restarts and address the weired "backjumping" of the time reported in this thread. This will still happen, but the jumps should be much smaller (basically maximally as long as the time between checkpoints.)

  • Added checkpointing to Loopclosing part of FoldCST. (affects cc_* cc2_* cs_* WUs)

  • Added checkpointing to Looprelax.

  • The Watchdog has been checked and improved, now returning information on the aborted jobs to help us figure out how the remaining long running models come about. The watchdog will now abort if the runtime exceeds your preferred runtime + 4 hours. In other words the WUs should not overrun for more than around 4 hours. If they do please let us know !!

  • Added a limit ont he number of decoys per WU: 99. The WU will end gracefully after that and give full credit. This should address issues with excessive upload problems.

  • Fixed a bug in the BOINC API concerned with unzipping the input data. (I will let the BOINC guys know about this)

  • Fixed a strange problem in the options system leading to early crashes on some systems.

  • Two nasty instabilities fixed deep in the FoldConstraints/abinitio protocol (cc_* tasks and other homology modelling tasks)

  • Generally implemented much better error reporting - many many potential problems will now show up a meaningful error messages and not random segmentation faults.



NOTE: This new version contains a lot of debug output still. YOu will see that the stderr fills up with stuff - that is ok . It does not slow down the program nor cause much extra upload - but it tells us a lot about where things can go wrong still.


Despite all these fixes there are, i'm sure, many problems left. Most of them occur extremely rarely now though or are highly specific to particular machines. Thus we have decided to move the current version over from RALPH to Rosetta@HOME and give it a go on a much larger scale. Our effords to keep the failure rate down will continue and your time donations over on RALPH as well as error reports are still highly appreciated.

Please let us know how things work out there. Particularily i'd like to know about


  • Stuck workunits
  • Overrunning workunits (WUs should now, due to the new watchdog, never run more than 4 hours longer than the preferred user time)
  • Problems with checkpointing.
  • Any other strange behaviour.




Happy crunching - I'm very excited to see how this new version will pan out.

Mike



____________

Verrie Pearce

Joined: Dec 2 05
Posts: 3
ID: 27415
Credit: 90,299
RAC: 0
Message 59525 - Posted 12 Feb 2009 3:14:52 UTC

I have reached the end since your new patch nothing works form your project. I keep resetting and still I get no improvement. Until you patch your patch I am done sorry, I wanted to help.
____________

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,545,746
RAC: 7,447
Message 59526 - Posted 12 Feb 2009 4:08:05 UTC - in response to Message ID 59525.

I have reached the end since your new patch nothing works form your project. I keep resetting and still I get no improvement. Until you patch your patch I am done sorry, I wanted to help.

Urgh - bad news :(

I notice you're using Boinc 6.2.19 with Vista64. Can you give it one last try and upgrade to 6.4.5? I had similar problems to you (not anywhere as bad) using Vista64 and these problems have disappeared for me after upgrading. It might make all the difference for you too.
____________

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 7,590,569
RAC: 2,277
Message 59527 - Posted 12 Feb 2009 6:08:38 UTC

Do you 'overclock' your PC? In that case lowering the overclock might help.
____________

Markus

Joined: Feb 21 08
Posts: 1
ID: 243327
Credit: 28,072
RAC: 0
Message 59528 - Posted 12 Feb 2009 8:22:53 UTC

Good morning!

I reinstalled my complete System a few days ago and restarded crunching rosetta@home again. Unfortuanally i got some errors

Here is what i got

12.02.2009 05:37:59|rosetta@home|Restarting task cc_1_3_mamcstmix_cen_0.1_hb_t369__IGNORE_THE_REST_1RXQA_12_6836_46_0 using minirosetta version 154
12.02.2009 05:38:00|rosetta@home|Task cc_1_3_mamcstmix_cen_0.1_hb_t369__IGNORE_THE_REST_1RXQA_12_6836_46_0 exited with zero status but no 'finished' file
12.02.2009 05:38:00|rosetta@home|If this happens repeatedly you may need to reset the project.

Therefore two workunits aborted with compuation error. Maybe just an error for my System, just wanted to post it

Greetings

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 628,529
RAC: 0
Message 59530 - Posted 12 Feb 2009 14:02:04 UTC - in response to Message ID 59520.

Mod.Sense

I've not heard any other reports of the percent completed not increasing. What is it showing for the estimated runtime, before the task starts?


In the meantime I have set that computer on NNT, and changed the preferred runtime. I will reactivate that computer, and evaluate Saturday or after the weekend. You'll be informed :)

BrnmccO1

Joined: Jun 26 07
Posts: 17
ID: 186323
Credit: 578,825
RAC: 0
Message 59532 - Posted 12 Feb 2009 21:23:42 UTC

Very good so far, zero error results on all machines for a long time. This 1.54 is much better than the prev versions, much more stable etc. Keep up the good work stamping out the bugs.

Its been a long time since I've reviewed the results on all my crunchers and found no compute errors. If things keep going the way they are, we might break 100 Tflops yet!
____________

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 59560 - Posted 14 Feb 2009 17:10:02 UTC

Workunit 205979363
Task 228619747
Bame loopbuild_ref_tex_cst_hombench_loopbuild_tex_cst_t332__IGNORE_THE_REST_2FLIA_6_6646_10_1
Mac OS X 10.4.11

This failed after 216 seconds : tail of stderr below

Setting database description ...
Setting up checkpointing ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 14400
Hbond tripped.
interpolate rotamers bin out of range: ARG 1.43667e-05 nan nan nan nan nan
81 81 19 20 2147483649 22 1.43667e-06 nan
ERROR:: Exit from: src/core/scoring/dunbrack/RotamericSingleResidueDunbrackLibrary.tmpl.hh line: 593
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>

____________

Yaroslav Isakov

Joined: Nov 2 07
Posts: 11
ID: 217531
Credit: 98,027
RAC: 0
Message 59596 - Posted 16 Feb 2009 3:03:25 UTC
Last modified: 16 Feb 2009 3:05:52 UTC

Hello, I have some problems with Minirosetta 1.54
validate error (about 25,000 seconds of runtime each)

1
2
3

client error

1
2

LizzieBarry

Joined: Feb 25 08
Posts: 76
ID: 243949
Credit: 201,862
RAC: 0
Message 59601 - Posted 16 Feb 2009 10:15:59 UTC - in response to Message ID 59596.

Hello, I have some problems with Minirosetta 1.54
validate error (about 25,000 seconds of runtime each)

1
2
3

client error

1
2

I got a couple of validate errors too:
Task 228125280
Task 228133134
There's nothing more frustrating than completing a job ok only for it to go wrong when uploaded.

I notice yours are a bit different though.
The first ones just include the line:
hbond tripped


The other two show:
Starting work on structure: _1JUDA_2_00001
Hbond tripped.

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

Not sure if one leads to the other but hbond tripped seems to be coming up in reports more regularly.

epcorian

Joined: Jan 1 09
Posts: 16
ID: 295242
Credit: 253,062
RAC: 0
Message 59610 - Posted 16 Feb 2009 16:39:56 UTC - in response to Message ID 59428.
Last modified: 16 Feb 2009 16:42:55 UTC

I think I spoke too soon...that first WU crunched successfully but only 1 other was WU successful out of the 8 WU's. 2/8, better but still not good. I might try replacing Vista 64 with XP 64 another weekend when I'm bored. Just for curiosity sake I had my P4 and Atom 330 PC's running 32-bit XP SP3 crunch some Mini's and they did just fine.


So this weekend I installed a fresh copy of XP x64, upgraded it to SP2, installed my x64 version of NOD32 antivirus, told BOINC to use "...use at most 75% of the processors" meaning 3 of 4 cores on my Q6600 and it's crunching Mini's and Beta's without a problem! 1 successful Beta, 5 successful Mini's with 4 more coming down the pipe. So it looks like Mini does not like Vista x64 and on my adventures on google, it turns out that XP x64 is actually based on the Server 2003 code tree while Vista is based on crap. :)

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59614 - Posted 16 Feb 2009 18:41:30 UTC

Just noted that I have two tasks that failed. One had an exception, the other a validate error with 99 decoys ...

Validate Error
Exception

Does the system have an issue with too many decoys? The reissue has not returned ...

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 59615 - Posted 16 Feb 2009 18:45:12 UTC - in response to Message ID 59614.

Just noted that I have two tasks that failed. One had an exception, the other a validate error with 99 decoys ...

Validate Error
Exception

Does the system have an issue with too many decoys? The reissue has not returned ...


If I remember correctly, they have created a 99 model stop line to keep the tasks from running forever.

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59617 - Posted 16 Feb 2009 19:25:37 UTC
Last modified: 16 Feb 2009 19:27:33 UTC

Yeah, the 99 stop limit was to avoid a problem with the file size that is zipped up and uploaded. However, I was just wondering if there is now a new companion problem that the validator does not properly handle those results... or, the result itself is somehow bad...

In that I have gone back to the 3rd of Feb and have at least a hundred (220) results with only three errors this is a puzzlement ...

{edit}
added number ..

Also I note that The runtime is only 145 seconds ... so that was fast work ... :)

Pharrg

Joined: Jul 10 06
Posts: 10
ID: 99406
Credit: 6,478
RAC: 0
Message 59625 - Posted 17 Feb 2009 2:22:04 UTC

I started running Rosetta this morning on a 64bit Vista machine and all seems to be working well. It's been working well on other projects too. Here is what I'm running:

Core i7 920 CPU
Asus P6T6 WS Revolution motherboard
6Gb DDR3 Triple Channel RAM
Vista Home Premium SP1 64bit

64bit BOINC 6.6.7

As I said, no problems yet and a number of WU's have completed already.


____________

Pharrg

Joined: Jul 10 06
Posts: 10
ID: 99406
Credit: 6,478
RAC: 0
Message 59626 - Posted 17 Feb 2009 3:14:15 UTC

Ok, after a number of successful completions, I did see one that looks like it failed. Message as follows:

2/16/2009 7:49:12 PM rosetta@home Computation for task ss-neg-1i17__7365_4677_1 finished
2/16/2009 7:49:12 PM rosetta@home Output file ss-neg-1i17__7365_4677_1_0 for task ss-neg-1i17__7365_4677_1 absent


Don't know the cause of that one...

____________

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59627 - Posted 17 Feb 2009 6:35:01 UTC

Well, a couple hundred tasks and several with the same error, multiple systems (3 different), based on Xeon, Q9300, and i7 processors, various amounts of available RAM, though in common all are running Win XP Pro 32-Bit:

228932012
229013783
229066094
229072515

The error:

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004E3308 read attempt to address 0x00000000

Yaroslav Isakov

Joined: Nov 2 07
Posts: 11
ID: 217531
Credit: 98,027
RAC: 0
Message 59631 - Posted 17 Feb 2009 12:16:07 UTC - in response to Message ID 59601.


I notice yours are a bit different though.
The first ones just include the line:
hbond tripped


The other two show:
Starting work on structure: _1JUDA_2_00001
Hbond tripped.

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

Not sure if one leads to the other but hbond tripped seems to be coming up in reports more regularly.


Hey, you're right, all my errors are with Hbond tripped in stderr, so I think that it's a source of problems

Pharrg

Joined: Jul 10 06
Posts: 10
ID: 99406
Credit: 6,478
RAC: 0
Message 59632 - Posted 17 Feb 2009 15:42:53 UTC
Last modified: 17 Feb 2009 15:45:01 UTC

So... I completed a bunch more tasks successfully, then got a 2nd task where it said the output file was missing. Anyone else getting these?

2/17/2009 6:20:35 AM rosetta@home Computation for task ss-neg-1i17__7365_5964_0 finished
2/17/2009 6:20:35 AM rosetta@home Output file ss-neg-1i17__7365_5964_0_0 for task ss-neg-1i17__7365_5964_0 absent

I noticed that both tasks that gave the 'absent output file' message had a name the started witht the same first part:

ss-neg-1i17__7365_

perhaps a bug in that one?
____________

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1740
ID: 44890
Credit: 2,350,568
RAC: 3,695
Message 59633 - Posted 17 Feb 2009 17:14:07 UTC - in response to Message ID 59632.
Last modified: 17 Feb 2009 17:15:24 UTC


I noticed that both tasks that gave the 'absent output file' message had a name the started witht the same first part:

ss-neg-1i17__7365_

perhaps a bug in that one?


I had one of those fail too. Firewall blocked it from reporting the symbol tables :(
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59634 - Posted 17 Feb 2009 17:25:15 UTC

Looks like Pharrg actually had three of these fail

ss-neg-1i17__7365_5964_0
ss-neg-1i17__7365_5190_1 (wingman failed too)
ss-neg-1i17__7365_4677_1 (wingman failed too)

____________
Rosetta Moderator: Mod.Sense

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1740
ID: 44890
Credit: 2,350,568
RAC: 3,695
Message 59635 - Posted 17 Feb 2009 17:40:09 UTC

I had two more similar tasks on my machiens, so I suspended others to try and run them.

I've got an ss-neg-1je9 that seems normal so far. But my other ss-net-1i17 doesn't seem able to display graphics. Black window, no pane lines, on WinXP.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1740
ID: 44890
Credit: 2,350,568
RAC: 3,695
Message 59637 - Posted 17 Feb 2009 18:44:34 UTC
Last modified: 17 Feb 2009 18:45:25 UTC

Yep, my next ss-neg-1i17 failed too.

As soon as you bring up the graphic, which never gets beyond black, Windows task manager shows the graphic thread as "not responding".
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 59638 - Posted 17 Feb 2009 21:39:56 UTC

2 ss-neg tasks died on me as well, i have a 3rd in progress at 50% complete so far.

Here are the failures:

ss-neg-1i17__7365_1743_0

ss-neg-1i17__7365_542_1

They both do the following:

initialization is ok, but then when it is about to start it errors out:
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004E3308 read attempt to address 0x00000000
----------

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,545,746
RAC: 7,447
Message 59640 - Posted 17 Feb 2009 23:35:02 UTC
Last modified: 17 Feb 2009 23:35:45 UTC

Ditto:

ss-neg-1i17__7365_5466_0
ss-neg-1i17__7365_1656_0
____________

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 59641 - Posted 18 Feb 2009 0:53:03 UTC

A couple of these ssneg-1i17* workunits failing on Mac OS X 10.4.11

Workunit 208810096, Task 229094592, Name ss-neg-1i17__7365_4132_0

and

Workunit 208854507, Task 229142269, Name ss-neg-1i17__7365_4742_0

They're both failing in the same routine: here's the crash info from the first one

Thread 0 Crashed:
0 ...etta_1.54_i686-apple-darwin 0x001b13b7 __ZN4core10kinematics10build_treeERKNS0_8FoldTreeERKN7utility7vector1INS4_7pointer10access_ptrIKNS_12conformation7ResidueEEESaISB_EEERNS_2id10AtomID_MapINS6_10owning_ptrINS0_4tree4AtomEEEEE + 235
1 ...etta_1.54_i686-apple-darwin 0x00027735 __ZN4core12conformation12Conformation15setup_atom_treeEv + 109
2 ...etta_1.54_i686-apple-darwin 0x0002a378 __ZN4core12conformation12Conformation9fold_treeERKNS_10kinematics8FoldTreeE + 2910
3 ...etta_1.54_i686-apple-darwin 0x00400e64 __ZN4core2io13serialization11read_binaryERNS_4pose4PoseERNS1_6BUFFERE + 516
4 ...etta_1.54_i686-apple-darwin 0x00107b23 __ZN9protocols5boinc5Boinc18worker_is_finishedERKi + 913
5 ...etta_1.54_i686-apple-darwin 0x00c8d172 __ZN9protocols7jobdist18BaseJobDistributorIN7utility7pointer10owning_ptrINS0_8BasicJobEEEE8next_jobERS6_Ri + 2102
6 ...etta_1.54_i686-apple-darwin 0x001177a5 __ZN9protocols8abinitio18AbrelaxApplication4foldERN4core4pose4PoseEN7utility7pointer10owning_ptrINS_8ProtocolEEE + 1449
7 ...etta_1.54_i686-apple-darwin 0x001289ad __ZN9protocols8abinitio18AbrelaxApplication3runEv + 807
8 ...etta_1.54_i686-apple-darwin 0x000039cc _main + 1356
9 ...etta_1.54_i686-apple-darwin 0x00001dee __start + 216
10 ...etta_1.54_i686-apple-darwin 0x00001d15 start + 41


____________

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 59645 - Posted 18 Feb 2009 4:37:41 UTC

I've had three ss-neg-1i17__7365 WUs fail with segmentation violations on three different linux machines:

http://boinc.bakerlab.org/rosetta/result.php?resultid=229167706
http://boinc.bakerlab.org/rosetta/result.php?resultid=229161990
http://boinc.bakerlab.org/rosetta/result.php?resultid=229084435

(I notice that only the third number is different in the stack traces of the above three WUs.)

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 59647 - Posted 18 Feb 2009 9:16:58 UTC

A workunit with some odd behavior, but no definite error:

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=209046400

A few minutes ago when it was about 93% complete, I told it to display graphics (which I usually don't do). After about a minute, I closed the graphics window. Within another minute or two, that workunit decided it was finished.

It may or may not be significant that a few minutes before doing this, I had set the Activity to Suspend, also suspended the network communications, ran some antispyware programs, then set the Activity back to normal.

Is this something normal that just happened at an unusual time, or something more significant?

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 628,529
RAC: 0
Message 59649 - Posted 18 Feb 2009 10:57:15 UTC - in response to Message ID 59520.

Mod.Sense

What is it showing for the estimated runtime, before the task starts?


There is a new task running on that same computer:
- Estimated runtime: 09:43:55
- current runtime: 18:03:14
- Progress: 0%

I think my settings before were asking for about 6 hours runtime and now 10 hours. Changing this did not solve the problem. For the sake of testing I will keep this task running for some more time. You can let me know what to do. In the worst case I'll set that computer on NNT for Rosetta but I'm willing to wait some longer.

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59650 - Posted 18 Feb 2009 13:14:18 UTC

Three more errors ... this time two I have not seen before:

229353838 0 0x0056d881 SIGPIPE: write on a pipe with no reader

229355014 Reason: Access Violation (0xc0000005) at address 0x004E3308 read attempt to address 0x00000000

229435564 ERROR: ERROR: FragmentIO: could not open file cs_aa_1ji8A09_05.200_v1_3.gz

So, two shiny new errors and one old rusty access violation that quite a few of us have seen ...

Keith T.
Avatar

Joined: Mar 1 07
Posts: 37
ID: 150379
Credit: 12,959
RAC: 0
Message 59651 - Posted 18 Feb 2009 13:30:29 UTC

At least 3 of my recent tasks have resulted in Validate errors.

http://boinc.bakerlab.org/rosetta/result.php?resultid=227721905
http://boinc.bakerlab.org/rosetta/result.php?resultid=227934901
http://boinc.bakerlab.org/rosetta/result.php?resultid=227919237

Please could someone in authority explain why there have been so many of these recently.

I currently have Rosetta set to "No New Tasks", partly because of these. I am still accepting work from RALPH.

Keith

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59655 - Posted 18 Feb 2009 14:47:25 UTC

rembertw, the maximum runtime preference possible is 24hrs, and if it is a v1.54 task, the watchdog should end it if it runs longer then 28hrs. So, if you could, let it run at least 29hrs and if it is still running at that point, then abort it.

I still have not seen anyone else reporting such a problem, and you've got a score of other hosts running fine. What is different about this one that's having trouble? anitvirus software? Windows service pack? age of machine? BOINC version?
____________
Rosetta Moderator: Mod.Sense

Yaroslav Isakov

Joined: Nov 2 07
Posts: 11
ID: 217531
Credit: 98,027
RAC: 0
Message 59657 - Posted 18 Feb 2009 15:01:59 UTC

Another hbond tripped

Path7

Joined: Aug 25 07
Posts: 128
ID: 201002
Credit: 61,751
RAC: 0
Message 59658 - Posted 18 Feb 2009 18:57:12 UTC

About 12 hours ago the next WU ended with an Unhandled Exception Detected:

ss-neg-1i17__7365_3969_1

This WU had the same error before running on another computer.

Path7.

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,545,746
RAC: 7,447
Message 59667 - Posted 19 Feb 2009 5:04:25 UTC

Another one snuck through:

ss-neg-1i17__7365_4076_1

Looks like I'll have to abort all these on sight. I'm not sure any of them have run successfully for me yet. :(
____________

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59668 - Posted 19 Feb 2009 7:07:58 UTC

New error -161 on both Mini 1.54 and 5.98 ...

Mini-1.54
229605017
229597762
229594079
229593677

5.98
229601150

Yaroslav Isakov

Joined: Nov 2 07
Posts: 11
ID: 217531
Credit: 98,027
RAC: 0
Message 59672 - Posted 19 Feb 2009 16:29:16 UTC
Last modified: 19 Feb 2009 16:32:01 UTC

Hey! Very strange one! it's valid, but with Hbond tripped and verys short time, 2380 secs instead of ~10000:
loopbuild_chunk_1_3_B_hb_t357__IGNORE_THE_REST_1VBGA_4_7477_27_0

BTW, I notice that all my wrong results (and this last one) are loopbuild_chunk*.

xrobert Profile

Joined: Oct 28 05
Posts: 3
ID: 7210
Credit: 103,543
RAC: 0
Message 59674 - Posted 19 Feb 2009 18:02:55 UTC

So far, all my mini-Rosetta WUs are sticking. I've to abort them.
The normal WUs work fine.


____________

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 628,529
RAC: 0
Message 59677 - Posted 20 Feb 2009 7:03:21 UTC - in response to Message ID 59655.
Last modified: 20 Feb 2009 7:12:40 UTC

mod.sense

I still have not seen anyone else reporting such a problem, and you've got a score of other hosts running fine. What is different about this one that's having trouble? antivirus software? Windows service pack? age of machine? BOINC version?


I it strange indeed. My other computers seem to be running fine. About the computer: I have an identical computer that gives no problems. They both have the same antivirus software, same servicepack, same age, same Boinc version.

Some things I noticed:
- when a 0% task (only at Rosetta 1.54) gets paused manually after x hours and it gets restarted, also the time resets to 0.
- When the 1.54 task starts both processors get work (multiple projects). However, when one of the other project tasks stop, then the 2nd processor starts idling. It can not get another task to run from Rosetta or any other project despite the queue having multiple tasks ready to start or continue.

I broke off 2 remaining tasks of Rosetta that still had to get started and am letting run the restarted task. Before it had already 24h+ but because of a pauze it reset its time. At this moment it is at 19h again. I will let it run until it gets past 31h runtime. After (tomorrow) that I will set that computer on NNT for Rosetta so it can crunch for my other projects while I wait for your comment.

[edit]Changed "all" in "both" and corrected a typo[/edit]

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59683 - Posted 20 Feb 2009 14:32:26 UTC

rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another.

Is it possible BOINC is having trouble writing to disk (authorities?)? Have you checked the authorities to the data directory and it's contents? Is the data on the local hard drive of the machine, or off on a network somewhere?

I see from the one task that completed that you are running BOINC 6.2.14. Have you tried other BOINC versions on this machine?
____________
Rosetta Moderator: Mod.Sense

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 628,529
RAC: 0
Message 59684 - Posted 20 Feb 2009 15:23:06 UTC - in response to Message ID 59683.

rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another.

I agree, but this shows only when it started a "0%" Minirosetta task. To check this I put Rosetta on NNT for a while. When it runs only other projects there are no problems at all, making me think the problem is with Minirosetta.

Is it possible BOINC is having trouble writing to disk (authorities?)? Have you checked the authorities to the data directory and it's contents? Is the data on the local hard drive of the machine, or off on a network somewhere?

Standard setup with full authority running on a local hard drive. No fancy settings.

I see from the one task that completed that you are running BOINC 6.2.14. Have you tried other BOINC versions on this machine?

Every now and again I do a Boinc upgrade on my machines. I heard some negative comments about the current Boinc version, which is why I considered waiting until Summer or so to upgrade. I guess now the time has come. To be certain I'll do a total Boinc uninstall on that computer followed by a cleanup before I download the current version. We'll see if this helped...

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 59686 - Posted 20 Feb 2009 16:41:25 UTC - in response to Message ID 59684.

rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another.

I agree, but this shows only when it started a "0%" Minirosetta task. To check this I put Rosetta on NNT for a while. When it runs only other projects there are no problems at all, making me think the problem is with Minirosetta.

Is it possible BOINC is having trouble writing to disk (authorities?)? Have you checked the authorities to the data directory and it's contents? Is the data on the local hard drive of the machine, or off on a network somewhere?

Standard setup with full authority running on a local hard drive. No fancy settings.

I see from the one task that completed that you are running BOINC 6.2.14. Have you tried other BOINC versions on this machine?

Every now and again I do a Boinc upgrade on my machines. I heard some negative comments about the current Boinc version, which is why I considered waiting until Summer or so to upgrade. I guess now the time has come. To be certain I'll do a total Boinc uninstall on that computer followed by a cleanup before I download the current version. We'll see if this helped...


Which BOINC version do you consider current? I'm running 6.2.28 without seeing such a problem, but I've read some negative comments about the 6.4.* series.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59688 - Posted 20 Feb 2009 18:31:56 UTC
Last modified: 20 Feb 2009 18:33:48 UTC

robertmiles, if you were directing the question to me, I try to stay out of that one. And am only recommending a change to BOINC version because problems are occurring with the version installed now. I know we've seen many work-fetch and DCF problems reported on the 6.6 (which is the current test version) and I think 6.4 series introduced those problems. So, if it were me, I'd try the 6.2.19 shown at the link below. I myself am on 6.2.18 and running well on WinXP. (nothing against 6.2.28, but it's not listed anymore for some reason)

You can see more BOINC versions for download on this page:
http://boinc.berkeley.edu/download_all.php
____________
Rosetta Moderator: Mod.Sense

TimL

Joined: Sep 16 06
Posts: 14
ID: 112884
Credit: 8,492,974
RAC: 5,867
Message 59723 - Posted 22 Feb 2009 9:59:14 UTC

Hi all,
loopbuild_mamaln_ideal_hb_t305__IGNORE_THE_REST_1zc0_1_7630_19 finished early with error -
Access Violation (0xc0000005) at address 0x7C91AA01 read attempt to address 0x0D1BF548

Haven't had much luck getting errors of late but will mention that I had just bumped the bus speed up a touch when this error occurred.


____________

TomaszPawel

Joined: Apr 28 07
Posts: 54
ID: 170716
Credit: 2,791,145
RAC: 0
Message 59751 - Posted 23 Feb 2009 7:06:15 UTC - in response to Message ID 59045.

Hi:

http://boinc.bakerlab.org/rosetta/result.php?resultid=229237620
http://boinc.bakerlab.org/rosetta/result.php?resultid=229237620
http://boinc.bakerlab.org/rosetta/result.php?resultid=229237514
http://boinc.bakerlab.org/rosetta/result.php?resultid=229145242
http://boinc.bakerlab.org/rosetta/result.php?resultid=228892067
http://boinc.bakerlab.org/rosetta/result.php?resultid=228820491
http://boinc.bakerlab.org/rosetta/result.php?resultid=228820477

Any tips?

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 628,529
RAC: 0
Message 59752 - Posted 23 Feb 2009 7:50:20 UTC - in response to Message ID 59688.

Mod.Sense

And am only recommending a change to BOINC version because problems are occurring with the version installed now.

I set up Boinc 6.4.5 on that computer, and it seems to be running fine with Rosetta. I still will wait for a general upgrade until there are new Boinc versions, I think.

robertmiles
"Current" is for me the version that the actual Boinc site gives as standard. Researching older versions and installing those is too much micromanagement for me. Same like posting on the boards... If this problem gets solved with 6.4.5 (and it seems to be solved) then I'm off again.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 59756 - Posted 23 Feb 2009 14:09:26 UTC - in response to Message ID 59751.

Hi:

http://boinc.bakerlab.org/rosetta/result.php?resultid=229237620
http://boinc.bakerlab.org/rosetta/result.php?resultid=229237620
http://boinc.bakerlab.org/rosetta/result.php?resultid=229237514
http://boinc.bakerlab.org/rosetta/result.php?resultid=229145242
http://boinc.bakerlab.org/rosetta/result.php?resultid=228892067
http://boinc.bakerlab.org/rosetta/result.php?resultid=228820491
http://boinc.bakerlab.org/rosetta/result.php?resultid=228820477

Any tips?


Looks like all of these were the ss-neg-1i17s that most people have been having trouble with. Something specific to the 1i17, the other ss-neg's do not seem to be having any trouble.

Except for your last one on the list, it got a
"Too many restarts with no progress. Keep application in memory while preempted."
error. Perhaps you rebooted your machine several times in a row to install fixes or something?
____________
Rosetta Moderator: Mod.Sense

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59761 - Posted 23 Feb 2009 18:49:59 UTC

-161 error on 230728890

RodrigoPS
Avatar

Joined: Nov 28 08
Posts: 3
ID: 289807
Credit: 860,206
RAC: 14
Message 59782 - Posted 24 Feb 2009 22:01:20 UTC

I noticed that with the minirosetta 1.54 the granted credit was very low in the Athlon X2 processors - sometimes half the claimed credit. This did not occur with the single core Athlon.

RodrigoPS
Avatar

Joined: Nov 28 08
Posts: 3
ID: 289807
Credit: 860,206
RAC: 14
Message 59834 - Posted 27 Feb 2009 0:05:01 UTC - in response to Message ID 59782.

I noticed that with the minirosetta 1.54 the granted credit was very low in the Athlon X2 processors - sometimes half the claimed credit. This did not occur with the single core Athlon.


Problem solved. Updating the BIOS (F8> F9) of the motherboard caused a considerable loss of performance of PCs with Athlon X2 processors. The restoration of BIOS F8 normalized the system.
____________

Mike* Profile

Joined: Feb 16 09
Posts: 5
ID: 301833
Credit: 102,030
RAC: 0
Message 59835 - Posted 27 Feb 2009 1:49:33 UTC

Hi all,
Had the below error show up.
I initially DLd 3 WU, the first 2 bombed, I aborted the 3rd.. I then detached, re-attached, then DLed 11 new ones.

Every one of them went south..

Boinc mgr is 6.2.18

Free disk is 88g
Used by boinc is 4.81
Use at most 100g
Leave 0
Use up to 50% disk
Leave apps in memory.
Only other project (which was suspended was CPDN at 55% @1004 hrs (do not want to loose this)

My host is 1008545 (should be viewable)

At this point, I will wait till next week (SIMAP starting soon with it's monthly run :)) and will try again.
Don't want to keep trashing WUs for no reason.

I do have the messages from boinc stored if they would be useful, but here is one thing I see, but it may only be due to the process crashing:

2/26/2009 8:04:04 PM|rosetta@home|Starting lr8_A_score12_rlbd_2ci2_IGNORE_THE_REST_DECOY_SAVE_ALL_OUT_7089_1093_0
2/26/2009 8:04:05 PM|rosetta@home|Starting task lr8_A_score12_rlbd_2ci2_IGNORE_THE_REST_DECOY_SAVE_ALL_OUT_7089_1093_0 using minirosetta version 154
2/26/2009 8:04:19 PM|rosetta@home|Computation for task lr8_A_score12_rlbd_2ci2_IGNORE_THE_REST_DECOY_SAVE_ALL_OUT_7089_1093_0 finished
2/26/2009 8:04:19 PM|rosetta@home|Output file lr8_A_score12_rlbd_2ci2_IGNORE_THE_REST_DECOY_SAVE_ALL_OUT_7089_1093_0_0 for task lr8_A_score12_rlbd_2ci2_IGNORE_THE_REST_DECOY_SAVE_ALL_OUT_7089_1093_0 absent

Thanks

mike

(extra blank lines removed)
<core_client_version>6.2.18</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 2-26 20:10: 2:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x7C910193 write attempt to address 0x009882EA
Engaging BOINC Windows Runtime Debugger...
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x7C910193 write attempt to address 0x0040118E
Engaging BOINC Windows Runtime Debugger...
</stderr_txt>
]]>

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 59839 - Posted 27 Feb 2009 6:05:40 UTC - in response to Message ID 59835.
Last modified: 27 Feb 2009 6:12:07 UTC

Hi all,
Had the below error show up.
I initially DLd 3 WU, the first 2 bombed, I aborted the 3rd.. I then detached, re-attached, then DLed 11 new ones.

Every one of them went south..

Boinc mgr is 6.2.18

Free disk is 88g
Used by boinc is 4.81
Use at most 100g
Leave 0
Use up to 50% disk
Leave apps in memory.
Only other project (which was suspended was CPDN at 55% @1004 hrs (do not want to loose this)


mike

(extra blank lines removed)
<core_client_version>6.2.18</core_client_version>



A few questions that may help pin down the problem:

Are you able to find BOINC 6.2.28, and willing to upgrade to it? That's the only version I have used since 5.10.45, and I don't have that problem.

Have you gone to any extra effort to tell BOINC that it could use more virtual memory than the default?

Have you gone to any extra effort to tell your copy of Windows to allow a bigger swap file than the default?

How many BOINC projects do you have your BOINC Manager set up to recognize? I've seen some so far rather indistinct signs that BOINC divides the disk space it is allowed to use into equal sections for each BOINC project it recognizes before it starts dividing those sections into smaller subsections for each workunit. Therefore, if one BOINC project is heavy on disk space use, workunits for that project might run out of disk space even if some other BOINC project doesn't need all that is reserved for it.

Does this site tell you how much memory your machine has now and what the maximum for that model of computer is?

http://www.crucial.com/

I had problems getting my dual-core CPU to run two Rosetta@home workunits at the same time back when I had only 1 GB of memory to share between Vista and the two workunits, so I ordered an upgrade to the 2 GB maximum my model of computer can handle; now I can run two such workunits at once even while typing this.

TomaszPawel

Joined: Apr 28 07
Posts: 54
ID: 170716
Credit: 2,791,145
RAC: 0
Message 59845 - Posted 27 Feb 2009 10:25:55 UTC - in response to Message ID 59756.
Last modified: 27 Feb 2009 10:26:31 UTC

Hi:

http://boinc.bakerlab.org/rosetta/result.php?resultid=229237620
http://boinc.bakerlab.org/rosetta/result.php?resultid=229237620
http://boinc.bakerlab.org/rosetta/result.php?resultid=229237514
http://boinc.bakerlab.org/rosetta/result.php?resultid=229145242
http://boinc.bakerlab.org/rosetta/result.php?resultid=228892067
http://boinc.bakerlab.org/rosetta/result.php?resultid=228820491
http://boinc.bakerlab.org/rosetta/result.php?resultid=228820477

Any tips?


Looks like all of these were the ss-neg-1i17s that most people have been having trouble with. Something specific to the 1i17, the other ss-neg's do not seem to be having any trouble.

Except for your last one on the list, it got a
"Too many restarts with no progress. Keep application in memory while preempted."
error. Perhaps you rebooted your machine several times in a row to install fixes or something?

Right, last was multifix from our "love" Microsoft....

Mike* Profile

Joined: Feb 16 09
Posts: 5
ID: 301833
Credit: 102,030
RAC: 0
Message 59846 - Posted 27 Feb 2009 11:48:29 UTC - in response to Message ID 59839.

Hi all,
Had the below error show up.
I initially DLd 3 WU, the first 2 bombed, I aborted the 3rd.. I then detached, re-attached, then DLed 11 new ones.

Every one of them went south..

Boinc mgr is 6.2.18

Free disk is 88g
Used by boinc is 4.81
Use at most 100g
Leave 0
Use up to 50% disk
Leave apps in memory.
Only other project (which was suspended was CPDN at 55% @1004 hrs (do not want to loose this)


mike

(extra blank lines removed)
<core_client_version>6.2.18</core_client_version>



A few questions that may help pin down the problem:

Are you able to find BOINC 6.2.28, and willing to upgrade to it? That's the only version I have used since 5.10.45, and I don't have that problem.

Have you gone to any extra effort to tell BOINC that it could use more virtual memory than the default?

Have you gone to any extra effort to tell your copy of Windows to allow a bigger swap file than the default?

How many BOINC projects do you have your BOINC Manager set up to recognize? I've seen some so far rather indistinct signs that BOINC divides the disk space it is allowed to use into equal sections for each BOINC project it recognizes before it starts dividing those sections into smaller subsections for each workunit. Therefore, if one BOINC project is heavy on disk space use, workunits for that project might run out of disk space even if some other BOINC project doesn't need all that is reserved for it.

Does this site tell you how much memory your machine has now and what the maximum for that model of computer is?

http://www.crucial.com/

I had problems getting my dual-core CPU to run two Rosetta@home workunits at the same time back when I had only 1 GB of memory to share between Vista and the two workunits, so I ordered an upgrade to the 2 GB maximum my model of computer can handle; now I can run two such workunits at once even while typing this.



The odd thing is that I had successfully finished 3 models a few days ago, and a couple before that, (cant remember the version off hand, only 1 wu at a time) with no issues. I am attached to 7 projects but am not running then all. (I NNT the projects, and have a small buffer so as to not have to worry about having too much (Yea, I know boinc manages it, but I want to make sure everything gets doone quickly).
When you mentioned boinc dividing the disk space, I am wondering if I had the non active projects suspended, which I ususally have done in the past..
I will retry after I get thru the SIMAP run (this is why I keep the tasks low), making sure my buffer is small so as hopefully not grab 11 tasks


Thanks

Mike


TomaszPawel

Joined: Apr 28 07
Posts: 54
ID: 170716
Credit: 2,791,145
RAC: 0
Message 59847 - Posted 27 Feb 2009 12:59:41 UTC
Last modified: 27 Feb 2009 13:02:04 UTC

Another bug:
http://boinc.bakerlab.org/rosetta/result.php?resultid=231152575
loopbuild_reference_allmodels_hb_t360

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 628,529
RAC: 0
Message 59848 - Posted 27 Feb 2009 13:33:44 UTC - in response to Message ID 59655.

[Mod.Sense]

I still have not seen anyone else reporting such a problem, and you've got a score of other hosts running fine.


Last update: everything seems to be ok after I updated the Boinc version to 6.4.5. The exact reason for the 0% progress with Mini Rosetta is still a mystery but at least that computer is crunching again.

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 59850 - Posted 27 Feb 2009 13:49:41 UTC - in response to Message ID 59846.

Hi all,
Had the below error show up.
I initially DLd 3 WU, the first 2 bombed, I aborted the 3rd.. I then detached, re-attached, then DLed 11 new ones.

Every one of them went south..

Boinc mgr is 6.2.18

Free disk is 88g
Used by boinc is 4.81
Use at most 100g
Leave 0
Use up to 50% disk
Leave apps in memory.
Only other project (which was suspended was CPDN at 55% @1004 hrs (do not want to loose this)


mike

(extra blank lines removed)
<core_client_version>6.2.18</core_client_version>



A few questions that may help pin down the problem:



The odd thing is that I had successfully finished 3 models a few days ago, and a couple before that, (cant remember the version off hand, only 1 wu at a time) with no issues. I am attached to 7 projects but am not running then all. (I NNT the projects, and have a small buffer so as to not have to worry about having too much (Yea, I know boinc manages it, but I want to make sure everything gets doone quickly).
When you mentioned boinc dividing the disk space, I am wondering if I had the non active projects suspended, which I ususally have done in the past..
I will retry after I get thru the SIMAP run (this is why I keep the tasks low), making sure my buffer is small so as hopefully not grab 11 tasks


Thanks

Mike




Another question that may help pin down the problem:

Did you have graphics enabled at any time during those runs? When I run minirosetta 1.58 for RALPH@home, it completes successfully if I never enable graphics, but fails if I have graphics enabled for a short time during the run.

Yaroslav Isakov

Joined: Nov 2 07
Posts: 11
ID: 217531
Credit: 98,027
RAC: 0
Message 59858 - Posted 27 Feb 2009 16:05:56 UTC

Another bunch of Hbond tripped errors:
hw_mamaln_t290_3_hb_1xyh__IGNORE_THE_REST_1ihg_1_SAVE_ALL_OUT_7736_375_0
hw_mamaln_t290_3_hb_1ihg__IGNORE_THE_REST_1cyn_1_SAVE_ALL_OUT_7729_256_0
hw_mamaln_t290_3_hb_t290__IGNORE_THE_REST_1zkc_1_SAVE_ALL_OUT_7743_255_0
hw_mamaln_t290_3_hb_t290__IGNORE_THE_REST_1xwn_1_SAVE_ALL_OUT_7743_255_0

First three of them have valid status and:
ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
called boinc_finish

Mike* Profile

Joined: Feb 16 09
Posts: 5
ID: 301833
Credit: 102,030
RAC: 0
Message 59867 - Posted 27 Feb 2009 23:34:09 UTC - in response to Message ID 59850.
Last modified: 28 Feb 2009 0:10:54 UTC

Hi all,
Had the below error show up.
I initially DLd 3 WU, the first 2 bombed, I aborted the 3rd.. I then detached, re-attached, then DLed 11 new ones.

Every one of them went south..

Boinc mgr is 6.2.18

Free disk is 88g
Used by boinc is 4.81
Use at most 100g
Leave 0
Use up to 50% disk
Leave apps in memory.
Only other project (which was suspended was CPDN at 55% @1004 hrs (do not want to loose this)


mike

(extra blank lines removed)
<core_client_version>6.2.18</core_client_version>



A few questions that may help pin down the problem:



The odd thing is that I had successfully finished 3 models a few days ago, and a couple before that, (cant remember the version off hand, only 1 wu at a time) with no issues. I am attached to 7 projects but am not running then all. (I NNT the projects, and have a small buffer so as to not have to worry about having too much (Yea, I know boinc manages it, but I want to make sure everything gets doone quickly).
When you mentioned boinc dividing the disk space, I am wondering if I had the non active projects suspended, which I ususally have done in the past..
I will retry after I get thru the SIMAP run (this is why I keep the tasks low), making sure my buffer is small so as hopefully not grab 11 tasks


Thanks

Mike




Another question that may help pin down the problem:

Did you have graphics enabled at any time during those runs? When I run minirosetta 1.58 for RALPH@home, it completes successfully if I never enable graphics, but fails if I have graphics enabled for a short time during the run.


No, did not have the graphics running, the process crashed immediatly upon startup (or at least within a few seconds).

Interesting thing..

Normally I only have 1 to 3 projects un-suspended at 1 time. I has more than that un-suspended, but No new tasks..
I suspended ALL projects, shut down, and re-booted.
Started up boinc, set to not keep projects in memory, 50% cpu (us the 1 core non HT, unsuspended Rossetta, said give me tasks, hit update. Gave me 6 and then let it do its thing..
Guess what.. no issues..
I suspended 5 of the tasks to let the 1 run.
I also re-adjusted to 100% to use HT, re-started Docking, and had several Docking and 1 Rosetta finish..

Might be due to allocating memory among the active projects..

Am wondering if any of the other bugs I saw here, is the same issue with too many "active projects".
The programmer in me is suspecting that.. Not knowing what goes on in Boinc, etc could not tell (Besides, don't do C++ or later).

Thanks for the 'insight"..
Mike

p.s. added answer on graphics and spellings.

Yaroslav Isakov

Joined: Nov 2 07
Posts: 11
ID: 217531
Credit: 98,027
RAC: 0
Message 59874 - Posted 28 Feb 2009 15:43:44 UTC

Very long WU (25000 seconds), probably ended by timeout (intended runtime + 4 hours):
wt_ub_BOINC_ABRELAX_3MERS_NOHOMS_t482_SAVE_ALL_OUT_IGNORE_THE_REST-S25-3-S3-3--wt_ub-_7707_42783_0

It slows down on about 90% and I see in graphics that for about 4 hours it do SmallMoverMoverBase+Minimization stage

And it's also a Hbond tripped result :(

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 59877 - Posted 28 Feb 2009 16:53:46 UTC - in response to Message ID 59874.

Very long WU (25000 seconds), probably ended by timeout (intended runtime + 4 hours):

And it's also a Hbond tripped result :(


This one is interesting as it was completed successfully by a second computer in less than half the time and both were run on Linux machines.

____________

Yaroslav Isakov

Joined: Nov 2 07
Posts: 11
ID: 217531
Credit: 98,027
RAC: 0
Message 59878 - Posted 28 Feb 2009 18:25:30 UTC - in response to Message ID 59877.

Very long WU (25000 seconds), probably ended by timeout (intended runtime + 4 hours):

And it's also a Hbond tripped result :(


This one is interesting as it was completed successfully by a second computer in less than half the time and both were run on Linux machines.


Maybe it's because I have a 64-bit Linux?

root

Joined: Feb 16 09
Posts: 6
ID: 301869
Credit: 24,387
RAC: 0
Message 59931 - Posted 2 Mar 2009 20:34:01 UTC

I'm getting this same error for nearly all WUs on two Linux boxes running FC8 and FC9 with kernel 2.6.23.1-42.fc8 and 2.6.25.14-108.fc9.x86_64; resp.

In addition, I have a third Linux laptop running FC9 with no problems whatsoever. All 3 machines are running with leave_apps_in_memory=0.

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>

Any ideas?

hedera Profile
Avatar

Joined: Jul 15 06
Posts: 66
ID: 100139
Credit: 2,486,188
RAC: 1,390
Message 59948 - Posted 3 Mar 2009 19:46:19 UTC

I've had 2 Windows error messages in the last couple of days from Rosetta. This is on a Win XP Pro SP2 system. The last one was this morning. I looked at my results today and this WU has crashed at 15:13:50 UTC:

2p64__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2p64_-native_frag2__7622_431

Checking my message log, I found these messages:


03/03/2009 6:00:54 AM|rosetta@home|Restarting task 2p64__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2p64_-native_frag2__7622_431_2 using rosetta_beta version 598
03/03/2009 6:01:41 AM|rosetta@home|Task 2p64__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2p64_-native_frag2__7622_431_2 exited with zero status but no 'finished' file
03/03/2009 6:01:41 AM|rosetta@home|If this happens repeatedly you may need to reset the project.


Identical messages repeated until 7:12 AM when I got this:


03/03/2009 7:12:14 AM|rosetta@home|Computation for task 2p64__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2p64_-native_frag2__7622_431_2 finished
03/03/2009 7:12:14 AM|rosetta@home|Output file 2p64__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2p64_-native_frag2__7622_431_2_0 for task 2p64__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2p64_-native_frag2__7622_431_2 absent


If you look at the task details for WU 209583003 on computer 272841, you'll see this error followed by a dump:


<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
too many exit(0)s
</message>
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 2834914

Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x008BB955 read attempt to address 0x09A9C000

Engaging BOINC Windows Runtime Debugger...

********************


I'm sure it isn't meant to do this...
____________
--hedera

Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic.

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 59955 - Posted 3 Mar 2009 22:53:51 UTC - in response to Message ID 59948.

I've had 2 Windows error messages in the last couple of days from Rosetta. This is on a Win XP Pro SP2 system. The last one was this morning. I looked at my results today and this WU has crashed at 15:13:50 UTC:

2p64__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2p64_-native_frag2__7622_431

Checking my message log, I found these messages:


03/03/2009 6:00:54 AM|rosetta@home|Restarting task 2p64__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2p64_-native_frag2__7622_431_2 using rosetta_beta version 598
03/03/2009 6:01:41 AM|rosetta@home|Task 2p64__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2p64_-native_frag2__7622_431_2 exited with zero status but no 'finished' file
03/03/2009 6:01:41 AM|rosetta@home|If this happens repeatedly you may need to reset the project.


Could you check the results uploaded for this one and see it the results include any mention of lockfile problems?

Also, a few questions that may help pin down the problem:

1. Do the error messages shown above repeat several times, and do the lockfile error messages if any repeat several times?

2. What version of BOINC are you using?

3. Have you enabled the leave in memory option?

4. What percentage of CPU time do you let BOINC projects use? The 60% setting typical for laptops, the 100% setting typical for desktops, or something else?

5. Did this workunit start with graphics enabled? Did you enable graphics later? Did you then shut down graphics for it?

senatoralex85

Joined: Sep 27 05
Posts: 66
ID: 1329
Credit: 169,644
RAC: 0
Message 59959 - Posted 4 Mar 2009 2:43:32 UTC

Once in awhile, I get a Microsoft Visual C++ Runtime Library Error? It is for minirosetta_1.54_windows_intelx86.exe. The error message reads "This application has requested the runtime to terminate it in an unusual way. Please contact the applications support team for more information."

Received it for this workunit. http://boinc.bakerlab.org/rosetta/result.php?resultid=232499308

Currently Using XP service pack 2 with Boinc version 5.10.45
____________

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 507,926
RAC: 0
Message 59961 - Posted 4 Mar 2009 4:14:04 UTC
Last modified: 4 Mar 2009 4:22:39 UTC

Task Id 232649967 isn't displaying graphics instead it's displaying a black window, when i move my cursor around in the black window a white block of what looks like unreadable text moves around under my cursor while my cursor is in the black window. It's a 2vik task. Task finished with a successful out come. Task ID 232649968 isn't displaying graphics instead it's displaying a black window When I try to close the black window it comes up with End Program my opinions are Emd Now or Cancel I chose End Now. Task finished with a successful out come.

I am using XP Pro SP3 fully patched and Boinc 6.4.7 on a quad 2.66 with 2.87GB Ram. I'm not sure if that will make a difference or not.
Has anybody else had any of the above issues?
Thanks for any information as to why this could be happening in advance.
____________
Have a crunching good day!!

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 59967 - Posted 4 Mar 2009 15:04:11 UTC
Last modified: 4 Mar 2009 15:50:43 UTC

The lockfile problem again:

http://boinc.bakerlab.org/rosetta/result.php?resultid=232787694

Starting work on structure: _1NRGA_7_00029
BOINC:: Initializing ... ok.
Can't acquire lockfile - exiting
BOINC:: Initializing ... ok.
Can't acquire lockfile - exiting
BOINC:: Initializing ... ok.
Can't acquire lockfile - exiting
BOINC:: Initializing ... ok.
Can't acquire lockfile - exiting
BOINC:: Initializing ... ok.
Can't acquire lockfile - exiting

Plus many more copies of the same error messages.

I run BOINC 6.2.28 at 95% CPU with the leave in memory option, under Vista SP1. I didn't enable graphics at all for this workunit.

A wingman, apparantly with a shorter requested workunit length, completed only 9 decoys, but successfully.

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 59978 - Posted 4 Mar 2009 21:31:15 UTC

@Robert,

A wild question ... did you enable or run graphics for any task for any project?

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 59979 - Posted 4 Mar 2009 22:16:07 UTC - in response to Message ID 59978.

@Robert,

A wild question ... did you enable or run graphics for any task for any project?


Hard to remember. I often go for days without using graphics on any BOINC project these days.

When I was testing graphics triggering the problem for minirosetta 1.58 over on RALPH@home, though, it seemed to be only graphics for a 1.58 workunit, not graphics for a 1.54 workunit, which triggered the problem, though, and only for the 1.58 workunit.

I probably used graphics for purposes unrelated to BOINC projects, though, which hasn't triggered such a problem for me in the past.

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1740
ID: 44890
Credit: 2,350,568
RAC: 3,695
Message 60004 - Posted 6 Mar 2009 21:50:18 UTC
Last modified: 6 Mar 2009 21:50:58 UTC

Wow, this one has a 2.15MB result file for 24hrs of crunching. 100K is more what I am used to seeing. Task name is lrfrag_0_8_hb_t308__IGNORE_THE_REST_ 1M2OB_8_7783_69_0
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 628,529
RAC: 0
Message 60077 - Posted 11 Mar 2009 17:14:20 UTC

Mod.Sense
Another 0% progress Minirosetta task, on another computer. 84:25:24 time progress of a projected 10:17:32 duration. Windows XP, SP3, vintage computer. Boinc version 6.2.18.

Task is now aborted, Boinc upgraded to 6.4.7. Am I still the only one noticing this?

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 60078 - Posted 11 Mar 2009 17:26:56 UTC
Last modified: 11 Mar 2009 17:40:23 UTC

rembertw, I haven't heard any other reports. You seem to be translating the screen to English for me, and I appreciate that, but it's never entirely clear what you are referring to. On the English screen, there are three columns of interest, "CPU time", "Progress" (the percentage), and "To completion".

If I understand what you are saying is that
CPU time was 84hrs
progress was 0%
and to completion was still 10 more hours?

Is this the task you had to abort?

Now that you have aborted that one, what does the next task show for the "to ompletion" before it starts?

If the above is the correct task, it looks like the host only has 256MB of memory. And 32MB of that is likely devoted to your graphics card. The current recommendation to run Rosseta is machines with 512MB of memory or higher. (they recently increased that from 256MB when they began running more tasks that require more memory).

It looks like that machine has been having trouble earning credit for some time. I see you also do work for WCG. I've noticed that the rice project there runs in about 10MB of memory! So, perhaps that would run better on that machine.

I see you have a very large list of projects you do work for. Are all of your hosts using an account manager and dividing their resource share across all 8 projects? You also have 13 machines active, at least for Rosetta. You might want to create a seperate account, or seperate venue to seperate your P4 machines from your core 2's. And that way you could have some machines doing more work for WCG for example, and others do more for Rosetta. Based on the machine's configuration.
____________
Rosetta Moderator: Mod.Sense

alpha Profile

Joined: Nov 4 06
Posts: 27
ID: 127202
Credit: 868,017
RAC: 764
Message 60090 - Posted 12 Mar 2009 7:03:31 UTC

Visual C++ runtime error with this task after 51,711 seconds:

http://boinc.bakerlab.org/rosetta/result.php?resultid=234626846
____________

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 628,529
RAC: 0
Message 60095 - Posted 12 Mar 2009 11:44:29 UTC - in response to Message ID 60078.

If I understand what you are saying is that
CPU time was 84hrs
progress was 0%
and to completion was still 10 more hours?

Mod.Sense, you understood correctly. Indeed it is for me sometimes guessing how Boinc translated English into Dutch. And indeed, it was that task.

I have all my computers under Gridrepublic. It would simply take too much time to micromanage every single computer so I don't even try to. Up until now I simply assumed that projects would not give work to computers that did not have the minimum requirements so I never bothered checking every project for that. From your reply I take it that Rosetta does not do such a test.

There are indeed a list of projects that I have active, but I never run them all at the same time. Right now there are only 2 projects active with a stable feed (Rosetta, WCG), 2 projects that send Wu's when they have them (simap, LHC) and one that is only on a couple of computers, and set on NNT (orbit).

Now I set that last computer on NNT for Rosetta since it's got a limited configuration.

I realise that this does not belong here, but it would be interesting if there were a manager like Gridrepublic or BAM that looks at the connected computers, and divides the projects over the available processors. Let's say that for now I have 20 processors available, and Rosetta gets 10% resource share, then Rosetta would get 1 computer with 2 processors working only for Rosetta. All this without having me driving from location to location if I want to change settings. It would help, indeed, in available memory, available disk space and so on.

Since there is no such thing for now, I'll just go on as I'm used to: if there's a problem, then upgrade Boinc, and set Rosetta to NNT on the older computers. I can equal out a little by increasing the resource share for every computer set on NNT.

Steven Pletsch
Avatar

Joined: Oct 17 07
Posts: 17
ID: 213103
Credit: 282,298
RAC: 0
Message 60097 - Posted 12 Mar 2009 14:24:19 UTC

Ran into a couple errors, both on the same machine.

I really don't know how to make heads or tails of the errors, but there is a lot of information there.

http://boinc.bakerlab.org/rosetta/result.php?resultid=234977428

http://boinc.bakerlab.org/rosetta/result.php?resultid=234904088

I believe it's something with the machine, since I'm not having errors on any others, and it's only been attached for about 24 hours.

I am curious if there is anything in the debug info that might point to a clue as to what is up with it.

Anyone that can provide some insight would be much appreciated.

Thanks
____________
"Every passing hour brings the Solar System forty-three thousand miles closer to Globular Cluster M13 in Hercules -- and still there are some misfits who insist that there is no such thing as progress." - Kurt Vonnegut

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 60099 - Posted 12 Mar 2009 16:03:34 UTC

A 1.54 workunit that hit the lockfile problem:

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=213726108

A wingman avoided the problem, apparantly by choosing a workunit length short enough to end it before my copy hit the problem.

rembertw

Joined: Apr 21 07
Posts: 14
ID: 167471
Credit: 628,529
RAC: 0
Message 60123 - Posted 13 Mar 2009 7:25:17 UTC - in response to Message ID 60078.

Now that you have aborted that one, what does the next task show for the "to ompletion" before it starts?

I did not answer this question, but the answer would have been 10:17:32 time to completion if I accepted new tasks from Rosetta on that computer. (I do not). On other computers the time to completion values that I get vary from 9:something up to 12:something depending on the computer. Meaning that every computer has a different "time to completion" but different tasks on one computer have the same value.

Ivor Cogdell

Joined: Nov 7 06
Posts: 10
ID: 127694
Credit: 15,627
RAC: 0
Message 60165 - Posted 15 Mar 2009 17:59:48 UTC

Hi folks,
I try to run Minirosetta 1.54 (Windows XP Home sp3,BOINC Manager 6.4.5, wx Wigets version 2.8.7), but my Kaspersky 2009 Interner Security (Version 8.0.0.0.506) blocks it from running and throws up a black error message. I have tried to view the report but that does not give me any information on how to rectify the problem.
The standard Rosetta program will run ok (as of 22 feb workunit). Any suggestions please ?

Ivor Cogdell
Birmingham, UK

rochester new york Profile
Avatar

Joined: Jul 2 06
Posts: 2562
ID: 98229
Credit: 958,139
RAC: 127
Message 60166 - Posted 15 Mar 2009 19:15:53 UTC

http://boinc.bakerlab.org/rosetta/results.php?hostid=267483

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 60204 - Posted 17 Mar 2009 23:33:24 UTC - in response to Message ID 60166.

http://boinc.bakerlab.org/rosetta/results.php?hostid=267483


Looks like you've managed to make your queue of workunits so long you don't return them by the deadline, so other people run them and get credit for them before you do.

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 60205 - Posted 17 Mar 2009 23:34:06 UTC - in response to Message ID 60166.
Last modified: 17 Mar 2009 23:35:32 UTC

[duplicate deleted]

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 7,590,569
RAC: 2,277
Message 60209 - Posted 18 Mar 2009 6:01:10 UTC - in response to Message ID 60166.

http://boinc.bakerlab.org/rosetta/results.php?hostid=267483


I think the problem is not with the application but with your computer. Why are you always returning the tasks after the deadline?
____________

Ivor Cogdell

Joined: Nov 7 06
Posts: 10
ID: 127694
Credit: 15,627
RAC: 0
Message 60228 - Posted 19 Mar 2009 21:52:39 UTC

Kaspersky Internet Security does not allow the launch of Rosetta Mini 1.54 program because it has a danger rating of 82 and no digital signature. Do you need to add this at your server before sending ?

Regards,

Ivor
____________

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 60236 - Posted 20 Mar 2009 9:43:37 UTC

I am unable to upload 236435348
lb_alnmatrix_5700_6000_hb_t293__IGNORE_THE_REST_2as0_269_8708_3_0

The Message says that the servers may be down.

This is strange because I have just transferred and reported three other work units.
____________

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 60254 - Posted 21 Mar 2009 13:41:36 UTC

A lockfile problem again:

http://boinc.bakerlab.org/rosetta/result.php?resultid=236718529

I'm still running at 95% CPU time in order to help Ralph@home track down the lockfile problem.

trick@planet3dnow

Joined: Feb 21 09
Posts: 8
ID: 302635
Credit: 53,370
RAC: 0
Message 60260 - Posted 22 Mar 2009 2:57:01 UTC
Last modified: 22 Mar 2009 3:01:21 UTC

hi!
as already posted here: http://boinc.bakerlab.org/rosetta/forum_thread.php?id=4771 on my pc (this one here): http://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1012657
i get lots of validate errors and several client errors (too much to link each of them here). the usual symptome is greatly increased processing time. the work units should run 3 hours, but they run for 7 hours.

when i notice that a work unit takes much too long, should i abort it? or let it run until it fails to validate after 7 hours?

Alberthuang

Joined: Dec 5 05
Posts: 6
ID: 30308
Credit: 61,178
RAC: 105
Message 60261 - Posted 22 Mar 2009 3:12:53 UTC

My computer's OS is Windows XP SP3, using the BOINC manager version 5.10.45. It computed two workunits (1hz6A_BOINC_ABINITIO_IGNORE_THE_REST-MOO18-S25-9-S3-9--1hz6A-_7873_76 and lr5_E_01_hbond_bb_sc_rlbd_2hsb_SAVE_ALL_OUT_8261_652) with minirosetta version 1.54, and both of them showed compute error at last. Of course both of these workunits were invalid.

The former one (workunit 1hz6A_BOINC_ABINITIO_IGNORE_THE_REST-MOO18-S25-9-S3-9--1hz6A-_7873_76) spent more than 4.5 hours CPU time in my computer. And a windows message showed that Windows C++ Runtime error when this workunit crashed. When this condition happened, I was using Mozilla Firefox browser V 3.0. And the Mozilla Firefox browser also accidently closed almost at the same time. The task detail is in the following:
Task ID 234173364
Name 1hz6A_BOINC_ABINITIO_IGNORE_THE_REST-MOO18-S25-9-S3-9--1hz6A-_7873_76_0
Workunit 213483545
Created 9 Mar 2009 7:21:46 UTC
Sent 9 Mar 2009 7:23:00 UTC
Received 17 Mar 2009 8:07:24 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
Computer ID 224205
Report deadline 19 Mar 2009 7:23:00 UTC
CPU time 17563.45
stderr out

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 3-16 14:16:21:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Starting work on structure: _MOO18U9X9X_00001
# cpu_run_time_pref: 21600
Starting work on structure: _MOO18U9X9X_00002
Starting work on structure: _MOO18U9X9X_00003
Starting work on structure: _MOO18U9X9X_00004
BOINC:: Initializing ... ok.
[2009- 3-17 11:23:26:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 21600
Starting work on structure: _MOO18U9X9X_00004
Continuing computation from checkpoint: chk_S_MOO18U9X9X_00000004_ClassicAbinitio__stage_1 ... success!
Continuing computation from checkpoint: chk_S_MOO18U9X9X_00000004_ClassicAbinitio__stage_2 ... success!
Starting work on structure: _MOO18U9X9X_00005
Starting work on structure: _MOO18U9X9X_00006
Starting work on structure: _MOO18U9X9X_00007
Starting work on structure: _MOO18U9X9X_00008
Starting work on structure: _MOO18U9X9X_00009
Starting work on structure: _MOO18U9X9X_00010


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0055B8C1 write attempt to address 0x00000024

Engaging BOINC Windows Runtime Debugger...



********************


BOINC Windows Runtime Debugger Version 6.5.0


Dump Timestamp : 03/17/09 16:01:02
Install Directory : C:\Program Files\BOINC\
Data Directory : C:\Program Files\BOINC
Project Symstore :
Loaded Library : C:\Program Files\BOINC\\dbghelp.dll
Loaded Library : C:\Program Files\BOINC\\symsrv.dll
Loaded Library : C:\Program Files\BOINC\\srcsrv.dll
LoadLibraryA( C:\Program Files\BOINC\\version.dll ): GetLastError = 126
Loaded Library : version.dll
Debugger Engine : 4.0.5.0
Symbol Search Path: C:\Program Files\BOINC\slots\1;C:\Program Files\BOINC\projects\boinc.bakerlab.org_rosetta;srv*C:\Program Files\BOINC\projects\boinc.bakerlab.org_rosettasymbols*http://msdl.microsoft.com/download/symbols;srv*C:\Program Files\BOINC\projects\boinc.bakerlab.org_rosettasymbols*http://boinc.berkeley.edu/symstore


ModLoad: 00400000 00724000 C:\Program Files\BOINC\projects\boinc.bakerlab.org_rosetta\minirosetta_1.54_windows_intelx86.exe (-nosymbols- Symbols Loaded)
Linked PDB Filename : D:\boinc_build\minirosetta_windows\mini\Visual Studio\BoincRelease\minirosetta_1.54_windows_intelx86.pdb

ModLoad: 7c920000 00094000 C:\WINDOWS\system32\ntdll.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : ntdll.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2111)
Company Name : Microsoft Corporation
Product Name : Microsoft(R) Windows(R) Operating System
Product Version : 5.1.2600.5512

ModLoad: 7c800000 0011f000 C:\WINDOWS\system32\kernel32.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : kernel32.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2111)
Company Name : Microsoft Corporation
Product Name : Microsoft(R) Windows(R) Operating System
Product Version : 5.1.2600.5512

ModLoad: 77d10000 0008f000 C:\WINDOWS\system32\USER32.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : user32.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2105)
Company Name : Microsoft Corporation
Product Name : Microsoft(R) Windows(R) Operating System
Product Version : 5.1.2600.5512

ModLoad: 77ef0000 00049000 C:\WINDOWS\system32\GDI32.dll (5.1.2600.5698) (PDB Symbols Loaded)
Linked PDB Filename : gdi32.pdb
File Version : 5.1.2600.5698 (xpsp_sp3_gdr.081022-1932)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version : 5.1.2600.5698

ModLoad: 77da0000 000a7000 C:\WINDOWS\system32\ADVAPI32.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : advapi32.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2113)
Company Name : Microsoft Corporation
Product Name : Microsoft(R) Windows(R) Operating System
Product Version : 5.1.2600.5512

ModLoad: 77e50000 00092000 C:\WINDOWS\system32\RPCRT4.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : rpcrt4.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2108)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version : 5.1.2600.5512

ModLoad: 77fc0000 00011000 C:\WINDOWS\system32\Secur32.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : secur32.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2113)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version : 5.1.2600.5512

ModLoad: 76300000 0001d000 C:\WINDOWS\system32\IMM32.DLL (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : imm32.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2105)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version : 5.1.2600.5512

ModLoad: 621f0000 00009000 C:\WINDOWS\system32\LPK.DLL (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : lpk.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2105)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version : 5.1.2600.5512

ModLoad: 73fa0000 0006b000 C:\WINDOWS\system32\USP10.dll (1.420.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : usp10.pdb
File Version : 1.0420.2600.5512 (xpsp.080413-2105)
Company Name : Microsoft Corporation
Product Name : Microsoft(R) Uniscribe Unicode script processor
Product Version : 1.0420.2600.5512

ModLoad: 76cb0000 00020000 C:\WINDOWS\system32\NTMARTA.DLL (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : ntmarta.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2113)
Company Name : Microsoft Corporation
Product Name : Microsoft(R) Windows(R) Operating System
Product Version : 5.1.2600.5512

ModLoad: 77be0000 00058000 C:\WINDOWS\system32\msvcrt.dll (7.0.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : msvcrt.pdb
File Version : 7.0.2600.5512 (xpsp.080413-2111)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version : 7.0.2600.5512

ModLoad: 76990000 0013d000 C:\WINDOWS\system32\ole32.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : ole32.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2108)
Company Name : Microsoft Corporation
Product Name : Microsoft(R) Windows(R) Operating System
Product Version : 5.1.2600.5512

ModLoad: 71b70000 00013000 C:\WINDOWS\system32\SAMLIB.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : samlib.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2113)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version : 5.1.2600.5512

ModLoad: 76f30000 0002c000 C:\WINDOWS\system32\WLDAP32.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : wldap32.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2113)
Company Name : Microsoft Corporation
Product Name : Microsoft(R) Windows(R) Operating System
Product Version : 5.1.2600.5512

ModLoad: 0b610000 00115000 C:\Program Files\BOINC\dbghelp.dll (6.6.7.5) (PDB Symbols Loaded)
Linked PDB Filename : dbghelp.pdb
File Version : 6.6.0007.5 (debuggers(dbg).051021-1446)
Company Name : Microsoft Corporation
Product Name : Debugging Tools for Windows(R)
Product Version : 6.6.0007.5

ModLoad: 0b830000 00083000 C:\Program Files\BOINC\symsrv.dll (6.6.7.5) (PDB Symbols Loaded)
Linked PDB Filename : symsrv.pdb
File Version : 6.6.0007.5 (debuggers(dbg).051021-1446)
Company Name : Microsoft Corporation
Product Name : Debugging Tools for Windows(R)
Product Version : 6.6.0007.5

ModLoad: 0b8c0000 0003a000 C:\Program Files\BOINC\srcsrv.dll (6.6.7.5) (PDB Symbols Loaded)
Linked PDB Filename : srcsrv.pdb
File Version : 6.6.0007.5 (debuggers(dbg).051021-1446)
Company Name : Microsoft Corporation
Product Name : Debugging Tools for Windows(R)
Product Version : 6.6.0007.5

ModLoad: 77bd0000 00008000 C:\WINDOWS\system32\version.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : version.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2105)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version : 5.1.2600.5512



*** Dump of the Process Statistics: ***

- I/O Operations Counters -
Read: 4199, Write: 0, Other 4119

- I/O Transfers Counters -
Read: 0, Write: 283156, Other 0

- Paged Pool Usage -
QuotaPagedPoolUsage: 29464, QuotaPeakPagedPoolUsage: 29484
QuotaNonPagedPoolUsage: 3856, QuotaPeakNonPagedPoolUsage: 5104

- Virtual Memory Usage -
VirtualSize: 288505856, PeakVirtualSize: 294109184

- Pagefile Usage -
PagefileUsage: 177410048, PeakPagefileUsage: 180875264

- Working Set Size -
WorkingSetSize: 44548096, PeakWorkingSetSize: 142151680, PageFaultCount: 4153040

*** Dump of thread ID 1256 (state: Waiting): ***

- Information -
Status: Wait Reason: UserRequest, , Kernel Time: 929636736.000000, User Time: 118402555904.000000, Wait Time: 1696694.000000

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0055B8C1 write attempt to address 0x00000024

- Registers -
eax=097646c8 ebx=097646cc ecx=038ffe20 edx=038ffe20 esi=097646a0 edi=00000000
eip=0055b8c1 esp=0012c02c ebp=0ab9f938
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010202

- Callstack -
ChildEBP RetAddr Args to Child
0012c048 0061138e 00000000 fa25f9aa 0b235e50 097646a0 minirosetta_1.54_windows_intelx!+0x0
0012c068 006113fe 0b235e50 fa25f98a 00000001 097646a0 minirosetta_1.54_windows_intelx!+0x0
00000000 00000000 00000000 00000000 00000000 00000000 minirosetta_1.54_windows_intelx!+0x0

*** Dump of thread ID 672 (state: Waiting): ***

- Information -
Status: Wait Reason: ExecutionDelay, , Kernel Time: 1902736.000000, User Time: 6909936.000000, Wait Time: 1696719.000000

- Registers -
eax=0164fb44 ebx=00000000 ecx=fa3739f2 edx=00000000 esi=00000000 edi=0164ff70
eip=7c92e4f4 esp=0164ff40 ebp=0164ff98
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000202

- Callstack -
ChildEBP RetAddr Args to Child
0164ff3c 7c92d1fc 7c8023f1 00000000 0164ff70 00000000 ntdll!_KiFastSystemCallRet@0+0x0 FPO: [0,0,0]
0164ff40 7c8023f1 00000000 0164ff70 00000000 7c802446 ntdll!_NtDelayExecution@8+0x0 FPO: [2,0,0]
0164ff98 7c802455 00000064 00000000 0164ffec 00411a7b kernel32!_SleepEx@8+0x0
0164ffa8 00411a7b 00000064 00000000 7c80b713 00000000 kernel32!_Sleep@4+0x0
0164ffec 00000000 00411a70 00000000 00000000 2f73fcd8 minirosetta_1.54_windows_intelx!+0x0

*** Dump of thread ID 1808 (state: Waiting): ***

- Information -
Status: Wait Reason: ExecutionDelay, , Kernel Time: 100144.000000, User Time: 0.000000, Wait Time: 1696642.000000

- Registers -
eax=0272fe28 ebx=021c4a01 ecx=0272e734 edx=00001f9a esi=00000000 edi=0272fdf8
eip=7c92e4f4 esp=0272fdc8 ebp=0272fe20
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000202

- Callstack -
ChildEBP RetAddr Args to Child
0272fdc4 7c92d1fc 7c8023f1 00000000 0272fdf8 00000122 ntdll!_KiFastSystemCallRet@0+0x0 FPO: [0,0,0]
0272fdc8 7c8023f1 00000000 0272fdf8 00000122 09778748 ntdll!_NtDelayExecution@8+0x0 FPO: [2,0,0]
0272fe20 7c802455 000007d0 00000000 7c802446 0079aa61 kernel32!_SleepEx@8+0x0
0272fe30 0079aa61 000007d0 f845c7b2 0012bfe0 021c4a38 kernel32!_Sleep@4+0x0
0272fe38 f845c7b2 0012bfe0 021c4a38 0272ff6c 021c4a38 minirosetta_1.54_windows_intelx!+0x0
0272fe3c 0012bfe0 021c4a38 0272ff6c 021c4a38 00000001 minirosetta_1.54_windows_intelx!+0x0 SymFromAddr(): GetLastError = '126' SymGetLineFromAddr(): GetLastError = '126' SymGetModuleInfo(): GetLastError = '126' Address = 'f845c7b2'
0272ff3c 7c937de9 7c937ea0 7c800000 0272ff7c 00000000 minirosetta_1.54_windows_intelx!+0x0 SymFromAddr(): GetLastError = '126' SymGetLineFromAddr(): GetLastError = '126' SymGetModuleInfo(): GetLastError = '126' Address = '0012bfe0'
0272ffe0 7c80b71f 00000000 00000000 00000000 0041eb46 ntdll!_LdrpGetProcedureAddress@20+0x0 SymFromAddr(): GetLastError = '126' SymGetLineFromAddr(): GetLastError = '126' SymGetModuleInfo(): GetLastError = '126' Address = '7c937de9'
0272ffe4 00000000 00000000 00000000 0041eb46 021c4a38 kernel32!_BaseThreadStart@8+0x0 FPO: [0,0,0] SymFromAddr(): GetLastError = '126' SymGetLineFromAddr(): GetLastError = '126' SymGetModuleInfo(): GetLastError = '126' Address = '7c80b71f'


*** Debug Message Dump ****


*** Foreground Window Data ***
Window Name :
Window Class :
Window Process ID: 0
Window Thread ID : 0

Exiting...

</stderr_txt>
]]>

Validate state Invalid
Claimed credit 32.9406239634204
Granted credit 0
application version 1.54

The other one (workunit lr5_E_01_hbond_bb_sc_rlbd_2hsb_SAVE_ALL_OUT_8261_652) only spent less than a half hour in my computer, but the error message did not show when it crashed. And I also used Mozilla Firefox browser V 3.0 then, strangely the Mozilla Firefox browser did not accidently closed at the same time. The task detail is in the following:
Task ID 236172160
Name lr5_E_01_hbond_bb_sc_rlbd_2hsb_SAVE_ALL_OUT_8261_652_1
Workunit 215347031
Created 17 Mar 2009 8:05:59 UTC
Sent 17 Mar 2009 8:07:24 UTC
Received 20 Mar 2009 17:36:16 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
Computer ID 224205
Report deadline 27 Mar 2009 8:07:24 UTC
CPU time 1436.896
stderr out

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 3-21 1: 5:10:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/mtyka_lr5_D_score12.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/mtyka_lr5_D_score12.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/lr5_2hsb.out.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/lr5_2hsb.out.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
Initializing score function:
Initializing relax mover:
Starting protocol...
Silent Output Mode
Jobdist startup..
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Starting work on structure: S_shuffle_00001 <--- S_00002_0000216_0_test_6.0.out
Fullatom mode ..
# cpu_run_time_pref: 21600


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0055B8C1 write attempt to address 0x00000024

Engaging BOINC Windows Runtime Debugger...



********************


BOINC Windows Runtime Debugger Version 6.5.0


Dump Timestamp : 03/21/09 01:34:14
Install Directory : C:\Program Files\BOINC\
Data Directory : C:\Program Files\BOINC
Project Symstore :
LoadLibraryA( C:\Program Files\BOINC\\dbghelp.dll ): GetLastError = 1455
LoadLibraryA( dbghelp.dll ): GetLastError = 1455
*** Dump of the Process Statistics: ***

- I/O Operations Counters -
Read: 10715, Write: 0, Other 3493

- I/O Transfers Counters -
Read: 0, Write: 200794, Other 0

- Paged Pool Usage -
QuotaPagedPoolUsage: 29464, QuotaPeakPagedPoolUsage: 29464
QuotaNonPagedPoolUsage: 4416, QuotaPeakNonPagedPoolUsage: 5664

- Virtual Memory Usage -
VirtualSize: 288079872, PeakVirtualSize: 296271872

- Pagefile Usage -
PagefileUsage: 192016384, PeakPagefileUsage: 208936960

- Working Set Size -
WorkingSetSize: 136130560, PeakWorkingSetSize: 213221376, PageFaultCount: 366777

*** Dump of thread ID 1164 (state: Waiting): ***

- Information -
Status: Wait Reason: UserRequest, , Kernel Time: 93334208.000000, User Time: 14287143936.000000, Wait Time: 2525130.000000

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0055B8C1 write attempt to address 0x00000024


*** Dump of thread ID 3344 (state: Waiting): ***

- Information -
Status: Wait Reason: ExecutionDelay, , Kernel Time: 300432.000000, User Time: 300432.000000, Wait Time: 2525124.000000


*** Dump of thread ID 2416 (state: Waiting): ***

- Information -
Status: Wait Reason: ExecutionDelay, , Kernel Time: 0.000000, User Time: 100144.000000, Wait Time: 2524973.000000



*** Debug Message Dump ****


*** Foreground Window Data ***
Window Name :
Window Class :
Window Process ID: 0
Window Thread ID : 0

Exiting...

</stderr_txt>
]]>

Validate state Invalid
Claimed credit 2.76822531352161
Granted credit 2.76822531352161
application version 1.54

And another computer computed this workunit also computed error. The message is in the following:
Task ID 236168980
Name lr5_E_01_hbond_bb_sc_rlbd_2hsb_SAVE_ALL_OUT_8261_652_0
Workunit 215347031
Created 17 Mar 2009 7:49:09 UTC
Sent 17 Mar 2009 7:50:56 UTC
Received 17 Mar 2009 8:05:56 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status -185 (0xffffff47)
Computer ID 868926
Report deadline 27 Mar 2009 7:50:56 UTC
CPU time 0
stderr out

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
Input file minirosetta_1.54_windows_intelx86.exe missing or invalid: -163
</message>
]]>

Validate state Invalid
Claimed credit 0
Granted credit 0
application version 1.54
____________


Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 60265 - Posted 22 Mar 2009 4:00:06 UTC - in response to Message ID 60260.
Last modified: 22 Mar 2009 4:13:49 UTC

hi!
as already posted here: http://boinc.bakerlab.org/rosetta/forum_thread.php?id=4771 on my pc (this one here): http://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1012657
i get lots of validate errors and several client errors (too much to link each of them here). the usual symptome is greatly increased processing time. the work units should run 3 hours, but they run for 7 hours.

when i notice that a work unit takes much too long, should i abort it? or let it run until it fails to validate after 7 hours?


I can only tell you that the v1.54 mini version now includes code both to end such tasks sooner, and to report information useful to help determine why those models are running so long. Prior to these enhancements, the watchdog would wait until the task ran for 3 or 4 times longer then the runtime preference, and the results when such a watchdog end was made were not as useful in studying what occurred.

I've been asking why such tasks are not receiving credit from the nightly credit granting script, but have not yet received any word.
____________
Rosetta Moderator: Mod.Sense

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 60268 - Posted 22 Mar 2009 7:38:05 UTC

I just tried resetting the Rosetta@home project and got these error messages (with no Rosetta@home workunit running, none downloaded but not run, and the last one already reported):

3/22/2009 2:03:10 AM|rosetta@home|Resetting project
3/22/2009 2:03:16 AM|rosetta@home|[error] Couldn't delete file projects/boinc.bakerlab.org_rosetta/minirosetta_1.54_windows_intelx86.exe

Attempts to delete the file manually also failed, with error messages about being unable to move it to the deleted items folder.

I currently have Rosetta@home on no new tasks, to keep it this way until you can give me some usable advice about how to finish the reset.

I run BOINC 6.2.28 under Vista SP1.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 60275 - Posted 22 Mar 2009 18:43:38 UTC
Last modified: 22 Mar 2009 18:45:12 UTC

this task 5croA_BOINC_ABINITIO_IGNORE_THE_REST-MOO56-S25-11-S3-13--5croA-_7876_63 crashed on 2 computers and did not reply on another.

I got a validate error, another person got a compute error and the third never replied with the task error or completion.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 60279 - Posted 23 Mar 2009 0:48:30 UTC

robertmiles
Sounds like a reboot is in order to clear all of the locks. I've never heard of that happening before. Perhaps something like anti-virus software has taken a lock on the file to perform a scan?

Curious, why were you resetting the project?
____________
Rosetta Moderator: Mod.Sense

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 60285 - Posted 23 Mar 2009 14:43:41 UTC - in response to Message ID 60279.

robertmiles
Sounds like a reboot is in order to clear all of the locks. I've never heard of that happening before. Perhaps something like anti-virus software has taken a lock on the file to perform a scan?

Curious, why were you resetting the project?


A reboot may have helped - it was part of the procedure I described trying over on Ralph@home, and was able to remove the lockfiles for a while.

I was resetting the project because that's what the error messages from the lockfile problem suggest I may need to do. However, it doesn't seem to have helped enough, since the first Rosetta@home workunit my machine completed since the reset had the lockfile problem again:

http://boinc.bakerlab.org/rosetta/result.php?resultid=237629070

Two more Rosetta@home workunits that started later aren't finished, but at least don't seem to have run into the lockfile problem yet.

My antivirus program, and also my three antispyware programs, are able to finish scanning a file in much less time than it needs for Rosetta@home and Ralph@home workunits to fail due to too many restarts from a lockfile problem, so I'd expect a lock from any of them to cause lockfile error messages for only a short time, followed by a successful minirosetta restart.

A suggestion - modify minirosetta to check for the lockfile as it starts up (preferably before any effort to create one), report the results of this check if it can, and if this first check for the lockfile finds one, don't waste as much time restarting over and over before declaring the workunit failed.

Another suggestion - modify minirosetta to report which slot it ran in, since the problem looks like it may be specific to workunits assigned to specific slots, due to what looks like its inability to remove lockfiles left by previous workunits assigned to the same slot but already completed since the last reboot.

I leave BOINC running nearly 24 hours a day, often days between reboots, which may have something to do with why I'm seeing the lockfile problem as often as I do.

I'm still using BOINC 6.2.28 under 32-bit Vista SP1.

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 7,590,569
RAC: 2,277
Message 60287 - Posted 23 Mar 2009 17:51:00 UTC

This one crashed on me, I aborted it.

http://boinc.bakerlab.org/rosetta/result.php?resultid=236816502
____________

Snagletooth

Joined: Feb 22 07
Posts: 192
ID: 149031
Credit: 1,396,123
RAC: 1,318
Message 60294 - Posted 24 Mar 2009 6:00:33 UTC - in response to Message ID 60285.

robertmiles
Sounds like a reboot is in order to clear all of the locks. I've never heard of that happening before. Perhaps something like anti-virus software has taken a lock on the file to perform a scan?

Curious, why were you resetting the project?


A reboot may have helped - it was part of the procedure I described trying over on Ralph@home, and was able to remove the lockfiles for a while.

I was resetting the project because that's what the error messages from the lockfile problem suggest I may need to do. However, it doesn't seem to have helped enough, since the first Rosetta@home workunit my machine completed since the reset had the lockfile problem again:

http://boinc.bakerlab.org/rosetta/result.php?resultid=237629070

Two more Rosetta@home workunits that started later aren't finished, but at least don't seem to have run into the lockfile problem yet.

My antivirus program, and also my three antispyware programs, are able to finish scanning a file in much less time than it needs for Rosetta@home and Ralph@home workunits to fail due to too many restarts from a lockfile problem, so I'd expect a lock from any of them to cause lockfile error messages for only a short time, followed by a successful minirosetta restart.

A suggestion - modify minirosetta to check for the lockfile as it starts up (preferably before any effort to create one), report the results of this check if it can, and if this first check for the lockfile finds one, don't waste as much time restarting over and over before declaring the workunit failed.

Another suggestion - modify minirosetta to report which slot it ran in, since the problem looks like it may be specific to workunits assigned to specific slots, due to what looks like its inability to remove lockfiles left by previous workunits assigned to the same slot but already completed since the last reboot.

I leave BOINC running nearly 24 hours a day, often days between reboots, which may have something to do with why I'm seeing the lockfile problem as often as I do.

I'm still using BOINC 6.2.28 under 32-bit Vista SP1.


You might be interested in this announcement by Bernd over at Einstein@home. He has made an Einstein Windows app specifically to collect more info on the CPU throttling=too many exits/can't acquire lockfile errors. Hopefully his discoveries will prove useful here on rosetta@home as well.

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1740
ID: 44890
Credit: 2,350,568
RAC: 3,695
Message 60300 - Posted 24 Mar 2009 14:56:45 UTC
Last modified: 24 Mar 2009 14:58:33 UTC

This task is currently using 496MB on my machine. Max was 536MB. It is called 2P09A_BOINC_MPZN_vanilla_abrelax_9106_6681_0

What is the status now that the minimum recommended memory is 512MB? Are there still WUs created that will only go to systems with more? My machine has 2GB. But was wondering if this task is using more then planned.

That task seems to be running normally otherwise. It is 22hrs in to my 24hr preference.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 60319 - Posted 25 Mar 2009 15:28:05 UTC

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338

Task ID:237330352

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 60337 - Posted 27 Mar 2009 3:16:16 UTC

A workunit that ran for a while, then ran into the lockfile problem:

http://boinc.bakerlab.org/rosetta/result.php?resultid=238431267

Two of the five subdirectories under the slots directory contain a large number of files, and appear to be for the two workunits now in progress. Two are empty.

The other subdirectory contains only 3 files, and appears to be left over from this failed workunit.

File boinc_lockfile appears to be empty, since its size is zero. It's marked as still is use, though, so I can't check this.

The contents of stderr.txt start with this:

BOINC:: Initializing ... ok.
[2009- 3-25 22:55: 2:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Starting work on structure: _U9X3X_00001
# cpu_run_time_pref: 43200
Starting work on structure: _U9X3X_00002
Starting work on structure: _U9X3X_00003
Starting work on structure: _U9X3X_00004
Starting work on structure: _U9X3X_00005
Starting work on structure: _U9X3X_00006
Starting work on structure: _U9X3X_00007
Starting work on structure: _U9X3X_00008
Starting work on structure: _U9X3X_00009
Starting work on structure: _U9X3X_00010
Starting work on structure: _U9X3X_00011
Starting work on structure: _U9X3X_00012
Starting work on structure: _U9X3X_00013
Starting work on structure: _U9X3X_00014
Starting work on structure: _U9X3X_00015
Starting work on structure: _U9X3X_00016
Starting work on structure: _U9X3X_00017
Starting work on structure: _U9X3X_00018
Starting work on structure: _U9X3X_00019
Starting work on structure: _U9X3X_00020
BOINC:: Initializing ... ok.
Can't acquire lockfile - exiting
BOINC:: Initializing ... ok.
Can't acquire lockfile - exiting
BOINC:: Initializing ... ok.
Can't acquire lockfile - exiting
BOINC:: Initializing ... ok.
Can't acquire lockfile - exiting


The contents of stdout.txt are:

Created shared memory segment
Created semaphore


Do these results mean that Rosetta@home never tries to clear up these three files for failed workunits? Should it? They appear to prevent any workunits from Rosetta@home or Ralph@home from being able to run in this slot until the next reboot - often meaning a few days for me. I haven't seen them have a similar effect on workunits from other BOINC projects, though.

Hammeh Profile

Joined: Nov 11 08
Posts: 63
ID: 287579
Credit: 211,283
RAC: 0
Message 60341 - Posted 27 Mar 2009 19:50:11 UTC
Last modified: 27 Mar 2009 19:50:29 UTC

Can anyone shed some light on this WU, I just started crunching for Rosetta, it didn't report any client side errors.

217630163

Thanks

Chilean Profile
Avatar

Joined: Oct 16 05
Posts: 651
ID: 5008
Credit: 10,238,180
RAC: 4,709
Message 60343 - Posted 27 Mar 2009 20:08:33 UTC

Is your CPU overclocked?
____________

Hammeh Profile

Joined: Nov 11 08
Posts: 63
ID: 287579
Credit: 211,283
RAC: 0
Message 60344 - Posted 27 Mar 2009 20:32:06 UTC

Nope here is some system info:
Amd Phenom x4 9600 (not overclocked)
3GB RAM
Windows Vista Home Premium 32-bit
BOINC version 6.4.7

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 60388 - Posted 30 Mar 2009 16:54:44 UTC

Validate error on this workunit 218443282 on Mac.

cc_natcst_1_8_nocstinrelax_hb_t327__IGNORE_THE_REST_2FSWA_7_9505_20_1

An unlikely 99 decoys from 99 attempts: a wingman had the same problem.

Starting work on structure: _2FSWA_7_00098
Starting work on structure: _2FSWA_7_00099
======================================================
DONE :: 1 starting structures 145.451 cpu seconds
This process generated 99 decoys from 99 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>


____________

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 60418 - Posted 31 Mar 2009 17:11:41 UTC

Too many restarts error

LizzieBarry

Joined: Feb 25 08
Posts: 76
ID: 243949
Credit: 201,862
RAC: 0
Message 60430 - Posted 1 Apr 2009 13:39:06 UTC
Last modified: 1 Apr 2009 13:45:56 UTC

frb_1_8_bestfrag_hb_t313___IGNORE_THE_REST_1F9TA_5_9696_15_0

7 hours running (3hr default), no decoys, Validate Error.

I've been noticing these "frb" WUs are singularly unsuccessful. What are the stats on their successful completion? I'd say they were minimal.

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,545,746
RAC: 7,447
Message 60432 - Posted 1 Apr 2009 18:18:53 UTC - in response to Message ID 60430.

frb_1_8_bestfrag_hb_t313___IGNORE_THE_REST_1F9TA_5_9696_15_0

7 hours running (3hr default), no decoys, Validate Error.

I've been noticing these "frb" WUs are singularly unsuccessful. What are the stats on their successful completion? I'd say they were minimal.

Oh, I don't know...

frb_1_8_ecut_hb_t322___IGNORE_THE_REST_1VPMA_12_9712_12_0

# cpu_run_time_pref: 14400
CPU time 14099.2

Claimed credit 69.0173659142213
Granted credit 229.296476006251

No complaints here!!! :)
____________

Path7

Joined: Aug 25 07
Posts: 128
ID: 201002
Credit: 61,751
RAC: 0
Message 60460 - Posted 2 Apr 2009 17:46:57 UTC
Last modified: 2 Apr 2009 17:50:13 UTC

Success & Error on the same WU

Hello all,

This WU:
frb_0_8_el_chosen_hb_t312___IGNORE_THE_REST_1XV2A_15_9667_54_0

has official been reported as: Outcome = Success.
However the WU ran only for 4309.559 seconds, cpu_run_time_pref: 21600 and
ended with an error:
Starting work on structure: _1XV2A_15_00008
interpolate rotamers bin out of range: GLN -107.207 180 -7e-005 -6.1e-005 -5.1e-005
34 36 8 9 37 2 0.2793 0
ERROR:: Exit from: d:\boinc_build\minirosetta_windows\mini\src\core/scoring/dunbrack/RotamericSingleResidueDunbrackLibrary.tmpl.hh line: 593
called boinc_finish

Have a nice day,
Path7.

Murasaki
Avatar

Joined: Apr 20 06
Posts: 303
ID: 78284
Credit: 365,375
RAC: 94
Message 60461 - Posted 2 Apr 2009 19:25:31 UTC

Another WU with 99 successful decoys

ala_2he4_p40-1.ala.ppk_dock_random.xml_RANDOM12_BOUND_DOCK_9895_843_0

# cpu_run_time_pref: 21600
======================================================
DONE :: 1 starting structures 6841.62 cpu seconds
This process generated 99 decoys from 99 attempts
======================================================

My preferred run time is 6 hours, but this one completed in less than 2. Either this is an extremely quick model or something odd occurred.
____________

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 60465 - Posted 3 Apr 2009 12:03:34 UTC - in response to Message ID 60461.

Another WU with 99 successful decoys

ala_2he4_p40-1.ala.ppk_dock_random.xml_RANDOM12_BOUND_DOCK_9895_843_0

My preferred run time is 6 hours, but this one completed in less than 2. Either this is an extremely quick model or something odd occurred.


This appears to look normal, I am getting through them at the rate of about 1.17 minutes per model. If my calculations are correct you are .02 minutes faster per model.

____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 60495 - Posted 5 Apr 2009 7:50:26 UTC

first error in a long time!
ran 100% and had a compute error at the end
abinitio_nohomfrag_129_B_1o73A_SAVE_ALL_OUT_7581_8721_1
Exit status -1073741819 (0xc0000005)
CPU time 11314.84
Starting work on structure: _U9X3X_00001
# cpu_run_time_pref: 14400
Starting work on structure: _U9X3X_00002
Starting work on structure: _U9X3X_00003
Starting work on structure: _U9X3X_00004
Starting work on structure: _U9X3X_00005
Starting work on structure: _U9X3X_00006
Starting work on structure: _U9X3X_00007
Starting work on structure: _U9X3X_00008
Starting work on structure: _U9X3X_00009
Starting work on structure: _U9X3X_00010


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00587042 write attempt to address 0x34A2BAB7

Engaging BOINC Windows Runtime Debugger...

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,545,746
RAC: 7,447
Message 60514 - Posted 6 Apr 2009 12:47:14 UTC - in response to Message ID 60432.

frb_1_8_bestfrag_hb_t313___IGNORE_THE_REST_1F9TA_5_9696_15_0

7 hours running (3hr default), no decoys, Validate Error.

I've been noticing these "frb" WUs are singularly unsuccessful. What are the stats on their successful completion? I'd say they were minimal.

Oh, I don't know...

frb_1_8_ecut_hb_t322___IGNORE_THE_REST_1VPMA_12_9712_12_0

# cpu_run_time_pref: 14400
CPU time 14099.2

Claimed credit 69.0173659142213
Granted credit 229.296476006251

No complaints here!!! :)

I spoke too soon...

frb_0_8_template_enriched_hb_t313___IGNORE_THE_REST_1CZ7A_7_9682_18_1

# cpu_run_time_pref: 14400
CPU time 17744.52 [1 decoy]

Claimed credit 86.8616680245843
Granted credit 9.36388194088631

:(
____________

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1740
ID: 44890
Credit: 2,350,568
RAC: 3,695
Message 60524 - Posted 6 Apr 2009 22:46:11 UTC

This one cut off after a clean exit of BOINC and a reboot to install a MS fix. What wasn't clean was the restart. I forgot BOINC was in my Win startup folder and so ended up starting two of them. I then ended both and after 61 second after starting again, this task was ended. No messages, just that it finished. But it should have run another couple of hours.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 60525 - Posted 7 Apr 2009 2:32:23 UTC

Task 241419982 failed on Mac: see below. Oddly, it then went out to someone on a Linux machine and completed fine.

Watchdog active.
# cpu_run_time_pref: 14400
Hbond tripped.

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>


____________

Klimax

Joined: Apr 27 07
Posts: 29
ID: 170261
Credit: 107,923
RAC: 0
Message 60526 - Posted 7 Apr 2009 4:45:16 UTC

Again another task is now not crunching due to "Accepted Energy:1.#QNAN" and "Accpeted RMSD:1.#QQ".
It is 39.50% Complete ; Model:11 Step 7788. I have now suspended task.

I can create dump file.Should I?

Or is it already fixed in next version?

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 60528 - Posted 7 Apr 2009 9:00:35 UTC

Error with this one 240746159


ERROR: in::file::boinc_wu_zip fragments_2hkv.zip does not exist!
ERROR:: Exit from: ..\..\src\apps\public\boinc\minirosetta.cc line: 108
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 60529 - Posted 7 Apr 2009 12:07:58 UTC

Klimax, why don't you go ahead and take a dump and EMail it to me, along with details on what you observered with it as it ran. I will forward it to the Project Team.
____________
Rosetta Moderator: Mod.Sense

jswolf19

Joined: Apr 3 09
Posts: 3
ID: 309533
Credit: 824,515
RAC: 814
Message 60530 - Posted 7 Apr 2009 13:21:23 UTC

I'm also having an issue with no progress. Rosetta Beta runs fine, but Rosetta Mini (1.54) never registers any progress even after clocking hours of CPU time (the current process I just aborted clocked at almost 17 hours). I have an Intel(R) Core(TM)2 Duo CPU P8600 @ 2.40GHz (WinXP Professional SP3) . It also won't switch off, freeing up a core for another BOINC (v6.4.7) process to run.

Klimax

Joined: Apr 27 07
Posts: 29
ID: 170261
Credit: 107,923
RAC: 0
Message 60540 - Posted 7 Apr 2009 19:26:17 UTC - in response to Message ID 60529.

Klimax, why don't you go ahead and take a dump and EMail it to me, along with details on what you observered with it as it ran. I will forward it to the Project Team.

Ups,didn't know :-(
Last time I reported it,I was told to let it finish and upload.(IIRC)
Mail is being prepared.

TomaszPawel

Joined: Apr 28 07
Posts: 54
ID: 170716
Credit: 2,791,145
RAC: 0
Message 60554 - Posted 8 Apr 2009 16:02:15 UTC
Last modified: 8 Apr 2009 16:02:56 UTC

I have this error :

http://boinc.bakerlab.org/rosetta/result.php?resultid=240721682

any tips?

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 60560 - Posted 9 Apr 2009 1:12:27 UTC - in response to Message ID 60554.

I have this error :

http://boinc.bakerlab.org/rosetta/result.php?resultid=240721682

any tips?


Looks like you've hit one of the errors still in 1.54 because it's too uncommon to debug quickly. Let's hope your results for that workunit help them finally debug it.

Looking at the rest of the jobs your machine has been working on lately, I'd say that that you have a lower frequency of errors than I do because you've set up your machine well for aiming at a high score (probably selecting Rosetta@home as your only BOINC project on that machine, selecting leave in memory, and running at 100% CPU usage), while I'm deliberately choosing settings aimed at helping debug problems with the program (giving other BOINC projects enough computer time to prevent workunits from Rosetta@home from being likely to complete without being interrupted to give workunits from other projects a turn, and running at 95% CPU usage, although with leave in memory selected). However, is there any good reason for maintaining such a long queue of jobs waiting for your machine to choose them next, and therefore delaying any work at the Rosetta@home end on your results?

I can't tell if you've also tried a few other things I've also found good for getting a high score, such as:

1. Selecting a black screen, instead of the BOINC graphics, as your screen saver, and avoiding activating the BOINC graphics.

2. If you see the lockfile problem in your results, suspend all projects, reboot the machine to clear any lockfiles left behind by failed workunits, then resume the projects.

3. Running the machine 24 hours a day, except when shutting BOINC down for Windows updates or other updates, running antivirus programs, running antispyware programs, and any needed reboots.

4. If you happen to need some update that doesn't require a reboot, such as most Windows Defender updates, only tell BOINC to suspend all jobs while you install the update, instead of shutting it down completely; then resume the projects after the update completes.

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 507,926
RAC: 0
Message 60561 - Posted 9 Apr 2009 5:47:08 UTC

I'm interested to know how Selecting a black screen, instead of the BOINC graphics, as your screen saver, and avoiding activating the BOINC graphics.
helps increase a work units score? Thank's in advance
____________
Have a crunching good day!!

TomaszPawel

Joined: Apr 28 07
Posts: 54
ID: 170716
Credit: 2,791,145
RAC: 0
Message 60562 - Posted 9 Apr 2009 6:43:02 UTC - in response to Message ID 60560.
Last modified: 9 Apr 2009 6:47:09 UTC

I have this error :

http://boinc.bakerlab.org/rosetta/result.php?resultid=240721682

any tips?


Looks like you've hit one of the errors still in 1.54 because it's too uncommon to debug quickly. Let's hope your results for that workunit help them finally debug it.

Looking at the rest of the jobs your machine has been working on lately, I'd say that that you have a lower frequency of errors than I do because you've set up your machine well for aiming at a high score (probably selecting Rosetta@home as your only BOINC project on that machine, selecting leave in memory, and running at 100% CPU usage), while I'm deliberately choosing settings aimed at helping debug problems with the program (giving other BOINC projects enough computer time to prevent workunits from Rosetta@home from being likely to complete without being interrupted to give workunits from other projects a turn, and running at 95% CPU usage, although with leave in memory selected). However, is there any good reason for maintaining such a long queue of jobs waiting for your machine to choose them next, and therefore delaying any work at the Rosetta@home end on your results?

I can't tell if you've also tried a few other things I've also found good for getting a high score, such as:

1. Selecting a black screen, instead of the BOINC graphics, as your screen saver, and avoiding activating the BOINC graphics.

2. If you see the lockfile problem in your results, suspend all projects, reboot the machine to clear any lockfiles left behind by failed workunits, then resume the projects.

3. Running the machine 24 hours a day, except when shutting BOINC down for Windows updates or other updates, running antivirus programs, running antispyware programs, and any needed reboots.

4. If you happen to need some update that doesn't require a reboot, such as most Windows Defender updates, only tell BOINC to suspend all jobs while you install the update, instead of shutting it down completely; then resume the projects after the update completes.


Thanks for reply.

If you see scores I achieve for WU on that host witch make error I must tell you 2 important things:
1. It was computer with orginally Q6600@3200. On 7 apr 09 I replace this CPU to Q9550@3600. So it is safe to say that credits form 6 apr 09 and older represents Q6600 and from 8 apr 09 and newer represents Q9550.
2. I am crunching Rosetta@home at all 4 cores with GPUGRID on my GTX260. So in reality i run 5 treads by Boinc.

Also:
AD 1. I don't use BOINC screen saver only windows logo screen saver on my CRT NEC 2111SB
AD 2. I sometimes suspend to play some games....
AD 3. I must shut down my PC for night because it is to loud for me, so it crunch from 10 a.m. do 11-12 p.m. usually.
Ad 4. Rosetta@home is very GUI friendly because there is no slow down in interface. GPUGRID is real horror in that matter...
Running at 100% CPU usage is also set.
Live in memory option was not selected but today I selected it. I will see what happend :)

Also i work in 32 bit XP with 2x2Gb as CL4 DDR2 423 (846).

dcdc Profile

Joined: Nov 3 05
Posts: 1596
ID: 8948
Credit: 33,801,861
RAC: 17,327
Message 60563 - Posted 9 Apr 2009 7:36:08 UTC - in response to Message ID 60561.

I'm interested to know how Selecting a black screen, instead of the BOINC graphics, as your screen saver, and avoiding activating the BOINC graphics.
helps increase a work units score? Thank's in advance

just because your computer doesn't have to do the computation for the graphics tread too then.
____________

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 507,926
RAC: 0
Message 60564 - Posted 9 Apr 2009 7:42:01 UTC - in response to Message ID 60563.

I'm interested to know how Selecting a black screen, instead of the BOINC graphics, as your screen saver, and avoiding activating the BOINC graphics.
helps increase a work units score?

just because your computer doesn't have to do the computation for the graphics tread too then.

OK thanks
____________
Have a crunching good day!!

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 60569 - Posted 9 Apr 2009 12:41:10 UTC - in response to Message ID 60561.

I'm interested to know how Selecting a black screen, instead of the BOINC graphics, as your screen saver, and avoiding activating the BOINC graphics.
helps increase a work units score? Thank's in advance


Selecting a black screen, which only needs to be calculated once, cuts down on CPU time needed to calculate the graphics, and lets more of what's available be used for the scientific calculations. Since Rosetta@home uses the number of decoys produced as a more important factor in calculating how much credit to give you than the CPU time required to do it, this is likely to increase the number of decoys your computer produces for that workunit, and therefore the resulting score.

Also, something involving the graphics seems to be able to trigger the lockfile problem for a workunit, with the results then returned marked as invalid and therefore worth a score of zero. Once a lockfile problem occurs, 1.54 seems to be unable to erase the lockfile from the slot used by that workunit, and therefore lets the problems spread to any 1.54 workunits run later in the same slot but before the next reboot. My results for Ralph@home indicate that the 1.58 now being tested there has kept this same problem, and therefore needs more testing before the 1.54 used at Rosetta@home is replaced with a newer version.

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 60570 - Posted 9 Apr 2009 12:54:21 UTC

Also, something involving the graphics seems to be able to trigger the lockfile problem for a workunit, with the results then returned marked as invalid


I turn the graphics on and off several times during the course of the day to check on the performance and I haven't encountered this lockfile problem for a long time now on both Rosetta and Ralph.

Having said that, Murphy's Law states 'watch this space'!!
____________

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 60573 - Posted 9 Apr 2009 14:23:20 UTC - in response to Message ID 60570.

Also, something involving the graphics seems to be able to trigger the lockfile problem for a workunit, with the results then returned marked as invalid


I turn the graphics on and off several times during the course of the day to check on the performance and I haven't encountered this lockfile problem for a long time now on both Rosetta and Ralph.

Having said that, Murphy's Law states 'watch this space'!!


The lockfile problem results could vary depending on what operating system version and what BOINC version is used; if so, my results could easily apply only when using BOINC 6.2.28 under Vista SP1. In other words, I suspect that results from just the two of us aren't enough; we need more people with access to other operating system versions and more versions of BOINC to test for graphics causing the lockfile problem and report the results, along with which operating system version and which BOINC version was used.

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 60591 - Posted 10 Apr 2009 6:12:14 UTC

New error:

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
BOINC:: Error reading and gzipping output datafile: default.out

For this task

The same task run on an XP machine ran for a long time and only failed on validate. Which is kind of interesting, it almost seems as if my machine (OS-X) tipped over on an assertion or parameter file error ... what is the difference in OS platform guys ...

Snagletooth

Joined: Feb 22 07
Posts: 192
ID: 149031
Credit: 1,396,123
RAC: 1,318
Message 60599 - Posted 10 Apr 2009 22:27:45 UTC

I have a 10 preferred runtime on my MacBook Pro. I spotted lb_all_multi_threshold.2.0_hb_t317__IGNORE_THE_REST_1I9SA_12_10355_4_0
still running at 10 hours and 20 minutes so I opened the graphics window to check on it. It was on model 33, step 1920, stage unk. Checking on it later it had run another cpu hour but failed to make any progress so I shut down BOINC completely and restarted. it now showed 5 hours and 20 minutes cpu time consumed, all the rest the same. Within a few seconds it returned to step 0 and apparently restarted model 33 over from the beginning. I didn't catch exactly when it reached step 1920 but it would have been about 3-4 cpu minutes after the restart. It didn't get stuck this time but continued on its merry way. It also moved out of the unk stage by the time I glanced at it 4+ minutes after restart. It has now finished successfully and validated with 58 models completed in 10 (non-stuck)hours.

Hope this helps.

Snags

jswolf19

Joined: Apr 3 09
Posts: 3
ID: 309533
Credit: 824,515
RAC: 814
Message 60605 - Posted 11 Apr 2009 14:08:57 UTC

I was looking through the RALPH minirosetta v1.54 bug thread and found an issue about setting day-of-week overrides (http://ralph.bakerlab.org/forum_thread.php?id=432&nowrap=true#4590). I had some set on network usage that when I cleared and restarted BOINC (which I upgraded to 6.6.20) I started registering progress on a minirosetta task (as well as having some stderr progress past

Initializing options.... ok

This appears to have been the cause of my problem.

TomaszPawel

Joined: Apr 28 07
Posts: 54
ID: 170716
Credit: 2,791,145
RAC: 0
Message 60620 - Posted 14 Apr 2009 10:59:53 UTC

Another bug:

http://boinc.bakerlab.org/rosetta/result.php?resultid=242587798

Gavin Shaw Profile
Avatar

Joined: Feb 1 07
Posts: 10
ID: 144828
Credit: 506,456
RAC: 0
Message 60638 - Posted 14 Apr 2009 23:09:36 UTC

While not exactly a bug, this morning I had a rather large upload file...

Task 243404526 had a 6.8MB file to upload. The task only run for about 50 minutes and my preference is set to 4 hours. It did 99 decoys from 99 attempts.

Thought admin might want to know...

____________
Never surrender and never give up. In the darkest hour there is always hope.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 60639 - Posted 15 Apr 2009 0:20:20 UTC

Wow, good thing the watchdog only lets 99 models run. Just imagine how large it would have been with a 4 hour run!
____________
Rosetta Moderator: Mod.Sense

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 60650 - Posted 15 Apr 2009 15:50:53 UTC

This task failed on Mac with an error in pairtermderiv that's been reported previously.

243548575

Hbond tripped.

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>


____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 60651 - Posted 15 Apr 2009 16:41:26 UTC
Last modified: 15 Apr 2009 16:41:53 UTC

further notes to svincent's failed task

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
BOINC:: Error reading and gzipping output datafile: default.out

...and only 198 seconds of runtime.
____________
Rosetta Moderator: Mod.Sense

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 60653 - Posted 15 Apr 2009 18:27:09 UTC
Last modified: 15 Apr 2009 18:27:52 UTC


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Alert about problem WUs.

Problem task names all begin with "res_careful_". For details on which proteins are known to have problems and should be aborted, and which will run OK and should be run normally, please see the link above.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

____________
Rosetta Moderator: Mod.Sense

l_mckeon

Joined: Jun 5 07
Posts: 44
ID: 182403
Credit: 180,717
RAC: 0
Message 60657 - Posted 15 Apr 2009 21:27:57 UTC

The following two tasks had shorter run times than usual (about 1:30 hrs and 1:50 hrs from memory) and their uploads totalled around 16MB.

rest3d85_ip40_1t4w.patchdock.6.pdb_0002_fa_dock.xml_score12_pert38_DOCK_10797_354_0_0
rest3d85_ip40_1t4w.patchdock.6.pdb_0002_fa_dock.xml_score12_pert38_DOCK_10797_354_0_0

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 60659 - Posted 15 Apr 2009 22:45:20 UTC

l_mckeon, yes, those tasks hit the 99 model limit before reaching your normal runtime preference.
____________
Rosetta Moderator: Mod.Sense

Gavin Shaw Profile
Avatar

Joined: Feb 1 07
Posts: 10
ID: 144828
Credit: 506,456
RAC: 0
Message 60660 - Posted 15 Apr 2009 23:29:56 UTC

Had another big one overnight.

Task 243710356 was another 6.8MB upload, again with 99 decoys.

Of course I have now seen a post about some problem with units, but it didn't help as the unit had already crunched :)

____________
Never surrender and never give up. In the darkest hour there is always hope.

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,545,746
RAC: 7,447
Message 60661 - Posted 16 Apr 2009 0:14:14 UTC

Another Validation error with this job:

crys__BOINC_ABRELAX_R120G_CRYSTALLIN_SAVE_ALL_OUT_IGNORE_THE_REST-S25-9-S3-3--crys_-_9344_11912_2

No errors reported within the Task Details of any of them.

Previous ones reported here and here.
____________

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 60662 - Posted 16 Apr 2009 6:48:51 UTC

Looks like I might have gotten one of the problems:

ERROR: [ERROR] Unable to open constraints file: resample_outward0.05_ub0.1_lb0.02_median.t364_.cst
ERROR:: Exit from: ..\..\src\core\scoring\constraints\ConstraintIO.cc line: 330
BOINC:: Error reading and gzipping output datafile: default.out

task 243804881

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 507,926
RAC: 0
Message 60678 - Posted 17 Apr 2009 3:21:20 UTC

This task http://boinc.bakerlab.org/rosetta/result.php?resultid=243902658 made 99 decoys & the upload was about 7.14MB is this normal for these tasks?
____________
Have a crunching good day!!

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 60680 - Posted 17 Apr 2009 7:27:34 UTC - in response to Message ID 60678.

This task http://boinc.bakerlab.org/rosetta/result.php?resultid=243902658 made 99 decoys & the upload was about 7.14MB is this normal for these tasks?


there is a limiter built into the program. it stops the crunching at 99 decoys.
this is normal.

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 507,926
RAC: 0
Message 60685 - Posted 17 Apr 2009 9:12:28 UTC - in response to Message ID 60680.

This task http://boinc.bakerlab.org/rosetta/result.php?resultid=243902658 made 99 decoys & the upload was about 7.14MB is this normal for these tasks?


there is a limiter built into the program. it stops the crunching at 99 decoys.
this is normal.

I'm aware of this thanks. greg be I think you misunderstood the question. I was referring to the the upload size of the work unit. Is the normal upload size 7.14MB for this type http://boinc.bakerlab.org/rosetta/result.php?resultid=243902658 of work unit?
____________
Have a crunching good day!!

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 60691 - Posted 17 Apr 2009 13:01:57 UTC

The more models completed, the larger the upload will be. The resulting increase in upload size is part of why mini put on the 99 model limit per task. So, it is normal, but will probably be reviewed and perhaps changed to run longer models in some way.
____________
Rosetta Moderator: Mod.Sense

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 60703 - Posted 17 Apr 2009 21:28:26 UTC

Validate error

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 60726 - Posted 19 Apr 2009 8:34:54 UTC
Last modified: 19 Apr 2009 8:41:14 UTC

This weeks problems:

2 validate errors (non res_careful) and 3 res_careful errors (all listed as troubled from the res_careful thread)

rest3d85_ip40_2jkf.patchdock.25.pdb_0001_fa_dock.xml_score12_pert38_DOCK_10797_583_0
rest3d85_ip40_2v1l.patchdock.10.pdb_0001_fa_dock.xml_score12_pert38_DOCK_10797_499_0

no error message, just validate error.

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 507,926
RAC: 0
Message 60728 - Posted 19 Apr 2009 10:18:57 UTC - in response to Message ID 60691.

The more models completed, the larger the upload will be. The resulting increase in upload size is part of why mini put on the 99 model limit per task. So, it is normal, but will probably be reviewed and perhaps changed to run longer models in some way.

Thank you for explaining.

____________
Have a crunching good day!!

TomaszPawel

Joined: Apr 28 07
Posts: 54
ID: 170716
Credit: 2,791,145
RAC: 0
Message 60746 - Posted 20 Apr 2009 5:57:52 UTC

Bug?

http://boinc.bakerlab.org/rosetta/result.php?resultid=243895936

TomaszPawel

Joined: Apr 28 07
Posts: 54
ID: 170716
Credit: 2,791,145
RAC: 0
Message 60747 - Posted 20 Apr 2009 10:23:59 UTC - in response to Message ID 60746.

BUG....
http://boinc.bakerlab.org/rosetta/result.php?resultid=244107786

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 60753 - Posted 20 Apr 2009 15:28:29 UTC

TomaszPawel sights two cases where 99 models were completed in less then an hour with a 6.6.20 Win XP client, and resulted in validate error from miniRosetta v1.54.

WU names
243895936 rest3d85_ip40_2oqk.patchdock.7.pdb_0003_fa_dock.xml_score12_pert38_DOCK_10797_652_0

244107786
rest3d85_ip40_2w4f.patchdock.1.pdb_0001_fa_dock.xml_score12_pert38_DOCK_10797_943_0
____________
Rosetta Moderator: Mod.Sense

Paul D. Buck Profile

Joined: Sep 17 05
Posts: 815
ID: 269
Credit: 1,812,737
RAC: 0
Message 60756 - Posted 20 Apr 2009 19:02:12 UTC

Validate error with 80 decoys, 10K seconds:

lb_all_multi_threshold.1.5_hb_t327__IGNORE_THE_REST_2F2EA_3_10393_1_2

TomaszPawel

Joined: Apr 28 07
Posts: 54
ID: 170716
Credit: 2,791,145
RAC: 0
Message 60793 - Posted 23 Apr 2009 11:47:57 UTC - in response to Message ID 60756.

http://boinc.bakerlab.org/rosetta/result.php?resultid=245909228

Reason: Divide by Zero (0xc000008e) at address 0x004E51A9

TomaszPawel

Joined: Apr 28 07
Posts: 54
ID: 170716
Credit: 2,791,145
RAC: 0
Message 60794 - Posted 23 Apr 2009 13:26:22 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=245909239

TomaszPawel

Joined: Apr 28 07
Posts: 54
ID: 170716
Credit: 2,791,145
RAC: 0
Message 60797 - Posted 23 Apr 2009 20:09:13 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=245964014

William Kahler

Joined: Oct 26 06
Posts: 1
ID: 124989
Credit: 59,624
RAC: 64
Message 60814 - Posted 25 Apr 2009 1:09:39 UTC

MiniRosetta 1.54 constantly crashing after ~5 seconds
& (note to Bill G) w/Boinc 6.4.x & 6.6.x (Error Code 5).
It runs a little slow for first 5 seconds of CPU time
w/last stable Boinc 5.x & finishes ok.
No difference with protected app. or not.
Complete BOINC un/re-install & Rosetta de/re-attach no help.

Dell Core Duo 2 GHz w/2 Gig Ram.
WinXP Sp3 Home Edition (up to date).
24/7, no throttle, no graphics/screensaver, leave in memory.
Stand alone or with other projects.
Memtest x2/Prime95/Dell Diagnostics run fine.

thoughts? suggestions?


Gavin Shaw Profile
Avatar

Joined: Feb 1 07
Posts: 10
ID: 144828
Credit: 506,456
RAC: 0
Message 60817 - Posted 25 Apr 2009 7:49:42 UTC

And another big upload.

Task 246174559 run for 4 hours with 82 decoys. File upload size was 8.9MB. Took a while to upload. Hate to see what it would have been if there were 99 decoys...

____________
Never surrender and never give up. In the darkest hour there is always hope.

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 60825 - Posted 25 Apr 2009 21:54:34 UTC

Hi there.

I got this on Ubuntu x64 this morning, haven't had any in a while.

That's 41min run time.

Docking_benchmark_unbound__1AVZ.unbound.mppk.pdb.gzdock_score12_hi.xml_11475_29_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=224594412

Over__Validate error__Done__2,496.64

======================================================
DONE :: 1 starting structures 2496.42 cpu seconds
This process generated 99 decoys from 99 attempts
======================================================

pete.

____________


P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 60828 - Posted 26 Apr 2009 7:52:48 UTC

Hi me again.

This was a big one, 7.04MB result file for a six hour run.

Docking_benchmark_natives__1FIN.mppk.pdb.gzdock_score_docking_hi.xml_11477_209_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=224813196

======================================================
DONE :: 1 starting structures 21620.9 cpu seconds
This process generated 75 decoys from 75 attempts
======================================================

pete.

____________


Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,545,746
RAC: 7,447
Message 60836 - Posted 26 Apr 2009 23:34:12 UTC

Very few errors nowadays, but just came up with two compute errors:

Docking_benchmark_unbound__1ATN.unbound.mppk.pdb.gzdock_score_docking_hi.xml_11476_94_1
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x005C1D7D read attempt to address 0xC49A08B0

res_careful_ourward_cst_chunk_0_8_hb_t342__IGNORE_THE_REST_1VKBA_5_10927_2_2
ERROR: [ERROR] Unable to open constraints file: resample_outward0.05_ub0.1_lb0.02_median.t342_.cst
ERROR:: Exit from: ..\..\src\core\scoring\constraints\ConstraintIO.cc line: 330
BOINC:: Error reading and gzipping output datafile: default.out


Running AMD9850 Vista64 8Gb RAM Boinc 6.6.20
____________

Yaroslav Isakov

Joined: Nov 2 07
Posts: 11
ID: 217531
Credit: 98,027
RAC: 0
Message 60869 - Posted 28 Apr 2009 13:42:45 UTC

Hello, I have a problem: very long pending status in my last WUs:
1 2 3 4 5 6

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 60870 - Posted 28 Apr 2009 14:38:46 UTC - in response to Message ID 60869.

Hello, I have a problem: very long pending status in my last WUs:
1 2 3 4 5 6


That would explain why credit has been dropping. The assimilator must be having a problem. I've EMailed the Project Team to look in to it when they arrive for the day in Seattle.
____________
Rosetta Moderator: Mod.Sense

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,545,746
RAC: 7,447
Message 60885 - Posted 29 Apr 2009 14:37:32 UTC - in response to Message ID 60870.
Last modified: 29 Apr 2009 14:38:12 UTC

[quote]Hello, I have a problem: very long pending status in my last WUs:
1 2 3 4 5 6/quote]
That would explain why credit has been dropping. The assimilator must be having a problem. I've EMailed the Project Team to look in to it when they arrive for the day in Seattle.

I'm assuming this is fixed now. 17 of my WUs have been allocated credit since the original post, but I have another 15 pending credit - 13 hours worth.

Just awaiting catch-up, I assume. The Server Status page is showing all systems 'Running'.

I also noticed credit was taking more than 4 minutes to come through in the days leading up to the outage, so the problem may've been building up for a few days.
____________

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1740
ID: 44890
Credit: 2,350,568
RAC: 3,695
Message 60898 - Posted 29 Apr 2009 21:21:28 UTC
Last modified: 29 Apr 2009 21:23:04 UTC

BOINC v6.6.20 seems to be causing failures due to too many restarts.
http://boinc.bakerlab.org/rosetta/result.php?resultid=247095859
http://boinc.bakerlab.org/rosetta/result.php?resultid=246620233

It suggests keeping tasks in memory. But I've always had it configured to do so. I've also limited the memory available to BOINC while computer is in use. This seems to cause BOINC to begin and then suspend the tasks numerous times during the day. When the task attempts to run and then exceeds memory bound, it goes to a status of waiting for memory. But it no longer appears in the Windows task list, hence was removed from memory.

I have a HT P4, so 2 CPUs. As the primary task cycles through periods with lower memory usage, it attempts to fire up the second core. Only to find it ends up short of memory again a few minutes later as the second task gears up and uses more, or the first cycles in to another phase of higher memory usage.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 507,926
RAC: 0
Message 60915 - Posted 30 Apr 2009 5:24:59 UTC
Last modified: 30 Apr 2009 5:35:33 UTC

This task has been in my pending list since 29 Apr 2009 22:36:17 UTC Since 30 Apr 2009 3:12:54 UTC &
Since 30 Apr 2009 3:12:54 UTC Any ideas as to why this is happening?
____________
Have a crunching good day!!

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 7,590,569
RAC: 2,277
Message 60921 - Posted 30 Apr 2009 12:04:12 UTC

frb_0_8_mike_chosen_cst_hb_t367__IGNORE_THE_REST_1UFBA_2_11071_831_0

Interesting task, IMO. It generated 99 decoys in a bit more than 20 minutes.
____________

WilMar

Joined: Mar 29 09
Posts: 1
ID: 308354
Credit: 1,984
RAC: 0
Message 60922 - Posted 30 Apr 2009 13:23:20 UTC

Hello !
Now at the advent of version 1.64, I´ve difficulties to load up my last crunched file with version 1.54. I get repeatedly the following messages:
30/04/2009 13:40:31|rosetta@home|Started upload of lb_all_multi_threshold.0.5_hb_t311__IGNORE_THE_REST_1ZK8A_1_10279_7_2_0
30/04/2009 13:42:19||Project communication failed: attempting access to reference site
30/04/2009 13:42:19|rosetta@home|Temporarily failed upload of lb_all_multi_threshold.0.5_hb_t311__IGNORE_THE_REST_1ZK8A_1_10279_7_2_0: connect() failed
30/04/2009 13:42:19|rosetta@home|Backing off 1 hr 50 min 57 sec on upload of lb_all_multi_threshold.0.5_hb_t311__IGNORE_THE_REST_1ZK8A_1_10279_7_2_0
30/04/2009 13:42:21||Internet access OK - project servers may be temporarily down.

As seen on the server status page, all servers are running. So, why this problem and how to cure it ?

Martin

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 60923 - Posted 30 Apr 2009 14:22:24 UTC - in response to Message ID 60898.
Last modified: 30 Apr 2009 15:18:34 UTC

BOINC v6.6.20 seems to be causing failures due to too many restarts.
http://boinc.bakerlab.org/rosetta/result.php?resultid=247095859
http://boinc.bakerlab.org/rosetta/result.php?resultid=246620233

It suggests keeping tasks in memory. But I've always had it configured to do so. I've also limited the memory available to BOINC while computer is in use. This seems to cause BOINC to begin and then suspend the tasks numerous times during the day. When the task attempts to run and then exceeds memory bound, it goes to a status of waiting for memory. But it no longer appears in the Windows task list, hence was removed from memory.

I have a HT P4, so 2 CPUs. As the primary task cycles through periods with lower memory usage, it attempts to fire up the second core. Only to find it ends up short of memory again a few minutes later as the second task gears up and uses more, or the first cycles in to another phase of higher memory usage.


BOINC 6.6.20 is wotking better for me, so lets's compare our machines and settings. My newer machine, with BOINC 6.6.20 under 64-bit Vista SP1 with 8 GB of memory, does not appear to have any memory problems.

My 32-bit Vista SP1 machine, with BOINC 6.2.28, originally came with 1 GB of memory. I found that wasn't enough to even start running two minirosetta@home workunits at the same time. After enough other problems showed up which I decided were memory problems, I used this site to find out how much memory my motherboard could handle, and then order enough to raise it to the 2 GB limit for my motherboard:

http://www.crucial.com/

This was enough to allow it to start running two minirosetta workunits at one on my 2 CPU cores, but still not enough to run them well. Eventually, I raised both the amount of disk space BOINC is allowed to use, and the amount of swap space BOINC is allowed to use. It's not clear which of the last two steps were actually needed, if not both of them, but that combination handled the memory problems on that machine.

At least some versions of BOINC do not divide up the available swap space in the most efficient way - they first divide it up into equal shares for each BOINC project you have subscribed to, then those shares into smaller shares for each CPU core. If these smaller shares aren't large enough, it can't preserve any work done since the last checkpoint by simply swapping one into the swap space on the hard drive.

Does the HT stand for hyperthreaded, a method of appearing to have twice as many CPU cores by giving each one of them an extra set of registers? If so, I've seen messages from other BOINC users saying that this does not increase the total throughput very much. Therefore, until you are able to handle the memory and swapfile problems, you may find it worthwhile to tell BOINC to use only one of the two apparant CPU cores on your machine.

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 60926 - Posted 30 Apr 2009 16:05:45 UTC

I've recently had two workunits with the lockfile problem:

http://boinc.bakerlab.org/rosetta/result.php?resultid=247527853

http://boinc.bakerlab.org/rosetta/result.php?resultid=247443039

Both were then completed successfully by someone else.

Could minirosetta be modified to check for the lockfile problem sooner, and at least produce more debug information about it instead of wasting CPU time first?

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1740
ID: 44890
Credit: 2,350,568
RAC: 3,695
Message 60927 - Posted 30 Apr 2009 16:19:58 UTC

Robert, thanks for the comments. I have plenty of memory, but for 1/3 of the day I actually use it for a number of work applications and with the new increase in memory used by mini, I'm testing to see if BOINC is the cause of some sluggish behavior on my machine. Indeed it seems to be the case.

Yes, by HT, I meant hyperthreaded. But I believe setting number of CPUs to one on a machine configured with HT active would cut my credit roughly in half. I'd think that the other analysis you've read is comparing a machine with HT enabled running 2 tasks at a time, with the same machine with HT disabled running 1. Since my HT is enabled, running 2 tasks is the only way to break even. But yes, one option would be to disable HT, then I'd be focusing all the resource on one task at a time, and not have the desire to support memory enough for two tasks.

I was just trying to point out that 6.6.20 seems to be removing tasks from memory in some cases, even when configured to leave tasks in memory. And this can lead to cancelled WUs such as I reported. I wasn't limiting memory on my prior version of BOINC, so am unsure if this is new behavior or not.

I just saw another task suspended waiting for memory, but this time it remained in the task list. Could be BOINC saw it had 3 hours invested in it and didn't want to throw it away. I believe the tasks that are getting removed are actually only running for a couple of minutes.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 507,926
RAC: 0
Message 60929 - Posted 30 Apr 2009 21:15:55 UTC - in response to Message ID 60922.

Hello !
I´ve difficulties to load up my last crunched file with version 1.54. I get repeatedly

I'm getting the same type of messages to
5/1/2009 8:51:54 AM rosetta@home Temporarily failed upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VYHA_4_11644_1_0_0: HTTP error
5/1/2009 8:51:54 AM rosetta@home Backing off 2 hr 52 min 32 sec on upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VYHA_4_11644_1_0_0
5/1/2009 8:51:54 AM rosetta@home Started upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1FXWF_6_11644_1_0_0
5/1/2009 8:51:56 AM Internet access OK - project servers may be temporarily down.
5/1/2009 8:51:59 AM rosetta@home Finished upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1FXWF_6_11644_1_0_0
5/1/2009 8:52:53 AM Project communication failed: attempting access to reference site
5/1/2009 8:52:53 AM rosetta@home Temporarily failed upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VJGA_4_11644_1_0_0: HTTP error
5/1/2009 8:52:53 AM rosetta@home Backing off 12 min 18 sec on upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VJGA_4_11644_1_0_0
5/1/2009 8:52:55 AM Internet access OK - project servers may be temporarily down.

Should I abort these transfers? I will wait for further instructios before I do anything to these.
____________
Have a crunching good day!!

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 60930 - Posted 30 Apr 2009 21:26:28 UTC - in response to Message ID 60927.

Robert, thanks for the comments. I have plenty of memory, but for 1/3 of the day I actually use it for a number of work applications and with the new increase in memory used by mini, I'm testing to see if BOINC is the cause of some sluggish behavior on my machine. Indeed it seems to be the case.

Yes, by HT, I meant hyperthreaded. But I believe setting number of CPUs to one on a machine configured with HT active would cut my credit roughly in half. I'd think that the other analysis you've read is comparing a machine with HT enabled running 2 tasks at a time, with the same machine with HT disabled running 1. Since my HT is enabled, running 2 tasks is the only way to break even. But yes, one option would be to disable HT, then I'd be focusing all the resource on one task at a time, and not have the desire to support memory enough for two tasks.

I was just trying to point out that 6.6.20 seems to be removing tasks from memory in some cases, even when configured to leave tasks in memory. And this can lead to cancelled WUs such as I reported. I wasn't limiting memory on my prior version of BOINC, so am unsure if this is new behavior or not.

I just saw another task suspended waiting for memory, but this time it remained in the task list. Could be BOINC saw it had 3 hours invested in it and didn't want to throw it away. I believe the tasks that are getting removed are actually only running for a couple of minutes.


Do you have enough free disk space to allow BOINC enough space to increase the swap space it can use to store any partly completed work in a way that allows resuming it where it was interrupted? That way, BOINC could simply switch to helping projects with lower memory requirements while you need more memory for something else; for example, the POEM@HOME project requires less memory, but helps an earlier step in medical research. That way, the suspended tasks will move off of the list of tasks currently running, but in a way that lets them move back onto this list and at the point of interruption later, instead of being dropped entirely. Such tasks will need to go back to the last checkpoint if you reboot for any reason, though. If you prefer to run mainly Rosetta@home, just keep the percentage of your CPU time assigned to these lower memory requirement projects less than the percentage of your CPU time you actually need to run with lower memory requirements. Also, insuring that there is enough swap space for all the projects BOINC tries to keep running at once allows you to suspend all BOINC projects at once if you need to run something with even more requirements. It seems that the defaults for the amount of swap space BOINC is allowed to use aren't good enough if you attach to enough BOINC projects at once, and even one of them is as memory-hungry as Rosetta@home.

http://boinc.fzk.de/poem/

Also, turning off one of a pair of hyperthreaded CPUs shouldn't cause you to get only half the credits, since it then allows you to run the other one at full speed, instead of at barely more than half the full speed. It would, however, give you only half the credits if you actually had two fully independent CPU cores instead of a hyperthreaded pair, or if you use an older version of BOINC that isn't aware that it needs to keep track of CPU core sharing between hyperthreaded pairs.

If your main concern is credits for helping medical research and you happen to have one of the newer graphics boards GPUGRID can use (mainly recent Nvidia cards), consider adding GPUGRID to your list of BOINC projects. It will require switching to the newest version of BOINC I've read about, but then can run workunits on your graphics card instead of on your CPUs. Shouldn't interfere with your regular computer use if it isn't graphics-intensive.

http://www.gpugrid.net/

Also, check if that web site I gave mentions how much memory your machine can handle and what the price is. I spent only about $50 (US) to reach the maximum amount this computer can use, but that did have me as the person who installed the new and faster memory.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 60931 - Posted 30 Apr 2009 21:32:24 UTC

Speedy, no don't abort them. I'm sure the problem with uploads must be related to the current problems with getting credit issued. When the back end file system is having problems, everything is having problems to some degree or another.
____________
Rosetta Moderator: Mod.Sense

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 507,926
RAC: 0
Message 60932 - Posted 30 Apr 2009 21:40:26 UTC - in response to Message ID 60931.

Speedy, no don't abort them. I'm sure the problem with uploads must be related to the current problems with getting credit issued. When the back end file system is having problems, everything is having problems to some degree or another.

Thank you. All my results that need to be uploaded have just been uploaded. All is good at my end. Thank you for your continued hard work.
____________
Have a crunching good day!!

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1740
ID: 44890
Credit: 2,350,568
RAC: 3,695
Message 60933 - Posted 30 Apr 2009 21:42:18 UTC

Robert, yes I've had all the same thoughts, and have plenty of disk allowed to BOINC, and to my swap file. But am finding that BOINC isn't smart enough to realize which projects require less memory. It cycles through all the work you currently have for the project it wants to repay debt to, and only after it gets about 2 minutes in to every single downloaded Rosetta task will it try to run a 10MB WCG rice task. But if I don't happen to have any WCG work, it isn't smart enough to think about getting some rather then leaving a CPU idle.

I'd love if it were smart enough to run one Rosetta and one rice during the day when I'm using the machine, and then run dual Rosetta tasks at night when my machine is idle and I allow more memory to BOINC. But it's just not smart enough to do so without major manual adjustments.

I could keep a larger cache of work, and therefore help assure I always have something from each project, but then it would cycle through 10 Rosetta tasks, running each for 2 minutes, rather then just 6.

Hopefully with all the discussion on the client work fetch policies, something will shake out that will work better.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1740
ID: 44890
Credit: 2,350,568
RAC: 3,695
Message 60934 - Posted 30 Apr 2009 21:48:23 UTC

Guess I'd never noticed BOINC allows you to configure the amount of swap space (I thought you meant size of Win page file). It was set to 75%, and Win task manager shows my "commit charge" to be 1477M/3397M. So does that mean my swap file is 3.4GB? And so BOINC is allowed over 2GB of swap space, but my entire system hasn't reached that much.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 60935 - Posted 30 Apr 2009 21:52:27 UTC - in response to Message ID 60929.

Hello !
I´ve difficulties to load up my last crunched file with version 1.54. I get repeatedly

I'm getting the same type of messages to
5/1/2009 8:51:54 AM rosetta@home Temporarily failed upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VYHA_4_11644_1_0_0: HTTP error
...
5/1/2009 8:52:53 AM Project communication failed: attempting access to reference site
5/1/2009 8:52:53 AM rosetta@home Temporarily failed upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VJGA_4_11644_1_0_0: HTTP error
5/1/2009 8:52:53 AM rosetta@home Backing off 12 min 18 sec on upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VJGA_4_11644_1_0_0
5/1/2009 8:52:55 AM Internet access OK - project servers may be temporarily down.

Should I abort these transfers? I will wait for further instructios before I do anything to these.


If the transfers aren't too close to their deadlines, I'd just let BOINC keep trying. I've had workunits upload successfully after getting similar messages for days, when router problems kept me from reaching the internet at all for several days. However, it's occasionally useful in such circumstances to first start viewing the Rosetta@home web site to make sure the connection is open,
then without closing your browser, start the BOINC manager program if it isn't already running, click on Advanced View if the simplified view appears first, then click on the Transfers tab, click on Advanced, then click on Do network communication in order to make it retry the communications while your connection to the internet is still open.

For some BOINC projects, even returning the results after their deadlines is useful, if you manage to return the results before anyone else does for the same workunit. Not all BOINC projects allow this, though.

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 60948 - Posted 2 May 2009 9:55:40 UTC - in response to Message ID 60934.

Guess I'd never noticed BOINC allows you to configure the amount of swap space (I thought you meant size of Win page file). It was set to 75%, and Win task manager shows my "commit charge" to be 1477M/3397M. So does that mean my swap file is 3.4GB? And so BOINC is allowed over 2GB of swap space, but my entire system hasn't reached that much.


At least some versions of Windows automatically expand the swap space if BOINC is allowed to use a large enough fraction of it to come close enough to the amount already provided. I'd expect the name page file to be what some people call the swap file.

I've set up my machines to start up with the swap file size already set to 30 GB, with no sign of coming close to that limit. That doesn't allow any further expansion, but should keep the disk head from needing to move very far when going from one place in the swap file to another.

I have seen signs that BOINC divides the available swap space equally among either the active slots or all the enabled BOINC projects before deciding how much to give to each workunit, and does not adjust this based on how much memory each BOINC project is expected to require. For that reason, if you have enough free disk space, allowing both the swap file and the disk space for each workunit to be significantly more than the average required is helpful for the applications with high requirements, such as minirosetta.

Message boards : Number crunching : Problems with Minirosetta v1.54


Home | Join | About | Participants | Community | Statistics

Copyright © 2017 University of Washington

Last Modified: 10 Nov 2010 1:51:38 UTC
Back to top ^