Mike Tyka Forum moderator Project administrator Project developer Project scientist Joined: Oct 20 05 Posts: 96 ID: 5612 Credit: 2,190 RAC: 0
Hello All!
We're ready for a new update. I want to say thank all of you who have helped over the last months to find and fix errors in minirosetta. A particular thank you goes to those who have donated their time over on RALPH and helped with their active feedback - we managed to find a number of difficult and rare bugs and put some new features into minirosetta that should help conserve computer time. Read about it here: http://ralph.bakerlab.org/forum_thread.php?id=431
and here http://ralph.bakerlab.org/forum_thread.php?id=432
I should add that work over there will continue,but now supplemented with information from Rosetta@HOME.
This update is highly focused on bugfixing and stability issues - we have virtually no new science in it, but: We will hopefully now be able to run the science projects that have been in the pipeline waiting for BOINC - we're expecting quite a bit of work to go out very soon indeed. See Dr. Baker's journal for more details.
Features/Fixes:
1.54 Release CHANGELOG
Faster loop closing in FoldCST/Abinitio (affects cc_* cc2_* cs_* WUs), should help with overrunning WUs.
Bug fix concerning intermittent crashes in relax benchmark jobs (_rlbd_) jobs - caused by buggy input file reader.
Bug fix for a potential instability in handling text files (affects all types of WUs).
Bug fix in checkpointing machinery, states were not being correctly restored, probably contributing to long runtimes. (affects cc_* cc2_* cs_* WUs)
Increased the density of checkpoints to lose less time on restarts and address the weired "backjumping" of the time reported in this thread. This will still happen, but the jumps should be much smaller (basically maximally as long as the time between checkpoints.)
Added checkpointing to Loopclosing part of FoldCST. (affects cc_* cc2_* cs_* WUs)
Added checkpointing to Looprelax.
The Watchdog has been checked and improved, now returning information on the aborted jobs to help us figure out how the remaining long running models come about. The watchdog will now abort if the runtime exceeds your preferred runtime + 4 hours. In other words the WUs should not overrun for more than around 4 hours. If they do please let us know !!
Added a limit ont he number of decoys per WU: 99. The WU will end gracefully after that and give full credit. This should address issues with excessive upload problems.
Fixed a bug in the BOINC API concerned with unzipping the input data. (I will let the BOINC guys know about this)
Fixed a strange problem in the options system leading to early crashes on some systems.
Two nasty instabilities fixed deep in the FoldConstraints/abinitio protocol (cc_* tasks and other homology modelling tasks)
Generally implemented much better error reporting - many many potential problems will now show up a meaningful error messages and not random segmentation faults.
NOTE: This new version contains a lot of debug output still. YOu will see that the stderr fills up with stuff - that is ok . It does not slow down the program nor cause much extra upload - but it tells us a lot about where things can go wrong still.
Despite all these fixes there are, i'm sure, many problems left. Most of them occur extremely rarely now though or are highly specific to particular machines. Thus we have decided to move the current version over from RALPH to Rosetta@HOME and give it a go on a much larger scale. Our effords to keep the failure rate down will continue and your time donations over on RALPH as well as error reports are still highly appreciated.
Please let us know how things work out there. Particularily i'd like to know about
Stuck workunits
Overrunning workunits (WUs should now, due to the new watchdog, never run more than 4 hours longer than the preferred user time)
Problems with checkpointing.
Any other strange behaviour.
Happy crunching - I'm very excited to see how this new version will pan out.
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
The link in the news item that should bring you to this thread is truncated.
____________ Rosetta Moderator: Mod.Sense
ID: 59047 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
The news item also shows the year as 2008 (which is probably the last time you had enough coffee to be able to read the calendar!! All these improvements are going to send TeraFLOPS much higher! Nice work Mike, and BakerLab. I can really see that you've come through for people here).
____________ Rosetta Moderator: Mod.Sense
Hello, the version 1.47 was very well for me with 151 Workunits and 0 errors and an average CPU time 2.8 hours. Hope that the new version 1.54 will be as well...
____________
ID: 59074 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,383,092 RAC: 0
If you are seeing errors with lock-file problems try setting the cpu setting back to 100%. If you are running at 100% CPU preference and are getting this problem, I for one, am very interested. If you are getting the failures and change the CPU setting to 100% and that cures the issue ... well, we are interested in THAT too ...
I read about this in Einstein@Home and it seems to work for me ... YMMV ...
____________
I don't know about others but my Rosetta machines are running dry!!! The new minirosetta is stuck downloading at 89.25% and has been there for HOURS!!!!
I have had to attach to a different project until it gets sorted out. So far all machines, exact same problem, one a dual core one a single core. If you llok at my computers, they are not hidden, any task that says "outcome unknown" is because the mini-rosetta download ain't happenning!!!! Message in Boinc says 1/28/2009 4:45:03 AM|rosetta@home|Started download of minirosetta_1.54_windows_intelx86.exe
1/28/2009 4:50:11 AM|rosetta@home|Temporarily failed download of minirosetta_1.54_windows_intelx86.exe: HTTP error
1/28/2009 4:50:12 AM|rosetta@home|Started download of minirosetta_1.54_windows_intelx86.exe
1/28/2009 4:50:13 AM||Internet access OK - project servers may be temporarily down.
1/28/2009 4:50:34 AM||Project communication failed: attempting access to reference site
1/28/2009 4:50:34 AM|rosetta@home|Temporarily failed download of minirosetta_1.54_windows_intelx86.exe: connect() failed
1/28/2009 4:50:35 AM||Internet access OK - project servers may be temporarily down.
etc, etc, etc, etc forever!!!!
Another project now loves you!!
____________
ID: 59088 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
mikey, I haven't seen that problem my self, so it's not likely on the server side. At least not consistently. So it also seems odd that all of your servers are stopping... is it on the same file? You have to download the new programs, which is several MB. Are your machines all going through the same proxy or something that might be hung up on that particular file?
Could I ask you to check the transfers tab and see exactly which file and how much of it you've downloaded? Your hosts seem to have pretty good bandwidth.
Is anyone else seeing such a problem? Given then increase in project TFLOPS, I am thinking it is rare at best.
Have you tried aborting the transfer on one of the machines? This may cause a couple of tasks to fail due to downloading error, but BOINC will recover and eventually try to pull a fresh copy of the problem file.
____________ Rosetta Moderator: Mod.Sense
ID: 59093 | Rating: 0 | rate:
/
Mike Tyka Forum moderator Project administrator Project developer Project scientist Joined: Oct 20 05 Posts: 96 ID: 5612 Credit: 2,190 RAC: 0
Paul,
can you point me to the thing you read about Lockfile problems on Einstein !?
5% of jobs fail in this way consistently. I would love to know if the problem is us or the clients or what, and get it resolved.
What do you mean by 100% CPU ? If i can make this happen here on my machine i could learn better about what's going on.
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
What do you mean by 100% CPU ?
"computing preferences" configured on website for the venue of the machine. The setting is called "Use at most" at the bottom of the processor usage section.
Can also be configured via the BOINC Manager for a specific host.
____________ Rosetta Moderator: Mod.Sense
mikey, I haven't seen that problem my self, so it's not likely on the server side. At least not consistently. So it also seems odd that all of your servers are stopping... is it on the same file? You have to download the new programs, which is several MB. Are your machines all going through the same proxy or something that might be hung up on that particular file?
I do not use a proxy, just straight to the net. I use Comcast.
Could I ask you to check the transfers tab and see exactly which file and how much of it you've downloaded? Your hosts seem to have pretty good bandwidth.
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's. The one I am looking at right now has been trying for 11:51:02 and is going to retry in 03:34:34, and counting.
Is anyone else seeing such a problem? Given then increase in project TFLOPS, I am thinking it is rare at best.
Have you tried aborting the transfer on one of the machines? This may cause a couple of tasks to fail due to downloading error, but BOINC will recover and eventually try to pull a fresh copy of the problem file.
Yes I have, no luck, the file is stuck at 89.25, 89.26 or 89.27% depending on the pc. I am stuck at exactly 5.85 meg of 6.56 meg on all machines.
____________
ID: 59108 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
mikey, if you would like to study this further, it would be helpful if you could create a cc_config.xml file and add the flag for debug of file transfers. You have to define the first three flags as shown, then just add a line for the:
<file_xfer_debug>1</file_xfer_debug>
If you already have such a file set up, do you have the <http_1_0> flag defined? Not asking you to do that one, just asking if you were already doing it. HTTP 1.0 does not have the ability to retry from the middle of the transfer (persistent file transfer is the term BOINC uses for this). It has to start over each attempt. Then BOINC seems to only open the pipe for 5minutes at a time. So if you can't get the whole thing in 5min. It might never happen.
____________ Rosetta Moderator: Mod.Sense
I'm seeing a validate error on task 224245929 , workunit 204213187, Mac OS X 10.4.11. The task name is 1nkuA_BOINC_MPZN_with_zinc_abrelax_cs_frags_6231_115354_1 : it ran twice as long as it was supposed to and I was the second person to get it. The original person to whom it was sent also got the same validate error: irritating after it took twice as long as it was supposed to. It seems to be one of these zinc-containing proteins that have a habit of doing this.
<core_client_version>6.2.18</core_client_version>
<![CDATA[
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 1-28 1:26:32:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Starting work on structure: _00001
Watchdog active.
# cpu_run_time_pref: 14400
Starting work on structure: _00002
====>
called boinc_finish
</stderr_txt>
]]>
____________
ID: 59112 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
mikey, I don't know why I didn't think of this before...
and drop it in to your Rosetta folder in your BOINC data directory under the projects subfolder. That will at least get you up and running, or on to the next file to see if similar problems continue.
____________ Rosetta Moderator: Mod.Sense
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.
My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.
Sorry i should have mentioned there is a new rule. Mini will not produce more than 99 models. It will finish gracefully and grant full credit. The reason for this is that i want to prevent your individual uploads from getting too large. In the future there will be a better way to do this, like it will check that the output file size has not reached some limit.
ITs just another safety hook that's been put in to prevent WUs from misbehaving.
Hello with all.
For me no problems to receive from Wu Minirosetta v1.54.
J'received 17 Wu to be made for February 6, 2009 with 21:28:04 (France Time).
The first calculations should begin today (January 29), and if it with problems I you will warn about it there.
____________
ID: 59124 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,383,092 RAC: 0
Paul,
can you point me to the thing you read about Lockfile problems on Einstein !?
5% of jobs fail in this way consistently. I would love to know if the problem is us or the clients or what, and get it resolved.
What do you mean by 100% CPU ? If i can make this happen here on my machine i could learn better about what's going on.
Mike
Two places to start: are here and here ... I can also report that since I made that change i have been getting good results on Win XP systems ... I cannot see the high error rate I had in the past as the tasks have been purged ...
It seemed to me to be a problem I had on XP and it was most severe on the i7 where there are more things going on ...
and drop it in to your Rosetta folder in your BOINC data directory under the projects subfolder. That will at least get you up and running, or on to the next file to see if similar problems continue.
No difference, I downloaded the file, dropped it into the directory C:\Boinc\Data\Projects\boinc.bakerlab.org_rosetta. It asked if I wanted to overwrite what was there, the new file was bigger, I said yes, exited Boinc, restarted Boinc, did an update of Rosetta and it is downloading the file AGAIN and is stuck at exactly the same place. I even turned off my ad-aware and anti-virus and no change.
Change #1....Just after I first posted this I did a total shutdown and then a restart, no change, Boinc is still trying to download that same file! I am about ready to detach and then reattach and see if that fixes it!
Change #2....I detached and then reattached. Started downloading all the Rosetta files again. I made sure everything Rosetta was gone out of the Boinc and all subdirectories, so downloading was not a surprise. It got thru all the files except the usual one, stopped at exactly the same place. I aborted the transfer and stopped Boinc. I then copied the file I had downloaded manually into the same place as before, and did another update of Rosetta. It asked for 36000 seconds of work and got none. It went into the communication deferred state and is now downloading the EXACT SAME FILE again!!!! It is also STUCK at the EXACT SAME PLACE!!!!
I have no clue how to fix this and other projects are working just fine. Frustrating to say the least!!!!!
____________
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.
My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.
I turned it off and nothing changed, I use the free version of Avast.
____________
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.
My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.
I turned it off and nothing changed, I use the free version of Avast.
and drop it in to your Rosetta folder in your BOINC data directory under the projects subfolder. That will at least get you up and running, or on to the next file to see if similar problems continue.
No difference, I downloaded the file, dropped it into the directory C:\Boinc\Data\Projects\boinc.bakerlab.org_rosetta. It asked if I wanted to overwrite what was there, the new file was bigger, I said yes, exited Boinc, restarted Boinc, did an update of Rosetta and it is downloading the file AGAIN and is stuck at exactly the same place. I even turned off my ad-aware and anti-virus and no change.
Change #1....Just after I first posted this I did a total shutdown and then a restart, no change, Boinc is still trying to download that same file! I am about ready to detach and then reattach and see if that fixes it!
Change #2....I detached and then reattached. Started downloading all the Rosetta files again. I made sure everything Rosetta was gone out of the Boinc and all subdirectories, so downloading was not a surprise. It got thru all the files except the usual one, stopped at exactly the same place. I aborted the transfer and stopped Boinc. I then copied the file I had downloaded manually into the same place as before, and did another update of Rosetta. It asked for 36000 seconds of work and got none. It went into the communication deferred state and is now downloading the EXACT SAME FILE again!!!! It is also STUCK at the EXACT SAME PLACE!!!!
I have no clue how to fix this and other projects are working just fine. Frustrating to say the least!!!!!
Change #3....I downloaded and installed the latest version of DirectX, no changes noted.
Change #4....I installed Boinc 6.6.3, got this message "1/29/2009 8:28:31 AM|rosetta@home|Scheduler request completed: got 0 new tasks". I may have errored out all my available work for the day. No files downloading, so maybe it will take this time? No clue.
____________
mikey, if you would like to study this further, it would be helpful if you could create a cc_config.xml file and add the flag for debug of file transfers. You have to define the first three flags as shown, then just add a line for the:
<file_xfer_debug>1</file_xfer_debug>
If you already have such a file set up, do you have the <http_1_0> flag defined? Not asking you to do that one, just asking if you were already doing it. HTTP 1.0 does not have the ability to retry from the middle of the transfer (persistent file transfer is the term BOINC uses for this). It has to start over each attempt. Then BOINC seems to only open the pipe for 5minutes at a time. So if you can't get the whole thing in 5min. It might never happen.
Okay I have downloaded the file and put it in the Boinc\Data directory. I took out the asterisks and changed the <file-xfer_debug line to a 1, it was a zero.
As for the http setting I use Firefox 3.0.5 and do not see that setting. I know it is/was in IE, but I do not see it in Firefox.
____________
ID: 59134 | Rating: 0 | rate:
/
Scott A. Howard* Joined: Oct 16 05 Posts: 2 ID: 4994 Credit: 1,054,779 RAC: 2,192
Hello,
Here's the problem in a nutshell.
On my Dell Precision T5400 with dual Xeon E5410 2.33 GHz chips (for a total of 8 cores) running on XP Pro SP3, almost every one of the Rosetta jobs (minirosetta version 154) fail. The typical failure mode is that they are exceeding their CPU time allocation. For example, if the job is estimated to require 4 hours of CPU time, they are killed at something like 20 hours. Sometimes the tasks show progress, other times they are stuck at zero.
Also, the exe is not removed from memory when the computer is in use.
I have reset the project and detached and attached again but it continues to happen.
Nothing like this happens with the lhcathome, QMC@HOME, Docking@Home, or boincsimap tasks. I also don't see this behavior on any of my other machines.
Do you guys produce any diagnostic logs that might of use in troubleshooting the problem? Maybe it's my configuration - maybe a coding error showing up when running 6 or 8 of these tasks simultaneously. (It appears to occur with any number running, from 1 - 8).
I have a full development environment and debuggers if you want some traces.
Scott Howard
Addendum: Now that I thought about it a little more, does the app use any global resource locking? E.g., mutexes, semaphores, file acess? Maybe that's why the progress is halted, it's deadlocked - but I am not sure why the task would continue to use CPU time though. Just some random thoughts...
____________
ID: 59135 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
mikey, we're not talking about the HTTP setting of your browser. We're talking about the http setting used by BOINC. If it were specifically set, it would have a line in that cc_config.xml file.
Once you have the file in the directory, abort the transfer.
You probably got no work because BOINC knew you already had enough coming. So you probably see a number of tasks in a "downloading" state.
The file transfer debug messages will appear in the messages tab. One of the things to note there is which of Rosetta's servers is currently being used to retrieve the file (the host name). I believe this will change from one retry to the next. But if not, you might try blocking outbound traffic to that server with a firewall, and this would then force the client to try the next server in the list.
Does each try go for 5 minutes before waiting again? Does any data come down in that period of time?
Once you determine which server is being used, could you do a ping and a tracert to that server's host name and report the results?
____________ Rosetta Moderator: Mod.Sense
Hello, the version 1.47 was very well for me with 151 Workunits and 0 errors and an average CPU time 2.8 hours. Hope that the new version 1.54 will be as well...
1.47 worked rather well for me, with perhaps one out of ten workunits giving an error. Not enough 1.54 workunits yet to say whether 1.54 is better. I'm asking for 14 hour workunits, so it will take me longer to run that many.
ID: 59137 | Rating: 0 | rate:
/
Scott A. Howard* Joined: Oct 16 05 Posts: 2 ID: 4994 Credit: 1,054,779 RAC: 2,192
Here's a follow up.
I did the following:
1) detached from the project.
2) removed the Rosetta project folder from under \Bonic\...
3) removed all files from a slot that contained Rosetta data
4) reattached to the project
5) allowed for 50% of the cpus to be used (4 in this case)
6) allowed the four projects to run - each expected to take about 4 hours
Observed results: The status for the projects are "Running, high priority", each has used about 20 minutes of cpu time, the progress is 0.000%
Setting the activity back to "run based on preferences" results in each task no longer using cpu time but they are not removed from memory.
It looks like that's all I can do. If there are no suggestions from your end, I'll need to stay detached from the project so I don't waste cycles.
I see the thread that's consuming the CPU has a pretty regular call stack. Here is the call stack. If you have your debug symbols for your build, you should be able to locate the routine and line at which the program is hung...
ntkrnlpa.exe!KiSwapContext+0x2f
ntkrnlpa.exe!KiSwapThread+0x8a
ntkrnlpa.exe!KeWaitForSingleObject+0x1c2
ntkrnlpa.exe!KiSuspendThread+0x18
ntkrnlpa.exe!KiDeliverApc+0x124
hal.dll!HalpApcInterrupt+0xc6
minirosetta_1.54_windows_intelx86.exe+0x91a63 <------ look for problem here
minirosetta_1.54_windows_intelx86.exe+0x17d3
minirosetta_1.54_windows_intelx86.exe+0x1afcd
minirosetta_1.54_windows_intelx86.exe+0x9289e
minirosetta_1.54_windows_intelx86.exe+0x4a4bc3
minirosetta_1.54_windows_intelx86.exe+0xb0892
minirosetta_1.54_windows_intelx86.exe+0x3e0c24
____________
mikey, we're not talking about the HTTP setting of your browser. We're talking about the http setting used by BOINC. If it were specifically set, it would have a line in that cc_config.xml file.
Once you have the file in the directory, abort the transfer.
You probably got no work because BOINC knew you already had enough coming. So you probably see a number of tasks in a "downloading" state.
The file transfer debug messages will appear in the messages tab. One of the things to note there is which of Rosetta's servers is currently being used to retrieve the file (the host name). I believe this will change from one retry to the next. But if not, you might try blocking outbound traffic to that server with a firewall, and this would then force the client to try the next server in the list.
Does each try go for 5 minutes before waiting again? Does any data come down in that period of time?
Once you determine which server is being used, could you do a ping and a tracert to that server's host name and report the results?
I changed the dual core settings to use both cores, this is a laptop and I do not like stressing it that much, and set the other project to no new work. I updated Rosetta and it proceeded to download new work. The same file stopped at the same place, 89.25%. I aborted it, after all other files were done downl0ading, and no new entries showed up in the cc_config.xml file.
I was browsing thru the stdout.txt file and found this:
9:21:33 AM: Error: can't open file 'C:\Boinc\\RebootPending.txt' (error 2: the system cannot find the file specified.)
[01/27/09 09:21:34] TRACE [2064]: RPC_CLIENT::init connect 2: Winsock error '10061'
[01/27/09 09:21:34] TRACE [2064]: RPC_CLIENT::init connect on 444 returned -1
[01/27/09 09:21:35] TRACE [2064]: RPC_CLIENT::init boinc_socket returned 444
[01/27/09 09:21:35] TRACE [2064]: RPC_CLIENT::init connect returned -1
[01/27/09 09:21:35] TRACE [2064]: RPC_CLIENT::init_poll sock = 444
It is in there many, many times.
I do not see what server I am downloading from, and only use the Windows firewall, so unless I could block thru the Hosts files, I do not know how to block that particular server anyway.
Yes each retry deferral is about 4 minutes.
I did find one more thing in that stdoutgiu.txt file:
[01/29/09 11:10:31] TRACE [3932]: RPC_CLIENT::init connect 2: Winsock error '10061'
[01/29/09 11:10:31] TRACE [3932]: RPC_CLIENT::init connect on 524 returned -1
It is also in there many, many times. I did a search and found where it said to change the attributes for the Boinc directory and all subdirectories. It was set to read only and when I unchecked that and changed it also for all subdirectories, Boinc will not run. It also auto defaults back to read only after it errors out. DO NOT DO THIS LAST PART It crashed my whole Boinc setup and I had to delete the Boinc directory, and all subdirectories, then reboot and then reinstall Boinc from scratch. FORTUNATELY it did a repair install instead of a brand new install from scratch! I lost all workunits from all projects though!!!! I attached to Rosetta and guess what? The EXACT SAME FILE is stuck at the EXACT SAME PLACE!!! A TON of files are downloading besides just that one, but that one is stuck all over AGAIN!!!
____________
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.
My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.
I turned it off and nothing changed, I use the free version of Avast.
got any firewalls active?
No I use the Windows one, I have Windows XP Media Center on this laptop.
____________
and drop it in to your Rosetta folder in your BOINC data directory under the projects subfolder. That will at least get you up and running, or on to the next file to see if similar problems continue.
No difference, I downloaded the file, dropped it into the directory C:\Boinc\Data\Projects\boinc.bakerlab.org_rosetta. It asked if I wanted to overwrite what was there, the new file was bigger, I said yes, exited Boinc, restarted Boinc, did an update of Rosetta and it is downloading the file AGAIN and is stuck at exactly the same place. I even turned off my ad-aware and anti-virus and no change.
What antivirus program do you have, and what version? Some antivirus programs don't fully turn off when you try to turn them off; they stop reporting that they have found a virus, but don't stop looking for a virus.
I'm also running Ad-Aware, but without this problem, so this antispyware program is less likely to be causing the problem.
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.
My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.
I turned it off and nothing changed, I use the free version of Avast.
I also use the free version of avast!, version 4.8 Home Edition, without seeing that problem, so if it's that program, the problem is probably in a section specific to the operating system you are using. I use 32-bit Windows Vista SP1 with nearly all the updates applied; what operating system and what version are you using?
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.
My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.
I turned it off and nothing changed, I use the free version of Avast.
I also use the free version of avast!, version 4.8 Home Edition, without seeing that problem, so if it's that program, the problem is probably in a section specific to the operating system you are using. I use 32-bit Windows Vista SP1 with nearly all the updates applied; what operating system and what version are you using?
I am running 32 bit Windows XP Media Center and have been running Boinc on this thing ever since I bought it a couple of years ago. I was running Rosetta on it until this new mini-rosetta came out! I have run Malaria, I can run Poem on it right now! I have run ABC on it but the units take too long on this T2300 1.6ghz, 2 gig of ram machine. EVERYTHING runs except it just won't download, or recognize, the new mini-rosetta file! I downloaded the mini-rosetta file directly, put it in the proper directory, and it STILL wants to download that exact same file!!!
I am also using the 4.8 Home version of Avast.
____________
ID: 59144 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
Scott Howard:
Setting the activity back to "run based on preferences" results in each task no longer using cpu time but they are not removed from memory.
There are many many BOINC settings possible and you've not described any of yours. When you set BOINC to run based on preference, you are telling it to only use CPU on the days and during the hours you've configured. If you've configured it to not be running at the current time or day of the week, it will suspend the currently active tasks. Any time a task is suspended, it will not make any progress. And there is a memory setting for whether or not tasks should remain "in memory" (virtual memory) while suspended. Doing so preserves the work done since the last checkpoint taken by the task.
...so major portions of what you are reporting may be exactly what you have configured BOINC to do.
You have 4 hosts, three are Windows XP and one is Win Vista. Which one is having problems? Is it this one? There are many failed tasks there with access violations. Are you overclocking this machine? Other then more CPUs and different CPU type, what is different about this machine then your others that having been running fine?
____________ Rosetta Moderator: Mod.Sense
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.
My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.
I turned it off and nothing changed, I use the free version of Avast.
got any firewalls active?
No I use the Windows one, I have Windows XP Media Center on this laptop.
I also use the Windows firewall, but the Vista SP1 version.
Laptops sometimes have problems with overheating when running BOINC workunits but set to use 100% of the CPU time, and I think I've read that minirosetta is likely to have problems when set to run at less than 100% of the CPU time. What is your setting of what fraction of the CPU time to use?
I've installed the SpeedFan program on my machine program to check for overheating, but don't have the file needed to show results with proper labels for my motherboard yet. The highest temperature it shows is 109F, though.
I notice that your 8 core machine only has 3GB. That's a bit small for 8 rosetta tasks. In your BOINC preferences what percent of memory are you allowing the machine to use when the machine is/isn't in use? You might try setting both to 100% on that machine and see if it makes any difference.
I seem to have the same problem. No special settings in Rosetta preferences, all kind of computers under XP, and tasks running 100+ hours with 0% progress.
Reason: Access Violation (0xc0000005) at address 0x00467846 read attempt to address 0x11B524C4
This task was running fine but after I suspended it, rebooted my system, and restarted the task it terminated almost immediately with access violation. Maybe restarts don't work very well or something is flakey with my hard drive or system. Having some troubles with access violations on Einstein tasks as well. But I've run memtest86 and prime95 and CHKDSK and none of them indicate any local computer problems. I'm just shaking my head in disgust.
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.
My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.
I turned it off and nothing changed, I use the free version of Avast.
I also use the free version of avast!, version 4.8 Home Edition, without seeing that problem, so if it's that program, the problem is probably in a section specific to the operating system you are using. I use 32-bit Windows Vista SP1 with nearly all the updates applied; what operating system and what version are you using?
I am running 32 bit Windows XP Media Center and have been running Boinc on this thing ever since I bought it a couple of years ago. I was running Rosetta on it until this new mini-rosetta came out! I have run Malaria, I can run Poem on it right now! I have run ABC on it but the units take too long on this T2300 1.6ghz, 2 gig of ram machine. EVERYTHING runs except it just won't download, or recognize, the new mini-rosetta file! I downloaded the mini-rosetta file directly, put it in the proper directory, and it STILL wants to download that exact same file!!!
I am also using the 4.8 Home version of Avast.
I've recently run Poem workunits on one core of my HP Compaq Presario PC, model SR5125CL, and minirosetta 1.54 workunits at the same time on the other core, without problems. I previously ran Malaria workunits on one core and earlier minirosetta workunits on the other without problems, back when I could still get Malaria workunits. My machine also has 2 GB. I haven't tried ABC workunits, so you might want to try running yours a while with no ABC workunits active. I also run WCG workunits (all active projects there except beta test), with Ralph workunits and Boincsimap workunits when I can get them. I used to run Cels workunits, back when that project was active.
My CPU is an AMD Athlon(tm) 64 X2 Dual Core Processor 3600+ 1.90 Ghz; what's yours?
Also, You may want to give your ISP the instructions for downloading the problem file with FTP, and ask them to test whether their antivirus software considers it to have a problem.
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.
My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.
I turned it off and nothing changed, I use the free version of Avast.
got any firewalls active?
No I use the Windows one, I have Windows XP Media Center on this laptop.
I also use the Windows firewall, but the Vista SP1 version.
Laptops sometimes have problems with overheating when running BOINC workunits but set to use 100% of the CPU time, and I think I've read that minirosetta is likely to have problems when set to run at less than 100% of the CPU time. What is your setting of what fraction of the CPU time to use?
I've installed the SpeedFan program on my machine program to check for overheating, but don't have the file needed to show results with proper labels for my motherboard yet. The highest temperature it shows is 109F, though.
I only run one core so the setting is to use 50% of the cpu's. Thus I do not have a problem with overheating on this laptop. I have it set to use 100% of the available cpu.
____________
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.
My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.
I turned it off and nothing changed, I use the free version of Avast.
I also use the free version of avast!, version 4.8 Home Edition, without seeing that problem, so if it's that program, the problem is probably in a section specific to the operating system you are using. I use 32-bit Windows Vista SP1 with nearly all the updates applied; what operating system and what version are you using?
I am running 32 bit Windows XP Media Center and have been running Boinc on this thing ever since I bought it a couple of years ago. I was running Rosetta on it until this new mini-rosetta came out! I have run Malaria, I can run Poem on it right now! I have run ABC on it but the units take too long on this T2300 1.6ghz, 2 gig of ram machine. EVERYTHING runs except it just won't download, or recognize, the new mini-rosetta file! I downloaded the mini-rosetta file directly, put it in the proper directory, and it STILL wants to download that exact same file!!!
I am also using the 4.8 Home version of Avast.
I've recently run Poem workunits on one core of my HP Compaq Presario PC, model SR5125CL, and minirosetta 1.54 workunits at the same time on the other core, without problems. I previously ran Malaria workunits on one core and earlier minirosetta workunits on the other without problems, back when I could still get Malaria workunits. My machine also has 2 GB. I haven't tried ABC workunits, so you might want to try running yours a while with no ABC workunits active. I also run WCG workunits (all active projects there except beta test), with Ralph workunits and Boincsimap workunits when I can get them. I used to run Cels workunits, back when that project was active.
My CPU is an AMD Athlon(tm) 64 X2 Dual Core Processor 3600+ 1.90 Ghz; what's yours?
This is an Intel T2300 dual core, only using one of them for Boinc, 1.6ghz machine.
Also, You may want to give your ISP the instructions for downloading the problem file with FTP, and ask them to test whether their antivirus software considers it to have a problem.
Yeah me telling Comcast what to do isn't going to happen in this lifetime. I can download any file in the World EXCEPT this damned mini-rosetta file and then ONLY thru Boinc!!!! I download the same file thru a direct download, Boinc just won't recognize it. Yes I did put the file in the proper directory. We have been thru this already.
____________
ID: 59157 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,383,092 RAC: 0
I am not sure that Comcast is the problem as I do use their AV software and Have no problems with downloading work ...
ID: 59159 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
mikey, have you tried a different version of BOINC?
____________ Rosetta Moderator: Mod.Sense
ID: 59160 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
rembertw
Please open the advanced view of the BOINC Manager, go to the tasks tab, and note the "application" name shown, this will have the application version. The only reports of tasks running that long are from the prior version. If it is not Rosetta mini 1.54, please select that task, and abort it with the button on the left. There were some problems like that on the prior version that are corrected now.
____________ Rosetta Moderator: Mod.Sense
mikey, have you tried a different version of BOINC?
Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!
____________
ID: 59162 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,383,092 RAC: 0
mikey, have you tried a different version of BOINC?
Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!
Just a wild shot. ..
How is your disk space?
How about BOINC settings for disk space? Are you at BOINC's limit?
You mentioned that when you manually copied the .exe file that you overwrote the half-downloaded file. Perhaps Boinc tried to resume the download without noticing the change. I think Boinc must be stopped, and must NOT have the file half-downloaded for this copying trick to work.
If you've avoided the above problem and Boinc is still trying to download the file, check the messages tab to see if Boinc is complaining about a bad checksum. It's possible that whatever is preventing Boinc from downloading the file could also be corrupting the file when you manually download it.
mikey, have you tried a different version of BOINC?
Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!
Just a wild shot. ..
How is your disk space?
How about BOINC settings for disk space? Are you at BOINC's limit?
No I am fine on both. Boinc still has 10 gig available to it and there is over 20 gig total available.
____________
You mentioned that when you manually copied the .exe file that you overwrote the half-downloaded file. Perhaps Boinc tried to resume the download without noticing the change. I think Boinc must be stopped, and must NOT have the file half-downloaded for this copying trick to work.
If you've avoided the above problem and Boinc is still trying to download the file, check the messages tab to see if Boinc is complaining about a bad checksum. It's possible that whatever is preventing Boinc from downloading the file could also be corrupting the file when you manually download it.
Nope this is the only message regarding the file:
1/29/2009 6:29:05 PM|rosetta@home|Started download of minirosetta_1.54_windows_intelx86.exe
I exited Boinc, deleted the old file, copied the new one into its location and then restarted the whole pc. Then when Boinc started up that message, along with a few dozen others, came up.
I appreciate all the help but I am done trying to make this work. I am on to another project and will try again another time. THANK YOU ALL!!!!
PS in the time it took me to type this I attached to Poem@Home and got 8 new units plus all the associated files and the pc is now happily crunching.
Thanks again for all your help, I still have a hard time believing it is my pc that can download just fine from any other project but just cannot download one file from Rosetta. Here is a partial list of files just downloaded:
1/29/2009 6:37:15 PM|Poem@Home|Started download of poem_1.0_windows_intelx86
1/29/2009 6:37:15 PM|Poem@Home|Started download of JParmJan97
1/29/2009 6:37:23 PM|Poem@Home|Sending scheduler request: To fetch work. Requesting 95475 seconds of work, reporting 0 completed tasks
1/29/2009 6:37:28 PM|Poem@Home|Scheduler request completed: got 8 new tasks
1/29/2009 6:37:29 PM|Poem@Home|Finished download of poem_1.0_windows_intelx86
As you can see it works just fine!! I do see that the mini-rosetta file has a ".exe" at the end while the poem file does not. Could that be the problem, no clue, seems it has worked for all other users.
Thanks for the ride it has been loads of fun but I am getting off for now. I will still come back and read and reply in the forums until my credits don't let me anymore.
____________
ID: 59173 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
mikey, whatever the problem is, it stands a good chance of clearing itself when the next Rosetta version comes out. The .exe will be a different name afterall. So, please monitor the new release thread and give another try at that time.
____________ Rosetta Moderator: Mod.Sense
ID: 59174 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,383,092 RAC: 0
Sometimes it is best to take a breather and come back later ...
At times the problems go away on their own for no apparent reason ... other times they can be found.
I have been doing Ralph for a week or so now and all I can say is that I am impressed with how many issues we found in 1.54 and now in 1.55 ...
I know that we all regret we could not get you going Besides POEM, WCG also does folding, and there are other projects that are related ... yell if you need us ... or want us ... or to say hi ...
mikey, whatever the problem is, it stands a good chance of clearing itself when the next Rosetta version comes out. The .exe will be a different name afterall. So, please monitor the new release thread and give another try at that time.
I will, I like the premise of Rosetta and that is what brought me here in the first place. I will certainly try again in the future, probably when you put out a new version as you suggest.
Thanks for all your help
____________
Sometimes it is best to take a breather and come back later ...
At times the problems go away on their own for no apparent reason ... other times they can be found.
I have been doing Ralph for a week or so now and all I can say is that I am impressed with how many issues we found in 1.54 and now in 1.55 ...
I know that we all regret we could not get you going Besides POEM, WCG also does folding, and there are other projects that are related ... yell if you need us ... or want us ... or to say hi ...
Oh I have 17 computers, I think, on line here at home right now. All are crunching for Boinc, plus I have 2 video cards doing the folding thing. I do ABC and Poem right now. But if you click on my name you will see I have crunched for a few projects and am not intending to stop anytime soon. In fact I have 2 new motherboard and dual core cpus to bring on line this weekend to replace 2 single core machines. I have already set them to no new work in preparation for the changeover.
____________
I detached yesterday, and re-attached just now so there is no way to see (for me) what applications were running. Also, my computers are too widely spread to start micromanagement. Anyway, I'll keep an eye on a couple of computers for a couple of days to see if they reattach succesfully and if that problem is indeed solved.
I supposed it was a wrong batch (or application) and detach/reattach was the fastest way to have a full reset. If the problem shows up again, I'll let you know. If it doesn't... then thanks for the info!
Hello with all.
I do not understand that some have problems.
Indeed my Desktop machine Intel Core 2 CPU Windows XP home x86 SP2 with carried out 27 wu with the v1.54 and 0 errors with an average CPU time of 2,8 hours. I cross the fingers so that continuous as well...
I specify only that betwen 80% and 85% of work, that passes directly to 100%.
With this new version I also notice that the processes generate more lures and attemps,(1 example: 23 decoys from 23 attemps on a wu), but as that the working mean by wu and more important as with the v1.47.
To finish, (although this n'is not the good forum), I specify that one of my Computeurs has been broken down for 8 days due to segments broken on the hard drive and qu'it is in repair. As it there to 3 wu as I n'is not puses to return before the dealine and which I think will be lost.
It would be thus although a person sympathetic nerve informs the persons in charge of the rosetta project of this problem.
Thank you very much d'advances...
Good memories...
____________
The 1.54 version seems to be in conflict with the Linux ABI in FreeBSD.
One machine I'm running boinc on is a FreeBSD one, boinc downloads and runs the Linux binaries through the Linuxulator. Version 1.47 worked flawlessly, but the 1.54 version crashes randomly on SIGILL. http://boinc.bakerlab.org/rosetta/results.php?hostid=973136 shows only one successful task, which was run with Rosetta beta rather than minirosetta; all of the minirosetta tasks crashed sooner or later.
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
Peter, the potential for large output files is why Mike changed it to exit after 99 models. That lets the task report back that it's running through models like candy and then they can weigh that before releasing more similar tasks.
____________ Rosetta Moderator: Mod.Sense
Peter, the potential for large output files is why Mike changed it to exit after 99 models. That lets the task report back that it's running through models like candy and then they can weigh that before releasing more similar tasks.
Hi.
Just as well it did finish after 99 i would hate to see the file size after
12 or 24 hours! :) I just returned another one the same size.
Both ended with exit code: 0 (0x0) and seemed to run successfully, but with no credit given.
Is it something I did, a bug or just one of those things?
I only checked the 1.54-task. You have a runtime-preference of 3 hours. This one ran for 7 hours, no finished models. I'd say the watchdog, which aborts tasks running longer than intended, cut in. This is one for the long-running tasks thread.
____________
Both ended with exit code: 0 (0x0) and seemed to run successfully, but with no credit given.
Is it something I did, a bug or just one of those things?
I only checked the 1.54-task. You have a runtime-preference of 3 hours. This one ran for 7 hours, no finished models. I'd say the watchdog, which aborts tasks running longer than intended, cut in. This is one for the long-running tasks thread.
Thanks for copying here - I thought it was just a problem with the validator (the error message being the clue). You're right, there's no "Done" section after the first model starts until the boinc_finish, which is odd, but no mention of the watchdog cutting in, even though it does run a long time. But on the 1.47 WU there are 3 models done, so I'm not entirely convinced it's the same thing.
Usually long-running jobs get a default credit of 80, don't they? Looks like I missed out all ways. Oh well...
2/2/2009 10:05:58 AM|rosetta@home|Sending scheduler request: Requested by user
2/2/2009 10:05:58 AM|rosetta@home|(not requesting new work or reporting completed tasks)
2/2/2009 10:06:03 AM|rosetta@home|Scheduler RPC succeeded
2/2/2009 10:06:03 AM|rosetta@home|Message from server: Server error: can't attach shared memory
2/2/2009 10:06:03 AM|rosetta@home|Deferring communication for 1 hr 0 min 0 sec
2/2/2009 10:06:03 AM|rosetta@home|Reason: project is down
Server is up according to the webpage. One task was updated as complete.
there is something odd going on with the graphics of lr5_D_score12_rlbd_2hsh_IGNORE_THE_REST_DECOY_6246_424_0 the plot disappears completely at times and the accepted energy does the same at times. then they reappear at times. all seems to depend on the energy value of the moment. as far as i know this is not normal.
I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.
For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.
Right now 3 cores are running:
Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.
I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.
For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.
Right now 3 cores are running:
Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.
what version of boinc manager are you using?
it looked like you were using 5.10.45 which is quite old.
I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.
For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.
Right now 3 cores are running:
Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.
Sounds like your Duration Correction Factor is waaaay off. You can wait for Boinc to fix it, and it will all by itself. Or you can shut down Boinc and manually edit the file client_state.xml. Do this thru notepad or whatever, and go down until you find this Project and the line that says:
<duration_correction_factor>0.705998</duration_correction_factor>
That is a copy of my line, yours will have the numbers be like 70.?????????? or whatever. Change it to 1.000000, save the file and then restart Boinc. Boinc will then use those new numbers and get both new work and recalculate how long it will take to crunch your existing units. If you do a right click on the file name the EDIT option is listed, use that to edit the file. Do NOT change anything else in the file. The line you are looking for is near the top, mine was only 65 lines down from the top.
____________
ID: 59254 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,383,092 RAC: 0
Compute error, though it looks more like a zip error ...
process exited with code 1 (0x1, -255)
Watchdog active.
Hbond tripped.
ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish
Not sure what to make of this error ... happened on the Mac Pro ...
That fixed it! Thanks, my duration was set at 55+.
I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.
For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.
Right now 3 cores are running:
Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.
Sounds like your Duration Correction Factor is waaaay off. You can wait for Boinc to fix it, and it will all by itself. Or you can shut down Boinc and manually edit the file client_state.xml. Do this thru notepad or whatever, and go down until you find this Project and the line that says:
<duration_correction_factor>0.705998</duration_correction_factor>
That is a copy of my line, yours will have the numbers be like 70.?????????? or whatever. Change it to 1.000000, save the file and then restart Boinc. Boinc will then use those new numbers and get both new work and recalculate how long it will take to crunch your existing units. If you do a right click on the file name the EDIT option is listed, use that to edit the file. Do NOT change anything else in the file. The line you are looking for is near the top, mine was only 65 lines down from the top.
2/2/2009 10:05:58 AM|rosetta@home|Sending scheduler request: Requested by user
2/2/2009 10:05:58 AM|rosetta@home|(not requesting new work or reporting completed tasks)
2/2/2009 10:06:03 AM|rosetta@home|Scheduler RPC succeeded
2/2/2009 10:06:03 AM|rosetta@home|Message from server: Server error: can't attach shared memory
2/2/2009 10:06:03 AM|rosetta@home|Deferring communication for 1 hr 0 min 0 sec
2/2/2009 10:06:03 AM|rosetta@home|Reason: project is down
Server is up according to the webpage. One task was updated as complete.
you have to wait and it will correct by itself.
Maybe it is a long time from your last rosetta WU... during this time the project changed its web address and so boinc need to re-fetch master file. Leave it alone and in 24 hour max it will redownload it and resume working!
That fixed it! Thanks, my duration was set at 55+.
I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.
For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.
Right now 3 cores are running:
Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.
Sounds like your Duration Correction Factor is waaaay off. You can wait for Boinc to fix it, and it will all by itself. Or you can shut down Boinc and manually edit the file client_state.xml. Do this thru notepad or whatever, and go down until you find this Project and the line that says:
<duration_correction_factor>0.705998</duration_correction_factor>
That is a copy of my line, yours will have the numbers be like 70.?????????? or whatever. Change it to 1.000000, save the file and then restart Boinc. Boinc will then use those new numbers and get both new work and recalculate how long it will take to crunch your existing units. If you do a right click on the file name the EDIT option is listed, use that to edit the file. Do NOT change anything else in the file. The line you are looking for is near the top, mine was only 65 lines down from the top.
Yea for some reason this has happened ALOT lately.
____________
That could be related to the BOINC version (6.4.5 and higher). The complaints about the RDCF being completely off are usually coming from people having installed it. A not uncommon opinion is that version 6.4.5 was made the recommended version too hasty and done to get the CUDA capabilities out.
____________
mikey, whatever the problem is, it stands a good chance of clearing itself when the next Rosetta version comes out. The .exe will be a different name afterall. So, please monitor the new release thread and give another try at that time.
I will, I like the premise of Rosetta and that is what brought me here in the first place. I will certainly try again in the future, probably when you put out a new version as you suggest.
Thanks for all your help
I just had a thought...NOT dangerous this time, I am off for a few days here and I have finally figured out how to make Ubuntu Linux work for me and crunch Boinc projects too. I will try switching one of the machines that won't download the Windows app to Linux and see if that works.
____________
mikey, whatever the problem is, it stands a good chance of clearing itself when the next Rosetta version comes out. The .exe will be a different name afterall. So, please monitor the new release thread and give another try at that time.
I will, I like the premise of Rosetta and that is what brought me here in the first place. I will certainly try again in the future, probably when you put out a new version as you suggest.
Thanks for all your help
I just had a thought...NOT dangerous this time, I am off for a few days here and I have finally figured out how to make Ubuntu Linux work for me and crunch Boinc projects too. I will try switching one of the machines that won't download the Windows app to Linux and see if that works.
mikey, whatever the problem is, it stands a good chance of clearing itself when the next Rosetta version comes out. The .exe will be a different name afterall. So, please monitor the new release thread and give another try at that time.
I will, I like the premise of Rosetta and that is what brought me here in the first place. I will certainly try again in the future, probably when you put out a new version as you suggest.
Thanks for all your help
I just had a thought...NOT dangerous this time, I am off for a few days here and I have finally figured out how to make Ubuntu Linux work for me and crunch Boinc projects too. I will try switching one of the machines that won't download the Windows app to Linux and see if that works.
I just thought of something....I wonder if changing the setting for:
Skip image file verification? to yes would have let my Windows pc's download the file? Hmmmmm
____________
ID: 59365 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
The image verification can't occur until the download completes. So, that's not what's causing the download problem.
____________ Rosetta Moderator: Mod.Sense
Mod.Sense had asked me to posts my results in here. A little history, I've been getting Compute Error's for every Minirosetta WU I try and crunch, they usually crash and burn within the first 60 seconds or so...I am running a Q6600 with everything running at stock speeds but I was throttling my processor to use only 3 of 4 cores, so it was suggested that I let all 4 cores run unthrottled and here's what happenned:
I changed it to: "On multiprocessor systems, use at most 100% of the processors" so that it would run completely unthrottled and use all 4 cores. And I let it download minirosetta WU's and it got 5 of them and all failed after 0:33, 1:39, 0:56, 0:38, and last one at 0:51 crashed with a Vista popup saying "minirosetta_1.54_windows_x86_64.exe has stopped working"
So it didn't seem to help, I don't know what else to try but I'm little ashamed of all the compute errors when you look at my results page..so I think I may have to give up on minirosetta and just stick to Beta WU's, they seem to work great when I'm not messing around with the BOINC client.
I think it may have something to do with Vista 64. Because I have an E8500 running Vista 64 and they fail on there too but the E8500 is throttled to 1 core and is OC'ed from 3.16Ghz to 3.8Ghz (I've been told OC'ing will effect minirosetta) but the E8500 is my gaming rig so I don't mind if it doesn't crunch WU's because it's crunching games! :)
ID: 59371 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
And epcorian is not overclocked. Running BOINC version 6.4.5
They consistently fail with Access Violations on the Mini tasks. The "Rosetta Beta" tasks are the successes you will find.
Is it possible you've got something like an antivirus application that's conflicting on Vista?
The only other thought is to go back to the prior stable version of the BOINC client. There have been a number of fishy issues with the 6.4.x level. You can download older BOINC versions here
____________ Rosetta Moderator: Mod.Sense
That's right, the Q6600 isn't overclocked, the system contains a Intel DQ35JO MB, Q6600 Processor, 4GB (2x2GB) Kingston Value Ram, Corsair HX-520W PS, 36GB WD Raptor HD, 2x750GB WD HD's in RAID 1, and a Zalman HSF running Vista 64 SP1, no external video card. I use it as a home file and print server and recently a BOINC cruncher as I leave it on 24/7. No issues with Beta WU's or SETI.
I do have NOD32 installed on there but I tried disabling it (I haven't gone as far to uninstall it) and they would still fail.
Maybe I should try an older version of the BOINC client, I will give it a go this weekend and post back.
And epcorian is not overclocked. Running BOINC version 6.4.5
They consistently fail with Access Violations on the Mini tasks. The "Rosetta Beta" tasks are the successes you will find.
Is it possible you've got something like an antivirus application that's conflicting on Vista?
The only other thought is to go back to the prior stable version of the BOINC client. There have been a number of fishy issues with the 6.4.x level. You can download older BOINC versions here
He is running a 64 bit OS though, I read on one of the projects that you need to do something to make 32 bit units work on a 64 bit system, is that true with Rosetta units too? That is NOT true for all projects and I do not remember where I read it.
____________
ID: 59385 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
Moved NewtonianRefractor's post here. They report a validation error on a tasks that had a visit from the watchdog. They ended at target runtime plus 4hrs, but show with validation errors.
____________ Rosetta Moderator: Mod.Sense
rembertw
Please open the advanced view of the BOINC Manager, go to the tasks tab, and note the "application" name shown, this will have the application version. The only reports of tasks running that long are from the prior version. If it is not Rosetta mini 1.54, please select that task, and abort it with the button on the left. There were some problems like that on the prior version that are corrected now.
Same problem again on at least one of my computers. This time I have more details:
Application: Rosetta Mini 1.54
Task name: lr6_E_score12_rlbd_1ail_IGNORE_THE_REST_DECOY_6254_459_0
Total runtime before manual cancellation: 72:21:22
Total Progress: 0%
Time to go: 6:42:30 (as usual on my computers)
Any comments/ideas?
ID: 59394 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
Rembertw, which machine are you having the problem with? What version of BOINC are you running? Was this a newly installed machine? Or was it working before?
____________ Rosetta Moderator: Mod.Sense
So I took Mod.Sense's advice and downgraded to the 6.2.19 64-bit version of the BOINC client and so far so good with the mini's, I've crunched with 30 minutes thus far and no errors yet, much better then the 30-60 seconds I was getting before.
Rembertw, which machine are you having the problem with? What version of BOINC are you running? Was this a newly installed machine? Or was it working before?
- This specific problem occurs with computer ID 586996
- Boinc version 6.2.14 as on most of my computers currently
- Not newly installed, but hardly a price winner with Rosetta. It crunches succesfully for other projects though
Extra comments: I have the impression that it is Rosetta that crashes. This morning I noticed 2 other tasks at +7h progress and 0% progress. When cancelling these tasks I got the Windows crash notice where I can "inform microsoft of the problem". The only "special" about this computer is that it doesn't have 24/7 internet access.
So I took Mod.Sense's advice and downgraded to the 6.2.19 64-bit version of the BOINC client and so far so good with the mini's, I've crunched with 30 minutes thus far and no errors yet, much better then the 30-60 seconds I was getting before.
ALRIGHT!!! Glad you guys found the problem, I guess the reports of the newer versions being released without proper testing were true in your case.
____________
En primer lugar, disculpas por escribir en castellano, pero mi inglés es insuficiente.
Desde agosto de 2008 me están finalizando el 99% de las tareas de Rosetta Mini con error de cálculo. Al cabo de un tiempo decidí no seguir procesando en este proyecto. Aun así, de cuando en cuando vuelvo a intentarlo, pero todo sigue igual: incluso con las nuevas versiones de Rosetta Mini, incluida esta última.
El caso es que las tareas de Rosetta Beta no me fallan, pero de ésas me envía proporcinalmente muy pocas. La pena es que en este proyecto no existe la posibilidad de seleccionar subproyectos, como sí la hay en otros muchos.
Me gustaría seguir procesando para este proyecto, pero no hay manera, y no es cuestión de tirar horas de computación desaprovechadas. Espero que este problema se resuelva pronto. Por mi parte seguiré probando de vez en cuando.
Hello.
Following task (http://boinc.bakerlab.org/rosetta/result.php?resultid=225859224) is suspended as it has produced "accepted energy": QNAN(Not a Number?) and RMSD: QO.Model number 25 step 9518. Running time: 20h 2min 21sec.
Set runtime 24h.
For now suspended.No crash before.
OS:Windows 7 beta.I can create dump file using task manager.
mikey, have you tried a different version of BOINC?
Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!
Just a wild shot. ..
How is your disk space?
How about BOINC settings for disk space? Are you at BOINC's limit?
No I am fine on both. Boinc still has 10 gig available to it and there is over 20 gig total available.
How many BOINC projects do you have set up? I've seen signs that BOINC divides the available space equally among projects, even if some projects don't even try to use all of their share. I'm currently allowing BOINC to share up to 30 GB among 8 BOINC projects (not all making workunits available recently). I had problems getting Rosetta@home to run workunits on both cores of my dual-core CPU at the same time before that. Also, I believe I've seen a maximum percentage of the available free space on the hard drive BOINC is allowed to use, which can reduce the limits even further.
I recently had a 1.54 workunit with a validate error for no reason I could spot in the Task ID details file. A wingman got a Success, but apparantly with a much shorter preferred workunit length than the 14 hours I request.
mikey, have you tried a different version of BOINC?
Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!
Just a wild shot. ..
How is your disk space?
How about BOINC settings for disk space? Are you at BOINC's limit?
No I am fine on both. Boinc still has 10 gig available to it and there is over 20 gig total available.
How many BOINC projects do you have set up? I've seen signs that BOINC divides the available space equally among projects, even if some projects don't even try to use all of their share. I'm currently allowing BOINC to share up to 30 GB among 8 BOINC projects (not all making workunits available recently). I had problems getting Rosetta@home to run workunits on both cores of my dual-core CPU at the same time before that. Also, I believe I've seen a maximum percentage of the available free space on the hard drive BOINC is allowed to use, which can reduce the limits even further.
I only have one project per pc, but I will add a second if the first is having workunit issues. All machines have at least a 20 gig hard drive but most have a 100 gig or bigger hard drive. The one above is a laptop with a 50 gig hard drive with almost 30 gig free. I have Boinc setup to use no more than 50% of the free hard drive space and don't have any issues with space.
____________
So I took Mod.Sense's advice and downgraded to the 6.2.19 64-bit version of the BOINC client and so far so good with the mini's, I've crunched with 30 minutes thus far and no errors yet, much better then the 30-60 seconds I was getting before.
ALRIGHT!!! Glad you guys found the problem, I guess the reports of the newer versions being released without proper testing were true in your case.
I think I spoke too soon...that first WU crunched successfully but only 1 other was WU successful out of the 8 WU's. 2/8, better but still not good. I might try replacing Vista 64 with XP 64 another weekend when I'm bored. Just for curiosity sake I had my P4 and Atom 330 PC's running 32-bit XP SP3 crunch some Mini's and they did just fine.
ID: 59428 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
Hello.
Following task (http://boinc.bakerlab.org/rosetta/result.php?resultid=225859224) is suspended as it has produced "accepted energy": QNAN(Not a Number?) and RMSD: QO.Model number 25 step 9518. Running time: 20h 2min 21sec.
Set runtime 24h.
For now suspended.No crash before.
OS:Windows 7 beta.I can create dump file using task manager.
Should I let it try to finish?
Thanks
I'd suggest allowing it to run normally. Was it still using CPU time? If you want to kind of cut it off, but get it to report in, let it run, then exit (not close) BOINC and restart it, let it run about 2 minutes, then exit again and restart, until you've done that 5 times and the task should be ended and report in with "too many restarts".
____________ Rosetta Moderator: Mod.Sense
ID: 59437 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
Hola,
En primer lugar, disculpas por escribir en castellano, pero mi inglés es insuficiente.
Desde agosto de 2008 me están finalizando el 99% de las tareas de Rosetta Mini con error de cálculo. Al cabo de un tiempo decidí no seguir procesando en este proyecto. Aun así, de cuando en cuando vuelvo a intentarlo, pero todo sigue igual: incluso con las nuevas versiones de Rosetta Mini, incluida esta última.
El caso es que las tareas de Rosetta Beta no me fallan, pero de ésas me envía proporcinalmente muy pocas. La pena es que en este proyecto no existe la posibilidad de seleccionar subproyectos, como sí la hay en otros muchos.
Me gustaría seguir procesando para este proyecto, pero no hay manera, y no es cuestión de tirar horas de computación desaprovechadas. Espero que este problema se resuelva pronto. Por mi parte seguiré probando de vez en cuando.
Un coridal saludo para todos,
Juan
Hola Juan,
I was able to translate his message and basically, he's been having problems with Mini, including the lastest version. He wishes Rosetta had subprojects, so he could select to crunch only the RosettaBeta application instead of mini.
Looking at his 2 failed tasks, they both have Exit status -226 and the Can't acquire lockfile errors.
He is running Win Vista x86.
I know some of you have had these lock file problems as well. Were they always with WinVista? And I thought the v1.54 release of mini had resolved these issues. Can any of you that have had the problem suggest the best steps for Juan to take to resolve it? You might even convert your reply to Spanish as best we can using a tool like this: http://dictionary.reference.com/translate/text.html
____________ Rosetta Moderator: Mod.Sense
According to the graphics screen of these four WUs, every "accepted" step becomes the new low energy state. No matter if the energy value is smaller or higher...
ID: 59443 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,383,092 RAC: 0
*I* cured the lock file problem by running with 100% time ... if he has opted to run at some lower percentage of CPU time this may be the issue. Something else to try ... and if it works we can report another success ... this is one of the issues that we have been trying to pin down in RALPH...
Hello.
Following task (http://boinc.bakerlab.org/rosetta/result.php?resultid=225859224) is suspended as it has produced "accepted energy": QNAN(Not a Number?) and RMSD: QO.Model number 25 step 9518. Running time: 20h 2min 21sec.
Set runtime 24h.
For now suspended.No crash before.
OS:Windows 7 beta.I can create dump file using task manager.
Should I let it try to finish?
Thanks
I'd suggest allowing it to run normally. Was it still using CPU time? If you want to kind of cut it off, but get it to report in, let it run, then exit (not close) BOINC and restart it, let it run about 2 minutes, then exit again and restart, until you've done that 5 times and the task should be ended and report in with "too many restarts".
OK,set runtime at 8hours,so watchdog would cut it at 24hours.It has now uploaded and reported it.I have dump files as well,if somebody in team is interested.(Captured at reported time and step)
And I see I was not alone... :-(
If you are seeing errors with lock-file problems try setting the cpu setting back to 100%. If you are running at 100% CPU preference and are getting this problem, I for one, am very interested. If you are getting the failures and change the CPU setting to 100% and that cures the issue ... well, we are interested in THAT too ...
I read about this in Einstein@Home and it seems to work for me ... YMMV ...
I, too, was plagued by frequent R@H lock file problems. Setting CPU to 100% seems to have cured that.
And, as I have a quad-core CPU, I can limit BOINC usage by setting "On Multiprocessor Systems, use at most 51% of all processors". (If I run BOINC at 100% on all cores, my system gets too hot - more precisely, my fan gets too loud)
-- Andreas
En primer lugar, disculpas por escribir en castellano, pero mi inglés es insuficiente.
Desde agosto de 2008 me están finalizando el 99% de las tareas de Rosetta Mini con error de cálculo. Al cabo de un tiempo decidí no seguir procesando en este proyecto. Aun así, de cuando en cuando vuelvo a intentarlo, pero todo sigue igual: incluso con las nuevas versiones de Rosetta Mini, incluida esta última.
El caso es que las tareas de Rosetta Beta no me fallan, pero de ésas me envía proporcinalmente muy pocas. La pena es que en este proyecto no existe la posibilidad de seleccionar subproyectos, como sí la hay en otros muchos.
Me gustaría seguir procesando para este proyecto, pero no hay manera, y no es cuestión de tirar horas de computación desaprovechadas. Espero que este problema se resuelva pronto. Por mi parte seguiré probando de vez en cuando.
Un coridal saludo para todos,
Juan
Hola Juan,
I was able to translate his message and basically, he's been having problems with Mini, including the lastest version. He wishes Rosetta had subprojects, so he could select to crunch only the RosettaBeta application instead of mini.
Looking at his 2 failed tasks, they both have Exit status -226 and the Can't acquire lockfile errors.
He is running Win Vista x86.
I know some of you have had these lock file problems as well. Were they always with WinVista? And I thought the v1.54 release of mini had resolved these issues. Can any of you that have had the problem suggest the best steps for Juan to take to resolve it? You might even convert your reply to Spanish as best we can using a tool like this: http://dictionary.reference.com/translate/text.html
I never learned enough Spanish to do such a translation myself, so I tried asking that web site to translate all of your reply at once to Spanish, in preparation for writing an answer in English and doing the same to it. It appeared that the translation succeeded, but enough of it was hidden by advertisements that it was unusable.
Anyone know another automatic translation site that doesn't have this problem?
I've been trying to trigger that problem over on RALPH@home by setting my CPU time less than 100% and unable to actually get it less than 100%, so you might want to consider this: For anyone having this problem repeatedly, give them 1.54 workunits with extra debugging output enabled. Then have someone on the RALPH@home staff analyze the results and give them credits according to the RALPH@home standards instead of the Rosetta@home standards.
Hello, First of all, excuses to write in Castilian, but my English is insufficient. From August of 2008 me 99% of the tasks of Mini Rosetta with computational error are finalizing. After a time I decided not to continue processing in this project. Even so, sometimes I return to try it, but everything follows equal: even with the new versions of Mini Rosetta, including this last one. The case is that the tasks of Rosetta Beta do not fail to me, but of that one sends very few proporcinalmente to me. The pain is that in this project the possibility of selecting sub-projects, does not exist there is as if it in other many. I would like to continue processing for this project, but there is no way, and it is not question to throw low-achieving hours of computation. I hope that this problem is solved soon. As for me I will continue trying from time to time. A coridal greeting for all, Juan
Rembertw, which machine are you having the problem with? What version of BOINC are you running? Was this a newly installed machine? Or was it working before?
- This specific problem occurs with computer ID 586996
- Boinc version 6.2.14 as on most of my computers currently
- Not newly installed, but hardly a price winner with Rosetta. It crunches succesfully for other projects though
Extra comments: I have the impression that it is Rosetta that crashes. This morning I noticed 2 other tasks at +7h progress and 0% progress. When cancelling these tasks I got the Windows crash notice where I can "inform microsoft of the problem". The only "special" about this computer is that it doesn't have 24/7 internet access.
No solution as yet?
ID: 59518 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
Mod.Sense
Rembertw, which machine are you having the problem with? What version of BOINC are you running? Was this a newly installed machine? Or was it working before?
- This specific problem occurs with computer ID 586996
- Boinc version 6.2.14 as on most of my computers currently
- Not newly installed, but hardly a price winner with Rosetta. It crunches succesfully for other projects though
Extra comments: I have the impression that it is Rosetta that crashes. This morning I noticed 2 other tasks at +7h progress and 0% progress. When cancelling these tasks I got the Windows crash notice where I can "inform microsoft of the problem". The only "special" about this computer is that it doesn't have 24/7 internet access.
No solution as yet?
I've not heard any other reports of the percent completed not increasing. What is it showing for the estimated runtime, before the task starts?
Odd, the failed task with some time on it shows that your
core client version is 6.2.14, but your BOINC Windows Runtime Debugger Version is 6.5.0. Not sure how that would happen.
We're ready for a new update. I want to say thank all of you who have helped over the last months to find and fix errors in minirosetta. A particular thank you goes to those who have donated their time over on RALPH and helped with their active feedback - we managed to find a number of difficult and rare bugs and put some new features into minirosetta that should help conserve computer time. Read about it here: http://ralph.bakerlab.org/forum_thread.php?id=431
and here http://ralph.bakerlab.org/forum_thread.php?id=432
I should add that work over there will continue,but now supplemented with information from Rosetta@HOME.
This update is highly focused on bugfixing and stability issues - we have virtually no new science in it, but: We will hopefully now be able to run the science projects that have been in the pipeline waiting for BOINC - we're expecting quite a bit of work to go out very soon indeed. See Dr. Baker's journal for more details.
Features/Fixes:
1.54 Release CHANGELOG
Faster loop closing in FoldCST/Abinitio (affects cc_* cc2_* cs_* WUs), should help with overrunning WUs.
Bug fix concerning intermittent crashes in relax benchmark jobs (_rlbd_) jobs - caused by buggy input file reader.
Bug fix for a potential instability in handling text files (affects all types of WUs).
Bug fix in checkpointing machinery, states were not being correctly restored, probably contributing to long runtimes. (affects cc_* cc2_* cs_* WUs)
Increased the density of checkpoints to lose less time on restarts and address the weired "backjumping" of the time reported in this thread. This will still happen, but the jumps should be much smaller (basically maximally as long as the time between checkpoints.)
Added checkpointing to Loopclosing part of FoldCST. (affects cc_* cc2_* cs_* WUs)
Added checkpointing to Looprelax.
The Watchdog has been checked and improved, now returning information on the aborted jobs to help us figure out how the remaining long running models come about. The watchdog will now abort if the runtime exceeds your preferred runtime + 4 hours. In other words the WUs should not overrun for more than around 4 hours. If they do please let us know !!
Added a limit ont he number of decoys per WU: 99. The WU will end gracefully after that and give full credit. This should address issues with excessive upload problems.
Fixed a bug in the BOINC API concerned with unzipping the input data. (I will let the BOINC guys know about this)
Fixed a strange problem in the options system leading to early crashes on some systems.
Two nasty instabilities fixed deep in the FoldConstraints/abinitio protocol (cc_* tasks and other homology modelling tasks)
Generally implemented much better error reporting - many many potential problems will now show up a meaningful error messages and not random segmentation faults.
NOTE: This new version contains a lot of debug output still. YOu will see that the stderr fills up with stuff - that is ok . It does not slow down the program nor cause much extra upload - but it tells us a lot about where things can go wrong still.
Despite all these fixes there are, i'm sure, many problems left. Most of them occur extremely rarely now though or are highly specific to particular machines. Thus we have decided to move the current version over from RALPH to Rosetta@HOME and give it a go on a much larger scale. Our effords to keep the failure rate down will continue and your time donations over on RALPH as well as error reports are still highly appreciated.
Please let us know how things work out there. Particularily i'd like to know about
Stuck workunits
Overrunning workunits (WUs should now, due to the new watchdog, never run more than 4 hours longer than the preferred user time)
Problems with checkpointing.
Any other strange behaviour.
Happy crunching - I'm very excited to see how this new version will pan out.
I have reached the end since your new patch nothing works form your project. I keep resetting and still I get no improvement. Until you patch your patch I am done sorry, I wanted to help.
____________
I have reached the end since your new patch nothing works form your project. I keep resetting and still I get no improvement. Until you patch your patch I am done sorry, I wanted to help.
Urgh - bad news :(
I notice you're using Boinc 6.2.19 with Vista64. Can you give it one last try and upgrade to 6.4.5? I had similar problems to you (not anywhere as bad) using Vista64 and these problems have disappeared for me after upgrading. It might make all the difference for you too.
____________
Do you 'overclock' your PC? In that case lowering the overclock might help.
____________
ID: 59527 | Rating: 0 | rate:
/
Markus Joined: Feb 21 08 Posts: 1 ID: 243327 Credit: 28,072 RAC: 0
Good morning!
I reinstalled my complete System a few days ago and restarded crunching rosetta@home again. Unfortuanally i got some errors
Here is what i got
12.02.2009 05:37:59|rosetta@home|Restarting task cc_1_3_mamcstmix_cen_0.1_hb_t369__IGNORE_THE_REST_1RXQA_12_6836_46_0 using minirosetta version 154
12.02.2009 05:38:00|rosetta@home|Task cc_1_3_mamcstmix_cen_0.1_hb_t369__IGNORE_THE_REST_1RXQA_12_6836_46_0 exited with zero status but no 'finished' file
12.02.2009 05:38:00|rosetta@home|If this happens repeatedly you may need to reset the project.
Therefore two workunits aborted with compuation error. Maybe just an error for my System, just wanted to post it
I've not heard any other reports of the percent completed not increasing. What is it showing for the estimated runtime, before the task starts?
In the meantime I have set that computer on NNT, and changed the preferred runtime. I will reactivate that computer, and evaluate Saturday or after the weekend. You'll be informed :)
Very good so far, zero error results on all machines for a long time. This 1.54 is much better than the prev versions, much more stable etc. Keep up the good work stamping out the bugs.
Its been a long time since I've reviewed the results on all my crunchers and found no compute errors. If things keep going the way they are, we might break 100 Tflops yet!
____________
Workunit 205979363
Task 228619747
Bame loopbuild_ref_tex_cst_hombench_loopbuild_tex_cst_t332__IGNORE_THE_REST_2FLIA_6_6646_10_1
Mac OS X 10.4.11
This failed after 216 seconds : tail of stderr below
Setting database description ...
Setting up checkpointing ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 14400
Hbond tripped.
interpolate rotamers bin out of range: ARG 1.43667e-05 nan nan nan nan nan
81 81 19 20 2147483649 22 1.43667e-06 nan
ERROR:: Exit from: src/core/scoring/dunbrack/RotamericSingleResidueDunbrackLibrary.tmpl.hh line: 593
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish
I got a couple of validate errors too: Task 228125280 Task 228133134
There's nothing more frustrating than completing a job ok only for it to go wrong when uploaded.
I notice yours are a bit different though.
The first ones just include the line:
hbond tripped
The other two show:
Starting work on structure: _1JUDA_2_00001
Hbond tripped.
ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish
Not sure if one leads to the other but hbond tripped seems to be coming up in reports more regularly.
I think I spoke too soon...that first WU crunched successfully but only 1 other was WU successful out of the 8 WU's. 2/8, better but still not good. I might try replacing Vista 64 with XP 64 another weekend when I'm bored. Just for curiosity sake I had my P4 and Atom 330 PC's running 32-bit XP SP3 crunch some Mini's and they did just fine.
So this weekend I installed a fresh copy of XP x64, upgraded it to SP2, installed my x64 version of NOD32 antivirus, told BOINC to use "...use at most 75% of the processors" meaning 3 of 4 cores on my Q6600 and it's crunching Mini's and Beta's without a problem! 1 successful Beta, 5 successful Mini's with 4 more coming down the pipe. So it looks like Mini does not like Vista x64 and on my adventures on google, it turns out that XP x64 is actually based on the Server 2003 code tree while Vista is based on crap. :)
ID: 59610 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,383,092 RAC: 0
Just noted that I have two tasks that failed. One had an exception, the other a validate error with 99 decoys ...
Does the system have an issue with too many decoys? The reissue has not returned ...
If I remember correctly, they have created a 99 model stop line to keep the tasks from running forever.
ID: 59615 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,383,092 RAC: 0
Yeah, the 99 stop limit was to avoid a problem with the file size that is zipped up and uploaded. However, I was just wondering if there is now a new companion problem that the validator does not properly handle those results... or, the result itself is somehow bad...
In that I have gone back to the 3rd of Feb and have at least a hundred (220) results with only three errors this is a puzzlement ...
{edit}
added number ..
Also I note that The runtime is only 145 seconds ... so that was fast work ... :)
I started running Rosetta this morning on a 64bit Vista machine and all seems to be working well. It's been working well on other projects too. Here is what I'm running:
Core i7 920 CPU
Asus P6T6 WS Revolution motherboard
6Gb DDR3 Triple Channel RAM
Vista Home Premium SP1 64bit
64bit BOINC 6.6.7
As I said, no problems yet and a number of WU's have completed already.
Ok, after a number of successful completions, I did see one that looks like it failed. Message as follows:
2/16/2009 7:49:12 PM rosetta@home Computation for task ss-neg-1i17__7365_4677_1 finished
2/16/2009 7:49:12 PM rosetta@home Output file ss-neg-1i17__7365_4677_1_0 for task ss-neg-1i17__7365_4677_1 absent
Don't know the cause of that one...
____________
ID: 59626 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,383,092 RAC: 0
Well, a couple hundred tasks and several with the same error, multiple systems (3 different), based on Xeon, Q9300, and i7 processors, various amounts of available RAM, though in common all are running Win XP Pro 32-Bit:
So... I completed a bunch more tasks successfully, then got a 2nd task where it said the output file was missing. Anyone else getting these?
2/17/2009 6:20:35 AM rosetta@home Computation for task ss-neg-1i17__7365_5964_0 finished
2/17/2009 6:20:35 AM rosetta@home Output file ss-neg-1i17__7365_5964_0_0 for task ss-neg-1i17__7365_5964_0 absent
I noticed that both tasks that gave the 'absent output file' message had a name the started witht the same first part:
I noticed that both tasks that gave the 'absent output file' message had a name the started witht the same first part:
ss-neg-1i17__7365_
perhaps a bug in that one?
I had one of those fail too. Firewall blocked it from reporting the symbol tables :(
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com
ID: 59633 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
Looks like Pharrg actually had three of these fail
I had two more similar tasks on my machiens, so I suspended others to try and run them.
I've got an ss-neg-1je9 that seems normal so far. But my other ss-net-1i17 doesn't seem able to display graphics. Black window, no pane lines, on WinXP.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com
As soon as you bring up the graphic, which never gets beyond black, Windows task manager shows the graphic thread as "not responding".
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com
A few minutes ago when it was about 93% complete, I told it to display graphics (which I usually don't do). After about a minute, I closed the graphics window. Within another minute or two, that workunit decided it was finished.
It may or may not be significant that a few minutes before doing this, I had set the Activity to Suspend, also suspended the network communications, ran some antispyware programs, then set the Activity back to normal.
Is this something normal that just happened at an unusual time, or something more significant?
What is it showing for the estimated runtime, before the task starts?
There is a new task running on that same computer:
- Estimated runtime: 09:43:55
- current runtime: 18:03:14
- Progress: 0%
I think my settings before were asking for about 6 hours runtime and now 10 hours. Changing this did not solve the problem. For the sake of testing I will keep this task running for some more time. You can let me know what to do. In the worst case I'll set that computer on NNT for Rosetta but I'm willing to wait some longer.
ID: 59649 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,383,092 RAC: 0
Three more errors ... this time two I have not seen before:
229353838 0 0x0056d881 SIGPIPE: write on a pipe with no reader
229355014 Reason: Access Violation (0xc0000005) at address 0x004E3308 read attempt to address 0x00000000
229435564 ERROR: ERROR: FragmentIO: could not open file cs_aa_1ji8A09_05.200_v1_3.gz
So, two shiny new errors and one old rusty access violation that quite a few of us have seen ...
Please could someone in authority explain why there have been so many of these recently.
I currently have Rosetta set to "No New Tasks", partly because of these. I am still accepting work from RALPH.
Keith
ID: 59651 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
rembertw, the maximum runtime preference possible is 24hrs, and if it is a v1.54 task, the watchdog should end it if it runs longer then 28hrs. So, if you could, let it run at least 29hrs and if it is still running at that point, then abort it.
I still have not seen anyone else reporting such a problem, and you've got a score of other hosts running fine. What is different about this one that's having trouble? anitvirus software? Windows service pack? age of machine? BOINC version?
____________ Rosetta Moderator: Mod.Sense
I still have not seen anyone else reporting such a problem, and you've got a score of other hosts running fine. What is different about this one that's having trouble? antivirus software? Windows service pack? age of machine? BOINC version?
I it strange indeed. My other computers seem to be running fine. About the computer: I have an identical computer that gives no problems. They both have the same antivirus software, same servicepack, same age, same Boinc version.
Some things I noticed:
- when a 0% task (only at Rosetta 1.54) gets paused manually after x hours and it gets restarted, also the time resets to 0.
- When the 1.54 task starts both processors get work (multiple projects). However, when one of the other project tasks stop, then the 2nd processor starts idling. It can not get another task to run from Rosetta or any other project despite the queue having multiple tasks ready to start or continue.
I broke off 2 remaining tasks of Rosetta that still had to get started and am letting run the restarted task. Before it had already 24h+ but because of a pauze it reset its time. At this moment it is at 19h again. I will let it run until it gets past 31h runtime. After (tomorrow) that I will set that computer on NNT for Rosetta so it can crunch for my other projects while I wait for your comment.
[edit]Changed "all" in "both" and corrected a typo[/edit]
ID: 59677 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another.
Is it possible BOINC is having trouble writing to disk (authorities?)? Have you checked the authorities to the data directory and it's contents? Is the data on the local hard drive of the machine, or off on a network somewhere?
I see from the one task that completed that you are running BOINC 6.2.14. Have you tried other BOINC versions on this machine?
____________ Rosetta Moderator: Mod.Sense
rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another.
I agree, but this shows only when it started a "0%" Minirosetta task. To check this I put Rosetta on NNT for a while. When it runs only other projects there are no problems at all, making me think the problem is with Minirosetta.
Is it possible BOINC is having trouble writing to disk (authorities?)? Have you checked the authorities to the data directory and it's contents? Is the data on the local hard drive of the machine, or off on a network somewhere?
Standard setup with full authority running on a local hard drive. No fancy settings.
I see from the one task that completed that you are running BOINC 6.2.14. Have you tried other BOINC versions on this machine?
Every now and again I do a Boinc upgrade on my machines. I heard some negative comments about the current Boinc version, which is why I considered waiting until Summer or so to upgrade. I guess now the time has come. To be certain I'll do a total Boinc uninstall on that computer followed by a cleanup before I download the current version. We'll see if this helped...
rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another.
I agree, but this shows only when it started a "0%" Minirosetta task. To check this I put Rosetta on NNT for a while. When it runs only other projects there are no problems at all, making me think the problem is with Minirosetta.
Is it possible BOINC is having trouble writing to disk (authorities?)? Have you checked the authorities to the data directory and it's contents? Is the data on the local hard drive of the machine, or off on a network somewhere?
Standard setup with full authority running on a local hard drive. No fancy settings.
I see from the one task that completed that you are running BOINC 6.2.14. Have you tried other BOINC versions on this machine?
Every now and again I do a Boinc upgrade on my machines. I heard some negative comments about the current Boinc version, which is why I considered waiting until Summer or so to upgrade. I guess now the time has come. To be certain I'll do a total Boinc uninstall on that computer followed by a cleanup before I download the current version. We'll see if this helped...
Which BOINC version do you consider current? I'm running 6.2.28 without seeing such a problem, but I've read some negative comments about the 6.4.* series.
ID: 59686 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
robertmiles, if you were directing the question to me, I try to stay out of that one. And am only recommending a change to BOINC version because problems are occurring with the version installed now. I know we've seen many work-fetch and DCF problems reported on the 6.6 (which is the current test version) and I think 6.4 series introduced those problems. So, if it were me, I'd try the 6.2.19 shown at the link below. I myself am on 6.2.18 and running well on WinXP. (nothing against 6.2.28, but it's not listed anymore for some reason)
And am only recommending a change to BOINC version because problems are occurring with the version installed now.
I set up Boinc 6.4.5 on that computer, and it seems to be running fine with Rosetta. I still will wait for a general upgrade until there are new Boinc versions, I think.
robertmiles
"Current" is for me the version that the actual Boinc site gives as standard. Researching older versions and installing those is too much micromanagement for me. Same like posting on the boards... If this problem gets solved with 6.4.5 (and it seems to be solved) then I'm off again.
ID: 59752 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
Looks like all of these were the ss-neg-1i17s that most people have been having trouble with. Something specific to the 1i17, the other ss-neg's do not seem to be having any trouble.
Except for your last one on the list, it got a
"Too many restarts with no progress. Keep application in memory while preempted."
error. Perhaps you rebooted your machine several times in a row to install fixes or something?
____________ Rosetta Moderator: Mod.Sense
ID: 59756 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,383,092 RAC: 0
I noticed that with the minirosetta 1.54 the granted credit was very low in the Athlon X2 processors - sometimes half the claimed credit. This did not occur with the single core Athlon.
I noticed that with the minirosetta 1.54 the granted credit was very low in the Athlon X2 processors - sometimes half the claimed credit. This did not occur with the single core Athlon.
Problem solved. Updating the BIOS (F8> F9) of the motherboard caused a considerable loss of performance of PCs with Athlon X2 processors. The restoration of BIOS F8 normalized the system.
____________
Hi all,
Had the below error show up.
I initially DLd 3 WU, the first 2 bombed, I aborted the 3rd.. I then detached, re-attached, then DLed 11 new ones.
Every one of them went south..
Boinc mgr is 6.2.18
Free disk is 88g
Used by boinc is 4.81
Use at most 100g
Leave 0
Use up to 50% disk
Leave apps in memory.
Only other project (which was suspended was CPDN at 55% @1004 hrs (do not want to loose this)
At this point, I will wait till next week (SIMAP starting soon with it's monthly run :)) and will try again.
Don't want to keep trashing WUs for no reason.
I do have the messages from boinc stored if they would be useful, but here is one thing I see, but it may only be due to the process crashing:
2/26/2009 8:04:04 PM|rosetta@home|Starting lr8_A_score12_rlbd_2ci2_IGNORE_THE_REST_DECOY_SAVE_ALL_OUT_7089_1093_0
2/26/2009 8:04:05 PM|rosetta@home|Starting task lr8_A_score12_rlbd_2ci2_IGNORE_THE_REST_DECOY_SAVE_ALL_OUT_7089_1093_0 using minirosetta version 154
2/26/2009 8:04:19 PM|rosetta@home|Computation for task lr8_A_score12_rlbd_2ci2_IGNORE_THE_REST_DECOY_SAVE_ALL_OUT_7089_1093_0 finished
2/26/2009 8:04:19 PM|rosetta@home|Output file lr8_A_score12_rlbd_2ci2_IGNORE_THE_REST_DECOY_SAVE_ALL_OUT_7089_1093_0_0 for task lr8_A_score12_rlbd_2ci2_IGNORE_THE_REST_DECOY_SAVE_ALL_OUT_7089_1093_0 absent
Thanks
mike
(extra blank lines removed)
<core_client_version>6.2.18</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 2-26 20:10: 2:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x7C910193 write attempt to address 0x009882EA
Engaging BOINC Windows Runtime Debugger...
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x7C910193 write attempt to address 0x0040118E
Engaging BOINC Windows Runtime Debugger...
</stderr_txt>
]]>
Hi all,
Had the below error show up.
I initially DLd 3 WU, the first 2 bombed, I aborted the 3rd.. I then detached, re-attached, then DLed 11 new ones.
Every one of them went south..
Boinc mgr is 6.2.18
Free disk is 88g
Used by boinc is 4.81
Use at most 100g
Leave 0
Use up to 50% disk
Leave apps in memory.
Only other project (which was suspended was CPDN at 55% @1004 hrs (do not want to loose this)
A few questions that may help pin down the problem:
Are you able to find BOINC 6.2.28, and willing to upgrade to it? That's the only version I have used since 5.10.45, and I don't have that problem.
Have you gone to any extra effort to tell BOINC that it could use more virtual memory than the default?
Have you gone to any extra effort to tell your copy of Windows to allow a bigger swap file than the default?
How many BOINC projects do you have your BOINC Manager set up to recognize? I've seen some so far rather indistinct signs that BOINC divides the disk space it is allowed to use into equal sections for each BOINC project it recognizes before it starts dividing those sections into smaller subsections for each workunit. Therefore, if one BOINC project is heavy on disk space use, workunits for that project might run out of disk space even if some other BOINC project doesn't need all that is reserved for it.
Does this site tell you how much memory your machine has now and what the maximum for that model of computer is?
I had problems getting my dual-core CPU to run two Rosetta@home workunits at the same time back when I had only 1 GB of memory to share between Vista and the two workunits, so I ordered an upgrade to the 2 GB maximum my model of computer can handle; now I can run two such workunits at once even while typing this.
Looks like all of these were the ss-neg-1i17s that most people have been having trouble with. Something specific to the 1i17, the other ss-neg's do not seem to be having any trouble.
Except for your last one on the list, it got a
"Too many restarts with no progress. Keep application in memory while preempted."
error. Perhaps you rebooted your machine several times in a row to install fixes or something?
Right, last was multifix from our "love" Microsoft....
Hi all,
Had the below error show up.
I initially DLd 3 WU, the first 2 bombed, I aborted the 3rd.. I then detached, re-attached, then DLed 11 new ones.
Every one of them went south..
Boinc mgr is 6.2.18
Free disk is 88g
Used by boinc is 4.81
Use at most 100g
Leave 0
Use up to 50% disk
Leave apps in memory.
Only other project (which was suspended was CPDN at 55% @1004 hrs (do not want to loose this)
A few questions that may help pin down the problem:
Are you able to find BOINC 6.2.28, and willing to upgrade to it? That's the only version I have used since 5.10.45, and I don't have that problem.
Have you gone to any extra effort to tell BOINC that it could use more virtual memory than the default?
Have you gone to any extra effort to tell your copy of Windows to allow a bigger swap file than the default?
How many BOINC projects do you have your BOINC Manager set up to recognize? I've seen some so far rather indistinct signs that BOINC divides the disk space it is allowed to use into equal sections for each BOINC project it recognizes before it starts dividing those sections into smaller subsections for each workunit. Therefore, if one BOINC project is heavy on disk space use, workunits for that project might run out of disk space even if some other BOINC project doesn't need all that is reserved for it.
Does this site tell you how much memory your machine has now and what the maximum for that model of computer is?
I had problems getting my dual-core CPU to run two Rosetta@home workunits at the same time back when I had only 1 GB of memory to share between Vista and the two workunits, so I ordered an upgrade to the 2 GB maximum my model of computer can handle; now I can run two such workunits at once even while typing this.
The odd thing is that I had successfully finished 3 models a few days ago, and a couple before that, (cant remember the version off hand, only 1 wu at a time) with no issues. I am attached to 7 projects but am not running then all. (I NNT the projects, and have a small buffer so as to not have to worry about having too much (Yea, I know boinc manages it, but I want to make sure everything gets doone quickly).
When you mentioned boinc dividing the disk space, I am wondering if I had the non active projects suspended, which I ususally have done in the past..
I will retry after I get thru the SIMAP run (this is why I keep the tasks low), making sure my buffer is small so as hopefully not grab 11 tasks
I still have not seen anyone else reporting such a problem, and you've got a score of other hosts running fine.
Last update: everything seems to be ok after I updated the Boinc version to 6.4.5. The exact reason for the 0% progress with Mini Rosetta is still a mystery but at least that computer is crunching again.
Hi all,
Had the below error show up.
I initially DLd 3 WU, the first 2 bombed, I aborted the 3rd.. I then detached, re-attached, then DLed 11 new ones.
Every one of them went south..
Boinc mgr is 6.2.18
Free disk is 88g
Used by boinc is 4.81
Use at most 100g
Leave 0
Use up to 50% disk
Leave apps in memory.
Only other project (which was suspended was CPDN at 55% @1004 hrs (do not want to loose this)
A few questions that may help pin down the problem:
The odd thing is that I had successfully finished 3 models a few days ago, and a couple before that, (cant remember the version off hand, only 1 wu at a time) with no issues. I am attached to 7 projects but am not running then all. (I NNT the projects, and have a small buffer so as to not have to worry about having too much (Yea, I know boinc manages it, but I want to make sure everything gets doone quickly).
When you mentioned boinc dividing the disk space, I am wondering if I had the non active projects suspended, which I ususally have done in the past..
I will retry after I get thru the SIMAP run (this is why I keep the tasks low), making sure my buffer is small so as hopefully not grab 11 tasks
Thanks
Mike
Another question that may help pin down the problem:
Did you have graphics enabled at any time during those runs? When I run minirosetta 1.58 for RALPH@home, it completes successfully if I never enable graphics, but fails if I have graphics enabled for a short time during the run.
First three of them have valid status and:
ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
called boinc_finish
Hi all,
Had the below error show up.
I initially DLd 3 WU, the first 2 bombed, I aborted the 3rd.. I then detached, re-attached, then DLed 11 new ones.
Every one of them went south..
Boinc mgr is 6.2.18
Free disk is 88g
Used by boinc is 4.81
Use at most 100g
Leave 0
Use up to 50% disk
Leave apps in memory.
Only other project (which was suspended was CPDN at 55% @1004 hrs (do not want to loose this)
A few questions that may help pin down the problem:
The odd thing is that I had successfully finished 3 models a few days ago, and a couple before that, (cant remember the version off hand, only 1 wu at a time) with no issues. I am attached to 7 projects but am not running then all. (I NNT the projects, and have a small buffer so as to not have to worry about having too much (Yea, I know boinc manages it, but I want to make sure everything gets doone quickly).
When you mentioned boinc dividing the disk space, I am wondering if I had the non active projects suspended, which I ususally have done in the past..
I will retry after I get thru the SIMAP run (this is why I keep the tasks low), making sure my buffer is small so as hopefully not grab 11 tasks
Thanks
Mike
Another question that may help pin down the problem:
Did you have graphics enabled at any time during those runs? When I run minirosetta 1.58 for RALPH@home, it completes successfully if I never enable graphics, but fails if I have graphics enabled for a short time during the run.
No, did not have the graphics running, the process crashed immediatly upon startup (or at least within a few seconds).
Interesting thing..
Normally I only have 1 to 3 projects un-suspended at 1 time. I has more than that un-suspended, but No new tasks..
I suspended ALL projects, shut down, and re-booted.
Started up boinc, set to not keep projects in memory, 50% cpu (us the 1 core non HT, unsuspended Rossetta, said give me tasks, hit update. Gave me 6 and then let it do its thing..
Guess what.. no issues..
I suspended 5 of the tasks to let the 1 run.
I also re-adjusted to 100% to use HT, re-started Docking, and had several Docking and 1 Rosetta finish..
Might be due to allocating memory among the active projects..
Am wondering if any of the other bugs I saw here, is the same issue with too many "active projects".
The programmer in me is suspecting that.. Not knowing what goes on in Boinc, etc could not tell (Besides, don't do C++ or later).
I've had 2 Windows error messages in the last couple of days from Rosetta. This is on a Win XP Pro SP2 system. The last one was this morning. I looked at my results today and this WU has crashed at 15:13:50 UTC:
03/03/2009 6:00:54 AM|rosetta@home|Restarting task 2p64__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2p64_-native_frag2__7622_431_2 using rosetta_beta version 598
03/03/2009 6:01:41 AM|rosetta@home|Task 2p64__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2p64_-native_frag2__7622_431_2 exited with zero status but no 'finished' file
03/03/2009 6:01:41 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
Identical messages repeated until 7:12 AM when I got this:
03/03/2009 7:12:14 AM|rosetta@home|Computation for task 2p64__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2p64_-native_frag2__7622_431_2 finished
03/03/2009 7:12:14 AM|rosetta@home|Output file 2p64__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2p64_-native_frag2__7622_431_2_0 for task 2p64__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2p64_-native_frag2__7622_431_2 absent
If you look at the task details for WU 209583003 on computer 272841, you'll see this error followed by a dump:
<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
too many exit(0)s
</message>
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 2834914
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x008BB955 read attempt to address 0x09A9C000
Engaging BOINC Windows Runtime Debugger...
********************
I'm sure it isn't meant to do this...
____________
--hedera
Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic.
I've had 2 Windows error messages in the last couple of days from Rosetta. This is on a Win XP Pro SP2 system. The last one was this morning. I looked at my results today and this WU has crashed at 15:13:50 UTC:
03/03/2009 6:00:54 AM|rosetta@home|Restarting task 2p64__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2p64_-native_frag2__7622_431_2 using rosetta_beta version 598
03/03/2009 6:01:41 AM|rosetta@home|Task 2p64__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2p64_-native_frag2__7622_431_2 exited with zero status but no 'finished' file
03/03/2009 6:01:41 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
Could you check the results uploaded for this one and see it the results include any mention of lockfile problems?
Also, a few questions that may help pin down the problem:
1. Do the error messages shown above repeat several times, and do the lockfile error messages if any repeat several times?
2. What version of BOINC are you using?
3. Have you enabled the leave in memory option?
4. What percentage of CPU time do you let BOINC projects use? The 60% setting typical for laptops, the 100% setting typical for desktops, or something else?
5. Did this workunit start with graphics enabled? Did you enable graphics later? Did you then shut down graphics for it?
Once in awhile, I get a Microsoft Visual C++ Runtime Library Error? It is for minirosetta_1.54_windows_intelx86.exe. The error message reads "This application has requested the runtime to terminate it in an unusual way. Please contact the applications support team for more information."
Received it for this workunit. http://boinc.bakerlab.org/rosetta/result.php?resultid=232499308
Currently Using XP service pack 2 with Boinc version 5.10.45
____________
Task Id 232649967 isn't displaying graphics instead it's displaying a black window, when i move my cursor around in the black window a white block of what looks like unreadable text moves around under my cursor while my cursor is in the black window. It's a 2vik task. Task finished with a successful out come. Task ID 232649968 isn't displaying graphics instead it's displaying a black window When I try to close the black window it comes up with End Program my opinions are Emd Now or Cancel I chose End Now. Task finished with a successful out come.
I am using XP Pro SP3 fully patched and Boinc 6.4.7 on a quad 2.66 with 2.87GB Ram. I'm not sure if that will make a difference or not.
Has anybody else had any of the above issues?
Thanks for any information as to why this could be happening in advance.
____________
Have a crunching good day!! Live in NZ y not join Smile City?
A wild question ... did you enable or run graphics for any task for any project?
Hard to remember. I often go for days without using graphics on any BOINC project these days.
When I was testing graphics triggering the problem for minirosetta 1.58 over on RALPH@home, though, it seemed to be only graphics for a 1.58 workunit, not graphics for a 1.54 workunit, which triggered the problem, though, and only for the 1.58 workunit.
I probably used graphics for purposes unrelated to BOINC projects, though, which hasn't triggered such a problem for me in the past.
Wow, this one has a 2.15MB result file for 24hrs of crunching. 100K is more what I am used to seeing. Task name is lrfrag_0_8_hb_t308__IGNORE_THE_REST_ 1M2OB_8_7783_69_0
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com
Mod.Sense
Another 0% progress Minirosetta task, on another computer. 84:25:24 time progress of a projected 10:17:32 duration. Windows XP, SP3, vintage computer. Boinc version 6.2.18.
Task is now aborted, Boinc upgraded to 6.4.7. Am I still the only one noticing this?
ID: 60077 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
rembertw, I haven't heard any other reports. You seem to be translating the screen to English for me, and I appreciate that, but it's never entirely clear what you are referring to. On the English screen, there are three columns of interest, "CPU time", "Progress" (the percentage), and "To completion".
If I understand what you are saying is that
CPU time was 84hrs
progress was 0%
and to completion was still 10 more hours?
Now that you have aborted that one, what does the next task show for the "to ompletion" before it starts?
If the above is the correct task, it looks like the host only has 256MB of memory. And 32MB of that is likely devoted to your graphics card. The current recommendation to run Rosseta is machines with 512MB of memory or higher. (they recently increased that from 256MB when they began running more tasks that require more memory).
It looks like that machine has been having trouble earning credit for some time. I see you also do work for WCG. I've noticed that the rice project there runs in about 10MB of memory! So, perhaps that would run better on that machine.
I see you have a very large list of projects you do work for. Are all of your hosts using an account manager and dividing their resource share across all 8 projects? You also have 13 machines active, at least for Rosetta. You might want to create a seperate account, or seperate venue to seperate your P4 machines from your core 2's. And that way you could have some machines doing more work for WCG for example, and others do more for Rosetta. Based on the machine's configuration.
____________ Rosetta Moderator: Mod.Sense
If I understand what you are saying is that
CPU time was 84hrs
progress was 0%
and to completion was still 10 more hours?
Mod.Sense, you understood correctly. Indeed it is for me sometimes guessing how Boinc translated English into Dutch. And indeed, it was that task.
I have all my computers under Gridrepublic. It would simply take too much time to micromanage every single computer so I don't even try to. Up until now I simply assumed that projects would not give work to computers that did not have the minimum requirements so I never bothered checking every project for that. From your reply I take it that Rosetta does not do such a test.
There are indeed a list of projects that I have active, but I never run them all at the same time. Right now there are only 2 projects active with a stable feed (Rosetta, WCG), 2 projects that send Wu's when they have them (simap, LHC) and one that is only on a couple of computers, and set on NNT (orbit).
Now I set that last computer on NNT for Rosetta since it's got a limited configuration.
I realise that this does not belong here, but it would be interesting if there were a manager like Gridrepublic or BAM that looks at the connected computers, and divides the projects over the available processors. Let's say that for now I have 20 processors available, and Rosetta gets 10% resource share, then Rosetta would get 1 computer with 2 processors working only for Rosetta. All this without having me driving from location to location if I want to change settings. It would help, indeed, in available memory, available disk space and so on.
Since there is no such thing for now, I'll just go on as I'm used to: if there's a problem, then upgrade Boinc, and set Rosetta to NNT on the older computers. I can equal out a little by increasing the resource share for every computer set on NNT.
I believe it's something with the machine, since I'm not having errors on any others, and it's only been attached for about 24 hours.
I am curious if there is anything in the debug info that might point to a clue as to what is up with it.
Anyone that can provide some insight would be much appreciated.
Thanks
____________
"Every passing hour brings the Solar System forty-three thousand miles closer to Globular Cluster M13 in Hercules -- and still there are some misfits who insist that there is no such thing as progress." - Kurt Vonnegut
Now that you have aborted that one, what does the next task show for the "to ompletion" before it starts?
I did not answer this question, but the answer would have been 10:17:32 time to completion if I accepted new tasks from Rosetta on that computer. (I do not). On other computers the time to completion values that I get vary from 9:something up to 12:something depending on the computer. Meaning that every computer has a different "time to completion" but different tasks on one computer have the same value.
Hi folks,
I try to run Minirosetta 1.54 (Windows XP Home sp3,BOINC Manager 6.4.5, wx Wigets version 2.8.7), but my Kaspersky 2009 Interner Security (Version 8.0.0.0.506) blocks it from running and throws up a black error message. I have tried to view the report but that does not give me any information on how to rectify the problem.
The standard Rosetta program will run ok (as of 22 feb workunit). Any suggestions please ?
Looks like you've managed to make your queue of workunits so long you don't return them by the deadline, so other people run them and get credit for them before you do.
Kaspersky Internet Security does not allow the launch of Rosetta Mini 1.54 program because it has a danger rating of 82 and no digital signature. Do you need to add this at your server before sending ?
The former one (workunit 1hz6A_BOINC_ABINITIO_IGNORE_THE_REST-MOO18-S25-9-S3-9--1hz6A-_7873_76) spent more than 4.5 hours CPU time in my computer. And a windows message showed that Windows C++ Runtime error when this workunit crashed. When this condition happened, I was using Mozilla Firefox browser V 3.0. And the Mozilla Firefox browser also accidently closed almost at the same time. The task detail is in the following: Task ID 234173364
Name 1hz6A_BOINC_ABINITIO_IGNORE_THE_REST-MOO18-S25-9-S3-9--1hz6A-_7873_76_0
Workunit 213483545
Created 9 Mar 2009 7:21:46 UTC
Sent 9 Mar 2009 7:23:00 UTC
Received 17 Mar 2009 8:07:24 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
Computer ID 224205
Report deadline 19 Mar 2009 7:23:00 UTC
CPU time 17563.45
stderr out
<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 3-16 14:16:21:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Starting work on structure: _MOO18U9X9X_00001
# cpu_run_time_pref: 21600
Starting work on structure: _MOO18U9X9X_00002
Starting work on structure: _MOO18U9X9X_00003
Starting work on structure: _MOO18U9X9X_00004
BOINC:: Initializing ... ok.
[2009- 3-17 11:23:26:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 21600
Starting work on structure: _MOO18U9X9X_00004
Continuing computation from checkpoint: chk_S_MOO18U9X9X_00000004_ClassicAbinitio__stage_1 ... success!
Continuing computation from checkpoint: chk_S_MOO18U9X9X_00000004_ClassicAbinitio__stage_2 ... success!
Starting work on structure: _MOO18U9X9X_00005
Starting work on structure: _MOO18U9X9X_00006
Starting work on structure: _MOO18U9X9X_00007
Starting work on structure: _MOO18U9X9X_00008
Starting work on structure: _MOO18U9X9X_00009
Starting work on structure: _MOO18U9X9X_00010
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0055B8C1 write attempt to address 0x00000024
ModLoad: 7c920000 00094000 C:\WINDOWS\system32\ntdll.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : ntdll.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2111)
Company Name : Microsoft Corporation
Product Name : Microsoft(R) Windows(R) Operating System
Product Version : 5.1.2600.5512
ModLoad: 7c800000 0011f000 C:\WINDOWS\system32\kernel32.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : kernel32.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2111)
Company Name : Microsoft Corporation
Product Name : Microsoft(R) Windows(R) Operating System
Product Version : 5.1.2600.5512
ModLoad: 77d10000 0008f000 C:\WINDOWS\system32\USER32.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : user32.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2105)
Company Name : Microsoft Corporation
Product Name : Microsoft(R) Windows(R) Operating System
Product Version : 5.1.2600.5512
ModLoad: 77ef0000 00049000 C:\WINDOWS\system32\GDI32.dll (5.1.2600.5698) (PDB Symbols Loaded)
Linked PDB Filename : gdi32.pdb
File Version : 5.1.2600.5698 (xpsp_sp3_gdr.081022-1932)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version : 5.1.2600.5698
ModLoad: 77da0000 000a7000 C:\WINDOWS\system32\ADVAPI32.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : advapi32.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2113)
Company Name : Microsoft Corporation
Product Name : Microsoft(R) Windows(R) Operating System
Product Version : 5.1.2600.5512
ModLoad: 77e50000 00092000 C:\WINDOWS\system32\RPCRT4.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : rpcrt4.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2108)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version : 5.1.2600.5512
ModLoad: 77fc0000 00011000 C:\WINDOWS\system32\Secur32.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : secur32.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2113)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version : 5.1.2600.5512
ModLoad: 76300000 0001d000 C:\WINDOWS\system32\IMM32.DLL (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : imm32.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2105)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version : 5.1.2600.5512
ModLoad: 621f0000 00009000 C:\WINDOWS\system32\LPK.DLL (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : lpk.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2105)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version : 5.1.2600.5512
ModLoad: 73fa0000 0006b000 C:\WINDOWS\system32\USP10.dll (1.420.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : usp10.pdb
File Version : 1.0420.2600.5512 (xpsp.080413-2105)
Company Name : Microsoft Corporation
Product Name : Microsoft(R) Uniscribe Unicode script processor
Product Version : 1.0420.2600.5512
ModLoad: 76cb0000 00020000 C:\WINDOWS\system32\NTMARTA.DLL (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : ntmarta.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2113)
Company Name : Microsoft Corporation
Product Name : Microsoft(R) Windows(R) Operating System
Product Version : 5.1.2600.5512
ModLoad: 77be0000 00058000 C:\WINDOWS\system32\msvcrt.dll (7.0.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : msvcrt.pdb
File Version : 7.0.2600.5512 (xpsp.080413-2111)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version : 7.0.2600.5512
ModLoad: 76990000 0013d000 C:\WINDOWS\system32\ole32.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : ole32.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2108)
Company Name : Microsoft Corporation
Product Name : Microsoft(R) Windows(R) Operating System
Product Version : 5.1.2600.5512
ModLoad: 71b70000 00013000 C:\WINDOWS\system32\SAMLIB.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : samlib.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2113)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version : 5.1.2600.5512
ModLoad: 76f30000 0002c000 C:\WINDOWS\system32\WLDAP32.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : wldap32.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2113)
Company Name : Microsoft Corporation
Product Name : Microsoft(R) Windows(R) Operating System
Product Version : 5.1.2600.5512
ModLoad: 0b610000 00115000 C:\Program Files\BOINC\dbghelp.dll (6.6.7.5) (PDB Symbols Loaded)
Linked PDB Filename : dbghelp.pdb
File Version : 6.6.0007.5 (debuggers(dbg).051021-1446)
Company Name : Microsoft Corporation
Product Name : Debugging Tools for Windows(R)
Product Version : 6.6.0007.5
ModLoad: 0b830000 00083000 C:\Program Files\BOINC\symsrv.dll (6.6.7.5) (PDB Symbols Loaded)
Linked PDB Filename : symsrv.pdb
File Version : 6.6.0007.5 (debuggers(dbg).051021-1446)
Company Name : Microsoft Corporation
Product Name : Debugging Tools for Windows(R)
Product Version : 6.6.0007.5
ModLoad: 0b8c0000 0003a000 C:\Program Files\BOINC\srcsrv.dll (6.6.7.5) (PDB Symbols Loaded)
Linked PDB Filename : srcsrv.pdb
File Version : 6.6.0007.5 (debuggers(dbg).051021-1446)
Company Name : Microsoft Corporation
Product Name : Debugging Tools for Windows(R)
Product Version : 6.6.0007.5
ModLoad: 77bd0000 00008000 C:\WINDOWS\system32\version.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : version.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2105)
Company Name : Microsoft Corporation
Product Name : Microsoft® Windows® Operating System
Product Version : 5.1.2600.5512
*** Foreground Window Data ***
Window Name :
Window Class :
Window Process ID: 0
Window Thread ID : 0
Exiting...
</stderr_txt>
]]>
Validate state Invalid
Claimed credit 32.9406239634204
Granted credit 0
application version 1.54
The other one (workunit lr5_E_01_hbond_bb_sc_rlbd_2hsb_SAVE_ALL_OUT_8261_652) only spent less than a half hour in my computer, but the error message did not show when it crashed. And I also used Mozilla Firefox browser V 3.0 then, strangely the Mozilla Firefox browser did not accidently closed at the same time. The task detail is in the following: Task ID 236172160
Name lr5_E_01_hbond_bb_sc_rlbd_2hsb_SAVE_ALL_OUT_8261_652_1
Workunit 215347031
Created 17 Mar 2009 8:05:59 UTC
Sent 17 Mar 2009 8:07:24 UTC
Received 20 Mar 2009 17:36:16 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
Computer ID 224205
Report deadline 27 Mar 2009 8:07:24 UTC
CPU time 1436.896
stderr out
<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 3-21 1: 5:10:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/mtyka_lr5_D_score12.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/mtyka_lr5_D_score12.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/lr5_2hsb.out.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/lr5_2hsb.out.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
Initializing score function:
Initializing relax mover:
Starting protocol...
Silent Output Mode
Jobdist startup..
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Starting work on structure: S_shuffle_00001 <--- S_00002_0000216_0_test_6.0.out
Fullatom mode ..
# cpu_run_time_pref: 21600
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0055B8C1 write attempt to address 0x00000024
Engaging BOINC Windows Runtime Debugger...
********************
BOINC Windows Runtime Debugger Version 6.5.0
Dump Timestamp : 03/21/09 01:34:14
Install Directory : C:\Program Files\BOINC\
Data Directory : C:\Program Files\BOINC
Project Symstore :
LoadLibraryA( C:\Program Files\BOINC\\dbghelp.dll ): GetLastError = 1455
LoadLibraryA( dbghelp.dll ): GetLastError = 1455
*** Dump of the Process Statistics: ***
- Working Set Size -
WorkingSetSize: 136130560, PeakWorkingSetSize: 213221376, PageFaultCount: 366777
*** Dump of thread ID 1164 (state: Waiting): ***
- Information -
Status: Wait Reason: UserRequest, , Kernel Time: 93334208.000000, User Time: 14287143936.000000, Wait Time: 2525130.000000
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0055B8C1 write attempt to address 0x00000024
*** Dump of thread ID 3344 (state: Waiting): ***
- Information -
Status: Wait Reason: ExecutionDelay, , Kernel Time: 300432.000000, User Time: 300432.000000, Wait Time: 2525124.000000
*** Dump of thread ID 2416 (state: Waiting): ***
- Information -
Status: Wait Reason: ExecutionDelay, , Kernel Time: 0.000000, User Time: 100144.000000, Wait Time: 2524973.000000
*** Debug Message Dump ****
*** Foreground Window Data ***
Window Name :
Window Class :
Window Process ID: 0
Window Thread ID : 0
Exiting...
</stderr_txt>
]]>
Validate state Invalid
Claimed credit 2.76822531352161
Granted credit 2.76822531352161
application version 1.54
And another computer computed this workunit also computed error. The message is in the following: Task ID 236168980
Name lr5_E_01_hbond_bb_sc_rlbd_2hsb_SAVE_ALL_OUT_8261_652_0
Workunit 215347031
Created 17 Mar 2009 7:49:09 UTC
Sent 17 Mar 2009 7:50:56 UTC
Received 17 Mar 2009 8:05:56 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status -185 (0xffffff47)
Computer ID 868926
Report deadline 27 Mar 2009 7:50:56 UTC
CPU time 0
stderr out
when i notice that a work unit takes much too long, should i abort it? or let it run until it fails to validate after 7 hours?
I can only tell you that the v1.54 mini version now includes code both to end such tasks sooner, and to report information useful to help determine why those models are running so long. Prior to these enhancements, the watchdog would wait until the task ran for 3 or 4 times longer then the runtime preference, and the results when such a watchdog end was made were not as useful in studying what occurred.
I've been asking why such tasks are not receiving credit from the nightly credit granting script, but have not yet received any word.
____________ Rosetta Moderator: Mod.Sense
I just tried resetting the Rosetta@home project and got these error messages (with no Rosetta@home workunit running, none downloaded but not run, and the last one already reported):
I got a validate error, another person got a compute error and the third never replied with the task error or completion.
ID: 60275 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
robertmiles
Sounds like a reboot is in order to clear all of the locks. I've never heard of that happening before. Perhaps something like anti-virus software has taken a lock on the file to perform a scan?
Curious, why were you resetting the project?
____________ Rosetta Moderator: Mod.Sense
robertmiles
Sounds like a reboot is in order to clear all of the locks. I've never heard of that happening before. Perhaps something like anti-virus software has taken a lock on the file to perform a scan?
Curious, why were you resetting the project?
A reboot may have helped - it was part of the procedure I described trying over on Ralph@home, and was able to remove the lockfiles for a while.
I was resetting the project because that's what the error messages from the lockfile problem suggest I may need to do. However, it doesn't seem to have helped enough, since the first Rosetta@home workunit my machine completed since the reset had the lockfile problem again:
Two more Rosetta@home workunits that started later aren't finished, but at least don't seem to have run into the lockfile problem yet.
My antivirus program, and also my three antispyware programs, are able to finish scanning a file in much less time than it needs for Rosetta@home and Ralph@home workunits to fail due to too many restarts from a lockfile problem, so I'd expect a lock from any of them to cause lockfile error messages for only a short time, followed by a successful minirosetta restart.
A suggestion - modify minirosetta to check for the lockfile as it starts up (preferably before any effort to create one), report the results of this check if it can, and if this first check for the lockfile finds one, don't waste as much time restarting over and over before declaring the workunit failed.
Another suggestion - modify minirosetta to report which slot it ran in, since the problem looks like it may be specific to workunits assigned to specific slots, due to what looks like its inability to remove lockfiles left by previous workunits assigned to the same slot but already completed since the last reboot.
I leave BOINC running nearly 24 hours a day, often days between reboots, which may have something to do with why I'm seeing the lockfile problem as often as I do.
I'm still using BOINC 6.2.28 under 32-bit Vista SP1.
robertmiles
Sounds like a reboot is in order to clear all of the locks. I've never heard of that happening before. Perhaps something like anti-virus software has taken a lock on the file to perform a scan?
Curious, why were you resetting the project?
A reboot may have helped - it was part of the procedure I described trying over on Ralph@home, and was able to remove the lockfiles for a while.
I was resetting the project because that's what the error messages from the lockfile problem suggest I may need to do. However, it doesn't seem to have helped enough, since the first Rosetta@home workunit my machine completed since the reset had the lockfile problem again:
Two more Rosetta@home workunits that started later aren't finished, but at least don't seem to have run into the lockfile problem yet.
My antivirus program, and also my three antispyware programs, are able to finish scanning a file in much less time than it needs for Rosetta@home and Ralph@home workunits to fail due to too many restarts from a lockfile problem, so I'd expect a lock from any of them to cause lockfile error messages for only a short time, followed by a successful minirosetta restart.
A suggestion - modify minirosetta to check for the lockfile as it starts up (preferably before any effort to create one), report the results of this check if it can, and if this first check for the lockfile finds one, don't waste as much time restarting over and over before declaring the workunit failed.
Another suggestion - modify minirosetta to report which slot it ran in, since the problem looks like it may be specific to workunits assigned to specific slots, due to what looks like its inability to remove lockfiles left by previous workunits assigned to the same slot but already completed since the last reboot.
I leave BOINC running nearly 24 hours a day, often days between reboots, which may have something to do with why I'm seeing the lockfile problem as often as I do.
I'm still using BOINC 6.2.28 under 32-bit Vista SP1.
You might be interested in this announcement by Bernd over at Einstein@home. He has made an Einstein Windows app specifically to collect more info on the CPU throttling=too many exits/can't acquire lockfile errors. Hopefully his discoveries will prove useful here on rosetta@home as well.
This task is currently using 496MB on my machine. Max was 536MB. It is called 2P09A_BOINC_MPZN_vanilla_abrelax_9106_6681_0
What is the status now that the minimum recommended memory is 512MB? Are there still WUs created that will only go to systems with more? My machine has 2GB. But was wondering if this task is using more then planned.
That task seems to be running normally otherwise. It is 22hrs in to my 24hr preference.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com
ID: 60300 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,383,092 RAC: 0
ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
Two of the five subdirectories under the slots directory contain a large number of files, and appear to be for the two workunits now in progress. Two are empty.
The other subdirectory contains only 3 files, and appears to be left over from this failed workunit.
File boinc_lockfile appears to be empty, since its size is zero. It's marked as still is use, though, so I can't check this.
The contents of stderr.txt start with this:
BOINC:: Initializing ... ok.
[2009- 3-25 22:55: 2:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Starting work on structure: _U9X3X_00001
# cpu_run_time_pref: 43200
Starting work on structure: _U9X3X_00002
Starting work on structure: _U9X3X_00003
Starting work on structure: _U9X3X_00004
Starting work on structure: _U9X3X_00005
Starting work on structure: _U9X3X_00006
Starting work on structure: _U9X3X_00007
Starting work on structure: _U9X3X_00008
Starting work on structure: _U9X3X_00009
Starting work on structure: _U9X3X_00010
Starting work on structure: _U9X3X_00011
Starting work on structure: _U9X3X_00012
Starting work on structure: _U9X3X_00013
Starting work on structure: _U9X3X_00014
Starting work on structure: _U9X3X_00015
Starting work on structure: _U9X3X_00016
Starting work on structure: _U9X3X_00017
Starting work on structure: _U9X3X_00018
Starting work on structure: _U9X3X_00019
Starting work on structure: _U9X3X_00020
BOINC:: Initializing ... ok.
Can't acquire lockfile - exiting
BOINC:: Initializing ... ok.
Can't acquire lockfile - exiting
BOINC:: Initializing ... ok.
Can't acquire lockfile - exiting
BOINC:: Initializing ... ok.
Can't acquire lockfile - exiting
The contents of stdout.txt are:
Created shared memory segment
Created semaphore
Do these results mean that Rosetta@home never tries to clear up these three files for failed workunits? Should it? They appear to prevent any workunits from Rosetta@home or Ralph@home from being able to run in this slot until the next reboot - often meaning a few days for me. I haven't seen them have a similar effect on workunits from other BOINC projects, though.
An unlikely 99 decoys from 99 attempts: a wingman had the same problem.
Starting work on structure: _2FSWA_7_00098
Starting work on structure: _2FSWA_7_00099
======================================================
DONE :: 1 starting structures 145.451 cpu seconds
This process generated 99 decoys from 99 attempts
======================================================
BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish
</stderr_txt>
____________
ID: 60388 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,383,092 RAC: 0
has official been reported as: Outcome = Success.
However the WU ran only for 4309.559 seconds, cpu_run_time_pref: 21600 and
ended with an error:
Starting work on structure: _1XV2A_15_00008
interpolate rotamers bin out of range: GLN -107.207 180 -7e-005 -6.1e-005 -5.1e-005
34 36 8 9 37 2 0.2793 0
ERROR:: Exit from: d:\boinc_build\minirosetta_windows\mini\src\core/scoring/dunbrack/RotamericSingleResidueDunbrackLibrary.tmpl.hh line: 593
called boinc_finish
# cpu_run_time_pref: 21600
======================================================
DONE :: 1 starting structures 6841.62 cpu seconds
This process generated 99 decoys from 99 attempts
======================================================
My preferred run time is 6 hours, but this one completed in less than 2. Either this is an extremely quick model or something odd occurred.
____________
Rosetta Wiki: http://rosetta.wikia.com/wiki/Rosetta_Wiki
My preferred run time is 6 hours, but this one completed in less than 2. Either this is an extremely quick model or something odd occurred.
This appears to look normal, I am getting through them at the rate of about 1.17 minutes per model. If my calculations are correct you are .02 minutes faster per model.
first error in a long time!
ran 100% and had a compute error at the end abinitio_nohomfrag_129_B_1o73A_SAVE_ALL_OUT_7581_8721_1
Exit status -1073741819 (0xc0000005)
CPU time 11314.84
Starting work on structure: _U9X3X_00001
# cpu_run_time_pref: 14400
Starting work on structure: _U9X3X_00002
Starting work on structure: _U9X3X_00003
Starting work on structure: _U9X3X_00004
Starting work on structure: _U9X3X_00005
Starting work on structure: _U9X3X_00006
Starting work on structure: _U9X3X_00007
Starting work on structure: _U9X3X_00008
Starting work on structure: _U9X3X_00009
Starting work on structure: _U9X3X_00010
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00587042 write attempt to address 0x34A2BAB7
This one cut off after a clean exit of BOINC and a reboot to install a MS fix. What wasn't clean was the restart. I forgot BOINC was in my Win startup folder and so ended up starting two of them. I then ended both and after 61 second after starting again, this task was ended. No messages, just that it finished. But it should have run another couple of hours.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com
Again another task is now not crunching due to "Accepted Energy:1.#QNAN" and "Accpeted RMSD:1.#QQ".
It is 39.50% Complete ; Model:11 Step 7788. I have now suspended task.
ERROR: in::file::boinc_wu_zip fragments_2hkv.zip does not exist!
ERROR:: Exit from: ..\..\src\apps\public\boinc\minirosetta.cc line: 108
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish
____________
ID: 60528 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
Klimax, why don't you go ahead and take a dump and EMail it to me, along with details on what you observered with it as it ran. I will forward it to the Project Team.
____________ Rosetta Moderator: Mod.Sense
I'm also having an issue with no progress. Rosetta Beta runs fine, but Rosetta Mini (1.54) never registers any progress even after clocking hours of CPU time (the current process I just aborted clocked at almost 17 hours). I have an Intel(R) Core(TM)2 Duo CPU P8600 @ 2.40GHz (WinXP Professional SP3) . It also won't switch off, freeing up a core for another BOINC (v6.4.7) process to run.
Klimax, why don't you go ahead and take a dump and EMail it to me, along with details on what you observered with it as it ran. I will forward it to the Project Team.
Ups,didn't know :-(
Last time I reported it,I was told to let it finish and upload.(IIRC)
Mail is being prepared.
Looks like you've hit one of the errors still in 1.54 because it's too uncommon to debug quickly. Let's hope your results for that workunit help them finally debug it.
Looking at the rest of the jobs your machine has been working on lately, I'd say that that you have a lower frequency of errors than I do because you've set up your machine well for aiming at a high score (probably selecting Rosetta@home as your only BOINC project on that machine, selecting leave in memory, and running at 100% CPU usage), while I'm deliberately choosing settings aimed at helping debug problems with the program (giving other BOINC projects enough computer time to prevent workunits from Rosetta@home from being likely to complete without being interrupted to give workunits from other projects a turn, and running at 95% CPU usage, although with leave in memory selected). However, is there any good reason for maintaining such a long queue of jobs waiting for your machine to choose them next, and therefore delaying any work at the Rosetta@home end on your results?
I can't tell if you've also tried a few other things I've also found good for getting a high score, such as:
1. Selecting a black screen, instead of the BOINC graphics, as your screen saver, and avoiding activating the BOINC graphics.
2. If you see the lockfile problem in your results, suspend all projects, reboot the machine to clear any lockfiles left behind by failed workunits, then resume the projects.
3. Running the machine 24 hours a day, except when shutting BOINC down for Windows updates or other updates, running antivirus programs, running antispyware programs, and any needed reboots.
4. If you happen to need some update that doesn't require a reboot, such as most Windows Defender updates, only tell BOINC to suspend all jobs while you install the update, instead of shutting it down completely; then resume the projects after the update completes.
I'm interested to know how Selecting a black screen, instead of the BOINC graphics, as your screen saver, and avoiding activating the BOINC graphics. helps increase a work units score? Thank's in advance
____________
Have a crunching good day!! Live in NZ y not join Smile City?
Looks like you've hit one of the errors still in 1.54 because it's too uncommon to debug quickly. Let's hope your results for that workunit help them finally debug it.
Looking at the rest of the jobs your machine has been working on lately, I'd say that that you have a lower frequency of errors than I do because you've set up your machine well for aiming at a high score (probably selecting Rosetta@home as your only BOINC project on that machine, selecting leave in memory, and running at 100% CPU usage), while I'm deliberately choosing settings aimed at helping debug problems with the program (giving other BOINC projects enough computer time to prevent workunits from Rosetta@home from being likely to complete without being interrupted to give workunits from other projects a turn, and running at 95% CPU usage, although with leave in memory selected). However, is there any good reason for maintaining such a long queue of jobs waiting for your machine to choose them next, and therefore delaying any work at the Rosetta@home end on your results?
I can't tell if you've also tried a few other things I've also found good for getting a high score, such as:
1. Selecting a black screen, instead of the BOINC graphics, as your screen saver, and avoiding activating the BOINC graphics.
2. If you see the lockfile problem in your results, suspend all projects, reboot the machine to clear any lockfiles left behind by failed workunits, then resume the projects.
3. Running the machine 24 hours a day, except when shutting BOINC down for Windows updates or other updates, running antivirus programs, running antispyware programs, and any needed reboots.
4. If you happen to need some update that doesn't require a reboot, such as most Windows Defender updates, only tell BOINC to suspend all jobs while you install the update, instead of shutting it down completely; then resume the projects after the update completes.
Thanks for reply.
If you see scores I achieve for WU on that host witch make error I must tell you 2 important things:
1. It was computer with orginally Q6600@3200. On 7 apr 09 I replace this CPU to Q9550@3600. So it is safe to say that credits form 6 apr 09 and older represents Q6600 and from 8 apr 09 and newer represents Q9550.
2. I am crunching Rosetta@home at all 4 cores with GPUGRID on my GTX260. So in reality i run 5 treads by Boinc.
Also:
AD 1. I don't use BOINC screen saver only windows logo screen saver on my CRT NEC 2111SB
AD 2. I sometimes suspend to play some games....
AD 3. I must shut down my PC for night because it is to loud for me, so it crunch from 10 a.m. do 11-12 p.m. usually.
Ad 4. Rosetta@home is very GUI friendly because there is no slow down in interface. GPUGRID is real horror in that matter...
Running at 100% CPU usage is also set.
Live in memory option was not selected but today I selected it. I will see what happend :)
Also i work in 32 bit XP with 2x2Gb as CL4 DDR2 423 (846).
I'm interested to know how Selecting a black screen, instead of the BOINC graphics, as your screen saver, and avoiding activating the BOINC graphics. helps increase a work units score? Thank's in advance
just because your computer doesn't have to do the computation for the graphics tread too then.
____________
I'm interested to know how Selecting a black screen, instead of the BOINC graphics, as your screen saver, and avoiding activating the BOINC graphics. helps increase a work units score?
just because your computer doesn't have to do the computation for the graphics tread too then.
I'm interested to know how Selecting a black screen, instead of the BOINC graphics, as your screen saver, and avoiding activating the BOINC graphics. helps increase a work units score? Thank's in advance
Selecting a black screen, which only needs to be calculated once, cuts down on CPU time needed to calculate the graphics, and lets more of what's available be used for the scientific calculations. Since Rosetta@home uses the number of decoys produced as a more important factor in calculating how much credit to give you than the CPU time required to do it, this is likely to increase the number of decoys your computer produces for that workunit, and therefore the resulting score.
Also, something involving the graphics seems to be able to trigger the lockfile problem for a workunit, with the results then returned marked as invalid and therefore worth a score of zero. Once a lockfile problem occurs, 1.54 seems to be unable to erase the lockfile from the slot used by that workunit, and therefore lets the problems spread to any 1.54 workunits run later in the same slot but before the next reboot. My results for Ralph@home indicate that the 1.58 now being tested there has kept this same problem, and therefore needs more testing before the 1.54 used at Rosetta@home is replaced with a newer version.
Also, something involving the graphics seems to be able to trigger the lockfile problem for a workunit, with the results then returned marked as invalid
I turn the graphics on and off several times during the course of the day to check on the performance and I haven't encountered this lockfile problem for a long time now on both Rosetta and Ralph.
Having said that, Murphy's Law states 'watch this space'!!
____________
Also, something involving the graphics seems to be able to trigger the lockfile problem for a workunit, with the results then returned marked as invalid
I turn the graphics on and off several times during the course of the day to check on the performance and I haven't encountered this lockfile problem for a long time now on both Rosetta and Ralph.
Having said that, Murphy's Law states 'watch this space'!!
The lockfile problem results could vary depending on what operating system version and what BOINC version is used; if so, my results could easily apply only when using BOINC 6.2.28 under Vista SP1. In other words, I suspect that results from just the two of us aren't enough; we need more people with access to other operating system versions and more versions of BOINC to test for graphics causing the lockfile problem and report the results, along with which operating system version and which BOINC version was used.
ID: 60573 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,383,092 RAC: 0
New error:
ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
BOINC:: Error reading and gzipping output datafile: default.out
The same task run on an XP machine ran for a long time and only failed on validate. Which is kind of interesting, it almost seems as if my machine (OS-X) tipped over on an assertion or parameter file error ... what is the difference in OS platform guys ...
I have a 10 preferred runtime on my MacBook Pro. I spotted lb_all_multi_threshold.2.0_hb_t317__IGNORE_THE_REST_1I9SA_12_10355_4_0
still running at 10 hours and 20 minutes so I opened the graphics window to check on it. It was on model 33, step 1920, stage unk. Checking on it later it had run another cpu hour but failed to make any progress so I shut down BOINC completely and restarted. it now showed 5 hours and 20 minutes cpu time consumed, all the rest the same. Within a few seconds it returned to step 0 and apparently restarted model 33 over from the beginning. I didn't catch exactly when it reached step 1920 but it would have been about 3-4 cpu minutes after the restart. It didn't get stuck this time but continued on its merry way. It also moved out of the unk stage by the time I glanced at it 4+ minutes after restart. It has now finished successfully and validated with 58 models completed in 10 (non-stuck)hours.
I was looking through the RALPH minirosetta v1.54 bug thread and found an issue about setting day-of-week overrides (http://ralph.bakerlab.org/forum_thread.php?id=432&nowrap=true#4590). I had some set on network usage that when I cleared and restarted BOINC (which I upgraded to 6.6.20) I started registering progress on a minirosetta task (as well as having some stderr progress past
Initializing options.... ok
This appears to have been the cause of my problem.
While not exactly a bug, this morning I had a rather large upload file...
Task 243404526 had a 6.8MB file to upload. The task only run for about 50 minutes and my preference is set to 4 hours. It did 99 decoys from 99 attempts.
Thought admin might want to know...
____________
Never surrender and never give up. In the darkest hour there is always hope.
ID: 60638 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
Wow, good thing the watchdog only lets 99 models run. Just imagine how large it would have been with a 4 hour run!
____________ Rosetta Moderator: Mod.Sense
Problem task names all begin with "res_careful_". For details on which proteins are known to have problems and should be aborted, and which will run OK and should be run normally, please see the link above.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
This task http://boinc.bakerlab.org/rosetta/result.php?resultid=243902658 made 99 decoys & the upload was about 7.14MB is this normal for these tasks?
____________
Have a crunching good day!! Live in NZ y not join Smile City?
This task http://boinc.bakerlab.org/rosetta/result.php?resultid=243902658 made 99 decoys & the upload was about 7.14MB is this normal for these tasks?
there is a limiter built into the program. it stops the crunching at 99 decoys.
this is normal.
I'm aware of this thanks. greg be I think you misunderstood the question. I was referring to the the upload size of the work unit. Is the normal upload size 7.14MB for this type http://boinc.bakerlab.org/rosetta/result.php?resultid=243902658 of work unit?
____________
Have a crunching good day!! Live in NZ y not join Smile City?
ID: 60685 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
The more models completed, the larger the upload will be. The resulting increase in upload size is part of why mini put on the 99 model limit per task. So, it is normal, but will probably be reviewed and perhaps changed to run longer models in some way.
____________ Rosetta Moderator: Mod.Sense
ID: 60691 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,383,092 RAC: 0
The more models completed, the larger the upload will be. The resulting increase in upload size is part of why mini put on the 99 model limit per task. So, it is normal, but will probably be reviewed and perhaps changed to run longer models in some way.
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
TomaszPawel sights two cases where 99 models were completed in less then an hour with a 6.6.20 Win XP client, and resulted in validate error from miniRosetta v1.54.
William Kahler Joined: Oct 26 06 Posts: 1 ID: 124989 Credit: 18,722 RAC: 15
MiniRosetta 1.54 constantly crashing after ~5 seconds
& (note to Bill G) w/Boinc 6.4.x & 6.6.x (Error Code 5).
It runs a little slow for first 5 seconds of CPU time
w/last stable Boinc 5.x & finishes ok.
No difference with protected app. or not.
Complete BOINC un/re-install & Rosetta de/re-attach no help.
Dell Core Duo 2 GHz w/2 Gig Ram.
WinXP Sp3 Home Edition (up to date).
24/7, no throttle, no graphics/screensaver, leave in memory.
Stand alone or with other projects.
Memtest x2/Prime95/Dell Diagnostics run fine.
Task 246174559 run for 4 hours with 82 decoys. File upload size was 8.9MB. Took a while to upload. Hate to see what it would have been if there were 99 decoys...
____________
Never surrender and never give up. In the darkest hour there is always hope.
======================================================
DONE :: 1 starting structures 2496.42 cpu seconds
This process generated 99 decoys from 99 attempts
======================================================
======================================================
DONE :: 1 starting structures 21620.9 cpu seconds
This process generated 75 decoys from 75 attempts
======================================================
Hello, I have a problem: very long pending status in my last WUs: 123456
ID: 60869 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
Hello, I have a problem: very long pending status in my last WUs: 123456
That would explain why credit has been dropping. The assimilator must be having a problem. I've EMailed the Project Team to look in to it when they arrive for the day in Seattle.
____________ Rosetta Moderator: Mod.Sense
[quote]Hello, I have a problem: very long pending status in my last WUs: 123456/quote]
That would explain why credit has been dropping. The assimilator must be having a problem. I've EMailed the Project Team to look in to it when they arrive for the day in Seattle.
I'm assuming this is fixed now. 17 of my WUs have been allocated credit since the original post, but I have another 15 pending credit - 13 hours worth.
Just awaiting catch-up, I assume. The Server Status page is showing all systems 'Running'.
I also noticed credit was taking more than 4 minutes to come through in the days leading up to the outage, so the problem may've been building up for a few days.
____________
It suggests keeping tasks in memory. But I've always had it configured to do so. I've also limited the memory available to BOINC while computer is in use. This seems to cause BOINC to begin and then suspend the tasks numerous times during the day. When the task attempts to run and then exceeds memory bound, it goes to a status of waiting for memory. But it no longer appears in the Windows task list, hence was removed from memory.
I have a HT P4, so 2 CPUs. As the primary task cycles through periods with lower memory usage, it attempts to fire up the second core. Only to find it ends up short of memory again a few minutes later as the second task gears up and uses more, or the first cycles in to another phase of higher memory usage.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com
Hello !
Now at the advent of version 1.64, I´ve difficulties to load up my last crunched file with version 1.54. I get repeatedly the following messages:
30/04/2009 13:40:31|rosetta@home|Started upload of lb_all_multi_threshold.0.5_hb_t311__IGNORE_THE_REST_1ZK8A_1_10279_7_2_0
30/04/2009 13:42:19||Project communication failed: attempting access to reference site
30/04/2009 13:42:19|rosetta@home|Temporarily failed upload of lb_all_multi_threshold.0.5_hb_t311__IGNORE_THE_REST_1ZK8A_1_10279_7_2_0: connect() failed
30/04/2009 13:42:19|rosetta@home|Backing off 1 hr 50 min 57 sec on upload of lb_all_multi_threshold.0.5_hb_t311__IGNORE_THE_REST_1ZK8A_1_10279_7_2_0
30/04/2009 13:42:21||Internet access OK - project servers may be temporarily down.
As seen on the server status page, all servers are running. So, why this problem and how to cure it ?
It suggests keeping tasks in memory. But I've always had it configured to do so. I've also limited the memory available to BOINC while computer is in use. This seems to cause BOINC to begin and then suspend the tasks numerous times during the day. When the task attempts to run and then exceeds memory bound, it goes to a status of waiting for memory. But it no longer appears in the Windows task list, hence was removed from memory.
I have a HT P4, so 2 CPUs. As the primary task cycles through periods with lower memory usage, it attempts to fire up the second core. Only to find it ends up short of memory again a few minutes later as the second task gears up and uses more, or the first cycles in to another phase of higher memory usage.
BOINC 6.6.20 is wotking better for me, so lets's compare our machines and settings. My newer machine, with BOINC 6.6.20 under 64-bit Vista SP1 with 8 GB of memory, does not appear to have any memory problems.
My 32-bit Vista SP1 machine, with BOINC 6.2.28, originally came with 1 GB of memory. I found that wasn't enough to even start running two minirosetta@home workunits at the same time. After enough other problems showed up which I decided were memory problems, I used this site to find out how much memory my motherboard could handle, and then order enough to raise it to the 2 GB limit for my motherboard:
This was enough to allow it to start running two minirosetta workunits at one on my 2 CPU cores, but still not enough to run them well. Eventually, I raised both the amount of disk space BOINC is allowed to use, and the amount of swap space BOINC is allowed to use. It's not clear which of the last two steps were actually needed, if not both of them, but that combination handled the memory problems on that machine.
At least some versions of BOINC do not divide up the available swap space in the most efficient way - they first divide it up into equal shares for each BOINC project you have subscribed to, then those shares into smaller shares for each CPU core. If these smaller shares aren't large enough, it can't preserve any work done since the last checkpoint by simply swapping one into the swap space on the hard drive.
Does the HT stand for hyperthreaded, a method of appearing to have twice as many CPU cores by giving each one of them an extra set of registers? If so, I've seen messages from other BOINC users saying that this does not increase the total throughput very much. Therefore, until you are able to handle the memory and swapfile problems, you may find it worthwhile to tell BOINC to use only one of the two apparant CPU cores on your machine.
Both were then completed successfully by someone else.
Could minirosetta be modified to check for the lockfile problem sooner, and at least produce more debug information about it instead of wasting CPU time first?
Robert, thanks for the comments. I have plenty of memory, but for 1/3 of the day I actually use it for a number of work applications and with the new increase in memory used by mini, I'm testing to see if BOINC is the cause of some sluggish behavior on my machine. Indeed it seems to be the case.
Yes, by HT, I meant hyperthreaded. But I believe setting number of CPUs to one on a machine configured with HT active would cut my credit roughly in half. I'd think that the other analysis you've read is comparing a machine with HT enabled running 2 tasks at a time, with the same machine with HT disabled running 1. Since my HT is enabled, running 2 tasks is the only way to break even. But yes, one option would be to disable HT, then I'd be focusing all the resource on one task at a time, and not have the desire to support memory enough for two tasks.
I was just trying to point out that 6.6.20 seems to be removing tasks from memory in some cases, even when configured to leave tasks in memory. And this can lead to cancelled WUs such as I reported. I wasn't limiting memory on my prior version of BOINC, so am unsure if this is new behavior or not.
I just saw another task suspended waiting for memory, but this time it remained in the task list. Could be BOINC saw it had 3 hours invested in it and didn't want to throw it away. I believe the tasks that are getting removed are actually only running for a couple of minutes.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com
Hello !
I´ve difficulties to load up my last crunched file with version 1.54. I get repeatedly
I'm getting the same type of messages to
5/1/2009 8:51:54 AM rosetta@home Temporarily failed upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VYHA_4_11644_1_0_0: HTTP error
5/1/2009 8:51:54 AM rosetta@home Backing off 2 hr 52 min 32 sec on upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VYHA_4_11644_1_0_0
5/1/2009 8:51:54 AM rosetta@home Started upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1FXWF_6_11644_1_0_0
5/1/2009 8:51:56 AM Internet access OK - project servers may be temporarily down.
5/1/2009 8:51:59 AM rosetta@home Finished upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1FXWF_6_11644_1_0_0
5/1/2009 8:52:53 AM Project communication failed: attempting access to reference site
5/1/2009 8:52:53 AM rosetta@home Temporarily failed upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VJGA_4_11644_1_0_0: HTTP error
5/1/2009 8:52:53 AM rosetta@home Backing off 12 min 18 sec on upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VJGA_4_11644_1_0_0
5/1/2009 8:52:55 AM Internet access OK - project servers may be temporarily down.
Should I abort these transfers? I will wait for further instructios before I do anything to these.
____________
Have a crunching good day!! Live in NZ y not join Smile City?
Robert, thanks for the comments. I have plenty of memory, but for 1/3 of the day I actually use it for a number of work applications and with the new increase in memory used by mini, I'm testing to see if BOINC is the cause of some sluggish behavior on my machine. Indeed it seems to be the case.
Yes, by HT, I meant hyperthreaded. But I believe setting number of CPUs to one on a machine configured with HT active would cut my credit roughly in half. I'd think that the other analysis you've read is comparing a machine with HT enabled running 2 tasks at a time, with the same machine with HT disabled running 1. Since my HT is enabled, running 2 tasks is the only way to break even. But yes, one option would be to disable HT, then I'd be focusing all the resource on one task at a time, and not have the desire to support memory enough for two tasks.
I was just trying to point out that 6.6.20 seems to be removing tasks from memory in some cases, even when configured to leave tasks in memory. And this can lead to cancelled WUs such as I reported. I wasn't limiting memory on my prior version of BOINC, so am unsure if this is new behavior or not.
I just saw another task suspended waiting for memory, but this time it remained in the task list. Could be BOINC saw it had 3 hours invested in it and didn't want to throw it away. I believe the tasks that are getting removed are actually only running for a couple of minutes.
Do you have enough free disk space to allow BOINC enough space to increase the swap space it can use to store any partly completed work in a way that allows resuming it where it was interrupted? That way, BOINC could simply switch to helping projects with lower memory requirements while you need more memory for something else; for example, the POEM@HOME project requires less memory, but helps an earlier step in medical research. That way, the suspended tasks will move off of the list of tasks currently running, but in a way that lets them move back onto this list and at the point of interruption later, instead of being dropped entirely. Such tasks will need to go back to the last checkpoint if you reboot for any reason, though. If you prefer to run mainly Rosetta@home, just keep the percentage of your CPU time assigned to these lower memory requirement projects less than the percentage of your CPU time you actually need to run with lower memory requirements. Also, insuring that there is enough swap space for all the projects BOINC tries to keep running at once allows you to suspend all BOINC projects at once if you need to run something with even more requirements. It seems that the defaults for the amount of swap space BOINC is allowed to use aren't good enough if you attach to enough BOINC projects at once, and even one of them is as memory-hungry as Rosetta@home.
Also, turning off one of a pair of hyperthreaded CPUs shouldn't cause you to get only half the credits, since it then allows you to run the other one at full speed, instead of at barely more than half the full speed. It would, however, give you only half the credits if you actually had two fully independent CPU cores instead of a hyperthreaded pair, or if you use an older version of BOINC that isn't aware that it needs to keep track of CPU core sharing between hyperthreaded pairs.
If your main concern is credits for helping medical research and you happen to have one of the newer graphics boards GPUGRID can use (mainly recent Nvidia cards), consider adding GPUGRID to your list of BOINC projects. It will require switching to the newest version of BOINC I've read about, but then can run workunits on your graphics card instead of on your CPUs. Shouldn't interfere with your regular computer use if it isn't graphics-intensive.
Also, check if that web site I gave mentions how much memory your machine can handle and what the price is. I spent only about $50 (US) to reach the maximum amount this computer can use, but that did have me as the person who installed the new and faster memory.
ID: 60930 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2988 ID: 106194 Credit: 0 RAC: 0
Speedy, no don't abort them. I'm sure the problem with uploads must be related to the current problems with getting credit issued. When the back end file system is having problems, everything is having problems to some degree or another.
____________ Rosetta Moderator: Mod.Sense
Speedy, no don't abort them. I'm sure the problem with uploads must be related to the current problems with getting credit issued. When the back end file system is having problems, everything is having problems to some degree or another.
Thank you. All my results that need to be uploaded have just been uploaded. All is good at my end. Thank you for your continued hard work.
____________
Have a crunching good day!! Live in NZ y not join Smile City?
Robert, yes I've had all the same thoughts, and have plenty of disk allowed to BOINC, and to my swap file. But am finding that BOINC isn't smart enough to realize which projects require less memory. It cycles through all the work you currently have for the project it wants to repay debt to, and only after it gets about 2 minutes in to every single downloaded Rosetta task will it try to run a 10MB WCG rice task. But if I don't happen to have any WCG work, it isn't smart enough to think about getting some rather then leaving a CPU idle.
I'd love if it were smart enough to run one Rosetta and one rice during the day when I'm using the machine, and then run dual Rosetta tasks at night when my machine is idle and I allow more memory to BOINC. But it's just not smart enough to do so without major manual adjustments.
I could keep a larger cache of work, and therefore help assure I always have something from each project, but then it would cycle through 10 Rosetta tasks, running each for 2 minutes, rather then just 6.
Hopefully with all the discussion on the client work fetch policies, something will shake out that will work better.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com
Guess I'd never noticed BOINC allows you to configure the amount of swap space (I thought you meant size of Win page file). It was set to 75%, and Win task manager shows my "commit charge" to be 1477M/3397M. So does that mean my swap file is 3.4GB? And so BOINC is allowed over 2GB of swap space, but my entire system hasn't reached that much.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com
Hello !
I´ve difficulties to load up my last crunched file with version 1.54. I get repeatedly
I'm getting the same type of messages to
5/1/2009 8:51:54 AM rosetta@home Temporarily failed upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VYHA_4_11644_1_0_0: HTTP error
...
5/1/2009 8:52:53 AM Project communication failed: attempting access to reference site
5/1/2009 8:52:53 AM rosetta@home Temporarily failed upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VJGA_4_11644_1_0_0: HTTP error
5/1/2009 8:52:53 AM rosetta@home Backing off 12 min 18 sec on upload of less_careful_inward0.05_ub0.2_lb0.07_maxdist20_chunk_0_8_hb_t297__IGNORE_THE_REST_1VJGA_4_11644_1_0_0
5/1/2009 8:52:55 AM Internet access OK - project servers may be temporarily down.
Should I abort these transfers? I will wait for further instructios before I do anything to these.
If the transfers aren't too close to their deadlines, I'd just let BOINC keep trying. I've had workunits upload successfully after getting similar messages for days, when router problems kept me from reaching the internet at all for several days. However, it's occasionally useful in such circumstances to first start viewing the Rosetta@home web site to make sure the connection is open,
then without closing your browser, start the BOINC manager program if it isn't already running, click on Advanced View if the simplified view appears first, then click on the Transfers tab, click on Advanced, then click on Do network communication in order to make it retry the communications while your connection to the internet is still open.
For some BOINC projects, even returning the results after their deadlines is useful, if you manage to return the results before anyone else does for the same workunit. Not all BOINC projects allow this, though.
Guess I'd never noticed BOINC allows you to configure the amount of swap space (I thought you meant size of Win page file). It was set to 75%, and Win task manager shows my "commit charge" to be 1477M/3397M. So does that mean my swap file is 3.4GB? And so BOINC is allowed over 2GB of swap space, but my entire system hasn't reached that much.
At least some versions of Windows automatically expand the swap space if BOINC is allowed to use a large enough fraction of it to come close enough to the amount already provided. I'd expect the name page file to be what some people call the swap file.
I've set up my machines to start up with the swap file size already set to 30 GB, with no sign of coming close to that limit. That doesn't allow any further expansion, but should keep the disk head from needing to move very far when going from one place in the swap file to another.
I have seen signs that BOINC divides the available swap space equally among either the active slots or all the enabled BOINC projects before deciding how much to give to each workunit, and does not adjust this based on how much memory each BOINC project is expected to require. For that reason, if you have enough free disk space, allowing both the swap file and the disk space for each workunit to be significantly more than the average required is helpful for the applications with high requirements, such as minirosetta.