Mike Tyka Forum moderator Project administrator Project developer Project scientist Joined: Oct 20 05 Posts: 95 ID: 5612 Credit: 2,190 RAC: 0
Hello All!
We're ready for a new update. I want to say thank all of you who have helped over the last months to find and fix errors in minirosetta. A particular thank you goes to those who have donated their time over on RALPH and helped with their active feedback - we managed to find a number of difficult and rare bugs and put some new features into minirosetta that should help conserve computer time. Read about it here: http://ralph.bakerlab.org/forum_thread.php?id=431
and here http://ralph.bakerlab.org/forum_thread.php?id=432
I should add that work over there will continue,but now supplemented with information from Rosetta@HOME.
This update is highly focused on bugfixing and stability issues - we have virtually no new science in it, but: We will hopefully now be able to run the science projects that have been in the pipeline waiting for BOINC - we're expecting quite a bit of work to go out very soon indeed. See Dr. Baker's journal for more details.
Features/Fixes:
1.54 Release CHANGELOG
Faster loop closing in FoldCST/Abinitio (affects cc_* cc2_* cs_* WUs), should help with overrunning WUs.
Bug fix concerning intermittent crashes in relax benchmark jobs (_rlbd_) jobs - caused by buggy input file reader.
Bug fix for a potential instability in handling text files (affects all types of WUs).
Bug fix in checkpointing machinery, states were not being correctly restored, probably contributing to long runtimes. (affects cc_* cc2_* cs_* WUs)
Increased the density of checkpoints to lose less time on restarts and address the weired "backjumping" of the time reported in this thread. This will still happen, but the jumps should be much smaller (basically maximally as long as the time between checkpoints.)
Added checkpointing to Loopclosing part of FoldCST. (affects cc_* cc2_* cs_* WUs)
Added checkpointing to Looprelax.
The Watchdog has been checked and improved, now returning information on the aborted jobs to help us figure out how the remaining long running models come about. The watchdog will now abort if the runtime exceeds your preferred runtime + 4 hours. In other words the WUs should not overrun for more than around 4 hours. If they do please let us know !!
Added a limit ont he number of decoys per WU: 99. The WU will end gracefully after that and give full credit. This should address issues with excessive upload problems.
Fixed a bug in the BOINC API concerned with unzipping the input data. (I will let the BOINC guys know about this)
Fixed a strange problem in the options system leading to early crashes on some systems.
Two nasty instabilities fixed deep in the FoldConstraints/abinitio protocol (cc_* tasks and other homology modelling tasks)
Generally implemented much better error reporting - many many potential problems will now show up a meaningful error messages and not random segmentation faults.
NOTE: This new version contains a lot of debug output still. YOu will see that the stderr fills up with stuff - that is ok . It does not slow down the program nor cause much extra upload - but it tells us a lot about where things can go wrong still.
Despite all these fixes there are, i'm sure, many problems left. Most of them occur extremely rarely now though or are highly specific to particular machines. Thus we have decided to move the current version over from RALPH to Rosetta@HOME and give it a go on a much larger scale. Our effords to keep the failure rate down will continue and your time donations over on RALPH as well as error reports are still highly appreciated.
Please let us know how things work out there. Particularily i'd like to know about
Stuck workunits
Overrunning workunits (WUs should now, due to the new watchdog, never run more than 4 hours longer than the preferred user time)
Problems with checkpointing.
Any other strange behaviour.
Happy crunching - I'm very excited to see how this new version will pan out.
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
The link in the news item that should bring you to this thread is truncated.
____________ Rosetta Moderator: Mod.Sense
ID: 59047 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
The news item also shows the year as 2008 (which is probably the last time you had enough coffee to be able to read the calendar!! All these improvements are going to send TeraFLOPS much higher! Nice work Mike, and BakerLab. I can really see that you've come through for people here).
____________ Rosetta Moderator: Mod.Sense
Hello, the version 1.47 was very well for me with 151 Workunits and 0 errors and an average CPU time 2.8 hours. Hope that the new version 1.54 will be as well...
____________
ID: 59074 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,023,621 RAC: 81
If you are seeing errors with lock-file problems try setting the cpu setting back to 100%. If you are running at 100% CPU preference and are getting this problem, I for one, am very interested. If you are getting the failures and change the CPU setting to 100% and that cures the issue ... well, we are interested in THAT too ...
I read about this in Einstein@Home and it seems to work for me ... YMMV ...
____________
I don't know about others but my Rosetta machines are running dry!!! The new minirosetta is stuck downloading at 89.25% and has been there for HOURS!!!!
I have had to attach to a different project until it gets sorted out. So far all machines, exact same problem, one a dual core one a single core. If you llok at my computers, they are not hidden, any task that says "outcome unknown" is because the mini-rosetta download ain't happenning!!!! Message in Boinc says 1/28/2009 4:45:03 AM|rosetta@home|Started download of minirosetta_1.54_windows_intelx86.exe
1/28/2009 4:50:11 AM|rosetta@home|Temporarily failed download of minirosetta_1.54_windows_intelx86.exe: HTTP error
1/28/2009 4:50:12 AM|rosetta@home|Started download of minirosetta_1.54_windows_intelx86.exe
1/28/2009 4:50:13 AM||Internet access OK - project servers may be temporarily down.
1/28/2009 4:50:34 AM||Project communication failed: attempting access to reference site
1/28/2009 4:50:34 AM|rosetta@home|Temporarily failed download of minirosetta_1.54_windows_intelx86.exe: connect() failed
1/28/2009 4:50:35 AM||Internet access OK - project servers may be temporarily down.
etc, etc, etc, etc forever!!!!
Another project now loves you!!
____________
ID: 59088 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
mikey, I haven't seen that problem my self, so it's not likely on the server side. At least not consistently. So it also seems odd that all of your servers are stopping... is it on the same file? You have to download the new programs, which is several MB. Are your machines all going through the same proxy or something that might be hung up on that particular file?
Could I ask you to check the transfers tab and see exactly which file and how much of it you've downloaded? Your hosts seem to have pretty good bandwidth.
Is anyone else seeing such a problem? Given then increase in project TFLOPS, I am thinking it is rare at best.
Have you tried aborting the transfer on one of the machines? This may cause a couple of tasks to fail due to downloading error, but BOINC will recover and eventually try to pull a fresh copy of the problem file.
____________ Rosetta Moderator: Mod.Sense
ID: 59093 | Rating: 0 | rate:
/
Mike Tyka Forum moderator Project administrator Project developer Project scientist Joined: Oct 20 05 Posts: 95 ID: 5612 Credit: 2,190 RAC: 0
Paul,
can you point me to the thing you read about Lockfile problems on Einstein !?
5% of jobs fail in this way consistently. I would love to know if the problem is us or the clients or what, and get it resolved.
What do you mean by 100% CPU ? If i can make this happen here on my machine i could learn better about what's going on.
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
What do you mean by 100% CPU ?
"computing preferences" configured on website for the venue of the machine. The setting is called "Use at most" at the bottom of the processor usage section.
Can also be configured via the BOINC Manager for a specific host.
____________ Rosetta Moderator: Mod.Sense
mikey, I haven't seen that problem my self, so it's not likely on the server side. At least not consistently. So it also seems odd that all of your servers are stopping... is it on the same file? You have to download the new programs, which is several MB. Are your machines all going through the same proxy or something that might be hung up on that particular file?
I do not use a proxy, just straight to the net. I use Comcast.
Could I ask you to check the transfers tab and see exactly which file and how much of it you've downloaded? Your hosts seem to have pretty good bandwidth.
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's. The one I am looking at right now has been trying for 11:51:02 and is going to retry in 03:34:34, and counting.
Is anyone else seeing such a problem? Given then increase in project TFLOPS, I am thinking it is rare at best.
Have you tried aborting the transfer on one of the machines? This may cause a couple of tasks to fail due to downloading error, but BOINC will recover and eventually try to pull a fresh copy of the problem file.
Yes I have, no luck, the file is stuck at 89.25, 89.26 or 89.27% depending on the pc. I am stuck at exactly 5.85 meg of 6.56 meg on all machines.
____________
ID: 59108 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
mikey, if you would like to study this further, it would be helpful if you could create a cc_config.xml file and add the flag for debug of file transfers. You have to define the first three flags as shown, then just add a line for the:
<file_xfer_debug>1</file_xfer_debug>
If you already have such a file set up, do you have the <http_1_0> flag defined? Not asking you to do that one, just asking if you were already doing it. HTTP 1.0 does not have the ability to retry from the middle of the transfer (persistent file transfer is the term BOINC uses for this). It has to start over each attempt. Then BOINC seems to only open the pipe for 5minutes at a time. So if you can't get the whole thing in 5min. It might never happen.
____________ Rosetta Moderator: Mod.Sense
I'm seeing a validate error on task 224245929 , workunit 204213187, Mac OS X 10.4.11. The task name is 1nkuA_BOINC_MPZN_with_zinc_abrelax_cs_frags_6231_115354_1 : it ran twice as long as it was supposed to and I was the second person to get it. The original person to whom it was sent also got the same validate error: irritating after it took twice as long as it was supposed to. It seems to be one of these zinc-containing proteins that have a habit of doing this.
<core_client_version>6.2.18</core_client_version>
<![CDATA[
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 1-28 1:26:32:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Starting work on structure: _00001
Watchdog active.
# cpu_run_time_pref: 14400
Starting work on structure: _00002
====>
called boinc_finish
</stderr_txt>
]]>
____________
ID: 59112 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
mikey, I don't know why I didn't think of this before...
and drop it in to your Rosetta folder in your BOINC data directory under the projects subfolder. That will at least get you up and running, or on to the next file to see if similar problems continue.
____________ Rosetta Moderator: Mod.Sense
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.
My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.
Sorry i should have mentioned there is a new rule. Mini will not produce more than 99 models. It will finish gracefully and grant full credit. The reason for this is that i want to prevent your individual uploads from getting too large. In the future there will be a better way to do this, like it will check that the output file size has not reached some limit.
ITs just another safety hook that's been put in to prevent WUs from misbehaving.
Hello with all.
For me no problems to receive from Wu Minirosetta v1.54.
J'received 17 Wu to be made for February 6, 2009 with 21:28:04 (France Time).
The first calculations should begin today (January 29), and if it with problems I you will warn about it there.
____________
ID: 59124 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,023,621 RAC: 81
Paul,
can you point me to the thing you read about Lockfile problems on Einstein !?
5% of jobs fail in this way consistently. I would love to know if the problem is us or the clients or what, and get it resolved.
What do you mean by 100% CPU ? If i can make this happen here on my machine i could learn better about what's going on.
Mike
Two places to start: are here and here ... I can also report that since I made that change i have been getting good results on Win XP systems ... I cannot see the high error rate I had in the past as the tasks have been purged ...
It seemed to me to be a problem I had on XP and it was most severe on the i7 where there are more things going on ...
and drop it in to your Rosetta folder in your BOINC data directory under the projects subfolder. That will at least get you up and running, or on to the next file to see if similar problems continue.
No difference, I downloaded the file, dropped it into the directory C:\Boinc\Data\Projects\boinc.bakerlab.org_rosetta. It asked if I wanted to overwrite what was there, the new file was bigger, I said yes, exited Boinc, restarted Boinc, did an update of Rosetta and it is downloading the file AGAIN and is stuck at exactly the same place. I even turned off my ad-aware and anti-virus and no change.
Change #1....Just after I first posted this I did a total shutdown and then a restart, no change, Boinc is still trying to download that same file! I am about ready to detach and then reattach and see if that fixes it!
Change #2....I detached and then reattached. Started downloading all the Rosetta files again. I made sure everything Rosetta was gone out of the Boinc and all subdirectories, so downloading was not a surprise. It got thru all the files except the usual one, stopped at exactly the same place. I aborted the transfer and stopped Boinc. I then copied the file I had downloaded manually into the same place as before, and did another update of Rosetta. It asked for 36000 seconds of work and got none. It went into the communication deferred state and is now downloading the EXACT SAME FILE again!!!! It is also STUCK at the EXACT SAME PLACE!!!!
I have no clue how to fix this and other projects are working just fine. Frustrating to say the least!!!!!
____________
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.
My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.
I turned it off and nothing changed, I use the free version of Avast.
____________
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.
My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.
I turned it off and nothing changed, I use the free version of Avast.
and drop it in to your Rosetta folder in your BOINC data directory under the projects subfolder. That will at least get you up and running, or on to the next file to see if similar problems continue.
No difference, I downloaded the file, dropped it into the directory C:\Boinc\Data\Projects\boinc.bakerlab.org_rosetta. It asked if I wanted to overwrite what was there, the new file was bigger, I said yes, exited Boinc, restarted Boinc, did an update of Rosetta and it is downloading the file AGAIN and is stuck at exactly the same place. I even turned off my ad-aware and anti-virus and no change.
Change #1....Just after I first posted this I did a total shutdown and then a restart, no change, Boinc is still trying to download that same file! I am about ready to detach and then reattach and see if that fixes it!
Change #2....I detached and then reattached. Started downloading all the Rosetta files again. I made sure everything Rosetta was gone out of the Boinc and all subdirectories, so downloading was not a surprise. It got thru all the files except the usual one, stopped at exactly the same place. I aborted the transfer and stopped Boinc. I then copied the file I had downloaded manually into the same place as before, and did another update of Rosetta. It asked for 36000 seconds of work and got none. It went into the communication deferred state and is now downloading the EXACT SAME FILE again!!!! It is also STUCK at the EXACT SAME PLACE!!!!
I have no clue how to fix this and other projects are working just fine. Frustrating to say the least!!!!!
Change #3....I downloaded and installed the latest version of DirectX, no changes noted.
Change #4....I installed Boinc 6.6.3, got this message "1/29/2009 8:28:31 AM|rosetta@home|Scheduler request completed: got 0 new tasks". I may have errored out all my available work for the day. No files downloading, so maybe it will take this time? No clue.
____________
mikey, if you would like to study this further, it would be helpful if you could create a cc_config.xml file and add the flag for debug of file transfers. You have to define the first three flags as shown, then just add a line for the:
<file_xfer_debug>1</file_xfer_debug>
If you already have such a file set up, do you have the <http_1_0> flag defined? Not asking you to do that one, just asking if you were already doing it. HTTP 1.0 does not have the ability to retry from the middle of the transfer (persistent file transfer is the term BOINC uses for this). It has to start over each attempt. Then BOINC seems to only open the pipe for 5minutes at a time. So if you can't get the whole thing in 5min. It might never happen.
Okay I have downloaded the file and put it in the Boinc\Data directory. I took out the asterisks and changed the <file-xfer_debug line to a 1, it was a zero.
As for the http setting I use Firefox 3.0.5 and do not see that setting. I know it is/was in IE, but I do not see it in Firefox.
____________
ID: 59134 | Rating: 0 | rate:
/
Scott A. Howard Joined: Oct 16 05 Posts: 2 ID: 4994 Credit: 307,651 RAC: 66
Hello,
Here's the problem in a nutshell.
On my Dell Precision T5400 with dual Xeon E5410 2.33 GHz chips (for a total of 8 cores) running on XP Pro SP3, almost every one of the Rosetta jobs (minirosetta version 154) fail. The typical failure mode is that they are exceeding their CPU time allocation. For example, if the job is estimated to require 4 hours of CPU time, they are killed at something like 20 hours. Sometimes the tasks show progress, other times they are stuck at zero.
Also, the exe is not removed from memory when the computer is in use.
I have reset the project and detached and attached again but it continues to happen.
Nothing like this happens with the lhcathome, QMC@HOME, Docking@Home, or boincsimap tasks. I also don't see this behavior on any of my other machines.
Do you guys produce any diagnostic logs that might of use in troubleshooting the problem? Maybe it's my configuration - maybe a coding error showing up when running 6 or 8 of these tasks simultaneously. (It appears to occur with any number running, from 1 - 8).
I have a full development environment and debuggers if you want some traces.
Scott Howard
Addendum: Now that I thought about it a little more, does the app use any global resource locking? E.g., mutexes, semaphores, file acess? Maybe that's why the progress is halted, it's deadlocked - but I am not sure why the task would continue to use CPU time though. Just some random thoughts...
____________
ID: 59135 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
mikey, we're not talking about the HTTP setting of your browser. We're talking about the http setting used by BOINC. If it were specifically set, it would have a line in that cc_config.xml file.
Once you have the file in the directory, abort the transfer.
You probably got no work because BOINC knew you already had enough coming. So you probably see a number of tasks in a "downloading" state.
The file transfer debug messages will appear in the messages tab. One of the things to note there is which of Rosetta's servers is currently being used to retrieve the file (the host name). I believe this will change from one retry to the next. But if not, you might try blocking outbound traffic to that server with a firewall, and this would then force the client to try the next server in the list.
Does each try go for 5 minutes before waiting again? Does any data come down in that period of time?
Once you determine which server is being used, could you do a ping and a tracert to that server's host name and report the results?
____________ Rosetta Moderator: Mod.Sense
Hello, the version 1.47 was very well for me with 151 Workunits and 0 errors and an average CPU time 2.8 hours. Hope that the new version 1.54 will be as well...
1.47 worked rather well for me, with perhaps one out of ten workunits giving an error. Not enough 1.54 workunits yet to say whether 1.54 is better. I'm asking for 14 hour workunits, so it will take me longer to run that many.
ID: 59137 | Rating: 0 | rate:
/
Scott A. Howard Joined: Oct 16 05 Posts: 2 ID: 4994 Credit: 307,651 RAC: 66
Here's a follow up.
I did the following:
1) detached from the project.
2) removed the Rosetta project folder from under \Bonic\...
3) removed all files from a slot that contained Rosetta data
4) reattached to the project
5) allowed for 50% of the cpus to be used (4 in this case)
6) allowed the four projects to run - each expected to take about 4 hours
Observed results: The status for the projects are "Running, high priority", each has used about 20 minutes of cpu time, the progress is 0.000%
Setting the activity back to "run based on preferences" results in each task no longer using cpu time but they are not removed from memory.
It looks like that's all I can do. If there are no suggestions from your end, I'll need to stay detached from the project so I don't waste cycles.
I see the thread that's consuming the CPU has a pretty regular call stack. Here is the call stack. If you have your debug symbols for your build, you should be able to locate the routine and line at which the program is hung...
ntkrnlpa.exe!KiSwapContext+0x2f
ntkrnlpa.exe!KiSwapThread+0x8a
ntkrnlpa.exe!KeWaitForSingleObject+0x1c2
ntkrnlpa.exe!KiSuspendThread+0x18
ntkrnlpa.exe!KiDeliverApc+0x124
hal.dll!HalpApcInterrupt+0xc6
minirosetta_1.54_windows_intelx86.exe+0x91a63 <------ look for problem here
minirosetta_1.54_windows_intelx86.exe+0x17d3
minirosetta_1.54_windows_intelx86.exe+0x1afcd
minirosetta_1.54_windows_intelx86.exe+0x9289e
minirosetta_1.54_windows_intelx86.exe+0x4a4bc3
minirosetta_1.54_windows_intelx86.exe+0xb0892
minirosetta_1.54_windows_intelx86.exe+0x3e0c24
____________
mikey, we're not talking about the HTTP setting of your browser. We're talking about the http setting used by BOINC. If it were specifically set, it would have a line in that cc_config.xml file.
Once you have the file in the directory, abort the transfer.
You probably got no work because BOINC knew you already had enough coming. So you probably see a number of tasks in a "downloading" state.
The file transfer debug messages will appear in the messages tab. One of the things to note there is which of Rosetta's servers is currently being used to retrieve the file (the host name). I believe this will change from one retry to the next. But if not, you might try blocking outbound traffic to that server with a firewall, and this would then force the client to try the next server in the list.
Does each try go for 5 minutes before waiting again? Does any data come down in that period of time?
Once you determine which server is being used, could you do a ping and a tracert to that server's host name and report the results?
I changed the dual core settings to use both cores, this is a laptop and I do not like stressing it that much, and set the other project to no new work. I updated Rosetta and it proceeded to download new work. The same file stopped at the same place, 89.25%. I aborted it, after all other files were done downl0ading, and no new entries showed up in the cc_config.xml file.
I was browsing thru the stdout.txt file and found this:
9:21:33 AM: Error: can't open file 'C:\Boinc\\RebootPending.txt' (error 2: the system cannot find the file specified.)
[01/27/09 09:21:34] TRACE [2064]: RPC_CLIENT::init connect 2: Winsock error '10061'
[01/27/09 09:21:34] TRACE [2064]: RPC_CLIENT::init connect on 444 returned -1
[01/27/09 09:21:35] TRACE [2064]: RPC_CLIENT::init boinc_socket returned 444
[01/27/09 09:21:35] TRACE [2064]: RPC_CLIENT::init connect returned -1
[01/27/09 09:21:35] TRACE [2064]: RPC_CLIENT::init_poll sock = 444
It is in there many, many times.
I do not see what server I am downloading from, and only use the Windows firewall, so unless I could block thru the Hosts files, I do not know how to block that particular server anyway.
Yes each retry deferral is about 4 minutes.
I did find one more thing in that stdoutgiu.txt file:
[01/29/09 11:10:31] TRACE [3932]: RPC_CLIENT::init connect 2: Winsock error '10061'
[01/29/09 11:10:31] TRACE [3932]: RPC_CLIENT::init connect on 524 returned -1
It is also in there many, many times. I did a search and found where it said to change the attributes for the Boinc directory and all subdirectories. It was set to read only and when I unchecked that and changed it also for all subdirectories, Boinc will not run. It also auto defaults back to read only after it errors out. DO NOT DO THIS LAST PART It crashed my whole Boinc setup and I had to delete the Boinc directory, and all subdirectories, then reboot and then reinstall Boinc from scratch. FORTUNATELY it did a repair install instead of a brand new install from scratch! I lost all workunits from all projects though!!!! I attached to Rosetta and guess what? The EXACT SAME FILE is stuck at the EXACT SAME PLACE!!! A TON of files are downloading besides just that one, but that one is stuck all over AGAIN!!!
____________
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.
My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.
I turned it off and nothing changed, I use the free version of Avast.
got any firewalls active?
No I use the Windows one, I have Windows XP Media Center on this laptop.
____________
and drop it in to your Rosetta folder in your BOINC data directory under the projects subfolder. That will at least get you up and running, or on to the next file to see if similar problems continue.
No difference, I downloaded the file, dropped it into the directory C:\Boinc\Data\Projects\boinc.bakerlab.org_rosetta. It asked if I wanted to overwrite what was there, the new file was bigger, I said yes, exited Boinc, restarted Boinc, did an update of Rosetta and it is downloading the file AGAIN and is stuck at exactly the same place. I even turned off my ad-aware and anti-virus and no change.
What antivirus program do you have, and what version? Some antivirus programs don't fully turn off when you try to turn them off; they stop reporting that they have found a virus, but don't stop looking for a virus.
I'm also running Ad-Aware, but without this problem, so this antispyware program is less likely to be causing the problem.
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.
My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.
I turned it off and nothing changed, I use the free version of Avast.
I also use the free version of avast!, version 4.8 Home Edition, without seeing that problem, so if it's that program, the problem is probably in a section specific to the operating system you are using. I use 32-bit Windows Vista SP1 with nearly all the updates applied; what operating system and what version are you using?
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.
My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.
I turned it off and nothing changed, I use the free version of Avast.
I also use the free version of avast!, version 4.8 Home Edition, without seeing that problem, so if it's that program, the problem is probably in a section specific to the operating system you are using. I use 32-bit Windows Vista SP1 with nearly all the updates applied; what operating system and what version are you using?
I am running 32 bit Windows XP Media Center and have been running Boinc on this thing ever since I bought it a couple of years ago. I was running Rosetta on it until this new mini-rosetta came out! I have run Malaria, I can run Poem on it right now! I have run ABC on it but the units take too long on this T2300 1.6ghz, 2 gig of ram machine. EVERYTHING runs except it just won't download, or recognize, the new mini-rosetta file! I downloaded the mini-rosetta file directly, put it in the proper directory, and it STILL wants to download that exact same file!!!
I am also using the 4.8 Home version of Avast.
____________
ID: 59144 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
Scott Howard:
Setting the activity back to "run based on preferences" results in each task no longer using cpu time but they are not removed from memory.
There are many many BOINC settings possible and you've not described any of yours. When you set BOINC to run based on preference, you are telling it to only use CPU on the days and during the hours you've configured. If you've configured it to not be running at the current time or day of the week, it will suspend the currently active tasks. Any time a task is suspended, it will not make any progress. And there is a memory setting for whether or not tasks should remain "in memory" (virtual memory) while suspended. Doing so preserves the work done since the last checkpoint taken by the task.
...so major portions of what you are reporting may be exactly what you have configured BOINC to do.
You have 4 hosts, three are Windows XP and one is Win Vista. Which one is having problems? Is it this one? There are many failed tasks there with access violations. Are you overclocking this machine? Other then more CPUs and different CPU type, what is different about this machine then your others that having been running fine?
____________ Rosetta Moderator: Mod.Sense
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.
My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.
I turned it off and nothing changed, I use the free version of Avast.
got any firewalls active?
No I use the Windows one, I have Windows XP Media Center on this laptop.
I also use the Windows firewall, but the Vista SP1 version.
Laptops sometimes have problems with overheating when running BOINC workunits but set to use 100% of the CPU time, and I think I've read that minirosetta is likely to have problems when set to run at less than 100% of the CPU time. What is your setting of what fraction of the CPU time to use?
I've installed the SpeedFan program on my machine program to check for overheating, but don't have the file needed to show results with proper labels for my motherboard yet. The highest temperature it shows is 109F, though.
I notice that your 8 core machine only has 3GB. That's a bit small for 8 rosetta tasks. In your BOINC preferences what percent of memory are you allowing the machine to use when the machine is/isn't in use? You might try setting both to 100% on that machine and see if it makes any difference.
I seem to have the same problem. No special settings in Rosetta preferences, all kind of computers under XP, and tasks running 100+ hours with 0% progress.
Reason: Access Violation (0xc0000005) at address 0x00467846 read attempt to address 0x11B524C4
This task was running fine but after I suspended it, rebooted my system, and restarted the task it terminated almost immediately with access violation. Maybe restarts don't work very well or something is flakey with my hard drive or system. Having some troubles with access violations on Einstein tasks as well. But I've run memtest86 and prime95 and CHKDSK and none of them indicate any local computer problems. I'm just shaking my head in disgust.
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.
My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.
I turned it off and nothing changed, I use the free version of Avast.
I also use the free version of avast!, version 4.8 Home Edition, without seeing that problem, so if it's that program, the problem is probably in a section specific to the operating system you are using. I use 32-bit Windows Vista SP1 with nearly all the updates applied; what operating system and what version are you using?
I am running 32 bit Windows XP Media Center and have been running Boinc on this thing ever since I bought it a couple of years ago. I was running Rosetta on it until this new mini-rosetta came out! I have run Malaria, I can run Poem on it right now! I have run ABC on it but the units take too long on this T2300 1.6ghz, 2 gig of ram machine. EVERYTHING runs except it just won't download, or recognize, the new mini-rosetta file! I downloaded the mini-rosetta file directly, put it in the proper directory, and it STILL wants to download that exact same file!!!
I am also using the 4.8 Home version of Avast.
I've recently run Poem workunits on one core of my HP Compaq Presario PC, model SR5125CL, and minirosetta 1.54 workunits at the same time on the other core, without problems. I previously ran Malaria workunits on one core and earlier minirosetta workunits on the other without problems, back when I could still get Malaria workunits. My machine also has 2 GB. I haven't tried ABC workunits, so you might want to try running yours a while with no ABC workunits active. I also run WCG workunits (all active projects there except beta test), with Ralph workunits and Boincsimap workunits when I can get them. I used to run Cels workunits, back when that project was active.
My CPU is an AMD Athlon(tm) 64 X2 Dual Core Processor 3600+ 1.90 Ghz; what's yours?
Also, You may want to give your ISP the instructions for downloading the problem file with FTP, and ask them to test whether their antivirus software considers it to have a problem.
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.
My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.
I turned it off and nothing changed, I use the free version of Avast.
got any firewalls active?
No I use the Windows one, I have Windows XP Media Center on this laptop.
I also use the Windows firewall, but the Vista SP1 version.
Laptops sometimes have problems with overheating when running BOINC workunits but set to use 100% of the CPU time, and I think I've read that minirosetta is likely to have problems when set to run at less than 100% of the CPU time. What is your setting of what fraction of the CPU time to use?
I've installed the SpeedFan program on my machine program to check for overheating, but don't have the file needed to show results with proper labels for my motherboard yet. The highest temperature it shows is 109F, though.
I only run one core so the setting is to use 50% of the cpu's. Thus I do not have a problem with overheating on this laptop. I have it set to use 100% of the available cpu.
____________
It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.
My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.
I turned it off and nothing changed, I use the free version of Avast.
I also use the free version of avast!, version 4.8 Home Edition, without seeing that problem, so if it's that program, the problem is probably in a section specific to the operating system you are using. I use 32-bit Windows Vista SP1 with nearly all the updates applied; what operating system and what version are you using?
I am running 32 bit Windows XP Media Center and have been running Boinc on this thing ever since I bought it a couple of years ago. I was running Rosetta on it until this new mini-rosetta came out! I have run Malaria, I can run Poem on it right now! I have run ABC on it but the units take too long on this T2300 1.6ghz, 2 gig of ram machine. EVERYTHING runs except it just won't download, or recognize, the new mini-rosetta file! I downloaded the mini-rosetta file directly, put it in the proper directory, and it STILL wants to download that exact same file!!!
I am also using the 4.8 Home version of Avast.
I've recently run Poem workunits on one core of my HP Compaq Presario PC, model SR5125CL, and minirosetta 1.54 workunits at the same time on the other core, without problems. I previously ran Malaria workunits on one core and earlier minirosetta workunits on the other without problems, back when I could still get Malaria workunits. My machine also has 2 GB. I haven't tried ABC workunits, so you might want to try running yours a while with no ABC workunits active. I also run WCG workunits (all active projects there except beta test), with Ralph workunits and Boincsimap workunits when I can get them. I used to run Cels workunits, back when that project was active.
My CPU is an AMD Athlon(tm) 64 X2 Dual Core Processor 3600+ 1.90 Ghz; what's yours?
This is an Intel T2300 dual core, only using one of them for Boinc, 1.6ghz machine.
Also, You may want to give your ISP the instructions for downloading the problem file with FTP, and ask them to test whether their antivirus software considers it to have a problem.
Yeah me telling Comcast what to do isn't going to happen in this lifetime. I can download any file in the World EXCEPT this damned mini-rosetta file and then ONLY thru Boinc!!!! I download the same file thru a direct download, Boinc just won't recognize it. Yes I did put the file in the proper directory. We have been thru this already.
____________
ID: 59157 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,023,621 RAC: 81
I am not sure that Comcast is the problem as I do use their AV software and Have no problems with downloading work ...
ID: 59159 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
mikey, have you tried a different version of BOINC?
____________ Rosetta Moderator: Mod.Sense
ID: 59160 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
rembertw
Please open the advanced view of the BOINC Manager, go to the tasks tab, and note the "application" name shown, this will have the application version. The only reports of tasks running that long are from the prior version. If it is not Rosetta mini 1.54, please select that task, and abort it with the button on the left. There were some problems like that on the prior version that are corrected now.
____________ Rosetta Moderator: Mod.Sense
mikey, have you tried a different version of BOINC?
Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!
____________
ID: 59162 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,023,621 RAC: 81
mikey, have you tried a different version of BOINC?
Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!
Just a wild shot. ..
How is your disk space?
How about BOINC settings for disk space? Are you at BOINC's limit?
You mentioned that when you manually copied the .exe file that you overwrote the half-downloaded file. Perhaps Boinc tried to resume the download without noticing the change. I think Boinc must be stopped, and must NOT have the file half-downloaded for this copying trick to work.
If you've avoided the above problem and Boinc is still trying to download the file, check the messages tab to see if Boinc is complaining about a bad checksum. It's possible that whatever is preventing Boinc from downloading the file could also be corrupting the file when you manually download it.
mikey, have you tried a different version of BOINC?
Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!
Just a wild shot. ..
How is your disk space?
How about BOINC settings for disk space? Are you at BOINC's limit?
No I am fine on both. Boinc still has 10 gig available to it and there is over 20 gig total available.
____________
You mentioned that when you manually copied the .exe file that you overwrote the half-downloaded file. Perhaps Boinc tried to resume the download without noticing the change. I think Boinc must be stopped, and must NOT have the file half-downloaded for this copying trick to work.
If you've avoided the above problem and Boinc is still trying to download the file, check the messages tab to see if Boinc is complaining about a bad checksum. It's possible that whatever is preventing Boinc from downloading the file could also be corrupting the file when you manually download it.
Nope this is the only message regarding the file:
1/29/2009 6:29:05 PM|rosetta@home|Started download of minirosetta_1.54_windows_intelx86.exe
I exited Boinc, deleted the old file, copied the new one into its location and then restarted the whole pc. Then when Boinc started up that message, along with a few dozen others, came up.
I appreciate all the help but I am done trying to make this work. I am on to another project and will try again another time. THANK YOU ALL!!!!
PS in the time it took me to type this I attached to Poem@Home and got 8 new units plus all the associated files and the pc is now happily crunching.
Thanks again for all your help, I still have a hard time believing it is my pc that can download just fine from any other project but just cannot download one file from Rosetta. Here is a partial list of files just downloaded:
1/29/2009 6:37:15 PM|Poem@Home|Started download of poem_1.0_windows_intelx86
1/29/2009 6:37:15 PM|Poem@Home|Started download of JParmJan97
1/29/2009 6:37:23 PM|Poem@Home|Sending scheduler request: To fetch work. Requesting 95475 seconds of work, reporting 0 completed tasks
1/29/2009 6:37:28 PM|Poem@Home|Scheduler request completed: got 8 new tasks
1/29/2009 6:37:29 PM|Poem@Home|Finished download of poem_1.0_windows_intelx86
As you can see it works just fine!! I do see that the mini-rosetta file has a ".exe" at the end while the poem file does not. Could that be the problem, no clue, seems it has worked for all other users.
Thanks for the ride it has been loads of fun but I am getting off for now. I will still come back and read and reply in the forums until my credits don't let me anymore.
____________
ID: 59173 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
mikey, whatever the problem is, it stands a good chance of clearing itself when the next Rosetta version comes out. The .exe will be a different name afterall. So, please monitor the new release thread and give another try at that time.
____________ Rosetta Moderator: Mod.Sense
ID: 59174 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,023,621 RAC: 81
Sometimes it is best to take a breather and come back later ...
At times the problems go away on their own for no apparent reason ... other times they can be found.
I have been doing Ralph for a week or so now and all I can say is that I am impressed with how many issues we found in 1.54 and now in 1.55 ...
I know that we all regret we could not get you going Besides POEM, WCG also does folding, and there are other projects that are related ... yell if you need us ... or want us ... or to say hi ...
mikey, whatever the problem is, it stands a good chance of clearing itself when the next Rosetta version comes out. The .exe will be a different name afterall. So, please monitor the new release thread and give another try at that time.
I will, I like the premise of Rosetta and that is what brought me here in the first place. I will certainly try again in the future, probably when you put out a new version as you suggest.
Thanks for all your help
____________
Sometimes it is best to take a breather and come back later ...
At times the problems go away on their own for no apparent reason ... other times they can be found.
I have been doing Ralph for a week or so now and all I can say is that I am impressed with how many issues we found in 1.54 and now in 1.55 ...
I know that we all regret we could not get you going Besides POEM, WCG also does folding, and there are other projects that are related ... yell if you need us ... or want us ... or to say hi ...
Oh I have 17 computers, I think, on line here at home right now. All are crunching for Boinc, plus I have 2 video cards doing the folding thing. I do ABC and Poem right now. But if you click on my name you will see I have crunched for a few projects and am not intending to stop anytime soon. In fact I have 2 new motherboard and dual core cpus to bring on line this weekend to replace 2 single core machines. I have already set them to no new work in preparation for the changeover.
____________
I detached yesterday, and re-attached just now so there is no way to see (for me) what applications were running. Also, my computers are too widely spread to start micromanagement. Anyway, I'll keep an eye on a couple of computers for a couple of days to see if they reattach succesfully and if that problem is indeed solved.
I supposed it was a wrong batch (or application) and detach/reattach was the fastest way to have a full reset. If the problem shows up again, I'll let you know. If it doesn't... then thanks for the info!
Hello with all.
I do not understand that some have problems.
Indeed my Desktop machine Intel Core 2 CPU Windows XP home x86 SP2 with carried out 27 wu with the v1.54 and 0 errors with an average CPU time of 2,8 hours. I cross the fingers so that continuous as well...
I specify only that betwen 80% and 85% of work, that passes directly to 100%.
With this new version I also notice that the processes generate more lures and attemps,(1 example: 23 decoys from 23 attemps on a wu), but as that the working mean by wu and more important as with the v1.47.
To finish, (although this n'is not the good forum), I specify that one of my Computeurs has been broken down for 8 days due to segments broken on the hard drive and qu'it is in repair. As it there to 3 wu as I n'is not puses to return before the dealine and which I think will be lost.
It would be thus although a person sympathetic nerve informs the persons in charge of the rosetta project of this problem.
Thank you very much d'advances...
Good memories...
____________
The 1.54 version seems to be in conflict with the Linux ABI in FreeBSD.
One machine I'm running boinc on is a FreeBSD one, boinc downloads and runs the Linux binaries through the Linuxulator. Version 1.47 worked flawlessly, but the 1.54 version crashes randomly on SIGILL. http://boinc.bakerlab.org/rosetta/results.php?hostid=973136 shows only one successful task, which was run with Rosetta beta rather than minirosetta; all of the minirosetta tasks crashed sooner or later.
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
Peter, the potential for large output files is why Mike changed it to exit after 99 models. That lets the task report back that it's running through models like candy and then they can weigh that before releasing more similar tasks.
____________ Rosetta Moderator: Mod.Sense
ID: 59236 | Rating: 0 | rate:
/
P . P . L . Joined: Aug 20 06 Posts: 365 ID: 105843 Credit: 362,889 RAC: 796
Peter, the potential for large output files is why Mike changed it to exit after 99 models. That lets the task report back that it's running through models like candy and then they can weigh that before releasing more similar tasks.
Hi.
Just as well it did finish after 99 i would hate to see the file size after
12 or 24 hours! :) I just returned another one the same size.
Both ended with exit code: 0 (0x0) and seemed to run successfully, but with no credit given.
Is it something I did, a bug or just one of those things?
I only checked the 1.54-task. You have a runtime-preference of 3 hours. This one ran for 7 hours, no finished models. I'd say the watchdog, which aborts tasks running longer than intended, cut in. This is one for the long-running tasks thread.
____________
Both ended with exit code: 0 (0x0) and seemed to run successfully, but with no credit given.
Is it something I did, a bug or just one of those things?
I only checked the 1.54-task. You have a runtime-preference of 3 hours. This one ran for 7 hours, no finished models. I'd say the watchdog, which aborts tasks running longer than intended, cut in. This is one for the long-running tasks thread.
Thanks for copying here - I thought it was just a problem with the validator (the error message being the clue). You're right, there's no "Done" section after the first model starts until the boinc_finish, which is odd, but no mention of the watchdog cutting in, even though it does run a long time. But on the 1.47 WU there are 3 models done, so I'm not entirely convinced it's the same thing.
Usually long-running jobs get a default credit of 80, don't they? Looks like I missed out all ways. Oh well...
2/2/2009 10:05:58 AM|rosetta@home|Sending scheduler request: Requested by user
2/2/2009 10:05:58 AM|rosetta@home|(not requesting new work or reporting completed tasks)
2/2/2009 10:06:03 AM|rosetta@home|Scheduler RPC succeeded
2/2/2009 10:06:03 AM|rosetta@home|Message from server: Server error: can't attach shared memory
2/2/2009 10:06:03 AM|rosetta@home|Deferring communication for 1 hr 0 min 0 sec
2/2/2009 10:06:03 AM|rosetta@home|Reason: project is down
Server is up according to the webpage. One task was updated as complete.
there is something odd going on with the graphics of lr5_D_score12_rlbd_2hsh_IGNORE_THE_REST_DECOY_6246_424_0 the plot disappears completely at times and the accepted energy does the same at times. then they reappear at times. all seems to depend on the energy value of the moment. as far as i know this is not normal.
I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.
For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.
Right now 3 cores are running:
Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.
I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.
For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.
Right now 3 cores are running:
Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.
what version of boinc manager are you using?
it looked like you were using 5.10.45 which is quite old.
I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.
For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.
Right now 3 cores are running:
Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.
Sounds like your Duration Correction Factor is waaaay off. You can wait for Boinc to fix it, and it will all by itself. Or you can shut down Boinc and manually edit the file client_state.xml. Do this thru notepad or whatever, and go down until you find this Project and the line that says:
<duration_correction_factor>0.705998</duration_correction_factor>
That is a copy of my line, yours will have the numbers be like 70.?????????? or whatever. Change it to 1.000000, save the file and then restart Boinc. Boinc will then use those new numbers and get both new work and recalculate how long it will take to crunch your existing units. If you do a right click on the file name the EDIT option is listed, use that to edit the file. Do NOT change anything else in the file. The line you are looking for is near the top, mine was only 65 lines down from the top.
____________
ID: 59254 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,023,621 RAC: 81
Compute error, though it looks more like a zip error ...
process exited with code 1 (0x1, -255)
Watchdog active.
Hbond tripped.
ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish
Not sure what to make of this error ... happened on the Mac Pro ...
That fixed it! Thanks, my duration was set at 55+.
I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.
For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.
Right now 3 cores are running:
Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.
Sounds like your Duration Correction Factor is waaaay off. You can wait for Boinc to fix it, and it will all by itself. Or you can shut down Boinc and manually edit the file client_state.xml. Do this thru notepad or whatever, and go down until you find this Project and the line that says:
<duration_correction_factor>0.705998</duration_correction_factor>
That is a copy of my line, yours will have the numbers be like 70.?????????? or whatever. Change it to 1.000000, save the file and then restart Boinc. Boinc will then use those new numbers and get both new work and recalculate how long it will take to crunch your existing units. If you do a right click on the file name the EDIT option is listed, use that to edit the file. Do NOT change anything else in the file. The line you are looking for is near the top, mine was only 65 lines down from the top.
2/2/2009 10:05:58 AM|rosetta@home|Sending scheduler request: Requested by user
2/2/2009 10:05:58 AM|rosetta@home|(not requesting new work or reporting completed tasks)
2/2/2009 10:06:03 AM|rosetta@home|Scheduler RPC succeeded
2/2/2009 10:06:03 AM|rosetta@home|Message from server: Server error: can't attach shared memory
2/2/2009 10:06:03 AM|rosetta@home|Deferring communication for 1 hr 0 min 0 sec
2/2/2009 10:06:03 AM|rosetta@home|Reason: project is down
Server is up according to the webpage. One task was updated as complete.
you have to wait and it will correct by itself.
Maybe it is a long time from your last rosetta WU... during this time the project changed its web address and so boinc need to re-fetch master file. Leave it alone and in 24 hour max it will redownload it and resume working!
That fixed it! Thanks, my duration was set at 55+.
I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.
For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.
Right now 3 cores are running:
Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.
Sounds like your Duration Correction Factor is waaaay off. You can wait for Boinc to fix it, and it will all by itself. Or you can shut down Boinc and manually edit the file client_state.xml. Do this thru notepad or whatever, and go down until you find this Project and the line that says:
<duration_correction_factor>0.705998</duration_correction_factor>
That is a copy of my line, yours will have the numbers be like 70.?????????? or whatever. Change it to 1.000000, save the file and then restart Boinc. Boinc will then use those new numbers and get both new work and recalculate how long it will take to crunch your existing units. If you do a right click on the file name the EDIT option is listed, use that to edit the file. Do NOT change anything else in the file. The line you are looking for is near the top, mine was only 65 lines down from the top.
Yea for some reason this has happened ALOT lately.
____________
That could be related to the BOINC version (6.4.5 and higher). The complaints about the RDCF being completely off are usually coming from people having installed it. A not uncommon opinion is that version 6.4.5 was made the recommended version too hasty and done to get the CUDA capabilities out.
____________
mikey, whatever the problem is, it stands a good chance of clearing itself when the next Rosetta version comes out. The .exe will be a different name afterall. So, please monitor the new release thread and give another try at that time.
I will, I like the premise of Rosetta and that is what brought me here in the first place. I will certainly try again in the future, probably when you put out a new version as you suggest.
Thanks for all your help
I just had a thought...NOT dangerous this time, I am off for a few days here and I have finally figured out how to make Ubuntu Linux work for me and crunch Boinc projects too. I will try switching one of the machines that won't download the Windows app to Linux and see if that works.
____________
mikey, whatever the problem is, it stands a good chance of clearing itself when the next Rosetta version comes out. The .exe will be a different name afterall. So, please monitor the new release thread and give another try at that time.
I will, I like the premise of Rosetta and that is what brought me here in the first place. I will certainly try again in the future, probably when you put out a new version as you suggest.
Thanks for all your help
I just had a thought...NOT dangerous this time, I am off for a few days here and I have finally figured out how to make Ubuntu Linux work for me and crunch Boinc projects too. I will try switching one of the machines that won't download the Windows app to Linux and see if that works.
mikey, whatever the problem is, it stands a good chance of clearing itself when the next Rosetta version comes out. The .exe will be a different name afterall. So, please monitor the new release thread and give another try at that time.
I will, I like the premise of Rosetta and that is what brought me here in the first place. I will certainly try again in the future, probably when you put out a new version as you suggest.
Thanks for all your help
I just had a thought...NOT dangerous this time, I am off for a few days here and I have finally figured out how to make Ubuntu Linux work for me and crunch Boinc projects too. I will try switching one of the machines that won't download the Windows app to Linux and see if that works.
I just thought of something....I wonder if changing the setting for:
Skip image file verification? to yes would have let my Windows pc's download the file? Hmmmmm
____________
ID: 59365 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
The image verification can't occur until the download completes. So, that's not what's causing the download problem.
____________ Rosetta Moderator: Mod.Sense
Mod.Sense had asked me to posts my results in here. A little history, I've been getting Compute Error's for every Minirosetta WU I try and crunch, they usually crash and burn within the first 60 seconds or so...I am running a Q6600 with everything running at stock speeds but I was throttling my processor to use only 3 of 4 cores, so it was suggested that I let all 4 cores run unthrottled and here's what happenned:
I changed it to: "On multiprocessor systems, use at most 100% of the processors" so that it would run completely unthrottled and use all 4 cores. And I let it download minirosetta WU's and it got 5 of them and all failed after 0:33, 1:39, 0:56, 0:38, and last one at 0:51 crashed with a Vista popup saying "minirosetta_1.54_windows_x86_64.exe has stopped working"
So it didn't seem to help, I don't know what else to try but I'm little ashamed of all the compute errors when you look at my results page..so I think I may have to give up on minirosetta and just stick to Beta WU's, they seem to work great when I'm not messing around with the BOINC client.
I think it may have something to do with Vista 64. Because I have an E8500 running Vista 64 and they fail on there too but the E8500 is throttled to 1 core and is OC'ed from 3.16Ghz to 3.8Ghz (I've been told OC'ing will effect minirosetta) but the E8500 is my gaming rig so I don't mind if it doesn't crunch WU's because it's crunching games! :)
ID: 59371 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
And epcorian is not overclocked. Running BOINC version 6.4.5
They consistently fail with Access Violations on the Mini tasks. The "Rosetta Beta" tasks are the successes you will find.
Is it possible you've got something like an antivirus application that's conflicting on Vista?
The only other thought is to go back to the prior stable version of the BOINC client. There have been a number of fishy issues with the 6.4.x level. You can download older BOINC versions here
____________ Rosetta Moderator: Mod.Sense
That's right, the Q6600 isn't overclocked, the system contains a Intel DQ35JO MB, Q6600 Processor, 4GB (2x2GB) Kingston Value Ram, Corsair HX-520W PS, 36GB WD Raptor HD, 2x750GB WD HD's in RAID 1, and a Zalman HSF running Vista 64 SP1, no external video card. I use it as a home file and print server and recently a BOINC cruncher as I leave it on 24/7. No issues with Beta WU's or SETI.
I do have NOD32 installed on there but I tried disabling it (I haven't gone as far to uninstall it) and they would still fail.
Maybe I should try an older version of the BOINC client, I will give it a go this weekend and post back.
And epcorian is not overclocked. Running BOINC version 6.4.5
They consistently fail with Access Violations on the Mini tasks. The "Rosetta Beta" tasks are the successes you will find.
Is it possible you've got something like an antivirus application that's conflicting on Vista?
The only other thought is to go back to the prior stable version of the BOINC client. There have been a number of fishy issues with the 6.4.x level. You can download older BOINC versions here
He is running a 64 bit OS though, I read on one of the projects that you need to do something to make 32 bit units work on a 64 bit system, is that true with Rosetta units too? That is NOT true for all projects and I do not remember where I read it.
____________
ID: 59385 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
Moved NewtonianRefractor's post here. They report a validation error on a tasks that had a visit from the watchdog. They ended at target runtime plus 4hrs, but show with validation errors.
____________ Rosetta Moderator: Mod.Sense
rembertw
Please open the advanced view of the BOINC Manager, go to the tasks tab, and note the "application" name shown, this will have the application version. The only reports of tasks running that long are from the prior version. If it is not Rosetta mini 1.54, please select that task, and abort it with the button on the left. There were some problems like that on the prior version that are corrected now.
Same problem again on at least one of my computers. This time I have more details:
Application: Rosetta Mini 1.54
Task name: lr6_E_score12_rlbd_1ail_IGNORE_THE_REST_DECOY_6254_459_0
Total runtime before manual cancellation: 72:21:22
Total Progress: 0%
Time to go: 6:42:30 (as usual on my computers)
Any comments/ideas?
ID: 59394 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
Rembertw, which machine are you having the problem with? What version of BOINC are you running? Was this a newly installed machine? Or was it working before?
____________ Rosetta Moderator: Mod.Sense
So I took Mod.Sense's advice and downgraded to the 6.2.19 64-bit version of the BOINC client and so far so good with the mini's, I've crunched with 30 minutes thus far and no errors yet, much better then the 30-60 seconds I was getting before.
Rembertw, which machine are you having the problem with? What version of BOINC are you running? Was this a newly installed machine? Or was it working before?
- This specific problem occurs with computer ID 586996
- Boinc version 6.2.14 as on most of my computers currently
- Not newly installed, but hardly a price winner with Rosetta. It crunches succesfully for other projects though
Extra comments: I have the impression that it is Rosetta that crashes. This morning I noticed 2 other tasks at +7h progress and 0% progress. When cancelling these tasks I got the Windows crash notice where I can "inform microsoft of the problem". The only "special" about this computer is that it doesn't have 24/7 internet access.
So I took Mod.Sense's advice and downgraded to the 6.2.19 64-bit version of the BOINC client and so far so good with the mini's, I've crunched with 30 minutes thus far and no errors yet, much better then the 30-60 seconds I was getting before.
ALRIGHT!!! Glad you guys found the problem, I guess the reports of the newer versions being released without proper testing were true in your case.
____________
Desde agosto de 2008 me están finalizando el 99% de las tareas de Rosetta Mini con error de cálculo. Al cabo de un tiempo decidà no seguir procesando en este proyecto. Aun asÃ, de cuando en cuando vuelvo a intentarlo, pero todo sigue igual: incluso con las nuevas versiones de Rosetta Mini, incluida esta última.
Hello.
Following task (http://boinc.bakerlab.org/rosetta/result.php?resultid=225859224) is suspended as it has produced "accepted energy": QNAN(Not a Number?) and RMSD: QO.Model number 25 step 9518. Running time: 20h 2min 21sec.
Set runtime 24h.
For now suspended.No crash before.
OS:Windows 7 beta.I can create dump file using task manager.
mikey, have you tried a different version of BOINC?
Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!
Just a wild shot. ..
How is your disk space?
How about BOINC settings for disk space? Are you at BOINC's limit?
No I am fine on both. Boinc still has 10 gig available to it and there is over 20 gig total available.
How many BOINC projects do you have set up? I've seen signs that BOINC divides the available space equally among projects, even if some projects don't even try to use all of their share. I'm currently allowing BOINC to share up to 30 GB among 8 BOINC projects (not all making workunits available recently). I had problems getting Rosetta@home to run workunits on both cores of my dual-core CPU at the same time before that. Also, I believe I've seen a maximum percentage of the available free space on the hard drive BOINC is allowed to use, which can reduce the limits even further.
I recently had a 1.54 workunit with a validate error for no reason I could spot in the Task ID details file. A wingman got a Success, but apparantly with a much shorter preferred workunit length than the 14 hours I request.
mikey, have you tried a different version of BOINC?
Yes I was originally using version 6.4.3 but upgraded to version 6.6.3 and have since downgraded back to 6.4.3. Nothing has worked. I even had a couple of computers on 6.2.19 and they couldn't finish the download either!
Just a wild shot. ..
How is your disk space?
How about BOINC settings for disk space? Are you at BOINC's limit?
No I am fine on both. Boinc still has 10 gig available to it and there is over 20 gig total available.
How many BOINC projects do you have set up? I've seen signs that BOINC divides the available space equally among projects, even if some projects don't even try to use all of their share. I'm currently allowing BOINC to share up to 30 GB among 8 BOINC projects (not all making workunits available recently). I had problems getting Rosetta@home to run workunits on both cores of my dual-core CPU at the same time before that. Also, I believe I've seen a maximum percentage of the available free space on the hard drive BOINC is allowed to use, which can reduce the limits even further.
I only have one project per pc, but I will add a second if the first is having workunit issues. All machines have at least a 20 gig hard drive but most have a 100 gig or bigger hard drive. The one above is a laptop with a 50 gig hard drive with almost 30 gig free. I have Boinc setup to use no more than 50% of the free hard drive space and don't have any issues with space.
____________
So I took Mod.Sense's advice and downgraded to the 6.2.19 64-bit version of the BOINC client and so far so good with the mini's, I've crunched with 30 minutes thus far and no errors yet, much better then the 30-60 seconds I was getting before.
ALRIGHT!!! Glad you guys found the problem, I guess the reports of the newer versions being released without proper testing were true in your case.
I think I spoke too soon...that first WU crunched successfully but only 1 other was WU successful out of the 8 WU's. 2/8, better but still not good. I might try replacing Vista 64 with XP 64 another weekend when I'm bored. Just for curiosity sake I had my P4 and Atom 330 PC's running 32-bit XP SP3 crunch some Mini's and they did just fine.
ID: 59428 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
Hello.
Following task (http://boinc.bakerlab.org/rosetta/result.php?resultid=225859224) is suspended as it has produced "accepted energy": QNAN(Not a Number?) and RMSD: QO.Model number 25 step 9518. Running time: 20h 2min 21sec.
Set runtime 24h.
For now suspended.No crash before.
OS:Windows 7 beta.I can create dump file using task manager.
Should I let it try to finish?
Thanks
I'd suggest allowing it to run normally. Was it still using CPU time? If you want to kind of cut it off, but get it to report in, let it run, then exit (not close) BOINC and restart it, let it run about 2 minutes, then exit again and restart, until you've done that 5 times and the task should be ended and report in with "too many restarts".
____________ Rosetta Moderator: Mod.Sense
ID: 59437 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
Desde agosto de 2008 me están finalizando el 99% de las tareas de Rosetta Mini con error de cálculo. Al cabo de un tiempo decidà no seguir procesando en este proyecto. Aun asÃ, de cuando en cuando vuelvo a intentarlo, pero todo sigue igual: incluso con las nuevas versiones de Rosetta Mini, incluida esta última.
I was able to translate his message and basically, he's been having problems with Mini, including the lastest version. He wishes Rosetta had subprojects, so he could select to crunch only the RosettaBeta application instead of mini.
Looking at his 2 failed tasks, they both have Exit status -226 and the Can't acquire lockfile errors.
He is running Win Vista x86.
I know some of you have had these lock file problems as well. Were they always with WinVista? And I thought the v1.54 release of mini had resolved these issues. Can any of you that have had the problem suggest the best steps for Juan to take to resolve it? You might even convert your reply to Spanish as best we can using a tool like this: http://dictionary.reference.com/translate/text.html
____________ Rosetta Moderator: Mod.Sense
According to the graphics screen of these four WUs, every "accepted" step becomes the new low energy state. No matter if the energy value is smaller or higher...
ID: 59443 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,023,621 RAC: 81
*I* cured the lock file problem by running with 100% time ... if he has opted to run at some lower percentage of CPU time this may be the issue. Something else to try ... and if it works we can report another success ... this is one of the issues that we have been trying to pin down in RALPH...
Hello.
Following task (http://boinc.bakerlab.org/rosetta/result.php?resultid=225859224) is suspended as it has produced "accepted energy": QNAN(Not a Number?) and RMSD: QO.Model number 25 step 9518. Running time: 20h 2min 21sec.
Set runtime 24h.
For now suspended.No crash before.
OS:Windows 7 beta.I can create dump file using task manager.
Should I let it try to finish?
Thanks
I'd suggest allowing it to run normally. Was it still using CPU time? If you want to kind of cut it off, but get it to report in, let it run, then exit (not close) BOINC and restart it, let it run about 2 minutes, then exit again and restart, until you've done that 5 times and the task should be ended and report in with "too many restarts".
OK,set runtime at 8hours,so watchdog would cut it at 24hours.It has now uploaded and reported it.I have dump files as well,if somebody in team is interested.(Captured at reported time and step)
And I see I was not alone... :-(
If you are seeing errors with lock-file problems try setting the cpu setting back to 100%. If you are running at 100% CPU preference and are getting this problem, I for one, am very interested. If you are getting the failures and change the CPU setting to 100% and that cures the issue ... well, we are interested in THAT too ...
I read about this in Einstein@Home and it seems to work for me ... YMMV ...
I, too, was plagued by frequent R@H lock file problems. Setting CPU to 100% seems to have cured that.
And, as I have a quad-core CPU, I can limit BOINC usage by setting "On Multiprocessor Systems, use at most 51% of all processors". (If I run BOINC at 100% on all cores, my system gets too hot - more precisely, my fan gets too loud)
-- Andreas
Desde agosto de 2008 me están finalizando el 99% de las tareas de Rosetta Mini con error de cálculo. Al cabo de un tiempo decidà no seguir procesando en este proyecto. Aun asÃ, de cuando en cuando vuelvo a intentarlo, pero todo sigue igual: incluso con las nuevas versiones de Rosetta Mini, incluida esta última.
I was able to translate his message and basically, he's been having problems with Mini, including the lastest version. He wishes Rosetta had subprojects, so he could select to crunch only the RosettaBeta application instead of mini.
Looking at his 2 failed tasks, they both have Exit status -226 and the Can't acquire lockfile errors.
He is running Win Vista x86.
I know some of you have had these lock file problems as well. Were they always with WinVista? And I thought the v1.54 release of mini had resolved these issues. Can any of you that have had the problem suggest the best steps for Juan to take to resolve it? You might even convert your reply to Spanish as best we can using a tool like this: http://dictionary.reference.com/translate/text.html
I never learned enough Spanish to do such a translation myself, so I tried asking that web site to translate all of your reply at once to Spanish, in preparation for writing an answer in English and doing the same to it. It appeared that the translation succeeded, but enough of it was hidden by advertisements that it was unusable.
Anyone know another automatic translation site that doesn't have this problem?
I've been trying to trigger that problem over on RALPH@home by setting my CPU time less than 100% and unable to actually get it less than 100%, so you might want to consider this: For anyone having this problem repeatedly, give them 1.54 workunits with extra debugging output enabled. Then have someone on the RALPH@home staff analyze the results and give them credits according to the RALPH@home standards instead of the Rosetta@home standards.
Hello, First of all, excuses to write in Castilian, but my English is insufficient. From August of 2008 me 99% of the tasks of Mini Rosetta with computational error are finalizing. After a time I decided not to continue processing in this project. Even so, sometimes I return to try it, but everything follows equal: even with the new versions of Mini Rosetta, including this last one. The case is that the tasks of Rosetta Beta do not fail to me, but of that one sends very few proporcinalmente to me. The pain is that in this project the possibility of selecting sub-projects, does not exist there is as if it in other many. I would like to continue processing for this project, but there is no way, and it is not question to throw low-achieving hours of computation. I hope that this problem is solved soon. As for me I will continue trying from time to time. A coridal greeting for all, Juan
Rembertw, which machine are you having the problem with? What version of BOINC are you running? Was this a newly installed machine? Or was it working before?
- This specific problem occurs with computer ID 586996
- Boinc version 6.2.14 as on most of my computers currently
- Not newly installed, but hardly a price winner with Rosetta. It crunches succesfully for other projects though
Extra comments: I have the impression that it is Rosetta that crashes. This morning I noticed 2 other tasks at +7h progress and 0% progress. When cancelling these tasks I got the Windows crash notice where I can "inform microsoft of the problem". The only "special" about this computer is that it doesn't have 24/7 internet access.
No solution as yet?
ID: 59518 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
Mod.Sense
Rembertw, which machine are you having the problem with? What version of BOINC are you running? Was this a newly installed machine? Or was it working before?
- This specific problem occurs with computer ID 586996
- Boinc version 6.2.14 as on most of my computers currently
- Not newly installed, but hardly a price winner with Rosetta. It crunches succesfully for other projects though
Extra comments: I have the impression that it is Rosetta that crashes. This morning I noticed 2 other tasks at +7h progress and 0% progress. When cancelling these tasks I got the Windows crash notice where I can "inform microsoft of the problem". The only "special" about this computer is that it doesn't have 24/7 internet access.
No solution as yet?
I've not heard any other reports of the percent completed not increasing. What is it showing for the estimated runtime, before the task starts?
Odd, the failed task with some time on it shows that your
core client version is 6.2.14, but your BOINC Windows Runtime Debugger Version is 6.5.0. Not sure how that would happen.
We're ready for a new update. I want to say thank all of you who have helped over the last months to find and fix errors in minirosetta. A particular thank you goes to those who have donated their time over on RALPH and helped with their active feedback - we managed to find a number of difficult and rare bugs and put some new features into minirosetta that should help conserve computer time. Read about it here: http://ralph.bakerlab.org/forum_thread.php?id=431
and here http://ralph.bakerlab.org/forum_thread.php?id=432
I should add that work over there will continue,but now supplemented with information from Rosetta@HOME.
This update is highly focused on bugfixing and stability issues - we have virtually no new science in it, but: We will hopefully now be able to run the science projects that have been in the pipeline waiting for BOINC - we're expecting quite a bit of work to go out very soon indeed. See Dr. Baker's journal for more details.
Features/Fixes:
1.54 Release CHANGELOG
Faster loop closing in FoldCST/Abinitio (affects cc_* cc2_* cs_* WUs), should help with overrunning WUs.
Bug fix concerning intermittent crashes in relax benchmark jobs (_rlbd_) jobs - caused by buggy input file reader.
Bug fix for a potential instability in handling text files (affects all types of WUs).
Bug fix in checkpointing machinery, states were not being correctly restored, probably contributing to long runtimes. (affects cc_* cc2_* cs_* WUs)
Increased the density of checkpoints to lose less time on restarts and address the weired "backjumping" of the time reported in this thread. This will still happen, but the jumps should be much smaller (basically maximally as long as the time between checkpoints.)
Added checkpointing to Loopclosing part of FoldCST. (affects cc_* cc2_* cs_* WUs)
Added checkpointing to Looprelax.
The Watchdog has been checked and improved, now returning information on the aborted jobs to help us figure out how the remaining long running models come about. The watchdog will now abort if the runtime exceeds your preferred runtime + 4 hours. In other words the WUs should not overrun for more than around 4 hours. If they do please let us know !!
Added a limit ont he number of decoys per WU: 99. The WU will end gracefully after that and give full credit. This should address issues with excessive upload problems.
Fixed a bug in the BOINC API concerned with unzipping the input data. (I will let the BOINC guys know about this)
Fixed a strange problem in the options system leading to early crashes on some systems.
Two nasty instabilities fixed deep in the FoldConstraints/abinitio protocol (cc_* tasks and other homology modelling tasks)
Generally implemented much better error reporting - many many potential problems will now show up a meaningful error messages and not random segmentation faults.
NOTE: This new version contains a lot of debug output still. YOu will see that the stderr fills up with stuff - that is ok . It does not slow down the program nor cause much extra upload - but it tells us a lot about where things can go wrong still.
Despite all these fixes there are, i'm sure, many problems left. Most of them occur extremely rarely now though or are highly specific to particular machines. Thus we have decided to move the current version over from RALPH to Rosetta@HOME and give it a go on a much larger scale. Our effords to keep the failure rate down will continue and your time donations over on RALPH as well as error reports are still highly appreciated.
Please let us know how things work out there. Particularily i'd like to know about
Stuck workunits
Overrunning workunits (WUs should now, due to the new watchdog, never run more than 4 hours longer than the preferred user time)
Problems with checkpointing.
Any other strange behaviour.
Happy crunching - I'm very excited to see how this new version will pan out.
I have reached the end since your new patch nothing works form your project. I keep resetting and still I get no improvement. Until you patch your patch I am done sorry, I wanted to help.
____________
I have reached the end since your new patch nothing works form your project. I keep resetting and still I get no improvement. Until you patch your patch I am done sorry, I wanted to help.
Urgh - bad news :(
I notice you're using Boinc 6.2.19 with Vista64. Can you give it one last try and upgrade to 6.4.5? I had similar problems to you (not anywhere as bad) using Vista64 and these problems have disappeared for me after upgrading. It might make all the difference for you too.
____________
Do you 'overclock' your PC? In that case lowering the overclock might help.
____________
ID: 59527 | Rating: 0 | rate:
/
Markus Joined: Feb 21 08 Posts: 1 ID: 243327 Credit: 25,065 RAC: 0
Good morning!
I reinstalled my complete System a few days ago and restarded crunching rosetta@home again. Unfortuanally i got some errors
Here is what i got
12.02.2009 05:37:59|rosetta@home|Restarting task cc_1_3_mamcstmix_cen_0.1_hb_t369__IGNORE_THE_REST_1RXQA_12_6836_46_0 using minirosetta version 154
12.02.2009 05:38:00|rosetta@home|Task cc_1_3_mamcstmix_cen_0.1_hb_t369__IGNORE_THE_REST_1RXQA_12_6836_46_0 exited with zero status but no 'finished' file
12.02.2009 05:38:00|rosetta@home|If this happens repeatedly you may need to reset the project.
Therefore two workunits aborted with compuation error. Maybe just an error for my System, just wanted to post it
I've not heard any other reports of the percent completed not increasing. What is it showing for the estimated runtime, before the task starts?
In the meantime I have set that computer on NNT, and changed the preferred runtime. I will reactivate that computer, and evaluate Saturday or after the weekend. You'll be informed :)
Very good so far, zero error results on all machines for a long time. This 1.54 is much better than the prev versions, much more stable etc. Keep up the good work stamping out the bugs.
Its been a long time since I've reviewed the results on all my crunchers and found no compute errors. If things keep going the way they are, we might break 100 Tflops yet!
____________
Workunit 205979363
Task 228619747
Bame loopbuild_ref_tex_cst_hombench_loopbuild_tex_cst_t332__IGNORE_THE_REST_2FLIA_6_6646_10_1
Mac OS X 10.4.11
This failed after 216 seconds : tail of stderr below
Setting database description ...
Setting up checkpointing ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 14400
Hbond tripped.
interpolate rotamers bin out of range: ARG 1.43667e-05 nan nan nan nan nan
81 81 19 20 2147483649 22 1.43667e-06 nan
ERROR:: Exit from: src/core/scoring/dunbrack/RotamericSingleResidueDunbrackLibrary.tmpl.hh line: 593
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish
I got a couple of validate errors too: Task 228125280 Task 228133134
There's nothing more frustrating than completing a job ok only for it to go wrong when uploaded.
I notice yours are a bit different though.
The first ones just include the line:
hbond tripped
The other two show:
Starting work on structure: _1JUDA_2_00001
Hbond tripped.
ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish
Not sure if one leads to the other but hbond tripped seems to be coming up in reports more regularly.
I think I spoke too soon...that first WU crunched successfully but only 1 other was WU successful out of the 8 WU's. 2/8, better but still not good. I might try replacing Vista 64 with XP 64 another weekend when I'm bored. Just for curiosity sake I had my P4 and Atom 330 PC's running 32-bit XP SP3 crunch some Mini's and they did just fine.
So this weekend I installed a fresh copy of XP x64, upgraded it to SP2, installed my x64 version of NOD32 antivirus, told BOINC to use "...use at most 75% of the processors" meaning 3 of 4 cores on my Q6600 and it's crunching Mini's and Beta's without a problem! 1 successful Beta, 5 successful Mini's with 4 more coming down the pipe. So it looks like Mini does not like Vista x64 and on my adventures on google, it turns out that XP x64 is actually based on the Server 2003 code tree while Vista is based on crap. :)
ID: 59610 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,023,621 RAC: 81
Just noted that I have two tasks that failed. One had an exception, the other a validate error with 99 decoys ...
Does the system have an issue with too many decoys? The reissue has not returned ...
If I remember correctly, they have created a 99 model stop line to keep the tasks from running forever.
ID: 59615 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,023,621 RAC: 81
Yeah, the 99 stop limit was to avoid a problem with the file size that is zipped up and uploaded. However, I was just wondering if there is now a new companion problem that the validator does not properly handle those results... or, the result itself is somehow bad...
In that I have gone back to the 3rd of Feb and have at least a hundred (220) results with only three errors this is a puzzlement ...
{edit}
added number ..
Also I note that The runtime is only 145 seconds ... so that was fast work ... :)
I started running Rosetta this morning on a 64bit Vista machine and all seems to be working well. It's been working well on other projects too. Here is what I'm running:
Core i7 920 CPU
Asus P6T6 WS Revolution motherboard
6Gb DDR3 Triple Channel RAM
Vista Home Premium SP1 64bit
64bit BOINC 6.6.7
As I said, no problems yet and a number of WU's have completed already.
Ok, after a number of successful completions, I did see one that looks like it failed. Message as follows:
2/16/2009 7:49:12 PM rosetta@home Computation for task ss-neg-1i17__7365_4677_1 finished
2/16/2009 7:49:12 PM rosetta@home Output file ss-neg-1i17__7365_4677_1_0 for task ss-neg-1i17__7365_4677_1 absent
Don't know the cause of that one...
____________
ID: 59626 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,023,621 RAC: 81
Well, a couple hundred tasks and several with the same error, multiple systems (3 different), based on Xeon, Q9300, and i7 processors, various amounts of available RAM, though in common all are running Win XP Pro 32-Bit:
So... I completed a bunch more tasks successfully, then got a 2nd task where it said the output file was missing. Anyone else getting these?
2/17/2009 6:20:35 AM rosetta@home Computation for task ss-neg-1i17__7365_5964_0 finished
2/17/2009 6:20:35 AM rosetta@home Output file ss-neg-1i17__7365_5964_0_0 for task ss-neg-1i17__7365_5964_0 absent
I noticed that both tasks that gave the 'absent output file' message had a name the started witht the same first part:
I noticed that both tasks that gave the 'absent output file' message had a name the started witht the same first part:
ss-neg-1i17__7365_
perhaps a bug in that one?
I had one of those fail too. Firewall blocked it from reporting the symbol tables :(
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com
ID: 59633 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
Looks like Pharrg actually had three of these fail
I had two more similar tasks on my machiens, so I suspended others to try and run them.
I've got an ss-neg-1je9 that seems normal so far. But my other ss-net-1i17 doesn't seem able to display graphics. Black window, no pane lines, on WinXP.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com
As soon as you bring up the graphic, which never gets beyond black, Windows task manager shows the graphic thread as "not responding".
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com
A few minutes ago when it was about 93% complete, I told it to display graphics (which I usually don't do). After about a minute, I closed the graphics window. Within another minute or two, that workunit decided it was finished.
It may or may not be significant that a few minutes before doing this, I had set the Activity to Suspend, also suspended the network communications, ran some antispyware programs, then set the Activity back to normal.
Is this something normal that just happened at an unusual time, or something more significant?
What is it showing for the estimated runtime, before the task starts?
There is a new task running on that same computer:
- Estimated runtime: 09:43:55
- current runtime: 18:03:14
- Progress: 0%
I think my settings before were asking for about 6 hours runtime and now 10 hours. Changing this did not solve the problem. For the sake of testing I will keep this task running for some more time. You can let me know what to do. In the worst case I'll set that computer on NNT for Rosetta but I'm willing to wait some longer.
ID: 59649 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,023,621 RAC: 81
Three more errors ... this time two I have not seen before:
229353838 0 0x0056d881 SIGPIPE: write on a pipe with no reader
229355014 Reason: Access Violation (0xc0000005) at address 0x004E3308 read attempt to address 0x00000000
229435564 ERROR: ERROR: FragmentIO: could not open file cs_aa_1ji8A09_05.200_v1_3.gz
So, two shiny new errors and one old rusty access violation that quite a few of us have seen ...
Please could someone in authority explain why there have been so many of these recently.
I currently have Rosetta set to "No New Tasks", partly because of these. I am still accepting work from RALPH.
Keith
ID: 59651 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
rembertw, the maximum runtime preference possible is 24hrs, and if it is a v1.54 task, the watchdog should end it if it runs longer then 28hrs. So, if you could, let it run at least 29hrs and if it is still running at that point, then abort it.
I still have not seen anyone else reporting such a problem, and you've got a score of other hosts running fine. What is different about this one that's having trouble? anitvirus software? Windows service pack? age of machine? BOINC version?
____________ Rosetta Moderator: Mod.Sense
I still have not seen anyone else reporting such a problem, and you've got a score of other hosts running fine. What is different about this one that's having trouble? antivirus software? Windows service pack? age of machine? BOINC version?
I it strange indeed. My other computers seem to be running fine. About the computer: I have an identical computer that gives no problems. They both have the same antivirus software, same servicepack, same age, same Boinc version.
Some things I noticed:
- when a 0% task (only at Rosetta 1.54) gets paused manually after x hours and it gets restarted, also the time resets to 0.
- When the 1.54 task starts both processors get work (multiple projects). However, when one of the other project tasks stop, then the 2nd processor starts idling. It can not get another task to run from Rosetta or any other project despite the queue having multiple tasks ready to start or continue.
I broke off 2 remaining tasks of Rosetta that still had to get started and am letting run the restarted task. Before it had already 24h+ but because of a pauze it reset its time. At this moment it is at 19h again. I will let it run until it gets past 31h runtime. After (tomorrow) that I will set that computer on NNT for Rosetta so it can crunch for my other projects while I wait for your comment.
[edit]Changed "all" in "both" and corrected a typo[/edit]
ID: 59677 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another.
Is it possible BOINC is having trouble writing to disk (authorities?)? Have you checked the authorities to the data directory and it's contents? Is the data on the local hard drive of the machine, or off on a network somewhere?
I see from the one task that completed that you are running BOINC 6.2.14. Have you tried other BOINC versions on this machine?
____________ Rosetta Moderator: Mod.Sense
rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another.
I agree, but this shows only when it started a "0%" Minirosetta task. To check this I put Rosetta on NNT for a while. When it runs only other projects there are no problems at all, making me think the problem is with Minirosetta.
Is it possible BOINC is having trouble writing to disk (authorities?)? Have you checked the authorities to the data directory and it's contents? Is the data on the local hard drive of the machine, or off on a network somewhere?
Standard setup with full authority running on a local hard drive. No fancy settings.
I see from the one task that completed that you are running BOINC 6.2.14. Have you tried other BOINC versions on this machine?
Every now and again I do a Boinc upgrade on my machines. I heard some negative comments about the current Boinc version, which is why I considered waiting until Summer or so to upgrade. I guess now the time has come. To be certain I'll do a total Boinc uninstall on that computer followed by a cleanup before I download the current version. We'll see if this helped...
rembertw, the rest of what you describe sounds like BOINC itself is having some problems. It should see the second task end and be starting another.
I agree, but this shows only when it started a "0%" Minirosetta task. To check this I put Rosetta on NNT for a while. When it runs only other projects there are no problems at all, making me think the problem is with Minirosetta.
Is it possible BOINC is having trouble writing to disk (authorities?)? Have you checked the authorities to the data directory and it's contents? Is the data on the local hard drive of the machine, or off on a network somewhere?
Standard setup with full authority running on a local hard drive. No fancy settings.
I see from the one task that completed that you are running BOINC 6.2.14. Have you tried other BOINC versions on this machine?
Every now and again I do a Boinc upgrade on my machines. I heard some negative comments about the current Boinc version, which is why I considered waiting until Summer or so to upgrade. I guess now the time has come. To be certain I'll do a total Boinc uninstall on that computer followed by a cleanup before I download the current version. We'll see if this helped...
Which BOINC version do you consider current? I'm running 6.2.28 without seeing such a problem, but I've read some negative comments about the 6.4.* series.
ID: 59686 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
robertmiles, if you were directing the question to me, I try to stay out of that one. And am only recommending a change to BOINC version because problems are occurring with the version installed now. I know we've seen many work-fetch and DCF problems reported on the 6.6 (which is the current test version) and I think 6.4 series introduced those problems. So, if it were me, I'd try the 6.2.19 shown at the link below. I myself am on 6.2.18 and running well on WinXP. (nothing against 6.2.28, but it's not listed anymore for some reason)
And am only recommending a change to BOINC version because problems are occurring with the version installed now.
I set up Boinc 6.4.5 on that computer, and it seems to be running fine with Rosetta. I still will wait for a general upgrade until there are new Boinc versions, I think.
robertmiles
"Current" is for me the version that the actual Boinc site gives as standard. Researching older versions and installing those is too much micromanagement for me. Same like posting on the boards... If this problem gets solved with 6.4.5 (and it seems to be solved) then I'm off again.
ID: 59752 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2399 ID: 106194 Credit: 0 RAC: 0
Looks like all of these were the ss-neg-1i17s that most people have been having trouble with. Something specific to the 1i17, the other ss-neg's do not seem to be having any trouble.
Except for your last one on the list, it got a
"Too many restarts with no progress. Keep application in memory while preempted."
error. Perhaps you rebooted your machine several times in a row to install fixes or something?
____________ Rosetta Moderator: Mod.Sense
ID: 59756 | Rating: 0 | rate:
/
Paul D. Buck Joined: Sep 17 05 Posts: 815 ID: 269 Credit: 1,023,621 RAC: 81
I noticed that with the minirosetta 1.54 the granted credit was very low in the Athlon X2 processors - sometimes half the claimed credit. This did not occur with the single core Athlon.