Problems with Minirosetta 1.75

Message boards : Number crunching : Problems with Minirosetta 1.75

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Mike Tyka

Send message
Joined: 20 Oct 05
Posts: 96
Credit: 2,190
RAC: 0
Message 61665 - Posted: 11 Jun 2009, 4:16:16 UTC

This one includes a number of new features allowing us to analyse larger amounts of data for homolog modelling and has new features for Docking, and, as always, bugfixes and (hopefully) improved stability.

Please post issues here!

Mike

http://beautifulproteins.blogspot.com/
http://www.miketyka.com/
ID: 61665 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rayfen Windspear

Send message
Joined: 13 May 09
Posts: 6
Credit: 113,749
RAC: 0
Message 61670 - Posted: 11 Jun 2009, 6:11:19 UTC - in response to Message 61665.  

Not exactly an "issue" but if 1.75 is out how can I actually get it? My BOINC client is still running 1.71. Should I reset the project?
ID: 61670 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,264,668
RAC: 4,443
Message 61673 - Posted: 11 Jun 2009, 7:40:37 UTC - in response to Message 61670.  

Not exactly an "issue" but if 1.75 is out how can I actually get it? My BOINC client is still running 1.71. Should I reset the project?


If I remember correctly, as soon as you get a 1.75 workunit, the program for 1.75 will be downloaded automatically.
ID: 61673 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 61686 - Posted: 11 Jun 2009, 13:12:07 UTC - in response to Message 61673.  

Not exactly an "issue" but if 1.75 is out how can I actually get it? My BOINC client is still running 1.71. Should I reset the project?


If I remember correctly, as soon as you get a 1.75 workunit, the program for 1.75 will be downloaded automatically.


That's correct. It will download automatically.

If for some reason you wanted to explicitly download it in advance, the executables are always found in the project downloads directory here:
https://boinc.bakerlab.org/rosetta/download
Rosetta Moderator: Mod.Sense
ID: 61686 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Hubis

Send message
Joined: 23 Apr 09
Posts: 1
Credit: 11,920
RAC: 0
Message 61688 - Posted: 11 Jun 2009, 15:46:02 UTC

Why serwers are off ? Why You don't give any info about that they are off and why ?
ID: 61688 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Tyka

Send message
Joined: 20 Oct 05
Posts: 96
Credit: 2,190
RAC: 0
Message 61691 - Posted: 11 Jun 2009, 16:47:13 UTC

Sorry - they were off by mistake. They should be back up now. My bad.

http://beautifulproteins.blogspot.com/
http://www.miketyka.com/
ID: 61691 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Cesium_133*
Avatar

Send message
Joined: 1 Dec 08
Posts: 28
Credit: 225,332
RAC: 0
Message 61721 - Posted: 13 Jun 2009, 1:46:17 UTC

Had a glitch with a WU just downloaded and running on Mini 1.75... it's a similar thing that's been plaguing me with other, earlier Mini versions:

lb_dk_ksync__full_hb_t313__IGNORE_THE_REST_12649_620_2

As with some others, it seemed to hang and not compute for a couple of hours while I was away. I turned BOINC off and back on, and it now is working again. It did, however, reset its "time elapsed" and "to completion" figures to lower values. An earlier, similar issue with another WU also reset the percentage of the job completed to a lower value.

Can someone give me a heads-up as to what demon, devil, or wraith is causing this, and if an exorcism is possible... thx...
The lovely lady you see isn't I, but Hayley Westenra, a classical crossover singer from Christchurch, NZ. There is no known voice as hers. Check her out- she's seraphic.

ID: 61721 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,264,668
RAC: 4,443
Message 61725 - Posted: 13 Jun 2009, 5:45:06 UTC - in response to Message 61721.  
Last modified: 13 Jun 2009, 6:34:29 UTC

Had a glitch with a WU just downloaded and running on Mini 1.75... it's a similar thing that's been plaguing me with other, earlier Mini versions:

lb_dk_ksync__full_hb_t313__IGNORE_THE_REST_12649_620_2

As with some others, it seemed to hang and not compute for a couple of hours while I was away. I turned BOINC off and back on, and it now is working again. It did, however, reset its "time elapsed" and "to completion" figures to lower values. An earlier, similar issue with another WU also reset the percentage of the job completed to a lower value.

Can someone give me a heads-up as to what demon, devil, or wraith is causing this, and if an exorcism is possible... thx...


Looks likely to be the same lockfile problem I'm trying to track down. For more details, see the thread on 1.71 problems. If that's what it is, it's likely to ignore any exorcisms, and wait for updates to the software causing the problem instead.

Do you have the Rosetta@home project set to be allowed to use 100% of the CPU time if nothing with a higher priority is trying to use that CPU time? That makes the problem less frequent, but is not recommended on many laptops and on computers that tend to overheat if set to use that much.

If you're unable to set the CPU time to 100%, you could join me in trying to send in a zipped up collection of the *.txt and *.old files from the BOINC data directory tree every time the problem occurs - but note that you'll probably have to stop the boinc.exe program to be allowed to copy all of these files. For me, the collection should contain over 400 files, many of them not related to the Rosetta@home project.

The loss of some time elapsed and percentage of completion is a normal result of restarting from a checkpoint, but turning BOINC off forces that, and restarting it cleans up the leftover lockfiles from failed workunits that make the problem cascade to any later workunits that try to run in the same slot.

Anyone know a ZIP program that lets you tell it to search a specific directory and all its subdirectories for *.txt and *.old files, and copy them along with the subdirectory structure into an *.zip file, without disturbing the original files and when that subdirectory tree is known to contain some other files that crash most ZIP programs that try to copy them (specificly, the lockfiles). I already have three different programs for creating *.zip files, but none of them seem to be able to do this, and Vista SP2's search program does not offer any option to zip up the results of the search.

Also, is anyone here able to alter the software for creating the lockfile to make it read-only for system users instead of not accessible to them at all? If I remember correctly, it's a 0-byte file, so being able to copy it but not able to delete it should not cause any security problems. Another, but probably harder, idea is to modify BOINC so that if a workunit leaves a lockfile behind, no more workunits will be assigned to that slot and another slot will be created if needed.
ID: 61725 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Cesium_133*
Avatar

Send message
Joined: 1 Dec 08
Posts: 28
Credit: 225,332
RAC: 0
Message 61727 - Posted: 13 Jun 2009, 6:29:11 UTC - in response to Message 61725.  

Looks likely to be the same lockfile problem I'm trying to track down. For more details, see the thread on 1.71 problems.


Whatever a lockfile problem is :) I'll read the 1.71 bug thread.

Do you have the Rosetta@home project set to be allowed to use 100% of the CPU time if nothing with a higher priority is trying to use that CPU time? That makes the problem less frequent, but is not recommended on many laptops and on computers that tend to overheat if set to use that much.


All 4 of my projects are set to run at 100% flop utilization. Rosetta has 85% of total CPU time allocated to it, so it's the BMOC if it's running. Everything else runs less of the time and dependent on when BOINC decides to run something other than Rosetta. Whenever another project is running, it also has 100% CPU access. I have everything set to change between programs every 90 minutes.

I can exploit the CPU like this because I have a good separate heat sink that keeps the drive at 45-48 C, the CPU around 50, and the core and GPU at 77-80. This represents 10-12 C off ordinary temperature for my computer, and is within acceptable parameters according to its website (I run an HP Pavilion 9600 or the like, 2 CPU's, actually...)

If you're unable to set the CPU time to 100%, you could join me in trying to send in a zipped up collection of the *.txt and *.old files from the BOINC data directory tree every time the problem occurs...


I can set the CPU to 100% and get it to run that way. The rest... are you saying you can prove Oswald did not act alone? Something about a tree on the grassy knoll? And you have the Zapruder film in an old zip file? Nice...

The loss of some time elapsed and percentage of completion is a normal result of restarting from a checkpoint... (it) cleans up the leftover lockfiles from failed workunits that make the problem cascade to any later workunits that try to run in the same slot.


Makes sense that it would have a self-fix algorithm that would work to repair itself on a reset... seems that's the way with most programs in modern computing, be they this or Vista. As for what makes a bug propagate, well, I know what the Domino Theory is in the abstract. However, I am a user-end whiz kid, not a programmer by birth. Thanks for the reply :)
The lovely lady you see isn't I, but Hayley Westenra, a classical crossover singer from Christchurch, NZ. There is no known voice as hers. Check her out- she's seraphic.

ID: 61727 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,264,668
RAC: 4,443
Message 61728 - Posted: 13 Jun 2009, 7:03:45 UTC - in response to Message 61727.  

I can exploit the CPU like this because I have a good separate heat sink that keeps the drive at 45-48 C, the CPU around 50, and the core and GPU at 77-80. This represents 10-12 C off ordinary temperature for my computer, and is within acceptable parameters according to its website (I run an HP Pavilion 9600 or the like, 2 CPU's, actually...)

Makes sense that it would have a self-fix algorithm that would work to repair itself on a reset... seems that's the way with most programs in modern computing, be they this or Vista. As for what makes a bug propagate, well, I know what the Domino Theory is in the abstract. However, I am a user-end whiz kid, not a programmer by birth. Thanks for the reply :)


Where can you find such websites giving temperature limits for HP computers? I'd like to find that information for the two I have.

I'm not a programmer by birth either, but I had to learn a few computer languages as part of my jobs. None likely to be used in any BOINC projects, though.
ID: 61728 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Hammeh

Send message
Joined: 11 Nov 08
Posts: 63
Credit: 211,283
RAC: 0
Message 61731 - Posted: 13 Jun 2009, 11:03:41 UTC
Last modified: 13 Jun 2009, 11:04:31 UTC

I find that HP's given temperature ranges for their computers are never actually correct. It is much better to find out what hardware components are in your PC and check each one with the manufacturer as they will give much more accurate figures.
My HP Pavilion a6430.uk with an AMD Phenom x4 9600 and an nvidia 8500GT is good up to a CPU temperature of 72ish degrees C and the graphics card is good up to just over 100.
When running BOINC 24/7 + CUDA, the CPU cores run at a constant 55 degrees C and the GPU at around 68 degrees C.
ID: 61731 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Cesium_133*
Avatar

Send message
Joined: 1 Dec 08
Posts: 28
Credit: 225,332
RAC: 0
Message 61736 - Posted: 14 Jun 2009, 6:12:13 UTC - in response to Message 61728.  

Where can you find such websites giving temperature limits for HP computers? I'd like to find that information for the two I have.


Now that I look again, I can't find the confounded specs :( Though you might wish to start here:

http://www.shopping.hp.com/webapp/shopping/store_access.do?template_type=landing&landing=notebooks

I know I had to look and look the last time...

Wherever you find the info, I doubt it'll address the issue of software compatibility or problems along those lines with the dv9700 series...
The lovely lady you see isn't I, but Hayley Westenra, a classical crossover singer from Christchurch, NZ. There is no known voice as hers. Check her out- she's seraphic.

ID: 61736 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,264,668
RAC: 4,443
Message 61740 - Posted: 14 Jun 2009, 12:41:11 UTC

It looks like 1.75 can have the lockfile problem, too:

6/14/2009 7:24:31 AM|rosetta@home|Task tails_homnative_relaxed_9mgw_m10_p00_SAVE_ALL_OUT_12749_4637_1 exited with zero status but no 'finished' file
6/14/2009 7:24:31 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
6/14/2009 7:24:31 AM|rosetta@home|Restarting task tails_homnative_relaxed_9mgw_m10_p00_SAVE_ALL_OUT_12749_4637_1 using minirosetta version 175

Repeated over and over, with BOINC 6.2.28 under 32-bit Vista Home Premium SP2 on my machine set to use 95% CPU in order to test for the lockfile problem.

This workunit is supposed to be running, but isn't using a significant amount of CPU time.

Looks like I'll get another chance to copy the files needed to investigate.
ID: 61740 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Cesium_133*
Avatar

Send message
Joined: 1 Dec 08
Posts: 28
Credit: 225,332
RAC: 0
Message 61741 - Posted: 14 Jun 2009, 18:57:53 UTC - in response to Message 61740.  

It looks like 1.75 can have the lockfile problem, too:


Well, it can lockfile this -gives the Spaceballs salute- if it keeps deciding to stop computing on me. I had it do the same thing again, spin its wheels, on 2 WU's in the past 24 hours. It pulls this one more time, I'll go back to POEM or something similar until the Rosetta people can extricate their heads from you-know-where and devise some code on par with everyone else. This is silly, putting out what looks like an alpha replacement for an already-knackered 1.71... X-(
The lovely lady you see isn't I, but Hayley Westenra, a classical crossover singer from Christchurch, NZ. There is no known voice as hers. Check her out- she's seraphic.

ID: 61741 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,264,668
RAC: 4,443
Message 61745 - Posted: 14 Jun 2009, 23:12:55 UTC
Last modified: 14 Jun 2009, 23:25:37 UTC

This time, I got a chance to copy the files. Looks like I got all I needed from the top level BOINC data directory, but none from lower level directories.

Another pointer to the workunit with the lockfile problem:

https://boinc.bakerlab.org/rosetta/result.php?resultid=258596930

The address to email them to still seems to have DNS problems, though, if it still exists.

If you don't have a ZIP program capable of accepting wildcard selections of which files to copy, it still looks like you need to shut down the boinc.exe program befire doing the copying.
ID: 61745 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile FalconFly
Avatar

Send message
Joined: 11 Jan 08
Posts: 23
Credit: 2,163,056
RAC: 0
Message 61761 - Posted: 15 Jun 2009, 12:31:24 UTC
Last modified: 15 Jun 2009, 12:34:11 UTC

WorkUnit mentioned is this one here.

Noteworthy mentioning maybe, that this System is seeing some strange Validation problems which I'm still trying to pinpoint.

Worked flawless so far for SETI, SIMAP, POEM and LHC, thus seeing it fail to validate all of a sudden got me off guard.

Hopefully a warning is implemented in all validating projects soon that give an adequate alert when a Host starts to fail validation, as I requested it at Berkeley a year ago.
ID: 61761 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 61764 - Posted: 15 Jun 2009, 13:55:22 UTC - in response to Message 61761.  

WorkUnit mentioned is this one here.

Noteworthy mentioning maybe, that this System is seeing some strange Validation problems which I'm still trying to pinpoint.


FalconFly, I've moved your post here to assure the Project Team takes note. You seem to have completed 92 models for that task and received credit for only one or two for some reason. Like it had a problem with a restart from a checkpoint, or something.

There is nothing on your end to look for or correct here.
Rosetta Moderator: Mod.Sense
ID: 61764 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile FalconFly
Avatar

Send message
Joined: 11 Jan 08
Posts: 23
Credit: 2,163,056
RAC: 0
Message 61766 - Posted: 15 Jun 2009, 15:35:14 UTC - in response to Message 61764.  

There is nothing on your end to look for or correct here.


Okidok, no Problem :)
I just wasn't sure if that is a MiniRosetta problem or some other quirk.

ID: 61766 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Tyka

Send message
Joined: 20 Oct 05
Posts: 96
Credit: 2,190
RAC: 0
Message 61769 - Posted: 15 Jun 2009, 19:21:53 UTC

Hi FalconFly -

I don't see the validation problems you're reporting over on RALPH. Could you join over there for a little while, otherwise I cannot actually see the log files your machine is producing.

M

http://beautifulproteins.blogspot.com/
http://www.miketyka.com/
ID: 61769 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile FalconFly
Avatar

Send message
Joined: 11 Jan 08
Posts: 23
Credit: 2,163,056
RAC: 0
Message 61773 - Posted: 15 Jun 2009, 22:25:23 UTC - in response to Message 61769.  
Last modified: 15 Jun 2009, 22:33:19 UTC

Hi FalconFly -

I don't see the validation problems you're reporting over on RALPH. Could you join over there for a little while, otherwise I cannot actually see the log files your machine is producing.

M


I'm presently not attached to RALPH@Home, but if the problems persist after my current attempt for a fix, I'll join there.

The stderr_txt looks completely normal to me, except for ending with the invalid status.

I've exchanged the RAM, set Vcore to Auto again and improved CPU cooling a bit. Maybe that helps. I'll report back on that in about 24hrs, with a bit of luck I smashed that elusive cause of failure (right now betting on defective RAM).
ID: 61773 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Problems with Minirosetta 1.75



©2024 University of Washington
https://www.bakerlab.org