Problems with Minirosetta v1.54

Message boards : Number crunching : Problems with Minirosetta v1.54

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 15 · Next

AuthorMessage
Mike Tyka

Send message
Joined: 20 Oct 05
Posts: 96
Credit: 2,190
RAC: 0
Message 59045 - Posted: 26 Jan 2009, 22:45:57 UTC
Last modified: 27 Jan 2009, 1:36:31 UTC

Hello All!

We're ready for a new update. I want to say thank all of you who have helped over the last months to find and fix errors in minirosetta. A particular thank you goes to those who have donated their time over on RALPH and helped with their active feedback - we managed to find a number of difficult and rare bugs and put some new features into minirosetta that should help conserve computer time. Read about it here: http://ralph.bakerlab.org/forum_thread.php?id=431
and here http://ralph.bakerlab.org/forum_thread.php?id=432
I should add that work over there will continue,but now supplemented with information from Rosetta@HOME.

This update is highly focused on bugfixing and stability issues - we have virtually no new science in it, but: We will hopefully now be able to run the science projects that have been in the pipeline waiting for BOINC - we're expecting quite a bit of work to go out very soon indeed. See Dr. Baker's journal for more details.


Features/Fixes:
1.54 Release CHANGELOG


  • Faster loop closing in FoldCST/Abinitio (affects cc_* cc2_* cs_* WUs), should help with overrunning WUs.

  • Bug fix concerning intermittent crashes in relax benchmark jobs (_rlbd_) jobs - caused by buggy input file reader.

  • Bug fix for a potential instability in handling text files (affects all types of WUs).

  • Bug fix in checkpointing machinery, states were not being correctly restored, probably contributing to long runtimes. (affects cc_* cc2_* cs_* WUs)

  • Increased the density of checkpoints to lose less time on restarts and address the weired "backjumping" of the time reported in this thread. This will still happen, but the jumps should be much smaller (basically maximally as long as the time between checkpoints.)

  • Added checkpointing to Loopclosing part of FoldCST. (affects cc_* cc2_* cs_* WUs)

  • Added checkpointing to Looprelax.

  • The Watchdog has been checked and improved, now returning information on the aborted jobs to help us figure out how the remaining long running models come about. The watchdog will now abort if the runtime exceeds your preferred runtime + 4 hours. In other words the WUs should not overrun for more than around 4 hours. If they do please let us know !!

  • Added a limit ont he number of decoys per WU: 99. The WU will end gracefully after that and give full credit. This should address issues with excessive upload problems.

  • Fixed a bug in the BOINC API concerned with unzipping the input data. (I will let the BOINC guys know about this)

  • Fixed a strange problem in the options system leading to early crashes on some systems.

  • Two nasty instabilities fixed deep in the FoldConstraints/abinitio protocol (cc_* tasks and other homology modelling tasks)

  • Generally implemented much better error reporting - many many potential problems will now show up a meaningful error messages and not random segmentation faults.



NOTE: This new version contains a lot of debug output still. YOu will see that the stderr fills up with stuff - that is ok . It does not slow down the program nor cause much extra upload - but it tells us a lot about where things can go wrong still.


Despite all these fixes there are, i'm sure, many problems left. Most of them occur extremely rarely now though or are highly specific to particular machines. Thus we have decided to move the current version over from RALPH to Rosetta@HOME and give it a go on a much larger scale. Our effords to keep the failure rate down will continue and your time donations over on RALPH as well as error reports are still highly appreciated.

Please let us know how things work out there. Particularily i'd like to know about


  • Stuck workunits
  • Overrunning workunits (WUs should now, due to the new watchdog, never run more than 4 hours longer than the preferred user time)
  • Problems with checkpointing.
  • Any other strange behaviour.




Happy crunching - I'm very excited to see how this new version will pan out.

Mike


http://beautifulproteins.blogspot.com/
http://www.miketyka.com/
ID: 59045 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 59047 - Posted: 26 Jan 2009, 23:40:44 UTC

The link in the news item that should bring you to this thread is truncated.
Rosetta Moderator: Mod.Sense
ID: 59047 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 59048 - Posted: 27 Jan 2009, 0:21:40 UTC
Last modified: 27 Jan 2009, 0:24:04 UTC

The news item also shows the year as 2008 (which is probably the last time you had enough coffee to be able to read the calendar!! All these improvements are going to send TeraFLOPS much higher! Nice work Mike, and BakerLab. I can really see that you've come through for people here).
Rosetta Moderator: Mod.Sense
ID: 59048 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
darengosse Jean-Paul
Avatar

Send message
Joined: 9 Jun 06
Posts: 18
Credit: 259,459
RAC: 0
Message 59074 - Posted: 27 Jan 2009, 22:46:52 UTC

Hello, the version 1.47 was very well for me with 151 Workunits and 0 errors and an average CPU time 2.8 hours. Hope that the new version 1.54 will be as well...

ID: 59074 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 59086 - Posted: 28 Jan 2009, 6:50:38 UTC

If you are seeing errors with lock-file problems try setting the cpu setting back to 100%. If you are running at 100% CPU preference and are getting this problem, I for one, am very interested. If you are getting the failures and change the CPU setting to 100% and that cures the issue ... well, we are interested in THAT too ...

I read about this in Einstein@Home and it seems to work for me ... YMMV ...
ID: 59086 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,214,047
RAC: 1,450
Message 59088 - Posted: 28 Jan 2009, 9:55:25 UTC

I don't know about others but my Rosetta machines are running dry!!! The new minirosetta is stuck downloading at 89.25% and has been there for HOURS!!!!
I have had to attach to a different project until it gets sorted out. So far all machines, exact same problem, one a dual core one a single core. If you llok at my computers, they are not hidden, any task that says "outcome unknown" is because the mini-rosetta download ain't happenning!!!! Message in Boinc says 1/28/2009 4:45:03 AM|rosetta@home|Started download of minirosetta_1.54_windows_intelx86.exe
1/28/2009 4:50:11 AM|rosetta@home|Temporarily failed download of minirosetta_1.54_windows_intelx86.exe: HTTP error
1/28/2009 4:50:12 AM|rosetta@home|Started download of minirosetta_1.54_windows_intelx86.exe
1/28/2009 4:50:13 AM||Internet access OK - project servers may be temporarily down.
1/28/2009 4:50:34 AM||Project communication failed: attempting access to reference site
1/28/2009 4:50:34 AM|rosetta@home|Temporarily failed download of minirosetta_1.54_windows_intelx86.exe: connect() failed
1/28/2009 4:50:35 AM||Internet access OK - project servers may be temporarily down.
etc, etc, etc, etc forever!!!!
Another project now loves you!!
ID: 59088 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 59093 - Posted: 28 Jan 2009, 12:27:44 UTC

mikey, I haven't seen that problem my self, so it's not likely on the server side. At least not consistently. So it also seems odd that all of your servers are stopping... is it on the same file? You have to download the new programs, which is several MB. Are your machines all going through the same proxy or something that might be hung up on that particular file?

Could I ask you to check the transfers tab and see exactly which file and how much of it you've downloaded? Your hosts seem to have pretty good bandwidth.

Is anyone else seeing such a problem? Given then increase in project TFLOPS, I am thinking it is rare at best.

Have you tried aborting the transfer on one of the machines? This may cause a couple of tasks to fail due to downloading error, but BOINC will recover and eventually try to pull a fresh copy of the problem file.
Rosetta Moderator: Mod.Sense
ID: 59093 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Tyka

Send message
Joined: 20 Oct 05
Posts: 96
Credit: 2,190
RAC: 0
Message 59103 - Posted: 28 Jan 2009, 19:59:21 UTC

Paul,

can you point me to the thing you read about Lockfile problems on Einstein !?

5% of jobs fail in this way consistently. I would love to know if the problem is us or the clients or what, and get it resolved.

What do you mean by 100% CPU ? If i can make this happen here on my machine i could learn better about what's going on.

Mike

http://beautifulproteins.blogspot.com/
http://www.miketyka.com/
ID: 59103 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 59106 - Posted: 28 Jan 2009, 20:20:22 UTC

What do you mean by 100% CPU ?


"computing preferences" configured on website for the venue of the machine. The setting is called "Use at most" at the bottom of the processor usage section.

Can also be configured via the BOINC Manager for a specific host.
Rosetta Moderator: Mod.Sense
ID: 59106 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,214,047
RAC: 1,450
Message 59108 - Posted: 28 Jan 2009, 22:10:00 UTC - in response to Message 59093.  

mikey, I haven't seen that problem my self, so it's not likely on the server side. At least not consistently. So it also seems odd that all of your servers are stopping... is it on the same file? You have to download the new programs, which is several MB. Are your machines all going through the same proxy or something that might be hung up on that particular file?

I do not use a proxy, just straight to the net. I use Comcast.

Could I ask you to check the transfers tab and see exactly which file and how much of it you've downloaded? Your hosts seem to have pretty good bandwidth.

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's. The one I am looking at right now has been trying for 11:51:02 and is going to retry in 03:34:34, and counting.

Is anyone else seeing such a problem? Given then increase in project TFLOPS, I am thinking it is rare at best.

Have you tried aborting the transfer on one of the machines? This may cause a couple of tasks to fail due to downloading error, but BOINC will recover and eventually try to pull a fresh copy of the problem file.

Yes I have, no luck, the file is stuck at 89.25, 89.26 or 89.27% depending on the pc. I am stuck at exactly 5.85 meg of 6.56 meg on all machines.
ID: 59108 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 59110 - Posted: 28 Jan 2009, 22:21:08 UTC

mikey, if you would like to study this further, it would be helpful if you could create a cc_config.xml file and add the flag for debug of file transfers. You have to define the first three flags as shown, then just add a line for the:
<file_xfer_debug>1</file_xfer_debug>

If you already have such a file set up, do you have the <http_1_0> flag defined? Not asking you to do that one, just asking if you were already doing it. HTTP 1.0 does not have the ability to retry from the middle of the transfer (persistent file transfer is the term BOINC uses for this). It has to start over each attempt. Then BOINC seems to only open the pipe for 5minutes at a time. So if you can't get the whole thing in 5min. It might never happen.
Rosetta Moderator: Mod.Sense
ID: 59110 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 59112 - Posted: 28 Jan 2009, 23:24:06 UTC

I'm seeing a validate error on task 224245929 , workunit 204213187, Mac OS X 10.4.11. The task name is 1nkuA_BOINC_MPZN_with_zinc_abrelax_cs_frags_6231_115354_1 : it ran twice as long as it was supposed to and I was the second person to get it. The original person to whom it was sent also got the same validate error: irritating after it took twice as long as it was supposed to. It seems to be one of these zinc-containing proteins that have a habit of doing this.

<core_client_version>6.2.18</core_client_version>
<![CDATA[
<stderr_txt>
BOINC:: Initializing ... ok.
[2009- 1-28 1:26:32:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing core...
Initializing options.... ok
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip
<unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./>
Firstarg=true; pp=-d./
firstarg: <-d./>
End of unzipping.
Setting database description ...
Setting up checkpointing ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Starting work on structure: _00001
Watchdog active.
# cpu_run_time_pref: 14400
Starting work on structure: _00002
====>
called boinc_finish

</stderr_txt>
]]>

ID: 59112 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 59116 - Posted: 29 Jan 2009, 0:11:48 UTC

mikey, I don't know why I didn't think of this before...

Do a binary ftp of the file from here:

boinc.bakerlab.org/download/minirosetta_1.54_windows_intelx86.exe

and drop it in to your Rosetta folder in your BOINC data directory under the projects subfolder. That will at least get you up and running, or on to the next file to see if similar problems continue.
Rosetta Moderator: Mod.Sense
ID: 59116 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 59119 - Posted: 29 Jan 2009, 2:38:00 UTC - in response to Message 59108.  

It is "minirosetta_1.54_windows_intelx86.exe" I have aborted, retried, everything, it is just stuck on ALL of my pc's.


My guess is that some anti-virus software either on your PC or at your ISP is blocking the download because the file is a .exe and it somehow looks suspicious to the anti-virus software.

ID: 59119 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2140
Credit: 41,518,559
RAC: 10,612
Message 59121 - Posted: 29 Jan 2009, 2:51:39 UTC

Long-running model reported in the appropriate thread here



ID: 59121 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 59122 - Posted: 29 Jan 2009, 2:57:50 UTC

ID: 59122 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Tyka

Send message
Joined: 20 Oct 05
Posts: 96
Credit: 2,190
RAC: 0
Message 59123 - Posted: 29 Jan 2009, 3:10:15 UTC - in response to Message 59122.  

I'm seeing a number of WUs ending at 99 models. They are ending normally, but they often take less than half my 12 hour (43,200 sec) preference.

Some examples:
https://boinc.bakerlab.org/rosetta/result.php?resultid=223957908
https://boinc.bakerlab.org/rosetta/result.php?resultid=223968996
https://boinc.bakerlab.org/rosetta/result.php?resultid=223981088
https://boinc.bakerlab.org/rosetta/result.php?resultid=223989528
https://boinc.bakerlab.org/rosetta/result.php?resultid=223997524
https://boinc.bakerlab.org/rosetta/result.php?resultid=224065056


Sorry i should have mentioned there is a new rule. Mini will not produce more than 99 models. It will finish gracefully and grant full credit. The reason for this is that i want to prevent your individual uploads from getting too large. In the future there will be a better way to do this, like it will check that the output file size has not reached some limit.
ITs just another safety hook that's been put in to prevent WUs from misbehaving.

http://beautifulproteins.blogspot.com/
http://www.miketyka.com/
ID: 59123 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
darengosse Jean-Paul
Avatar

Send message
Joined: 9 Jun 06
Posts: 18
Credit: 259,459
RAC: 0
Message 59124 - Posted: 29 Jan 2009, 5:11:01 UTC

Hello with all.
For me no problems to receive from Wu Minirosetta v1.54.
J'received 17 Wu to be made for February 6, 2009 with 21:28:04 (France Time).
The first calculations should begin today (January 29), and if it with problems I you will warn about it there.

ID: 59124 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 59126 - Posted: 29 Jan 2009, 6:49:25 UTC - in response to Message 59103.  

Paul,

can you point me to the thing you read about Lockfile problems on Einstein !?

5% of jobs fail in this way consistently. I would love to know if the problem is us or the clients or what, and get it resolved.

What do you mean by 100% CPU ? If i can make this happen here on my machine i could learn better about what's going on.

Mike


Two places to start: are here and here ... I can also report that since I made that change i have been getting good results on Win XP systems ... I cannot see the high error rate I had in the past as the tasks have been purged ...

It seemed to me to be a problem I had on XP and it was most severe on the i7 where there are more things going on ...
ID: 59126 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,214,047
RAC: 1,450
Message 59128 - Posted: 29 Jan 2009, 11:34:33 UTC - in response to Message 59116.  
Last modified: 29 Jan 2009, 12:05:38 UTC

mikey, I don't know why I didn't think of this before...

Do a binary ftp of the file from here:

boinc.bakerlab.org/download/minirosetta_1.54_windows_intelx86.exe

and drop it in to your Rosetta folder in your BOINC data directory under the projects subfolder. That will at least get you up and running, or on to the next file to see if similar problems continue.


No difference, I downloaded the file, dropped it into the directory C:BoincDataProjectsboinc.bakerlab.org_rosetta. It asked if I wanted to overwrite what was there, the new file was bigger, I said yes, exited Boinc, restarted Boinc, did an update of Rosetta and it is downloading the file AGAIN and is stuck at exactly the same place. I even turned off my ad-aware and anti-virus and no change.

Change #1....Just after I first posted this I did a total shutdown and then a restart, no change, Boinc is still trying to download that same file! I am about ready to detach and then reattach and see if that fixes it!

Change #2....I detached and then reattached. Started downloading all the Rosetta files again. I made sure everything Rosetta was gone out of the Boinc and all subdirectories, so downloading was not a surprise. It got thru all the files except the usual one, stopped at exactly the same place. I aborted the transfer and stopped Boinc. I then copied the file I had downloaded manually into the same place as before, and did another update of Rosetta. It asked for 36000 seconds of work and got none. It went into the communication deferred state and is now downloading the EXACT SAME FILE again!!!! It is also STUCK at the EXACT SAME PLACE!!!!
I have no clue how to fix this and other projects are working just fine. Frustrating to say the least!!!!!
ID: 59128 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 15 · Next

Message boards : Number crunching : Problems with Minirosetta v1.54



©2024 University of Washington
https://www.bakerlab.org