Report problems with Rosetta version 5.34

Message boards : Number crunching : Report problems with Rosetta version 5.34

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 30190 - Posted: 28 Oct 2006, 20:20:12 UTC
Last modified: 28 Oct 2006, 20:23:54 UTC

This task, running on this host: ran for about five hours (pref = 24h) and stopped - no DONE box in output file, exit err 131.

This host seems to be having more than its fair share of errors since v5.32 and 5.34. It has the same hardware as two other boxes which are not seeing anywhere near such a high error rate.

This box has a heavy network load (it is a router internal to a LAN and masquerades about half a broadband bandwidth) so possibly the heartbeat error is caused by a peak load on the box's main mission?

This box has been OK on Rosetta until v 5.32. It recently ran for two days on LHC WU with no problems - and LHC is about the most fussy project there is for declaring WU invalid! - then back to Rosetta v 5.34 and the errors start again. This is not a complaint, just painting the picture for you.

Let me know if you'd like this box taken off Rosetta due to the error rates - it will stay on unless you say otherwise.

River~~
ID: 30190 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 30191 - Posted: 28 Oct 2006, 20:25:16 UTC
Last modified: 28 Oct 2006, 20:25:32 UTC

Maybe they should increase the "no heartbeat" time out from 30 seconds to ONE minute before it exits the daemon???? I'd think it likely your other TCP traffic was prohibiting Boinc from talking. Perhaps someone should ask Rom at his blog about this error messages and possible problems with high network traffic computers???
ID: 30191 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 30192 - Posted: 28 Oct 2006, 20:38:19 UTC - in response to Message 30191.  

... I'd think it likely your other TCP traffic was prohibiting Boinc from talking. ...


Yes Tony, it certainly looks like it.

Only I'd have hoped that traffic from one net card to the other would be queued separately by linux from internal 'ip' traffic from localhost to localhost - especially as the two kinds of traffic are handled by different tables within iptables.

and then again, why was it not killing the older Rosetta versions, or LHC (which ran OK on that box while under similar network load). As far as I know, all projects have the heartbeat check.

so yes, congestion within the linux network handling is the most plausible culprit, but I am not absolutely convinced yet.

R~~
ID: 30192 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 30193 - Posted: 28 Oct 2006, 21:03:21 UTC

If hardware tests come back saying the hardware is fine; connecting systems with high error rates to Ralph (at least giving them some time on Ralph) will help track down the source of the higher than average errors showing up on some systems.




ID: 30193 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 30194 - Posted: 28 Oct 2006, 21:43:42 UTC

River

I have Boinc switch projects every 2hr and when i looked at the messages

the time it was showing was about right. It might be a problem with being

left in memory when i restarted my PC in the morning, but i haven't had

that problem before so it's got me.

ID: 30194 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
frederick corse

Send message
Joined: 7 Oct 05
Posts: 10
Credit: 1,545,999
RAC: 0
Message 30197 - Posted: 28 Oct 2006, 22:45:08 UTC

hello I got a unrecoverable error on1hz6ABOINC NATIVEJUMAAPS CLOSE CHAINBREAKS VARY ALL BOND ANGLEAS ALL BOND DISTANCES SAVE ALL OUT1306 14672 0 .mesage <file xfer error> <error code>161</error code. it didn't clear the listing for it,
ID: 30197 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile netwraith
Avatar

Send message
Joined: 3 Sep 06
Posts: 80
Credit: 13,483,227
RAC: 0
Message 30199 - Posted: 28 Oct 2006, 23:17:02 UTC - in response to Message 30191.  
Last modified: 28 Oct 2006, 23:18:02 UTC

Maybe they should increase the "no heartbeat" time out from 30 seconds to ONE minute before it exits the daemon???? I'd think it likely your other TCP traffic was prohibiting Boinc from talking. Perhaps someone should ask Rom at his blog about this error messages and possible problems with high network traffic computers???


Make sure that there is an accept statement listing the loopback interface *and*

make sure that the statement below is fairly high in the table... Allows established TCP connections through without checking the packets every time.

-A RH-Firewall-1-INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

Even if your CPU is overcommitted, you still should not have timeouts within 5 seconds... nevermind 30...



Looking for a team ??? Join BoincSynergy!!


ID: 30199 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 30200 - Posted: 28 Oct 2006, 23:17:39 UTC

Seems I was wrong. The no heartbeat message has nothing to do with the manager. I asked a question on the Dev mail list and go this back:

davea@ssl.berkeley.edu to me, boinc_dev
More options 7:13 pm (1 minute ago)

The manager is not involved.

Applications listen for "heartbeat" messages
(sent via shared memory) from the core client.
Normally it's sent once a second.
If the application doesn't get one in 30 secs,
it prints "no heartbeat" and quits

-- David

ID: 30200 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile netwraith
Avatar

Send message
Joined: 3 Sep 06
Posts: 80
Credit: 13,483,227
RAC: 0
Message 30201 - Posted: 28 Oct 2006, 23:20:17 UTC - in response to Message 30200.  
Last modified: 28 Oct 2006, 23:20:50 UTC

Seems I was wrong. The no heartbeat message has nothing to do with the manager. I asked a question on the Dev mail list and go this back:

davea@ssl.berkeley.edu to me, boinc_dev
More options 7:13 pm (1 minute ago)

The manager is not involved.

Applications listen for "heartbeat" messages
(sent via shared memory) from the core client.
Normally it's sent once a second.
If the application doesn't get one in 30 secs,
it prints "no heartbeat" and quits

-- David



wow... why use shared memory for that... I mean that's what semaphores were for... I mean... shared memory has always been much slower than other methods... It's just damned convenient for shared data... (to which a heartbeat does *not* qualify)

Looking for a team ??? Join BoincSynergy!!


ID: 30201 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
PUDDIN TAME

Send message
Joined: 3 Oct 06
Posts: 13
Credit: 53,998
RAC: 0
Message 30202 - Posted: 28 Oct 2006, 23:45:29 UTC

What is with some of the new WU. I just got finished running that took 11 hours. It produced only 2 models. The first ran in about 1 hour. The second model took 10 hours! The only reason I didn't abort it was that the step counter was advancing verrry slowly.
PUDDIN TAME
ID: 30202 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 30204 - Posted: 29 Oct 2006, 1:20:17 UTC - in response to Message 30202.  
Last modified: 29 Oct 2006, 1:22:23 UTC

Some of these workunits take a long time. The reason one of your models took significantly less time (1 hour vs. 10 hours) was because we have included a check -- if the model doesn't reach a low enough energy by a certain point, we prematurely exit, so that your client can have another shot from scratch.

Both models are given equal credits -- so its a kind of interesting lottery. If you pass the 1 hour-ish limit, your client will keep crunching. Although the model doesn't receive more credit for crunching more, that particular model will be a lot more scientifically valuable than if we stopped the search earlier. We are discussing ways to give more credit for models that required more computational power... but that won't happen soon. For now, I'm looking into ways that keep these workunits shorter and to keep the times per model more even!

I just posted a note over in another [url = https://boinc.bakerlab.org/rosetta/forum_thread.php?id=2495] thread [/url]to talk about these issues and get feedback.


What is with some of the new WU. I just got finished running that took 11 hours. It produced only 2 models. The first ran in about 1 hour. The second model took 10 hours! The only reason I didn't abort it was that the step counter was advancing verrry slowly.


ID: 30204 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 30207 - Posted: 29 Oct 2006, 1:26:30 UTC - in response to Message 30160.  

The "DANGER!" messages are actually OK; I'll make sure they don't show up in the next version of Rosetta.

For those of you who are worried about how long its taking for some of these workunits, please keep crunching. They're slow but the data coming back is pretty spectacular! I've canceled the send out of these kinds of WUs, so you won't get anymore until the next software update (maybe this week, or next week), which should process these sorts of workunits significantly faster.

Rosetta 5.34 has a few new features to allow us to test more accurate energy functions and more interesting variations in the protein's bond geometry. Let us know if you see any problems -- especially if they are reproducible!


Here is a bit more of the stuff I am seeing with 5.34 ... It's another of the
1hz6A_BOINC_NATIVEJUMPS_CLOSE_CHAINBREAKS_VARY_ALL_BOND_ANGLES jobs

https://boinc.bakerlab.org/rosetta/result.php?resultid=44140906

Below is part of a stdout.txt file from Linux-2.6 ... The message that concerns me is below the URL... (the stdout URL lists many of these).... I am wondering... are these bad workunits????? and BTW.. It's past it's bedtime...



http://web.hotiron.net/pics/johng/38941838-partial-stdout.txt

WARNING:: cant find phi but not a chainbreak?
======================================================
======================================================
======================================================
DANGER!!!! DANGER!!!! DANGER!!!! DANGER!!!! DANGER!!!!
======================================================
======================================================
======================================================
DANGER!!!! DANGER!!!! DANGER!!!! DANGER!!!! DANGER!!!!
======================================================
======================================================
======================================================
pose_minimize:: Big score_delta when turning on the nblist: 1.43719


ID: 30207 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Akins

Send message
Joined: 22 Oct 05
Posts: 176
Credit: 71,779
RAC: 0
Message 30215 - Posted: 29 Oct 2006, 6:47:52 UTC

Just a side note:

Some of the "...VARY_ALL_BOND_DISTANCES..." and "...VARY_ALL_BOND_ANGLES..." jobs appear to loose the native structure on the screen saver. Whether this anomaly is a symtom of something causing these WU's to process slowly or not, I'm not sure.

Anyone else noticing this?

ID: 30215 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Faust

Send message
Joined: 7 Sep 06
Posts: 14
Credit: 49,559
RAC: 0
Message 30220 - Posted: 29 Oct 2006, 9:34:37 UTC
Last modified: 29 Oct 2006, 9:40:12 UTC

I just saw this :

27 Oct 2006 19:34:40 UTC 28 Oct 2006 9:47:52 UTC Over Client error Compute error 43,201.38 91.06 ---

stderr out <core_client_version>5.4.11</core_client_version>
<message>
? ৣ ?詭 ੰ堧巩 (0x80000003) - exit code -2147483645 (0x80000003)
</message>
<stderr_txt>
# random seed: 1226466
# cpu_run_time_pref: 10800
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
CPU time: 43200.4 seconds. Greater than 4X preferred time: 10800 seconds
**********************************************************************
GZIP SILENT FILE: .xx1hz6.out
WARNING! attempt to gzip file .xx1hz6.out failed: file does not exist.


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x77F767CD

Engaging BOINC Windows Runtime Debugger...

than there's a huge dump file .. https://boinc.bakerlab.org/rosetta/result.php?resultid=44236969

it has also happend here.






Faust.
ID: 30220 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
RichardJ

Send message
Joined: 19 Mar 06
Posts: 8
Credit: 73,014
RAC: 0
Message 30236 - Posted: 29 Oct 2006, 18:05:01 UTC
Last modified: 29 Oct 2006, 18:09:04 UTC

Same thing for 38887432:
Rosetta score is stuck or going too long. Watchdog is ending the run!
CPU time: 44451.8 seconds. Greater than 4X preferred time: 10800 seconds

May also happen to 1ogw__BOINC_NATIVEJUMPS_CLOSE_CHAINBREAKS_SAVE_ALL_OUT__1315_5448
minimum quorum 1
which has been running for over 2 hours, is stuck on 1% and has 10 hours still to run!
ID: 30236 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dag
Avatar

Send message
Joined: 16 Dec 05
Posts: 106
Credit: 1,000,020
RAC: 0
Message 30256 - Posted: 29 Oct 2006, 23:06:53 UTC
Last modified: 29 Oct 2006, 23:07:41 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=44151269
1hz6A_BOINC_NATIVEJUMPS_CLOSE_CHAINBREAKS_VARY_ALL_BOND_ANGLES_SAVE_ALL_OUT__1306_26796_0

sin value out of range [-1,+1]
dag
--Finding aliens is cool, but understanding the structure of proteins is useful.
ID: 30256 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 30268 - Posted: 30 Oct 2006, 6:14:09 UTC - in response to Message 30215.  

Yes, I was noticing this on my own client. Seems to be occuring for other jobs too. I'll ask David Kim to look into it.

Just a side note:

Some of the "...VARY_ALL_BOND_DISTANCES..." and "...VARY_ALL_BOND_ANGLES..." jobs appear to loose the native structure on the screen saver. Whether this anomaly is a symtom of something causing these WU's to process slowly or not, I'm not sure.

Anyone else noticing this?


ID: 30268 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 30269 - Posted: 30 Oct 2006, 6:19:59 UTC - in response to Message 30204.  

An update to those of you who were helping me out with this issue of super-long workunits. I'm testing a new app on ralph that has some tricks to accelerate the workunits without sacrificing too much in terms of finding low energies. Workunits appear to be at least three-fold faster!

So in the next application update, we'll try those workunits again, and hopefully not see the same sorts of issues. The update may not occur for another week -- we also want to incorporate a cool new mode that models the fibrils that are correlated with Alzheimer's and other neuro-degenerative diseases, and that's going to take some optimization!

Some of these workunits take a long time. The reason one of your models took significantly less time (1 hour vs. 10 hours) was because we have included a check -- if the model doesn't reach a low enough energy by a certain point, we prematurely exit, so that your client can have another shot from scratch.

Both models are given equal credits -- so its a kind of interesting lottery. If you pass the 1 hour-ish limit, your client will keep crunching. Although the model doesn't receive more credit for crunching more, that particular model will be a lot more scientifically valuable than if we stopped the search earlier. We are discussing ways to give more credit for models that required more computational power... but that won't happen soon. For now, I'm looking into ways that keep these workunits shorter and to keep the times per model more even!

I just posted a note over in another [url = https://boinc.bakerlab.org/rosetta/forum_thread.php?id=2495] thread [/url]to talk about these issues and get feedback.


What is with some of the new WU. I just got finished running that took 11 hours. It produced only 2 models. The first ran in about 1 hour. The second model took 10 hours! The only reason I didn't abort it was that the step counter was advancing verrry slowly.



ID: 30269 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Soren Hedberg

Send message
Joined: 30 Oct 06
Posts: 25
Credit: 3,653
RAC: 0
Message 30323 - Posted: 30 Oct 2006, 22:28:58 UTC

I'm having a problem with R@H as well. My problem is that I set the program up to use 100% of the CPU (which, according to SPeedfan, it is doing), and I leave the program to do its work on a job that it says should take 4 hours CPU time. However, after about 30 minutes of working at full load, it says that it has only completed 1 minute 30 seconds worth of CPU Time. What's up with that? I have ZoneAlarm and AVG Antivirus working on my computer at the same time, is it a conflict with one of these programs?

ID: 30323 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
R.L. Casey

Send message
Joined: 7 Jun 06
Posts: 91
Credit: 2,728,885
RAC: 0
Message 30333 - Posted: 31 Oct 2006, 3:35:50 UTC - in response to Message 30323.  

I'm having a problem with R@H as well. My problem is that I set the program up to use 100% of the CPU (which, according to SPeedfan, it is doing), and I leave the program to do its work on a job that it says should take 4 hours CPU time. However, after about 30 minutes of working at full load, it says that it has only completed 1 minute 30 seconds worth of CPU Time. What's up with that? I have ZoneAlarm and AVG Antivirus working on my computer at the same time, is it a conflict with one of these programs?

Soren,welome to Rosetta!

I. (Assuming that you meant to say that after 30 minutes CPU Time, the task was showing only a bit more than 1% complete):
Some time ago, longer tasks would appear to be "stuck" at one percent, so the project developers changed the Rosetta application to increase the percent complete from 1.000% by small amounts so that people would not become concerned that the task was "stuck". If this is the case, on the graphic display/ screensaver you will see that the tassk is still working on Model 1. The percentage complete will be updated more realistically after tbe first model is completed.

II. (If you actually *did* intend to say that the task has used only 1.50 seconds of CPU time in a half hour (as measured by a watch or clock), then you can use the Windows Task Manager under the "Processes" tab to check the Rosetta task to see if it's using CPU. Also, check the "General Preferences" and "Rosetta preferences" from your Rosetta Account web page. In particular, check the General preference for "Do work while Computer is in use". If this is set to "No", then Rosetta will suspend itself (stop working) anytime you are working with the computer (typing, using mouse), and for some time afterward. There are also other conditions that must be satisfied in order for Rosetta to run. These are to provide limitations on Rosetta, if necessary, so that it cannot interfere with other work by, say, slowing response times. However, I have Rosetta set to run all the time and never have seen any significant slow-down on other work I do. (Well, perhaps rendering video might be affected, but few more typical tasks like web browsing, word processing, or email).

Note: you can also use the BOINC Manager "Activity" tab to tell Rosetta to "Run Always". This overrides the "Preferences".

There are many, many people here that really want to help you perform your best and have fun, too. Always feel free to post questions and comments!

Again, welcome! Happy Rosetta crunching!
ID: 30333 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Report problems with Rosetta version 5.34



©2024 University of Washington
https://www.bakerlab.org