minirosetta 2.17

Message boards : Number crunching : minirosetta 2.17

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next

AuthorMessage
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 68412 - Posted: 5 Nov 2010, 9:31:07 UTC
Last modified: 5 Nov 2010, 9:35:01 UTC

A few more examples of the Rossmann2x3_abinitio tasks having problems, running until the watchdog nails them, and spitting out gobs of "OVERFLOW ERROR: Error writing" messages.

376887103
376878933
377023057


Not all of these tasks are failing - here is a Rossmann2x3_abinitio task which ran normally:

376993800

However, when one of these tasks does decide to go renegade and run all the way out to watchdog territory, it can be justifiably reclassified as demon-spawn - I have watched several and they suck up every spare byte of memory on the system like a tax collector on steroids - I just watched one which had nearly 2 gig of memory allocated and resident.

Ouch!

This effectively shut out all other BOINC tasks until it completed. No other tasks were able to start until this task was purged and the memory released.
ID: 68412 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[AF>france>pas-de-calais]symaski62

Send message
Joined: 19 Sep 05
Posts: 47
Credit: 33,871
RAC: 0
Message 68415 - Posted: 5 Nov 2010, 12:01:17 UTC - in response to Message 68409.  

Moved Chris' post,
Appears these are failing at startup, here's some direct links:

This one ran for more then 7 hours before failing
Overflow error: 376994899


https://boinc.bakerlab.org/rosetta/result.php?resultid=376884639

ERROR: Unable to open file: minirosetta_database/chemical/residue_type_sets/faaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/residue_types.txt

ERROR:: Exit from: src/core/chemical/ResidueTypeSet.cc line: 96
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish



ID: 68415 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 68428 - Posted: 6 Nov 2010, 12:10:57 UTC

Come on guys - I find it hard to believe that I am the only one seeing these Rossmann2X3 tasks chew up their systems. Some complete, some fail, all are running long an are using nearly 2 gig per task. And all spit out the ominous "OVERFLOW ERROR: Error writing" repeatedly.

Here are two which finished - generating just 1 decoy for eight hours of run time:

377289713
376887410

And here is one which did not (Google the error message and it seems like it is trying to create a string longer than the system / compiler allows)

377281598
ID: 68428 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 68432 - Posted: 6 Nov 2010, 14:07:42 UTC

Transient -

You could be right about it being a problem unique to Linux and OSX (Darwin) - in both cases they very well may be built using the same compiler (GCC?) and it is possible they have stumbled on an awkward spot.

I have no way of knowing - in preparation for the purification ceremonies required to reach a higher state of karma and grace, I no longer own or run a Windows system :)
ID: 68432 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Ace Casino

Send message
Joined: 16 Jul 07
Posts: 17
Credit: 11,392,749
RAC: 13,634
Message 68433 - Posted: 6 Nov 2010, 20:20:34 UTC

Getting computational errors.
ID: 68433 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 68436 - Posted: 6 Nov 2010, 20:59:36 UTC - in response to Message 68433.  

Getting computational errors.



your getting file transfer errors
error -161 to be precise
<message>
<file_xfer_error>
<file_name>TEMP_0.01_control_1shfA_SAVE_ALL_OUT_22400_68_1_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

your system is processing the tasks just fine but when it comes to writing the data there is a problem.
ID: 68436 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AtHomer
Avatar

Send message
Joined: 26 Jan 10
Posts: 13
Credit: 7,145,229
RAC: 0
Message 68437 - Posted: 6 Nov 2010, 21:19:59 UTC
Last modified: 6 Nov 2010, 21:20:32 UTC

I have had two of those "Rossmann" WUs today and they both "crashed". They just kept on running for hours, the last checkpoint having been over three hours ago. I have spent over 12 hours of crunching today on these runaway tasks. Such a waste of resources! Is there no way to prevent this?

When a task has had its last checkpoint a long time in the past, it would be better to stop it right away and download a new one, right? Whenever I see a task like this I abort it manually.
ID: 68437 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 68438 - Posted: 6 Nov 2010, 23:08:29 UTC - in response to Message 68437.  

Such a waste of resources! Is there no way to prevent this?


The watchdog should shut the task down automatically when you reach
4 hours past your preferred run time. For example, if you have a runtime
of 10 hours a task will terminate at 14 hours if it has not been able to
checkpoint before then.

Is this a waste of resources? Yes, but it is seen as a reasonable
balance between stopping rogue tasks that aren't working properly and
not wasting good tasks that are just being a little slow in reaching a
checkpoint.

Is there a better a way to achieve that balance? Perhaps, but I
personally don't have a good answer.
ID: 68438 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Michael Gould

Send message
Joined: 3 Feb 10
Posts: 39
Credit: 14,668,842
RAC: 5,713
Message 68440 - Posted: 7 Nov 2010, 7:09:45 UTC - in response to Message 68432.  


You could be right about it being a problem unique to Linux and OSX (Darwin)...


Chris, you obviously run many more WU's than I do, but I haven't had any errors at all running them on my OS X machine. There is a Ross2X3 running as I type this. And I only have 2 gig of total ram installed.

Perhaps only certain WU's are problematic? The larger molecules, I guess.
ID: 68440 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Ace Casino

Send message
Joined: 16 Jul 07
Posts: 17
Credit: 11,392,749
RAC: 13,634
Message 68441 - Posted: 7 Nov 2010, 10:40:21 UTC

@Greg_BE,
If you look at my compute errors you will see after the WU was sent out to second party, it error-ed out again. So, not a problem on my side.
ID: 68441 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 0
Message 68442 - Posted: 7 Nov 2010, 10:46:53 UTC - in response to Message 68440.  


You could be right about it being a problem unique to Linux and OSX (Darwin)...


Chris, you obviously run many more WU's than I do, but I haven't had any errors at all running them on my OS X machine. There is a Ross2X3 running as I type this. And I only have 2 gig of total ram installed.

Perhaps only certain WU's are problematic? The larger molecules, I guess.


And I haven't had any errors on my linux machine. Even one of Chris' linux machines has no problems with them. Could it be machine specific?

Adeb
ID: 68442 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 68444 - Posted: 7 Nov 2010, 14:59:11 UTC
Last modified: 7 Nov 2010, 15:04:04 UTC

AdeB wondered:

Even one of Chris' Linux machines has no problems with them. Could it be machine specific?


The machine you pointed to has had the issue - although the task did not end in error it did eat up all off the memory in sight, run until the watchdog killed it, and spit out repeated "OVERFLOW ERROR: Error writing" messages.

Just because the task runs to completion, does not mean its not a problem task. Extreme memory usage + runtime can be issues when one of these tasks pretty much shut down the other 3 (or 5) cores on a system.

And it is not AMD specific - it also happens on my Xeon based Mac pro too.

But I do appreciate you taking the time to look at it and offer suggestions, I really do.

A couple sample tasks from the the system AdeB pointed to:

377124278
377297655
ID: 68444 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1966
Credit: 38,188,338
RAC: 11,005
Message 68455 - Posted: 8 Nov 2010, 13:10:35 UTC

In several posts Chris wrote:
A few more examples of the Rossmann2x3_abinitio tasks having problems, running until the watchdog nails them, and spitting out gobs of "OVERFLOW ERROR: Error writing" messages.

Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_001_22515_226_0 - Linux 6.10.56
Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_008_22515_182_0 - Darwin 6.10.58
Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_008_22515_1096_0 - Linux 6.10.56

...

Come on guys - I find it hard to believe that I am the only one seeing these Rossmann2X3 tasks chew up their systems. Some complete, some fail, all are running long an are using nearly 2 gig per task. And all spit out the ominous "OVERFLOW ERROR: Error writing" repeatedly.

Here are two which finished - generating just 1 decoy for eight hours of run time:

Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_001_22515_1024_1 - Linux 6.10.56
Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_007_22515_256_0 - Linux 6.10.56

...

You could be right about it being a problem unique to Linux and OSX (Darwin) - in both cases they very well may be built using the same compiler (GCC?) and it is possible they have stumbled on an awkward spot.

I have no way of knowing - in preparation for the purification ceremonies required to reach a higher state of karma and grace, I no longer own or run a Windows system :)

...

AdeB wondered:
Even one of Chris' Linux machines has no problems with them. Could it be machine specific?

The machine you pointed to has had the issue - although the task did not end in error it did eat up all off the memory in sight, run until the watchdog killed it, and spit out repeated "OVERFLOW ERROR: Error writing" messages.

Just because the task runs to completion, does not mean its not a problem task. Extreme memory usage + runtime can be issues when one of these tasks pretty much shut down the other 3 (or 5) cores on a system.

And it is not AMD specific - it also happens on my Xeon based Mac pro too.

But I do appreciate you taking the time to look at it and offer suggestions, I really do.

A couple sample tasks from the the system AdeB pointed to:

Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_004_22515_1706_0 - Linux 6.10.56
Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_004_22515_1706_0 - Linux 6.10.56

I checked a few days ago and I really didn't see any of this, so I've assumed it was OS specific or machine specific, as suggested, but I just glanced at a long-running watchdog-truncated job and find I had the same experience on my W7 x64 laptop.

I've modified Chris's earlier links to show the job names, OS & Boinc version just in case it reveals a more specific pattern of tasks. My task was slightly different in that it does seem to have checkpointed several times before the watchdog cut in at 8+4 hours.

Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_005_22515_1974_0 - Windows 7 64-bit 6.10.58

So the pattern is more specifically "Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_" if that helps.
ID: 68455 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1966
Credit: 38,188,338
RAC: 11,005
Message 68460 - Posted: 8 Nov 2010, 17:17:17 UTC
Last modified: 8 Nov 2010, 17:21:02 UTC

PCS_PGR122A_v1.frag_18-51_SAVE_ALL_OUT_22518_71_0
Outcome Client error
Client state Compute error
Exit status 1 (0x1)

CPU time 14.30529

stderr out <core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
...
ERROR: First parameter of SVD_Solver constructor MUST be larger than the second parameter
ERROR:: Exit from: ....srcnumericSVDSVD_Solver.cc line: 202
BOINC:: Error reading and gzipping output datafile: default.out

Same error from the wingman too.
ID: 68460 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 0
Message 68462 - Posted: 8 Nov 2010, 19:37:51 UTC - in response to Message 68455.  

In several posts Chris wrote:
A few more examples of the Rossmann2x3_abinitio tasks having problems, running until the watchdog nails them, and spitting out gobs of "OVERFLOW ERROR: Error writing" messages.

Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_001_22515_226_0 - Linux 6.10.56
Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_008_22515_182_0 - Darwin 6.10.58
Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_008_22515_1096_0 - Linux 6.10.56

...

Come on guys - I find it hard to believe that I am the only one seeing these Rossmann2X3 tasks chew up their systems. Some complete, some fail, all are running long an are using nearly 2 gig per task. And all spit out the ominous "OVERFLOW ERROR: Error writing" repeatedly.

Here are two which finished - generating just 1 decoy for eight hours of run time:

Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_001_22515_1024_1 - Linux 6.10.56
Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_007_22515_256_0 - Linux 6.10.56

...

You could be right about it being a problem unique to Linux and OSX (Darwin) - in both cases they very well may be built using the same compiler (GCC?) and it is possible they have stumbled on an awkward spot.

I have no way of knowing - in preparation for the purification ceremonies required to reach a higher state of karma and grace, I no longer own or run a Windows system :)

...

AdeB wondered:
Even one of Chris' Linux machines has no problems with them. Could it be machine specific?

The machine you pointed to has had the issue - although the task did not end in error it did eat up all off the memory in sight, run until the watchdog killed it, and spit out repeated "OVERFLOW ERROR: Error writing" messages.

Just because the task runs to completion, does not mean its not a problem task. Extreme memory usage + runtime can be issues when one of these tasks pretty much shut down the other 3 (or 5) cores on a system.

And it is not AMD specific - it also happens on my Xeon based Mac pro too.

But I do appreciate you taking the time to look at it and offer suggestions, I really do.

A couple sample tasks from the the system AdeB pointed to:

Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_004_22515_1706_0 - Linux 6.10.56
Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_004_22515_1706_0 - Linux 6.10.56

I checked a few days ago and I really didn't see any of this, so I've assumed it was OS specific or machine specific, as suggested, but I just glanced at a long-running watchdog-truncated job and find I had the same experience on my W7 x64 laptop.

I've modified Chris's earlier links to show the job names, OS & Boinc version just in case it reveals a more specific pattern of tasks. My task was slightly different in that it does seem to have checkpointed several times before the watchdog cut in at 8+4 hours.

Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_005_22515_1974_0 - Windows 7 64-bit 6.10.58

So the pattern is more specifically "Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_" if that helps.


Of course I only did a quick scan, and missed the problematic tasks on Chris' machine.
Sid's approach clearly shows a pattern, nice catch.

AdeB
ID: 68462 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1966
Credit: 38,188,338
RAC: 11,005
Message 68464 - Posted: 8 Nov 2010, 22:43:58 UTC - in response to Message 68462.  

In several posts Chris wrote: ...

...So the pattern is more specifically "Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_" if that helps.


Of course I only did a quick scan, and missed the problematic tasks on Chris' machine.
Sid's approach clearly shows a pattern, nice catch.

It was a possibility it was OS related while no-one else reported differently. It was the "write errors" that made me realise I had the same issue on a different OS.

Also, my error's on an Intel-based laptop, not my AMD desktop (yet), so it's not tied to AMD processors either. It seems to be the task itself (though most went through ok, as Chris originally reported). One for the coders to ponder.
ID: 68464 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chris Holvenstot
Avatar

Send message
Joined: 2 May 10
Posts: 220
Credit: 9,106,918
RAC: 0
Message 68468 - Posted: 8 Nov 2010, 23:16:20 UTC

Thanks a lot Sid - since it happened on both my Intel-based Xeon processor and AMD Phenom's I was pretty sure the issue was not silicon - however I could not defend the OS.

Darwin's kernel is BSD at the core, and BSD and Linux share a compiler and many run-times, so knowing it also happened with Windows was key. Thanks for taking the time to review your tasks.

I had two systems whose queues were just packed to the gills with these tasks - since they used enough memory to bollix up the whole system I had a big abort party after work today.

However, speaking of long-running / low decoy count tasks, while scanning a few other user's task lists I was in awe of these PCS* / PCT tasks which came into the queue over the past few days.

I don't really care that they are watchdog bait, at least they don't seem to be bringing my system to its knees with a 2 gigabyte memory requirement.

ID: 68468 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pardner

Send message
Joined: 31 Oct 10
Posts: 6
Credit: 3,442
RAC: 0
Message 68471 - Posted: 9 Nov 2010, 1:57:07 UTC - in response to Message 68384.  



Hi Snagletooth & Mod. Sense,

Snags you are the BOMB!!

I looked at the statement in Messages for what you suggested... "General prefs: from rosetta@home (last modified 31-Oct-2010 15:20:47)".
What I found was "General prefs: from SETI@home (last modified 21-Sep-2010 09:26:17)".

I logged into SETI to update my global "Computing Preferences". Made one change and clicked Update to see if it would work. I exited the BOINC app and then restarted BOINC, looked for the "General prefs" message and it had not changed. (Maybe because the SETI project is currently down)

I then decided to log into Rosetta and modified 1 item in the "Computing Preferences" and clicked Update. I then exited BOINC and restarted it.
Lo and behold I saw "11/3/2010 10:58:12 AM rosetta@home General prefs: from rosetta@home (last modified 03-Nov-2010 10:51:03)".

I'm now running just fine on both PCs with no "CPU usage too high" messages.

Thanks very much to you, Mod.Sense & Murasaki for taking the time with this and providing input. It is VERY much appreciated!!!! I was going through "crunch withdrawl".

Pardner


Well, Pardner, I don't know quite why that worked but nonetheless I am very happy it did. (And very glad Mod.Sense quickly caught my omission of the "update" step. Details, details are everything!)

Happy crunching,
Snags

If you are willing to satisfy my curiosity (or, more likely, risk provoking it further) you could say whether you ever saw the "Reading preferences override file".


Hi Snags... Just checked back to see if there were any further comments. And yes I do see the "Reading preferences override file" statement in my Messages. Hope that doesn't cause any "provoking". Thanks again for your help.

Pardner
ID: 68471 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Speedy
Avatar

Send message
Joined: 25 Sep 05
Posts: 163
Credit: 800,690
RAC: 173
Message 68634 - Posted: 16 Nov 2010, 4:06:04 UTC
Last modified: 16 Nov 2010, 4:15:59 UTC

Both of the following tasks completed successfully. My runtime pref is 3 hours.
379301586 took 5.89 hours credit 158.14/184.97 & 379301549 took 5.64 hours credit 151.65/184.97. Both are from batch 1FPW_R2. Ran on stock I7 980X with HT on. I'm just passing info on nothing more nothing less.
Edits= getting links to work correctly
Have a crunching good day!!
ID: 68634 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
cleaner

Send message
Joined: 22 Aug 10
Posts: 6
Credit: 26,245
RAC: 0
Message 68637 - Posted: 16 Nov 2010, 10:06:18 UTC

I am getting alot of "output file absent" messages lately. It seems almost every work unit now is spitting out that message. Anyone else having the same issue??
ID: 68637 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next

Message boards : Number crunching : minirosetta 2.17



©2024 University of Washington
https://www.bakerlab.org