Message boards : Number crunching : minirosetta 2.17
Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next
Author | Message |
---|---|
![]() ![]() Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
A few more examples of the Rossmann2x3_abinitio tasks having problems, running until the watchdog nails them, and spitting out gobs of "OVERFLOW ERROR: Error writing" messages. 376887103 376878933 377023057 Not all of these tasks are failing - here is a Rossmann2x3_abinitio task which ran normally: 376993800 However, when one of these tasks does decide to go renegade and run all the way out to watchdog territory, it can be justifiably reclassified as demon-spawn - I have watched several and they suck up every spare byte of memory on the system like a tax collector on steroids - I just watched one which had nearly 2 gig of memory allocated and resident. Ouch! This effectively shut out all other BOINC tasks until it completed. No other tasks were able to start until this task was purged and the memory released. |
[AF>france>pas-de-calais]symaski62 Send message Joined: 19 Sep 05 Posts: 47 Credit: 33,871 RAC: 0 |
Moved Chris' post, https://boinc.bakerlab.org/rosetta/result.php?resultid=376884639 ERROR: Unable to open file: minirosetta_database/chemical/residue_type_sets/faaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa/residue_types.txt ERROR:: Exit from: src/core/chemical/ResidueTypeSet.cc line: 96 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish |
![]() ![]() Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Come on guys - I find it hard to believe that I am the only one seeing these Rossmann2X3 tasks chew up their systems. Some complete, some fail, all are running long an are using nearly 2 gig per task. And all spit out the ominous "OVERFLOW ERROR: Error writing" repeatedly. Here are two which finished - generating just 1 decoy for eight hours of run time: 377289713 376887410 And here is one which did not (Google the error message and it seems like it is trying to create a string longer than the system / compiler allows) 377281598 |
![]() ![]() Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Transient - You could be right about it being a problem unique to Linux and OSX (Darwin) - in both cases they very well may be built using the same compiler (GCC?) and it is possible they have stumbled on an awkward spot. I have no way of knowing - in preparation for the purification ceremonies required to reach a higher state of karma and grace, I no longer own or run a Windows system :) |
![]() Send message Joined: 16 Jul 07 Posts: 18 Credit: 16,197,811 RAC: 0 |
Getting computational errors. |
![]() ![]() Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 1 |
Getting computational errors. your getting file transfer errors error -161 to be precise <message> <file_xfer_error> <file_name>TEMP_0.01_control_1shfA_SAVE_ALL_OUT_22400_68_1_0</file_name> <error_code>-161</error_code> </file_xfer_error> your system is processing the tasks just fine but when it comes to writing the data there is a problem. |
AtHomer![]() Send message Joined: 26 Jan 10 Posts: 13 Credit: 7,145,229 RAC: 0 |
I have had two of those "Rossmann" WUs today and they both "crashed". They just kept on running for hours, the last checkpoint having been over three hours ago. I have spent over 12 hours of crunching today on these runaway tasks. Such a waste of resources! Is there no way to prevent this? When a task has had its last checkpoint a long time in the past, it would be better to stop it right away and download a new one, right? Whenever I see a task like this I abort it manually. |
Murasaki![]() Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
Such a waste of resources! Is there no way to prevent this? The watchdog should shut the task down automatically when you reach 4 hours past your preferred run time. For example, if you have a runtime of 10 hours a task will terminate at 14 hours if it has not been able to checkpoint before then. Is this a waste of resources? Yes, but it is seen as a reasonable balance between stopping rogue tasks that aren't working properly and not wasting good tasks that are just being a little slow in reaching a checkpoint. Is there a better a way to achieve that balance? Perhaps, but I personally don't have a good answer. |
Michael Gould Send message Joined: 3 Feb 10 Posts: 39 Credit: 16,053,885 RAC: 0 |
Chris, you obviously run many more WU's than I do, but I haven't had any errors at all running them on my OS X machine. There is a Ross2X3 running as I type this. And I only have 2 gig of total ram installed. Perhaps only certain WU's are problematic? The larger molecules, I guess. |
![]() Send message Joined: 16 Jul 07 Posts: 18 Credit: 16,197,811 RAC: 0 |
@Greg_BE, If you look at my compute errors you will see after the WU was sent out to second party, it error-ed out again. So, not a problem on my side. |
![]() ![]() Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
And I haven't had any errors on my linux machine. Even one of Chris' linux machines has no problems with them. Could it be machine specific? Adeb |
![]() ![]() Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
AdeB wondered: Even one of Chris' Linux machines has no problems with them. Could it be machine specific? The machine you pointed to has had the issue - although the task did not end in error it did eat up all off the memory in sight, run until the watchdog killed it, and spit out repeated "OVERFLOW ERROR: Error writing" messages. Just because the task runs to completion, does not mean its not a problem task. Extreme memory usage + runtime can be issues when one of these tasks pretty much shut down the other 3 (or 5) cores on a system. And it is not AMD specific - it also happens on my Xeon based Mac pro too. But I do appreciate you taking the time to look at it and offer suggestions, I really do. A couple sample tasks from the the system AdeB pointed to: 377124278 377297655 |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2474 Credit: 46,503,340 RAC: 3,457 ![]() |
In several posts Chris wrote: A few more examples of the Rossmann2x3_abinitio tasks having problems, running until the watchdog nails them, and spitting out gobs of "OVERFLOW ERROR: Error writing" messages. I checked a few days ago and I really didn't see any of this, so I've assumed it was OS specific or machine specific, as suggested, but I just glanced at a long-running watchdog-truncated job and find I had the same experience on my W7 x64 laptop. I've modified Chris's earlier links to show the job names, OS & Boinc version just in case it reveals a more specific pattern of tasks. My task was slightly different in that it does seem to have checkpointed several times before the watchdog cut in at 8+4 hours. Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_005_22515_1974_0 - Windows 7 64-bit 6.10.58 So the pattern is more specifically "Rossmann2x3_abinitio_SAVE_ALL_OUT_design_or28_w_csfrags_" if that helps. ![]() ![]() |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2474 Credit: 46,503,340 RAC: 3,457 ![]() |
PCS_PGR122A_v1.frag_18-51_SAVE_ALL_OUT_22518_71_0 Outcome Client error Same error from the wingman too. ![]() ![]() |
![]() ![]() Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
In several posts Chris wrote: Of course I only did a quick scan, and missed the problematic tasks on Chris' machine. Sid's approach clearly shows a pattern, nice catch. AdeB |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2474 Credit: 46,503,340 RAC: 3,457 ![]() |
In several posts Chris wrote: ... It was a possibility it was OS related while no-one else reported differently. It was the "write errors" that made me realise I had the same issue on a different OS. Also, my error's on an Intel-based laptop, not my AMD desktop (yet), so it's not tied to AMD processors either. It seems to be the task itself (though most went through ok, as Chris originally reported). One for the coders to ponder. ![]() ![]() |
![]() ![]() Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Thanks a lot Sid - since it happened on both my Intel-based Xeon processor and AMD Phenom's I was pretty sure the issue was not silicon - however I could not defend the OS. Darwin's kernel is BSD at the core, and BSD and Linux share a compiler and many run-times, so knowing it also happened with Windows was key. Thanks for taking the time to review your tasks. I had two systems whose queues were just packed to the gills with these tasks - since they used enough memory to bollix up the whole system I had a big abort party after work today. However, speaking of long-running / low decoy count tasks, while scanning a few other user's task lists I was in awe of these PCS* / PCT tasks which came into the queue over the past few days. I don't really care that they are watchdog bait, at least they don't seem to be bringing my system to its knees with a 2 gigabyte memory requirement. |
Pardner Send message Joined: 31 Oct 10 Posts: 6 Credit: 3,442 RAC: 0 |
Hi Snags... Just checked back to see if there were any further comments. And yes I do see the "Reading preferences override file" statement in my Messages. Hope that doesn't cause any "provoking". Thanks again for your help. Pardner |
Speedy![]() Send message Joined: 25 Sep 05 Posts: 163 Credit: 826,597 RAC: 0 |
Both of the following tasks completed successfully. My runtime pref is 3 hours. 379301586 took 5.89 hours credit 158.14/184.97 & 379301549 took 5.64 hours credit 151.65/184.97. Both are from batch 1FPW_R2. Ran on stock I7 980X with HT on. I'm just passing info on nothing more nothing less. Edits= getting links to work correctly Have a crunching good day!! |
cleaner Send message Joined: 22 Aug 10 Posts: 6 Credit: 26,245 RAC: 0 |
I am getting alot of "output file absent" messages lately. It seems almost every work unit now is spitting out that message. Anyone else having the same issue?? |
Message boards :
Number crunching :
minirosetta 2.17
©2025 University of Washington
https://www.bakerlab.org