Rosetta@home

minirosetta 2.14

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search
Message boards : Number crunching : minirosetta 2.14

Sort
AuthorMessage
Yifan Song
Forum moderator
Project administrator
Project developer
Project scientist

Joined: May 26 09
Posts: 62
ID: 318024
Credit: 7,322
RAC: 0
Message 66050 - Posted 10 May 2010 19:48:20 UTC

More CASP related updates.

Felix Profile

Joined: Nov 10 08
Posts: 2
ID: 287385
Credit: 107,587
RAC: 0
Message 66062 - Posted 11 May 2010 2:05:44 UTC - in response to Message ID 66050.

More CASP related updates.


Question: Is there ever an end to the Rosetta workunits or you guys keep sending out the same ones thousands of times to be ran with slightly different parameters?

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 66063 - Posted 11 May 2010 2:27:27 UTC

Different parameters, different techniques, different proteins, Rosetta is a development project. There is no fixed amount of predetermined work to be done from a list.
____________
Rosetta Moderator: Mod.Sense

Brutall

Joined: May 4 10
Posts: 1
ID: 379428
Credit: 64,339
RAC: 0
Message 66085 - Posted 12 May 2010 9:06:00 UTC

Seems like minirosetta 2.14 workunits far more complicated? My CPU process them slower, than workunits from 2.11

Chilean Profile
Avatar

Joined: Oct 16 05
Posts: 651
ID: 5008
Credit: 10,238,180
RAC: 4,709
Message 66101 - Posted 13 May 2010 2:57:23 UTC - in response to Message ID 66085.

Seems like minirosetta 2.14 workunits far more complicated? My CPU process them slower, than workunits from 2.11


Doubt it. The CPU takes a determined time (set by you, default 3 hrs I believe) for each WU. Independently of it's version.
____________

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 7,590,569
RAC: 2,277
Message 66105 - Posted 13 May 2010 8:50:41 UTC - in response to Message ID 66101.

Seems like minirosetta 2.14 workunits far more complicated? My CPU process them slower, than workunits from 2.11


Doubt it. The CPU takes a determined time (set by you, default 3 hrs I believe) for each WU. Independently of it's version.


True, but the number of models generated could be lower in that timespan, because of higher complexity.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 66108 - Posted 13 May 2010 17:37:23 UTC

CASP often sends very challenging targets. These are larger proteins, made up of more amino acids. These larger proteins generally take longer per model to process then those typically processed otherwise.
____________
Rosetta Moderator: Mod.Sense

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 66131 - Posted 15 May 2010 16:38:30 UTC

Task 338821144 failed on W7

rb_05_13_148_531_rs_stg0_lrlxcst_t000__casp9_SAVE_ALL_OUT_20582_2038_1

Seems a previous cruncher had the same issue

ERROR: CORE ERROR: You must use the ThreadingJobInputter with the LoopRelaxThreadingMover - did you forget the -in:file:template_pdb option?
ERROR:: Exit from: ..\..\src\protocols\loops\LoopRelaxThreadingMover.cc line: 80
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>

Nuadormrac

Joined: Sep 27 05
Posts: 37
ID: 1352
Credit: 75,798
RAC: 0
Message 66141 - Posted 16 May 2010 3:09:12 UTC
Last modified: 16 May 2010 3:12:28 UTC

There's also another aside with increasing complexity on a task, which depending on people's machines might/might not effect them. The bigger the dataset, the more complex something is, the more RAM it can use. The current task I'm working on is using 511 MB of RAM to itself. Now this might/might not seem like much with today's computers, but also remember that today's processors have either 2 or even 4 cores on one CPU. Which means that each CPU is running a separate task which is each taking up it's own pool of RAM. If someone has a quad core, and is running 4 Rosseta tasks as such, they're really using 511 MB x 4 or 2,044 MB or (2,044/1024)= 1.996 GB of RAM over and above Windows (Vista or 7, on today's comps).

Now thinking of it another way, many of today's computers which come with 2 or 4 GB of RAM, on a quad core, would essentially have 512 MB or 1 GB respective per core if one were to break it up that way. And though you might not care on your web browser (what many OEM's are thinking about with pre-built systems, on BOINC you would...

As things become more crowded, their comps might swap a little more, and increased paging activity (as the memory pool useage grows larger) can slow things down for that reason.
____________

Nuadormrac

Joined: Sep 27 05
Posts: 37
ID: 1352
Credit: 75,798
RAC: 0
Message 66150 - Posted 16 May 2010 15:48:56 UTC

Has anyone else noticed that with this version of minirosetta tasks are completing, instead of cutting off as they should? I had one last night which was on model 0, beyond the target time set in preferences, went out to work, and when I came home, it was onto model 1 (the 2nd model), and crunched the entire WU. I've noticed a couple others like this.

Other models are completing within within the selected time, but the early completion of WUs based on target time has elapsed seems to be gone. I think watchdog, or whatever is responsible for telling the WU "you've crunched enough and are done" is not working as it should in this version.
____________

dgnuff Profile
Avatar

Joined: Nov 1 05
Posts: 347
ID: 8170
Credit: 23,006,762
RAC: 6,734
Message 66153 - Posted 16 May 2010 16:38:20 UTC - in response to Message ID 66141.

There's also another aside with increasing complexity on a task, which depending on people's machines might/might not effect them. The bigger the dataset, the more complex something is, the more RAM it can use. The current task I'm working on is using 511 MB of RAM to itself.


Unless it gets to the point where the processes start thrashing, this isn't much of an issue. Modern OS's have very effective virtual memory systems which simply page out less used portions of the working set to disk. Looking at the four rosetta tasks running on my system now, they all have a virtual size on the 400 to 450 mb range. However they all only have a working set size in the 250 mb to 300 mb range, and hardly any page faults happening. Therefore Windows is doing a first class job of figuring which parts of the process aren't immediately necessary this instant, and paging them out.

And I suspect that as the workload on my system increases, the working set size of Rosetta task will decrease.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 66156 - Posted 16 May 2010 18:00:42 UTC

Nuadormrac, the watchdog tries to leave things alone and not interrupt useful work. So it will only end a task when it has gone 4 hours passed the target runtime user preference. I believe what you are observing is a combination of long runtime per model, and variance in runtime between one model and the next.

If I run one model is 90 minutes with a 3 hour target runtime, the task will begin a second model on the thought that it will complete within the target. However, if the second model then takes 120 minutes to complete, the task will end a half hour passed the target instead. This is normal behavior, and part of the reason why the watchdog has the patience I described above.
____________
Rosetta Moderator: Mod.Sense

coturnix

Joined: Oct 8 09
Posts: 4
ID: 353496
Credit: 729,171
RAC: 0
Message 66171 - Posted 17 May 2010 11:59:12 UTC

Segmentation faults for both tasks:

rb_05_14_151_539_rs_stg0_lrlx_t000__casp9_SAVE_ALL_OUT.IGNORE_THE_REST_A_20677_4793
rb_05_13_148_531_rs_stg0_lrlx_t000__casp9_SAVE_ALL_OUT.IGNORE_THE_REST_B_20583_1252

Kyle

Joined: Jun 9 07
Posts: 1
ID: 183024
Credit: 15,292
RAC: 0
Message 66183 - Posted 17 May 2010 23:22:35 UTC

I've been having a problem on 2 of 3 pc's i boinc on where rosetta hangs, stops what have you. wu's will stop progressing to there finish, but the pc is still trying to fold. using my older folder as an example i would get wu's that take around a day for this pc to finish. if i didn't restart the pc at least twice a day the wu would hang or pause but still counting elapsed time and completion time would also start to increase. I've had wu's stuck for over 13hrs elapsed time and show 60hr completion time. my 3rd system only runs rosetta and does so w/o any hassles. all systems running win xp pro sp3, one that's working fine is a intel dual core laptop. problem systems are desktops one running an old amd athalon xp 1600+ other is athalon 64 x2. sorry for the long post but i would like to try and get this fixed, the 1600+ was a full time rosetta rig. its currently running seti because i am tired of babysitting this one system.

kashi Profile

Joined: Nov 23 07
Posts: 2
ID: 222338
Credit: 346,280
RAC: 0
Message 66185 - Posted 18 May 2010 10:44:36 UTC

I had 4 tasks error with error message the same as svincent above. They only run for 11-13 seconds before erroring. All of these tasks have casp9 in the name, the other tasks without casp9 in the name gave no trouble.
338419916
338419899
338418442
38410731

I thought perhaps they exceeded the memory capacity of my computer because they used 400-500MB of ram each but 3 of them also errored on others' computers when they were sent out again. The 4th one that was resent has not been completed yet.

One "casp9" task completed successfully however, so they don't all error on my computer.

I have 4MB of ram, but a VM takes 900MB so only 3.1MB is available. With 8 cores to feed and these casp9 tasks taking 500MB each and Windows 7 taking a fair chunk too, I am thinking my computer does not have sufficient resources to process the current minirosetta tasks when 2 casp9 tasks try to run at the one time.

Nuadormrac

Joined: Sep 27 05
Posts: 37
ID: 1352
Credit: 75,798
RAC: 0
Message 66187 - Posted 18 May 2010 11:01:10 UTC - in response to Message ID 66156.
Last modified: 18 May 2010 11:01:50 UTC

Nuadormrac, the watchdog tries to leave things alone and not interrupt useful work. So it will only end a task when it has gone 4 hours passed the target runtime user preference. I believe what you are observing is a combination of long runtime per model, and variance in runtime between one model and the next.

If I run one model is 90 minutes with a 3 hour target runtime, the task will begin a second model on the thought that it will complete within the target. However, if the second model then takes 120 minutes to complete, the task will end a half hour passed the target instead. This is normal behavior, and part of the reason why the watchdog has the patience I described above.


Well I had a 2 hour run time, and it was 2.5 hours runtime when model 0 was complete.... Which is why I was surprised to come back and see it still chugging away at the task, inspected in the "show graphics again" and saw it was on model 1 rather then 0. It was on model 0 when I left.
____________

Nuadormrac

Joined: Sep 27 05
Posts: 37
ID: 1352
Credit: 75,798
RAC: 0
Message 66188 - Posted 18 May 2010 11:16:54 UTC - in response to Message ID 66153.
Last modified: 18 May 2010 11:27:51 UTC

There's also another aside with increasing complexity on a task, which depending on people's machines might/might not effect them. The bigger the dataset, the more complex something is, the more RAM it can use. The current task I'm working on is using 511 MB of RAM to itself.


Unless it gets to the point where the processes start thrashing, this isn't much of an issue. Modern OS's have very effective virtual memory systems which simply page out less used portions of the working set to disk.


Actually the "conservative swap feature" was a setting which was restricted to Windows 9x branded operating systems, as the winNT/2k/XP line did things a little different. Windows Vista is essentially an off shoot of Windows 2003 server...

I can say that from experience with running Windows Vista 64 beta and release candidates on a then Athlon 64 which had 1 GB of RAM at the time it was in beta; my experience was this.

- Vista booted up allocating about 900 MB at desktop, winXP allocated around 200-340 MB at desktop (before loading apps). Typical bloatware, we're all familiar with that.

- When Vista 64 (and I do think some of this was a 64-bit OS on a 64-bit CPU) was allocating less RAM then one had in the machine (though tbh there is paging that goes on when the physical RAM doesn't warrant needing it, part of the differences on how the winNT line of OS's, along with it's successors deal with paging, vs how win98 dealt with it), the OS was snappier and more responsive.

- Course keep in mind, the physical memory also has a HD cache, whic the above isn't taking into account. But needless to say extra RAM is good, especially if one has write caching enabled for the drives, and not just read caching.

- But anyhow, the experience was that as soon as allocated RAM went beyond physical RAM by even a small amount, aka even just 1/10th of a GB or then 110% physical on that box, the responsiveness degraded, and the OS seemed slower to respond then even winXP. Course there's also a reason many downgraded to XP :p This sort of thing can be especially noticeable with any form of computer gaming, where real time response times can be an issue; especially in some intensive situations (be it from a FPS standpoint, or an MMO standpoint if one's in a large raid, with a lot going on at once which must be responded to with as next to no delay as possible).

- When left to themselves, the swapfiles in win2k, XP, Vista, and I would imagine win7; have one fatal flaw with how they "grow" if the initial swapfile size is exceeded. They do so very conservatively, and this can also result in a fragmentation problem wrt the swapfile. This is also why utilities such as Diskeeper and the like introduced a defrag pagefile option (and latter on an option to defrag the MFT). People in the know however don't go with the Windows default setting, they set a fixed swapfile size, when the initial and max sizes are the same, and follow MS's recommendation of making it at least 1.5x physical memory. (More on how this line of OS's handles paging vs how win9x handled it.) TBH, if speed and efficiency were the only concern I think win98 did pagefiles a little better (arguably), though this line of OS's does have other things it can do with pagefiles, such as a degree of error handling through them.

Vista would not count as an old, and would very much count as a "modern OS" even though Windows 7 is now out. And all I can say, is Vista, on this box here, with 2 GB RAM and a duel core, yes it's got some of that same sluggishness in general which can leave me wanting to curse Vista at times :laugh: I wouldn't exactly call it the most responsive and snappy thing out there. And tbh, if I had the memory in this box, a few of the changes I would want to make would be to impose a "conservative swap" feature like in win9x, except Vista doesn't allow for that. Though some things it does allow for and I would do, is go into regedt32 and alter some of the memory management features to disable paging executive (one wants enough extra RAM for that change though) as well as enable large system cache. There's some other tweaks one can make, if the computer isn't bogged down that is, relative to their own physical RAM.
____________

kashi Profile

Joined: Nov 23 07
Posts: 2
ID: 222338
Credit: 346,280
RAC: 0
Message 66191 - Posted 18 May 2010 12:58:21 UTC - in response to Message ID 66185.

Oops, my previous post should read 4GB of ram and 3.1GB available. I trust you all knew what I meant and have overlooked my error.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 66193 - Posted 18 May 2010 13:46:00 UTC
Last modified: 18 May 2010 13:53:10 UTC

Casp 9 task the died

rs_stg0_lrlxjcst_t512__casp8_SAVE_ALL_OUT_20673_2315_0

Compute error
Exit status -177 (0xffffff4f)
CPU time 7788.453

Maximum elapsed time exceeded

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x7C90120E


and another one earlier

rb_05_13_148_531_rs_stg0_lrlxcst_t000__casp9_SAVE_ALL_OUT_20582_1917_1


Compute error
Exit status 1 (0x1)
Cpu time 15.54688

<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>


ERROR: CORE ERROR: You must use the ThreadingJobInputter with the LoopRelaxThreadingMover - did you forget the -in:file:template_pdb option?
ERROR:: Exit from: ..\..\src\protocols\loops\LoopRelaxThreadingMover.cc line: 80
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

and one I missed at the beginning of the month

rb_05_04_128_339_rs_stg0_lrlx_t000__casp9_SAVE_ALL_OUT.IGNORE_THE_REST_A_20282_2406_0

Client state Compute error
Exit status -177 (0xffffff4f)
CPU time 9525.484

<message>
Maximum elapsed time exceeded
</message>


- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x7C90120E

coturnix

Joined: Oct 8 09
Posts: 4
ID: 353496
Credit: 729,171
RAC: 0
Message 66214 - Posted 19 May 2010 9:24:25 UTC

Quite a few work units recently failed with segmentation faults on my Linux machine. However, most of these work units seem to succeed on Windows.

Here is another work unit that segfaulted on both Linux and Windows:
rb_05_17_152_540_rs_stg0_lrlx_t000__casp9_SAVE_ALL_OUT.IGNORE_THE_REST_B_20851_1555

Rui Pinheiro Profile

Joined: Feb 6 10
Posts: 3
ID: 369234
Credit: 85,400
RAC: 45
Message 66215 - Posted 19 May 2010 9:41:39 UTC - in response to Message ID 66214.

hi, sorry to post this here, but ive been searching for a while, still i cant find the answers as simple as they may be.

1 - how do i update to the 2.14 version?

2 - how can i get some info about the workunit im working on? like what are they related to

as a suggestion i would say it would be nice if the users could pick their path, choosing the way their computer time is spent.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 66221 - Posted 19 May 2010 13:20:58 UTC - in response to Message ID 66215.

2.14 is the latest program version that will be put on your computer automatically with the workunit(s) you download.

As to your other question, I know buried somewhere in this forum is a thread that gives the link to a website where you can enter a protein name and get information about it. I don't recall where that post is, maybe Mod remembers.


hi, sorry to post this here, but ive been searching for a while, still i cant find the answers as simple as they may be.

1 - how do i update to the 2.14 version?

2 - how can i get some info about the workunit im working on? like what are they related to

as a suggestion i would say it would be nice if the users could pick their path, choosing the way their computer time is spent.

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 66224 - Posted 19 May 2010 15:24:18 UTC - in response to Message ID 66215.



2 - how can i get some info about the workunit im working on? like what are they related to



In the protein name somewhere there will be a sequence of 4 characters bracketed by underscores (e.g. 1mvo ). Paste this sequence into the search field at

http://www.rcsb.org/pdb/home/home.do

and it'll tell you about the protein

HTH

Rui Pinheiro Profile

Joined: Feb 6 10
Posts: 3
ID: 369234
Credit: 85,400
RAC: 45
Message 66227 - Posted 19 May 2010 17:30:53 UTC - in response to Message ID 66224.

thank u greg

thank u vincent

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,546,016
RAC: 7,460
Message 66230 - Posted 19 May 2010 17:44:38 UTC - in response to Message ID 66215.

1 - how do i update to the 2.14 version?

Looking at your most recent completed jobs, you're already running 2.14 WUs.
____________

coturnix

Joined: Oct 8 09
Posts: 4
ID: 353496
Credit: 729,171
RAC: 0
Message 66276 - Posted 22 May 2010 8:28:24 UTC

gunn_fragments_SAVE_ALL_OUT_-1wtyA__20675_2524
gunn_fragments_SAVE_ALL_OUT_-1lveA__20675_212

ERROR: ct == final_atoms
ERROR:: Exit from: src/core/scoring/rms_util.cc line: 410
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

rb_05_13_148_531_rs_stg0_lrlx_t000__casp9_SAVE_ALL_OUT.IGNORE_THE_REST_B_20583_1252

ERROR: CORE ERROR: You must use the ThreadingJobInputter with the LoopRelaxThreadingMover - did you forget the -in:file:template_pdb option?
ERROR:: Exit from: ..\..\src\protocols\loops\LoopRelaxThreadingMover.cc line: 80
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 66358 - Posted 30 May 2010 1:00:59 UTC

Task 342193444 (int2_centerfirst2b_1fAc_2bmv_ProteinInterfaceDesign_23May2010_21231_77_1) failed immediately on Mac OS X

ERROR: Cannot open patchdock file: 1fAc_2bmv.patchdock
ERROR:: Exit from: src/protocols/ProteinInterfaceDesign/read_patchdock.cc line: 101
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>

A previous cruncher had the same problem on Windows Vista

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 66371 - Posted 30 May 2010 16:45:53 UTC

Task 341066734 (lrm_jorj_combined_torsion_it06_run01_A_rlbn_1bmg_SAVE_ALL_OUT_IGNORE_THE_REST_NATIVE_NOCON_21225_6_1) failed on W7

ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_database\scoring/weights/dslf_weights.wts exist
ERROR:: Exit from: ..\..\src\core\scoring\ScoreFunctionFactory.cc line: 178
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>

A wingman also had the same problem

H KYLE

Joined: Sep 18 09
Posts: 3
ID: 345956
Credit: 431,342
RAC: 61
Message 66390 - Posted 1 Jun 2010 3:15:47 UTC
Last modified: 1 Jun 2010 3:26:22 UTC

This task was only 8% through after 15 hours of computation: Task 342533081 int2_centerfirst2b_1fAc_2hhz_ProteinInterfaceDesign_23May2010_21231_149_0

I am only on the default runtime which I believe is 3 hours. Occasionally workunits take up to 6-7 hours to complete but I leave them be, this however had around 30 hours or so remaining on it so I aborted it.

In the task details it says it only had 1485 seconds of CPU time which is only 25 minutes... Odd.

I run a dual core CPU on win7 x86 and other rosetta tasks have been crunching fine on the other core and collatz tasks been running fine on GPU.

No red error messages in message log.

Murasaki
Avatar

Joined: Apr 20 06
Posts: 303
ID: 78284
Credit: 365,375
RAC: 94
Message 66409 - Posted 1 Jun 2010 19:52:15 UTC

int2_centerfirst2b_1fAc_2oc5_ProteinInterfaceDesign_23May2010_21231_163_1

This one errored out after 12 seconds as my Firewall challenged it. For some reason it was trying to ask for more access than a normal 2.14 unit. No other Rosetta work unit has triggered a Firewall alert for me this year.

According to my firewall logs the file causing the problem was minirosetta_2.14_windows_intelx86.exe.

Looking at the work unit, it crashed for the previous cruncher after 10 seconds.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 66416 - Posted 2 Jun 2010 6:03:29 UTC

When a task encounters a crash, it may directly attempt to pull the symbol table from the Rosetta website to aid in preparing the crash report. Normally all file downloads would be done by BOINC Manager, not minirosetta, but in some error scenarios minirosetta does attempt to do direct internet access. If it is not available, it returns a crash report without the detail possible with the symbol table.

So, I believe it was because the task crashed that it attempted to do a fairly unusual internet access which was trapped by your firewall. However, the resulting failure shouldn't have caused any further problem.

I'm trying to say that a failure caused the firewall trip, and that the fact that the firewall denied access was not the cause of a failure. I hope that makes sense.
____________
Rosetta Moderator: Mod.Sense

Murasaki
Avatar

Joined: Apr 20 06
Posts: 303
ID: 78284
Credit: 365,375
RAC: 94
Message 66419 - Posted 2 Jun 2010 9:49:11 UTC - in response to Message ID 66416.

I hope that makes sense.


That makes perfect sense. Thank you for the explanation.
____________

Chris Holvenstot Profile
Avatar

Joined: May 2 10
Posts: 220
ID: 379129
Credit: 9,106,918
RAC: 0
Message 66420 - Posted 2 Jun 2010 9:58:50 UTC
Last modified: 2 Jun 2010 10:02:28 UTC

It was suggested that I post this information here - it is sort of a duplication of the information I posted in a separate thread "Error opening file - anyone else?"

I have had six of these failures in the past few days across five systems. It is always the patchdoc file reported as having the error. It is not always the same exact filename. Three of my systems did not see this error at all. Several hundred work units were processed without error during the same time frame.

All of these systems are dedicated to BOINC and run 64 bit Linux on AMD Phenom II processors.

If the failure occurs, it is always seen during the first few seconds of processing.

Here is a list of the failures - I prepended my hostnames to each record to aid me in "getting back there" on case there were further questions about any specific incident.

Popeye: ERROR: Cannot open patchdock file: 1fAc_2j44.patchdock
Proteus: ERROR: Cannot open patchdock file: 1fAc_2odh.patchdock
Neptune: ERROR: Cannot open patchdock file: 1fAc_2vg9.patchdock
Poseidon: ERROR: Cannot open patchdock file: 1fAc_2vg9.patchdock
Poseidon: ERROR: Cannot open patchdock file: 1fAc_2boo.patchdock
Sinbad: ERROR: Cannot open patchdock file: 1fAc_2j44.patchdock

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 66423 - Posted 2 Jun 2010 15:28:51 UTC

Chris, I happend to stumple on a host with such errors and confirmed that the same error occurred on the next machine the tasks were sent to. This would be great confirmation that there is no problem specific to your machines. Rather something must be up with how these tasks were created.

Here are two such links:
int2_centerfirst2b_1fAc_2j44_ProteinInterfaceDesign_23May2010_21231_274
int2_centerfirst2b_1fAc_2huj_ProteinInterfaceDesign_23May2010_21231_252

____________
Rosetta Moderator: Mod.Sense

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 7,590,569
RAC: 2,277
Message 66463 - Posted 4 Jun 2010 15:26:31 UTC

I have a runtime preference of 8 hours. So when a tasks runs for much less than 8 hours I assume the 100 decoy limit was reached. This task (CASP8?) ran considerably less than an hour and generated only 5 decoys. That made me curious. Are there any explanations?

http://boinc.bakerlab.org/rosetta/result.php?resultid=343474661
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 66465 - Posted 4 Jun 2010 17:04:08 UTC

Your "wingman" (another machine that had been sent the same task and then due to his validate error another copy was generated and it was sent to you) had only 5 models as well. It is possible that batch of tasks was created with a 5 decoy limit rather then the 99.
____________
Rosetta Moderator: Mod.Sense

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,546,016
RAC: 7,460
Message 66468 - Posted 5 Jun 2010 1:31:19 UTC

I got one of these too. 8 hour runtime, 40 mins CPU time, 5 models only. Similar job-type.

rs_stg0_lrlx_t415__casp8_SAVE_ALL_OUT_20790_2334_0
____________

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 7,590,569
RAC: 2,277
Message 66470 - Posted 5 Jun 2010 8:47:29 UTC

I agree with Mod.Sense. I'm guessing because of CASP targets with a short deadline, the project team decided to limit those tasks to 5 decoys. It would be nice, I guess, to get a confirmation on that from the project team. :)
____________

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 66475 - Posted 5 Jun 2010 16:18:37 UTC

Task 343301757 (td-only-2-BcR103A_9-15_20163_36_0) failed on Mac OS X

ERROR: rsd_type_list.size()
ERROR:: Exit from: src/core/fragment/Frame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>

cnick6

Joined: May 30 06
Posts: 24
ID: 85398
Credit: 4,993,416
RAC: 5,988
Message 66484 - Posted 6 Jun 2010 6:39:25 UTC

I'm seeing a lot of these lately in my results. Anything to worry about?


ERROR: ERROR: Residue not supported by Placement coordinate constraint machinery
ERROR:: Exit from: ..\..\src\protocols\ProteinInterfaceDesign\movers\PlaceUtils.cc line: 281
called boinc_finish

____________

Chris Holvenstot Profile
Avatar

Joined: May 2 10
Posts: 220
ID: 379129
Credit: 9,106,918
RAC: 0
Message 66486 - Posted 6 Jun 2010 11:16:52 UTC - in response to Message ID 66484.

I'm seeing a lot of these lately in my results. Anything to worry about?


ERROR: ERROR: Residue not supported by Placement coordinate constraint machinery
ERROR:: Exit from: ..\..\src\protocols\ProteinInterfaceDesign\movers\PlaceUtils.cc line: 281
called boinc_finish


I'm getting them here too - so I am going to guess that it is not a problem with your system. Interesting that they are declared a "success" even though they end with an error.

Task 343850781

ERROR: ERROR: Residue not supported by Placement coordinate constraint machinery
ERROR:: Exit from: src/protocols/ProteinInterfaceDesign/movers/PlaceUtils.cc line: 281


Task 343814609

ERROR: ERROR: Residue not supported by Placement coordinate constraint machinery
ERROR:: Exit from: src/protocols/ProteinInterfaceDesign/movers/PlaceUtils.cc line: 281


I am running X64 Linux on systems based on AMD Phenom II CPUs




Sarel Profile

Joined: May 11 06
Posts: 51
ID: 81994
Credit: 81,712
RAC: 0
Message 66495 - Posted 6 Jun 2010 16:04:48 UTC - in response to Message ID 66486.

Thanks for your posts! It looks like a problem with the MDMX protocol that I put up earlier. I've now eliminated these jobs from the queue until I figure out what's wrong with these (but some jobs might still be rolling out on your machines). I'll let you know once I figure this out!

Thanks again, Sarel.

I'm seeing a lot of these lately in my results. Anything to worry about?


ERROR: ERROR: Residue not supported by Placement coordinate constraint machinery
ERROR:: Exit from: ..\..\src\protocols\ProteinInterfaceDesign\movers\PlaceUtils.cc line: 281
called boinc_finish


I'm getting them here too - so I am going to guess that it is not a problem with your system. Interesting that they are declared a "success" even though they end with an error.

Task 343850781

ERROR: ERROR: Residue not supported by Placement coordinate constraint machinery
ERROR:: Exit from: src/protocols/ProteinInterfaceDesign/movers/PlaceUtils.cc line: 281


Task 343814609

ERROR: ERROR: Residue not supported by Placement coordinate constraint machinery
ERROR:: Exit from: src/protocols/ProteinInterfaceDesign/movers/PlaceUtils.cc line: 281


I am running X64 Linux on systems based on AMD Phenom II CPUs






____________

Sarel Profile

Joined: May 11 06
Posts: 51
ID: 81994
Credit: 81,712
RAC: 0
Message 66497 - Posted 6 Jun 2010 16:34:12 UTC - in response to Message ID 66495.

I found the problem, it was an error in one of the input files that would sporadically show up (so it didn't come up on my local tests). I've now fixed it and tested it locally and will gradually resubmit the jobs to the queue (the new jobs will be dated 6Jun2010 rather than 4Jun2010). Please let me know if you have more problems with these new jobs.

Thanks for mentioning the specific task that gave you problems! It made tracking down the problem very easy.

Best, Sarel.

Thanks for your posts! It looks like a problem with the MDMX protocol that I put up earlier. I've now eliminated these jobs from the queue until I figure out what's wrong with these (but some jobs might still be rolling out on your machines). I'll let you know once I figure this out!

Thanks again, Sarel.

I'm seeing a lot of these lately in my results. Anything to worry about?


ERROR: ERROR: Residue not supported by Placement coordinate constraint machinery
ERROR:: Exit from: ..\..\src\protocols\ProteinInterfaceDesign\movers\PlaceUtils.cc line: 281
called boinc_finish


I'm getting them here too - so I am going to guess that it is not a problem with your system. Interesting that they are declared a "success" even though they end with an error.

Task 343850781

ERROR: ERROR: Residue not supported by Placement coordinate constraint machinery
ERROR:: Exit from: src/protocols/ProteinInterfaceDesign/movers/PlaceUtils.cc line: 281


Task 343814609

ERROR: ERROR: Residue not supported by Placement coordinate constraint machinery
ERROR:: Exit from: src/protocols/ProteinInterfaceDesign/movers/PlaceUtils.cc line: 281


I am running X64 Linux on systems based on AMD Phenom II CPUs







____________

Chris Holvenstot Profile
Avatar

Joined: May 2 10
Posts: 220
ID: 379129
Credit: 9,106,918
RAC: 0
Message 66499 - Posted 6 Jun 2010 20:57:58 UTC

I'm still getting the occasional failure with the ProteinInterfaceDesign task and its "patchdock" file after only a few seconds of processing - the example task, whose output is posted below, was created today (June 6th)

Task 312328443

ERROR: Cannot open patchdock file: 1fAc_2vg9.patchdock
ERROR:: Exit from: src/protocols/ProteinInterfaceDesign/read_patchdock.cc line: 101

Chris Holvenstot Profile
Avatar

Joined: May 2 10
Posts: 220
ID: 379129
Credit: 9,106,918
RAC: 0
Message 66508 - Posted 7 Jun 2010 11:37:39 UTC

Long running job just finished - 28894 seconds of CPU, one decoy finished. Killed by watchdog. It continued to take checkpoints throughout the run. SegFault on completion. I have several other jobs across a few systems which appear to be heading down the same path.

All seen to have similar task names: rs_stg0_lrlx_t"xyz"__casp8_SAVE_ALL_OUT

Output follows:

Task ID 344004739
Name rs_stg0_lrlx_t447__casp8_SAVE_ALL_OUT_20806_3438_0
Workunit 314064997
Created 6 Jun 2010 19:33:49 UTC
Sent 6 Jun 2010 20:11:19 UTC
Received 7 Jun 2010 11:25:52 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 1290176
Report deadline 16 Jun 2010 20:11:19 UTC
CPU time 28896.33
stderr out

<core_client_version>6.10.56</core_client_version>
<![CDATA[
<stderr_txt>
[2010- 6- 6 22:21:50:] :: BOINC:: Initializing ... ok.
[2010- 6- 6 22:21:50:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev36507.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/rs_stg0_lrlx_t447__casp8.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 14400
BOINC:: CPU time: 28894.3s, 14400s + 14400s[2010- 6- 7 6:24:23:] :: BOINC
InternalDecoyCount: 0
======================================================
DONE :: 1 starting structures 28894.3 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
called boinc_finish
SIGSEGV: segmentation violation
Stack trace (25 frames):
[0x992e4a3]
[0x9958378]
[0xf77eb400]
[0x8c1ac97]
[0x8e26032]
[0x8e2646e]
[0x93d1812]
[0x93d3094]
[0x93d511e]
[0x93d1195]
[0x80dac5e]
[0x80d8f91]
[0x810386e]
[0x858db3f]
[0x815324a]
[0x81755cf]
[0x80ace21]
[0x85379f7]
[0x812b7aa]
[0x812c94d]
[0x878038b]
[0x82ff325]
[0x804989b]
[0x99b42dc]
[0x8048121]

Exiting...

</stderr_txt>
]]>

Validate state Valid
Claimed credit 179.245358151844
Granted credit 95.9047142739811
application version 2.14

Chris Holvenstot Profile
Avatar

Joined: May 2 10
Posts: 220
ID: 379129
Credit: 9,106,918
RAC: 0
Message 66510 - Posted 7 Jun 2010 12:06:42 UTC

Here is the output from a second task - of the same "family" as the one reported in my previous post - differences: this one ended on its own after running an hour over the preferred time, not killed by watchdog, and no SegFault (could the SegFault have been caused by watchdog killing the task?)

20572 CPU seconds - only 2 decoys.

(both tasks were declared as "success" and both generated reasonable credit)



Task ID 344034237
Name rs_stg0_lrlx_t436__casp8_SAVE_ALL_OUT_20802_3787_0
Workunit 314090933
Created 6 Jun 2010 23:01:21 UTC
Sent 6 Jun 2010 23:13:10 UTC
Received 7 Jun 2010 11:47:34 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 1290176
Report deadline 16 Jun 2010 23:13:10 UTC
CPU time 20572.51
stderr out

<core_client_version>6.10.56</core_client_version>
<![CDATA[
<stderr_txt>
[2010- 6- 7 1: 1:57:] :: BOINC:: Initializing ... ok.
[2010- 6- 7 1: 1:57:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev36507.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/rs_stg0_lrlx_t436__casp8.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 14400
======================================================
DONE :: 2 starting structures 20572.2 cpu seconds
This process generated 2 decoys from 2 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>

Validate state Valid
Claimed credit 127.612292738641
Granted credit 97.0638812927757
application version 2.14

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 66563 - Posted 13 Jun 2010 7:03:25 UTC

3 tasks died recently with errors

these two just say : Maximum elapsed time exceeded
no cpu time shown and no debug output

T0561_whole_SAVE_ALL_OUT_IGNORE_THE_REST_8-17_21314_677_0
T0561_whole_SAVE_ALL_OUT_IGNORE_THE_REST_3-6_21314_594_0

this one: int2_centerfirst2b_1fAc_2qwt_ProteinInterfaceDesign_23May2010_21231_230_0 is the patchdock error.

VO Profile
Avatar

Joined: Nov 4 05
Posts: 6
ID: 9090
Credit: 1,739,319
RAC: 144
Message 66578 - Posted 15 Jun 2010 15:54:28 UTC

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=313445716
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 66582 - Posted 16 Jun 2010 2:47:03 UTC - in response to Message ID 66578.
Last modified: 16 Jun 2010 2:47:38 UTC

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=313445716


(Notes for Project Team)
Validation errors, with no apparent cause on:
rb_06_02_188_708_t000__t0571_IGNORE_THE_REST_04_05_21338

Resends all failed as well.
____________
Rosetta Moderator: Mod.Sense

cnick6

Joined: May 30 06
Posts: 24
ID: 85398
Credit: 4,993,416
RAC: 5,988
Message 66583 - Posted 16 Jun 2010 6:01:57 UTC

I have one work unit that is crashing the minirosetta214 executable in Windows and Linux:

Windows TASKID (with debug info): 345248849
Linux TASKID: 345967972

Workunit:

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=315213770

WU Name:

rb_06_10_202_765_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_21404_3249_1

____________

adrianxw Profile
Avatar

Joined: Sep 18 05
Posts: 535
ID: 402
Credit: 1,057,641
RAC: 1,674
Message 66592 - Posted 17 Jun 2010 8:16:04 UTC
Last modified: 17 Jun 2010 8:16:36 UTC

MiniRosetta 2.14 memory use seems extremely high. I noticed another process in the "Waiting for memory" state, something I don't believe I have seen before. Upon investigation, MiniRosetta was using 800+k.

Is this intentional, or is something not being freeĀ“d?
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 66597 - Posted 17 Jun 2010 15:11:13 UTC

adrianxw, some tasks use protocols that do require more memory. These are only sent to machines that have more then the minimum memory required. I see both of your machines are reporting 4 CPUs and 2GB of memory. That's only 512MB per CPU, but I believe the check for high-memory tasks is not sensitive to the number of CPUs. So if you happened to get several high-memory tasks at the same time, that would explain the waiting for memory message.

You mentioned seeing Mini using more then 800... I assume you meant MB :) Was that just one task or were several running at the same time with that usage? Task names would be helpful.
____________
Rosetta Moderator: Mod.Sense

adrianxw Profile
Avatar

Joined: Sep 18 05
Posts: 535
ID: 402
Credit: 1,057,641
RAC: 1,674
Message 66599 - Posted 18 Jun 2010 7:52:27 UTC

The job is finished and gone now, so I don't know which it was. This one is running right now, and has ~500M, (yes, M, that dates me a bit huh?). There are not processes waiting on here at the moment. Rosetta has quite a high work share value on both my machines so it crunches them fairly quickly, I wouldn't like to guess which wu it was that was causing the event yesterday. As I recall, it was the only Rosetta wu on the machine at that time, it was Climate Prediction that was in the "Waiting for memory" state, not Rosetta.
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

Murasaki
Avatar

Joined: Apr 20 06
Posts: 303
ID: 78284
Credit: 365,375
RAC: 94
Message 66600 - Posted 18 Jun 2010 13:09:14 UTC
Last modified: 18 Jun 2010 13:09:46 UTC

I have noticed the same "waiting for memory" messages several times on my system in recent weeks and I have got one right now. For me they only pop up with Rosetta CASP9 WUs with huge protein structures.

Looking at your task history, adrianxw, you were probably processing rb_06_15_205_770_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_21438_1071_0.

The one eating up my memory today is rb_06_15_205_770_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_21438_19046_0 from the same batch as yours.

There is nothing to worry about with these as they free up the memory again as soon as they are completed.
____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 66605 - Posted 19 Jun 2010 1:41:40 UTC

This errored after 22 sec.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=316550658

Sat 19 Jun 2010 11:20:51 EST|rosetta@home|Output file rb_06_15_205_770_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_21438_15952_0_0 for task rb_06_15_205_770_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_21438_15952_0 absent

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>

____________


P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 66607 - Posted 19 Jun 2010 7:39:20 UTC

Another error, this ran for 1hr 59min i have a four hour run time set with two hour switching projects.

It ran the first two hours O.K. when it restarted it failed.

eed_4_eed_1fm4_ProteinInterfaceDesign_7Jun2010_21383_177_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=316587259

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>

BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 14400
SIGSEGV: segmentation violation
Stack trace (11 frames):
[0x992e4a3]
[0x9958378]
[0xffffe500]
[0x84bd3da]
[0x882dfff]
[0x812b7aa]
[0x812c94d]
[0x878038b]
[0x8049a2a]
[0x99b42dc]
[0x8048121]

Exiting...

</stderr_txt>

____________


adrianxw Profile
Avatar

Joined: Sep 18 05
Posts: 535
ID: 402
Credit: 1,057,641
RAC: 1,674
Message 66638 - Posted 22 Jun 2010 15:24:46 UTC

Same again...

rb_06_21_217_781_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_21462_3794

... two other projects stopped "Waiting for memory" 882M in use.
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

cnick6

Joined: May 30 06
Posts: 24
ID: 85398
Credit: 4,993,416
RAC: 5,988
Message 66685 - Posted 24 Jun 2010 20:28:41 UTC

Can one of the mods please look into the low-credit issues lately?

See this thread:

http://boinc.bakerlab.org/rosetta/forum_thread.php?id=5366
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 66693 - Posted 25 Jun 2010 3:01:55 UTC

Moderators do not have access to any credit information beyond what you see on the task and WU. The Project Team maintains all of the BOINC databases and etc. but they are pretty busy with CASP at the moment.

I can only assure you that credit is granted based on models completed, and that hooks were placed in the code to report CPU time on a per model basis so that specific protocols or proteins that have a high variability in CPU time between models can be reviewed in more detail.

Generally when credit is that dramatically low, it is the result of a long running model. In other words, if models typically take 10 minutes of CPU time, and your machine runs for an hour and has completed 6 models, and then the 7th takes 3 hours (or more and perhaps is eventually ended by the watchdog) then the credit granted is going to be on par with 70 minutes of processing rather then the 4 hours that was actually spent. This is why there is a thread for reporting long-running models.

Over time, as revisions are made and new protocols become accepted for future use, changes are found which reduce the number of such outlaying long-running models. But if a new protocol is not found to produce better results then prior methods, it will not be run in the future anyway, and so tracking down the 1% outlayers ends up consuming resources that could be invested into developing another new protocol.
____________
Rosetta Moderator: Mod.Sense

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 66699 - Posted 26 Jun 2010 19:35:41 UTC

Three recent failures on W7

Task 347583183 ab_06_19_d000_top_broker_server_models_21455_46857_0
Task 347583182 ab_06_19_d000_top_broker_server_models_21455_46856_0
Task 347583171 ab_06_19_d000_top_broker_server_models_21455_46845_0

all failed as follows

Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
ERROR: Option file open failed for: ab_06_19_d000_top_broker_server_models.flags

</stderr_txt>
]]>

billy ewell 1931

Joined: Mar 30 07
Posts: 10
ID: 160868
Credit: 3,008,779
RAC: 0
Message 66708 - Posted 29 Jun 2010 22:39:10 UTC

Task ID 349192260: This is a ProteinDesignInterface unit that consumed 7.5 hours of cpu time on an Intel quad 2.66 with 4 gigs of memory. There were 98 starting structures, 98 attempts and 98 decoys resulting. It really irritates me to see the scoring results when a claimed credit amount of 130.12 was reduced to a granted amount of 36.82. This seems to be quite a COMMON result when processing the PDI work units. I am NOT a points chaser but a dedicated supporter of research science and its potential impace for mankind and the world. BUT I still wonder if perhaps 10% or more of my fairly high-quality computing power is going to waste. Three of my computers; an i7 930 and two 9400 2.66 quads were purchased and run 24/7 solely in support of projects like rosetta and other BOINC research initiatives.

Am I terribly wrong here or do I have a legitimate concern as a dedicated and loyal supporter of Rosetta and the current CASP?

My account is 160868

I appreciate so very much the dedicated professional designers of this project and the loyal crunchers who particularly make it possible.

Bill: Austin, Texas USA

mhhall

Joined: Mar 28 06
Posts: 7
ID: 68866
Credit: 5,322,769
RAC: 404
Message 66709 - Posted 29 Jun 2010 23:51:34 UTC

Hi folks,
My system is currently executing WU 317305089.

BOINC is showing following properties that would seem to
indicate process is stuck and not checkpointing properly.

CPU Time at last checkpoing: 13:18:13
CPU Time : 15:20:20

Fraction done: 98.925%

Would hate to kill a job so close to comletion,
but I've got to wonder if this is really going to
complete.

Jochen

Joined: Jun 6 06
Posts: 133
ID: 91626
Credit: 3,847,433
RAC: 0
Message 66713 - Posted 30 Jun 2010 11:33:23 UTC - in response to Message ID 66709.

Would hate to kill a job so close to comletion,
but I've got to wonder if this is really going to
complete.

Does this task still create CPU-load? If yes, leave it running, if not try restarting the BOINC-manager (make sure, the client processes will be stopped as well). If it still doesn't create CPU-load after restarting the manager, you should abort it.
cu

Joe


____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 66714 - Posted 30 Jun 2010 14:40:41 UTC

billy ewell 1931, thanks for crunching. Regardless of credit granted, the results are valuable. So I cannot agree with your comment about any CPU time being wasted. I would simply say that ideally the runtime per model of these tasks could be more consistent. Rest assured that for every long-running model, there are more credits granted per normal one. This is how the credit system works. Your long-running model result causes the average credit claimed per model to increase, and so as other users report results they (and you) are granted slightly more then if no such long-running model had occurred. The problem is that it is much harder to see that you get a fraction more, and easy to see when you get 50+% less then claimed.

Keep crunching.
____________
Rosetta Moderator: Mod.Sense

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 66715 - Posted 30 Jun 2010 14:42:01 UTC - in response to Message ID 66713.

Would hate to kill a job so close to comletion,
but I've got to wonder if this is really going to
complete.

Does this task still create CPU-load? If yes, leave it running, if not try restarting the BOINC-manager (make sure, the client processes will be stopped as well). If it still doesn't create CPU-load after restarting the manager, you should abort it.
cu

Joe



That would be my advice as well. Sounds like a long-running model, like billy ewell 1931 just reported as well. But if it is still using CPU time, then it should take care of itself without any tinkering.
____________
Rosetta Moderator: Mod.Sense

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,546,016
RAC: 7,460
Message 66719 - Posted 30 Jun 2010 16:50:27 UTC - in response to Message ID 66714.

billy ewell 1931, thanks for crunching. Regardless of credit granted, the results are valuable. So I cannot agree with your comment about any CPU time being wasted. I would simply say that ideally the runtime per model of these tasks could be more consistent. Rest assured that for every long-running model, there are more credits granted per normal one. This is how the credit system works. Your long-running model result causes the average credit claimed per model to increase, and so as other users report results they (and you) are granted slightly more then if no such long-running model had occurred. The problem is that it is much harder to see that you get a fraction more, and easy to see when you get 50+% less then claimed.

Sometimes I understand these things and sometimes not, so please pull me up if I'm getting this wrong, but...

Some WUs seem to be of the type where 500 steps are attempted for a decoy\model and if nothing useful seems to be coming up it gets ended very quickly and moves onto the next. Then the next, then the next etc.

Eventually a decoy\model comes up that looks promising and rather than stopping at 500 it seems to go on (and on) until the watchdog cuts in.

So, perversely, you either have (say) 1000 models taking (say) 2h 50m or 1001 taking 7 hours. It's when the task over-runs that it's working on the most valuable stuff, which then gets a low credit award.

Alternatively, the task is getting into a loop and is going nowhere very slowly indeed, as we've seen recently.

An example of a long running WU for me is simIF2_1f0s_1PBV_ProteinInterfaceDesign_28Jun2010_21501_4_0

These simIF tasks seem to be particularly susceptible.
____________

billy ewell 1931

Joined: Mar 30 07
Posts: 10
ID: 160868
Credit: 3,008,779
RAC: 0
Message 66720 - Posted 30 Jun 2010 19:24:26 UTC - in response to Message ID 66714.

billy ewell 1931, thanks for crunching. Regardless of credit granted, the results are valuable. So I cannot agree with your comment about any CPU time being wasted. I would simply say that ideally the runtime per model of these tasks could be more consistent. Rest assured that for every long-running model, there are more credits granted per normal one. This is how the credit system works. Your long-running model result causes the average credit claimed per model to increase, and so as other users report results they (and you) are granted slightly more then if no such long-running model had occurred. The problem is that it is much harder to see that you get a fraction more, and easy to see when you get 50+% less then claimed.

Keep crunching.


MS: Thanks for the reply; I principally understand but wish to emphasize as I stated previously that "I am not a credits chaser" but a dedicated supporter of scientific research and extremely happy to do so. All that having been said, the work unit recently completed and reported below highlighted my concerns brilliantly. I last checked this work unit at about 6.6 hours of completion time and the last check point at that time was 00:26:13. I started and stopped this unit at least 15 times without effect. My main concern is that my 21 cpus are being used efficiently and your answer has reassured me. Again, thanks.

Bill [q]

Task ID 349219039
Name fc_A_noSmallMvs_fc6x_2hwx_ProteinInterfaceDesign_20Jun2010_21458_97_0
Workunit 318903612
Created 29 Jun 2010 8:01:24 UTC
Sent 29 Jun 2010 8:03:08 UTC
Received 30 Jun 2010 18:46:19 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 1273687
Report deadline 9 Jul 2010 8:03:08 UTC
CPU time 28848.84
stderr out <core_client_version>6.10.56</core_client_version>
<![CDATA[
<stderr_txt>
[2010- 6-30 3:17: 1:] :: BOINC:: Initializing ... ok.
[2010- 6-30 3:17: 1:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev36507.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 14400
BOINC:: CPU time: 28846.8s, 14400s + 14400s[2010- 6-30 11:27:54:] :: BOINC
InternalDecoyCount: 64
======================================================
DONE :: 2 starting structures 28846.8 cpu seconds
This process generated 64 decoys from 64 attempts
======================================================
called boinc_finish

</stderr_txt>
]]>


Validate state Valid
Claimed credit 184.044030472935
Granted credit 16.2806625211898
application version 2.14

Jochen

Joined: Jun 6 06
Posts: 133
ID: 91626
Credit: 3,847,433
RAC: 0
Message 66721 - Posted 30 Jun 2010 20:08:18 UTC - in response to Message ID 66720.

...I started and stopped this unit at least 15 times...

AFAIR this is not a good idea, as long as the task is not kept in memory.
If you suspend a task and keep it in memory, it does not have any effect at all. If you suspend a task and don't keep it in memory you will lose any work work done from the last checkpoint.

Again I would recommend to just leave the tasks running, as long as the create CPU-load. This is probably the best you can do. And again AFAIR this is the best way to assure that no CPU time is wasted.

Joe, the jinx
____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 66736 - Posted 1 Jul 2010 23:35:53 UTC

This failed after 11min.

td-only-2-ARF1_4-15_21413_114_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=319298520

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt

Starting work on structure: _00001
# cpu_run_time_pref: 14400

ERROR: rsd_type_list.size()
ERROR:: Exit from: src/core/fragment/Frame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

____________


Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 66828 - Posted 9 Jul 2010 9:04:17 UTC

This one completed successfully but stopped well short of its 6 hour run.
ab_07_06_T0581_21_136_homs_h004__SAVE_ALL_OUT.IGNORE_THE_REST_10_11_21556_1


ERROR: expected to read 18 libraries from Dun02, but read 0
ERROR:: Exit from: ..\..\src\core\scoring\dunbrack\RotamerLibrary.cc line: 865

____________

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 507,926
RAC: 0
Message 67060 - Posted 1 Aug 2010 21:20:22 UTC

result id 356014820 errored out after 28 minutes defult runtime is 10 hours
Error output
ERROR: rsd_type_list.size()
ERROR:: Exit from: ..\..\src\core\fragment\Frame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>
____________
Have a crunching good day!!

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 67091 - Posted 4 Aug 2010 2:38:36 UTC

Hi.

Someone might want to have a look at this one i got a Validate error, none of

the other copies have been returned. I can't see a problem with it.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=321742035

ab_07_08_T0606_27_169_h001_disulf_SAVE_ALL_OUT.IGNORE_THE_REST_06_07_21584_248_2


Starting work on structure: _00001
# cpu_run_time_pref: 14400
Starting work on structure: _00002
Starting work on structure: _00003
Starting work on structure: _00004
Starting work on structure: _00005
Starting work on structure: _00006
Starting work on structure: _00007
Starting work on structure: _00008
Starting work on structure: _00009
Starting work on structure: _00010
Starting work on structure: _00011
Starting work on structure: _00012
Starting work on structure: _00013
Starting work on structure: _00014
Starting work on structure: _00015
Starting work on structure: _00016
Starting work on structure: _00017
======================================================
DONE :: 1 starting structures 13766.2 cpu seconds
This process generated 17 decoys from 17 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>

____________


Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 67098 - Posted 4 Aug 2010 15:25:27 UTC

P.P.L. it looks like you received the third issue of that specific work unit. The first two never reported back. So the third resulted in too many tasks, as two is the configured maximum. So, yes, the BOINC server should not send out such results that are doomed to failure in the first place. It is a bug that I believe was recently fixed, so the next time the Project Team upgrades the servers, it shouldn't happen anymore. It only happens when some very rare circumstances combine. That was part of what made it hard for Berkeley to track it down.

So, your machine completed it ok, i.e. no computation errors. But the validator discovered three reports for something with a maximum of two and hence produces the validation error.
____________
Rosetta Moderator: Mod.Sense

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 67108 - Posted 5 Aug 2010 1:12:22 UTC

Hi Mod Sense.

Yes a bug indeed, i received credit for it anyway so that's O.K. :)


____________


12kpp

Joined: Jul 4 09
Posts: 2
ID: 324824
Credit: 225,504
RAC: 0
Message 67112 - Posted 6 Aug 2010 2:05:16 UTC

Hi !
I have the same error. Validate error.

324097911

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 67113 - Posted 6 Aug 2010 3:42:32 UTC

This one errored after 2 min.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=326440378

cs-only-2-DinI_3-14_20161_242_0

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>


( Left out the bits in between )


Starting work on structure: _00001
# cpu_run_time_pref: 14400

ERROR: rsd_type_list.size()
ERROR:: Exit from: src/core/fragment/Frame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>

____________


Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 67123 - Posted 6 Aug 2010 18:45:46 UTC
Last modified: 6 Aug 2010 18:51:03 UTC

Reposting speedy's comments for the Project Team to investigate. Speedy's tasks report:

ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_database\scoring/weights/dslf_weights.wts exist
ERROR:: Exit from: ..\..\src\core\scoring\ScoreFunctionFactory.cc line: 178


357381146 357381134 & 357381125 all tasks start with lrm_jorj_combined_tlrm_jorj_combined_torsion. All tasks end with Compute error. I'm thinking lrm_jorj_combined_tlrm_jorj_combined_torsion is a bad bad batch of tasks.

____________
Rosetta Moderator: Mod.Sense

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 67133 - Posted 7 Aug 2010 22:00:28 UTC

Another one failed after 2 sec.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=326517114

lrm_jorj_combined_torsion_it06_run01_A_rlbd_1mgw__SAVE_ALL_OUT_IGNORE_THE_RESTlr5_DECOY_21224_221_0

<core_client_version>6.10.17</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>

( left out bits in middle )

Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...

ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_database/scoring/weights/dslf_weights.wts exist
ERROR:: Exit from: src/core/scoring/ScoreFunctionFactory.cc line: 178
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

____________


P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 67135 - Posted 8 Aug 2010 3:25:14 UTC
Last modified: 8 Aug 2010 3:26:01 UTC

And another one, this went for 12 sec.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=326653711


lrm_jorj_combined_torsion_it06_run01_A_rlbd_1o4w__SAVE_ALL_OUT_IGNORE_THE_RESTlr10_DECOY_21224_430_1

<core_client_version>6.10.17</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>

( left out bits in middle )

Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...

ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_database/scoring/weights/dslf_weights.wts exist
ERROR:: Exit from: src/core/scoring/ScoreFunctionFactory.cc line: 178
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
____________


Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 507,926
RAC: 0
Message 67137 - Posted 8 Aug 2010 5:52:30 UTC

Task 357811058
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>

( left out lines in middle )

Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev36507.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/yfsong_lrm_jorj_combined_torsion_it06_run01_A.zip
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/lr13_2iiy.fix.out.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...

ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_database\scoring/weights/dslf_weights.wts exist
ERROR:: Exit from: ..\..\src\core\scoring\ScoreFunctionFactory.cc line: 178
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

Validate state Invalid Ran for 7 seconds.
____________
Have a crunching good day!!

[DPC]NGS~StugIII

Joined: Mar 8 06
Posts: 2
ID: 64413
Credit: 58,616
RAC: 0
Message 67141 - Posted 8 Aug 2010 16:22:09 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=355266280

This one failed but it took 59251 seconds.

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0057DC0B write attempt to address 0x00D0954C


____________

duftkerze

Joined: Jul 7 06
Posts: 2
ID: 99027
Credit: 637,181
RAC: 1
Message 67161 - Posted 11 Aug 2010 3:23:01 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=357781244
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 67162 - Posted 11 Aug 2010 4:03:58 UTC
Last modified: 11 Aug 2010 4:05:11 UTC

Moved duftkerze's post here. Their result has the "Unable to open weights" error that is reported in the prior posts here.
____________
Rosetta Moderator: Mod.Sense

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 67167 - Posted 12 Aug 2010 3:56:36 UTC

This one ran for 4sec, failed same as others.


http://boinc.bakerlab.org/rosetta/workunit.php?wuid=327013475

lrm_jorj_combined_torsion_it06_run01_A_rlbd_2cmx__SAVE_ALL_OUT_IGNORE_THE_RESTlr13_DECOY_21224_830_1

<core_client_version>6.10.17</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>

( left out bits in middle )

Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...

ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_database/scoring/weights/dslf_weights.wts exist
ERROR:: Exit from: src/core/scoring/ScoreFunctionFactory.cc line: 178
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

____________


Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 67170 - Posted 12 Aug 2010 10:02:08 UTC
Last modified: 12 Aug 2010 10:10:01 UTC

Also a weights issue: lrm_jorj_combined_torsion_it06_run01_A_rlbd_1bmg__SAVE_ALL_OUT_IGNORE_THE_RESTlr5_DECOY_21224_996_1

also task lrm_jorj_combined_torsion_it06_run01_A_rlbd_1bmg__SAVE_ALL_OUT_IGNORE_THE_RESTlr10_DECOY_21224_913_0

and quite a few more, it would take to long to post them all here.


ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_database\scoring/weights/dslf_weights.wts exist
ERROR:: Exit from: ..\..\src\core\scoring\ScoreFunctionFactory.cc line: 178
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

AdeB Profile
Avatar

Joined: Dec 12 06
Posts: 45
ID: 135244
Credit: 2,358,915
RAC: 2,105
Message 67173 - Posted 12 Aug 2010 15:54:36 UTC

T0624_refinement_1_5_topology_broker_SAVE_ALL_OUT.IGNORE_THE_REST_2_21730_1889_0

ERROR: ERROR: ArrayPool array size cannot be changed unless the ArrayPool is empty
ERROR:: Exit from: src/core/graph/ArrayPool.hh line: 296
BOINC:: Error reading and gzipping output datafile: default.out


AdeB
____________

Jochen

Joined: Jun 6 06
Posts: 133
ID: 91626
Credit: 3,847,433
RAC: 0
Message 67174 - Posted 12 Aug 2010 17:22:24 UTC

cs-only-2-RrR43_9-13_20161_106_1

ERROR: rsd_type_list.size()
ERROR:: Exit from: ..\..\src\core\fragment\Frame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out


gunn_fragments_SAVE_ALL_OUT_-1lveA__20675_4867_1
ERROR: ct == final_atoms
ERROR:: Exit from: ..\..\src\core\scoring\rms_util.cc line: 410
BOINC:: Error reading and gzipping output datafile: default.out


gunn_fragments_SAVE_ALL_OUT_-1lveA__20675_4642_1
ERROR: ct == final_atoms
ERROR:: Exit from: ..\..\src\core\scoring\rms_util.cc line: 410
BOINC:: Error reading and gzipping output datafile: default.out


____________

Mike.Gibson

Joined: Nov 3 07
Posts: 19
ID: 217599
Credit: 194,329
RAC: 0
Message 67271 - Posted 18 Aug 2010 19:41:11 UTC
Last modified: 18 Aug 2010 19:42:02 UTC

I have preferences set to 24 hours crunching but 327442297 has now been crunching for 38 hours and is stuck at 60.034% although still clocking time up and the time to go is rising in proportion.

Should this be terminated? Is there any way it can be made to terminate without losing the results?

Mike

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 67273 - Posted 18 Aug 2010 20:06:50 UTC - in response to Message ID 67271.

I have preferences set to 24 hours crunching but 327442297 has now been crunching for 38 hours and is stuck at 60.034% although still clocking time up and the time to go is rising in proportion.

Should this be terminated? Is there any way it can be made to terminate without losing the results?

Mike


The properties of that task will show you the CPU time used. Check it, jot down the time, then a minute later check it again. Did it use any CPU time during that minute? I am doubtful the task is getting any CPU time. If it were, the watchdog would have already caught the problem and reported the task result. So, I'm guessing you may have to suspend and then resume the task to get it back to using CPU time. It will probably then run through it's originally expected 24 hour runtime (i.e. another 8-10 hours).
____________
Rosetta Moderator: Mod.Sense

Mike.Gibson

Joined: Nov 3 07
Posts: 19
ID: 217599
Credit: 194,329
RAC: 0
Message 67275 - Posted 18 Aug 2010 21:09:15 UTC - in response to Message ID 67273.

I have preferences set to 24 hours crunching but 327442297 has now been crunching for 38 hours and is stuck at 60.034% although still clocking time up and the time to go is rising in proportion.

Should this be terminated? Is there any way it can be made to terminate without losing the results?

Mike


The properties of that task will show you the CPU time used. Check it, jot down the time, then a minute later check it again. Did it use any CPU time during that minute? I am doubtful the task is getting any CPU time. If it were, the watchdog would have already caught the problem and reported the task result. So, I'm guessing you may have to suspend and then resume the task to get it back to using CPU time. It will probably then run through it's originally expected 24 hour runtime (i.e. another 8-10 hours).


Many thanks. After suspending and resuming several times it finally restarted using CPU time, although it had been hogging one core for the last day without using CPU time.

Mike

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 67289 - Posted 20 Aug 2010 21:28:42 UTC

Frag_relax_2i9cA-20_PRODUCTIVE_SAVE_ALL_OUT_21745_32_0
Incorrect function. (0x1) - exit code 1 (0x1)
ERROR: rsd_type_list.size()
ERROR:: Exit from: ..\..\src\core\fragment\Frame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 67306 - Posted 23 Aug 2010 4:38:44 UTC

This one erred after 12sec.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=329845625

lrm_jorj_combined_torsion_it06_run01_A_rlbd_1dzo__SAVE_ALL_OUT_IGNORE_THE_RESTlr13_DECOY_21224_8_1

<core_client_version>6.10.17</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>

( left out bits in middle )

Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...

ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_database/scoring/weights/dslf_weights.wts exist
ERROR:: Exit from: src/core/scoring/ScoreFunctionFactory.cc line: 178
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

____________


Jochen

Joined: Jun 6 06
Posts: 133
ID: 91626
Credit: 3,847,433
RAC: 0
Message 67308 - Posted 23 Aug 2010 12:16:06 UTC

This one failed after 80 seconds:

Frag_relax_1c8cA-14_PRODUCTIVE_SAVE_ALL_OUT_21745_208_1


ERROR: rsd_type_list.size()
ERROR:: Exit from: ..\..\src\core\fragment\Frame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out

____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 67321 - Posted 25 Aug 2010 5:11:10 UTC
Last modified: 25 Aug 2010 5:11:31 UTC

Hi.

Anyone else seeing a problem with these tasks, they seem to be running longer then they should and not checkpointing much.

THESE//intSpin2_1f0s_2gbn_ProteinInterfaceDesign_20Aug2010_


I have a few running now that have ran for over four hours and the last checkpoint was at 30min it has done over 140 models.

I have had others that have ran for over six hours and the last checkpoint was at ~3hrs and many models.
____________


Jochen

Joined: Jun 6 06
Posts: 133
ID: 91626
Credit: 3,847,433
RAC: 0
Message 67326 - Posted 25 Aug 2010 8:49:27 UTC

Yes, some of the ...ProteinInterfaceDesign...-tasks show this behaviour.
You might want to have a look at this thread. It's more or less the same question.

cu Joe
____________

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 67348 - Posted 25 Aug 2010 22:21:14 UTC

I also have the unable to open weights problem with this one:

lrm_jorj_combined_torsion_it06_run01_A_rlbd_2uzr__SAVE_ALL_OUT_IGNORE_THE_RESTlr10_DECOY_21224_54
____________

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 67426 - Posted 29 Aug 2010 8:50:40 UTC

two - unable to open weights:
lrm_jorj_combined_torsion_it06_run01_A_rlbn_1b3a_SAVE_ALL_OUT_IGNORE_THE_REST_NATIVE_21225_89

lrm_jorj_combined_torsion_it06_run01_A_rlbd_1unr__SAVE_ALL_OUT_IGNORE_THE_RESTlr13_DECOY_21224_89

and this one failed after 330 seconds

cs-only-2-TR80_8-7_20161_123

ERROR: rsd_type_list.size()
ERROR:: Exit from: ..\..\src\core\fragment\Frame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

____________

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 67455 - Posted 30 Aug 2010 8:39:01 UTC

another one here:
cs-td-2-ARF1_5-11_20162_130

ERROR: rsd_type_list.size()
ERROR:: Exit from: ..\..\src\core\fragment\Frame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out

____________

laneferm

Joined: Aug 18 10
Posts: 1
ID: 391057
Credit: 146,295
RAC: 0
Message 67502 - Posted 1 Sep 2010 0:16:19 UTC

What is a credit unit based on?
I have around 527. Thanks,

H KYLE

Joined: Sep 18 09
Posts: 3
ID: 345956
Credit: 431,342
RAC: 61
Message 67505 - Posted 1 Sep 2010 2:47:26 UTC - in response to Message ID 67502.

What is a credit unit based on?
I have around 527. Thanks,


Visit http://boinc.bakerlab.org/rosetta/cert1.php (make sure you are logged in on top right).

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 67518 - Posted 1 Sep 2010 14:52:22 UTC - in response to Message ID 67502.

What is a credit unit based on?
I have around 527. Thanks,


What is Credit
____________
Rosetta Moderator: Mod.Sense

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 67560 - Posted 3 Sep 2010 3:44:09 UTC

This failed after 12 sec.

lrm_jorj_combined_torsion_it06_run01_A_rlbd_2hl7__SAVE_ALL_OUT_IGNORE_THE_RESTlr10_DECOY_21224_318

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=331004480

<core_client_version>6.10.17</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>

( left out bits in middle )

Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev36507.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/yfsong_lrm_jorj_combined_torsion_it06_run01_A.zip
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/lr10_2hl7.fix.out.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...

ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_database/scoring/weights/dslf_weights.wts exist
ERROR:: Exit from: src/core/scoring/ScoreFunctionFactory.cc line: 178
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

____________


P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 67575 - Posted 3 Sep 2010 21:50:43 UTC

This failed after 14 sec.

lrm_jorj_combined_torsion_it06_run01_A_rlbd_1r26__SAVE_ALL_OUT_IGNORE_THE_RESTlr8_DECOY_21224_522


http://boinc.bakerlab.org/rosetta/workunit.php?wuid=331161671

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>

( left out bits in middle )

Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev36507.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/yfsong_lrm_jorj_combined_torsion_it06_run01_A.zip
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/lr8_1r26.fix.out.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...

ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_database/scoring/weights/dslf_weights.wts exist
ERROR:: Exit from: src/core/scoring/ScoreFunctionFactory.cc line: 178
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>


____________


Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,546,016
RAC: 7,460
Message 67581 - Posted 4 Sep 2010 2:45:26 UTC - in response to Message ID 67575.


lrm_jorj_combined_torsion_it06_run01_A_rlbd_2hl7__SAVE_ALL_OUT_IGNORE_THE_RESTlr10_DECOY_21224_318

lrm_jorj_combined_torsion_it06_run01_A_rlbd_1r26__SAVE_ALL_OUT_IGNORE_THE_RESTlr8_DECOY_21224_522

ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_database/scoring/weights/dslf_weights.wts exist
ERROR:: Exit from: src/core/scoring/ScoreFunctionFactory.cc line: 178
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

Exact same error for me in the following tasks:

lrm_jorj_combined_torsion_it06_run01_A_rlbd_1h75__SAVE_ALL_OUT_IGNORE_THE_RESTlr13_DECOY_21224_147_1
lrm_jorj_combined_torsion_it06_run01_A_rlbd_1o73__SAVE_ALL_OUT_IGNORE_THE_RESTlr8_DECOY_21224_118_0
lrm_jorj_combined_torsion_it06_run01_A_rlbd_2uzr__SAVE_ALL_OUT_IGNORE_THE_RESTlr10_DECOY_21224_120_0
lrm_jorj_combined_torsion_it06_run01_A_rlbd_1s12__SAVE_ALL_OUT_IGNORE_THE_RESTlr8_DECOY_21224_213_0
lrm_jorj_combined_torsion_it06_run01_A_rlbd_2i1u__SAVE_ALL_OUT_IGNORE_THE_RESTlr10_DECOY_21224_169_1

Also, the following error in the following tasks:
ERROR: rsd_type_list.size()
ERROR:: Exit from: ..\..\src\core\fragment\Frame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out

td-only-2-Alg13_8-10_21413_155_1
td-only-2-RrR43_7-10_21413_131_1
td-only-2-DsbA_10-12_21413_137_1
____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 67583 - Posted 4 Sep 2010 8:09:15 UTC

Another failed after 2sec, same problem as others.


lrm_jorj_combined_torsion_it06_run01_A_rlbd_1xd6__SAVE_ALL_OUT_IGNORE_THE_RESTlr5_DECOY_21224_606

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=331225615

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>

Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev36507.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/yfsong_lrm_jorj_combined_torsion_it06_run01_A.zip
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/lr5_1xd6.fix.out.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...

ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_database/scoring/weights/dslf_weights.wts exist
ERROR:: Exit from: src/core/scoring/ScoreFunctionFactory.cc line: 178
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

____________


P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 67601 - Posted 6 Sep 2010 1:59:12 UTC

This one failed after 4sec, same as others.


lrm_jorj_combined_torsion_it06_run01_A_rlbd_2iiy__SAVE_ALL_OUT_IGNORE_THE_RESTlr10_DECOY_21224_814_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=331916208

<core_client_version>6.10.17</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>

( left out bits in middle )

Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev36507.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/yfsong_lrm_jorj_combined_torsion_it06_run01_A.zip
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/lr10_2iiy.fix.out.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...

ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_database/scoring/weights/dslf_weights.wts exist
ERROR:: Exit from: src/core/scoring/ScoreFunctionFactory.cc line: 178
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>



____________


P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 67602 - Posted 6 Sep 2010 5:31:34 UTC

And another one, took 12sec to die.

lrm_jorj_combined_torsion_it06_run01_A_rlbd_1l6p__SAVE_ALL_OUT_IGNORE_THE_RESTlr13_DECOY_21224_797

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=331893230


<core_client_version>6.10.17</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>


Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev36507.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/yfsong_lrm_jorj_combined_torsion_it06_run01_A.zip
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/lr13_1l6p.fix.out.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...

ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_database/scoring/weights/dslf_weights.wts exist
ERROR:: Exit from: src/core/scoring/ScoreFunctionFactory.cc line: 178
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

____________


Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 67629 - Posted 7 Sep 2010 17:58:29 UTC

thought you guys fixed the weights problem???

lrm_jorj_combined_torsion_it06_run01_A_rlbd_2i1u__SAVE_ALL_OUT_IGNORE_THE_RESTlr10_DECOY_21224_584_0

ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_database\scoring/weights/dslf_weights.wts exist
ERROR:: Exit from: ..\..\src\core\scoring\ScoreFunctionFactory.cc line: 178
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 67634 - Posted 7 Sep 2010 22:04:42 UTC

Hi.

The first copy of this task errored, i can't see a problem with mine

don't know why it got a validate error.

cs-td-2-LkR15_5-5_20162_213_1

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=331107668

Server state__Over
Outcome__Validate error
Client state__Done
Exit status__0 (0x0)

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<stderr_txt>

Starting work on structure: _00023
======================================================
DONE :: 1 starting structures 14182.3 cpu seconds
This process generated 23 decoys from 23 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>

____________


P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 67694 - Posted 10 Sep 2010 21:52:08 UTC

This one has failed twice, mine after 14sec.

SAXS-score-1egaB_SAVE_ALL_OUT_21827_871_1

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=332831626


<core_client_version>6.10.17</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>


Starting work on structure: _00001

ERROR: Assertion failure: runtime_assert( ( begin + size - 1 ) <= pose.total_residue() );
ERROR:: Exit from: src/protocols/abinitio/FragmentMover.cc line: 250
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

____________


Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,546,016
RAC: 7,460
Message 67702 - Posted 11 Sep 2010 2:40:02 UTC

A strange one:

T0605_tjrs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_21824_1219_1

ERROR: Error in traceback: pointer doesn't go anywhere!

ERROR:: Exit from: ..\..\src\core\sequence\Aligner.cc line: 79
BOINC:: Error reading and gzipping output datafile: default.out

____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 67706 - Posted 11 Sep 2010 7:50:11 UTC

This ran for 3min.

fix_disulf_v4_NMR_1j0t_DISULF__BOINC_abrelax.v1_SAVE_ALL_OUT_21861_87_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=333047271

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>

Starting work on structure: _00001
# cpu_run_time_pref: 14400

ERROR: rsd_type_list.size()
ERROR:: Exit from: src/core/fragment/Frame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

____________


Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 507,926
RAC: 0
Message 67711 - Posted 11 Sep 2010 9:32:38 UTC

P.P.L Your core client (Boinc Manager) is a little out dated, try updating your core client to the recommended version This may help
____________
Have a crunching good day!!

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 7,590,569
RAC: 2,277
Message 67712 - Posted 11 Sep 2010 10:53:54 UTC

That BOINC error above is because of an error during processing that task. I'm not convinced updating BOINC will help prevent that error.
____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 67717 - Posted 11 Sep 2010 23:00:05 UTC

Hi.

As transient says, i think that (process exited with code 1 (0x1, -255) is

a generic error code they/boinc use.

I'll stop posting that bit in future.

As for boinc versions goes as they say, if it ain't broke don't fix it!

____________


Levent TERLEMEZ

Joined: Dec 7 05
Posts: 18
ID: 31895
Credit: 118,709
RAC: 0
Message 67726 - Posted 13 Sep 2010 7:56:31 UTC

I bought a brand new AMD Phenom(tm) II X4 925 Processor and return to BOINC my projects. But some interesting things began. What ever project working seti, einstein, rosetta or what ever it is, what wu number it is in that session (in a number downloaded WUs for that day), anyhow there was A (one) calculation error. What may it be?
Machine Specs:
XP Pro SP3
AMD Phenom(tm) II X4 925 Processor
2 GB DDR3 Ram
BOINC Ver. 6.10.58
THANKS for any answers or tips about after any observed the same or like this error before.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 67728 - Posted 13 Sep 2010 17:45:22 UTC

Levent TERLEMEZ
Looks like their task reported back with this:

ERROR: rsd_type_list.size()
ERROR:: Exit from: ..\..\src\core\fragment\Frame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


The task was restarted five times. If the task was unable to reach a checkpoint in that time, then the task is aborted for you. But I would expect a message about too many restarts with no progress rather then the one you got.
____________
Rosetta Moderator: Mod.Sense

Levent TERLEMEZ

Joined: Dec 7 05
Posts: 18
ID: 31895
Credit: 118,709
RAC: 0
Message 67729 - Posted 13 Sep 2010 20:49:26 UTC - in response to Message ID 67728.

Levent TERLEMEZ
Looks like their task reported back with this:
ERROR: rsd_type_list.size()
ERROR:: Exit from: ..\..\src\core\fragment\Frame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


The task was restarted five times. If the task was unable to reach a checkpoint in that time, then the task is aborted for you. But I would expect a message about too many restarts with no progress rather then the one you got.


Thanks for the reply, well sorry for the easy way I selected-asking more, is it possible to be corrupted while downloading. Thanks again.


____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 67732 - Posted 14 Sep 2010 14:42:50 UTC

...is it possible to be corrupted while downloading.


It is possible for corruption to occur to any data that passes over a network. However, BOINC has signatures that double check the integrity of the files you receive. When a signature mismatch is found, the error is reported differently and the task is not run.

Generally the error about gzipping is due to the output file not being produced. So it isn't there to zip. And this is because the error occurred before any output was produced.
____________
Rosetta Moderator: Mod.Sense

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 67737 - Posted 15 Sep 2010 1:48:49 UTC

I've ran a few of these already no problem, this is a different error.

Ran for 17sec.

T0585_tj_rs_stg0_lrlxjcst_t000__casp9_SAVE_ALL_OUT_21908_3066_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=333706692

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>


____________


Michael*

Joined: Apr 20 10
Posts: 2
ID: 377909
Credit: 1,334,106
RAC: 0
Message 67747 - Posted 16 Sep 2010 11:12:05 UTC

I have a recurring problem with Rosetta. One or more workunit gets stuck at some random percentage to completion. Restarting BOINC seems to get the stuck WUs going again but I hate to see so much of my processing potential wasted.

With 8 threads, 4 are usually doing SIMAP with each using 12 or 13 percent processing power and 4 threads doing rosetta with each using 12 or 13 percent. Right now one of the rosetta threads is using 0 percent processing power. One of the workunits is stuck at 83.030% and has been running for 11 hours and 39 minutes. Rosetta WUs never take more than 3 hours. The only mention of this WU in the messages is the one where computation started.

Just now a restart set that WU back to 34% but at least it is moving again.

I don't think it is a memory problem. I've checked the messages and there is no mention of memory running out or any other problems. BOINC uses less than half of my available RAM (6GB) and I have it set to use 70% max while the computer is active.

Any solutions or ideas about this problem would be greatly appreciated.

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 67748 - Posted 16 Sep 2010 12:27:37 UTC

Does this look like a minirosetta 2.14 problem triggered problems in several other workunits from other BOINC projects?

9/15/2010 10:13:46 PM rosetta@home Sending scheduler request: To fetch work.
9/15/2010 10:13:46 PM rosetta@home Requesting new tasks for CPU and GPU
9/15/2010 10:13:49 PM rosetta@home Scheduler request completed: got 1 new tasks
9/15/2010 10:13:51 PM rosetta@home Started download of old_targets_calbindin_pcs_files4.zip
9/15/2010 10:14:14 PM rosetta@home Finished download of old_targets_calbindin_pcs_files4.zip
9/15/2010 10:15:46 PM rosetta@home Starting calbindin_old_targets_PCS_SAVE_ALL_OUT_21968_479_0
9/15/2010 10:16:20 PM rosetta@home Starting task calbindin_old_targets_PCS_SAVE_ALL_OUT_21968_479_0 using minirosetta version 214
9/15/2010 10:16:20 PM QMC@HOME Task qasino_b3lyp-E26_iso34.896_0 exited with zero status but no 'finished' file
9/15/2010 10:16:20 PM QMC@HOME If this happens repeatedly you may need to reset the project.
9/15/2010 10:16:20 PM Docking Task 1g2k1ebw_mod0014crossdockinghiv1_7120_130310_0 exited with zero status but no 'finished' file
9/15/2010 10:16:20 PM Docking If this happens repeatedly you may need to reset the project.
9/15/2010 10:16:20 PM World Community Grid Task BETA_E200366_495_A.24.C19H12N2OS2.250.1.set1d06_0 exited with zero status but no 'finished' file
9/15/2010 10:16:20 PM World Community Grid If this happens repeatedly you may need to reset the project.
9/15/2010 10:16:20 PM malariacontrol.net Task wu_760_234_219331_0_1284580689_1 exited with zero status but no 'finished' file
9/15/2010 10:16:20 PM malariacontrol.net If this happens repeatedly you may need to reset the project.
9/15/2010 10:16:20 PM ibercivis Task 1bm7opt_fix_gridmaps.7z__ZINC06701282_1284586816_S08_E05_0 exited with zero status but no 'finished' file
9/15/2010 10:16:20 PM ibercivis If this happens repeatedly you may need to reset the project.
9/15/2010 10:16:20 PM boincsimap Task 10090101.156326_1 exited with zero status but no 'finished' file
9/15/2010 10:16:20 PM boincsimap If this happens repeatedly you may need to reset the project.
9/15/2010 10:16:21 PM PrimeGrid Task pps_sr2sieve_1941162_0 exited with zero status but no 'finished' file
9/15/2010 10:16:21 PM PrimeGrid If this happens repeatedly you may need to reset the project.
9/15/2010 10:16:21 PM ibercivis Computation for task 1bm7opt_fix_gridmaps.7z__ZINC06722361_1284587828_S08_E05_0 finished

Most of the other workunits recovered enough to finish apparantly successfully.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 67751 - Posted 16 Sep 2010 16:18:32 UTC

robertmiles That SHOULD not be possible. But you certainly have some highly suspicious circumstantial evidence to assert otherwise. The only thing the various projects have in common that should be capable of causing a cascading crash like that is... well the BOINC client. My instinct is that BOINC had a problem at that time and took 'em all out.

Michael* I can't offer any suggestions. As you pointed out, suspend and resume of the task doesn't even seem to kick it to start, at least when tasks are kept in memory, so full restart of BOINC seems to be the only way to get CPU allocated to the task again. I can only confirm that others have observed this as well, and that it seems to be rather rare.

I haven't seen what happens if BOINC reschedules that task one it's own. I mean if you suspend it, it will begin another task. If you then release it, BOINC will eventually try to come back to it. At that time does it successfully get CPU time? Or does it get no CPU while BOINC still says it is running? Something to try anyway.
____________
Rosetta Moderator: Mod.Sense

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 67752 - Posted 16 Sep 2010 17:58:44 UTC
Last modified: 16 Sep 2010 18:06:24 UTC

Could be, although if so, it left no other evidence I can see on what went wrong. I do seem to have had problems with the SuperFetch feature of Windows Vista for some time, though - something not adequately documented so that I can see how to fix it. Need some information on how to control WHAT TYPE of information SuperFetch stores; I already have enough information on how to turn it off entirely.

On Michael*'s problem: Could that indicate that restarting from what's left in the main memory does not work adequately for that problem, but restarting from the last checkpoint on the hard drive does?

Michael*

Joined: Apr 20 10
Posts: 2
ID: 377909
Credit: 1,334,106
RAC: 0
Message 67758 - Posted 16 Sep 2010 21:00:08 UTC - in response to Message ID 67751.

Michael* I can't offer any suggestions. As you pointed out, suspend and resume of the task doesn't even seem to kick it to start, at least when tasks are kept in memory, so full restart of BOINC seems to be the only way to get CPU allocated to the task again. I can only confirm that others have observed this as well, and that it seems to be rather rare.

I haven't seen what happens if BOINC reschedules that task one it's own. I mean if you suspend it, it will begin another task. If you then release it, BOINC will eventually try to come back to it. At that time does it successfully get CPU time? Or does it get no CPU while BOINC still says it is running? Something to try anyway.


robertmiles On Michael*'s problem: Could that indicate that restarting from what's left in the main memory does not work adequately for that problem, but restarting from the last checkpoint on the hard drive does?


If I suspend all computation and then resume then the WUs stay in memory and it does not fix the stuck WU. Completely closing BOINC and restarting it does fix the stuck WU. I have not tried to suspend an individual stuck WU and then resume it later but I'll attempt it next time this happens. I seem to get a stuck Rosetta WU about every other day for the past couple of weeks. The only change I've made to BOINC in that time is I added a configuration file to suspend computation while a specific application is running on my PC. <exclusive_app>filename.exe</exclusive_app>

Levent TERLEMEZ

Joined: Dec 7 05
Posts: 18
ID: 31895
Credit: 118,709
RAC: 0
Message 67760 - Posted 17 Sep 2010 7:59:26 UTC - in response to Message ID 67732.

...is it possible to be corrupted while downloading.


It is possible for corruption to occur to any data that passes over a network. However, BOINC has signatures that double check the integrity of the files you receive. When a signature mismatch is found, the error is reported differently and the task is not run.

Generally the error about gzipping is due to the output file not being produced. So it isn't there to zip. And this is because the error occurred before any output was produced.


Ok thanks for the reply.
____________

Message boards : Number crunching : minirosetta 2.14


Home | Join | About | Participants | Community | Statistics

Copyright © 2017 University of Washington

Last Modified: 10 Nov 2010 1:51:38 UTC
Back to top ^