minirosetta 2.05

Message boards : Number crunching : minirosetta 2.05

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · Next

AuthorMessage
Admin

Send message
Joined: 13 Apr 07
Posts: 42
Credit: 260,782
RAC: 0
Message 65300 - Posted: 12 Feb 2010, 15:59:56 UTC

Compute error - exit status 1 lrmixclus_opt_.1hz6.1hz6.SAVE_ALL_OUT_IGNORE_THE_REST.c.20.2.pdb.pdb.JOB_17816_1_0

https://boinc.bakerlab.org/rosetta/result.php?resultid=317250268

ERROR: start_res != middle_res
ERROR:: Exit from: ....srcprotocolsmovesKinematicMover.cc line: 132
BOINC:: Error reading and gzipping output datafile: default.out
ID: 65300 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 65302 - Posted: 12 Feb 2010, 20:35:24 UTC

This one failed after just 14 sec.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=289171483

lr15clusfa_opt_.1ail.1ail.SAVE_ALL_OUT_IGNORE_THE_REST.c.4.20.pdb.pdb.JOB_17676_7_0

Fri 12 Feb 2010 21:40:02 EST|rosetta@home|Output file lr15clusfa_opt_.1ail.1ail.SAVE_ALL_OUT_IGNORE_THE_REST.c.4.20.pdb.pdb.JOB_17676_7_0_0 for task absent

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>

BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
SIGSEGV: segmentation violation
Stack trace (8 frames):
[0x96c49b3]
[0x96ee888]
[0xb7fd1420]
[0x80a8721]
[0x808fcc1]
[0x804985f]
[0x974c15c]
[0x8048121]

Exiting...

</stderr_txt>



ID: 65302 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 65310 - Posted: 14 Feb 2010, 1:53:36 UTC

This one ran for 11min.

lr15clus_3fa_opt_.1bm8.1bm8.SAVE_ALL_OUT_IGNORE_THE_REST.c.13.6.pdb.pdb.JOB_17967_1_0

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=289719139

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>

BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 14400

ERROR: start_res != middle_res
ERROR:: Exit from: src/protocols/moves/KinematicMover.cc line: 132
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

ID: 65310 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,422,404
RAC: 8
Message 65311 - Posted: 14 Feb 2010, 10:24:55 UTC

The same error as P.P.L. and Admin.

ERROR: start_res != middle_res
ERROR:: Exit from: src/protocols/moves/KinematicMover.cc line: 132
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


Task 317684657

AdeB
ID: 65311 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 206
Credit: 19,451,304
RAC: 1,152
Message 65313 - Posted: 14 Feb 2010, 16:39:05 UTC - in response to Message 65165.  
Last modified: 14 Feb 2010, 16:47:19 UTC

Max, the watchdog should be expected to kick in if a model exceeds target runtime by more then 4 hours. And so with a short 2 hour preference, you will tend to see much more variation in runtimes then if you had a longer preference.

However, I would expect that once the target runtime is exceeded, and a model is completed, that work on the WU should be ended and it would be reported back. So if you have a 2hr preference, and the first model took more then 2hrs to completed, I should have expected the WU to end once the first model was completed. In fact, if the first model took more then an hour, the predicted completion time of the second model would have exceeded your preference, and so I would expect that scenario to end after the first model as well.

Have you changed your runtime preference? Perhaps this machine hadn't received the updated preference yet? (looking at the WU you linked, it appears to have had the 7200sec preference during the entire run). So, it sounds like your expectations are correct. More observations would be helpful. Reference to the specific WU (as you've done) is very helpful.


The rest of what you say sounds right to me. So keep watching, and jotting notes. This information will be key to resolving this issue.


I got a lot of tasks that ignore the Target CPU Time in preferences recently
It seems most of them belong to the type * boinc_filtered_loopbuild_threading *
Examples of such tasks:
t380__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_2906_0 - 15002.5 cpu seconds, 2 decoys

t347__boinc_filtered_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_16901_9452_0 - 20591.2 cpu seconds, 2 decoys

t330__boinc_filtered_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_16901_3175_0 - 16323.3 cpu seconds, 2 decoys

t322__boinc_filtered_loopbuild_threading_cst_all_tex_IGNORE_THE_REST_16902_3299_0 - 21789.4 cpu seconds, 3 decoys

In all the examples cpu_run_time_pref was set at 7200 sec. And all was generated 2 or more decoys(and 2 of them i saw what 1st model took about 2hr or more), so that the program was able to stop after 1st decoy correctly. But for some reason did not do so.
ID: 65313 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
transient
Avatar

Send message
Joined: 30 Sep 06
Posts: 376
Credit: 10,836,395
RAC: 0
Message 65323 - Posted: 15 Feb 2010, 6:17:47 UTC
Last modified: 15 Feb 2010, 6:18:02 UTC

I had a computation error on this task. My 'wingman' also got a computation error on this task. The task errored out after 17 seconds.
lr15clusfa_opt_.1wd6.1wd6.SAVE_ALL_OUT_IGNORE_THE_REST.c.0.10.pdb.pdb.JOB_17747_2
ID: 65323 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile banditwolf

Send message
Joined: 10 Jan 06
Posts: 28
Credit: 139,737
RAC: 0
Message 65329 - Posted: 15 Feb 2010, 15:46:14 UTC

I am seeing the 2.05 use more memory than previous versions. Up to 160k from ~60-100k.
ID: 65329 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Sarel

Send message
Joined: 11 May 06
Posts: 51
Credit: 81,712
RAC: 0
Message 65331 - Posted: 15 Feb 2010, 18:27:47 UTC - in response to Message 65329.  

Hello, if these are the Protein-interface Design jobs then this is expected since they work with very large complexes of proteins. If you turn on the graphics you'll see that the protein systems are much larger than the typical ones on Rosetta @ Home. These jobs are sent out with a requirement for 512Mb of memory to ensure that large-memory jobs are not sent out to low-resource machines.

Best, Sarel.

I am seeing the 2.05 use more memory than previous versions. Up to 160k from ~60-100k.


ID: 65331 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 11,537,396
RAC: 5
Message 65333 - Posted: 15 Feb 2010, 20:41:29 UTC

Task 317544195 , lr15clus_opt_.1a32.1a32.IGNORE_THE_REST.c.2.8.pdb.pdb.JOB_17418_5_1 behaved strangely on Mac OS X. It got hung at Model 2: step 0 and had to aborted. In the Searching... pane in the graphics window the protein was compressed into a furball: the other protein displays seemed pretty normal.
ID: 65333 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Craig Dickinson

Send message
Joined: 7 May 07
Posts: 8
Credit: 893,174
RAC: 0
Message 65338 - Posted: 15 Feb 2010, 23:30:05 UTC - in response to Message 65286.  


I have the wireshark trace from the re-attaching of the Rosetta project to the Boinc client showing the original download failure of the graphics file and the first 2 attempts to resume the download of the graphics file. Once again all other Rosetta files and the first work unit have sucessfully downloaded.

Where do you want me to send the wireshark trace report.


Craig and I arranged exchange of the trace via PMs... Craig, I've reviewed the trace and it seems the BOINC client is requesting a partial transfer from the next server in Rosetta's cluster, just as expected. But your machine never acknowledges receipt of any of the packets. And so the 3rd and 4th try are asking for the same point in the file that the second did. Plenty of time elapses, and the frame is resend about 10 times. But never acknowledged. Eventually the connection is closed.

The only times I've seen such fundamental things not work properly is when (bear with me :) there is a fundamental problem. Some examples from my past... bad ethernet port in the hub. Bad cable. But most commonly there's a router between the client and the server that's in need of a code refresh. That, or perhaps Win7 itself or the driver for your LAN adapter need an update. Because it sounds like it is happening so consistently at the same point, I tend to think there is a code problem here, because a hardware problem like a bad cable or port would probably be more intermittent.

The only other thing I can think of to try is that the downloads started out looking good, so after that first one gets most of the file and fails, shutdown BOINC, and shutdown the LAN adapter ("disable"), reenable the adapter, and fire up BOINC again. Now the retry will occur with a fresh start on the adapter and perhaps get you over the hump. But ultimately, that is just a work around. I think the true fix is going to be drivers, Win7, or router/firewall code updates. You might also try disabling the IPv6 support on the adapter.

Is anyone aware of any specific TCP fixes for Win7?

Notes to self (cuz this little scrap of paper is sure to disappear as soon as anyone askes me anything else about it!): client starts on srv1, retries occur to srv5, srv6, and srv3. Local ports used are 51221, 51229, 51235, 51241. These make good filters, such as: tcp.port == 51235 in the Wireshark display.



I am using a router so re-loaded the firmware as suggested and this has fixed the problem. Didn't think about the router being the cause as I would have expected that to have caused problems with other BOINC project updates or other software auto updaters.
ID: 65338 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
pvh

Send message
Joined: 7 Feb 10
Posts: 3
Credit: 2,487,638
RAC: 0
Message 65346 - Posted: 16 Feb 2010, 16:50:58 UTC

I am seeing WUs that seem to be "stuck". If you look at the properties of the WU, you typically see something like:

CPU time at last checkpoint 00:35:26
CPU time 06:02:29

If you look at the graphics, you see that the protein is not changing shape at all and the energy and RMSD are perfectly constant. These jobs run on for around 25,000 seconds and (I assume) are then terminated by the watchdog. You get very low credit for these jobs. I assume this is a bug in the code. If so, please fix it quickly since it is wasting lots of CPU time.

When I see such a WU, should I abort it, or is it better to leave it running?

This is with Rosetta Mini 2.05 on a 64-bit Linux system. I have seen this on both of my OpenSUSE 11.2 systems with the 2.6.31.8-0.1-desktop kernel (so hardware problems are ruled out). I have so far not seen this on my dual-core OpenSUSE 11.0 system with a custom 2.6.28.2-vanilla kernel. The latter is by far the least performant machine, so there is a (small) chance that this is just random chance. I do not see an obvious pattern which WUs suffer from this.
ID: 65346 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 65348 - Posted: 16 Feb 2010, 18:38:46 UTC

CPU time at last checkpoint 00:35:26
CPU time 06:02:29]

Try closing down BOINC and re-opening it. That seems to do the trick.
ID: 65348 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
pvh

Send message
Joined: 7 Feb 10
Posts: 3
Credit: 2,487,638
RAC: 0
Message 65349 - Posted: 16 Feb 2010, 20:57:50 UTC - in response to Message 65348.  

Try closing down BOINC and re-opening it. That seems to do the trick.


Thanks for the tip, but why did you remove my post?
ID: 65349 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5652
Credit: 5,622,096
RAC: 0
Message 65365 - Posted: 19 Feb 2010, 13:11:40 UTC

lr15clusfa_opt_.1o5u.1o5u.SAVE_ALL_OUT_IGNORE_THE_REST.c.13.8.pdb.pdb.JOB_17715_7_0

https://boinc.bakerlab.org/rosetta/result.php?resultid=317120311

Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 15.39063
ID: 65365 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 65393 - Posted: 22 Feb 2010, 21:05:31 UTC

Hi.

I don't know if this is a task problem or because of the validator problems, i'll put it here anyway.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=291130717

2cgq_Jan28_2cgq_3cp0_ProteinInterfaceDesign_15Feb2010_18083_187_0

Validate error

# cpu_run_time_pref: 14400
======================================================
DONE :: 2 starting structures 14487.9 cpu seconds
This process generated 2 decoys from 2 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>

ID: 65393 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sam_spade

Send message
Joined: 2 Dec 08
Posts: 1
Credit: 453,056
RAC: 0
Message 65394 - Posted: 22 Feb 2010, 22:08:14 UTC
Last modified: 22 Feb 2010, 22:11:10 UTC

Almost since a week I get an error while downloading the app:
[error] Can't create HTTP response output file projects/boinc.bakerlab.org_rosetta/minirosetta_2.05_windows_x86_64.exe
What can I do? I already tried to reset the project.
The rosetta_beta version_598 app works well.
ID: 65394 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[AF>Le_Pommier>MacBidouille.com] BlueG3

Send message
Joined: 16 Mar 08
Posts: 1
Credit: 43,585
RAC: 0
Message 65403 - Posted: 23 Feb 2010, 22:21:11 UTC
Last modified: 23 Feb 2010, 22:24:47 UTC

ProteinInterfaceDesign seems to finish in validate error:
error 1
error 2
error 3
error 4
ID: 65403 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
markj

Send message
Joined: 21 Jun 08
Posts: 6
Credit: 18,060,229
RAC: 0
Message 65409 - Posted: 24 Feb 2010, 9:53:19 UTC
Last modified: 24 Feb 2010, 9:54:12 UTC

all, or at least most, of the ProteinInterface jobs cause validate errors - would it be possible to fix this and post in this thread when the fix is performed? It occurs on three different computers (PC, Mac), so appears to be platform-independent. In the meantime, I am aborting all ProteinInterface jobs, leaving the others which run ok.
markj
ID: 65409 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
J

Send message
Joined: 23 Feb 10
Posts: 4
Credit: 68,995
RAC: 0
Message 65420 - Posted: 26 Feb 2010, 3:16:43 UTC

http://img80.imageshack.us/img80/4378/roserr1.jpg

Haven't been on this project long. No noticeable problems outside of punching 'ok'. Briefly searched the forums for c++ runtime error and didn't find anything, so cheers, here's a pic.
ID: 65420 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65424 - Posted: 26 Feb 2010, 14:42:10 UTC
Last modified: 26 Feb 2010, 14:43:33 UTC

Looks like "J" has had a few compute errors reported on Win XP running BOINC 6.5.0:

abinitio_withrelax_nodisulf_nohomfrag_cst0.1_129_B_1wjgA_SAVE_ALL_OUT_17405_1983_0
abinitio_withrelax_nodisulf_nohomfrag_cst0.1_129_B_2hx5A_SAVE_ALL_OUT_17405_468_0
lrmixclus_opt_.5cro.5cro.SAVE_ALL_OUT_IGNORE_THE_REST.c.5.6.pdb.pdb.JOB_17886_5_1

The first was completed ok by a wingman
The second is out being worked on right now
The third failed on a wingman as well after 2 min. with an error: The system cannot find the path specified. (0x3) - exit code 3 (0x3)

Rosetta Moderator: Mod.Sense
ID: 65424 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · Next

Message boards : Number crunching : minirosetta 2.05



©2023 University of Washington
https://www.bakerlab.org