minirosetta 2.05

Message boards : Number crunching : minirosetta 2.05

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · Next

AuthorMessage
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 65310 - Posted: 14 Feb 2010, 1:53:36 UTC

This one ran for 11min.

lr15clus_3fa_opt_.1bm8.1bm8.SAVE_ALL_OUT_IGNORE_THE_REST.c.13.6.pdb.pdb.JOB_17967_1_0

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=289719139

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>

BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 14400

ERROR: start_res != middle_res
ERROR:: Exit from: src/protocols/moves/KinematicMover.cc line: 132
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

ID: 65310 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 0
Message 65311 - Posted: 14 Feb 2010, 10:24:55 UTC

The same error as P.P.L. and Admin.

ERROR: start_res != middle_res
ERROR:: Exit from: src/protocols/moves/KinematicMover.cc line: 132
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


Task 317684657

AdeB
ID: 65311 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 207
Credit: 23,296,955
RAC: 10,649
Message 65313 - Posted: 14 Feb 2010, 16:39:05 UTC - in response to Message 65165.  
Last modified: 14 Feb 2010, 16:47:19 UTC

Max, the watchdog should be expected to kick in if a model exceeds target runtime by more then 4 hours. And so with a short 2 hour preference, you will tend to see much more variation in runtimes then if you had a longer preference.

However, I would expect that once the target runtime is exceeded, and a model is completed, that work on the WU should be ended and it would be reported back. So if you have a 2hr preference, and the first model took more then 2hrs to completed, I should have expected the WU to end once the first model was completed. In fact, if the first model took more then an hour, the predicted completion time of the second model would have exceeded your preference, and so I would expect that scenario to end after the first model as well.

Have you changed your runtime preference? Perhaps this machine hadn't received the updated preference yet? (looking at the WU you linked, it appears to have had the 7200sec preference during the entire run). So, it sounds like your expectations are correct. More observations would be helpful. Reference to the specific WU (as you've done) is very helpful.


The rest of what you say sounds right to me. So keep watching, and jotting notes. This information will be key to resolving this issue.


I got a lot of tasks that ignore the Target CPU Time in preferences recently
It seems most of them belong to the type * boinc_filtered_loopbuild_threading *
Examples of such tasks:
t380__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_2906_0 - 15002.5 cpu seconds, 2 decoys

t347__boinc_filtered_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_16901_9452_0 - 20591.2 cpu seconds, 2 decoys

t330__boinc_filtered_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_16901_3175_0 - 16323.3 cpu seconds, 2 decoys

t322__boinc_filtered_loopbuild_threading_cst_all_tex_IGNORE_THE_REST_16902_3299_0 - 21789.4 cpu seconds, 3 decoys

In all the examples cpu_run_time_pref was set at 7200 sec. And all was generated 2 or more decoys(and 2 of them i saw what 1st model took about 2hr or more), so that the program was able to stop after 1st decoy correctly. But for some reason did not do so.
ID: 65313 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile banditwolf

Send message
Joined: 10 Jan 06
Posts: 28
Credit: 139,737
RAC: 0
Message 65329 - Posted: 15 Feb 2010, 15:46:14 UTC

I am seeing the 2.05 use more memory than previous versions. Up to 160k from ~60-100k.
ID: 65329 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Sarel

Send message
Joined: 11 May 06
Posts: 51
Credit: 81,712
RAC: 0
Message 65331 - Posted: 15 Feb 2010, 18:27:47 UTC - in response to Message 65329.  

Hello, if these are the Protein-interface Design jobs then this is expected since they work with very large complexes of proteins. If you turn on the graphics you'll see that the protein systems are much larger than the typical ones on Rosetta @ Home. These jobs are sent out with a requirement for 512Mb of memory to ensure that large-memory jobs are not sent out to low-resource machines.

Best, Sarel.

I am seeing the 2.05 use more memory than previous versions. Up to 160k from ~60-100k.


ID: 65331 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 11,805,838
RAC: 0
Message 65333 - Posted: 15 Feb 2010, 20:41:29 UTC

Task 317544195 , lr15clus_opt_.1a32.1a32.IGNORE_THE_REST.c.2.8.pdb.pdb.JOB_17418_5_1 behaved strangely on Mac OS X. It got hung at Model 2: step 0 and had to aborted. In the Searching... pane in the graphics window the protein was compressed into a furball: the other protein displays seemed pretty normal.
ID: 65333 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Craig Dickinson

Send message
Joined: 7 May 07
Posts: 8
Credit: 924,823
RAC: 666
Message 65338 - Posted: 15 Feb 2010, 23:30:05 UTC - in response to Message 65286.  


I have the wireshark trace from the re-attaching of the Rosetta project to the Boinc client showing the original download failure of the graphics file and the first 2 attempts to resume the download of the graphics file. Once again all other Rosetta files and the first work unit have sucessfully downloaded.

Where do you want me to send the wireshark trace report.


Craig and I arranged exchange of the trace via PMs... Craig, I've reviewed the trace and it seems the BOINC client is requesting a partial transfer from the next server in Rosetta's cluster, just as expected. But your machine never acknowledges receipt of any of the packets. And so the 3rd and 4th try are asking for the same point in the file that the second did. Plenty of time elapses, and the frame is resend about 10 times. But never acknowledged. Eventually the connection is closed.

The only times I've seen such fundamental things not work properly is when (bear with me :) there is a fundamental problem. Some examples from my past... bad ethernet port in the hub. Bad cable. But most commonly there's a router between the client and the server that's in need of a code refresh. That, or perhaps Win7 itself or the driver for your LAN adapter need an update. Because it sounds like it is happening so consistently at the same point, I tend to think there is a code problem here, because a hardware problem like a bad cable or port would probably be more intermittent.

The only other thing I can think of to try is that the downloads started out looking good, so after that first one gets most of the file and fails, shutdown BOINC, and shutdown the LAN adapter ("disable"), reenable the adapter, and fire up BOINC again. Now the retry will occur with a fresh start on the adapter and perhaps get you over the hump. But ultimately, that is just a work around. I think the true fix is going to be drivers, Win7, or router/firewall code updates. You might also try disabling the IPv6 support on the adapter.

Is anyone aware of any specific TCP fixes for Win7?

Notes to self (cuz this little scrap of paper is sure to disappear as soon as anyone askes me anything else about it!): client starts on srv1, retries occur to srv5, srv6, and srv3. Local ports used are 51221, 51229, 51235, 51241. These make good filters, such as: tcp.port == 51235 in the Wireshark display.



I am using a router so re-loaded the firmware as suggested and this has fixed the problem. Didn't think about the router being the cause as I would have expected that to have caused problems with other BOINC project updates or other software auto updaters.
ID: 65338 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
pvh

Send message
Joined: 7 Feb 10
Posts: 3
Credit: 2,487,638
RAC: 0
Message 65346 - Posted: 16 Feb 2010, 16:50:58 UTC

I am seeing WUs that seem to be "stuck". If you look at the properties of the WU, you typically see something like:

CPU time at last checkpoint 00:35:26
CPU time 06:02:29

If you look at the graphics, you see that the protein is not changing shape at all and the energy and RMSD are perfectly constant. These jobs run on for around 25,000 seconds and (I assume) are then terminated by the watchdog. You get very low credit for these jobs. I assume this is a bug in the code. If so, please fix it quickly since it is wasting lots of CPU time.

When I see such a WU, should I abort it, or is it better to leave it running?

This is with Rosetta Mini 2.05 on a 64-bit Linux system. I have seen this on both of my OpenSUSE 11.2 systems with the 2.6.31.8-0.1-desktop kernel (so hardware problems are ruled out). I have so far not seen this on my dual-core OpenSUSE 11.0 system with a custom 2.6.28.2-vanilla kernel. The latter is by far the least performant machine, so there is a (small) chance that this is just random chance. I do not see an obvious pattern which WUs suffer from this.
ID: 65346 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 65348 - Posted: 16 Feb 2010, 18:38:46 UTC

CPU time at last checkpoint 00:35:26
CPU time 06:02:29]

Try closing down BOINC and re-opening it. That seems to do the trick.
ID: 65348 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
pvh

Send message
Joined: 7 Feb 10
Posts: 3
Credit: 2,487,638
RAC: 0
Message 65349 - Posted: 16 Feb 2010, 20:57:50 UTC - in response to Message 65348.  

Try closing down BOINC and re-opening it. That seems to do the trick.


Thanks for the tip, but why did you remove my post?
ID: 65349 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5659
Credit: 5,696,573
RAC: 1,935
Message 65365 - Posted: 19 Feb 2010, 13:11:40 UTC

lr15clusfa_opt_.1o5u.1o5u.SAVE_ALL_OUT_IGNORE_THE_REST.c.13.8.pdb.pdb.JOB_17715_7_0

https://boinc.bakerlab.org/rosetta/result.php?resultid=317120311

Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 15.39063
ID: 65365 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 65393 - Posted: 22 Feb 2010, 21:05:31 UTC

Hi.

I don't know if this is a task problem or because of the validator problems, i'll put it here anyway.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=291130717

2cgq_Jan28_2cgq_3cp0_ProteinInterfaceDesign_15Feb2010_18083_187_0

Validate error

# cpu_run_time_pref: 14400
======================================================
DONE :: 2 starting structures 14487.9 cpu seconds
This process generated 2 decoys from 2 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>

ID: 65393 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sam_spade

Send message
Joined: 2 Dec 08
Posts: 1
Credit: 453,056
RAC: 0
Message 65394 - Posted: 22 Feb 2010, 22:08:14 UTC
Last modified: 22 Feb 2010, 22:11:10 UTC

Almost since a week I get an error while downloading the app:
[error] Can't create HTTP response output file projects/boinc.bakerlab.org_rosetta/minirosetta_2.05_windows_x86_64.exe
What can I do? I already tried to reset the project.
The rosetta_beta version_598 app works well.
ID: 65394 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[AF>Le_Pommier>MacBidouille.com] BlueG3

Send message
Joined: 16 Mar 08
Posts: 1
Credit: 43,585
RAC: 0
Message 65403 - Posted: 23 Feb 2010, 22:21:11 UTC
Last modified: 23 Feb 2010, 22:24:47 UTC

ProteinInterfaceDesign seems to finish in validate error:
error 1
error 2
error 3
error 4
ID: 65403 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
markj

Send message
Joined: 21 Jun 08
Posts: 6
Credit: 18,060,229
RAC: 0
Message 65409 - Posted: 24 Feb 2010, 9:53:19 UTC
Last modified: 24 Feb 2010, 9:54:12 UTC

all, or at least most, of the ProteinInterface jobs cause validate errors - would it be possible to fix this and post in this thread when the fix is performed? It occurs on three different computers (PC, Mac), so appears to be platform-independent. In the meantime, I am aborting all ProteinInterface jobs, leaving the others which run ok.
markj
ID: 65409 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
J

Send message
Joined: 23 Feb 10
Posts: 4
Credit: 68,995
RAC: 0
Message 65420 - Posted: 26 Feb 2010, 3:16:43 UTC

http://img80.imageshack.us/img80/4378/roserr1.jpg

Haven't been on this project long. No noticeable problems outside of punching 'ok'. Briefly searched the forums for c++ runtime error and didn't find anything, so cheers, here's a pic.
ID: 65420 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65424 - Posted: 26 Feb 2010, 14:42:10 UTC
Last modified: 26 Feb 2010, 14:43:33 UTC

Looks like "J" has had a few compute errors reported on Win XP running BOINC 6.5.0:

abinitio_withrelax_nodisulf_nohomfrag_cst0.1_129_B_1wjgA_SAVE_ALL_OUT_17405_1983_0
abinitio_withrelax_nodisulf_nohomfrag_cst0.1_129_B_2hx5A_SAVE_ALL_OUT_17405_468_0
lrmixclus_opt_.5cro.5cro.SAVE_ALL_OUT_IGNORE_THE_REST.c.5.6.pdb.pdb.JOB_17886_5_1

The first was completed ok by a wingman
The second is out being worked on right now
The third failed on a wingman as well after 2 min. with an error: The system cannot find the path specified. (0x3) - exit code 3 (0x3)

Rosetta Moderator: Mod.Sense
ID: 65424 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,840,411
RAC: 1,575
Message 65429 - Posted: 27 Feb 2010, 3:00:02 UTC - in response to Message 65152.  
Last modified: 27 Feb 2010, 3:55:14 UTC

I am running Rosetta@home (along with some other projects) on four machines, three runing XP and one running Win7. Two of the machines have AMD 64 single core processors and two have AMD 64 dual core processors. The rosetta app is 'rosetta mini 2.05' and the BOINC version is 6.10.18. About once a week or so I will notice that the Rosetta WU running on at least one of the machines has been running longer than usual and using little or no CPU time. I abort it, and the next WU runs fine. I have not seen this occur with any of my other projects (climite prodiction, malaria control or world community grid).


Try suspending/resuming, or even entirely turn-off/restart BOINC, before aborting the WU. I have found this works often, not always, for me.


Thanks - one of my 2.05 workunits had the same problem, but now seems to be running well after a reboot.

https://boinc.bakerlab.org/rosetta/result.php?resultid=320652086

64-bit Vista SP2, BOINC 6.10.18, quad-core Intel, not using keep in memory when suspended (something tends to tie up lots of memory and make the computer unresponsive to the mouse and keyboard; haven't found what, though)

t311__boinc_filtered_loopbuild_threading type workunit

Before the reboot, showed CPU time 03:39:05, last checkpoint 03:39:03, elapsed time so far 20:29:26, not using any CPU time

Rebooted, that workunit restarted at about 4 hours elapsed time, but is now using a CPU core again.
ID: 65429 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,840,411
RAC: 1,575
Message 65430 - Posted: 27 Feb 2010, 3:32:00 UTC - in response to Message 65229.  
Last modified: 27 Feb 2010, 3:53:38 UTC

Hi!
I don't know if this happened also with older versions of rosetta since I started computing on the 3rd of February.
I'm running on an amd64 linux system, a pretty powerful one. Looking at my tasks log, I had about a 120 WUs assigned until today, but only 3-4 of them completed successfully. Others show "Outcome - Client error" / "Client state - Compute error". Looking at boinc.log gave me no information because it doesn't contain any error line except "output file .... absent", which I'm told from the FAQ it is safe to ignore. I'm running lhc, seti, milkyway, einstein, ralph, cosmology and with the exception of einstein tasks which seem to end up in computation errors also, every other program is running fine. Milkyway in particular granted me 2500 credits in the last four days (from which I assume the machine is stable). I have never observed problems with the machine itself (occasional lockups, strange sudden shutdowns etc).

Thanks
Neo2


One thing to look for: I've found that when the output file absent error occurs, it's a good idea to search the logfile for any reference to boinc_lockfile. Errors that refer to that file tend to cascade from one workunit to the next, at least with the older versions of BOINC, but not with some of the newer versions like the 6.10.18 I'm now using. They can also cascade to other BOINC projects that use a file with the same name, again for the older BOINC versions.
ID: 65430 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Minardi

Send message
Joined: 19 Jan 10
Posts: 1
Credit: 1,117,527
RAC: 0
Message 65460 - Posted: 5 Mar 2010, 2:11:31 UTC

I have had several tasks stall out and stop using CPU over the past few days. I am finishing up my rosetta tasks, then taking this machine off the project. I was running an XP machine and had no problems. It died, and I replaced it with a W7 64-bit machine and some tasks started stalling out on me. In reviewing this thread, it appears there is a problem with mini Rosetta 2.05 running on W7 machines.
ID: 65460 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · Next

Message boards : Number crunching : minirosetta 2.05



©2024 University of Washington
https://www.bakerlab.org