minirosetta 2.05

Author	Message
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0	Message 65310 - Posted: 14 Feb 2010, 1:53:36 UTC This one ran for 11min. lr15clus_3fa_opt_.1bm8.1bm8.SAVE_ALL_OUT_IGNORE_THE_REST.c.13.6.pdb.pdb.JOB_17967_1_0 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=289719139 <core_client_version>6.2.14</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> BOINC:: Worker startup. Starting watchdog... Watchdog active. # cpu_run_time_pref: 14400 ERROR: start_res != middle_res ERROR:: Exit from: src/protocols/moves/KinematicMover.cc line: 132 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> ID: 65310 · Rating: 0 · rate: / Reply Quote

AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0	Message 65311 - Posted: 14 Feb 2010, 10:24:55 UTC The same error as P.P.L. and Admin. ERROR: start_res != middle_res ERROR:: Exit from: src/protocols/moves/KinematicMover.cc line: 132 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish Task 317684657 AdeB ID: 65311 · Rating: 0 · rate: / Reply Quote

Mad_Max Send message Joined: 31 Dec 09 Posts: 207 Credit: 23,409,009 RAC: 12,179	Message 65313 - Posted: 14 Feb 2010, 16:39:05 UTC - in response to Message 65165. Last modified: 14 Feb 2010, 16:47:19 UTC Max, the watchdog should be expected to kick in if a model exceeds target runtime by more then 4 hours. And so with a short 2 hour preference, you will tend to see much more variation in runtimes then if you had a longer preference. However, I would expect that once the target runtime is exceeded, and a model is completed, that work on the WU should be ended and it would be reported back. So if you have a 2hr preference, and the first model took more then 2hrs to completed, I should have expected the WU to end once the first model was completed. In fact, if the first model took more then an hour, the predicted completion time of the second model would have exceeded your preference, and so I would expect that scenario to end after the first model as well. Have you changed your runtime preference? Perhaps this machine hadn't received the updated preference yet? (looking at the WU you linked, it appears to have had the 7200sec preference during the entire run). So, it sounds like your expectations are correct. More observations would be helpful. Reference to the specific WU (as you've done) is very helpful. The rest of what you say sounds right to me. So keep watching, and jotting notes. This information will be key to resolving this issue. I got a lot of tasks that ignore the Target CPU Time in preferences recently It seems most of them belong to the type * boinc_filtered_loopbuild_threading * Examples of such tasks: t380__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_2906_0 - 15002.5 cpu seconds, 2 decoys t347__boinc_filtered_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_16901_9452_0 - 20591.2 cpu seconds, 2 decoys t330__boinc_filtered_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_16901_3175_0 - 16323.3 cpu seconds, 2 decoys t322__boinc_filtered_loopbuild_threading_cst_all_tex_IGNORE_THE_REST_16902_3299_0 - 21789.4 cpu seconds, 3 decoys In all the examples cpu_run_time_pref was set at 7200 sec. And all was generated 2 or more decoys(and 2 of them i saw what 1st model took about 2hr or more), so that the program was able to stop after 1st decoy correctly. But for some reason did not do so. ID: 65313 · Rating: 0 · rate: / Reply Quote

banditwolf Send message Joined: 10 Jan 06 Posts: 28 Credit: 139,737 RAC: 0	Message 65329 - Posted: 15 Feb 2010, 15:46:14 UTC I am seeing the 2.05 use more memory than previous versions. Up to 160k from ~60-100k. ID: 65329 · Rating: 0 · rate: / Reply Quote

Sarel Send message Joined: 11 May 06 Posts: 51 Credit: 81,712 RAC: 0	Message 65331 - Posted: 15 Feb 2010, 18:27:47 UTC - in response to Message 65329. Hello, if these are the Protein-interface Design jobs then this is expected since they work with very large complexes of proteins. If you turn on the graphics you'll see that the protein systems are much larger than the typical ones on Rosetta @ Home. These jobs are sent out with a requirement for 512Mb of memory to ensure that large-memory jobs are not sent out to low-resource machines. Best, Sarel. I am seeing the 2.05 use more memory than previous versions. Up to 160k from ~60-100k. ID: 65331 · Rating: 0 · rate: / Reply Quote

svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 11,805,838 RAC: 0	Message 65333 - Posted: 15 Feb 2010, 20:41:29 UTC Task 317544195 , lr15clus_opt_.1a32.1a32.IGNORE_THE_REST.c.2.8.pdb.pdb.JOB_17418_5_1 behaved strangely on Mac OS X. It got hung at Model 2: step 0 and had to aborted. In the Searching... pane in the graphics window the protein was compressed into a furball: the other protein displays seemed pretty normal. ID: 65333 · Rating: 0 · rate: / Reply Quote

Craig Dickinson Send message Joined: 7 May 07 Posts: 8 Credit: 951,722 RAC: 2,606	Message 65338 - Posted: 15 Feb 2010, 23:30:05 UTC - in response to Message 65286. I have the wireshark trace from the re-attaching of the Rosetta project to the Boinc client showing the original download failure of the graphics file and the first 2 attempts to resume the download of the graphics file. Once again all other Rosetta files and the first work unit have sucessfully downloaded. Where do you want me to send the wireshark trace report. Craig and I arranged exchange of the trace via PMs... Craig, I've reviewed the trace and it seems the BOINC client is requesting a partial transfer from the next server in Rosetta's cluster, just as expected. But your machine never acknowledges receipt of any of the packets. And so the 3rd and 4th try are asking for the same point in the file that the second did. Plenty of time elapses, and the frame is resend about 10 times. But never acknowledged. Eventually the connection is closed. The only times I've seen such fundamental things not work properly is when (bear with me :) there is a fundamental problem. Some examples from my past... bad ethernet port in the hub. Bad cable. But most commonly there's a router between the client and the server that's in need of a code refresh. That, or perhaps Win7 itself or the driver for your LAN adapter need an update. Because it sounds like it is happening so consistently at the same point, I tend to think there is a code problem here, because a hardware problem like a bad cable or port would probably be more intermittent. The only other thing I can think of to try is that the downloads started out looking good, so after that first one gets most of the file and fails, shutdown BOINC, and shutdown the LAN adapter ("disable"), reenable the adapter, and fire up BOINC again. Now the retry will occur with a fresh start on the adapter and perhaps get you over the hump. But ultimately, that is just a work around. I think the true fix is going to be drivers, Win7, or router/firewall code updates. You might also try disabling the IPv6 support on the adapter. Is anyone aware of any specific TCP fixes for Win7? Notes to self (cuz this little scrap of paper is sure to disappear as soon as anyone askes me anything else about it!): client starts on srv1, retries occur to srv5, srv6, and srv3. Local ports used are 51221, 51229, 51235, 51241. These make good filters, such as: tcp.port == 51235 in the Wireshark display. I am using a router so re-loaded the firmware as suggested and this has fixed the problem. Didn't think about the router being the cause as I would have expected that to have caused problems with other BOINC project updates or other software auto updaters. ID: 65338 · Rating: 0 · rate: / Reply Quote

pvh Send message Joined: 7 Feb 10 Posts: 3 Credit: 2,487,638 RAC: 0	Message 65346 - Posted: 16 Feb 2010, 16:50:58 UTC I am seeing WUs that seem to be "stuck". If you look at the properties of the WU, you typically see something like: CPU time at last checkpoint 00:35:26 CPU time 06:02:29 If you look at the graphics, you see that the protein is not changing shape at all and the energy and RMSD are perfectly constant. These jobs run on for around 25,000 seconds and (I assume) are then terminated by the watchdog. You get very low credit for these jobs. I assume this is a bug in the code. If so, please fix it quickly since it is wasting lots of CPU time. When I see such a WU, should I abort it, or is it better to leave it running? This is with Rosetta Mini 2.05 on a 64-bit Linux system. I have seen this on both of my OpenSUSE 11.2 systems with the 2.6.31.8-0.1-desktop kernel (so hardware problems are ruled out). I have so far not seen this on my dual-core OpenSUSE 11.0 system with a custom 2.6.28.2-vanilla kernel. The latter is by far the least performant machine, so there is a (small) chance that this is just random chance. I do not see an obvious pattern which WUs suffer from this. ID: 65346 · Rating: 0 · rate: / Reply Quote

Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0	Message 65348 - Posted: 16 Feb 2010, 18:38:46 UTC CPU time at last checkpoint 00:35:26 CPU time 06:02:29] Try closing down BOINC and re-opening it. That seems to do the trick. ID: 65348 · Rating: 0 · rate: / Reply Quote

pvh Send message Joined: 7 Feb 10 Posts: 3 Credit: 2,487,638 RAC: 0	Message 65349 - Posted: 16 Feb 2010, 20:57:50 UTC - in response to Message 65348. Try closing down BOINC and re-opening it. That seems to do the trick. Thanks for the tip, but why did you remove my post? ID: 65349 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5664 Credit: 5,711,666 RAC: 1,996	Message 65365 - Posted: 19 Feb 2010, 13:11:40 UTC lr15clusfa_opt_.1o5u.1o5u.SAVE_ALL_OUT_IGNORE_THE_REST.c.13.8.pdb.pdb.JOB_17715_7_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=317120311 Outcome Client error Client state Compute error Exit status -1073741819 (0xc0000005) CPU time 15.39063 ID: 65365 · Rating: 0 · rate: / Reply Quote

P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0	Message 65393 - Posted: 22 Feb 2010, 21:05:31 UTC Hi. I don't know if this is a task problem or because of the validator problems, i'll put it here anyway. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=291130717 2cgq_Jan28_2cgq_3cp0_ProteinInterfaceDesign_15Feb2010_18083_187_0 Validate error # cpu_run_time_pref: 14400 ====================================================== DONE :: 2 starting structures 14487.9 cpu seconds This process generated 2 decoys from 2 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish </stderr_txt> ID: 65393 · Rating: 0 · rate: / Reply Quote

sam_spade Send message Joined: 2 Dec 08 Posts: 1 Credit: 453,056 RAC: 0	Message 65394 - Posted: 22 Feb 2010, 22:08:14 UTC Last modified: 22 Feb 2010, 22:11:10 UTC Almost since a week I get an error while downloading the app: [error] Can't create HTTP response output file projects/boinc.bakerlab.org_rosetta/minirosetta_2.05_windows_x86_64.exe What can I do? I already tried to reset the project. The rosetta_beta version_598 app works well. ID: 65394 · Rating: 0 · rate: / Reply Quote

[AF>Le_Pommier>MacBidouille.com] BlueG3 Send message Joined: 16 Mar 08 Posts: 1 Credit: 43,585 RAC: 0	Message 65403 - Posted: 23 Feb 2010, 22:21:11 UTC Last modified: 23 Feb 2010, 22:24:47 UTC ProteinInterfaceDesign seems to finish in validate error: error 1 error 2 error 3 error 4 ID: 65403 · Rating: 0 · rate: / Reply Quote

markj Send message Joined: 21 Jun 08 Posts: 6 Credit: 18,060,229 RAC: 0	Message 65409 - Posted: 24 Feb 2010, 9:53:19 UTC Last modified: 24 Feb 2010, 9:54:12 UTC all, or at least most, of the ProteinInterface jobs cause validate errors - would it be possible to fix this and post in this thread when the fix is performed? It occurs on three different computers (PC, Mac), so appears to be platform-independent. In the meantime, I am aborting all ProteinInterface jobs, leaving the others which run ok. markj ID: 65409 · Rating: 0 · rate: / Reply Quote

J Send message Joined: 23 Feb 10 Posts: 4 Credit: 68,995 RAC: 0	Message 65420 - Posted: 26 Feb 2010, 3:16:43 UTC http://img80.imageshack.us/img80/4378/roserr1.jpg Haven't been on this project long. No noticeable problems outside of punching 'ok'. Briefly searched the forums for c++ runtime error and didn't find anything, so cheers, here's a pic. ID: 65420 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 65424 - Posted: 26 Feb 2010, 14:42:10 UTC Last modified: 26 Feb 2010, 14:43:33 UTC Looks like "J" has had a few compute errors reported on Win XP running BOINC 6.5.0: abinitio_withrelax_nodisulf_nohomfrag_cst0.1_129_B_1wjgA_SAVE_ALL_OUT_17405_1983_0 abinitio_withrelax_nodisulf_nohomfrag_cst0.1_129_B_2hx5A_SAVE_ALL_OUT_17405_468_0 lrmixclus_opt_.5cro.5cro.SAVE_ALL_OUT_IGNORE_THE_REST.c.5.6.pdb.pdb.JOB_17886_5_1 The first was completed ok by a wingman The second is out being worked on right now The third failed on a wingman as well after 2 min. with an error: The system cannot find the path specified. (0x3) - exit code 3 (0x3) Rosetta Moderator: Mod.Sense ID: 65424 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1225 Credit: 13,867,813 RAC: 2,618	Message 65429 - Posted: 27 Feb 2010, 3:00:02 UTC - in response to Message 65152. Last modified: 27 Feb 2010, 3:55:14 UTC I am running Rosetta@home (along with some other projects) on four machines, three runing XP and one running Win7. Two of the machines have AMD 64 single core processors and two have AMD 64 dual core processors. The rosetta app is 'rosetta mini 2.05' and the BOINC version is 6.10.18. About once a week or so I will notice that the Rosetta WU running on at least one of the machines has been running longer than usual and using little or no CPU time. I abort it, and the next WU runs fine. I have not seen this occur with any of my other projects (climite prodiction, malaria control or world community grid). Try suspending/resuming, or even entirely turn-off/restart BOINC, before aborting the WU. I have found this works often, not always, for me. Thanks - one of my 2.05 workunits had the same problem, but now seems to be running well after a reboot. https://boinc.bakerlab.org/rosetta/result.php?resultid=320652086 64-bit Vista SP2, BOINC 6.10.18, quad-core Intel, not using keep in memory when suspended (something tends to tie up lots of memory and make the computer unresponsive to the mouse and keyboard; haven't found what, though) t311__boinc_filtered_loopbuild_threading type workunit Before the reboot, showed CPU time 03:39:05, last checkpoint 03:39:03, elapsed time so far 20:29:26, not using any CPU time Rebooted, that workunit restarted at about 4 hours elapsed time, but is now using a CPU core again. ID: 65429 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1225 Credit: 13,867,813 RAC: 2,618	Message 65430 - Posted: 27 Feb 2010, 3:32:00 UTC - in response to Message 65229. Last modified: 27 Feb 2010, 3:53:38 UTC Hi! I don't know if this happened also with older versions of rosetta since I started computing on the 3rd of February. I'm running on an amd64 linux system, a pretty powerful one. Looking at my tasks log, I had about a 120 WUs assigned until today, but only 3-4 of them completed successfully. Others show "Outcome - Client error" / "Client state - Compute error". Looking at boinc.log gave me no information because it doesn't contain any error line except "output file .... absent", which I'm told from the FAQ it is safe to ignore. I'm running lhc, seti, milkyway, einstein, ralph, cosmology and with the exception of einstein tasks which seem to end up in computation errors also, every other program is running fine. Milkyway in particular granted me 2500 credits in the last four days (from which I assume the machine is stable). I have never observed problems with the machine itself (occasional lockups, strange sudden shutdowns etc). Thanks Neo2 One thing to look for: I've found that when the output file absent error occurs, it's a good idea to search the logfile for any reference to boinc_lockfile. Errors that refer to that file tend to cascade from one workunit to the next, at least with the older versions of BOINC, but not with some of the newer versions like the 6.10.18 I'm now using. They can also cascade to other BOINC projects that use a file with the same name, again for the older BOINC versions. ID: 65430 · Rating: 0 · rate: / Reply Quote

Minardi Send message Joined: 19 Jan 10 Posts: 1 Credit: 1,117,527 RAC: 0	Message 65460 - Posted: 5 Mar 2010, 2:11:31 UTC I have had several tasks stall out and stop using CPU over the past few days. I am finishing up my rosetta tasks, then taking this machine off the project. I was running an XP machine and had no problems. It died, and I replaced it with a W7 64-bit machine and some tasks started stalling out on me. In reviewing this thread, it appears there is a problem with mini Rosetta 2.05 running on W7 machines. ID: 65460 · Rating: 0 · rate: / Reply Quote