Report Problems with Rosetta Version 5.16 II

Author	Message
Sam Miorelli Send message Joined: 16 Feb 06 Posts: 7 Credit: 1,303,044 RAC: 0	Message 17390 - Posted: 30 May 2006, 19:41:25 UTC I have a Prescott-based machine that I run Rosetta, Einstein, LHC, and SETI on. None of the other projects have any problems, but Rosetta is about 50% errors. Today alone I had two: Unrecoverable error for result HOMOLOG_ABRELAX_hom007_t283__505_33607_1 ( - exit code-1073741811 (0xc000000d)) Unrecoverable error for result JUMP_RELAX_LONGRANGEPAIR_FORCESTRAND_t285__SAVE_ALL_OUT_550_36800_0 ( - exit code-1073741811 (0xc000000d)) It only seems that the errors come up when the screensaver is running, not when it's just running in the background as I do other things. My machine runs two units at a time and has 512MB ram. Anyone have any idea why I'm getting these problems in Rosetta? ID: 17390 · Rating: 0 · rate: / Reply Quote

Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0	Message 17395 - Posted: 30 May 2006, 21:16:03 UTC - in response to Message 17390. Last modified: 30 May 2006, 21:17:13 UTC Anyone have any idea why I'm getting these problems in Rosetta? Yup, I have an AMD64 3700 which has the same problems with ralph 5.12 and 5.16, and Rosetta 5.16. When I run the screensaver, and leave the machine alone. Windows fatal error, if I keep working with it, so that the screensaver never comes on, successful results. When I turn OFF screensaver, successful results. Turn OFF screensaver. They are aware of this error and have asked Rom Waltons assistance. I offered Rom my help, which he's not accepted. tony PS, I have two other puters which work just fine with the screensaver on. ID: 17395 · Rating: 0 · rate: / Reply Quote

Bob Guy Send message Joined: 7 Oct 05 Posts: 39 Credit: 24,895 RAC: 0	Message 17396 - Posted: 30 May 2006, 21:21:28 UTC Two recent errors because I used the Boinc view graphics button. The WUs complete successfully if I never view the graphics - I have the Boinc screensaver turned off. 21418496 21418531 ID: 17396 · Rating: 0 · rate: / Reply Quote

XS_Duc Send message Joined: 30 Dec 05 Posts: 17 Credit: 310,471 RAC: 0	Message 17413 - Posted: 31 May 2006, 9:55:05 UTC Last modified: 31 May 2006, 9:56:46 UTC It's been ages since I had another error to report, but this morning I noticed one... never seen that one before. Resultid21972244 (Workunit18429646) ID: 17413 · Rating: 0 · rate: / Reply Quote

pieface Send message Joined: 20 Sep 05 Posts: 17 Credit: 797,661 RAC: 0	Message 17419 - Posted: 31 May 2006, 13:01:03 UTC I also had watchdog knock down one of those pdbblast guys: resultid like XS DUC's. ID: 17419 · Rating: 0 · rate: / Reply Quote

Enno Ruijters Send message Joined: 23 Sep 05 Posts: 2 Credit: 3,194,827 RAC: 0	Message 17420 - Posted: 31 May 2006, 13:04:39 UTC My linux machine got its first error: result 22071081: Wed 31 May 2006 02:42:39 PM CEST\|rosetta@home\|Unrecoverable error for result JUMP_RELAX_LONGRANGEPAIR_ANTIPARALLEL_t285__SAVE_ALL_OUT_548_29690_0 (process exited with code 131 (0x83)) I'm using boinc version 5.4.9 on x86_64 linux 2.6.15. Result ID 22071081 Name JUMP_RELAX_LONGRANGEPAIR_ANTIPARALLEL_t285__SAVE_ALL_OUT_548_29690_0 Workunit 18520787 Created 30 May 2006 3:35:03 UTC Sent 30 May 2006 5:48:44 UTC Received 31 May 2006 12:47:19 UTC Server state Over Outcome Client error Client state Computing Exit status 131 (0x83) Computer ID 70238 Report deadline 6 Jun 2006 5:48:44 UTC CPU time 6945.15 stderr out <core_client_version>5.4.9</core_client_version> <message> process exited with code 131 (0x83) </message> <stderr_txt> # random seed: 2797211 # cpu_run_time_pref: 10800 No heartbeat from core client for 31 sec - exiting SIGSEGV: segmentation violationStack trace (19 frames): [0x8836a6b] [0x884f74c] [0xffffe500] [0x860e7a9] [0x85ff1f8] [0x809364c] [0x860ff95] [0x8610bb0] [0x87dca0f] [0x8728e50] [0x872a6bb] [0x80a3a75] [0x85c3a13] [0x842093e] [0x85f1ffb] [0x8496132] [0x8498c8f] [0x88aec34] [0x8048111] Exiting... </stderr_txt> Validate state Invalid Claimed credit 12.3736072419708 Granted credit 0 application version 5.16 ID: 17420 · Rating: 0 · rate: / Reply Quote

tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 17422 - Posted: 31 May 2006, 13:14:38 UTC - in response to Message 17419. I also had watchdog knock down one of those pdbblast guys: resultid like XS DUC's. It seems there is a problem with these WUs: FRA_t297 ID: 17422 · Rating: 0 · rate: / Reply Quote

Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0	Message 17425 - Posted: 31 May 2006, 14:04:04 UTC - in response to Message 17413. It's been ages since I had another error to report, but this morning I noticed one... never seen that one before. Resultid21972244 (Workunit18429646) It means that the watchdog function placed to prevent Wu's stuck in time IS WORKING as it is supposed to do. I imagine that that unit and the ones where the watchdog is terminating the processing ( and there are some few recently) will be analyzed to see what is happening that is causing the watchdog to work. This and no other is the root from which a Tyrant springs; when he first appears he is a protector.â€ Plato ID: 17425 · Rating: 0 · rate: / Reply Quote

David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0	Message 17440 - Posted: 31 May 2006, 16:17:56 UTC - in response to Message 17422. I also had watchdog knock down one of those pdbblast guys: resultid like XS DUC's. It seems there is a problem with these WUs: FRA_t297 Thanks for the heads up--we'll look into this right away ID: 17440 · Rating: 0 · rate: / Reply Quote

Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0	Message 17481 - Posted: 31 May 2006, 23:05:51 UTC - in response to Message 17425. Thanks to your helpful messages, we've tracked down the rare bug that's causing this in the code and fixed it. The fix will be included in the next release. Great job all! Luckily we only sent out 5000 of these bad WUs (A very small number compare to the 120,000 done everyday) and about a third of them were affected. You will still get credits for those jobs killed by the watchdog when our credit-grantor runs nightly! It's been ages since I had another error to report, but this morning I noticed one... never seen that one before. Resultid21972244 (Workunit18429646) It means that the watchdog function placed to prevent Wu's stuck in time IS WORKING as it is supposed to do. I imagine that that unit and the ones where the watchdog is terminating the processing ( and there are some few recently) will be analyzed to see what is happening that is causing the watchdog to work. ID: 17481 · Rating: 0 · rate: / Reply Quote

dag Send message Joined: 16 Dec 05 Posts: 106 Credit: 1,000,020 RAC: 0	Message 17561 - Posted: 2 Jun 2006, 21:30:17 UTC Hit 100% at around 12 hours (normal). Then stayed there using a slot and not running for 6 hours -- still 100%, no more cpu time being used. https://boinc.bakerlab.org/rosetta/result.php?resultid=22292990 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=18724964 T0283_CONTACTS_CONSERVATIVE_CALPHA_HALFHB_MAP_FROM_hom024_593_11206 dag --Finding aliens is cool, but understanding the structure of proteins is useful. ID: 17561 · Rating: 0 · rate: / Reply Quote

Ian Send message Joined: 14 Apr 06 Posts: 29 Credit: 416,883 RAC: 0	Message 17562 - Posted: 3 Jun 2006, 0:55:51 UTC Last modified: 3 Jun 2006, 0:57:00 UTC Errors from a day or two ago that I only just spotted: https://boinc.bakerlab.org/rosetta/result.php?resultid=22302203 https://boinc.bakerlab.org/rosetta/result.php?resultid=22240155 Touch wood (or at least wood veneer), very few errors lately. Ian Cundell, St Albans, UK ID: 17562 · Rating: 0 · rate: / Reply Quote

Aglarond Send message Joined: 29 Jan 06 Posts: 26 Credit: 446,212 RAC: 0	Message 17584 - Posted: 3 Jun 2006, 19:54:21 UTC Last modified: 3 Jun 2006, 19:59:27 UTC I had many errors with FRA_t301_hom028_1_LOOPRLX_IGNORE_THE_REST recently. Probably Rosetta needed to use more than 300MB RAM and system was out of memory. I've got two different error codes: -1073741819 (0xc0000005) and -1073741571 (0xc00000fd) . Results: 22632143 22629987 22628762 22626932 22625240 22624321 22623307 22593482 22585478 ID: 17584 · Rating: 0 · rate: / Reply Quote

rbpeake Send message Joined: 25 Sep 05 Posts: 168 Credit: 247,828 RAC: 0	Message 17595 - Posted: 4 Jun 2006, 5:12:54 UTC - in response to Message 17584. I had many errors with FRA_t301_hom028_1_LOOPRLX_IGNORE_THE_REST recently. Probably Rosetta needed to use more than 300MB RAM and system was out of memory. I've got two different error codes: -1073741819 (0xc0000005) and -1073741571 (0xc00000fd) . Results: 22632143 22629987 22628762 22626932 22625240 22624321 22623307 22593482 22585478 Looks like it might be prudent to take this computer off of Rosetta. It has well below the minimum 500mb recommended, and it is probably not fair to the project to have all these units abort because of that... Just a thought! Regards, Bob P. ID: 17595 · Rating: 0 · rate: / Reply Quote

Vester Send message Joined: 2 Nov 05 Posts: 259 Credit: 4,625,443 RAC: 0	Message 17646 - Posted: 5 Jun 2006, 4:36:10 UTC This one is running well, but it is using more memory than any that I have observed: A peak of 318 MB. [img=http://img213.imageshack.us/img213/7604/capture050620060029159zm.th.png] Thumbnail. You may have to click on the larger image to see it clearly. ID: 17646 · Rating: 0 · rate: / Reply Quote

Aglarond Send message Joined: 29 Jan 06 Posts: 26 Credit: 446,212 RAC: 0	Message 17719 - Posted: 6 Jun 2006, 0:51:56 UTC - in response to Message 17595. Last modified: 6 Jun 2006, 0:53:54 UTC Looks like it might be prudent to take this computer off of Rosetta. It has well below the minimum 500mb recommended, and it is probably not fair to the project to have all these units abort because of that... Just a thought! Yes, I took it off Rosetta immediately. And also other computer with 256MB RAM. New workunits are taking A LOT of RAM. I've seen Rosetta using 375MB recently. So I recommend everyone with less than 512MB to be carefull. ID: 17719 · Rating: 0 · rate: / Reply Quote

BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0	Message 17760 - Posted: 6 Jun 2006, 7:31:42 UTC If you're getting repeated errors like that, perhaps it'd be a good idea to run Ralph on that machine. Let them track down the source of the errors - and either correct the code, or have Rosetta instantly fail such WUs and state, "not enough Ram for this WU." Someday, we'll hopefully have a client that will tell the server how much ram we have on our machine, and get WUs that will run with that amount of Ram. ID: 17760 · Rating: 0 · rate: / Reply Quote

Jimi@0wned.org.uk Send message Joined: 10 Mar 06 Posts: 29 Credit: 335,252 RAC: 0	Message 17993 - Posted: 7 Jun 2006, 19:36:40 UTC My air-cooled dual core AMD is slowly dying. Is the data still good in this unit? It bothers me that a bad machine might poison the result. How did the following WU validate with those errors? Was it restarting from an earlier checkpoint? Result ID 23165576 Name t304__CASP7_ABRELAX_SAVE_ALL_OUT_cterm2_hom001__654_16007_0 Workunit 19503478 Created 7 Jun 2006 11:57:47 UTC Sent 7 Jun 2006 13:27:18 UTC Received 7 Jun 2006 19:29:58 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 241725 Report deadline 14 Jun 2006 13:27:18 UTC CPU time 14133.575291 stderr out <core_client_version>5.3.6</core_client_version> <stderr_txt> # random seed: 1986744 SIGSEGV: segmentation violationStack trace (16 frames): [0x8836a6b] [0x884f74c] [0xffffe500] [0x88d0170] [0x88d1a29] [0x88a0767] [0x88a2b51] [0x81eb08b] [0x87298fc] [0x87d2f38] [0x8313d95] [0x80e49ed] [0x849682f] [0x8498c8f] [0x88aec34] [0x8048111] Exiting... # random seed: 1986744 # cpu_run_time_pref: 14400 SIGSEGV: segmentation violationStack trace (21 frames): [0x8836a6b] [0x884f74c] [0xffffe500] [0x882e6bc] [0x8625638] [0x83671a9] [0x8361a84] [0x8729051] [0x84cea28] [0x84cedc4] [0x84cfb67] [0x84de8b1] [0x84e06f1] [0x87d42c3] [0x86afa6b] [0x86b2089] [0x80e5111] [0x849682f] [0x8498c8f] [0x88aec34] [0x8048111] Exiting... # cpu_run_time_pref: 14400 # DONE :: 1 starting structures built 20 (nstruct) times # This process generated 20 decoys from 20 attempts </stderr_txt> Validate state Valid ID: 17993 · Rating: 0 · rate: / Reply Quote

truckpuller Send message Joined: 5 Nov 05 Posts: 40 Credit: 229,134 RAC: 0	Message 18059 - Posted: 8 Jun 2006, 4:16:57 UTC Since the running of 5.16 my Rac on 1 of my computers had dropped from 250 down to 225 anyone shed any light as what to look for as to why.This 1 computer is 2500+ barton with 1 gig of memory and i have a 1.6 duron with 512 memory out producing it. Thanks in advance Visit us at Christianboards.org ID: 18059 · Rating: 0 · rate: / Reply Quote

BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0	Message 18060 - Posted: 8 Jun 2006, 4:29:07 UTC I noticed my RAC fall from 256 to 225 when I ran Ralph. (Ralph scores are seperate from Rosetta.) ID: 18060 · Rating: 0 · rate: / Reply Quote