Message boards : Number crunching : Report Problems with Rosetta Version 5.16 II
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Sam Miorelli Send message Joined: 16 Feb 06 Posts: 7 Credit: 1,303,044 RAC: 0 |
I have a Prescott-based machine that I run Rosetta, Einstein, LHC, and SETI on. None of the other projects have any problems, but Rosetta is about 50% errors. Today alone I had two: Unrecoverable error for result HOMOLOG_ABRELAX_hom007_t283__505_33607_1 ( - exit code-1073741811 (0xc000000d)) Unrecoverable error for result JUMP_RELAX_LONGRANGEPAIR_FORCESTRAND_t285__SAVE_ALL_OUT_550_36800_0 ( - exit code-1073741811 (0xc000000d)) It only seems that the errors come up when the screensaver is running, not when it's just running in the background as I do other things. My machine runs two units at a time and has 512MB ram. Anyone have any idea why I'm getting these problems in Rosetta? |
Astro Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
Anyone have any idea why I'm getting these problems in Rosetta? Yup, I have an AMD64 3700 which has the same problems with ralph 5.12 and 5.16, and Rosetta 5.16. When I run the screensaver, and leave the machine alone. Windows fatal error, if I keep working with it, so that the screensaver never comes on, successful results. When I turn OFF screensaver, successful results. Turn OFF screensaver. They are aware of this error and have asked Rom Waltons assistance. I offered Rom my help, which he's not accepted. tony PS, I have two other puters which work just fine with the screensaver on. |
Bob Guy Send message Joined: 7 Oct 05 Posts: 39 Credit: 24,895 RAC: 0 |
|
XS_Duc Send message Joined: 30 Dec 05 Posts: 17 Credit: 310,471 RAC: 0 |
It's been ages since I had another error to report, but this morning I noticed one... never seen that one before. Resultid21972244 (Workunit18429646) |
pieface Send message Joined: 20 Sep 05 Posts: 17 Credit: 797,661 RAC: 0 |
|
Enno Ruijters Send message Joined: 23 Sep 05 Posts: 2 Credit: 3,194,827 RAC: 0 |
My linux machine got its first error: result 22071081: Wed 31 May 2006 02:42:39 PM CEST|rosetta@home|Unrecoverable error for result JUMP_RELAX_LONGRANGEPAIR_ANTIPARALLEL_t285__SAVE_ALL_OUT_548_29690_0 (process exited with code 131 (0x83)) I'm using boinc version 5.4.9 on x86_64 linux 2.6.15. Result ID 22071081 Name JUMP_RELAX_LONGRANGEPAIR_ANTIPARALLEL_t285__SAVE_ALL_OUT_548_29690_0 Workunit 18520787 Created 30 May 2006 3:35:03 UTC Sent 30 May 2006 5:48:44 UTC Received 31 May 2006 12:47:19 UTC Server state Over Outcome Client error Client state Computing Exit status 131 (0x83) Computer ID 70238 Report deadline 6 Jun 2006 5:48:44 UTC CPU time 6945.15 stderr out <core_client_version>5.4.9</core_client_version> <message> process exited with code 131 (0x83) </message> <stderr_txt> # random seed: 2797211 # cpu_run_time_pref: 10800 No heartbeat from core client for 31 sec - exiting SIGSEGV: segmentation violationStack trace (19 frames): [0x8836a6b] [0x884f74c] [0xffffe500] [0x860e7a9] [0x85ff1f8] [0x809364c] [0x860ff95] [0x8610bb0] [0x87dca0f] [0x8728e50] [0x872a6bb] [0x80a3a75] [0x85c3a13] [0x842093e] [0x85f1ffb] [0x8496132] [0x8498c8f] [0x88aec34] [0x8048111] Exiting... </stderr_txt> Validate state Invalid Claimed credit 12.3736072419708 Granted credit 0 application version 5.16 |
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0 |
I also had watchdog knock down one of those pdbblast guys: resultid It seems there is a problem with these WUs: FRA_t297 |
Jose Send message Joined: 28 Mar 06 Posts: 820 Credit: 48,297 RAC: 0 |
It's been ages since I had another error to report, but this morning I noticed one... never seen that one before. It means that the watchdog function placed to prevent Wu's stuck in time IS WORKING as it is supposed to do. I imagine that that unit and the ones where the watchdog is terminating the processing ( and there are some few recently) will be analyzed to see what is happening that is causing the watchdog to work. This and no other is the root from which a Tyrant springs; when he first appears he is a protector.†Plato |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
I also had watchdog knock down one of those pdbblast guys: resultid Thanks for the heads up--we'll look into this right away |
Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0 |
Thanks to your helpful messages, we've tracked down the rare bug that's causing this in the code and fixed it. The fix will be included in the next release. Great job all! Luckily we only sent out 5000 of these bad WUs (A very small number compare to the 120,000 done everyday) and about a third of them were affected. You will still get credits for those jobs killed by the watchdog when our credit-grantor runs nightly! It's been ages since I had another error to report, but this morning I noticed one... never seen that one before. |
dag Send message Joined: 16 Dec 05 Posts: 106 Credit: 1,000,020 RAC: 0 |
Hit 100% at around 12 hours (normal). Then stayed there using a slot and not running for 6 hours -- still 100%, no more cpu time being used. https://boinc.bakerlab.org/rosetta/result.php?resultid=22292990 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=18724964 T0283_CONTACTS_CONSERVATIVE_CALPHA_HALFHB_MAP_FROM_hom024_593_11206 dag --Finding aliens is cool, but understanding the structure of proteins is useful. |
Ian Send message Joined: 14 Apr 06 Posts: 29 Credit: 344,294 RAC: 165 |
Errors from a day or two ago that I only just spotted: https://boinc.bakerlab.org/rosetta/result.php?resultid=22302203 https://boinc.bakerlab.org/rosetta/result.php?resultid=22240155 Touch wood (or at least wood veneer), very few errors lately. Ian Cundell, St Albans, UK |
Aglarond Send message Joined: 29 Jan 06 Posts: 26 Credit: 446,212 RAC: 0 |
I had many errors with FRA_t301_hom028_1_LOOPRLX_IGNORE_THE_REST recently. Probably Rosetta needed to use more than 300MB RAM and system was out of memory. I've got two different error codes: -1073741819 (0xc0000005) and -1073741571 (0xc00000fd) . Results: 22632143 22629987 22628762 22626932 22625240 22624321 22623307 22593482 22585478 |
rbpeake Send message Joined: 25 Sep 05 Posts: 168 Credit: 247,828 RAC: 0 |
I had many errors with FRA_t301_hom028_1_LOOPRLX_IGNORE_THE_REST recently. Probably Rosetta needed to use more than 300MB RAM and system was out of memory. I've got two different error codes: -1073741819 (0xc0000005) and -1073741571 (0xc00000fd) . Looks like it might be prudent to take this computer off of Rosetta. It has well below the minimum 500mb recommended, and it is probably not fair to the project to have all these units abort because of that... Just a thought! Regards, Bob P. |
Vester Send message Joined: 2 Nov 05 Posts: 258 Credit: 3,651,260 RAC: 2 |
This one is running well, but it is using more memory than any that I have observed: A peak of 318 MB. [img=http://img213.imageshack.us/img213/7604/capture050620060029159zm.th.png] Thumbnail. You may have to click on the larger image to see it clearly. |
Aglarond Send message Joined: 29 Jan 06 Posts: 26 Credit: 446,212 RAC: 0 |
Looks like it might be prudent to take this computer off of Rosetta. It has well below the minimum 500mb recommended, and it is probably not fair to the project to have all these units abort because of that... Just a thought! Yes, I took it off Rosetta immediately. And also other computer with 256MB RAM. New workunits are taking A LOT of RAM. I've seen Rosetta using 375MB recently. So I recommend everyone with less than 512MB to be carefull. |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
If you're getting repeated errors like that, perhaps it'd be a good idea to run Ralph on that machine. Let them track down the source of the errors - and either correct the code, or have Rosetta instantly fail such WUs and state, "not enough Ram for this WU." Someday, we'll hopefully have a client that will tell the server how much ram we have on our machine, and get WUs that will run with that amount of Ram. |
Jimi@0wned.org.uk Send message Joined: 10 Mar 06 Posts: 29 Credit: 335,252 RAC: 0 |
My air-cooled dual core AMD is slowly dying. Is the data still good in this unit? It bothers me that a bad machine might poison the result. How did the following WU validate with those errors? Was it restarting from an earlier checkpoint? Result ID 23165576 Name t304__CASP7_ABRELAX_SAVE_ALL_OUT_cterm2_hom001__654_16007_0 Workunit 19503478 Created 7 Jun 2006 11:57:47 UTC Sent 7 Jun 2006 13:27:18 UTC Received 7 Jun 2006 19:29:58 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 241725 Report deadline 14 Jun 2006 13:27:18 UTC CPU time 14133.575291 stderr out <core_client_version>5.3.6</core_client_version> <stderr_txt> # random seed: 1986744 SIGSEGV: segmentation violationStack trace (16 frames): [0x8836a6b] [0x884f74c] [0xffffe500] [0x88d0170] [0x88d1a29] [0x88a0767] [0x88a2b51] [0x81eb08b] [0x87298fc] [0x87d2f38] [0x8313d95] [0x80e49ed] [0x849682f] [0x8498c8f] [0x88aec34] [0x8048111] Exiting... # random seed: 1986744 # cpu_run_time_pref: 14400 SIGSEGV: segmentation violationStack trace (21 frames): [0x8836a6b] [0x884f74c] [0xffffe500] [0x882e6bc] [0x8625638] [0x83671a9] [0x8361a84] [0x8729051] [0x84cea28] [0x84cedc4] [0x84cfb67] [0x84de8b1] [0x84e06f1] [0x87d42c3] [0x86afa6b] [0x86b2089] [0x80e5111] [0x849682f] [0x8498c8f] [0x88aec34] [0x8048111] Exiting... # cpu_run_time_pref: 14400 # DONE :: 1 starting structures built 20 (nstruct) times # This process generated 20 decoys from 20 attempts </stderr_txt> Validate state Valid |
truckpuller Send message Joined: 5 Nov 05 Posts: 40 Credit: 229,134 RAC: 0 |
Since the running of 5.16 my Rac on 1 of my computers had dropped from 250 down to 225 anyone shed any light as what to look for as to why.This 1 computer is 2500+ barton with 1 gig of memory and i have a 1.6 duron with 512 memory out producing it. Thanks in advance Visit us at Christianboards.org |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
I noticed my RAC fall from 256 to 225 when I ran Ralph. (Ralph scores are seperate from Rosetta.) |
Message boards :
Number crunching :
Report Problems with Rosetta Version 5.16 II
©2025 University of Washington
https://www.bakerlab.org