Message boards : Number crunching : Minirosetta v1.47 bug thread.
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 10 · Next
Author | Message |
---|---|
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 2,014 |
I'm seeing problems when attempting to show graphics on workunits with names such as cs_noe* on Mac OS X 10.4.11. Its seems like several other people are seeing similar problems. I'm seeing somewhat similar problems under Windows Vista SP1. 12/21/2008 7:18:31 AM|rosetta@home|Resuming task cs_noe_fullw_nolin_homo_bench_cs_noe_abrelax_cs_ccr19_olange_5604_39348_0 using minirosetta version 147 Moving the mouse had no particular effect, but the graphics window stayed blank and shutting it down gave some error messages before it finally worked. I normally let minirosetta run without graphics. |
Mike Tyka Send message Joined: 20 Oct 05 Posts: 96 Credit: 2,190 RAC: 0 |
Hi all! I'm back connected with the internet. Sadly to find more errors - we'll be back to debugging after the holidays. Quick comments for the major issues reported above: - The graphics problems cs_noe_* jobs. THis is v strange. we have NOT updated the graphics app - so these jobs must be doing something funny that the graphics app doesnt like. I'll ask the person submitting these to try and run the graphics app locally to see if we can reproduce this error. - The normal_relax_rlb[dn]_* jobs validator error. I thought i had fixed this, this must be something eles then. Yes the validator will reject the WU if it has produced more than some number of decoys (like around 128 or so per hour). Now, this is pointing to some other problem now - evidently its racing through decoys nd not doing anything with them, thereby producing thousands of results. How that can happen on a sporadic basis (< 1/1000 WUs it seems) is puzzeling me. I'll have to ook into that one. - Virus Scanners: Aehm - not really a bug. We have no control over what virus scanners seem to "recognise" about it as a malware/virus. They won't tellus either - they have been wholy unhelpful in this matter. The only solution i see right now is to set exceptions in your virus scanner to ignore apps coming from ralph.bakerlab.org and boinc.bakerlab.org Has anyone seen any new Lockfile problems ? Or are these finally a thing of the past ? Mike http://beautifulproteins.blogspot.com/ http://www.miketyka.com/ |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Task 215936807; Workunit 194706499; Name 1dsvA_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1dsvA-_5479_5614_1; crashed on Mac OS X 10.4.11 after 4 secs (thankfully) <core_client_version>6.2.18</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> SIGSEGV: segmentation violation Crashed executable name: minirosetta_1.47_i686-apple-darwin built using BOINC library version 6.5.0 Machine type Intel 80486 (32-bit executable) System version: Macintosh OS 10.4.11 build 8S2167 Sat Dec 20 23:23:58 2008 Thread 0 Crashed: 0 ...etta_1.47_i686-apple-darwin 0x0022f77f __ZNK4core10kinematics8AtomTree20torsion_angle_dof_idERKNS_2id6AtomIDES5_S5_S5_Rd + 139 1 ...etta_1.47_i686-apple-darwin 0x0023415a __ZNK4core10kinematics8AtomTree13torsion_angleERKNS_2id6AtomIDES5_S5_S5_ + 284 2 ...etta_1.47_i686-apple-darwin 0x00022b1c __ZN4core12conformation12Conformation15setup_atom_treeEv + 1384 3 ...etta_1.47_i686-apple-darwin 0x00025055 __ZN4core12conformation12Conformation9fold_treeERKNS_10kinematics8FoldTreeE + 4167 4 ...etta_1.47_i686-apple-darwin 0x00984800 __ZNK9protocols8abinitio16KinematicControl25prepare_pose_for_samplingERN4core4pose4PoseE + 32 5 ...etta_1.47_i686-apple-darwin 0x0060a1d3 __ZN9protocols8abinitio17KinematicAbinitio5applyERN4core4pose4PoseE + 5277 6 ...etta_1.47_i686-apple-darwin 0x0060d8fd __ZN9protocols8abinitio29JumpingFoldConstraintsWrapper5applyERN4core4pose4PoseE + 3927 7 ...etta_1.47_i686-apple-darwin 0x001b13ee __ZN9protocols8abinitio18AbrelaxApplication4foldEv + 4468 8 ...etta_1.47_i686-apple-darwin 0x001b7fad __ZN9protocols8abinitio18AbrelaxApplication3runEv + 1137 9 ...etta_1.47_i686-apple-darwin 0x00008cc8 _main + 4078 10 ...etta_1.47_i686-apple-darwin 0x00001bce __start + 216 11 ...etta_1.47_i686-apple-darwin 0x00001af5 start + 41 etc. |
Mike Tyka Send message Joined: 20 Oct 05 Posts: 96 Credit: 2,190 RAC: 0 |
My 1.47 cc2_1_8_mammoth-tasks have all crashed on Ralph, now my 1.47 cc2_1_8_native-tasks are crashing on Rosetta. #Aehm - i can't see your RALPH failure for this job. I had one result come back and it was a success.. http://ralph.bakerlab.org/rah_queue_ops/db_action.php?table=result&id=1228006 http://beautifulproteins.blogspot.com/ http://www.miketyka.com/ |
Mike Tyka Send message Joined: 20 Oct 05 Posts: 96 Credit: 2,190 RAC: 0 |
After a 1 week hiatus I downloaded v1.47 and 4 tasks. The first task showed a completion time of 12 hours which corresponds to my chosen runtime. The other 3 tasks, all _rlbd_ tasks, showed completion times of only 1 hour. What's up with that? It suggests that the staff provided an estimated task runtime of something like 45 minutes instead of the customary 8 hours. We run a number of very different jobs on R@home covering a number of different problems in structure prediction and now also protein design. Thus, depending on the type of workunit runtimes may vary hugely. The rldb jobs do indeed run very quickly (requiring something like 25minutes per decoy). What was your very first job ?? I think we will put a limit into the code that will abort jobs running over 6 hours in the next update. Watch this space.. http://beautifulproteins.blogspot.com/ http://www.miketyka.com/ |
Mike Tyka Send message Joined: 20 Oct 05 Posts: 96 Credit: 2,190 RAC: 0 |
I appreciate that, thanks. I'll try and keep you guys uptodate, your feedback is pretty indispensible for our debugging. http://beautifulproteins.blogspot.com/ http://www.miketyka.com/ |
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
3 errors: 1. This one has failed twice: 4.3 sec 216056173 - exit code -1073741819 (0xc0000005) </message> <stderr_txt> Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x00476D2D read attempt to address 0x00000000 2. 216056174 6.5 sec Reason: Access Violation (0xc0000005) at address 0x0049162C read attempt to address 0x00000000 Engaging BOINC Windows Runtime Debugger... 3. This one also failed twice. .02 sec 216056175 - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x00476D2D read attempt to address 0x00000000 Engaging BOINC Windows Runtime Debugger... |
Ian_D Send message Joined: 21 Sep 05 Posts: 55 Credit: 4,216,173 RAC: 0 |
CPU type GenuineIntel Intel(R) Pentium(R) 4 CPU 2.60GHz [Family 15 Model 2 Stepping 9] Number of CPUs 2 Operating System Linux 2.6.24-22-generic process exited with code 193 (0xc1, -63) Stack trace (22 frames): [0x8b979b7] [0x8bc20b0] [0xb7f03420] [0x83c53bc] [0x84356a0] [0x83c4fa3] [0x83ba6f8] [0x85c2f4e] [0x80cf524] [0x80de98f] [0x83376f7] [0x8337100] [0x8243364] [0x82a246c] [0x818e15a] [0x819bae3] [0x819b3aa] [0x8127771] [0x8129a1a] [0x804b9c8] [0x8c1dbac] [0x8048111] https://boinc.bakerlab.org/rosetta/result.php?resultid=215801702 process exited with code 193 (0xc1, -63) SIGSEGV: segmentation violation Stack trace (20 frames): [0x8b979b7] [0x8bc20b0] [0xb7fa5420] [0x83c4fa3] [0x83ba6f8] [0x85c2f4e] [0x80cf1ff] [0x80de98f] [0x83376f7] [0x8337100] [0x8243364] [0x82a246c] [0x818e15a] [0x819bae3] [0x819b3aa] [0x8127771] [0x8129a1a] [0x804b9c8] [0x8c1dbac] [0x8048111] https://boinc.bakerlab.org/rosetta/result.php?resultid=215414530 process exited with code 193 (0xc1, -63) SIGSEGV: segmentation violation Stack trace (23 frames): [0x8b979b7] [0x8bc20b0] [0xb7f48420] [0x8ace23a] [0x84348d3] [0x8ace5f6] [0x8acd739] [0x83b1c55] [0x862a631] [0x83f65af] [0x80cece6] [0x80de98f] [0x82c37e4] [0x82b897a] [0x82c16c1] [0x818d6ee] [0x819bae3] [0x819b3aa] [0x8127771] [0x8129a1a] [0x804b9c8] [0x8c1dbac] [0x8048111] https://boinc.bakerlab.org/rosetta/result.php?resultid=215035006 What's going on with the Rosetta Linux App ? Sometimes it works , sometimes it's duff ? Machine NOT overclocked in the slightest Cheers |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
two more that wasted my cpu time crashing halfway edit - more of the same type of task errored out https://boinc.bakerlab.org/rosetta/result.php?resultid=215554911 t071_1_RDC_NMR_NESG_5480_119941_0 state Compute error Exit status -1073741819 (0xc0000005) CPU time 9361.141 https://boinc.bakerlab.org/rosetta/result.php?resultid=215583938 t072_1_RDC_NMR_NESG_5481_100236_0 state Compute error Exit status -1073741819 (0xc0000005) CPU time 4056.126 i am aborting the remaing t071 and t072 tasks due to 4 errors in 5-6 hours. wasting my time with that junk. another note: these 2 tasks did not respond to a suspend command in the sense that the time to completion continued to count even though the actual running time had stopped and the status showed as suspended. hope the t073 tasks are better |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
i think you guys should recheck the code or whatever of the t071 and t072 tasks as I see someone before me had one of these series of tasks and ran into a computer error of the same nature of what i reported. i aborted that task since i am not interested in wasting my cpu time on a compute error bugged task. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 2,014 |
I'm seeing problems when attempting to show graphics on workunits with names such as cs_noe* on Mac OS X 10.4.11. Its seems like several other people are seeing similar problems. Another workunit with graphics problems: 12/21/2008 11:27:13 AM|rosetta@home|Resuming task cs_noe_fullw_nolin_homo_bench_cs_noe_abrelax_cs_flua_olange_5605_35210_0 using minirosetta version 147 The previous one seemed to complete successfully despite the graphics problem. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 2,014 |
After a 1 week hiatus I downloaded v1.47 and 4 tasks. The first task showed a completion time of 12 hours which corresponds to my chosen runtime. The other 3 tasks, all _rlbd_ tasks, showed completion times of only 1 hour. What's up with that? It suggests that the staff provided an estimated task runtime of something like 45 minutes instead of the customary 8 hours. What effect will that have on users who have chosen default workunit times over 6 hours? Is this 6 hours per decoy or 6 hours for the whole workunit? If it only aborts one decoy, will the other decoys still continue, with credit for the decoys that completed successfully both before and after this aborted decoy? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
What effect will that have on users who have chosen default workunit times over 6 hours? Is this 6 hours per decoy or 6 hours for the whole workunit? If it only aborts one decoy, will the other decoys still continue, with credit for the decoys that completed successfully both before and after this aborted decoy? Yes, he's talking about per model. If any models that run that long are cut off, it would help assure a more consistent runtime inline with each person's stated preference. Not perfect, but better then having some specific models haul off and run for 12 hours. So, yes, if time remains for the task, another model may begin. I won't comment on credit, because it's not my decision, and so far as I know no specific decision has been made yet. But the project has always maintained that even "failures" provide information valueable to advancing the project. At present, the model would run for (sometimes) as much as 12 hours or more, and you'd get the same credit average as those that are running models with the more average runtime under 3hrs, so if nothing else, just cutting it off at 6 hours (or whatever length is deemed appropriate) is preventing you from running for more then that, for essentially zero credit. So, this approach limits your credit loss, if nothing else. Rosetta Moderator: Mod.Sense |
ramostol Send message Joined: 6 Feb 07 Posts: 64 Credit: 584,052 RAC: 0 |
My 1.47 cc2_1_8_mammoth-tasks have all crashed on Ralph, now my 1.47 cc2_1_8_native-tasks are crashing on Rosetta. I believe I am not allowed access to rah_queue_ops ;-) so I cannot check your observation. However, my Ralph mammoth-failures flourish, the ultimate example: cc2_1_8_mammoth_fa_cst_hb_t369__IGNORE_THE_REST_1S3QA_7_6585_1_0 When this is said I seem to have reconciled with Rosetta by rebooting the computer in question. Why this was suddenly necessary on a computer with no new program installations, no new configurations, no system upgrades, no separate computing on the side, and successfully computing 1.47-tasks 24 hours earlier, I am unable to explain. Even the subsequently installed Boinc 6.5 works like a charm. So I am loaded with tasks for a peaceful Christmas session and hope for the best until reporting time next weekend. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
come on guys, you say this stuff is tested and ok and then it bombs on a windows machine. can someone tell me if this is a program error an error caused by to high of a OC speed? being that not all the tasks I get error out it would seem more of a case of a bad program and not the OC speed. see below for a series of tasks that died part of the way through. https://boinc.bakerlab.org/rosetta/result.php?resultid=215716365 cc2_1_8_native_cen_cst_hb_t311__IGNORE_THE_REST_2B5AA_7_5843_16_0 Outcome Client error Client state Compute error Exit status -1073741819 (0xc0000005) CPU time 4999.172 stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> ------ https://boinc.bakerlab.org/rosetta/result.php?resultid=215736070 Name t074_1_RDC_NMR_NESG_5568_92427_0 Outcome Client error Client state Compute error Exit status -1073741819 (0xc0000005) CPU time 9133.313 stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 14400 ---------------- https://boinc.bakerlab.org/rosetta/result.php?resultid=215742498 Name 1wjbA_ZNMP_ABRELAX_tetraL_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1wjbA-_5478_130_1 Outcome Client error Client state Compute error Exit status -1073741819 (0xc0000005) CPU time 2.984375 stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> ------------ https://boinc.bakerlab.org/rosetta/result.php?resultid=215811069 Name t073_1_RDC_NMR_NESG_5563_143956_0 Outcome Client error Client state Compute error Exit status -1073741819 (0xc0000005) CPU time 12305.66 stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 14400 --------------- https://boinc.bakerlab.org/rosetta/result.php?resultid=215833987 Name t073_1_RDC_NMR_NESG_5563_146392_0 Outcome Client error Client state Compute error Exit status -1073741819 (0xc0000005) CPU time 8922.172 stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 14400 -------------------- |
xsc2 Send message Joined: 9 Jul 08 Posts: 4 Credit: 62,354 RAC: 0 |
Exit status -1073741819 (0xc0000005) https://boinc.bakerlab.org/rosetta/result.php?resultid=216178769 |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
This makes 10 tasks in a days time that have died with the 0xc error. COME ON! This ran to within 10 minutes of completion and died. Gees! Then you insult me with me no credit granted for a 99% completed task. https://boinc.bakerlab.org/rosetta/result.php?resultid=216155882 1g47A_BOINC_MPZN_vanilla_abrelax_5901_6856_0 Workunit 196996323 Outcome Client error Client state Compute error Exit status -1073741819 (0xc0000005) CPU time 13796 stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 14400 |
A Few Good Men Send message Joined: 25 Mar 07 Posts: 14 Credit: 2,031,382 RAC: 0 |
Well... The lastest attempt to effectivly utilize @home computers to further mankind in medical fields has reduced my last machine into a power wasting room heater. Just for the fun of it, go to a Rosetta server aquiring results from the last 2 versions and search "Outcome Client error" Ill check back after a few months to see if things are any better here. |
Ian_D Send message Joined: 21 Sep 05 Posts: 55 Credit: 4,216,173 RAC: 0 |
wuid=196939593 <core_client_version>6.2.15</core_client_version> <![CDATA[ <message> process got signal 8 </message> <stderr_txt> # cpu_run_time_pref: 7200 ********************************************************************** Rosetta is going too long. Watchdog is ending the run! CPU time: 26914 seconds. Greater than 3X preferred time: 7200 seconds ********************************************************************** called boinc_finish </stderr_txt> ]]> |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
your vanilla task died at 2hrs and 23 mins. this makes about 12 failures now in 2 days. https://boinc.bakerlab.org/rosetta/result.php?resultid=216178144 1g47A_BOINC_MPZN_vanilla_abrelax_5901_7554_0 Client state Compute error Exit status -1073741819 (0xc0000005) CPU time 8912.25 stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 14400 |
Message boards :
Number crunching :
Minirosetta v1.47 bug thread.
©2024 University of Washington
https://www.bakerlab.org