Minirosetta v1.45 bug thread

Author	Message
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 57612 - Posted: 5 Dec 2008, 1:52:42 UTC Please post bugs and issues regarding minirosetta version 1.45. This update includes fixes to long runtimes for 'relax' jobs, validation errors, check point recovery issues, and numerical instability in hydrogen-bond scoring. We think we might have fixed the preemption problem so please keep an eye out for this. The "can't acquire lockfile" issue might also be related. If you are having lockfile problems, please make sure there are no other boinc applications running in the same slot. If necessary, turn off the client and make sure all boinc apps are not running, and then restart the client. ID: 57612 · Rating: 0 · rate: / Reply Quote

googloo Send message Joined: 15 Sep 06 Posts: 137 Credit: 24,022,414 RAC: 24	Message 57625 - Posted: 5 Dec 2008, 14:52:06 UTC Last modified: 5 Dec 2008, 14:52:37 UTC Please, please, please post new versions in Rosetta Application Version Release Log. That's how we get notified by email so that we can adjust our firewalls--it's very important. ID: 57625 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 57626 - Posted: 5 Dec 2008, 16:02:43 UTC - in response to Message 57625. Please, please, please post new versions in Rosetta Application Version Release Log. That's how we get notified by email so that we can adjust our firewalls--it's very important. oh, sorry about that. will do. ID: 57626 · Rating: 0 · rate: / Reply Quote

ChiTownDale Send message Joined: 10 Dec 05 Posts: 3 Credit: 57,428 RAC: 0	Message 57639 - Posted: 6 Dec 2008, 1:11:01 UTC This problem is getting very frustratung. I run Eosetta as well as nine other BOINC tasks so I try to balance approximate compute times across all but the Climate Prediction task which taks 2000 CPU minutes per task but allows two years to complete each task. So when a Rosetta task runs away for 18-22 hours of CPU time I end up aborting it since it says a little more than 9 minutes left out of about 6 hours it was estimated to run. All other BOINC based projects are fairly accurate and none have come close to 3-4 times the initial estimate as these are. Right now I have two Rosetta tasks running and both have hung at a little more than nine min for one and ten for the other. So until this problem is resolved I have no choice but to suspend all further Rosetta tasks. I feel bad having to abort those 5-6 tasks previously since that is over 100 hours of CPU time wasted with two still suspended. That comes to over 160 hours of CPU time wasted when that amount of time could have completed dozens of tasks for other projects. Event SETI hasn't had a task that took more the 40 hours while the LHCAT tasks only take about two hours each so I could have processed 80 of them with the wasted CPU time from Rosette 1.40 tasks. So I do hope they can fix what is wrong so I can get back to processing Rosette tasks again. ID: 57639 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 57640 - Posted: 6 Dec 2008, 1:13:49 UTC - in response to Message 57639. This problem is getting very frustratung. I run Eosetta as well as nine other BOINC tasks so I try to balance approximate compute times across all but the Climate Prediction task which taks 2000 CPU minutes per task but allows two years to complete each task. So when a Rosetta task runs away for 18-22 hours of CPU time I end up aborting it since it says a little more than 9 minutes left out of about 6 hours it was estimated to run. All other BOINC based projects are fairly accurate and none have come close to 3-4 times the initial estimate as these are. Right now I have two Rosetta tasks running and both have hung at a little more than nine min for one and ten for the other. So until this problem is resolved I have no choice but to suspend all further Rosetta tasks. I feel bad having to abort those 5-6 tasks previously since that is over 100 hours of CPU time wasted with two still suspended. That comes to over 160 hours of CPU time wasted when that amount of time could have completed dozens of tasks for other projects. Event SETI hasn't had a task that took more the 40 hours while the LHCAT tasks only take about two hours each so I could have processed 80 of them with the wasted CPU time from Rosette 1.40 tasks. So I do hope they can fix what is wrong so I can get back to processing Rosette tasks again. do you still have the names of those problem tasks? can you try our recently updated version and see if you have the same problems? ID: 57640 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 57644 - Posted: 6 Dec 2008, 4:28:21 UTC Last modified: 6 Dec 2008, 4:31:32 UTC Looks like they were all v1.40 tasks so far. https://boinc.bakerlab.org/rosetta/results.php?hostid=812687 ChiTownDale have you seen problems like this with v1.45?? It includes changes that should eliminate the long running models, and unpredictable completion times. Rosetta Moderator: Mod.Sense ID: 57644 · Rating: 0 · rate: / Reply Quote

JChojnacki Send message Joined: 17 Sep 05 Posts: 71 Credit: 11,951,807 RAC: 3	Message 57645 - Posted: 6 Dec 2008, 6:10:02 UTC This WU failed: 212238609 ID: 57645 · Rating: 0 · rate: / Reply Quote

Guus Gerritsen van der Hoop Send message Joined: 7 Feb 06 Posts: 1 Credit: 2,010,139 RAC: 0	Message 57646 - Posted: 6 Dec 2008, 8:38:58 UTC I run Rosetta on two computers and seem to be unable to get new work due to the following error. What could be the problem? Gus. 6-12-2008 8:48:22\|rosetta@home\|Sending scheduler request: To fetch work. Requesting 30240 seconds of work, reporting 0 completed tasks 6-12-2008 8:48:27\|rosetta@home\|Scheduler request succeeded: got 0 new tasks 6-12-2008 8:48:27\|rosetta@home\|Message from server: Server error: can't attach shared memory ID: 57646 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 57647 - Posted: 6 Dec 2008, 9:01:59 UTC - in response to Message 57646. I run Rosetta on two computers and seem to be unable to get new work due to the following error. What could be the problem? Gus. 6-12-2008 8:48:22\|rosetta@home\|Sending scheduler request: To fetch work. Requesting 30240 seconds of work, reporting 0 completed tasks 6-12-2008 8:48:27\|rosetta@home\|Scheduler request succeeded: got 0 new tasks 6-12-2008 8:48:27\|rosetta@home\|Message from server: Server error: can't attach shared memory read this thread for more info. ID: 57647 · Rating: 0 · rate: / Reply Quote

rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0	Message 57648 - Posted: 6 Dec 2008, 12:40:04 UTC problem here https://boinc.bakerlab.org/rosetta/results.php?hostid=267483&offset=20 ID: 57648 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 57649 - Posted: 6 Dec 2008, 12:46:03 UTC - in response to Message 57648. problem here https://boinc.bakerlab.org/rosetta/results.php?hostid=267483&offset=20 these should be reported in the 1.40 thread. very odd, aborted and then detached and then completed ok on some of them. 3 different users. ID: 57649 · Rating: 0 · rate: / Reply Quote

RottenMutt Send message Joined: 2 Jan 07 Posts: 2 Credit: 249,397 RAC: 0	Message 57650 - Posted: 6 Dec 2008, 13:46:01 UTC - in response to Message 57649. Last modified: 6 Dec 2008, 13:51:15 UTC lots of mini 1.45 compute errors here, with just a few successes. no problems with beta 5.98. ID: 57650 · Rating: 0 · rate: / Reply Quote

Mattia Verga Send message Joined: 15 Jul 06 Posts: 3 Credit: 124,357 RAC: 0	Message 57652 - Posted: 6 Dec 2008, 15:14:42 UTC Error code 193 here: 212518199 ID: 57652 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 57653 - Posted: 6 Dec 2008, 15:22:50 UTC - in response to Message 57650. lots of mini 1.45 compute errors here, with just a few successes. no problems with beta 5.98. are you OC'd at all? if so try lowering your speed a bit. Ive found that some of these tasks are speed sensitive. ID: 57653 · Rating: 0 · rate: / Reply Quote

svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0	Message 57659 - Posted: 6 Dec 2008, 19:43:17 UTC Task 212334548, workunit 193576676 failed on my iMac2 10.4.11. after about half an hour. It seems to have been completed successfully by someone on an XP system. <core_client_version>6.2.18</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> # cpu_run_time_pref: 14400 SIGBUS: bus error Crashed executable name: minirosetta_1.45_i686-apple-darwin built using BOINC library version 6.5.0 Machine type Intel 80486 (32-bit executable) System version: Macintosh OS 10.4.11 build 8S2167 Fri Dec 5 12:23:43 2008 Thread 0 Crashed: 0 ...etta_1.45_i686-apple-darwin 0x00226273 __ZNK4core10kinematics8AtomTree17update_domain_mapERN9ObjexxFCL8FArray1DIiEERKNS_2id10AtomID_MapIbEESA_ + 41 1 ...etta_1.45_i686-apple-darwin 0x0001596f __ZNK4core12conformation12Conformation17update_domain_mapERN9ObjexxFCL8FArray1DIiEE + 403 2 ...etta_1.45_i686-apple-darwin 0x000830f3 __ZN4core4pose4Pose13scoring_beginEN7utility7pointer10owning_ptrINS_7scoring17ScoreFunctionInfoEEE + 1329 3 ...etta_1.45_i686-apple-darwin 0x000fa4e2 __ZNK4core7scoring13ScoreFunctionclERNS_4pose4PoseE + 4686 4 ...etta_1.45_i686-apple-darwin 0x001938b1 __ZNK9protocols8abinitio18AbrelaxApplication13process_decoyERN4core4pose4PoseERKNS2_7scoring13ScoreFunctionESsRNS2_2io6silent12SilentStructE + 35 5 ...etta_1.45_i686-apple-darwin 0x001afe27 __ZN9protocols8abinitio18AbrelaxApplication4foldEv + 9651 6 ...etta_1.45_i686-apple-darwin 0x001b5381 __ZN9protocols8abinitio18AbrelaxApplication3runEv + 881 7 ...etta_1.45_i686-apple-darwin 0x00009a87 _main + 3941 8 ...etta_1.45_i686-apple-darwin 0x0000292e __start + 216 9 ...etta_1.45_i686-apple-darwin 0x00002855 start + 41 Thread 1: 0 /usr/lib/libSystem.B.dylib 0x90037b57 _mach_wait_until + 7 1 /usr/lib/libSystem.B.dylib 0x9003799e _nanosleep + 398 2 /usr/lib/libSystem.B.dylib 0x9003a222 _usleep + 82 3 ...etta_1.45_i686-apple-darwin 0x00516bd1 __Z11boinc_sleepd + 197 4 ...etta_1.45_i686-apple-darwin 0x001f8583 __Z12timer_threadPv + 77 5 /usr/lib/libSystem.B.dylib 0x90024227 __pthread_body + 84 Thread 2: 0 /usr/lib/libSystem.B.dylib 0x90037b57 _mach_wait_until + 7 1 /usr/lib/libSystem.B.dylib 0x9003799e _nanosleep + 398 2 /usr/lib/libSystem.B.dylib 0x900377d9 _sleep + 121 3 ...etta_1.45_i686-apple-darwin 0x0051f4c0 __ZN9protocols5boinc8watchdog13main_watchdogEPv + 548 4 /usr/lib/libSystem.B.dylib 0x90024227 __pthread_body + 84 Thread 0 crashed with X86 Thread State (32-bit): eax: 0x00000000 ebx: 0x00000000 ecx: 0x00000000 edx: 0x00000000 edi: 0x00000000 esi: 0x00000000 ebp: 0xbfffb0a8 esp: 0x00000000 ss: 0x00000000 efl: 0x00000000 eip: 0x0096e325 cs: 0x00000000 ds: 0x00000000 es: 0x00000000 fs: 0x00000000 gs: 0x00000000 Binary Images Description: 0x1000 - 0x12b2fff /Library/Application Support/BOINC Data/slots/0/../../projects/boinc.bakerlab.org_rosetta/minirosetta_1.45_i686-apple-darwin 0x162c000 - 0x170afff /usr/lib/libxml2.2.dylib 0x90000000 - 0x90171fff /usr/lib/libSystem.B.dylib ID: 57659 · Rating: 0 · rate: / Reply Quote

JChojnacki Send message Joined: 17 Sep 05 Posts: 71 Credit: 11,951,807 RAC: 3	Message 57662 - Posted: 6 Dec 2008, 22:42:51 UTC This WU failed with exit code -1073741819 (0xc0000005) 212340099 ID: 57662 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2584 Credit: 47,220,881 RAC: 79	Message 57664 - Posted: 7 Dec 2008, 4:25:16 UTC Promising results. All the silly warning messages have disappeared for me. No NAN hbonding errors either (yet?). Where I used to have 2 out of 5 WUs crash out with the error "Can't acquire lockfile" it's dropped to 2 out of 7 crashing out for that reason (over my first 21 results only - may not turn out to be representative). In addition, similar to RottenMutt and JChojnacki, task 212336883 errored out with: Outcome Client error Client state Compute error Exit status -1073741819 (0xc0000005) <core_client_version>6.2.19</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 7200 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x004EAD47 read attempt to address 0x00000000 This is the same exit code I reported under the Mini 1.34 thread here but with an access violation at a different address this time. Hope that helps. ID: 57664 · Rating: 0 · rate: / Reply Quote

AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0	Message 57665 - Posted: 7 Dec 2008, 4:59:43 UTC I'm getting a bunch of errors from cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs WUs: https://boinc.bakerlab.org/rosetta/result.php?resultid=212352758 https://boinc.bakerlab.org/rosetta/result.php?resultid=212299725 https://boinc.bakerlab.org/rosetta/result.php?resultid=212268137 https://boinc.bakerlab.org/rosetta/result.php?resultid=212215308 https://boinc.bakerlab.org/rosetta/result.php?resultid=212192548 After running a while, the WUs exit with code 193 and a stack trace. Note that this is on 4 different Linux nodes, (all of which were running well with version 1.40, except for the NANs problem). ID: 57665 · Rating: 0 · rate: / Reply Quote

David Ball Send message Joined: 25 Nov 05 Posts: 25 Credit: 1,439,333 RAC: 0	Message 57668 - Posted: 7 Dec 2008, 8:35:17 UTC Vista 64 bit on stock HP machine with Q6600 CPU and 5 GB memory - no OC BOINC 6.2.19 App: Mini 1.45 Name cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_hr1958_olange_5387_12341_1 Ran for around 4 hours and exited with - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x004EAD47 read attempt to address 0x00000000 Stack trace is in the result https://boinc.bakerlab.org/rosetta/result.php?resultid=212406604 Have you read a good Science Fiction book lately? ID: 57668 · Rating: 0 · rate: / Reply Quote

Rifleman Send message Joined: 19 Nov 08 Posts: 17 Credit: 139,408 RAC: 0	Message 57670 - Posted: 7 Dec 2008, 8:58:23 UTC I have had 3 WUs error out on me but seems to be much more stable than it was: https://boinc.bakerlab.org/rosetta/result.php?resultid=212602945 Client error Client state Compute error Exit status -1073741819 (0xc0000005) Computer ID 948562 Report deadline 16 Dec 2008 20:17:24 UTC CPU time 18577.93 stderr out <core_client_version>6.2.19</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 28800 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x007FA877 read attempt to address 0x1F59DCA6 Engaging BOINC Windows Runtime Debugger... ****************** https://boinc.bakerlab.org/rosetta/result.php?resultid=212495875 Client error Client state Compute error Exit status -1073741819 (0xc0000005) Computer ID 948562 Report deadline 16 Dec 2008 10:40:05 UTC CPU time 6441.172 stderr out <core_client_version>6.2.19</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 28800 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x004EAD47 read attempt to address 0x00000000 Engaging BOINC Windows Runtime Debugger... https://boinc.bakerlab.org/rosetta/result.php?resultid=212434493 Client error Client state Compute error Exit status -1073741819 (0xc0000005) Computer ID 948562 Report deadline 16 Dec 2008 4:04:00 UTC CPU time 13200.43 stderr out <core_client_version>6.2.19</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 28800 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x004EAD47 read attempt to address 0x00000000 Engaging BOINC Windows Runtime Debugger... ****************** ID: 57670 · Rating: 0 · rate: / Reply Quote