Message boards : Number crunching : Minirosetta v1.45 bug thread
Author | Message |
---|---|
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Please post bugs and issues regarding minirosetta version 1.45. This update includes fixes to long runtimes for 'relax' jobs, validation errors, check point recovery issues, and numerical instability in hydrogen-bond scoring. We think we might have fixed the preemption problem so please keep an eye out for this. The "can't acquire lockfile" issue might also be related. If you are having lockfile problems, please make sure there are no other boinc applications running in the same slot. If necessary, turn off the client and make sure all boinc apps are not running, and then restart the client. |
googloo Send message Joined: 15 Sep 06 Posts: 133 Credit: 22,783,789 RAC: 5,547 |
Please, please, please post new versions in Rosetta Application Version Release Log. That's how we get notified by email so that we can adjust our firewalls--it's very important. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Please, please, please post new versions in Rosetta Application Version Release Log. That's how we get notified by email so that we can adjust our firewalls--it's very important. oh, sorry about that. will do. |
ChiTownDale Send message Joined: 10 Dec 05 Posts: 3 Credit: 57,428 RAC: 0 |
This problem is getting very frustratung. I run Eosetta as well as nine other BOINC tasks so I try to balance approximate compute times across all but the Climate Prediction task which taks 2000 CPU minutes per task but allows two years to complete each task. So when a Rosetta task runs away for 18-22 hours of CPU time I end up aborting it since it says a little more than 9 minutes left out of about 6 hours it was estimated to run. All other BOINC based projects are fairly accurate and none have come close to 3-4 times the initial estimate as these are. Right now I have two Rosetta tasks running and both have hung at a little more than nine min for one and ten for the other. So until this problem is resolved I have no choice but to suspend all further Rosetta tasks. I feel bad having to abort those 5-6 tasks previously since that is over 100 hours of CPU time wasted with two still suspended. That comes to over 160 hours of CPU time wasted when that amount of time could have completed dozens of tasks for other projects. Event SETI hasn't had a task that took more the 40 hours while the LHCAT tasks only take about two hours each so I could have processed 80 of them with the wasted CPU time from Rosette 1.40 tasks. So I do hope they can fix what is wrong so I can get back to processing Rosette tasks again. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
This problem is getting very frustratung. I run Eosetta as well as nine other BOINC tasks so I try to balance approximate compute times across all but the Climate Prediction task which taks 2000 CPU minutes per task but allows two years to complete each task. do you still have the names of those problem tasks? can you try our recently updated version and see if you have the same problems? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Looks like they were all v1.40 tasks so far. https://boinc.bakerlab.org/rosetta/results.php?hostid=812687 ChiTownDale have you seen problems like this with v1.45?? It includes changes that should eliminate the long running models, and unpredictable completion times. Rosetta Moderator: Mod.Sense |
JChojnacki Send message Joined: 17 Sep 05 Posts: 71 Credit: 10,711,593 RAC: 6,960 |
|
Guus Gerritsen van der Hoop Send message Joined: 7 Feb 06 Posts: 1 Credit: 2,010,139 RAC: 0 |
I run Rosetta on two computers and seem to be unable to get new work due to the following error. What could be the problem? Gus. 6-12-2008 8:48:22|rosetta@home|Sending scheduler request: To fetch work. Requesting 30240 seconds of work, reporting 0 completed tasks 6-12-2008 8:48:27|rosetta@home|Scheduler request succeeded: got 0 new tasks 6-12-2008 8:48:27|rosetta@home|Message from server: Server error: can't attach shared memory |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
I run Rosetta on two computers and seem to be unable to get new work due to the following error. What could be the problem? read this thread for more info. |
rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0 |
|
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
problem here these should be reported in the 1.40 thread. very odd, aborted and then detached and then completed ok on some of them. 3 different users. |
RottenMutt Send message Joined: 2 Jan 07 Posts: 2 Credit: 249,397 RAC: 0 |
|
Mattia Verga Send message Joined: 15 Jul 06 Posts: 3 Credit: 124,357 RAC: 0 |
|
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
lots of mini 1.45 compute errors here, with just a few successes. are you OC'd at all? if so try lowering your speed a bit. Ive found that some of these tasks are speed sensitive. |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Task 212334548, workunit 193576676 failed on my iMac2 10.4.11. after about half an hour. It seems to have been completed successfully by someone on an XP system. <core_client_version>6.2.18</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> # cpu_run_time_pref: 14400 SIGBUS: bus error Crashed executable name: minirosetta_1.45_i686-apple-darwin built using BOINC library version 6.5.0 Machine type Intel 80486 (32-bit executable) System version: Macintosh OS 10.4.11 build 8S2167 Fri Dec 5 12:23:43 2008 Thread 0 Crashed: 0 ...etta_1.45_i686-apple-darwin 0x00226273 __ZNK4core10kinematics8AtomTree17update_domain_mapERN9ObjexxFCL8FArray1DIiEERKNS_2id10AtomID_MapIbEESA_ + 41 1 ...etta_1.45_i686-apple-darwin 0x0001596f __ZNK4core12conformation12Conformation17update_domain_mapERN9ObjexxFCL8FArray1DIiEE + 403 2 ...etta_1.45_i686-apple-darwin 0x000830f3 __ZN4core4pose4Pose13scoring_beginEN7utility7pointer10owning_ptrINS_7scoring17ScoreFunctionInfoEEE + 1329 3 ...etta_1.45_i686-apple-darwin 0x000fa4e2 __ZNK4core7scoring13ScoreFunctionclERNS_4pose4PoseE + 4686 4 ...etta_1.45_i686-apple-darwin 0x001938b1 __ZNK9protocols8abinitio18AbrelaxApplication13process_decoyERN4core4pose4PoseERKNS2_7scoring13ScoreFunctionESsRNS2_2io6silent12SilentStructE + 35 5 ...etta_1.45_i686-apple-darwin 0x001afe27 __ZN9protocols8abinitio18AbrelaxApplication4foldEv + 9651 6 ...etta_1.45_i686-apple-darwin 0x001b5381 __ZN9protocols8abinitio18AbrelaxApplication3runEv + 881 7 ...etta_1.45_i686-apple-darwin 0x00009a87 _main + 3941 8 ...etta_1.45_i686-apple-darwin 0x0000292e __start + 216 9 ...etta_1.45_i686-apple-darwin 0x00002855 start + 41 Thread 1: 0 /usr/lib/libSystem.B.dylib 0x90037b57 _mach_wait_until + 7 1 /usr/lib/libSystem.B.dylib 0x9003799e _nanosleep + 398 2 /usr/lib/libSystem.B.dylib 0x9003a222 _usleep + 82 3 ...etta_1.45_i686-apple-darwin 0x00516bd1 __Z11boinc_sleepd + 197 4 ...etta_1.45_i686-apple-darwin 0x001f8583 __Z12timer_threadPv + 77 5 /usr/lib/libSystem.B.dylib 0x90024227 __pthread_body + 84 Thread 2: 0 /usr/lib/libSystem.B.dylib 0x90037b57 _mach_wait_until + 7 1 /usr/lib/libSystem.B.dylib 0x9003799e _nanosleep + 398 2 /usr/lib/libSystem.B.dylib 0x900377d9 _sleep + 121 3 ...etta_1.45_i686-apple-darwin 0x0051f4c0 __ZN9protocols5boinc8watchdog13main_watchdogEPv + 548 4 /usr/lib/libSystem.B.dylib 0x90024227 __pthread_body + 84 Thread 0 crashed with X86 Thread State (32-bit): eax: 0x00000000 ebx: 0x00000000 ecx: 0x00000000 edx: 0x00000000 edi: 0x00000000 esi: 0x00000000 ebp: 0xbfffb0a8 esp: 0x00000000 ss: 0x00000000 efl: 0x00000000 eip: 0x0096e325 cs: 0x00000000 ds: 0x00000000 es: 0x00000000 fs: 0x00000000 gs: 0x00000000 Binary Images Description: 0x1000 - 0x12b2fff /Library/Application Support/BOINC Data/slots/0/../../projects/boinc.bakerlab.org_rosetta/minirosetta_1.45_i686-apple-darwin 0x162c000 - 0x170afff /usr/lib/libxml2.2.dylib 0x90000000 - 0x90171fff /usr/lib/libSystem.B.dylib |
JChojnacki Send message Joined: 17 Sep 05 Posts: 71 Credit: 10,711,593 RAC: 6,960 |
|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2130 Credit: 41,424,155 RAC: 16,102 |
Promising results. All the silly warning messages have disappeared for me. No NAN hbonding errors either (yet?). Where I used to have 2 out of 5 WUs crash out with the error "Can't acquire lockfile" it's dropped to 2 out of 7 crashing out for that reason (over my first 21 results only - may not turn out to be representative). In addition, similar to RottenMutt and JChojnacki, task 212336883 errored out with: Outcome Client error This is the same exit code I reported under the Mini 1.34 thread here but with an access violation at a different address this time. Hope that helps. |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
I'm getting a bunch of errors from cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs WUs: https://boinc.bakerlab.org/rosetta/result.php?resultid=212352758 https://boinc.bakerlab.org/rosetta/result.php?resultid=212299725 https://boinc.bakerlab.org/rosetta/result.php?resultid=212268137 https://boinc.bakerlab.org/rosetta/result.php?resultid=212215308 https://boinc.bakerlab.org/rosetta/result.php?resultid=212192548 After running a while, the WUs exit with code 193 and a stack trace. Note that this is on 4 different Linux nodes, (all of which were running well with version 1.40, except for the NANs problem). |
David Ball Send message Joined: 25 Nov 05 Posts: 25 Credit: 1,439,333 RAC: 0 |
Vista 64 bit on stock HP machine with Q6600 CPU and 5 GB memory - no OC BOINC 6.2.19 App: Mini 1.45 Name cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_hr1958_olange_5387_12341_1 Ran for around 4 hours and exited with - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x004EAD47 read attempt to address 0x00000000 Stack trace is in the result https://boinc.bakerlab.org/rosetta/result.php?resultid=212406604 Have you read a good Science Fiction book lately? |
Rifleman Send message Joined: 19 Nov 08 Posts: 17 Credit: 139,408 RAC: 0 |
I have had 3 WUs error out on me but seems to be much more stable than it was: https://boinc.bakerlab.org/rosetta/result.php?resultid=212602945 Client error Client state Compute error Exit status -1073741819 (0xc0000005) Computer ID 948562 Report deadline 16 Dec 2008 20:17:24 UTC CPU time 18577.93 stderr out <core_client_version>6.2.19</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 28800 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x007FA877 read attempt to address 0x1F59DCA6 Engaging BOINC Windows Runtime Debugger... ******************** https://boinc.bakerlab.org/rosetta/result.php?resultid=212495875 Client error Client state Compute error Exit status -1073741819 (0xc0000005) Computer ID 948562 Report deadline 16 Dec 2008 10:40:05 UTC CPU time 6441.172 stderr out <core_client_version>6.2.19</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 28800 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x004EAD47 read attempt to address 0x00000000 Engaging BOINC Windows Runtime Debugger... https://boinc.bakerlab.org/rosetta/result.php?resultid=212434493 Client error Client state Compute error Exit status -1073741819 (0xc0000005) Computer ID 948562 Report deadline 16 Dec 2008 4:04:00 UTC CPU time 13200.43 stderr out <core_client_version>6.2.19</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 28800 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x004EAD47 read attempt to address 0x00000000 Engaging BOINC Windows Runtime Debugger... ******************** |
Message boards :
Number crunching :
Minirosetta v1.45 bug thread
©2024 University of Washington
https://www.bakerlab.org