Minirosetta v1.45 bug thread

Message boards : Number crunching : Minirosetta v1.45 bug thread

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1480
Credit: 4,334,829
RAC: 0
Message 57612 - Posted: 5 Dec 2008, 1:52:42 UTC

Please post bugs and issues regarding minirosetta version 1.45.

This update includes fixes to long runtimes for 'relax' jobs, validation errors, check point recovery issues, and numerical instability in hydrogen-bond scoring.

We think we might have fixed the preemption problem so please keep an eye out for this. The "can't acquire lockfile" issue might also be related. If you are having lockfile problems, please make sure there are no other boinc applications running in the same slot. If necessary, turn off the client and make sure all boinc apps are not running, and then restart the client.
ID: 57612 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
googloo
Avatar

Send message
Joined: 15 Sep 06
Posts: 133
Credit: 22,813,645
RAC: 3,531
Message 57625 - Posted: 5 Dec 2008, 14:52:06 UTC
Last modified: 5 Dec 2008, 14:52:37 UTC

Please, please, please post new versions in Rosetta Application Version Release Log. That's how we get notified by email so that we can adjust our firewalls--it's very important.
ID: 57625 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1480
Credit: 4,334,829
RAC: 0
Message 57626 - Posted: 5 Dec 2008, 16:02:43 UTC - in response to Message 57625.  

Please, please, please post new versions in Rosetta Application Version Release Log. That's how we get notified by email so that we can adjust our firewalls--it's very important.


oh, sorry about that. will do.
ID: 57626 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ChiTownDale

Send message
Joined: 10 Dec 05
Posts: 3
Credit: 57,428
RAC: 0
Message 57639 - Posted: 6 Dec 2008, 1:11:01 UTC

This problem is getting very frustratung. I run Eosetta as well as nine other BOINC tasks so I try to balance approximate compute times across all but the Climate Prediction task which taks 2000 CPU minutes per task but allows two years to complete each task.
So when a Rosetta task runs away for 18-22 hours of CPU time I end up aborting it since it says a little more than 9 minutes left out of about 6 hours it was estimated to run.

All other BOINC based projects are fairly accurate and none have come close to 3-4 times the initial estimate as these are. Right now I have two Rosetta tasks running and both have hung at a little more than nine min for one and ten for the other.
So until this problem is resolved I have no choice but to suspend all further Rosetta tasks. I feel bad having to abort those 5-6 tasks previously since that is over 100 hours of CPU time wasted with two still suspended. That comes to over 160 hours of CPU time wasted when that amount of time could have completed dozens of tasks for other projects. Event SETI hasn't had a task that took more the 40 hours while the LHCAT tasks only take about two hours each so I could have processed 80 of them with the wasted CPU time from Rosette 1.40 tasks.
So I do hope they can fix what is wrong so I can get back to processing Rosette tasks again.
ID: 57639 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1480
Credit: 4,334,829
RAC: 0
Message 57640 - Posted: 6 Dec 2008, 1:13:49 UTC - in response to Message 57639.  

This problem is getting very frustratung. I run Eosetta as well as nine other BOINC tasks so I try to balance approximate compute times across all but the Climate Prediction task which taks 2000 CPU minutes per task but allows two years to complete each task.
So when a Rosetta task runs away for 18-22 hours of CPU time I end up aborting it since it says a little more than 9 minutes left out of about 6 hours it was estimated to run.

All other BOINC based projects are fairly accurate and none have come close to 3-4 times the initial estimate as these are. Right now I have two Rosetta tasks running and both have hung at a little more than nine min for one and ten for the other.
So until this problem is resolved I have no choice but to suspend all further Rosetta tasks. I feel bad having to abort those 5-6 tasks previously since that is over 100 hours of CPU time wasted with two still suspended. That comes to over 160 hours of CPU time wasted when that amount of time could have completed dozens of tasks for other projects. Event SETI hasn't had a task that took more the 40 hours while the LHCAT tasks only take about two hours each so I could have processed 80 of them with the wasted CPU time from Rosette 1.40 tasks.
So I do hope they can fix what is wrong so I can get back to processing Rosette tasks again.


do you still have the names of those problem tasks? can you try our recently updated version and see if you have the same problems?
ID: 57640 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 57644 - Posted: 6 Dec 2008, 4:28:21 UTC
Last modified: 6 Dec 2008, 4:31:32 UTC

Looks like they were all v1.40 tasks so far.
https://boinc.bakerlab.org/rosetta/results.php?hostid=812687

ChiTownDale have you seen problems like this with v1.45?? It includes changes that should eliminate the long running models, and unpredictable completion times.
Rosetta Moderator: Mod.Sense
ID: 57644 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile JChojnacki
Avatar

Send message
Joined: 17 Sep 05
Posts: 71
Credit: 10,747,694
RAC: 4,384
Message 57645 - Posted: 6 Dec 2008, 6:10:02 UTC

This WU failed:
212238609

ID: 57645 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Guus Gerritsen van der Hoop

Send message
Joined: 7 Feb 06
Posts: 1
Credit: 2,010,139
RAC: 0
Message 57646 - Posted: 6 Dec 2008, 8:38:58 UTC

I run Rosetta on two computers and seem to be unable to get new work due to the following error. What could be the problem?
Gus.

6-12-2008 8:48:22|rosetta@home|Sending scheduler request: To fetch work. Requesting 30240 seconds of work, reporting 0 completed tasks
6-12-2008 8:48:27|rosetta@home|Scheduler request succeeded: got 0 new tasks
6-12-2008 8:48:27|rosetta@home|Message from server: Server error: can't attach shared memory

ID: 57646 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57647 - Posted: 6 Dec 2008, 9:01:59 UTC - in response to Message 57646.  

I run Rosetta on two computers and seem to be unable to get new work due to the following error. What could be the problem?
Gus.

6-12-2008 8:48:22|rosetta@home|Sending scheduler request: To fetch work. Requesting 30240 seconds of work, reporting 0 completed tasks
6-12-2008 8:48:27|rosetta@home|Scheduler request succeeded: got 0 new tasks
6-12-2008 8:48:27|rosetta@home|Message from server: Server error: can't attach shared memory



read this thread for more info.
ID: 57647 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 57648 - Posted: 6 Dec 2008, 12:40:04 UTC

ID: 57648 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57649 - Posted: 6 Dec 2008, 12:46:03 UTC - in response to Message 57648.  

problem here


https://boinc.bakerlab.org/rosetta/results.php?hostid=267483&offset=20



these should be reported in the 1.40 thread. very odd, aborted and then detached and then completed ok on some of them. 3 different users.
ID: 57649 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
RottenMutt

Send message
Joined: 2 Jan 07
Posts: 2
Credit: 249,397
RAC: 0
Message 57650 - Posted: 6 Dec 2008, 13:46:01 UTC - in response to Message 57649.  
Last modified: 6 Dec 2008, 13:51:15 UTC

lots of mini 1.45 compute errors here, with just a few successes.

no problems with beta 5.98.
ID: 57650 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mattia Verga

Send message
Joined: 15 Jul 06
Posts: 3
Credit: 124,357
RAC: 0
Message 57652 - Posted: 6 Dec 2008, 15:14:42 UTC

Error code 193 here:

212518199


ID: 57652 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57653 - Posted: 6 Dec 2008, 15:22:50 UTC - in response to Message 57650.  

lots of mini 1.45 compute errors here, with just a few successes.

no problems with beta 5.98.


are you OC'd at all? if so try lowering your speed a bit.
Ive found that some of these tasks are speed sensitive.
ID: 57653 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 57659 - Posted: 6 Dec 2008, 19:43:17 UTC

Task 212334548, workunit 193576676 failed on my iMac2 10.4.11. after about half an hour. It seems to have been completed successfully by someone on an XP system.


<core_client_version>6.2.18</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400
SIGBUS: bus error

Crashed executable name: minirosetta_1.45_i686-apple-darwin
built using BOINC library version 6.5.0
Machine type Intel 80486 (32-bit executable)
System version: Macintosh OS 10.4.11 build 8S2167
Fri Dec 5 12:23:43 2008

Thread 0 Crashed:
0 ...etta_1.45_i686-apple-darwin 0x00226273 __ZNK4core10kinematics8AtomTree17update_domain_mapERN9ObjexxFCL8FArray1DIiEERKNS_2id10AtomID_MapIbEESA_ + 41
1 ...etta_1.45_i686-apple-darwin 0x0001596f __ZNK4core12conformation12Conformation17update_domain_mapERN9ObjexxFCL8FArray1DIiEE + 403
2 ...etta_1.45_i686-apple-darwin 0x000830f3 __ZN4core4pose4Pose13scoring_beginEN7utility7pointer10owning_ptrINS_7scoring17ScoreFunctionInfoEEE + 1329
3 ...etta_1.45_i686-apple-darwin 0x000fa4e2 __ZNK4core7scoring13ScoreFunctionclERNS_4pose4PoseE + 4686
4 ...etta_1.45_i686-apple-darwin 0x001938b1 __ZNK9protocols8abinitio18AbrelaxApplication13process_decoyERN4core4pose4PoseERKNS2_7scoring13ScoreFunctionESsRNS2_2io6silent12SilentStructE + 35
5 ...etta_1.45_i686-apple-darwin 0x001afe27 __ZN9protocols8abinitio18AbrelaxApplication4foldEv + 9651
6 ...etta_1.45_i686-apple-darwin 0x001b5381 __ZN9protocols8abinitio18AbrelaxApplication3runEv + 881
7 ...etta_1.45_i686-apple-darwin 0x00009a87 _main + 3941
8 ...etta_1.45_i686-apple-darwin 0x0000292e __start + 216
9 ...etta_1.45_i686-apple-darwin 0x00002855 start + 41

Thread 1:
0 /usr/lib/libSystem.B.dylib 0x90037b57 _mach_wait_until + 7
1 /usr/lib/libSystem.B.dylib 0x9003799e _nanosleep + 398
2 /usr/lib/libSystem.B.dylib 0x9003a222 _usleep + 82
3 ...etta_1.45_i686-apple-darwin 0x00516bd1 __Z11boinc_sleepd + 197
4 ...etta_1.45_i686-apple-darwin 0x001f8583 __Z12timer_threadPv + 77
5 /usr/lib/libSystem.B.dylib 0x90024227 __pthread_body + 84

Thread 2:
0 /usr/lib/libSystem.B.dylib 0x90037b57 _mach_wait_until + 7
1 /usr/lib/libSystem.B.dylib 0x9003799e _nanosleep + 398
2 /usr/lib/libSystem.B.dylib 0x900377d9 _sleep + 121
3 ...etta_1.45_i686-apple-darwin 0x0051f4c0 __ZN9protocols5boinc8watchdog13main_watchdogEPv + 548
4 /usr/lib/libSystem.B.dylib 0x90024227 __pthread_body + 84

Thread 0 crashed with X86 Thread State (32-bit):
eax: 0x00000000 ebx: 0x00000000 ecx: 0x00000000 edx: 0x00000000
edi: 0x00000000 esi: 0x00000000 ebp: 0xbfffb0a8 esp: 0x00000000
ss: 0x00000000 efl: 0x00000000 eip: 0x0096e325 cs: 0x00000000
ds: 0x00000000 es: 0x00000000 fs: 0x00000000 gs: 0x00000000

Binary Images Description:
0x1000 - 0x12b2fff /Library/Application Support/BOINC Data/slots/0/../../projects/boinc.bakerlab.org_rosetta/minirosetta_1.45_i686-apple-darwin
0x162c000 - 0x170afff /usr/lib/libxml2.2.dylib
0x90000000 - 0x90171fff /usr/lib/libSystem.B.dylib

ID: 57659 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile JChojnacki
Avatar

Send message
Joined: 17 Sep 05
Posts: 71
Credit: 10,747,694
RAC: 4,384
Message 57662 - Posted: 6 Dec 2008, 22:42:51 UTC

This WU failed with exit code -1073741819 (0xc0000005)

212340099
ID: 57662 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2140
Credit: 41,518,559
RAC: 10,612
Message 57664 - Posted: 7 Dec 2008, 4:25:16 UTC

Promising results. All the silly warning messages have disappeared for me. No NAN hbonding errors either (yet?). Where I used to have 2 out of 5 WUs crash out with the error "Can't acquire lockfile" it's dropped to 2 out of 7 crashing out for that reason (over my first 21 results only - may not turn out to be representative).

In addition, similar to RottenMutt and JChojnacki, task 212336883 errored out with:

Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 7200


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004EAD47 read attempt to address 0x00000000

This is the same exit code I reported under the Mini 1.34 thread here but with an access violation at a different address this time.

Hope that helps.
ID: 57664 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 57665 - Posted: 7 Dec 2008, 4:59:43 UTC

I'm getting a bunch of errors from cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs WUs:

https://boinc.bakerlab.org/rosetta/result.php?resultid=212352758
https://boinc.bakerlab.org/rosetta/result.php?resultid=212299725
https://boinc.bakerlab.org/rosetta/result.php?resultid=212268137
https://boinc.bakerlab.org/rosetta/result.php?resultid=212215308
https://boinc.bakerlab.org/rosetta/result.php?resultid=212192548

After running a while, the WUs exit with code 193 and a stack trace.

Note that this is on 4 different Linux nodes, (all of which were running well with version 1.40, except for the NANs problem).
ID: 57665 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Ball

Send message
Joined: 25 Nov 05
Posts: 25
Credit: 1,439,333
RAC: 0
Message 57668 - Posted: 7 Dec 2008, 8:35:17 UTC

Vista 64 bit on stock HP machine with Q6600 CPU and 5 GB memory - no OC
BOINC 6.2.19
App: Mini 1.45

Name cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_hr1958_olange_5387_12341_1

Ran for around 4 hours and exited with
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004EAD47 read attempt to address 0x00000000

Stack trace is in the result

https://boinc.bakerlab.org/rosetta/result.php?resultid=212406604
Have you read a good Science Fiction book lately?
ID: 57668 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rifleman

Send message
Joined: 19 Nov 08
Posts: 17
Credit: 139,408
RAC: 0
Message 57670 - Posted: 7 Dec 2008, 8:58:23 UTC

I have had 3 WUs error out on me but seems to be much more stable than it was:
https://boinc.bakerlab.org/rosetta/result.php?resultid=212602945
Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
Computer ID 948562
Report deadline 16 Dec 2008 20:17:24 UTC
CPU time 18577.93
stderr out <core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 28800


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x007FA877 read attempt to address 0x1F59DCA6

Engaging BOINC Windows Runtime Debugger...



********************



https://boinc.bakerlab.org/rosetta/result.php?resultid=212495875
Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
Computer ID 948562
Report deadline 16 Dec 2008 10:40:05 UTC
CPU time 6441.172
stderr out <core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 28800


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004EAD47 read attempt to address 0x00000000

Engaging BOINC Windows Runtime Debugger...



https://boinc.bakerlab.org/rosetta/result.php?resultid=212434493
Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
Computer ID 948562
Report deadline 16 Dec 2008 4:04:00 UTC
CPU time 13200.43
stderr out <core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 28800


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004EAD47 read attempt to address 0x00000000

Engaging BOINC Windows Runtime Debugger...



********************




ID: 57670 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Number crunching : Minirosetta v1.45 bug thread



©2024 University of Washington
https://www.bakerlab.org