Problems with version 5.96

Message boards : Number crunching : Problems with version 5.96

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 10 · Next

AuthorMessage
Profile nouqraz

Send message
Joined: 8 Apr 08
Posts: 6
Credit: 328,006
RAC: 462
Message 53579 - Posted: 7 Jun 2008, 14:00:39 UTC

Woke up this morning to an error message on my screen - rosetta_beta_5.96_windows_intelx86.exe caused an error and needs to close blah blah...

This task: https://boinc.bakerlab.org/rosetta/result.php?resultid=169429650

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 2005994


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00BF05EB read attempt to address 0x0E3AF000

Engaging BOINC Windows Runtime Debugger...


.... full debugging info in link to task.
ID: 53579 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile (_KoDAk_)

Send message
Joined: 18 Jul 06
Posts: 109
Credit: 1,859,263
RAC: 0
Message 53583 - Posted: 7 Jun 2008, 19:40:43 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=169050697
https://boinc.bakerlab.org/rosetta/result.php?resultid=168985785


ID: 53583 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile (_KoDAk_)

Send message
Joined: 18 Jul 06
Posts: 109
Credit: 1,859,263
RAC: 0
Message 53584 - Posted: 7 Jun 2008, 19:41:14 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=169050697
https://boinc.bakerlab.org/rosetta/result.php?resultid=168985785


ID: 53584 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David Emigh
Avatar

Send message
Joined: 13 Mar 06
Posts: 158
Credit: 417,178
RAC: 0
Message 53587 - Posted: 7 Jun 2008, 23:32:04 UTC
Last modified: 7 Jun 2008, 23:34:03 UTC

And here we have a 24 hour validate error:

resultid=169434889

The stderr_txt has basically nothing to say as to why this might be invalid...
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
ID: 53587 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 53592 - Posted: 8 Jun 2008, 19:05:56 UTC

two validate errors with no information. full run of 4 hrs.
FRA_t401_CASP8_2PRV_2ICG_1_IGNORE_THE_RESTt401_1_aaT0401_2ICGA_13_0001_3601_14377_0
https://boinc.bakerlab.org/rosetta/result.php?resultid=168398596

FRA_t401_CASP8_2PRV_2ICG_1_IGNORE_THE_RESTt401_1_aaT0401_2ICGA_13_0001_3601_14380_0
https://boinc.bakerlab.org/rosetta/result.php?resultid=168398613
ID: 53592 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Azurrio

Send message
Joined: 20 Feb 06
Posts: 8
Credit: 237,979
RAC: 0
Message 53616 - Posted: 10 Jun 2008, 9:24:34 UTC - in response to Message 53592.  

Validate error on this which ran 24h.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=153656004
ID: 53616 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 53637 - Posted: 11 Jun 2008, 23:22:15 UTC

The WU t407__CASP8_JUMPAB_SAMPLE1_res1to79_SAVE_ALL_OUT_BARCODE_hom001__3653_173 crunched for the normal length of time, but was marked "invalid".

From the stderr:

<message><file_xfer_error>
<file_name>t407__CASP8_JUMPAB_SAMPLE1_res1to79_SAVE_ALL_OUT_BARCODE_hom001__3653_173_1_0</file_name>
<error_code>-161</error_code>
<error_message></error_message>
</file_xfer_error>

The other person to crunch this WU got the same error.
ID: 53637 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 53671 - Posted: 13 Jun 2008, 16:40:43 UTC

The WU FRA_t411_CASP8_1UAR_2_IGNORE_THE_RESTt411_2_aa1uar.man_align1.template_0010_3670_4598_1 crunched the normal length of time, it crunched 1277 decoys, and the stderr looks normal, but it was marked "invalid".
ID: 53671 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 53685 - Posted: 14 Jun 2008, 11:17:28 UTC

Here's another FRA_t411_CASP8_1UAR_2_IGNORE_THE_RESTt411_2_aa1uar.man_align1.template_0010_3670_ WU that had a validate error for both crunchers.
ID: 53685 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 53689 - Posted: 14 Jun 2008, 16:38:10 UTC

this FRA_t401_CASP8_2PRV_2ICG_1_IGNORE_THE_RESTt401_1_aaT0401_2ICGA_13_0001_3601_17465_1 crashed and burned with a validate error on my system and one before me.
ID: 53689 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The_Bad_Penguin
Avatar

Send message
Joined: 5 Jun 06
Posts: 2751
Credit: 4,271,025
RAC: 0
Message 53691 - Posted: 14 Jun 2008, 20:15:52 UTC

FRA_t421_CASP8_ROB_1_IGNORE_THE_RESTt421_1_t411_.finalmodel05.noNCterm_3667_452_0

CPU time 11039.21
stderr out <core_client_version>5.10.13</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 2726968
==
</stderr_txt>
]]>


Validate state Invalid
Claimed credit 46.989849356256
Granted credit 0
application version 5.96
ID: 53691 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bill Hepburn

Send message
Joined: 18 Sep 05
Posts: 14
Credit: 14,631,953
RAC: 4,633
Message 53713 - Posted: 16 Jun 2008, 15:18:36 UTC

This one ran about 28%, then sat there "running", but consuming no CPU cycles for about 12 hours or so until I noticed it, and aborted it. That hasn't happened in ages.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=156585874
ID: 53713 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ConflictingEmotions

Send message
Joined: 5 Jun 08
Posts: 10
Credit: 3,081,990
RAC: 0
Message 53714 - Posted: 16 Jun 2008, 15:22:28 UTC

I have been getting a number of errors on 64-bit SMP Linux (Fedora 8):
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***
*** glibc detected *** double free or corruption (!prev): 0x0959d100 ***
*** glibc detected *** double free or corruption (!prev): 0x0959d100 ***
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***


These appear to freeze boinc because these continue after restarting boinc.
The tasks have the prefix t405_CASP8_JUMPAB eg.
t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_356_0

The stdout.txt files contain the following message many times:
res 13 and var 1 at position 1 is not a proper Nterm variant
STOP:: Pose::set_coords(): mismatch between  N,CA,C,O coords between full-coord and Eposition. dev= 7.83973
1 1 8.21224 0.746619
STOP:: Pose::set_coords(): mismatch between  N,CA,C,O coords between full-coord and Eposition. dev= 4.71364
2 1 4.14663 0.817297
STOP:: Pose::set_coords(): mismatch between  N,CA,C,O coords between full-coord and Eposition. dev= 9.48273
3 1 -10.3568 -1.28817



ID: 53714 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikus

Send message
Joined: 7 Nov 05
Posts: 58
Credit: 700,115
RAC: 0
Message 53717 - Posted: 16 Jun 2008, 17:29:26 UTC

Looked at my multi-core AMD machine (64-bit Linux client) and saw that four cores were idle. In each case, the WU had "completed" (that is, had used up as much as it should of the time allocated), but had FAILED to tell the BOINC client that the task was done. As a result, the BOINC client kept that task dispatched on a CPU core -- but since the task had finished; the corresponding core was idle. [Boincmgr showed 100% complete for these tasks, and showed that these "running" tasks were NOT accumulating any more CPU time.] Unfortunately, the only way I had to get my CPUs running again was to abort these tasks (and that in turn caused their results to be thrown away -- my system had crunched these tasks uselessly).

I've encountered this failure before with Rosetta - but that was on a different hardware system, then using the 32-bit Linux client. My conclusion is that there is a problem (at least on multi-core hardware running Linux) with how the Rosetta executable notifies the BOINC client that an application's workunit has finished.
ID: 53717 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile netwraith
Avatar

Send message
Joined: 3 Sep 06
Posts: 80
Credit: 13,483,227
RAC: 0
Message 53718 - Posted: 16 Jun 2008, 18:17:15 UTC - in response to Message 53717.  

Looked at my multi-core AMD machine (64-bit Linux client) and saw that four cores were idle. In each case, the WU had "completed" (that is, had used up as much as it should of the time allocated), but had FAILED to tell the BOINC client that the task was done. As a result, the BOINC client kept that task dispatched on a CPU core -- but since the task had finished; the corresponding core was idle. [Boincmgr showed 100% complete for these tasks, and showed that these "running" tasks were NOT accumulating any more CPU time.] Unfortunately, the only way I had to get my CPUs running again was to abort these tasks (and that in turn caused their results to be thrown away -- my system had crunched these tasks uselessly).

I've encountered this failure before with Rosetta - but that was on a different hardware system, then using the 32-bit Linux client. My conclusion is that there is a problem (at least on multi-core hardware running Linux) with how the Rosetta executable notifies the BOINC client that an application's workunit has finished.


Myself as well. I am seeing a lot of these. It does not seem to matter what the processor is, but, seems to be happening on the CASP8 tasks. Any reason that CASP8 is being run on a BETA version???

I will be aborting about a dozen units here in a few minutes.



Looking for a team ??? Join BoincSynergy!!


ID: 53718 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 53721 - Posted: 16 Jun 2008, 18:27:44 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=170401850
just validate error with 1313 decoys generated(?)
ID: 53721 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ConflictingEmotions

Send message
Joined: 5 Jun 08
Posts: 10
Credit: 3,081,990
RAC: 0
Message 53723 - Posted: 16 Jun 2008, 18:38:48 UTC - in response to Message 53714.  

After multiple restarts of the boinc client, I have terminated these t405_CASP8_JUMPAB tasks. There clearly are some major bugs in rosetta and boinc involved here!

I have been getting a number of errors on 64-bit SMP Linux (Fedora 8):
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***
*** glibc detected *** double free or corruption (!prev): 0x0959d100 ***
*** glibc detected *** double free or corruption (!prev): 0x0959d100 ***
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***
*** glibc detected *** free(): invalid next size (normal): 0x0959d100 ***


These appear to freeze boinc because these continue after restarting boinc.
The tasks have the prefix t405_CASP8_JUMPAB eg.
t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_356_0

The stdout.txt files contain the following message many times:
res 13 and var 1 at position 1 is not a proper Nterm variant
STOP:: Pose::set_coords(): mismatch between  N,CA,C,O coords between full-coord and Eposition. dev= 7.83973
1 1 8.21224 0.746619
STOP:: Pose::set_coords(): mismatch between  N,CA,C,O coords between full-coord and Eposition. dev= 4.71364
2 1 4.14663 0.817297
STOP:: Pose::set_coords(): mismatch between  N,CA,C,O coords between full-coord and Eposition. dev= 9.48273
3 1 -10.3568 -1.28817




ID: 53723 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile netwraith
Avatar

Send message
Joined: 3 Sep 06
Posts: 80
Credit: 13,483,227
RAC: 0
Message 53724 - Posted: 16 Jun 2008, 19:26:06 UTC - in response to Message 53718.  
Last modified: 16 Jun 2008, 19:27:13 UTC

Looked at my multi-core AMD machine (64-bit Linux client) and saw that four cores were idle. In each case, the WU had "completed" (that is, had used up as much as it should of the time allocated), but had FAILED to tell the BOINC client that the task was done. As a result, the BOINC client kept that task dispatched on a CPU core -- but since the task had finished; the corresponding core was idle. [Boincmgr showed 100% complete for these tasks, and showed that these "running" tasks were NOT accumulating any more CPU time.] Unfortunately, the only way I had to get my CPUs running again was to abort these tasks (and that in turn caused their results to be thrown away -- my system had crunched these tasks uselessly).

I've encountered this failure before with Rosetta - but that was on a different hardware system, then using the 32-bit Linux client. My conclusion is that there is a problem (at least on multi-core hardware running Linux) with how the Rosetta executable notifies the BOINC client that an application's workunit has finished.


Myself as well. I am seeing a lot of these. It does not seem to matter what the processor is, but, seems to be happening on the CASP8 tasks. Any reason that CASP8 is being run on a BETA version???

I will be aborting about a dozen units here in a few minutes.




Here are a few of the aborts... All the machines with trouble are NetBurst or Dual Core systems... none of my uni-processor machines are showing these symptoms.

The NetBurst MP systems have 4 or 8GB ram. The Dual-Cores are all 2GB ram.. so memory is probably not an issue.

These all seem to be t404 or t405 CASP8 units....

https://boinc.bakerlab.org/rosetta/result.php?resultid=171381753
https://boinc.bakerlab.org/rosetta/result.php?resultid=171425636
https://boinc.bakerlab.org/rosetta/result.php?resultid=171446390
https://boinc.bakerlab.org/rosetta/result.php?resultid=171523734

https://boinc.bakerlab.org/rosetta/result.php?resultid=171536681

https://boinc.bakerlab.org/rosetta/result.php?resultid=171139128
Looking for a team ??? Join BoincSynergy!!


ID: 53724 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David Emigh
Avatar

Send message
Joined: 13 Mar 06
Posts: 158
Credit: 417,178
RAC: 0
Message 53725 - Posted: 16 Jun 2008, 19:38:03 UTC
Last modified: 16 Jun 2008, 19:40:31 UTC

Compute error here at about 20 minutes CPU time:

resultid=171712340

Large debugger output at the link.

I was the wingman, both computers had similar issues.
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
ID: 53725 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Allan Hojgaard

Send message
Joined: 4 May 08
Posts: 9
Credit: 591,749
RAC: 19
Message 53727 - Posted: 16 Jun 2008, 20:59:00 UTC

Here are my problems with the 5.96 beta version:
WU 156587340: t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_38843_0
WU 156567311: t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_28688_0
WU 156416544: t405__CASP8_JUMPAB_RES81to192_SAVE_ALL_OUT_BARCODE__3758_10358_0

Those WUs were all completed (100%), but BOINC still listed them as running and the processes were not consuming any CPU %. I had to abort them in order for BOINC to pick the next WUs in line.

All 3 had Exit status -197 (0xffffff3b). Zero credit received.

I am using BOINC 5.10.45 on Ubuntu 8.04 with an Intel Core2 Duo T7300 @ 2GHz in an Mobile IntelĀ® 965 Express Chipset with 2GB of RAM.
ID: 53727 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 10 · Next

Message boards : Number crunching : Problems with version 5.96



©2024 University of Washington
https://www.bakerlab.org