Report Problems with Rosetta Version 5.25

Message boards : Number crunching : Report Problems with Rosetta Version 5.25

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 12 · Next

AuthorMessage
Profile Vester
Avatar

Send message
Joined: 2 Nov 05
Posts: 257
Credit: 3,145,987
RAC: 8,914
Message 23867 - Posted: 20 Aug 2006, 19:12:52 UTC
Last modified: 20 Aug 2006, 19:21:09 UTC

This happened when I opened BOINC Manager to Run benchmarks using BOINC Manager 5.4.11. I have 550 MB free RAM of 1 GB installed, AMD Barton core at 2113 MHz and stable for a long time without any errors. Running Windows Vista Beta 2 build 5384.
8/20/2006 10:50:22 AM|rosetta@home|Resuming task 1wit__BOINC_BACKBONE_O_PENALTY_ABRELAX_SAVE_ALL_OUT__1176_756_0 using rosetta version 525
8/20/2006 2:52:41 PM|rosetta@home|Unrecoverable error for result 1wit__BOINC_BACKBONE_O_PENALTY_ABRELAX_SAVE_ALL_OUT__1176_756_0 ( - exit code -1073741819 (0xc0000005))

Here is the result: https://boinc.bakerlab.org/rosetta/result.php?resultid=33562744. My display didn't blink and I'd have missed the event if not looking for my unoptimized benchmarks. Edit for clarification: This is a new installation that has never been optimized.
ID: 23867 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,070,914
RAC: 0
Message 23984 - Posted: 21 Aug 2006, 1:22:42 UTC
Last modified: 21 Aug 2006, 1:25:01 UTC

Just noticed this one. Happened about 24 hours ago:

https://boinc.bakerlab.org/rosetta/result.php?resultid=33412986

Charlie

-Charlie
ID: 23984 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alan Roberts

Send message
Joined: 7 Jun 06
Posts: 61
Credit: 6,901,926
RAC: 0
Message 23989 - Posted: 21 Aug 2006, 1:54:41 UTC

Hi,

Across the machines I have running Rosetta, I've seen a handful of failures recently. Majority have reported an incorrect funtion in dock_structure.cc, with at least one in pack.cc.

Seems to be very similar to what others have reported recently. A bit surprising since I don't remember 5.25 throwing any errors during CASP.

Is it valuable to document the WUs here? My first thought is that the project team must have tools to filter error results from all the returned results for investigation, and documentation here wouldn't be necessary. If that is incorrect and it would help, I'll hang another reply onto the thread.

Cheers,
Alan

ID: 23989 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ethan
Volunteer moderator

Send message
Joined: 22 Aug 05
Posts: 286
Credit: 9,304,700
RAC: 0
Message 23990 - Posted: 21 Aug 2006, 2:02:38 UTC - in response to Message 23989.  

Is it valuable to document the WUs here?


Yes! See this thread https://boinc.bakerlab.org/forum_thread.php?id=2144. Rhiju is the scientist who submitted the WU.
ID: 23990 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,070,914
RAC: 0
Message 23991 - Posted: 21 Aug 2006, 2:06:59 UTC
Last modified: 21 Aug 2006, 2:07:15 UTC

And another. Same error, similar WU (I think) as I and others have reported. Could there be a bad batch of WUs?

https://boinc.bakerlab.org/rosetta/result.php?resultid=33514109

Charlie

-Charlie
ID: 23991 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alan Roberts

Send message
Joined: 7 Jun 06
Posts: 61
Credit: 6,901,926
RAC: 0
Message 23994 - Posted: 21 Aug 2006, 2:25:24 UTC

Per Ethan's reply:


I'm sure I've seen pack.cc errors as well, but these seem to have fallen off the back end of the results list.

Cheers,
Alan


ID: 23994 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 24010 - Posted: 21 Aug 2006, 5:15:41 UTC

1opd__BOINC_BACKBONE_HN_PENALTY_ABRELAX_SAVE_ALL_OUT__1175_174_0
<core_client_version>5.4.9</core_client_version>
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# random seed: 1669857
# cpu_run_time_pref: 86400
ERROR:: Exit at: .dock_structure.cc line:401

</stderr_txt>
33445584
Seems to be a few of them popping up.. (that's 3 of the 10 WUs that have run on this machine this last week.)
ID: 24010 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 7 Oct 05
Posts: 234
Credit: 15,020
RAC: 0
Message 24013 - Posted: 21 Aug 2006, 5:31:11 UTC

I got one too:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=29059205

Result: https://boinc.bakerlab.org/rosetta/result.php?resultid=33499101

stderr out

<core_client_version>5.5.13</core_client_version>
<![CDATA[
<message>
Forkert funktion. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# random seed: 1690643
# cpu_run_time_pref: 10800
ERROR:: Exit at: .dock_structure.cc line:401

</stderr_txt>
]]>


[b]"I'm trying to maintain a shred of dignity in this world." - Me[/b]

ID: 24013 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jack Shaftoe
Avatar

Send message
Joined: 30 Apr 06
Posts: 115
Credit: 1,307,916
RAC: 0
Message 24110 - Posted: 21 Aug 2006, 15:52:38 UTC - in response to Message 24010.  
Last modified: 21 Aug 2006, 16:29:30 UTC

Incorrect function. (0x1) - exit code 1 (0x1)

Seems to be a few of them popping up.. (that's 3 of the 10 WUs that have run on this machine this last week.)


Me too.

https://boinc.bakerlab.org/rosetta/results.php?hostid=288399

Got 5 of them on this host this morning. The new credit system gives me zero credit for them too. They all seem to have been resubmitted to other hosts. I am running 5.4.11 on this machine.

Edit to say that I have another machine with identical hardware running 5.4.11 and it has zero failed WU's. Maybe a config problem? I dunno..
Team Starfire World BOINC
ID: 24110 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 24121 - Posted: 21 Aug 2006, 17:07:10 UTC - in response to Message 24110.  

The new credit system gives me zero credit for them too.


You seem to be getting credit for them. The script only runs once a day, though, so you haven't gotten credit for the ones returned today yet. Also, the credit doesn't show in the list, but it does show if you look at the result.
ID: 24121 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Saenger
Avatar

Send message
Joined: 19 Sep 05
Posts: 271
Credit: 824,883
RAC: 0
Message 24123 - Posted: 21 Aug 2006, 17:14:22 UTC - in response to Message 24110.  

Incorrect function. (0x1) - exit code 1 (0x1)

Seems to be a few of them popping up.. (that's 3 of the 10 WUs that have run on this machine this last week.)


Me too.

https://boinc.bakerlab.org/rosetta/results.php?hostid=288399

Got 5 of them on this host this morning. The new credit system gives me zero credit for them too. They all seem to have been resubmitted to other hosts. I am running 5.4.11 on this machine.

Edit to say that I have another machine with identical hardware running 5.4.11 and it has zero failed WU's. Maybe a config problem? I dunno..

Do I understand this right: In the old system you even got credit for invalid results? Why should this be?
ID: 24123 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
NJMHoffmann

Send message
Joined: 17 Dec 05
Posts: 45
Credit: 45,891
RAC: 0
Message 24142 - Posted: 21 Aug 2006, 18:31:34 UTC - in response to Message 24123.  
Last modified: 21 Aug 2006, 18:33:36 UTC

Do I understand this right: In the old system you even got credit for invalid results? Why should this be?

Because here at Rosetta@home the software is tested. Part of the data, sent to you with a new WU, is code to test. So bugs in the software should not effect credit.

Norbert

PS: It's not new. IIRC Seti does this for years, when aborted WUs get credit. At Seti it's corrupt data (or useless data), that causes the aborts (Error 9??).
ID: 24142 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 24143 - Posted: 21 Aug 2006, 18:33:17 UTC - in response to Message 24123.  

Do I understand this right: In the old system you even got credit for invalid results? Why should this be?

Saenger, just before Ralph started (and probably the reason for ralph existence) is that they were have many wus fail most the way through or get stuck at 1% for days. People were screaming about tying up there puters for that period and not getting some form of reward. They started handing out credit as claimed for the run time up to a certain limit (can't remember the limit, 300 I think). Now that ralph is here, the incidents of failed wus is very low, and I hope they stop that practice.

tony
ID: 24143 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Saenger
Avatar

Send message
Joined: 19 Sep 05
Posts: 271
Credit: 824,883
RAC: 0
Message 24145 - Posted: 21 Aug 2006, 18:36:32 UTC - in response to Message 24142.  
Last modified: 21 Aug 2006, 18:37:40 UTC

Do I understand this right: In the old system you even got credit for invalid results? Why should this be?

Because here at Rosetta@home the software is tested. Part of the data, sent to you with a new WU, is code to test. So bugs in the software should not effect credit.

Norbert

PS: It's not new. IIRC Seti does this for years, when aborted WUs get credit. At Seti its corrupt data (or useless data), that causes the aborts (Error 9??).

OK, I understand.
But those results are not marked "invalid", that's the difference.
The only other project I know that grants anything for "invalid" is LHC. There you get half of the credits granted as those with valid results.

So I think the labelling should be changed, as it's also possible that a result is really invalid, for example when the hardware is faulty and delivers no useful results.
ID: 24145 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
NJMHoffmann

Send message
Joined: 17 Dec 05
Posts: 45
Credit: 45,891
RAC: 0
Message 24146 - Posted: 21 Aug 2006, 18:37:26 UTC - in response to Message 24143.  

Now that ralph is here, the incidents of failed wus is very low, and I hope they stop that practice.
No. (see my answer to Saenger for my argument).

Norbert

ID: 24146 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile carl.h
Avatar

Send message
Joined: 28 Dec 05
Posts: 555
Credit: 183,449
RAC: 0
Message 24147 - Posted: 21 Aug 2006, 18:37:28 UTC

I hope the practice continues, if the WU is what is wrong nothing to do with your system and you have spent say 23 hours of a 24 hour unit working why should you not get credits ?
Not all Czech`s bounce but I`d like to try with Barbar ;-)

Make no mistake This IS the TEDDIES TEAM.
ID: 24147 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 24148 - Posted: 21 Aug 2006, 18:38:29 UTC - in response to Message 24142.  
Last modified: 21 Aug 2006, 18:39:14 UTC

Do I understand this right: In the old system you even got credit for invalid results? Why should this be?

Because here at Rosetta@home the software is tested. Part of the data, sent to you with a new WU, is code to test. So bugs in the software should not effect credit.

Norbert

PS: It's not new. IIRC Seti does this for years, when aborted WUs get credit. At Seti it's corrupt data (or useless data), that causes the aborts (Error 9??).

Norbert, Yes, -9 "result overflow" is the ONLY error that will get credit at seti. It just means there was too much RFI in the signal. There's a limit and when reached it aborts the wu and you get proportional credit for it. It's usually on a matter of a minute or two runtime before it terminates though.

Here's 3 examples from my file:
361014709 86507692 1 Aug 2006 2:45:43 UTC 2 Aug 2006 19:13:29 UTC Over Success Done 138.39 0.11 0.11 2.8615 2.8615
361337990 86585100 1 Aug 2006 21:05:59 UTC 2 Aug 2006 19:13:29 UTC Over Success Done 139.64 0.12 0.12 3.0937 3.0937
360389054 86356170 30 Jul 2006 11:05:05 UTC 30 Jul 2006 17:05:44 UTC Over Success Done 58.78 0.11 0.11 6.7370 6.7370

as you can see they only ran 138 seconds, 139 seconds, and 58 seconds respectively. Now there are a few that run into hours before failing.
ID: 24148 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Saenger
Avatar

Send message
Joined: 19 Sep 05
Posts: 271
Credit: 824,883
RAC: 0
Message 24149 - Posted: 21 Aug 2006, 18:39:00 UTC - in response to Message 24147.  

I hope the practice continues, if the WU is what is wrong nothing to do with your system and you have spent say 23 hours of a 24 hour unit working why should you not get credits ?

That's right, and that's why they grant something over @LHC.
But how is it determined that it was the software, and not the hardware?
ID: 24149 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 19 Sep 05
Posts: 403
Credit: 537,991
RAC: 0
Message 24150 - Posted: 21 Aug 2006, 18:40:20 UTC - in response to Message 24147.  
Last modified: 21 Aug 2006, 18:48:20 UTC

I hope the practice continues, if the WU is what is wrong nothing to do with your system and you have spent say 23 hours of a 24 hour unit working why should you not get credits ?



My opinion is that if the "decoys" is ok you should get credit for them.

Anders n

[edit] I assume that if the computer has done 5 decoys and fails on no 6

it reports the 5 that was ok ?! [/edit]
ID: 24150 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
NJMHoffmann

Send message
Joined: 17 Dec 05
Posts: 45
Credit: 45,891
RAC: 0
Message 24151 - Posted: 21 Aug 2006, 18:43:08 UTC - in response to Message 24145.  

So I think the labelling should be changed, as it's also possible that a result is really invalid, for example when the hardware is faulty and delivers no useful results.
It would be difficult to decide: Is the result invalid, because the computer failed? Or is the result invalid, because the used "routines / parameter combination" doesn't work? The second is a very useful result for Rosetta.

Norbert
ID: 24151 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 12 · Next

Message boards : Number crunching : Report Problems with Rosetta Version 5.25



©2024 University of Washington
https://www.bakerlab.org