Problems with version 5.96

Message boards : Number crunching : Problems with version 5.96

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next

AuthorMessage
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 53950 - Posted: 24 Jun 2008, 6:14:39 UTC - in response to Message 53948.  

It is a shame that the t405 cancellation wasn't mentioned when it was done, I could have resumed Rosetta then instead of now! The news flow is certainly a bit stagnant, I appreciate CASP is a busy time, but a couple of lines on the news column on the front page would not take long.


I'm sorry about that. I didn't think about posting it on the news because there were no more tasks queued by the time they were cancelled but I should have. I'm still working on a fix for that particular protocol because we need to run similar jobs soon for another CASP target. I'll definitely post something up when we update the app with the fix. Likely within the next day or two.
ID: 53950 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 652
Credit: 11,660,246
RAC: 1,175
Message 53954 - Posted: 24 Jun 2008, 8:07:29 UTC
Last modified: 24 Jun 2008, 8:09:03 UTC

Yes, I saw the comments added, it is that which comes through the RSS feed. Good luck with the fix. Crunching Rosetta agian now.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 53954 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 53968 - Posted: 24 Jun 2008, 21:24:59 UTC

The t434_1_NMRREF_1_t434_1_T0434_2QPWA_2JV0_hybridIGNORE_THE_REST_truncated_4104_8528_0 errored out:

Incorrect function. (0x1) - exit code 1 (0x1)
# cpu_run_time_pref: 14400
# random seed: 2586117
ERROR:: Exit from: .refold.cc line: 338


Peter
ID: 53968 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 53970 - Posted: 24 Jun 2008, 21:49:11 UTC
Last modified: 24 Jun 2008, 21:56:48 UTC

I just got this.


6/25/2008 7:43:44 AM|rosetta@home|Output file for task t434_1_NMRREF_1_t434_1_T0434_2QPWA_2JV0_hybridIGNORE_THE_REST_truncated_4104_3025_0 absent

Edit// just to add this.

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 2591620
ERROR:: Exit from: .refold.cc line: 338

</stderr_txt>

pete.
ID: 53970 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 53971 - Posted: 24 Jun 2008, 21:53:14 UTC

ID: 53971 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 172
Credit: 5,648,282
RAC: 3,305
Message 53973 - Posted: 24 Jun 2008, 22:02:19 UTC - in response to Message 52532.  

On about half of the jobs, when I reach around 95% completed progress simply crawls. To completion time stops but percentages increment extremely slowly. I assume the job is progreessing but I don;t know.


That would be normal. Especially if you are still on the first model, and/or have a short preferred runtime specified in your Rosetta preferences.


I guess this is normal, but sometimes, like today, it bugs me.

I have two hyperthreaded Xeons (32-bit) and 8 GBytes RAM running Linux kernel 2.6.18-92.1.1.el5PAE on one machine and two Pentium III processors and 512 MBytes RAM running Linux kernel 2.6.9-67.0.15.ELsmp on the other machine. In each case, Rosetta runs up to about 96% complete in a relatively short period of time, and time remaining is usually in the order of 10 minutes. Right now, it has used up about 8 hours since getting to 96% complete (and it took only about three hours to get to 96%). This is time actually consumed by the process, not wall-clock time.

I just wish the time remaining would more accurately reflect the time needed to complete.

Rosetta is not the worst offender in this regard. Some projects have the time remaining actually increasing as the time consumed increases.
ID: 53973 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 53974 - Posted: 24 Jun 2008, 22:12:26 UTC

I've got three more of these t434_ all have failed on other hosts, what a waste.

pete.


ID: 53974 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jack Shaftoe
Avatar

Send message
Joined: 30 Apr 06
Posts: 115
Credit: 1,307,916
RAC: 0
Message 53976 - Posted: 24 Jun 2008, 23:51:53 UTC - in response to Message 53974.  
Last modified: 25 Jun 2008, 0:09:13 UTC

I've got three more of these t434_ all have failed on other hosts, what a waste.

pete.



Rosetta Beta 5.96 t434 is doing terrible on my hosts too. Ugh.

I remember when I first joined 2 years ago - I could let my hosts go for weeks without checking on them. Now I feel the need to make sure they are ok twice a day, and will be suspending Rosetta while I leave for vacation this weekend! What a tragic shame...
ID: 53976 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BrnmccO1

Send message
Joined: 26 Jun 07
Posts: 17
Credit: 578,825
RAC: 0
Message 53978 - Posted: 25 Jun 2008, 0:32:47 UTC

Have also had a rash of compute errors last two weeks. Mostly the aforementioned t405's and a few t434's as well.

Here's a list of the failed WU's:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=158023723
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=155316236
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=155266537
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=156920807 <-- had to manually abort, was 'stuck'
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=156498712

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=158046502
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=157548537 <-- Mini Rosetta, was sucessful on someone elses computer tho.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=155266298 <-- T409
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=156219608 <-- t405 had to manually abort

Other than the recent troubles, things have been pretty good the past year for me, so I'll keep plugging away!


Cheers,
ID: 53978 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 53979 - Posted: 25 Jun 2008, 2:08:38 UTC

Well two more t434_ failed one after 3hrs,23min the other after 15min, more to go.

pete.

ID: 53979 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile TeAm Enterprise
Avatar

Send message
Joined: 28 Sep 05
Posts: 18
Credit: 27,904,257
RAC: 4
Message 53980 - Posted: 25 Jun 2008, 3:47:20 UTC

I am close to being out of here! I started crunching Rosetta because it would run for weeks without any attention, it sure wasn't because of the way low Boinc credit.

Now I have many stuck jobs, have had to abort plenty of jobs and am running out of patience.

Jim
ID: 53980 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile hedera
Avatar

Send message
Joined: 15 Jul 06
Posts: 76
Credit: 5,150,900
RAC: 601
Message 53982 - Posted: 25 Jun 2008, 4:38:25 UTC

I've only been back from vacation a few days and I've already got one of these:

WU 173161638

This one seems to have completed correctly on another computer.
--hedera

Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic.

ID: 53982 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 53985 - Posted: 25 Jun 2008, 7:42:15 UTC

This is an odd one, the first host failed on it mine ran o.k. and finished!

It's one of the t434.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=158008942

pete.

ID: 53985 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 172
Credit: 5,648,282
RAC: 3,305
Message 53986 - Posted: 25 Jun 2008, 10:40:37 UTC - in response to Message 53980.  

I am close to being out of here! I started crunching Rosetta because it would run for weeks without any attention, it sure wasn't because of the way low Boinc credit.

Now I have many stuck jobs, have had to abort plenty of jobs and am running out of patience.

Jim


I have gotten a few "stuck jobs", if by that you mean some that get to 100% complete, time remaining: --, but still running for quite a while. I just assumed this was similar to those that run 2x or 3x longer for the last 4% than they took for the first 96%, so I let them continue to run for a while. They ultimately finished. I have not checked if they finished correctly or with an error.
ID: 53986 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 652
Credit: 11,660,246
RAC: 1,175
Message 53987 - Posted: 25 Jun 2008, 11:00:40 UTC

I had only just resumed Rosetta after the t405 problem, then straight away t434 strikes!
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 53987 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 53997 - Posted: 25 Jun 2008, 21:01:04 UTC

ID: 53997 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile (_KoDAk_)

Send message
Joined: 18 Jul 06
Posts: 109
Credit: 1,859,263
RAC: 0
Message 54032 - Posted: 27 Jun 2008, 19:07:43 UTC
Last modified: 27 Jun 2008, 19:10:12 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=173123569
Incorrect function. (0x1) - exit code 1 (0x1)
ERROR:: Exit from: .refold.cc line: 338

https://boinc.bakerlab.org/rosetta/result.php?resultid=172523452
ERROR:: Unable to determine sequence length from pdb file

https://boinc.bakerlab.org/rosetta/result.php?resultid=172522701
ERROR:: Unable to determine sequence length from pdb file
ID: 54032 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5659
Credit: 5,694,601
RAC: 1,923
Message 54043 - Posted: 28 Jun 2008, 14:45:48 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=172358241
Client Error
Compute error

CPU Time: 15.85938
stderr out

<core_client_version>6.2.6</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400
# random seed: 2802237
ERROR:: Exit from: .loop_relax.cc line: 1863

</stderr_txt>
]]>
ID: 54043 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5659
Credit: 5,694,601
RAC: 1,923
Message 54044 - Posted: 28 Jun 2008, 14:49:09 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=172355994
FRA_t426_CASP8_2JASA_51_IGNORE_THE_RESTt426_51_mdT0421_2JASA_5.Cterm_0001_3821_590_0

big debugger dump on this one

CPU time 2852.328
stderr out

<core_client_version>6.2.6</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400
# random seed: 2715239


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00A4F68A read attempt to address 0x86842CA4

wasted cpu time on this one and crashed my system.

i was testing my OC speed when this one died...maybe that had something to do with it?
ID: 54044 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5659
Credit: 5,694,601
RAC: 1,923
Message 54047 - Posted: 28 Jun 2008, 16:04:32 UTC - in response to Message 54044.  

https://boinc.bakerlab.org/rosetta/result.php?resultid=172355994
FRA_t426_CASP8_2JASA_51_IGNORE_THE_RESTt426_51_mdT0421_2JASA_5.Cterm_0001_3821_590_0

big debugger dump on this one

CPU time 2852.328
stderr out

<core_client_version>6.2.6</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400
# random seed: 2715239


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00A4F68A read attempt to address 0x86842CA4

wasted cpu time on this one and crashed my system.

i was testing my OC speed when this one died...maybe that had something to do with it?


Edit: It certainly does as I crashed a few other work units as well. Sorry folks.
ID: 54047 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next

Message boards : Number crunching : Problems with version 5.96



©2024 University of Washington
https://www.bakerlab.org