Rosetta@home

Problems with Rosetta version 5.78

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search
Message boards : Number crunching : Problems with Rosetta version 5.78

Sort
AuthorMessage
Rhiju
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Jan 8 06
Posts: 223
ID: 48256
Credit: 3,546
RAC: 0
Message 45712 - Posted 2 Sep 2007 23:06:33 UTC

Not too much different in this app from previous version. Thanks for continuing to post problems!
____________

adhc.com.au Profile
Avatar

Joined: Feb 10 06
Posts: 34
ID: 57862
Credit: 70,117
RAC: 0
Message 45719 - Posted 3 Sep 2007 11:09:54 UTC

Work unit 94392699 on computer 551987 has been stuck at 97.756% finished with about 00:9:54 to go for most of today. Unlike the last time this occurred to me, the CPU is at 100% use. However, the CPU time (done) is still only showing a bit over 7 hours.

Is this a real problem?
____________


Click here to join the #1 Aussie Alliance on Rosetta

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2391
ID: 106194
Credit: 0
RAC: 0
Message 45724 - Posted 3 Sep 2007 17:39:52 UTC

Mike, is that task still working on model 1? What is your work unit runtime preference? (the default is 3hrs).
...sounding normal so far.
____________
Rosetta Moderator: Mod.Sense

adhc.com.au Profile
Avatar

Joined: Feb 10 06
Posts: 34
ID: 57862
Credit: 70,117
RAC: 0
Message 45738 - Posted 4 Sep 2007 10:51:43 UTC

No, that one finished after I went to bed. :-) No other problems so far. I expect a bit of a pause around the 10 minute to go mark, this one just seem to go longer. Perhaps it snuck in a work unit from another project while I wasn't looking. Didn't see any in the messages though.
____________


Click here to join the #1 Aussie Alliance on Rosetta

M.L.

Joined: Nov 21 06
Posts: 182
ID: 130574
Credit: 180,462
RAC: 0
Message 45746 - Posted 4 Sep 2007 15:36:35 UTC

Result ID 104053613
Name profilin2_BOINC_MFR_ABRELAX_PICKED_2062_29191_0
Workunit 94455620
Created 3 Sep 2007 9:01:12 UTC
Sent 3 Sep 2007 9:01:24 UTC
Received 4 Sep 2007 12:25:47 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 1 (0x1)
Computer ID 510574
Report deadline 13 Sep 2007 9:01:24 UTC
CPU time 0
stderr out <core_client_version>5.10.13</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 21600
ERROR:: Unable to obtain total_residue & sequence.
start pdb file must be provided.
ERROR:: Exit from: .\input_pdb.cc line: 2956

</stderr_txt>
]]>


Validate state Invalid
Claimed credit 0
Granted credit 0
application version 5.78

AMD4800 duall core on W SP2 Home

____________

P . P . L .

Joined: Aug 20 06
Posts: 359
ID: 105843
Credit: 356,627
RAC: 665
Message 45784 - Posted 9 Sep 2007 6:03:45 UTC
Last modified: 9 Sep 2007 7:03:08 UTC

I have two W.U.'s of the same type finish short of time on my two systems,

they are both have the runtime set for 8hrs and they both stoped after only

4hrs. I have the projects switch every 2hrs, anyway they haven't U/L ed

yet.

Edit/ added: 1gidA_BOINC_MG_CHAINBREAK5_LRSCOREFIX_RNA_**********

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=94629940

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=94566211

Pete.


____________


drghughes

Joined: Apr 27 07
Posts: 7
ID: 170018
Credit: 5,557
RAC: 0
Message 45824 - Posted 9 Sep 2007 14:29:47 UTC

I also have a work unit 94604566 that has been stuck at around 97.2% progress for several hours of CPU time. It has now been running for 5:47 compared to a normal run time of about 3 hours.

I suspended it when the Rosetta problems started. Should I start it up again and let it run or should I abort it?

mikus

Joined: Nov 7 05
Posts: 58
ID: 10139
Credit: 700,115
RAC: 0
Message 45825 - Posted 9 Sep 2007 14:30:32 UTC

Had a problem with <http://boinc.bakerlab.org/rosetta/workunit.php?wuid=94715507> (not reported yet, since Rosetta is not yet accepting uploads). Noticed in gkrellm that one of my CPUs was idle (though boincmgr said that the workunit on that CPU was "running").

(If you can tell me where to send it, I have a tar of the slot directory.)
Here is a copy of the stderr.txt from that slot directory:

Graphics are disabled due to configuration...
# cpu_run_time_pref: 28800
# random seed: 1285195
SIGSEGV: segmentation violation
Stack trace (12 frames):
[0x8d45107]
[0x8d3fefc]
[0x40000420]
[0x8bb4bb4]
[0x8c96f34]
[0x84b6ee1]
[0x80d8665]
[0x85efeb3]
[0x871f807]
[0x871f8b2]
[0x8da9454]
[0x8048111]

Exiting...
SIGABRT: abort called
Stack trace (23 frames):
[0x8d45107]
[0x8d3fefc]
[0x40000420]
[0x8db0514]
[0x8dc53df]
[0x8dca445]
[0x8dca723]
[0x8d9b171]
[0x8d9cb99]
[0x83f92c1]
[0x8db0a5f]
[0x8d45152]
[0x8d3fefc]
[0x40000420]
[0x8bb4bb4]
[0x8c96f34]
[0x84b6ee1]
[0x80d8665]
[0x85efeb3]
[0x871f807]
[0x871f8b2]
[0x8da9454]
[0x8048111]

Exiting...


Would prefer it if applications which terminated abnormally would go away, rather than making the boinc client (Linux 32-bit 5.10.8) believe thay are still "running".
.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2391
ID: 106194
Credit: 0
RAC: 0
Message 45855 - Posted 9 Sep 2007 17:18:25 UTC

Peter & drghughes: Some of the recent tasks sent out have long run times per model. Some up to about 4 hours on 3Ghz machines. So if your runtime preference is 8hrs, and your first model took 4.5hrs to complete, then beginning a second model would be predicted to take you over the 8hr preference by a significant amount, so Rosetta ends that task early rather then beginning the next model, which would almost certainly take longer.

So Peter, that is normal for it to end early.
drhhughes, that is normal for them to sometimes take longer then your shorter runtime preference. But that can't be marked as finished until you complete at least one model. The time to completion is really just an estimate based on your 3hr preference. Once they get down to <10min left they start to just try to continue to show about 10min remaining, because they've got no more accurate idea when that first model will complete. Please let it run.

mikus, yes, the BOINC manager still seems to have some quirks on Linux. I can only suggest you keep current on the BOINC updates. But last I've seen there still seem to be occaisional problems there as well.

It appears from the number of tasks outstanding, that the project is accepting uploads and issueing downloads. I just had an upload go through about an hour ago. Keep in mind there are about 50,000 PCs out there that all are trying to report completed results and get more work. We just have to let it keep chugging and working through the backlog. Thanks for your patience.
____________
Rosetta Moderator: Mod.Sense

P . P . L .

Joined: Aug 20 06
Posts: 359
ID: 105843
Credit: 356,627
RAC: 665
Message 45879 - Posted 9 Sep 2007 22:25:42 UTC

Mod Sense.

Fair enough answer, thanks.

Pete.

____________


mikus

Joined: Nov 7 05
Posts: 58
ID: 10139
Credit: 700,115
RAC: 0
Message 45886 - Posted 10 Sep 2007 1:35:07 UTC - in response to Message ID 45855.

mikus, yes, the BOINC manager still seems to have some quirks on Linux. I can only suggest you keep current on the BOINC updates. But last I've seen there still seem to be occaisional problems there as well.

From this, should I conclude that Rosetta is not interested in one of its applications experiencing a segment violation, nor that whereas my expectation is that a failing task would go away, this one just sat there (holding a CPU resource that ought to have been released to other ready tasks.)
.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2391
ID: 106194
Credit: 0
RAC: 0
Message 45888 - Posted 10 Sep 2007 2:05:07 UTC - in response to Message ID 45886.

mikus, yes, the BOINC manager still seems to have some quirks on Linux. I can only suggest you keep current on the BOINC updates. But last I've seen there still seem to be occaisional problems there as well.

From this, should I conclude that Rosetta is not interested in one of its applications experiencing a segment violation, nor that whereas my expectation is that a failing task would go away, this one just sat there (holding a CPU resource that ought to have been released to other ready tasks.)
.


Not at all. I was simply trying to point out that BOINC manager controls which tasks run and when, so I'm not sure there is anything that the Rosetta team can do to improve things. It's probably going to require a fix to the BOINC code so that it can pick up and schedule other work accordingly.
____________
Rosetta Moderator: Mod.Sense

mikus

Joined: Nov 7 05
Posts: 58
ID: 10139
Credit: 700,115
RAC: 0
Message 45892 - Posted 10 Sep 2007 3:59:49 UTC - in response to Message ID 45888.

mikus, yes, the BOINC manager still seems to have some quirks on Linux. I can only suggest you keep current on the BOINC updates. But last I've seen there still seem to be occaisional problems there as well.
From this, should I conclude that Rosetta is not interested in one of its applications experiencing a segment violation, nor that whereas my expectation is that a failing task would go away, this one just sat there (holding a CPU resource that ought to have been released to other ready tasks.)
Not at all. I was simply trying to point out that BOINC manager controls which tasks run and when, so I'm not sure there is anything that the Rosetta team can do to improve things. It's probably going to require a fix to the BOINC code so that it can pick up and schedule other work accordingly.

It may well be that BOINC code needs to be upgraded to handle this unusual situation - an application task "dispatched" by BOINC which does not use any CPU.

BUT it is likely that the existing BOINC code expected that an application task which (according to the task's stderr.txt) had received (SIGSEGV + SIGABRT) would perform a "final exit". My question is - did the Rosetta application task do that ? (If yes, then BOINC dropped the ball; but if no, then it was the application that did not do what BOINC expected.) That is why I would like to send the snapshot of the slot directory to someone at Rosetta (if I knew where to send it), so Rosetta people can check for how far the application had gotten.

mikus


p.s. By the way, I now see that when I "aborted" the task to get it out of the ready queue, only the "abort" shows in the result's stderr field - overwriting the task's previously accumulated stderr output.

Also, I believe boincmgr is merely the 'GUI' to the BOINC client - the client can (and does) run perfectly well if boincmgr has been closed. So while the BOINC manager *can* control the application tasks (I issued the "abort" from boincmgr), it is the client which performs the details of task scheduling. Unfortunately, I believe the principal means the client has to keep track of what the tasks are doing is to track their CPU consumption. When faced with a task that does not consume CPU, I think the current BOINC *will* lose track.
.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2391
ID: 106194
Credit: 0
RAC: 0
Message 45922 - Posted 10 Sep 2007 12:36:06 UTC

mikus, you can EMail your files to me at the moderator contact EMail address, and I will forward them to the project team for you.

Yes, my terminology needs a little refinement. Most users do not know the difference between the two BOINC pieces, so they don't notice my misuse of terms.

Two questions for you, perhaps just include them in the EMail. What is your runtime preference? (actually that probably shows in the output file), and do you have any idea how long it was in the "running" state, but not using CPU time?
____________
Rosetta Moderator: Mod.Sense

drghughes

Joined: Apr 27 07
Posts: 7
ID: 170018
Credit: 5,557
RAC: 0
Message 45930 - Posted 10 Sep 2007 13:46:06 UTC - in response to Message ID 45855.


drhhughes The time to completion is really just an estimate based on your 3hr preference. Once they get down to <10min left they start to just try to continue to show about 10min remaining, because they've got no more accurate idea when that first model will complete. Please let it run.



Mod.Sense,

Thanks. I let it run and it finished at about 5 h 57 mins.

Perhaps you could include a sticky note telling people about the "10 minutes to completion" rule. That would have been useful to know.

Also, the latest work unit that I've received has an initial "To completion" of 5 h 57 mins. Is this coincidence or do new work units take the CPU Time of the last work unit as their initial To completion estimate? Again, this would be useful to know since it would explain why the actual run time might not match the estimate.

M.L.

Joined: Nov 21 06
Posts: 182
ID: 130574
Credit: 180,462
RAC: 0
Message 45933 - Posted 10 Sep 2007 14:23:42 UTC

Result ID 104434245
Name t030__BOINC_CAPRI14_DOCK_FIXBACKBONE-t030_-nosillyloop_nodimerloop_plexinmonomer__2066_697_0
Workunit 94766131
Created 9 Sep 2007 23:57:14 UTC
Sent 10 Sep 2007 0:01:53 UTC
Received 10 Sep 2007 14:05:17 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 510574
Report deadline 20 Sep 2007 0:01:53 UTC
CPU time 13821.375
stderr out <core_client_version>5.10.20</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 1280434
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -218.075 for 900 seconds
**********************************************************************
GZIP SILENT FILE: .\xxt030.out

</stderr_txt>
]]>


Validate state Valid
Claimed credit 56.4278965225222
Granted credit 20
application version 5.78

____________

Christoph Jansen Profile
Avatar

Joined: Jun 6 06
Posts: 248
ID: 91851
Credit: 267,153
RAC: 0
Message 45936 - Posted 10 Sep 2007 15:18:39 UTC

Same here too:

"Rosetta score is stuck or going too long. Watchdog is ending the run!"

On these WUs:

wuid=94910696
wuid=94910692
wuid=94910691
wuid=94770968

Ian_D Profile

Joined: Sep 21 05
Posts: 50
ID: 757
Credit: 3,215,911
RAC: 2,222
Message 45953 - Posted 10 Sep 2007 20:02:26 UTC
Last modified: 10 Sep 2007 20:25:31 UTC

Is this sort of thing supposed to be happening frequently as, at the moment, my four machines are doing quite a bit of work < 6.5 hrs and then coming up with

<core_client_version>5.10.20</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 28800
# random seed: 1276748
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score 303.464 for 900 seconds
**********************************************************************
GZIP SILENT FILE: .\xx1he8.out

</stderr_txt>
]]>

Taken from Here

and giving next to nothing in credit (not that that bothers me, just wondering if there's something amiss !!)

Anyone else ?

Now message has been moved I see there are others.

____________


BitSpit
Avatar

Joined: Nov 5 05
Posts: 33
ID: 9581
Credit: 4,144,202
RAC: 0
Message 45963 - Posted 10 Sep 2007 22:40:00 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=104493983
http://boinc.bakerlab.org/rosetta/result.php?resultid=104493982
http://boinc.bakerlab.org/rosetta/result.php?resultid=104511814
http://boinc.bakerlab.org/rosetta/result.php?resultid=104511813

Watchdog killed these after the score got stuck for 900 seconds. It only seemed to affect the Windows machines. The Linux ones ran just fine.

Zxian Profile

Joined: May 17 07
Posts: 18
ID: 177811
Credit: 1,173,075
RAC: 0
Message 45967 - Posted 11 Sep 2007 0:28:48 UTC

I've also had several WU's come out with only 20 granted credit, regardless of how long the WU actually ran for.

This is on several different computers with different versions of Windows (XP, 2003).

Beezlebub
Avatar

Joined: Oct 18 05
Posts: 40
ID: 5335
Credit: 260,375
RAC: 0
Message 45970 - Posted 11 Sep 2007 1:07:10 UTC

I also have 8 WU's showing:

Result ID 104616934
Name 1he8__BOINC_CAPRI14_DOCK_FIXBACKBONE-1he8_-nosillyloop_plexinmonomer__2067_8410_0
Workunit 94936831
Created 10 Sep 2007 10:42:43 UTC
Sent 10 Sep 2007 10:43:28 UTC
Received 10 Sep 2007 22:03:20 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 341092
Report deadline 20 Sep 2007 10:43:28 UTC
CPU time 16852.023625
stderr out

<core_client_version>5.10.13</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 28800
# random seed: 1272171
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -173.421 for 900 seconds
**********************************************************************
GZIP SILENT FILE: .\xx1he8.out

</stderr_txt>
]]>

Validate state Valid
Claimed credit 48.3512209341757
Granted credit 20
application version 5.78
____________
e6600 quad @ 2.5ghz
2418 floating point
5227 integer

e6750 dual @ 3.71ghz
3598 floating point
7918 integer


P . P . L .

Joined: Aug 20 06
Posts: 359
ID: 105843
Credit: 356,627
RAC: 665
Message 45976 - Posted 11 Sep 2007 3:39:12 UTC

I've had the same problem.

It's was an 1he8_**** W.U.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=94800922

<core_client_version>5.10.13</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 28800
# random seed: 1278145
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -202.375 for 900 seconds
**********************************************************************
GZIP SILENT FILE: .\xx1he8.out

</stderr_txt>

____________


BitSpit
Avatar

Joined: Nov 5 05
Posts: 33
ID: 9581
Credit: 4,144,202
RAC: 0
Message 45992 - Posted 11 Sep 2007 11:09:42 UTC

More that got stuck and the watchdog killed. Still Windows only that's hanging.

http://boinc.bakerlab.org/rosetta/result.php?resultid=104511812
http://boinc.bakerlab.org/rosetta/result.php?resultid=104511810
http://boinc.bakerlab.org/rosetta/result.php?resultid=104493978
http://boinc.bakerlab.org/rosetta/result.php?resultid=104493977

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2391
ID: 106194
Credit: 0
RAC: 0
Message 45996 - Posted 11 Sep 2007 12:22:58 UTC

The 20 credits sounds like the nightly credit granting script for failed WUs. I realize they probably show as "success", but they didn't end normally. Some details here.
____________
Rosetta Moderator: Mod.Sense

M.L.

Joined: Nov 21 06
Posts: 182
ID: 130574
Credit: 180,462
RAC: 0
Message 46001 - Posted 11 Sep 2007 16:19:12 UTC

Result ID 104435631
Name 1g4u__BOINC_CAPRI14_DOCK_FIXBACKBONE-1g4u_-nosillyloop_plexinmonomer__2067_760_0
Workunit 94767401
Created 10 Sep 2007 0:01:42 UTC
Sent 10 Sep 2007 0:01:53 UTC
Received 11 Sep 2007 16:12:47 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 510574
Report deadline 20 Sep 2007 0:01:53 UTC
CPU time 13657.84375
stderr out <core_client_version>5.10.20</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 1279871
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -223.806 for 900 seconds
**********************************************************************
GZIP SILENT FILE: .\xx1g4u.out

</stderr_txt>
]]>


Validate state Valid
Claimed credit 55.7602549562382
Granted credit 20
application version 5.78





____________

googloo

Joined: Sep 15 06
Posts: 42
ID: 112667
Credit: 317,024
RAC: 170
Message 46005 - Posted 11 Sep 2007 17:21:05 UTC - in response to Message ID 45996.

The 20 credits sounds like the nightly credit granting script for failed WUs. I realize they probably show as "success", but they didn't end normally. Some details here.


Had that problem with these results: here and here.
____________

uNiUs Profile

Joined: Apr 12 06
Posts: 3
ID: 75575
Credit: 2,439,966
RAC: 1,643
Message 46028 - Posted 11 Sep 2007 20:49:13 UTC
Last modified: 11 Sep 2007 20:54:51 UTC

Same problem:

104582448 94905137 10 Sep 2007 7:53:36 UTC 11 Sep 2007 20:16:51 UTC Over Success Done 86,268.45 530.82 514.59

104582610 94905257 10 Sep 2007 7:57:48 UTC 11 Sep 2007 20:16:51 UTC Over Success Done 86,175.33 530.25 530.43

104589787 94911730 10 Sep 2007 8:28:09 UTC 11 Sep 2007 20:16:51 UTC Over Success Done 21,587.09 132.83 20.00
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4350
ID: 85645
Credit: 772,384
RAC: 674
Message 46031 - Posted 11 Sep 2007 21:27:21 UTC

similar problem but with a new twist

104452274 94782474 10 Sep 2007 0:41:23 UTC 10 Sep 2007 17:38:48 UTC Over Success Done 21,550.19 61.05 20.00
104452273 94782473 10 Sep 2007 0:41:23 UTC 10 Sep 2007 19:59:44 UTC Over Success Done 7,042.41 19.95 20.00 <-- weird
____________

hugothehermit

Joined: Sep 26 05
Posts: 238
ID: 1310
Credit: 314,507
RAC: 2
Message 46043 - Posted 12 Sep 2007 4:54:03 UTC

I aborted this WU, nothing was wrong with it as far as I know, I just couldn't finish it in time so I didn't start it.

Jim Profile

Joined: Oct 15 06
Posts: 18
ID: 119359
Credit: 2,341,297
RAC: 0
Message 46044 - Posted 12 Sep 2007 4:55:19 UTC - in response to Message ID 45712.
Last modified: 12 Sep 2007 4:56:05 UTC

I'm the second person to get this WU: 94462214
It seems to be missing a file: PROF2.pdb ; will not finish the download
just a error message, "file not found".

____________

Ricky@SETI.USA
Avatar

Joined: Dec 13 05
Posts: 16
ID: 36732
Credit: 40,598
RAC: 0
Message 46069 - Posted 12 Sep 2007 15:50:03 UTC

9/12/2007 05:19:59||Suspending network activity - user request
9/12/2007 07:04:30|rosetta@home|[error] rosetta_beta not responding to screensaver, requesting exit
9/12/2007 07:25:19|rosetta@home|[error] rosetta_beta not responding to screensaver, killing it
9/12/2007 07:25:24|rosetta@home|Restarting task 1g4u__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-1g4u_-rxplxn_0472plexinmonomer__2074_62_0 using rosetta_beta version 578
9/12/2007 10:26:29|rosetta@home|Computation for task 1g4u__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-1g4u_-rxplxn_0472plexinmonomer__2074_62_0 finished
9/12/2007 11:28:53||Resuming network activity

Never seen this error before!

____________
"Life is like an Ice Cream cone, just when you think you got it licked, it drips all over you!"

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4350
ID: 85645
Credit: 772,384
RAC: 674
Message 46089 - Posted 12 Sep 2007 21:06:26 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=104452274
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -164.509 for 900 seconds
____________

The_Bad_Penguin
Avatar

Joined: Jun 5 06
Posts: 2526
ID: 89694
Credit: 1,121,417
RAC: 67
Message 46095 - Posted 12 Sep 2007 22:18:34 UTC
Last modified: 12 Sep 2007 22:21:17 UTC

1he8__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-1he8_-rxplxn_1030plexinmonomer__2074_1759_0

CPU time 14594.392753

stderr out

<core_client_version>5.10.13</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 3919342
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -467.27 for 900 seconds
**********************************************************************
GZIP SILENT FILE: .\xx1he8.out

</stderr_txt>
]]>

Validate state Valid
Claimed credit 62.7858592995201
Granted credit 20
application version 5.78

Rhiju
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Jan 8 06
Posts: 223
ID: 48256
Credit: 3,546
RAC: 0
Message 46098 - Posted 12 Sep 2007 22:30:31 UTC
Last modified: 12 Sep 2007 22:31:57 UTC

Thanks to everyone for posting. I think I know how to fix this (the watchdog problem)! I have removed these jobs from the queue for now, and when they are sent out again, we should see fewer premature exits...

____________

The_Bad_Penguin
Avatar

Joined: Jun 5 06
Posts: 2526
ID: 89694
Credit: 1,121,417
RAC: 67
Message 46099 - Posted 12 Sep 2007 22:31:31 UTC - in response to Message ID 46098.
Last modified: 12 Sep 2007 22:34:03 UTC

Ok, don't want to beat a dead horse, but just noticed

this also...

Rhiju
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Jan 8 06
Posts: 223
ID: 48256
Credit: 3,546
RAC: 0
Message 46100 - Posted 12 Sep 2007 22:43:47 UTC - in response to Message ID 46099.

One more question -- did you happen to notice if the screen looked totally stuck before the crash?
(Probably too much to ask.)

Ok, don't want to beat a dead horse, but just noticed

this also...


____________

The_Bad_Penguin
Avatar

Joined: Jun 5 06
Posts: 2526
ID: 89694
Credit: 1,121,417
RAC: 67
Message 46101 - Posted 12 Sep 2007 22:48:56 UTC - in response to Message ID 46100.
Last modified: 12 Sep 2007 22:50:26 UTC

sorry, didn't notice.

i have the quad-core running on its own as a (more or less) dedicated cruncher, and am using the A64 3800+ for i-net / e-mail / ms office / etc.

so, really don't look at Rosie running, just check my results page every so often to make sure i'm seeing about what i expect to see...

greg_be ???

One more question -- did you happen to notice if the screen looked totally stuck before the crash?
(Probably too much to ask.)

Jmarks Profile
Avatar

Joined: Jul 16 07
Posts: 132
ID: 191202
Credit: 98,025
RAC: 0
Message 46117 - Posted 13 Sep 2007 11:53:39 UTC
Last modified: 13 Sep 2007 11:54:30 UTC

I am having the same problems. I usually crunch 830 credits but now 50% of my 5.78 wu's are bad. I do not use the pc for anything else or project so there is no moniter to see if the pc acting strange while this happening.
104430347 94762391 9 Sep
104430349 94762393 9 Sep
104430354 94762398 9 Sep
104430355 94762399 9 Sep
104430357 94762401 9 Sep
104430364 94762408 9 Sep
104430359 94762403 9 Sep
104430358 94762402 9 Sep
104430366 94762410 9 Sep
104430373 94762416 9 Sep
104430372 94762415 9 Sep
104430376 94762419 9 Sep
____________
Jmarks

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 2391
ID: 106194
Credit: 0
RAC: 0
Message 46120 - Posted 13 Sep 2007 13:26:24 UTC
Last modified: 13 Sep 2007 13:27:03 UTC

Jmarks, sorry for all the failed WUs. Rhiju has pulled those WUs and is working on a fix that will improve things there. Otherwise, about all you can do is cut your runtime preference. Theory being that if your normal credit per task if close to 20, then a failure granted 20 will not be such an impact.
____________
Rosetta Moderator: Mod.Sense

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4350
ID: 85645
Credit: 772,384
RAC: 674
Message 46143 - Posted 13 Sep 2007 19:55:47 UTC - in response to Message ID 46101.

sorry, didn't notice.

i have the quad-core running on its own as a (more or less) dedicated cruncher, and am using the A64 3800+ for i-net / e-mail / ms office / etc.

so, really don't look at Rosie running, just check my results page every so often to make sure i'm seeing about what i expect to see...

greg_be ???
** so i see your moving up..congrat penguin**
One more question -- did you happen to notice if the screen looked totally stuck before the crash?
(Probably too much to ask.)



____________

TimL

Joined: Sep 16 06
Posts: 13
ID: 112884
Credit: 2,090,115
RAC: 6,913
Message 46149 - Posted 13 Sep 2007 20:33:00 UTC

Got these messages this morning

14/09/2007 6:28:49 AM|rosetta@home|Scheduler RPC succeeded
14/09/2007 6:28:49 AM|rosetta@home|Message from server: Project encountered internal error: shared memory
14/09/2007 6:28:49 AM|rosetta@home|Deferring communication for 1 hr 0 min 0 sec
14/09/2007 6:28:49 AM|rosetta@home|Reason: project is down

____________

Wits End

Joined: Apr 16 07
Posts: 4
ID: 165531
Credit: 1,034
RAC: 0
Message 46160 - Posted 14 Sep 2007 0:03:14 UTC - in response to Message ID 45712.

I'm running v5.78 on a 600 MHz machine. Three of the last nine WUs that I've uploaded reported a "Validate error" (103801478/94222135; 104683771/94998439; and, 104848402/95151762). In total, these three WUs represent just shy of 19 CPU hours, with a combined credit claim of just over 58.

As you might imagine, wasting 19 hours of CPU time because every 1 out of 3 WUs is rejected has me a bit frustrated with R@H! (I'm also running World Community Grid and Seti@Home, neither of which are producing errors.) Is anyone else experiencing similar problems with v5.78?

Jmarks Profile
Avatar

Joined: Jul 16 07
Posts: 132
ID: 191202
Credit: 98,025
RAC: 0
Message 46183 - Posted 14 Sep 2007 11:20:35 UTC

Here is another one

104430230 94762288 9 Sep
____________
Jmarks

David Emigh Profile
Avatar

Joined: Mar 13 06
Posts: 158
ID: 65176
Credit: 417,178
RAC: 0
Message 46188 - Posted 14 Sep 2007 12:23:55 UTC

I realize that 5.78 is (unpleasant) history now, but as a historical record, and in the hope that this information will make future versions of Rosetta more stable, I offer the following:

Five out of six Capri WUs killed by the watchdog, the casualties were:

WU 95336696
WU 95336686
WU 95336658
WU 95336625
WU 95366073

I won't try to link them all here, but they are all individually linked in my posts in this thread.

The sole survivor of the batch was:

WU 95336673

Even though it was a success, and this is a thread for problems, I have linked it here since there doesn't seem to be a place to post the (rare) successes in v5.78...

This represent about 90 hours on a pretty decent computer, about half of which was accomplished AFTER the problem with 5.78 was identified and (we hope) fixed.

When/If a situation like this arises again, I would specifically request a notice on the home page asking users to ABORT the WUs in question. I suspect the project would be better served thereby.

Respectfully,
David Emigh
____________
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!

Wits End

Joined: Apr 16 07
Posts: 4
ID: 165531
Credit: 1,034
RAC: 0
Message 46204 - Posted 14 Sep 2007 16:30:28 UTC - in response to Message ID 45712.
Last modified: 14 Sep 2007 16:32:16 UTC


v5.80 appears to have inherited the same problem (Message 46203)!

hugothehermit

Joined: Sep 26 05
Posts: 238
ID: 1310
Credit: 314,507
RAC: 2
Message 46248 - Posted 15 Sep 2007 4:09:07 UTC

WU

**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -457.996 for 900 seconds
**********************************************************************
GZIP SILENT FILE: .\xx1he8.out

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4350
ID: 85645
Credit: 772,384
RAC: 674
Message 46312 - Posted 15 Sep 2007 23:33:36 UTC
Last modified: 15 Sep 2007 23:36:02 UTC

same problem with this WU

**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score 117.804 for 900 seconds
**********************************************************************

thats 3 work units this week that got stuck and gave me only 20 points.
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4350
ID: 85645
Credit: 772,384
RAC: 674
Message 46334 - Posted 16 Sep 2007 9:02:27 UTC
Last modified: 16 Sep 2007 9:05:09 UTC

this wu got stuck
# cpu_run_time_pref: 21600
# random seed: 1279344
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -231.574 for 900 seconds
**********************************************************************

yet another 20 instead of actual points, 4 of them now out of over 20 wu's
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4350
ID: 85645
Credit: 772,384
RAC: 674
Message 46335 - Posted 16 Sep 2007 9:05:58 UTC

more errors

104452248 94782448 10 Sep 2007 0:41:23 UTC 15 Sep 2007 1:03:19 UTC Over Validate error Done 21,553.84 --- ---
104452246 94782446 10 Sep 2007 0:41:23 UTC 15 Sep 2007 1:03:19 UTC Over Validate error Done 21,609.22 --- ---
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4350
ID: 85645
Credit: 772,384
RAC: 674
Message 46469 - Posted 17 Sep 2007 20:00:50 UTC

1he8__BOINC_CAPRI14_DOCK_FIXBACKBONE-1he8_-nodimerloop_plexinmonomer__2067_1639_0 validate error
____________

Message boards : Number crunching : Problems with Rosetta version 5.78


Home | Join | About | Participants | Community | Statistics

Copyright © 2010 University of Washington

Last Modified: 3 Dec 2007 20:36:19 UTC
Back to top ^