Problems with Rosetta version 5.78

Message boards : Number crunching : Problems with Rosetta version 5.78

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 45712 - Posted: 2 Sep 2007, 23:06:33 UTC

Not too much different in this app from previous version. Thanks for continuing to post problems!
ID: 45712 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile m.mitch
Avatar

Send message
Joined: 10 Feb 06
Posts: 34
Credit: 1,928,904
RAC: 0
Message 45719 - Posted: 3 Sep 2007, 11:09:54 UTC

Work unit 94392699 on computer 551987 has been stuck at 97.756% finished with about 00:9:54 to go for most of today. Unlike the last time this occurred to me, the CPU is at 100% use. However, the CPU time (done) is still only showing a bit over 7 hours.

Is this a real problem?


Click here to join the #1 Aussie Alliance on Rosetta
ID: 45719 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 45724 - Posted: 3 Sep 2007, 17:39:52 UTC

Mike, is that task still working on model 1? What is your work unit runtime preference? (the default is 3hrs).
...sounding normal so far.
Rosetta Moderator: Mod.Sense
ID: 45724 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile m.mitch
Avatar

Send message
Joined: 10 Feb 06
Posts: 34
Credit: 1,928,904
RAC: 0
Message 45738 - Posted: 4 Sep 2007, 10:51:43 UTC

No, that one finished after I went to bed. :-) No other problems so far. I expect a bit of a pause around the 10 minute to go mark, this one just seem to go longer. Perhaps it snuck in a work unit from another project while I wasn't looking. Didn't see any in the messages though.


Click here to join the #1 Aussie Alliance on Rosetta
ID: 45738 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
M.L.

Send message
Joined: 21 Nov 06
Posts: 182
Credit: 180,462
RAC: 0
Message 45746 - Posted: 4 Sep 2007, 15:36:35 UTC

Result ID 104053613
Name profilin2_BOINC_MFR_ABRELAX_PICKED_2062_29191_0
Workunit 94455620
Created 3 Sep 2007 9:01:12 UTC
Sent 3 Sep 2007 9:01:24 UTC
Received 4 Sep 2007 12:25:47 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 1 (0x1)
Computer ID 510574
Report deadline 13 Sep 2007 9:01:24 UTC
CPU time 0
stderr out <core_client_version>5.10.13</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 21600
ERROR:: Unable to obtain total_residue & sequence.
start pdb file must be provided.
ERROR:: Exit from: .input_pdb.cc line: 2956

</stderr_txt>
]]>


Validate state Invalid
Claimed credit 0
Granted credit 0
application version 5.78

AMD4800 duall core on W SP2 Home

ID: 45746 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 45784 - Posted: 9 Sep 2007, 6:03:45 UTC
Last modified: 9 Sep 2007, 7:03:08 UTC

I have two W.U.'s of the same type finish short of time on my two systems,

they are both have the runtime set for 8hrs and they both stoped after only

4hrs. I have the projects switch every 2hrs, anyway they haven't U/L ed

yet.

Edit/ added: 1gidA_BOINC_MG_CHAINBREAK5_LRSCOREFIX_RNA_**********

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=94629940

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=94566211

Pete.


ID: 45784 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
drghughes

Send message
Joined: 27 Apr 07
Posts: 7
Credit: 6,346
RAC: 0
Message 45824 - Posted: 9 Sep 2007, 14:29:47 UTC

I also have a work unit 94604566 that has been stuck at around 97.2% progress for several hours of CPU time. It has now been running for 5:47 compared to a normal run time of about 3 hours.

I suspended it when the Rosetta problems started. Should I start it up again and let it run or should I abort it?

ID: 45824 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikus

Send message
Joined: 7 Nov 05
Posts: 58
Credit: 700,115
RAC: 0
Message 45825 - Posted: 9 Sep 2007, 14:30:32 UTC

Had a problem with <https://boinc.bakerlab.org/rosetta/workunit.php?wuid=94715507> (not reported yet, since Rosetta is not yet accepting uploads). Noticed in gkrellm that one of my CPUs was idle (though boincmgr said that the workunit on that CPU was "running").

(If you can tell me where to send it, I have a tar of the slot directory.)
Here is a copy of the stderr.txt from that slot directory:

Graphics are disabled due to configuration...
# cpu_run_time_pref: 28800
# random seed: 1285195
SIGSEGV: segmentation violation
Stack trace (12 frames):
[0x8d45107]
[0x8d3fefc]
[0x40000420]
[0x8bb4bb4]
[0x8c96f34]
[0x84b6ee1]
[0x80d8665]
[0x85efeb3]
[0x871f807]
[0x871f8b2]
[0x8da9454]
[0x8048111]

Exiting...
SIGABRT: abort called
Stack trace (23 frames):
[0x8d45107]
[0x8d3fefc]
[0x40000420]
[0x8db0514]
[0x8dc53df]
[0x8dca445]
[0x8dca723]
[0x8d9b171]
[0x8d9cb99]
[0x83f92c1]
[0x8db0a5f]
[0x8d45152]
[0x8d3fefc]
[0x40000420]
[0x8bb4bb4]
[0x8c96f34]
[0x84b6ee1]
[0x80d8665]
[0x85efeb3]
[0x871f807]
[0x871f8b2]
[0x8da9454]
[0x8048111]

Exiting...


Would prefer it if applications which terminated abnormally would go away, rather than making the boinc client (Linux 32-bit 5.10.8) believe thay are still "running".
.
ID: 45825 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 45855 - Posted: 9 Sep 2007, 17:18:25 UTC

Peter & drghughes: Some of the recent tasks sent out have long run times per model. Some up to about 4 hours on 3Ghz machines. So if your runtime preference is 8hrs, and your first model took 4.5hrs to complete, then beginning a second model would be predicted to take you over the 8hr preference by a significant amount, so Rosetta ends that task early rather then beginning the next model, which would almost certainly take longer.

So Peter, that is normal for it to end early.
drhhughes, that is normal for them to sometimes take longer then your shorter runtime preference. But that can't be marked as finished until you complete at least one model. The time to completion is really just an estimate based on your 3hr preference. Once they get down to <10min left they start to just try to continue to show about 10min remaining, because they've got no more accurate idea when that first model will complete. Please let it run.

mikus, yes, the BOINC manager still seems to have some quirks on Linux. I can only suggest you keep current on the BOINC updates. But last I've seen there still seem to be occaisional problems there as well.

It appears from the number of tasks outstanding, that the project is accepting uploads and issueing downloads. I just had an upload go through about an hour ago. Keep in mind there are about 50,000 PCs out there that all are trying to report completed results and get more work. We just have to let it keep chugging and working through the backlog. Thanks for your patience.
Rosetta Moderator: Mod.Sense
ID: 45855 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 45879 - Posted: 9 Sep 2007, 22:25:42 UTC

Mod Sense.

Fair enough answer, thanks.

Pete.

ID: 45879 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikus

Send message
Joined: 7 Nov 05
Posts: 58
Credit: 700,115
RAC: 0
Message 45886 - Posted: 10 Sep 2007, 1:35:07 UTC - in response to Message 45855.  

mikus, yes, the BOINC manager still seems to have some quirks on Linux. I can only suggest you keep current on the BOINC updates. But last I've seen there still seem to be occaisional problems there as well.

From this, should I conclude that Rosetta is not interested in one of its applications experiencing a segment violation, nor that whereas my expectation is that a failing task would go away, this one just sat there (holding a CPU resource that ought to have been released to other ready tasks.)
.
ID: 45886 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 45888 - Posted: 10 Sep 2007, 2:05:07 UTC - in response to Message 45886.  

mikus, yes, the BOINC manager still seems to have some quirks on Linux. I can only suggest you keep current on the BOINC updates. But last I've seen there still seem to be occaisional problems there as well.

From this, should I conclude that Rosetta is not interested in one of its applications experiencing a segment violation, nor that whereas my expectation is that a failing task would go away, this one just sat there (holding a CPU resource that ought to have been released to other ready tasks.)
.


Not at all. I was simply trying to point out that BOINC manager controls which tasks run and when, so I'm not sure there is anything that the Rosetta team can do to improve things. It's probably going to require a fix to the BOINC code so that it can pick up and schedule other work accordingly.
Rosetta Moderator: Mod.Sense
ID: 45888 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikus

Send message
Joined: 7 Nov 05
Posts: 58
Credit: 700,115
RAC: 0
Message 45892 - Posted: 10 Sep 2007, 3:59:49 UTC - in response to Message 45888.  

mikus, yes, the BOINC manager still seems to have some quirks on Linux. I can only suggest you keep current on the BOINC updates. But last I've seen there still seem to be occaisional problems there as well.
From this, should I conclude that Rosetta is not interested in one of its applications experiencing a segment violation, nor that whereas my expectation is that a failing task would go away, this one just sat there (holding a CPU resource that ought to have been released to other ready tasks.)
Not at all. I was simply trying to point out that BOINC manager controls which tasks run and when, so I'm not sure there is anything that the Rosetta team can do to improve things. It's probably going to require a fix to the BOINC code so that it can pick up and schedule other work accordingly.

It may well be that BOINC code needs to be upgraded to handle this unusual situation - an application task "dispatched" by BOINC which does not use any CPU.

BUT it is likely that the existing BOINC code expected that an application task which (according to the task's stderr.txt) had received (SIGSEGV + SIGABRT) would perform a "final exit". My question is - did the Rosetta application task do that ? (If yes, then BOINC dropped the ball; but if no, then it was the application that did not do what BOINC expected.) That is why I would like to send the snapshot of the slot directory to someone at Rosetta (if I knew where to send it), so Rosetta people can check for how far the application had gotten.

mikus


p.s. By the way, I now see that when I "aborted" the task to get it out of the ready queue, only the "abort" shows in the result's stderr field - overwriting the task's previously accumulated stderr output.

Also, I believe boincmgr is merely the 'GUI' to the BOINC client - the client can (and does) run perfectly well if boincmgr has been closed. So while the BOINC manager *can* control the application tasks (I issued the "abort" from boincmgr), it is the client which performs the details of task scheduling. Unfortunately, I believe the principal means the client has to keep track of what the tasks are doing is to track their CPU consumption. When faced with a task that does not consume CPU, I think the current BOINC *will* lose track.
.
ID: 45892 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 45922 - Posted: 10 Sep 2007, 12:36:06 UTC

mikus, you can EMail your files to me at the moderator contact EMail address, and I will forward them to the project team for you.

Yes, my terminology needs a little refinement. Most users do not know the difference between the two BOINC pieces, so they don't notice my misuse of terms.

Two questions for you, perhaps just include them in the EMail. What is your runtime preference? (actually that probably shows in the output file), and do you have any idea how long it was in the "running" state, but not using CPU time?
Rosetta Moderator: Mod.Sense
ID: 45922 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
drghughes

Send message
Joined: 27 Apr 07
Posts: 7
Credit: 6,346
RAC: 0
Message 45930 - Posted: 10 Sep 2007, 13:46:06 UTC - in response to Message 45855.  


drhhughes The time to completion is really just an estimate based on your 3hr preference. Once they get down to <10min left they start to just try to continue to show about 10min remaining, because they've got no more accurate idea when that first model will complete. Please let it run.



Mod.Sense,

Thanks. I let it run and it finished at about 5 h 57 mins.

Perhaps you could include a sticky note telling people about the "10 minutes to completion" rule. That would have been useful to know.

Also, the latest work unit that I've received has an initial "To completion" of 5 h 57 mins. Is this coincidence or do new work units take the CPU Time of the last work unit as their initial To completion estimate? Again, this would be useful to know since it would explain why the actual run time might not match the estimate.

ID: 45930 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
M.L.

Send message
Joined: 21 Nov 06
Posts: 182
Credit: 180,462
RAC: 0
Message 45933 - Posted: 10 Sep 2007, 14:23:42 UTC

Result ID 104434245
Name t030__BOINC_CAPRI14_DOCK_FIXBACKBONE-t030_-nosillyloop_nodimerloop_plexinmonomer__2066_697_0
Workunit 94766131
Created 9 Sep 2007 23:57:14 UTC
Sent 10 Sep 2007 0:01:53 UTC
Received 10 Sep 2007 14:05:17 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 510574
Report deadline 20 Sep 2007 0:01:53 UTC
CPU time 13821.375
stderr out <core_client_version>5.10.20</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 1280434
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -218.075 for 900 seconds
**********************************************************************
GZIP SILENT FILE: .xxt030.out

</stderr_txt>
]]>


Validate state Valid
Claimed credit 56.4278965225222
Granted credit 20
application version 5.78

ID: 45933 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Christoph Jansen
Avatar

Send message
Joined: 6 Jun 06
Posts: 248
Credit: 267,153
RAC: 0
Message 45936 - Posted: 10 Sep 2007, 15:18:39 UTC

Same here too:

"Rosetta score is stuck or going too long. Watchdog is ending the run!"

On these WUs:

wuid=94910696
wuid=94910692
wuid=94910691
wuid=94770968
ID: 45936 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Ian_D

Send message
Joined: 21 Sep 05
Posts: 55
Credit: 4,216,173
RAC: 0
Message 45953 - Posted: 10 Sep 2007, 20:02:26 UTC
Last modified: 10 Sep 2007, 20:25:31 UTC

Is this sort of thing supposed to be happening frequently as, at the moment, my four machines are doing quite a bit of work < 6.5 hrs and then coming up with

<core_client_version>5.10.20</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 28800
# random seed: 1276748
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score 303.464 for 900 seconds
**********************************************************************
GZIP SILENT FILE: .xx1he8.out

</stderr_txt>
]]>

Taken from Here

and giving next to nothing in credit (not that that bothers me, just wondering if there's something amiss !!)

Anyone else ?

Now message has been moved I see there are others.



ID: 45953 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BitSpit
Avatar

Send message
Joined: 5 Nov 05
Posts: 33
Credit: 4,147,344
RAC: 0
Message 45963 - Posted: 10 Sep 2007, 22:40:00 UTC

ID: 45963 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Zxian

Send message
Joined: 17 May 07
Posts: 18
Credit: 1,173,075
RAC: 0
Message 45967 - Posted: 11 Sep 2007, 0:28:48 UTC

I've also had several WU's come out with only 20 granted credit, regardless of how long the WU actually ran for.

This is on several different computers with different versions of Windows (XP, 2003).
ID: 45967 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : Problems with Rosetta version 5.78



©2024 University of Washington
https://www.bakerlab.org