Work unit 94392699 on computer 551987 has been stuck at 97.756% finished with about 00:9:54 to go for most of today. Unlike the last time this occurred to me, the CPU is at 100% use. However, the CPU time (done) is still only showing a bit over 7 hours.
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2391 ID: 106194 Credit: 0 RAC: 0
Mike, is that task still working on model 1? What is your work unit runtime preference? (the default is 3hrs).
...sounding normal so far.
____________ Rosetta Moderator: Mod.Sense
No, that one finished after I went to bed. :-) No other problems so far. I expect a bit of a pause around the 10 minute to go mark, this one just seem to go longer. Perhaps it snuck in a work unit from another project while I wasn't looking. Didn't see any in the messages though.
____________
Result ID 104053613
Name profilin2_BOINC_MFR_ABRELAX_PICKED_2062_29191_0
Workunit 94455620
Created 3 Sep 2007 9:01:12 UTC
Sent 3 Sep 2007 9:01:24 UTC
Received 4 Sep 2007 12:25:47 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 1 (0x1)
Computer ID 510574
Report deadline 13 Sep 2007 9:01:24 UTC
CPU time 0
stderr out <core_client_version>5.10.13</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 21600
ERROR:: Unable to obtain total_residue & sequence.
start pdb file must be provided.
ERROR:: Exit from: .\input_pdb.cc line: 2956
</stderr_txt>
]]>
Validate state Invalid
Claimed credit 0
Granted credit 0
application version 5.78
AMD4800 duall core on W SP2 Home
____________
ID: 45746 | Rating: 0 | rate:
/
P . P . L . Joined: Aug 20 06 Posts: 359 ID: 105843 Credit: 356,627 RAC: 665
I have two W.U.'s of the same type finish short of time on my two systems,
they are both have the runtime set for 8hrs and they both stoped after only
4hrs. I have the projects switch every 2hrs, anyway they haven't U/L ed
I also have a work unit 94604566 that has been stuck at around 97.2% progress for several hours of CPU time. It has now been running for 5:47 compared to a normal run time of about 3 hours.
I suspended it when the Rosetta problems started. Should I start it up again and let it run or should I abort it?
Had a problem with <http://boinc.bakerlab.org/rosetta/workunit.php?wuid=94715507> (not reported yet, since Rosetta is not yet accepting uploads). Noticed in gkrellm that one of my CPUs was idle (though boincmgr said that the workunit on that CPU was "running").
(If you can tell me where to send it, I have a tar of the slot directory.)
Here is a copy of the stderr.txt from that slot directory:
Graphics are disabled due to configuration...
# cpu_run_time_pref: 28800
# random seed: 1285195
SIGSEGV: segmentation violation
Stack trace (12 frames):
[0x8d45107]
[0x8d3fefc]
[0x40000420]
[0x8bb4bb4]
[0x8c96f34]
[0x84b6ee1]
[0x80d8665]
[0x85efeb3]
[0x871f807]
[0x871f8b2]
[0x8da9454]
[0x8048111]
Would prefer it if applications which terminated abnormally would go away, rather than making the boinc client (Linux 32-bit 5.10.8) believe thay are still "running".
.
ID: 45825 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2391 ID: 106194 Credit: 0 RAC: 0
Peter & drghughes: Some of the recent tasks sent out have long run times per model. Some up to about 4 hours on 3Ghz machines. So if your runtime preference is 8hrs, and your first model took 4.5hrs to complete, then beginning a second model would be predicted to take you over the 8hr preference by a significant amount, so Rosetta ends that task early rather then beginning the next model, which would almost certainly take longer.
So Peter, that is normal for it to end early. drhhughes, that is normal for them to sometimes take longer then your shorter runtime preference. But that can't be marked as finished until you complete at least one model. The time to completion is really just an estimate based on your 3hr preference. Once they get down to <10min left they start to just try to continue to show about 10min remaining, because they've got no more accurate idea when that first model will complete. Please let it run.
mikus, yes, the BOINC manager still seems to have some quirks on Linux. I can only suggest you keep current on the BOINC updates. But last I've seen there still seem to be occaisional problems there as well.
It appears from the number of tasks outstanding, that the project is accepting uploads and issueing downloads. I just had an upload go through about an hour ago. Keep in mind there are about 50,000 PCs out there that all are trying to report completed results and get more work. We just have to let it keep chugging and working through the backlog. Thanks for your patience.
____________ Rosetta Moderator: Mod.Sense
ID: 45855 | Rating: 0 | rate:
/
P . P . L . Joined: Aug 20 06 Posts: 359 ID: 105843 Credit: 356,627 RAC: 665
mikus, yes, the BOINC manager still seems to have some quirks on Linux. I can only suggest you keep current on the BOINC updates. But last I've seen there still seem to be occaisional problems there as well.
From this, should I conclude that Rosetta is not interested in one of its applications experiencing a segment violation, nor that whereas my expectation is that a failing task would go away, this one just sat there (holding a CPU resource that ought to have been released to other ready tasks.)
.
ID: 45886 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2391 ID: 106194 Credit: 0 RAC: 0
mikus, yes, the BOINC manager still seems to have some quirks on Linux. I can only suggest you keep current on the BOINC updates. But last I've seen there still seem to be occaisional problems there as well.
From this, should I conclude that Rosetta is not interested in one of its applications experiencing a segment violation, nor that whereas my expectation is that a failing task would go away, this one just sat there (holding a CPU resource that ought to have been released to other ready tasks.)
.
Not at all. I was simply trying to point out that BOINC manager controls which tasks run and when, so I'm not sure there is anything that the Rosetta team can do to improve things. It's probably going to require a fix to the BOINC code so that it can pick up and schedule other work accordingly.
____________ Rosetta Moderator: Mod.Sense
mikus, yes, the BOINC manager still seems to have some quirks on Linux. I can only suggest you keep current on the BOINC updates. But last I've seen there still seem to be occaisional problems there as well.
From this, should I conclude that Rosetta is not interested in one of its applications experiencing a segment violation, nor that whereas my expectation is that a failing task would go away, this one just sat there (holding a CPU resource that ought to have been released to other ready tasks.)
Not at all. I was simply trying to point out that BOINC manager controls which tasks run and when, so I'm not sure there is anything that the Rosetta team can do to improve things. It's probably going to require a fix to the BOINC code so that it can pick up and schedule other work accordingly.
It may well be that BOINC code needs to be upgraded to handle this unusual situation - an application task "dispatched" by BOINC which does not use any CPU.
BUT it is likely that the existing BOINC code expected that an application task which (according to the task's stderr.txt) had received (SIGSEGV + SIGABRT) would perform a "final exit". My question is - did the Rosetta application task do that ? (If yes, then BOINC dropped the ball; but if no, then it was the application that did not do what BOINC expected.) That is why I would like to send the snapshot of the slot directory to someone at Rosetta (if I knew where to send it), so Rosetta people can check for how far the application had gotten.
mikus
p.s. By the way, I now see that when I "aborted" the task to get it out of the ready queue, only the "abort" shows in the result's stderr field - overwriting the task's previously accumulated stderr output.
Also, I believe boincmgr is merely the 'GUI' to the BOINC client - the client can (and does) run perfectly well if boincmgr has been closed. So while the BOINC manager *can* control the application tasks (I issued the "abort" from boincmgr), it is the client which performs the details of task scheduling. Unfortunately, I believe the principal means the client has to keep track of what the tasks are doing is to track their CPU consumption. When faced with a task that does not consume CPU, I think the current BOINC *will* lose track.
.
ID: 45892 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2391 ID: 106194 Credit: 0 RAC: 0
mikus, you can EMail your files to me at the moderator contact EMail address, and I will forward them to the project team for you.
Yes, my terminology needs a little refinement. Most users do not know the difference between the two BOINC pieces, so they don't notice my misuse of terms.
Two questions for you, perhaps just include them in the EMail. What is your runtime preference? (actually that probably shows in the output file), and do you have any idea how long it was in the "running" state, but not using CPU time?
____________ Rosetta Moderator: Mod.Sense
drhhughes The time to completion is really just an estimate based on your 3hr preference. Once they get down to <10min left they start to just try to continue to show about 10min remaining, because they've got no more accurate idea when that first model will complete. Please let it run.
Mod.Sense,
Thanks. I let it run and it finished at about 5 h 57 mins.
Perhaps you could include a sticky note telling people about the "10 minutes to completion" rule. That would have been useful to know.
Also, the latest work unit that I've received has an initial "To completion" of 5 h 57 mins. Is this coincidence or do new work units take the CPU Time of the last work unit as their initial To completion estimate? Again, this would be useful to know since it would explain why the actual run time might not match the estimate.
Result ID 104434245
Name t030__BOINC_CAPRI14_DOCK_FIXBACKBONE-t030_-nosillyloop_nodimerloop_plexinmonomer__2066_697_0
Workunit 94766131
Created 9 Sep 2007 23:57:14 UTC
Sent 10 Sep 2007 0:01:53 UTC
Received 10 Sep 2007 14:05:17 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 510574
Report deadline 20 Sep 2007 0:01:53 UTC
CPU time 13821.375
stderr out <core_client_version>5.10.20</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 1280434
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -218.075 for 900 seconds
**********************************************************************
GZIP SILENT FILE: .\xxt030.out
</stderr_txt>
]]>
Validate state Valid
Claimed credit 56.4278965225222
Granted credit 20
application version 5.78
Is this sort of thing supposed to be happening frequently as, at the moment, my four machines are doing quite a bit of work < 6.5 hrs and then coming up with
<core_client_version>5.10.20</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 28800
# random seed: 1276748
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score 303.464 for 900 seconds
**********************************************************************
GZIP SILENT FILE: .\xx1he8.out
Result ID 104616934
Name 1he8__BOINC_CAPRI14_DOCK_FIXBACKBONE-1he8_-nosillyloop_plexinmonomer__2067_8410_0
Workunit 94936831
Created 10 Sep 2007 10:42:43 UTC
Sent 10 Sep 2007 10:43:28 UTC
Received 10 Sep 2007 22:03:20 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 341092
Report deadline 20 Sep 2007 10:43:28 UTC
CPU time 16852.023625
stderr out
<core_client_version>5.10.13</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 28800
# random seed: 1272171
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -173.421 for 900 seconds
**********************************************************************
GZIP SILENT FILE: .\xx1he8.out
</stderr_txt>
]]>
Validate state Valid
Claimed credit 48.3512209341757
Granted credit 20
application version 5.78
____________ e6600 quad @ 2.5ghz
2418 floating point
5227 integer
e6750 dual @ 3.71ghz
3598 floating point
7918 integer
ID: 45970 | Rating: 0 | rate:
/
P . P . L . Joined: Aug 20 06 Posts: 359 ID: 105843 Credit: 356,627 RAC: 665
<core_client_version>5.10.13</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 28800
# random seed: 1278145
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -202.375 for 900 seconds
**********************************************************************
GZIP SILENT FILE: .\xx1he8.out
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2391 ID: 106194 Credit: 0 RAC: 0
The 20 credits sounds like the nightly credit granting script for failed WUs. I realize they probably show as "success", but they didn't end normally. Some details here.
____________ Rosetta Moderator: Mod.Sense
Result ID 104435631
Name 1g4u__BOINC_CAPRI14_DOCK_FIXBACKBONE-1g4u_-nosillyloop_plexinmonomer__2067_760_0
Workunit 94767401
Created 10 Sep 2007 0:01:42 UTC
Sent 10 Sep 2007 0:01:53 UTC
Received 11 Sep 2007 16:12:47 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 510574
Report deadline 20 Sep 2007 0:01:53 UTC
CPU time 13657.84375
stderr out <core_client_version>5.10.20</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 1279871
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -223.806 for 900 seconds
**********************************************************************
GZIP SILENT FILE: .\xx1g4u.out
</stderr_txt>
]]>
Validate state Valid
Claimed credit 55.7602549562382
Granted credit 20
application version 5.78
The 20 credits sounds like the nightly credit granting script for failed WUs. I realize they probably show as "success", but they didn't end normally. Some details here.
Had that problem with these results: here and here.
____________
I aborted this WU, nothing was wrong with it as far as I know, I just couldn't finish it in time so I didn't start it.
ID: 46043 | Rating: 0 | rate:
/
Jim Joined: Oct 15 06 Posts: 18 ID: 119359 Credit: 2,341,297 RAC: 0
I'm the second person to get this WU: 94462214
It seems to be missing a file: PROF2.pdb ; will not finish the download
just a error message, "file not found".
9/12/2007 05:19:59||Suspending network activity - user request
9/12/2007 07:04:30|rosetta@home|[error] rosetta_beta not responding to screensaver, requesting exit
9/12/2007 07:25:19|rosetta@home|[error] rosetta_beta not responding to screensaver, killing it
9/12/2007 07:25:24|rosetta@home|Restarting task 1g4u__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-1g4u_-rxplxn_0472plexinmonomer__2074_62_0 using rosetta_beta version 578
9/12/2007 10:26:29|rosetta@home|Computation for task 1g4u__BOINC_MINIMIZE2_SCORE12_CAPRI14_DOCK_FIXBACKBONE-1g4u_-rxplxn_0472plexinmonomer__2074_62_0 finished
9/12/2007 11:28:53||Resuming network activity
Never seen this error before!
____________
"Life is like an Ice Cream cone, just when you think you got it licked, it drips all over you!"
http://boinc.bakerlab.org/rosetta/result.php?resultid=104452274
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -164.509 for 900 seconds
____________
<core_client_version>5.10.13</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 3919342
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -467.27 for 900 seconds
**********************************************************************
GZIP SILENT FILE: .\xx1he8.out
</stderr_txt>
]]>
Validate state Valid
Claimed credit 62.7858592995201
Granted credit 20
application version 5.78
ID: 46095 | Rating: 0 | rate:
/
Rhiju Forum moderator Project administrator Project developer Project scientist Joined: Jan 8 06 Posts: 223 ID: 48256 Credit: 3,546 RAC: 0
Thanks to everyone for posting. I think I know how to fix this (the watchdog problem)! I have removed these jobs from the queue for now, and when they are sent out again, we should see fewer premature exits...
I am having the same problems. I usually crunch 830 credits but now 50% of my 5.78 wu's are bad. I do not use the pc for anything else or project so there is no moniter to see if the pc acting strange while this happening.
104430347 94762391 9 Sep
104430349 94762393 9 Sep
104430354 94762398 9 Sep
104430355 94762399 9 Sep
104430357 94762401 9 Sep
104430364 94762408 9 Sep
104430359 94762403 9 Sep
104430358 94762402 9 Sep
104430366 94762410 9 Sep
104430373 94762416 9 Sep
104430372 94762415 9 Sep
104430376 94762419 9 Sep
____________ Jmarks
ID: 46117 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2391 ID: 106194 Credit: 0 RAC: 0
Jmarks, sorry for all the failed WUs. Rhiju has pulled those WUs and is working on a fix that will improve things there. Otherwise, about all you can do is cut your runtime preference. Theory being that if your normal credit per task if close to 20, then a failure granted 20 will not be such an impact.
____________ Rosetta Moderator: Mod.Sense
I'm running v5.78 on a 600 MHz machine. Three of the last nine WUs that I've uploaded reported a "Validate error" (103801478/94222135; 104683771/94998439; and, 104848402/95151762). In total, these three WUs represent just shy of 19 CPU hours, with a combined credit claim of just over 58.
As you might imagine, wasting 19 hours of CPU time because every 1 out of 3 WUs is rejected has me a bit frustrated with R@H! (I'm also running World Community Grid and Seti@Home, neither of which are producing errors.) Is anyone else experiencing similar problems with v5.78?
I realize that 5.78 is (unpleasant) history now, but as a historical record, and in the hope that this information will make future versions of Rosetta more stable, I offer the following:
Five out of six Capri WUs killed by the watchdog, the casualties were:
Even though it was a success, and this is a thread for problems, I have linked it here since there doesn't seem to be a place to post the (rare) successes in v5.78...
This represent about 90 hours on a pretty decent computer, about half of which was accomplished AFTER the problem with 5.78 was identified and (we hope) fixed.
When/If a situation like this arises again, I would specifically request a notice on the home page asking users to ABORT the WUs in question. I suspect the project would be better served thereby.
Respectfully,
David Emigh
____________
Rosie, Rosie, she's our gal,
If she can't do it, no one shall!
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -457.996 for 900 seconds
**********************************************************************
GZIP SILENT FILE: .\xx1he8.out
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score 117.804 for 900 seconds
**********************************************************************
thats 3 work units this week that got stuck and gave me only 20 points.
____________
this wu got stuck
# cpu_run_time_pref: 21600
# random seed: 1279344
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -231.574 for 900 seconds
**********************************************************************
yet another 20 instead of actual points, 4 of them now out of over 20 wu's
____________