Posts by Pepo

1) Message boards : Number crunching : Problems with Minirosetta Version 1.71 (Message 61549)
Posted 2 Jun 2009 by Pepo
Post:
One failed task lb_alnmatrix_threading_alncap__hb_t325__IGNORE_THE_REST_12581_1162_0 without any appatent reason - exit 0, but invalid result.
Maybe a failed computation restart.

Win XP SP3, BOINC 6.6.23.

Peter
2) Message boards : Number crunching : Report long-running models here (Message 57097)
Posted 20 Nov 2008 by Pepo
Post:
...and some rhetorical fighting...

:-)

Peter
3) Message boards : Number crunching : Report long-running models here (Message 57092)
Posted 20 Nov 2008 by Pepo
Post:
I've got notifications for 7 more messages between 19:14 and 21:43 UTC, which are missing here. Was the thread spammed, or a server hiccup happened?

Peter
4) Message boards : Number crunching : What? I thought this had stopped (Team Founder Change) (Message 56921)
Posted 13 Nov 2008 by Pepo
Post:
I modified the foundership change page so that it now just says it's disabled and to contact me if you have a legitimate request. [...] I'll take the requests case by case and decide on what action to take.

Perfect (for the time being), David, thanks.
I believe FluffyChicken will be at least partially satisfied.

Peter
5) Message boards : Number crunching : What? I thought this had stopped (Team Founder Change) (Message 56909)
Posted 13 Nov 2008 by Pepo
Post:
Peter, I checked with BOINC. They said it's up to Rosetta, that the transfer option has been enabled with BOINC.

Of course, Rosetta sends you an email, confirming that you've applied for the position, and stating you need to wait 60 days for the current team founder to respond to their email query.

Then, every time you log in, and look at your team page, it tells you that you've applied for team founder, and on what date you can make that transfer, if the current founder does not respond.

Only after the 60 day wait, only when you click on the "assume founder" button, are you informed that the team founder transfer has been disabled.

You don't have to worry about me slowing down, Peter. ;)

Yes, that's Rosetta's failure to think just of the "bad guys" and not to inform the "legitimate guys" immediately about the whole 60-day process being switched off.

It was a fast responce to the situation, but an incomplete solution.

Peter
6) Message boards : Number crunching : What? I thought this had stopped (Team Founder Change) (Message 56900)
Posted 13 Nov 2008 by Pepo
Post:
Not only that, he would gain access to the email address's our team members have trusted me with. I'm sure some of the lawyers in my team could sort some sort of charges out against you if it did happen, on privacy grounds.
Pissed off and upset with Rosetta@home :(

I hope if you slow down a bit, you could understand this is nothing Rosetta@home-specific. You are fighting against the wrong party.

Peter
7) Message boards : Number crunching : Minirosetta v1.39 bug thread (Message 56834)
Posted 11 Nov 2008 by Pepo
Post:
Please, please, please notify us of new versions in
Rosetta Application Version Release Log.

There's one catch with this. Since some one year ago, older BOINC threads occasionally gets marked as unread, completely, all messages. No idea why. Even if the thread is marked for notifications, no one will come until you'll read the thread :-(

A bit newer forum SW (than Rosetta is using) is overcoming this issue by sending a notification not just for first unread post, but for each newer post.

I've for instance got no notification for the last two messages in the Rosetta Application Version Release Log thread.

Peter
8) Message boards : Number crunching : Report long-running models here (Message 56827)
Posted 11 Nov 2008 by Pepo
Post:
On slow Duron CPU, Mini 1.39 task 1g73A_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1g73A-_4653_6486_0 was interrupted after nearly 10 hours because of "going too long". The preference was increased from 1 to 2 hours during the run, which was an attempt, how would the model cope with such slow machine. (Probably not able at all to finish a decoy on such slow host.) It was checkpointing.

During the run, the progress went very fast to some 80-90% and then was progressing 0.1%-wise over hours...

Peter
9) Message boards : Number crunching : Minirosetta v1.34 bug thread (Message 56402)
Posted 17 Oct 2008 by Pepo
Post:
By full cycle do you mean one whole day/ 24 hours?

Nooo, not the whole day. I've meant one full row of messages between two restarts - it is around 17 minutes in your case. (you've said that "the same messages repeat").

The log file is going to be huge. Updates every couple of seconds due to CPU throttling (set at 85%)

OMG :-) understand!

Or is this what you were expecting:
Thu 16 Oct 2008 11:09:25 AM PDT||[cpu_sched] Suspending - CPU throttle
Thu 16 Oct 2008 11:09:25 AM PDT|rosetta@home|[cpu_sched] Preempting hombench_mtyka_foldcst_simple_foldcst_simple_t305___4612_2720_0 (left in memory)
Thu 16 Oct 2008 11:09:25 AM PDT|rosetta@home|[task_debug] task_state=SUSPENDED for hombench_mtyka_foldcst_simple_foldcst_simple_t305___4612_2720_0 from suspend
Thu 16 Oct 2008 11:09:26 AM PDT||[cpu_sched] Resuming - CPU throttle

I was expecting to see anything but throttling :-) If your machine will manage it for the few minutes, you could please temporarily set it off.

Peter
10) Message boards : Number crunching : Minirosetta v1.34 bug thread (Message 56397)
Posted 16 Oct 2008 by Pepo
Post:
Tue 14 Oct 2008 09:11:13 AM PDT|rosetta@home|Restarting task hombench_mtyka_foldcst_simple_foldcst_simple_t305___4612_2720_0 using minirosetta version 134
Tue 14 Oct 2008 09:30:10 AM PDT|rosetta@home|Restarting task hombench_mtyka_foldcst_simple_foldcst_simple_t305___4612_2720_0 using minirosetta version 134

Removed some lines to save some space here, but the same messages repeat

Could you leave maybe at least one full cycle unstriped?

Found my default cc_config.xml
I assume you mean you want us to switch to
<cpu_sched>[b]1[/b]</cpu_sched>
    <cpu_sched_debug>[b]1[/b]</cpu_sched_debug>

Maybe also <task_debug>?

Peter
11) Message boards : Number crunching : Minirosetta v1.32 bug thread (Message 56376)
Posted 15 Oct 2008 by Pepo
Post:
This is about problems with Minirosetta v1.34 as there does not appear to a "thread" for it!

Sure there is one ;-) --> Minirosetta v1.34 bug thread

Peter
12) Message boards : Number crunching : why always short of space? (Message 56270)
Posted 7 Oct 2008 by Pepo
Post:
To aid solving these problems, BOINC Manager 6.3.x divides the "total disk usage" pie chart into four parts - dividing the former free part into "free, available to BOINC" and "free, not available to BOINC".

Peter
13) Message boards : Number crunching : Expired deadline (Message 56141)
Posted 1 Oct 2008 by Pepo
Post:
My very personal theory is that the text "max # of error/total/success tasks" is misleading and should read "max # of error/total/success results".


That's just a different wording. The term "task" was introduced at a later time, to describe what is assigned to and running on a host.

Previously there were just WUs, consisting of results (which were to be returned to server after being crunched). But it sounded weird if "a result was running on my host"...

Once the files are returned to the server, they are just plain "results of computation". But the official wording might indeed be "tasks" now.

By the way, I doubt that we may conclude from "max # of error/total/success tasks" = 1, 2, 1 that Rosetta should not send out more than 2 replications of the same task.

But we have to. The scientists use these values to set up, how should the server behave during the WU's lifetime.

By the same interpretation we should have to conclude that Rosetta terminates a task upon receiving one result with "Client error/Compute error". It doesn't,

Sure, it does not. Usually in that moment, there are two results: one failed (1 error is fulfilled) and one just resent. (There should be no second additional resent task, because max is 2.)

it terminates upon receiving more than 1, that is 2, error results.

Exactly. If either 2 successful or 2 eror results are back, suddenly it does not fit in the (1,2,1) form and the WU is declared as failed.

And Rosetta is perfectly capable of accepting more than 1 (again = 2) success results under ordinary circumstances, giving proper credits to everyone.

That is up to the devs to comment on. Anyway, they are still able to grant (semi-manually?) credit to any successful result, regardless of the WU state. This way it is often done on beta projects.

Peter
14) Message boards : Number crunching : Minirosetta v1.34 bug thread (Message 56138)
Posted 1 Oct 2008 by Pepo
Post:
You could try to use the mentioned debugging flags for a couple of minutes, whether it will reveal something...

Where is this cc_config.xml file located? I can't seem to find it.

It does not exist if you did not yet create it mamually. It's description is here. As Mod.Sense said, the exact place is best described in the first messages.

You need tu put just (some of?) the following tags in:
<cc_config>
<log_flags>
<cpu_sched>1</cpu_sched>
<cpu_sched_debug>1</cpu_sched_debug>
<checkpoint_debug>1</checkpoint_debug>
<task_debug>1</task_debug>
</log_flags>
</cc_config>

and then let the client read it (BOINC Mgr / Advanced / Read config file), stopping the client is not necessary.

It will possibly generate a lot of output (maybe not), which will be similar across the restarts. You could select some text from some last checkpoint prior to a restart, until the next subsequent checkpoint after the same restart. We will see... (maybe we will not see anything obvious :-)

Peter
15) Message boards : Number crunching : Expired deadline (Message 56134)
Posted 30 Sep 2008 by Pepo
Post:
This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified

The changeset [trac]changeset:276[/trac] describes a different case, where third task is being errorneously reissued, although the total=2.

Would the problem mentioned in this thread be solved using 2-2-2 limit settings?
minimum quorum 1
initial replication 1
max # of error/total/success tasks 2, 2, 2


(Surely it could take longer to discard the WU from server.)

Peter
16) Message boards : Number crunching : Expired deadline (Message 56117)
Posted 30 Sep 2008 by Pepo
Post:
You could let them run but then the other user crunches but gets no credit.
Check each one, if the task has been reassigned I personally would abort them so as not to hurt the new guy.

The WU has been reassigned because your computer did not finish before the deadline which seems to be ten days. Actually, once the WU hsa been reassigned, there is a race: the first of the two computers to report will get credit, and the second will get a Validation Error and no credit.

Ths should definitely not happen until the second guy's deadline will pass!! (If it does, it is BOINC server-side error, which should get reported and repaired. The second guy is given a promise to be able to crunch the reassigned WU until his new deadline.)

I'm taking my words back. It has nothing to do with BOINC server-side software, it is just Rosetta's tight and intolerant settings:

minimum quorum 1
initial replication 1
max # of error/total/success tasks 1, 2, 1


You are right: "poor second guy" ;-)

Peter
17) Message boards : Number crunching : Expired deadline (Message 56113)
Posted 30 Sep 2008 by Pepo
Post:
You should abort tasks that have passed the deadline.

A rule of thumb: you can immediately abort the task, if if it was not yet started crunching.

If it is already being crunched... then it depends. It is simpler to decide when the tasks take days and you need last few hours until finished. Then you can be sure that the reassigned task wll surely finish later.

Rosetta's tasks are usually much shorter, a reassigned task can be finished in any moment (like a hidden thread :-) So yes - abort it. (Your nice farm will not notice it ;-)

You could let them run but then the other user crunches but gets no credit.
Check each one, if the task has been reassigned I personally would abort them so as not to hurt the new guy.

The WU has been reassigned because your computer did not finish before the deadline which seems to be ten days. Actually, once the WU hsa been reassigned, there is a race: the first of the two computers to report will get credit, and the second will get a Validation Error and no credit.

Ths should definitely not happen until the second guy's deadline will pass!! (If it does, it is BOINC server-side error, which should get reported and repaired. The second guy is given a promise to be able to crunch the reassigned WU until his new deadline.)

Peter
18) Message boards : Number crunching : Problems with Rosetta version 5.98 (Message 56112)
Posted 30 Sep 2008 by Pepo
Post:
not when BOINC is installed in the protected mode (as a service).

You say 'protected mode', thing is I installed Boinc mgr in its standard form, I did not select anything different than the defaults. So is 'protected mode' the default? If so, how do you change it so that 5.98 will show the graphics?

Yes, default form for 6.x is a service. In this mode, science apps are running under newly added [i]boinc_project[/usr] account and have no access to your (the logged-in user) desktop. (See also How to install BOINC as a service (BOINC 6 series) on Windows?.)

Run the installation again and when on the BOINC Configuration page, press the "Advanced" button and then switch off the "Protected application execution" (a.k.a. "Service") mode checkbox. The client and applications will be then run under your account, started directly as Manager's child processes (the "good old way").

Peter
19) Message boards : Number crunching : Minirosetta v1.34 bug thread (Message 56107)
Posted 30 Sep 2008 by Pepo
Post:
Can anyone explain why I might be getting this kind of repeating error?

9/28/2008 11:17:09 PM|rosetta@home|Task hombench_mtyka_foldcst_boinc_test3_foldcst_simple_t302___4585_718_0 exited with zero status but no 'finished' file
9/28/2008 11:17:09 PM|rosetta@home|If this happens repeatedly you may need to reset the project.
9/28/2008 11:17:09 PM|rosetta@home|Restarting task hombench_mtyka_foldcst_boinc_test3_foldcst_simple_t302___4585_718_0 using minirosetta version 134
9/28/2008 11:17:50 PM|rosetta@home|Task hombench_mtyka_foldcst_boinc_test3_foldcst_simple_t302___4585_718_0 exited with zero status but no 'finished' file
9/28/2008 11:17:50 PM|rosetta@home|If this happens repeatedly you may need to reset the project.
9/28/2008 11:17:50 PM|rosetta@home|Restarting task hombench_mtyka_foldcst_boinc_test3_foldcst_simple_t302___4585_718_0 using minirosetta version 134
9/28/2008 11:18:31 PM|rosetta@home|Task hombench_mtyka_foldcst_boinc_test3_foldcst_simple_t302___4585_718_0 exited with zero status but no 'finished' file
9/28/2008 11:18:31 PM|rosetta@home|If this happens repeatedly you may need to reset the project.


This is only an excerpt of the series of messages. I've had a few WUs do this recently.

Either the client or the system is busy (and the client then fails to deliver heartbeat messages to the Rosetta app, which in turn quits after 30 seconds), or the application has some own problem and keeps terminating for unknown reason. I'd bet the reason is the same as DanieI described in Message 56087, although the behavior differs.

You could try to use the mentioned debugging logs for a couple of minutes, whether it will reveal something...

Peter
20) Message boards : Number crunching : Minirosetta v1.34 bug thread (Message 56105)
Posted 30 Sep 2008 by Pepo
Post:
The task is being restarted each one hour. Your cpu_run_time_pref seems to be 3 hours, so it is probably something different.

The task is still not reported. Could you please try to add <cpu_sched> and/or <cpu_sched_debug> (and possibly also <task_debug>?) to your cc_config.xml and let the client reread it (BOINC Mgr / Advanced / Read config file) without stopping the client?

Don't confuse the WU runtime preference, with the other BOINC prefernces such as how frequently to switch between applications.

I'm not. It was just an idea about one of the possible reasons for the task being restarted - something happening in the application algorithm at the end of preferred time interval...

Basically, the message about restarting is an indication that the task was suspended and is now beginning to run again. Your BOINC preferences basically dictate what conditions would suspend a running task. The simplest being that another task from another project begins running.

...Another was that the client is restarting the task (the default time slot is 1 hour), but the lack of any other messages suggested that it is not the case. The messages suggested the application is rather terminating itself :-) (But looking at the time stamps there really seem to be few minutes short gaps, where some small app could fit in. If there is any.)

If you've included all of the messages, then that is not the case here.

I failed to ask whether the message list is complete at all. I've simply not thought of this. Why? Because DanieI currently seems to be a dedicated Rosetta cruncher (sure, his CPIDs could be out of sync).

Perhaps you have set up BOINC to not run while the computer is in use? If so, then each time to step up to use it, BOINC suspends the tasks. Then once the machine is idle for the configured period of time, it resumes what it was doing.

I was hoping the additional logging flags should help to reveal this. But possibly just adding the discarded messages could solve it (a "constructed mystery").

Peter


Next 20



©2021 University of Washington
https://www.bakerlab.org