1)
Message boards :
Number crunching :
Problems with Minirosetta Version 1.71
(Message 61549)
Posted 2 Jun 2009 by Pepo Post: One failed task lb_alnmatrix_threading_alncap__hb_t325__IGNORE_THE_REST_12581_1162_0 without any appatent reason - exit 0, but invalid result. Maybe a failed computation restart. Win XP SP3, BOINC 6.6.23. Peter |
2)
Message boards :
Number crunching :
Report long-running models here
(Message 57097)
Posted 20 Nov 2008 by Pepo Post: ...and some rhetorical fighting... :-) Peter |
3)
Message boards :
Number crunching :
Report long-running models here
(Message 57092)
Posted 20 Nov 2008 by Pepo Post: I've got notifications for 7 more messages between 19:14 and 21:43 UTC, which are missing here. Was the thread spammed, or a server hiccup happened? Peter |
4)
Message boards :
Number crunching :
What? I thought this had stopped (Team Founder Change)
(Message 56921)
Posted 13 Nov 2008 by Pepo Post: I modified the foundership change page so that it now just says it's disabled and to contact me if you have a legitimate request. [...] I'll take the requests case by case and decide on what action to take. Perfect (for the time being), David, thanks. I believe FluffyChicken will be at least partially satisfied. Peter |
5)
Message boards :
Number crunching :
What? I thought this had stopped (Team Founder Change)
(Message 56909)
Posted 13 Nov 2008 by Pepo Post: Peter, I checked with BOINC. They said it's up to Rosetta, that the transfer option has been enabled with BOINC. Yes, that's Rosetta's failure to think just of the "bad guys" and not to inform the "legitimate guys" immediately about the whole 60-day process being switched off. It was a fast responce to the situation, but an incomplete solution. Peter |
6)
Message boards :
Number crunching :
What? I thought this had stopped (Team Founder Change)
(Message 56900)
Posted 13 Nov 2008 by Pepo Post: Not only that, he would gain access to the email address's our team members have trusted me with. I'm sure some of the lawyers in my team could sort some sort of charges out against you if it did happen, on privacy grounds. I hope if you slow down a bit, you could understand this is nothing Rosetta@home-specific. You are fighting against the wrong party. Peter |
7)
Message boards :
Number crunching :
Minirosetta v1.39 bug thread
(Message 56834)
Posted 11 Nov 2008 by Pepo Post: Please, please, please notify us of new versions in There's one catch with this. Since some one year ago, older BOINC threads occasionally gets marked as unread, completely, all messages. No idea why. Even if the thread is marked for notifications, no one will come until you'll read the thread :-( A bit newer forum SW (than Rosetta is using) is overcoming this issue by sending a notification not just for first unread post, but for each newer post. I've for instance got no notification for the last two messages in the Rosetta Application Version Release Log thread. Peter |
8)
Message boards :
Number crunching :
Report long-running models here
(Message 56827)
Posted 11 Nov 2008 by Pepo Post: On slow Duron CPU, Mini 1.39 task 1g73A_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1g73A-_4653_6486_0 was interrupted after nearly 10 hours because of "going too long". The preference was increased from 1 to 2 hours during the run, which was an attempt, how would the model cope with such slow machine. (Probably not able at all to finish a decoy on such slow host.) It was checkpointing. During the run, the progress went very fast to some 80-90% and then was progressing 0.1%-wise over hours... Peter |
9)
Message boards :
Number crunching :
Minirosetta v1.34 bug thread
(Message 56402)
Posted 17 Oct 2008 by Pepo Post: By full cycle do you mean one whole day/ 24 hours? Nooo, not the whole day. I've meant one full row of messages between two restarts - it is around 17 minutes in your case. (you've said that "the same messages repeat"). The log file is going to be huge. Updates every couple of seconds due to CPU throttling (set at 85%) OMG :-) understand! Or is this what you were expecting: I was expecting to see anything but throttling :-) If your machine will manage it for the few minutes, you could please temporarily set it off. Peter |
10)
Message boards :
Number crunching :
Minirosetta v1.34 bug thread
(Message 56397)
Posted 16 Oct 2008 by Pepo Post: Tue 14 Oct 2008 09:11:13 AM PDT|rosetta@home|Restarting task hombench_mtyka_foldcst_simple_foldcst_simple_t305___4612_2720_0 using minirosetta version 134 Could you leave maybe at least one full cycle unstriped? Found my default cc_config.xml Maybe also <task_debug>? Peter |
11)
Message boards :
Number crunching :
Minirosetta v1.32 bug thread
(Message 56376)
Posted 15 Oct 2008 by Pepo Post: This is about problems with Minirosetta v1.34 as there does not appear to a "thread" for it! Sure there is one ;-) --> Minirosetta v1.34 bug thread Peter |
12)
Message boards :
Number crunching :
why always short of space?
(Message 56270)
Posted 7 Oct 2008 by Pepo Post: To aid solving these problems, BOINC Manager 6.3.x divides the "total disk usage" pie chart into four parts - dividing the former free part into "free, available to BOINC" and "free, not available to BOINC". Peter |
13)
Message boards :
Number crunching :
Expired deadline
(Message 56141)
Posted 1 Oct 2008 by Pepo Post: My very personal theory is that the text "max # of error/total/success tasks" is misleading and should read "max # of error/total/success results". That's just a different wording. The term "task" was introduced at a later time, to describe what is assigned to and running on a host. Previously there were just WUs, consisting of results (which were to be returned to server after being crunched). But it sounded weird if "a result was running on my host"... Once the files are returned to the server, they are just plain "results of computation". But the official wording might indeed be "tasks" now. By the way, I doubt that we may conclude from "max # of error/total/success tasks" = 1, 2, 1 that Rosetta should not send out more than 2 replications of the same task. But we have to. The scientists use these values to set up, how should the server behave during the WU's lifetime. By the same interpretation we should have to conclude that Rosetta terminates a task upon receiving one result with "Client error/Compute error". It doesn't, Sure, it does not. Usually in that moment, there are two results: one failed (1 error is fulfilled) and one just resent. (There should be no second additional resent task, because max is 2.) it terminates upon receiving more than 1, that is 2, error results. Exactly. If either 2 successful or 2 eror results are back, suddenly it does not fit in the (1,2,1) form and the WU is declared as failed. And Rosetta is perfectly capable of accepting more than 1 (again = 2) success results under ordinary circumstances, giving proper credits to everyone. That is up to the devs to comment on. Anyway, they are still able to grant (semi-manually?) credit to any successful result, regardless of the WU state. This way it is often done on beta projects. Peter |
14)
Message boards :
Number crunching :
Minirosetta v1.34 bug thread
(Message 56138)
Posted 1 Oct 2008 by Pepo Post: You could try to use the mentioned debugging flags for a couple of minutes, whether it will reveal something... It does not exist if you did not yet create it mamually. It's description is here. As Mod.Sense said, the exact place is best described in the first messages. You need tu put just (some of?) the following tags in: <cc_config> and then let the client read it (BOINC Mgr / Advanced / Read config file), stopping the client is not necessary. It will possibly generate a lot of output (maybe not), which will be similar across the restarts. You could select some text from some last checkpoint prior to a restart, until the next subsequent checkpoint after the same restart. We will see... (maybe we will not see anything obvious :-) Peter |
15)
Message boards :
Number crunching :
Expired deadline
(Message 56134)
Posted 30 Sep 2008 by Pepo Post: This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified The changeset [trac]changeset:276[/trac] describes a different case, where third task is being errorneously reissued, although the total=2. Would the problem mentioned in this thread be solved using 2-2-2 limit settings? minimum quorum 1 (Surely it could take longer to discard the WU from server.) Peter |
16)
Message boards :
Number crunching :
Expired deadline
(Message 56117)
Posted 30 Sep 2008 by Pepo Post: You could let them run but then the other user crunches but gets no credit. I'm taking my words back. It has nothing to do with BOINC server-side software, it is just Rosetta's tight and intolerant settings: minimum quorum 1 You are right: "poor second guy" ;-) Peter |
17)
Message boards :
Number crunching :
Expired deadline
(Message 56113)
Posted 30 Sep 2008 by Pepo Post: You should abort tasks that have passed the deadline. A rule of thumb: you can immediately abort the task, if if it was not yet started crunching. If it is already being crunched... then it depends. It is simpler to decide when the tasks take days and you need last few hours until finished. Then you can be sure that the reassigned task wll surely finish later. Rosetta's tasks are usually much shorter, a reassigned task can be finished in any moment (like a hidden thread :-) So yes - abort it. (Your nice farm will not notice it ;-) You could let them run but then the other user crunches but gets no credit. The WU has been reassigned because your computer did not finish before the deadline which seems to be ten days. Actually, once the WU hsa been reassigned, there is a race: the first of the two computers to report will get credit, and the second will get a Validation Error and no credit. Ths should definitely not happen until the second guy's deadline will pass!! (If it does, it is BOINC server-side error, which should get reported and repaired. The second guy is given a promise to be able to crunch the reassigned WU until his new deadline.) Peter |
18)
Message boards :
Number crunching :
Problems with Rosetta version 5.98
(Message 56112)
Posted 30 Sep 2008 by Pepo Post: not when BOINC is installed in the protected mode (as a service). Yes, default form for 6.x is a service. In this mode, science apps are running under newly added [i]boinc_project[/usr] account and have no access to your (the logged-in user) desktop. (See also How to install BOINC as a service (BOINC 6 series) on Windows?.) Run the installation again and when on the BOINC Configuration page, press the "Advanced" button and then switch off the "Protected application execution" (a.k.a. "Service") mode checkbox. The client and applications will be then run under your account, started directly as Manager's child processes (the "good old way"). Peter |
19)
Message boards :
Number crunching :
Minirosetta v1.34 bug thread
(Message 56107)
Posted 30 Sep 2008 by Pepo Post: Can anyone explain why I might be getting this kind of repeating error? Either the client or the system is busy (and the client then fails to deliver heartbeat messages to the Rosetta app, which in turn quits after 30 seconds), or the application has some own problem and keeps terminating for unknown reason. I'd bet the reason is the same as DanieI described in Message 56087, although the behavior differs. You could try to use the mentioned debugging logs for a couple of minutes, whether it will reveal something... Peter |
20)
Message boards :
Number crunching :
Minirosetta v1.34 bug thread
(Message 56105)
Posted 30 Sep 2008 by Pepo Post: The task is being restarted each one hour. Your cpu_run_time_pref seems to be 3 hours, so it is probably something different. I'm not. It was just an idea about one of the possible reasons for the task being restarted - something happening in the application algorithm at the end of preferred time interval... Basically, the message about restarting is an indication that the task was suspended and is now beginning to run again. Your BOINC preferences basically dictate what conditions would suspend a running task. The simplest being that another task from another project begins running. ...Another was that the client is restarting the task (the default time slot is 1 hour), but the lack of any other messages suggested that it is not the case. The messages suggested the application is rather terminating itself :-) (But looking at the time stamps there really seem to be few minutes short gaps, where some small app could fit in. If there is any.) If you've included all of the messages, then that is not the case here. I failed to ask whether the message list is complete at all. I've simply not thought of this. Why? Because DanieI currently seems to be a dedicated Rosetta cruncher (sure, his CPIDs could be out of sync). Perhaps you have set up BOINC to not run while the computer is in use? If so, then each time to step up to use it, BOINC suspends the tasks. Then once the machine is idle for the configured period of time, it resumes what it was doing. I was hoping the additional logging flags should help to reveal this. But possibly just adding the discarded messages could solve it (a "constructed mystery"). Peter |
©2023 University of Washington
https://www.bakerlab.org