minirosetta 2.05

Message boards : Number crunching : minirosetta 2.05

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next

AuthorMessage
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1218
Credit: 13,491,596
RAC: 4,502
Message 65429 - Posted: 27 Feb 2010, 3:00:02 UTC - in response to Message 65152.  
Last modified: 27 Feb 2010, 3:55:14 UTC

I am running Rosetta@home (along with some other projects) on four machines, three runing XP and one running Win7. Two of the machines have AMD 64 single core processors and two have AMD 64 dual core processors. The rosetta app is 'rosetta mini 2.05' and the BOINC version is 6.10.18. About once a week or so I will notice that the Rosetta WU running on at least one of the machines has been running longer than usual and using little or no CPU time. I abort it, and the next WU runs fine. I have not seen this occur with any of my other projects (climite prodiction, malaria control or world community grid).


Try suspending/resuming, or even entirely turn-off/restart BOINC, before aborting the WU. I have found this works often, not always, for me.


Thanks - one of my 2.05 workunits had the same problem, but now seems to be running well after a reboot.

https://boinc.bakerlab.org/rosetta/result.php?resultid=320652086

64-bit Vista SP2, BOINC 6.10.18, quad-core Intel, not using keep in memory when suspended (something tends to tie up lots of memory and make the computer unresponsive to the mouse and keyboard; haven't found what, though)

t311__boinc_filtered_loopbuild_threading type workunit

Before the reboot, showed CPU time 03:39:05, last checkpoint 03:39:03, elapsed time so far 20:29:26, not using any CPU time

Rebooted, that workunit restarted at about 4 hours elapsed time, but is now using a CPU core again.
ID: 65429 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1218
Credit: 13,491,596
RAC: 4,502
Message 65430 - Posted: 27 Feb 2010, 3:32:00 UTC - in response to Message 65229.  
Last modified: 27 Feb 2010, 3:53:38 UTC

Hi!
I don't know if this happened also with older versions of rosetta since I started computing on the 3rd of February.
I'm running on an amd64 linux system, a pretty powerful one. Looking at my tasks log, I had about a 120 WUs assigned until today, but only 3-4 of them completed successfully. Others show "Outcome - Client error" / "Client state - Compute error". Looking at boinc.log gave me no information because it doesn't contain any error line except "output file .... absent", which I'm told from the FAQ it is safe to ignore. I'm running lhc, seti, milkyway, einstein, ralph, cosmology and with the exception of einstein tasks which seem to end up in computation errors also, every other program is running fine. Milkyway in particular granted me 2500 credits in the last four days (from which I assume the machine is stable). I have never observed problems with the machine itself (occasional lockups, strange sudden shutdowns etc).

Thanks
Neo2


One thing to look for: I've found that when the output file absent error occurs, it's a good idea to search the logfile for any reference to boinc_lockfile. Errors that refer to that file tend to cascade from one workunit to the next, at least with the older versions of BOINC, but not with some of the newer versions like the 6.10.18 I'm now using. They can also cascade to other BOINC projects that use a file with the same name, again for the older BOINC versions.
ID: 65430 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Minardi

Send message
Joined: 19 Jan 10
Posts: 1
Credit: 1,117,527
RAC: 0
Message 65460 - Posted: 5 Mar 2010, 2:11:31 UTC

I have had several tasks stall out and stop using CPU over the past few days. I am finishing up my rosetta tasks, then taking this machine off the project. I was running an XP machine and had no problems. It died, and I replaced it with a W7 64-bit machine and some tasks started stalling out on me. In reviewing this thread, it appears there is a problem with mini Rosetta 2.05 running on W7 machines.
ID: 65460 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ramostol

Send message
Joined: 6 Feb 07
Posts: 64
Credit: 584,052
RAC: 0
Message 65480 - Posted: 7 Mar 2010, 11:42:31 UTC

My first Protein_interface (validation related?) error as far as I know - MacOS 10.5:

tyrsim_3gbn_2esa_Protein_interface_design_01Feb2010_17949_9_2


Outcome Success
Client state Done
Exit status 0 (0x0)

CPU time 21540.8

<core_client_version>6.10.36</core_client_version>
<![CDATA[
<stderr_txt>

[...]

# cpu_run_time_pref: 21600
======================================================
DONE :: 327 starting structures 21540.3 cpu seconds
This process generated 327 decoys from 327 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>

Validate state Workunit error - check skipped


One of two wingmen validated successfully after his deadline, but with far fewer decoys completed.
ID: 65480 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 22
Message 65481 - Posted: 7 Mar 2010, 21:53:11 UTC - in response to Message 65480.  
Last modified: 7 Mar 2010, 22:00:19 UTC

My first Protein_interface (validation related?) error as far as I know - MacOS 10.5:

tyrsim_3gbn_2esa_Protein_interface_design_01Feb2010_17949_9_2


Outcome Success
Client state Done
Exit status 0 (0x0)

CPU time 21540.8

<core_client_version>6.10.36</core_client_version>
<![CDATA[
<stderr_txt>

[...]

# cpu_run_time_pref: 21600
======================================================
DONE :: 327 starting structures 21540.3 cpu seconds
This process generated 327 decoys from 327 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>

Validate state Workunit error - check skipped


One of two wingmen validated successfully after his deadline, but with far fewer decoys completed.


There is nothing wrong on your end. This is a very old (and rare) bug in the boinc server software. Take a look here.
Wait a second, the trac item claims that the bug is fixed. Maybe it is time for Rosetta to update the server-code.

AdeB
ID: 65481 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5652
Credit: 5,622,096
RAC: 0
Message 65482 - Posted: 7 Mar 2010, 22:49:27 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=322413556
tyrsim_3gbn_q.gz_Protein_interface_design_25Feb2010_18415_276_1
Outcome Client error
Client state Compute error
Exit status 1 (0x1)
CPU time 4.4375
stderr out

<core_client_version>6.10.18</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
ID: 65482 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 65492 - Posted: 9 Mar 2010, 0:22:41 UTC

Looks like there are still problems with this app, same task

it just restarted near the end and i got it in the neck, not impressed.

tyrsim_3gbn_1c81_Protein_interface_design_25Feb2010_18415_410_0

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=294414088


# cpu_run_time_pref: 14400
======================================================
DONE :: 348 starting structures 14397.5 cpu seconds
This process generated 348 decoys from 348 attempts
======================================================


# cpu_run_time_pref: 14400
======================================================
DONE :: 2 starting structures 14498.9 cpu seconds
This process generated 2 decoys from 2 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>

Validate state Valid

Claimed credit 102.297287162446

Granted credit 0.384433279143336

application version 2.05

ID: 65492 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile apohawk
Avatar

Send message
Joined: 13 Sep 08
Posts: 5
Credit: 28,734,325
RAC: 40,033
Message 65530 - Posted: 12 Mar 2010, 10:55:16 UTC

This work unit reports "success" despite having errors in the end.

https://boinc.bakerlab.org/rosetta/result.php?resultid=323517090

application: minitosetta 2.05
name of work unit: ina2inaN_to_NOE__18638_5045_0
Outcome: Success
Exit status: 0 (0x0)

CPU time: 2212.594

but at the end of the result we got:
# cpu_run_time_pref: 7200

ERROR: Unrecognized edge type!
ERROR:: Exit from: ....srccorekinematicsutil.cc line: 1422
called boinc_finish


CPU: Phenom II 945
OS: WinXP 64 SP2
ID: 65530 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Duzz

Send message
Joined: 14 Nov 05
Posts: 1
Credit: 13,148
RAC: 0
Message 65544 - Posted: 13 Mar 2010, 13:16:48 UTC
Last modified: 13 Mar 2010, 13:17:53 UTC

During the last days I had several WUs staying idle after some time of computation. Windows XP task manager shows no CPU activity. If one does not notice this, many hours of WU processing get lost, which is very unproductive for the project.
ID: 65544 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 22
Message 65547 - Posted: 13 Mar 2010, 22:39:05 UTC

In workunit gunn_fragments_SAVE_ALL_OUT_-1wtyA__18642_1106 both tasks (324092645 and 323994500) ended with the same error:
ERROR: ct == final_atoms
ERROR:: Exit from: ....srccorescoringrms_util.cc line: 397
BOINC:: Error reading and gzipping output datafile: default.out

AdeB
ID: 65547 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 206
Credit: 19,886,349
RAC: 9,799
Message 65555 - Posted: 15 Mar 2010, 3:44:52 UTC

Today I got strange validation errors: "Task was reported too late to validate"
But there are 4 days until deadline (19 Mar)!
Links to the tasks:
https://boinc.bakerlab.org/rosetta/result.php?resultid=323161767
https://boinc.bakerlab.org/rosetta/result.php?resultid=323181972
https://boinc.bakerlab.org/rosetta/result.php?resultid=323205144
ID: 65555 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
transient
Avatar

Send message
Joined: 30 Sep 06
Posts: 376
Credit: 10,836,395
RAC: 0
Message 65560 - Posted: 15 Mar 2010, 17:35:23 UTC - in response to Message 65555.  

Today I got strange validation errors: "Task was reported too late to validate"
But there are 4 days until deadline (19 Mar)!


I see more things happened during the weekend. I'm seeing you detached or something which might cause all incomplete units to report. Did you perhaps restore a backup which tried to continue from an earlier point?
ID: 65560 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65561 - Posted: 15 Mar 2010, 23:09:15 UTC

What is odd is the way the tasks were reissued before he reported the completed ones back. That wouldn't normally happen. That isn't dependent upon Mad Max's machine, so I doubt they did a restore or anything. I'll have to see what we can find out.
Rosetta Moderator: Mod.Sense
ID: 65561 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 206
Credit: 19,886,349
RAC: 9,799
Message 65564 - Posted: 16 Mar 2010, 2:38:42 UTC - in response to Message 65560.  


I see more things happened during the weekend. I'm seeing you detached or something which might cause all incomplete units to report. Did you perhaps restore a backup which tried to continue from an earlier point?


Error with "detached" is boinc related.
Actually I have not detached from the project, but rather connect a new computer. But after that boinc client initially goes mad - first it started to download to the new computer(Athlon II X2 250 ) tasks have already downloaded to old computer (Athlon XP 2600+), then at some point, thought better of it and register new computer on the server under a new ID, and than deleted mistakenly downloaded tasks. (I think this point and recorded on the server as "detached").

Note: there was no transfer of any boinc-related files from old computer to new one. The new client was a clean install from the distrib. So I do not know what caused this behavior. Maybe the fact that the computer is connect to internet under same ip?

Hmm, now I think that in principle, such an validate error could happen because of it. If one computer "cancels" the tasks(mistakenly downloaded), while the second worked on its, the server can issue the same WU to another volunteer computer and shift deadline time?
ID: 65564 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
transient
Avatar

Send message
Joined: 30 Sep 06
Posts: 376
Credit: 10,836,395
RAC: 0
Message 65565 - Posted: 16 Mar 2010, 5:33:06 UTC

You still would've gotten credits if you had managed to report before the other computer. :) Anyway, from what you're telling about the other computer I do think the "too late to validate" error was more likely related to the new PC, than to a bug in the science-application. Maybe a problem with the BOINC-manager itself?
ID: 65565 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65567 - Posted: 16 Mar 2010, 16:32:04 UTC

True, not a problem specific to v2.05 Rosetta. Perhaps BOINC server, or client. Either way, we should start another thread if further problem tasks are found.

Certainly many users that have multiple machines are connecting from same IP address (I'm talking the router's public IP address that the project servers see). And many other users come in via dynamic IPs, and so it is always different. My understanding is that BOINC uses many factors to determine if a given machine is the same as an existing registered one to keep it all straight and separated correctly. Factors such as the user ID, host name, any existing BOINC host ID, machine type, installed OS, last RPC sequence number... so a fresh install should not have caused the client to "go mad" on either machine. Indeed many users have identically configured machines at same site coming in via same IP.
Rosetta Moderator: Mod.Sense
ID: 65567 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 65570 - Posted: 17 Mar 2010, 3:07:31 UTC

This took 8hrs, 2min on my 3ghz intel, four hour run time.

aqp9__boinc_aqp9_fast_run01_yfsong_loopbuild_threading_cst_relax_superfast_yfsong_IGNORE_THE_REST_18658_1421_0

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=296064742

# cpu_run_time_pref: 14400
Continuing computation from checkpoint: chk_S_2B6OA_15_0001_Remodel__loop_1_0_0_S ... success!
BOINC:: CPU time: 28914.7s, 14400s + 14400s[2010- 3-17 13:39:17:] :: BOINC
InternalDecoyCount: 0
======================================================
DONE :: 1 starting structures 28914.7 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
called boinc_finish
SIGSEGV: segmentation violation
Stack trace (15 frames):
[0x96c49b3]
[0x96ee888]
[0xb7fe9420]
[0x91d6455]
[0x842671e]
[0x83e85d3]
[0x80a7840]
[0x84381fe]
[0x812a54a]
[0x812b82d]
[0x86aa16b]
[0x8243cf5]
[0x8049897]
[0x974c15c]
[0x8048121]

Exiting...

</stderr_txt>
]]>
Validate state Valid
Claimed credit__69.3077894676244
Granted credit__25.52312719487 -- for 8hrs.



ID: 65570 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1923
Credit: 36,212,006
RAC: 23,867
Message 65572 - Posted: 17 Mar 2010, 3:22:49 UTC
Last modified: 17 Mar 2010, 3:24:29 UTC

On this desktop I got a Compute error Exit status -177 (0xffffff4f) in the following task:
aqp9__boinc_aqp9_fast_run01_blast_yfsong_loopbuild_threading_cst_relax_superfast_yfsong_IGNORE_THE_REST_18653_30510_0
<message>
Maximum disk usage exceeded
</message>

I did notice while it was running it was about 2 hours over my 8 hour runtime, on Model 6 Step 19051, but it reported 0 CPU time in the end.

I allow 10Gb disk space for Boinc and have about 581Mb in use on 5 current or waiting tasks, 9.43Gb free.

Also, on this laptop I got a validate error on the following task a few days back:
t290__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_8451_0
ID: 65572 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 206
Credit: 19,886,349
RAC: 9,799
Message 65575 - Posted: 17 Mar 2010, 14:13:21 UTC

2 Mod.Sense
Yes, it is certainly not a problem with minirosetta 2.05. It looks like some rare bug with boinc server. Probably connected with the fact that the computer had the same ip (not only "external" router ip, but internal too) and same network name. The new computer was a replacement of old, so I called the new as well as the previous one, before that renaming the old one. Actually, this should not be a factor, because boinc used to identify the internal id (such as 1211592) and not windows names. But the bug is a bug and that something is not go as intended :)
In any case, now more such errors do not come across, so I think this can be forgotten.

2 Sid Celery
I also had a lot of errors in tasks such as *__boinc_filtered_loopbuild_threading_*. In fact, every second job terminated by an error. And violating the target CPU time in each of the first (ie all tasks of this type) + strange looking things in graphics part (such as RMSD from 20 to 50 and odd-looking models)
So now I am canceling all jobs of this type, if i see them in the job queue.
ID: 65575 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65576 - Posted: 17 Mar 2010, 20:51:24 UTC

Sid, each task also has a configured maximum disk space. So that must be the limit that was hit by the task you mention. This is just one more failsafe that is in place to help assure things keep running smoothly.
Rosetta Moderator: Mod.Sense
ID: 65576 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next

Message boards : Number crunching : minirosetta 2.05



©2023 University of Washington
https://www.bakerlab.org