Message boards : Number crunching : minirosetta 2.05
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next
Author | Message |
---|---|
![]() Send message Joined: 16 Jun 08 Posts: 1218 Credit: 13,479,644 RAC: 4,179 ![]() |
I am running Rosetta@home (along with some other projects) on four machines, three runing XP and one running Win7. Two of the machines have AMD 64 single core processors and two have AMD 64 dual core processors. The rosetta app is 'rosetta mini 2.05' and the BOINC version is 6.10.18. About once a week or so I will notice that the Rosetta WU running on at least one of the machines has been running longer than usual and using little or no CPU time. I abort it, and the next WU runs fine. I have not seen this occur with any of my other projects (climite prodiction, malaria control or world community grid). Thanks - one of my 2.05 workunits had the same problem, but now seems to be running well after a reboot. https://boinc.bakerlab.org/rosetta/result.php?resultid=320652086 64-bit Vista SP2, BOINC 6.10.18, quad-core Intel, not using keep in memory when suspended (something tends to tie up lots of memory and make the computer unresponsive to the mouse and keyboard; haven't found what, though) t311__boinc_filtered_loopbuild_threading type workunit Before the reboot, showed CPU time 03:39:05, last checkpoint 03:39:03, elapsed time so far 20:29:26, not using any CPU time Rebooted, that workunit restarted at about 4 hours elapsed time, but is now using a CPU core again. |
![]() Send message Joined: 16 Jun 08 Posts: 1218 Credit: 13,479,644 RAC: 4,179 ![]() |
Hi! One thing to look for: I've found that when the output file absent error occurs, it's a good idea to search the logfile for any reference to boinc_lockfile. Errors that refer to that file tend to cascade from one workunit to the next, at least with the older versions of BOINC, but not with some of the newer versions like the 6.10.18 I'm now using. They can also cascade to other BOINC projects that use a file with the same name, again for the older BOINC versions. |
Minardi Send message Joined: 19 Jan 10 Posts: 1 Credit: 1,117,527 RAC: 0 |
I have had several tasks stall out and stop using CPU over the past few days. I am finishing up my rosetta tasks, then taking this machine off the project. I was running an XP machine and had no problems. It died, and I replaced it with a W7 64-bit machine and some tasks started stalling out on me. In reviewing this thread, it appears there is a problem with mini Rosetta 2.05 running on W7 machines. |
ramostol Send message Joined: 6 Feb 07 Posts: 64 Credit: 584,052 RAC: 0 |
My first Protein_interface (validation related?) error as far as I know - MacOS 10.5: tyrsim_3gbn_2esa_Protein_interface_design_01Feb2010_17949_9_2 Outcome Success Client state Done Exit status 0 (0x0) CPU time 21540.8 <core_client_version>6.10.36</core_client_version> <![CDATA[ <stderr_txt> [...] # cpu_run_time_pref: 21600 ====================================================== DONE :: 327 starting structures 21540.3 cpu seconds This process generated 327 decoys from 327 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish </stderr_txt> ]]> Validate state Workunit error - check skipped One of two wingmen validated successfully after his deadline, but with far fewer decoys completed. |
![]() ![]() Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 27 |
My first Protein_interface (validation related?) error as far as I know - MacOS 10.5: There is nothing wrong on your end. This is a very old (and rare) bug in the boinc server software. Take a look here. Wait a second, the trac item claims that the bug is fixed. Maybe it is time for Rosetta to update the server-code. AdeB |
![]() ![]() Send message Joined: 30 May 06 Posts: 5652 Credit: 5,622,096 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=322413556 tyrsim_3gbn_q.gz_Protein_interface_design_25Feb2010_18415_276_1 Outcome Client error Client state Compute error Exit status 1 (0x1) CPU time 4.4375 stderr out <core_client_version>6.10.18</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Looks like there are still problems with this app, same task it just restarted near the end and i got it in the neck, not impressed. tyrsim_3gbn_1c81_Protein_interface_design_25Feb2010_18415_410_0 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=294414088 # cpu_run_time_pref: 14400 ====================================================== DONE :: 348 starting structures 14397.5 cpu seconds This process generated 348 decoys from 348 attempts ====================================================== # cpu_run_time_pref: 14400 ====================================================== DONE :: 2 starting structures 14498.9 cpu seconds This process generated 2 decoys from 2 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish </stderr_txt> ]]> Validate state Valid Claimed credit 102.297287162446 Granted credit 0.384433279143336 application version 2.05 ![]() |
![]() ![]() Send message Joined: 13 Sep 08 Posts: 5 Credit: 28,670,845 RAC: 41,730 ![]() |
This work unit reports "success" despite having errors in the end. https://boinc.bakerlab.org/rosetta/result.php?resultid=323517090 application: minitosetta 2.05 name of work unit: ina2inaN_to_NOE__18638_5045_0 Outcome: Success Exit status: 0 (0x0) CPU time: 2212.594 but at the end of the result we got: # cpu_run_time_pref: 7200 ERROR: Unrecognized edge type! ERROR:: Exit from: ....srccorekinematicsutil.cc line: 1422 called boinc_finish CPU: Phenom II 945 OS: WinXP 64 SP2 ![]() |
Duzz Send message Joined: 14 Nov 05 Posts: 1 Credit: 13,148 RAC: 0 |
During the last days I had several WUs staying idle after some time of computation. Windows XP task manager shows no CPU activity. If one does not notice this, many hours of WU processing get lost, which is very unproductive for the project. |
![]() ![]() Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 27 |
In workunit gunn_fragments_SAVE_ALL_OUT_-1wtyA__18642_1106 both tasks (324092645 and 323994500) ended with the same error: ERROR: ct == final_atoms AdeB |
Mad_Max Send message Joined: 31 Dec 09 Posts: 206 Credit: 19,868,493 RAC: 9,982 ![]() |
Today I got strange validation errors: "Task was reported too late to validate" But there are 4 days until deadline (19 Mar)! Links to the tasks: https://boinc.bakerlab.org/rosetta/result.php?resultid=323161767 https://boinc.bakerlab.org/rosetta/result.php?resultid=323181972 https://boinc.bakerlab.org/rosetta/result.php?resultid=323205144 |
transient![]() Send message Joined: 30 Sep 06 Posts: 376 Credit: 10,836,395 RAC: 0 |
Today I got strange validation errors: "Task was reported too late to validate" I see more things happened during the weekend. I'm seeing you detached or something which might cause all incomplete units to report. Did you perhaps restore a backup which tried to continue from an earlier point? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
What is odd is the way the tasks were reissued before he reported the completed ones back. That wouldn't normally happen. That isn't dependent upon Mad Max's machine, so I doubt they did a restore or anything. I'll have to see what we can find out. Rosetta Moderator: Mod.Sense |
Mad_Max Send message Joined: 31 Dec 09 Posts: 206 Credit: 19,868,493 RAC: 9,982 ![]() |
Error with "detached" is boinc related. Actually I have not detached from the project, but rather connect a new computer. But after that boinc client initially goes mad - first it started to download to the new computer(Athlon II X2 250 ) tasks have already downloaded to old computer (Athlon XP 2600+), then at some point, thought better of it and register new computer on the server under a new ID, and than deleted mistakenly downloaded tasks. (I think this point and recorded on the server as "detached"). Note: there was no transfer of any boinc-related files from old computer to new one. The new client was a clean install from the distrib. So I do not know what caused this behavior. Maybe the fact that the computer is connect to internet under same ip? Hmm, now I think that in principle, such an validate error could happen because of it. If one computer "cancels" the tasks(mistakenly downloaded), while the second worked on its, the server can issue the same WU to another volunteer computer and shift deadline time? |
transient![]() Send message Joined: 30 Sep 06 Posts: 376 Credit: 10,836,395 RAC: 0 |
You still would've gotten credits if you had managed to report before the other computer. :) Anyway, from what you're telling about the other computer I do think the "too late to validate" error was more likely related to the new PC, than to a bug in the science-application. Maybe a problem with the BOINC-manager itself? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
True, not a problem specific to v2.05 Rosetta. Perhaps BOINC server, or client. Either way, we should start another thread if further problem tasks are found. Certainly many users that have multiple machines are connecting from same IP address (I'm talking the router's public IP address that the project servers see). And many other users come in via dynamic IPs, and so it is always different. My understanding is that BOINC uses many factors to determine if a given machine is the same as an existing registered one to keep it all straight and separated correctly. Factors such as the user ID, host name, any existing BOINC host ID, machine type, installed OS, last RPC sequence number... so a fresh install should not have caused the client to "go mad" on either machine. Indeed many users have identically configured machines at same site coming in via same IP. Rosetta Moderator: Mod.Sense |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
This took 8hrs, 2min on my 3ghz intel, four hour run time. aqp9__boinc_aqp9_fast_run01_yfsong_loopbuild_threading_cst_relax_superfast_yfsong_IGNORE_THE_REST_18658_1421_0 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=296064742 # cpu_run_time_pref: 14400 Continuing computation from checkpoint: chk_S_2B6OA_15_0001_Remodel__loop_1_0_0_S ... success! BOINC:: CPU time: 28914.7s, 14400s + 14400s[2010- 3-17 13:39:17:] :: BOINC InternalDecoyCount: 0 ====================================================== DONE :: 1 starting structures 28914.7 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== called boinc_finish SIGSEGV: segmentation violation Stack trace (15 frames): [0x96c49b3] [0x96ee888] [0xb7fe9420] [0x91d6455] [0x842671e] [0x83e85d3] [0x80a7840] [0x84381fe] [0x812a54a] [0x812b82d] [0x86aa16b] [0x8243cf5] [0x8049897] [0x974c15c] [0x8048121] Exiting... </stderr_txt> ]]> Validate state Valid Claimed credit__69.3077894676244 Granted credit__25.52312719487 -- for 8hrs. ![]() |
Sid Celery Send message Joined: 11 Feb 08 Posts: 1920 Credit: 36,162,210 RAC: 23,629 ![]() |
On this desktop I got a Compute error Exit status -177 (0xffffff4f) in the following task: aqp9__boinc_aqp9_fast_run01_blast_yfsong_loopbuild_threading_cst_relax_superfast_yfsong_IGNORE_THE_REST_18653_30510_0 <message> I did notice while it was running it was about 2 hours over my 8 hour runtime, on Model 6 Step 19051, but it reported 0 CPU time in the end. I allow 10Gb disk space for Boinc and have about 581Mb in use on 5 current or waiting tasks, 9.43Gb free. Also, on this laptop I got a validate error on the following task a few days back: t290__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_8451_0 ![]() ![]() |
Mad_Max Send message Joined: 31 Dec 09 Posts: 206 Credit: 19,868,493 RAC: 9,982 ![]() |
2 Mod.Sense Yes, it is certainly not a problem with minirosetta 2.05. It looks like some rare bug with boinc server. Probably connected with the fact that the computer had the same ip (not only "external" router ip, but internal too) and same network name. The new computer was a replacement of old, so I called the new as well as the previous one, before that renaming the old one. Actually, this should not be a factor, because boinc used to identify the internal id (such as 1211592) and not windows names. But the bug is a bug and that something is not go as intended :) In any case, now more such errors do not come across, so I think this can be forgotten. 2 Sid Celery I also had a lot of errors in tasks such as *__boinc_filtered_loopbuild_threading_*. In fact, every second job terminated by an error. And violating the target CPU time in each of the first (ie all tasks of this type) + strange looking things in graphics part (such as RMSD from 20 to 50 and odd-looking models) So now I am canceling all jobs of this type, if i see them in the job queue. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Sid, each task also has a configured maximum disk space. So that must be the limit that was hit by the task you mention. This is just one more failsafe that is in place to help assure things keep running smoothly. Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
minirosetta 2.05
©2023 University of Washington
https://www.bakerlab.org