Message boards : Number crunching : minirosetta 2.17
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 · Next
Author | Message |
---|---|
Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0 |
Two more, both newly issued overnight: T0590_boinc_nmr_max40_rerun_abrelax_cs_frags_negative_tex_IGNORE_THE_REST_23856_3040 first copy sent 5 Apr 2011 3:07:32 UTC T0569_boinc_nmr_max40_rerun_abrelax_cs_frags_permuted_tex_IGNORE_THE_REST_23858_6710 first copy sent 5 Apr 2011 5:21:21 UTC All copies ended with: process exited with code 1 (0x1, -255) ERROR: ERROR: FragmentIO: could not open file cs_frags.9mers.gz ERROR:: Exit from: src/core/fragment/FragmentIO.cc line: 258 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 1,227 |
Looks much like a problem I've seen before. If so, the version of gzip sent as part of the workunit works properly in one direction but not the other. |
Jesse Viviano Send message Joined: 14 Jan 10 Posts: 42 Credit: 2,700,472 RAC: 0 |
Here is another corrupt work unit: 376739708 While writing this post up, I remembered that one other project, Docking@home, once spewed corrupt work units all over the place because the disk drives on the work unit generation server got completely filled up. Therefore, the work unit generator was creating work units that were zero bytes long. The client software tried to crunch these corrupt work units and went nowhere, and failed to declare compute errors, wasting its volunteers' time and electricity until everyone was told to abort the corrupt work units. Could one of the Rosetta@home servers generating work units have a full hard drive? I am wondering if the same situation at Docking@home is happening here except that the client code is smart enough to declare a computation error when it encounters a corrupt work unit. |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
A couple of new protein interface design tasks failing immediately with a computation error on Mac. Sample: Task 414128879 (dck_rhoA_rhoA_2nr7_final_ProteinInterfaceDesign_11Apr2011_25012_119_0) ERROR: Option matching -docking:no_filters not found in command line top-level context |
moody Volunteer moderator Project developer Project scientist Send message Joined: 8 Jun 10 Posts: 11 Credit: 88,068 RAC: 0 |
In response to Message 70021: "A couple of new protein interface design tasks failing immediately with a computation error on Mac. Sample: Task 414128879 (dck_rhoA_rhoA_2nr7_final_ProteinInterfaceDesign_11Apr2011_25012_119_0) ERROR: Option matching -docking:no_filters not found in command line top-level context" This was due to a Rosetta option that was recently renamed on our end but not in the version of Rosetta currently on Boinc. It should be fixed now. We apologize for any inconvenience. |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
This task 414671655 was puzzling in a couple of respects. First, it took about 7 hours (3 hours request time) to complete 1 decoy. This would be explicable if the model was particularly large but the log indicated that some other error was occurring: Watchdog active. BOINC:: CPU time: 25609.9s, 14400s + 10800s[2011- 4-15 1:37: 2:] :: BOINC WARNING! cannot get file size for default.out.gz: could not open file. Output exists: default.out.gz Size: -1 InternalDecoyCount: 0 (GZ) ----- 0 ----- Stream information inconsistent. Writing W_0000001 ====================================================== DONE :: 1 starting structures 25609.9 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== called boinc_finish </stderr_txt> |
bobgoblin Send message Joined: 15 Oct 05 Posts: 2 Credit: 1,616,056 RAC: 0 |
I've noticed that both my i7 machines have been taking 12+ hours to complete wu's. When I look in pending tasks once the are reported they are all showing less than 3 hours. My i5 machine is still crunching them in less than 3 hours. So, I've disabled rosetta on the i7s for now. any idea what may be causing that? All machines are running win7, though the i7's were upgrade last december from vista64, the i5 had win7 installed when it was built. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I've noticed that both my i7 machines have been taking 12+ hours to complete wu's. When I look in pending tasks once the are reported they are all showing less than 3 hours. My i5 machine is still crunching them in less than 3 hours. So, I've disabled rosetta on the i7s for now. any idea what may be causing that? Your main concern (because you've "disabled Rosetta") seems to be whether things are running properly, or doing harm to your machines. Nothing you've described implies any harm. In fact, the tasks complete in the 3 hours of CPU time that you've (likely) set (or defaulted to) in your R@h preferences. I think what you are saying is that "wall clock" time is over 12 hours, but actual CPU time is around 3 hours. So the question boils down to asking why tasks might not be receiving CPU time when they are trying to run. This could be due to other tasks on the machine demanding CPU (as BOINC runs at lowest possible priority, and will yield to other tasks). It seems fairly likely that with one 8 core machine running in 6GB of memory and the other 8 core machine running in 8GB of memory, that you would see "waiting for memory" as the status of several tasks rather then "running". This causes BOINC to stop giving the tasks CPU time until the total memory of other active tasks comes back down to be within the preferences set in your BOINC Manager for memory. So, when memory becomes constrained, BOINC is not longer using all of the CPUs of the machine (or all of the CPUs BOINC is configured to use). This likely is not occurring on your 4-core machine because it has 6GB of memory (50% more per core then the other machines). This thread has a number of ideas and descriptions of what to expect and what actions you might take to help things run better. Rosetta Moderator: Mod.Sense |
BerlinTomek Send message Joined: 11 Mar 09 Posts: 3 Credit: 15,530,617 RAC: 15 |
what does it means? can you tell me if this errors are because of a hardware fault? i cant believe because my i7 core (3,74 Ghz overclocked) never reaches a temp. higher than 60°C so whats wrong? "Task ID 415317758 Name T0475_boinc_nmr_max40_rerun_abrelax_cs_frags_permuted_tex_IGNORE_THE_REST_23858_18211_0 Workunit 379356995 Created 17 Apr 2011 12:39:28 UTC Sent 17 Apr 2011 12:42:55 UTC Received 17 Apr 2011 12:49:46 UTC Server state Over Outcome Client error Client state Compute error Exit status 1 (0x1) Computer ID 1388706 Report deadline 27 Apr 2011 12:42:55 UTC CPU time 3.296875 stderr out <core_client_version>6.10.58</core_client_version> <![CDATA[ <message> Unzul�ssige Funktion. (0x1) - exit code 1 (0x1) </message> <stderr_txt> [2011- 4-17 14:44:30:] :: BOINC:: Initializing ... ok. [2011- 4-17 14:44:30:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev39052.zip Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/T0475_boinc_nmr_max40_rerun_abrelax_cs_frags_permuted_tex.boinc.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... ERROR: ERROR: FragmentIO: could not open file cs_frags.9mers.gz ERROR:: Exit from: ....srccorefragmentFragmentIO.cc line: 258 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> ]]> Validate state Invalid Claimed credit 0.022264471690409 Granted credit 0 application version 2.17 |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
ERROR: FragmentIO: could not open file cs_frags.9mers.gz It sounds like there is a problem with work units starting with T0471 and T0475. When you got the error another copy was created for processing and in each case the other person got the same error that you did. So, it sounds like the work unit has a problem, not your machine. Rosetta Moderator: Mod.Sense |
BerlinTomek Send message Joined: 11 Mar 09 Posts: 3 Credit: 15,530,617 RAC: 15 |
ERROR: FragmentIO: could not open file cs_frags.9mers.gz ok thanks for the quick answer... i just thought somethings bad going on with my cpu! hope they will fix the problem as fast as possible. my machine hates working for useless boinc units ;-) |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 1,227 |
I may have a defective workunit. Rosetta Mini 2.17 T0533_rH_stg0_lrlxjcst_t000__casp9_w_symm_fm_2qmx_2_SAVE_ALL_OUT_25095_1910 Elapsed 20:56:54, Progress 20.026% and not changing, To completion 40:53:28 BOINC thinks it is running, but it's using no CPU time at all. No error messages seen. CPU time at last checkpoint 03:11:48 CPU time 03:12:15 I've selected workunits expected to last 12 hours. No relevant messages in the BOINC log file since: 4/17/2011 7:40:44 AM rosetta@home Restarting task T0533_rH_rs_stg0_lrlxjcst_t000__casp9_w_symm_fm_2qmx_2_SAVE_ALL_OUT_25095_1910_0 using minirosetta version 217 I've restarted BOINC to see if having it restart from the last checkpoint will help. Now showing 05:29:07 elapsed, 20.026% progress, 17:45:40 To completion,and Waiting to run. |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 26,537,115 RAC: 17,468 |
Few last days I got big pack of "Compute error" on tasks starting from "T0xxx_". This tasks ends with errors few seconds after start. Some examples: T0475_boinc_nmr_homology_max10_abrelax_cs_frags_permuted_tex_IGNORE_THE_REST_25329_232_0 T0462_boinc_nmr_homology_max10_abrelax_cs_frags_tex_IGNORE_THE_REST_25326_218_0 T0462_boinc_nmr_max40_rerun_abrelax_cs_frags_permuted_tex_IGNORE_THE_REST_23858_12436_1 T0589_symm_cm_runs_soeding_alns_relax_default_repeat_2_fix_csts_25310_1377_1 T0462_boinc_nmr_max40_rerun_abrelax_cs_frags_permuted_tex_IGNORE_THE_REST_23858_12293_1 T0569_boinc_nmr_max40_rerun_abrelax_cs_frags_permuted_tex_IGNORE_THE_REST_23858_11706_0 T0475_boinc_nmr_max40_rerun_abrelax_cs_frags_permuted_tex_IGNORE_THE_REST_23858_11263_1 And 4 WUs with validate errors - all of same type (ProteinG_abinitio_SAVE_ALL_OUT_design_relax_): ProteinG_abinitio_SAVE_ALL_OUT_design_relax_g007_003_25073_196_1 ProteinG_abinitio_SAVE_ALL_OUT_design_relax_g002_008_25063_189_1 ProteinG_abinitio_SAVE_ALL_OUT_design_relax_g006_001_25071_111_0 ProteinG_abinitio_SAVE_ALL_OUT_design_relax_g003_008_25065_94_1 All other types of WUs works on this machine normal. P.S. My wingmans on this WUs have received the same errors. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 1,227 |
Rosetta Mini 2.17 T0533_rH_stg0_lrlxjcst_t000__casp9_w_symm_fm_2qmx_2_SAVE_ALL_OUT_25095_1910 The next morning: 10:48:20 elapsed, 20.021% progress, 25:26:47 to completion. Now aborted. |
Tex1954 Send message Joined: 3 Apr 11 Posts: 9 Credit: 3,394,752 RAC: 1 |
Few last days I got big pack of "Compute error" on tasks starting from "T0xxx_". Yup, I just had a batch of 8 errors from 4 computers myself. Average 1 error per day per computer. (2 laptops, 2 desktops) In fact, just had like 5 real fast off main computer I just noticed in history too. *************************************************************************** T0471_boinc_nmr_homology_max10_abrelax_cs_frags_nocst_tex_IGNORE_THE_REST_25328_918_0 00:01:21 (00:00:01) 4/16/2011 8:08:27 PM 4/16/2011 8:10:29 PM Reported: Computation error (1,) T0471_boinc_nmr_homology_max10_abrelax_cs_frags_tex_IGNORE_THE_REST_25326_931_0 00:01:13 (00:00:01) 4/16/2011 8:06:45 PM 4/16/2011 8:10:29 PM Reported: Computation error (1,) T0471_boinc_nmr_max40_rerun_abrelax_cs_frags_permuted_tex_IGNORE_THE_REST_23858_15071_1 00:01:27 (00:00:02) 4/16/2011 8:05:13 PM 4/16/2011 8:06:45 PM Reported: Computation error (1,) T0475_boinc_nmr_max40_rerun_abrelax_cs_frags_permuted_tex_IGNORE_THE_REST_23858_14479_0 00:01:42 (00:00:02) 4/16/2011 5:04:57 PM 4/16/2011 5:07:01 PM Reported: Computation error (1,) T0475_boinc_nmr_max40_rerun_abrelax_cs_frags_permuted_tex_IGNORE_THE_REST_23858_14438_1 00:01:37 (00:00:02) 4/16/2011 4:08:01 PM 4/16/2011 4:12:33 PM Reported: Computation error (1,) *************************************************************************** It isn't OUR fault and they only waste a few seconds. I'm sure the PTB's are on it. 8-) Tex1954 |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Few last days I got big pack of "Compute error" on tasks starting from "T0xxx_". This tasks ends with errors few seconds after start. Some examples: To his list of those ending "in only a few seconds" with the error message "file cs_frags.9mers.gz" you can add: 415450587 415451112 415451774 415525056 415086428 415038519 415038519 415008687 415001340 415384421 415376541 415367010 415339775 415326968 415043715 415337799 415300558 415299126 415289036 415068958 415020984 414880550 415302737 415245580 415216583 415210228 415605253 415563904 415554009 415542857 415540472 415533975 415533394 415519909 415503828 415487923 415485024 415473737 415472523 415466575 415465650 415068765 415064287 415062582 415062402 415053058 415343211 415335164 415333379 415278070 This does NOT represent a COMPLETE listing of what I have seen on my systems - I just listed the FIRST 50 or so that have FAILED with this error so far TODAY. And it is still early. These errors have been going on for AT LEAST 2 WEEKS and have been the topic of discussion in another thread on this board. These are all "fresh" tasks having been issued by the Rosetta server in the last day or two. |
Tex1954 Send message Joined: 3 Apr 11 Posts: 9 Credit: 3,394,752 RAC: 1 |
CUT... Any way to know WHEN this group of tasks was generated on the server? Could it be it's an old batch and we just need to work through it? Or is it possible the errors themselves are significant in the process? 8-) Tex1954 |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Tex1954 said: Any way to know WHEN this group of tasks was generated on the server? Could it be it's an old batch and we just need to work through it? Good question - I just looked at a few of them on my previous list and the task creation dates were 16 April and 18 April. So this is NOT a case of just letting "old" jobs work their way through the system. |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
You can add to Mad Max's list of failing tasks (with matching wingman results) whose name is in the form of: ProteinG_abinitio_SAVE_ALL_OUT_design_relax 415571046 414989706 414921629 415131368 415102869 415091934 414802441 415091930 415171797 415008017 This is not an exhaustive list of this type of error found on my systems – these were all “fresh” tasks with creation dates between 16 April and 18 April. |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
I am seeing validate errors (with matching wingman results) on tasks whose name has the form of: T0590_boinc_nmr_homology_max10_loopbuild_threading_cst_relax_tex A few samples would be: 414980981 414994609 414957506 414950332 415065606 |
Message boards :
Number crunching :
minirosetta 2.17
©2024 University of Washington
https://www.bakerlab.org