Message boards : Number crunching : Rosetta 4.1+ and 4.2+
Previous · 1 . . . 22 · 23 · 24 · 25 · 26 · 27 · 28 . . . 34 · Next
Author | Message |
---|---|
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1663 Credit: 17,329,447 RAC: 24,445 |
Update, the sortilin tasks still have issues.Yep, but at least they are all producing Valid work. Unlike the dna_binder_round2 Work Units. About half a dozen Valid results, and over 20 that crashed and burned in 30 secs or less. Not a good ratio. Grant Darwin NT |
motov Send message Joined: 8 Apr 20 Posts: 4 Credit: 4,568,429 RAC: 0 |
The credit system broken, after 12 hours of only 3 credits? Task: 1251422739 Computer: 4683810 Sent: 31 Aug 2020, 11:02:41 UTC Time reported: 1 Sep 2020, 6:04:59 UTC Status: Completed and validated Run time(sec): 43,781.06 CPU time(sec): 43,720.89 Credit: 3.62 Application: Rosetta v4.20 |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1663 Credit: 17,329,447 RAC: 24,445 |
The credit system broken, after 12 hours of only 3 credits?There is something odd with the benchmarks on those systems. Measured floating point speed 1850.43 million ops/sec Measured integer speed 69467.74 million ops/secThat's an even greater difference between Floating point & Integer numbers than usual, which the Credit system takes offense with (as it was previously used to boost Credit awarded). Over time it generally settles down to more usual credit values. And to add to that- there was something odd that occurred with that particular Task. Std_error output from normal Result, <core_client_version>7.9.3</core_client_version> <![CDATA[ <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_aarch64-unknown-linux-gnu -in:file:native 00001.pdb -relax::default_repeats 5 -frag9 00001.200.9mers.index -abinitio::rg_reweight 0.5 -abinitio::use_filters false -out:file:silent default.out -beta 1 -ex1 1 -frag3 00001.200.3mers.index -abinitio::rsd_wt_loop 0.5 -silent_gz 1 -abinitio::fastrelax 1 -abinitio::rsd_wt_helix 0.5 -ex2aro 1 -abinitio::increase_cycles 10 -in:file:boinc_wu_zip JHR_bd4_01401_c_0000200006_0000026_0_fragments_data.zip -out:file:silent default.out -silent_gz -mute all -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3192653 Using database: database_357d5d93529_n_methyl/minirosetta_database ====================================================== DONE :: 1 starting structures 41446.9 cpu seconds This process generated 22 decoys from 22 attempts ====================================================== BOINC :: WS_max 4.9501e+08 09:25:03 (11864): called boinc_finish(0) </stderr_txt> ]]> Std_error output from low Credit Result, <core_client_version>7.9.3</core_client_version> <![CDATA[ <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_aarch64-unknown-linux-gnu -abinitio::rsd_wt_loop 0.5 -relax::default_repeats 5 -silent_gz 1 -out:file:silent default.out -abinitio::fastrelax 1 -frag9 00001.200.9mers.index -beta 1 -ex1 1 -ex2aro 1 -abinitio::increase_cycles 10 -abinitio::rsd_wt_helix 0.5 -abinitio::rg_reweight 0.5 -abinitio::use_filters false -frag3 00001.200.3mers.index -in:file:native 00001.pdb -in:file:boinc_wu_zip JHR_bd5_01448_c_0000100001_0000018_0_fragments_data.zip -out:file:silent default.out -silent_gz -mute all -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1991338 Using database: database_357d5d93529_n_methyl/minirosetta_database ====================================================== DONE :: 1 starting structures 42023.8 cpu seconds This process generated 31 decoys from 31 attempts ====================================================== BOINC :: WS_max 4.85376e+08 21:28:45 (13322): called boinc_finish(0) ====================================================== DONE :: 1 starting structures 43720.9 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== BOINC :: WS_max 4.84512e+08 22:03:46 (14192): called boinc_finish(0) </stderr_txt> ]]> Grant Darwin NT |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
We’ve seen this before: a task somehow runs and finishes twice (two different process IDs in the ‘called boinc_finish’ message), the second time reporting only one decoy and receiving little credit. |
motov Send message Joined: 8 Apr 20 Posts: 4 Credit: 4,568,429 RAC: 0 |
[quote]The credit system broken, after 12 hours of only 3 credits?There is something odd with the benchmarks on those systems. Measured floating point speed 1850.43 million ops/sec Measured integer speed 69467.74 million ops/secThat's an even greater difference between Floating point & Integer numbers than usual, which the Credit system takes offense with (as it was previously used to boost Credit awarded). Over time it generally settles down to more usual credit values. This is an 8g RPI running Ubuntu 18.04 and has similar benchmarks as my other 8g RPI 4 and both of my other 4g RPI4s. Is the Boinc client creating odd benchmarks on ARM processors? |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1663 Credit: 17,329,447 RAC: 24,445 |
This is an 8g RPI running Ubuntu 18.04 and has similar benchmarks as my other 8g RPI 4 and both of my other 4g RPI4s. Is the Boinc client creating odd benchmarks on ARM processors?Most likely. Found another similar system, running Linux with similar Benchmark numbers. Although according to the post from Brian, the glitch with your Credit has occurred previously. But the poster didn't say which system it occurred on, so i'm not sure if it's a BOINC thing, a Linux application issue, or related to your hardware (or all of the above). Grant Darwin NT |
Keith Myers Send message Joined: 29 Mar 20 Posts: 96 Credit: 322,693 RAC: 1,374 |
I would like to know if anyone has been successful in crunching any of the epcam_breaker_graft_v1_SAVE_ALL_OUT_IGNORE_THE_REST tasks. Everyone I have attempted fails with major errors. [ ERROR ]: Caught exception: File: src/protocols/motif_grafting/movers/MotifGraftMover.cc:537 For this scaffold there are not suitable scaffold grafts within your constrains ------------------------ Begin developer's backtrace ------------------------- |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,575,849 RAC: 20,380 |
I would like to know if anyone has been successful in crunching any of the epcam_breaker_graft_v1_SAVE_ALL_OUT_IGNORE_THE_REST tasks. Everyone I have attempted fails with major errors. I don't know how to get that much detail, but I can tell you I've got several validated, and none rejected either at my or the server end. |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
I’ve not been sent any of those yet, but I’ve had 33 epcam_dimer_graft tasks running for over 12 hours (I increased target run time from the default), so far without problems (which is to say: they haven’t failed, but on closer inspection I see they are spewing out exception messages like the one you mention). You have had a couple of epcam_breaker_grafts succeed (though the output also contains those error messages), but also b3x and b4k tasks fail, which are not new types. The signs are pointing to a problem with your machine. Has anything changed recently? |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,575,849 RAC: 20,380 |
I’ve not been sent any of those yet, but I’ve had 33 epcam_dimer_graft tasks running for over 12 hours (I increased target run time from the default), so far without problems. You have had a couple of epcam_breaker_grafts succeed (though the output also contains those error messages), but also b3x and b4k tasks fail, which are not new types. The signs are pointing to a problem with your machine. Has anything changed recently? I left mine on the default of 8 hours, yet I had one run for exactly 16 hours. Just the one though. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1663 Credit: 17,329,447 RAC: 24,445 |
I would like to know if anyone has been successful in crunching any of the epcam_breaker_graft_v1_SAVE_ALL_OUT_IGNORE_THE_REST tasks. Everyone I have attempted fails with major errors.My stderr output files are full of error messages, but all of my epcam_breaker_graft_v1 Tasks have run to completion time & Validated. [ ERROR ]: Caught exception: File: ......srcprotocolsmotif_graftingmoversMotifGraftMover.cc:537 For this scaffold there are not suitable scaffold grafts within your constrains ------------------------ Begin developer's backtrace ------------------------- BACKTRACE: ------------------------- End developer's backtrace -------------------------- AN INTERNAL ERROR HAS OCCURED. PLEASE SEE THE CONTENTS OF ROSETTA_CRASH.log FOR DETAILS.Repeat over & over again, and then at the end, ====================================================== DONE :: 2593 starting structures 28790.2 cpu seconds This process generated 2593 decoys from 2593 attempts ====================================================== BOINC :: WS_max 1.30979e+09 01:51:26 (8932): called boinc_finish(0) There are Lots of Signal 11 error Tasks from that system. eg rb_09_05_37264_36423_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_03_08_1009119_35 (While it ran to completion & Validated, it too produced a lot of messages in Stderr output- "*** Error in `../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu': free(): invalid pointer: 0x00000000067bd783 ***") <core_client_version>7.17.0</core_client_version> <![CDATA[ <message> process got signal 11</message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu @rb_09_05_37264_36423_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -jumps:pairing_file t000_.fasta.bbcontacts.jumps -jumps:random_sheets 1 -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_09_05_37264_36423_ab_t000__robetta.zip -frag3 rb_09_05_37264_36423_ab_t000__robetta.200.3mers.index.gz -fragA rb_09_05_37264_36423_ab_t000__robetta.200.8mers.index.gz -fragB rb_09_05_37264_36423_ab_t000__robetta.200.3mers.index.gz -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1520304 Using database: database_357d5d93529_n_methyl/minirosetta_database </stderr_txt> ]]> Here's one that completed ok with no errors in the stderr output on another system, but crashed on yours. b3x_3751_fold_SAVE_ALL_OUT_953871_1496 While these particular Tasks don't use a lot of RAM (for Rosetta), they still use more than many other projects Tasks. And if you've got some other large RAM Tasks (i've had plenty using 3GB+ over the last few days), then you'll be using RAM you may not normally use. Dodgy module, or gone past the limit of it's overclock and now those affected addresses are being used? Grant Darwin NT |
Keith Myers Send message Joined: 29 Mar 20 Posts: 96 Credit: 322,693 RAC: 1,374 |
I’ve not been sent any of those yet, but I’ve had 33 epcam_dimer_graft tasks running for over 12 hours (I increased target run time from the default), so far without problems (which is to say: they haven’t failed, but on closer inspection I see they are spewing out exception messages like the one you mention). You have had a couple of epcam_breaker_grafts succeed (though the output also contains those error messages), but also b3x and b4k tasks fail, which are not new types. The signs are pointing to a problem with your machine. Has anything changed recently? No,haven't changed anything with my machine. Don't have much luck with Rosetta tasks. About 50-50% success rate. No other issues with any other projects cpu tasks for some reason. Guess the Rosetta tasks are harder than TN-Grid, Einstein or Universe tasks. [Edit] Do the Rosetta apps still expect you to run with VSYSCALL=emulate kernel command line? |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,575,849 RAC: 20,380 |
I’ve not been sent any of those yet, but I’ve had 33 epcam_dimer_graft tasks running for over 12 hours (I increased target run time from the default), so far without problems (which is to say: they haven’t failed, but on closer inspection I see they are spewing out exception messages like the one you mention). You have had a couple of epcam_breaker_grafts succeed (though the output also contains those error messages), but also b3x and b4k tasks fail, which are not new types. The signs are pointing to a problem with your machine. Has anything changed recently? Are you talking about your Ryzen 9?! That's like 1000 times more advanced than some of mine, and I don't have any Rosetta problems. Must be some freaky bug in the Rosetta programming that only affects that type of CPU? |
Keith Myers Send message Joined: 29 Mar 20 Posts: 96 Credit: 322,693 RAC: 1,374 |
I’ve not been sent any of those yet, but I’ve had 33 epcam_dimer_graft tasks running for over 12 hours (I increased target run time from the default), so far without problems (which is to say: they haven’t failed, but on closer inspection I see they are spewing out exception messages like the one you mention). You have had a couple of epcam_breaker_grafts succeed (though the output also contains those error messages), but also b3x and b4k tasks fail, which are not new types. The signs are pointing to a problem with your machine. Has anything changed recently? Yes, I run Rosetta on my Ryzen 9 and the Nvidia Jetson Nano. Don't have the issues on my Nano. But as stated, no issues on any other cpu tasks I run. I also can run Google memory stressapptest for 24 hours with no errors. So don't think there is any basic memory issues at play. And I do successfully crunch half of the Rosetta tasks that get sent to me. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1663 Credit: 17,329,447 RAC: 24,445 |
And I do successfully crunch half of the Rosetta tasks that get sent to me.Which shows it is a system issue. Other systems running the same Linux application are processing WUs that fail on yours, and theirs run to their Target CPU time & Validate. And they don't have computation errors in their Tasks list. Maybe late to Validate, or Cancelled by server- but no computation errors. Grant Darwin NT |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
The epcam tasks seem to be using over 1 GB of memory each. If your machine is loaded with memory-hungry tasks, and some are failing strangely, it points to faulty RAM or a problem with swap space. |
Keith Myers Send message Joined: 29 Mar 20 Posts: 96 Credit: 322,693 RAC: 1,374 |
And I do successfully crunch half of the Rosetta tasks that get sent to me.Which shows it is a system issue. Well I have no clue on how to troubleshoot the issue. As I stated no issues with any other tasks. I have 32GB of memory and memory usage is only 25% of max. I see less than 1GB or memory usage on the Rosetta tasks. [Edit] While typing this response something just grabbed all my memory and used the 6GB swap file for some reason for the first time. Couldn't catch which application was the memory hog. It was only there for about 5 seconds. [Edit 2] Well it was this Rosetta task kp8RjDVk_fold_and_dock_SAVE_ALL_OUT_1009390_201. It is grabbing all the memory and the swap file every five minutes or so. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1663 Credit: 17,329,447 RAC: 24,445 |
[Edit] While typing this response something just grabbed all my memory and used the 6GB swap file for some reason for the first time. Couldn't catch which application was the memory hog. It was only there for about 5 seconds.I've had some RB tasks using 3.9GB of RAM. The presently running Epcams are using around 1.2GB each. With 12 cores/threads i've seen up to 18GB (56%) of system RAM in use on occasions over the last week. Over 14GB (44%) in use at present. Also unused RAM is used for caching, so as more RAM is used, the area used for caching moves up as well and if a call is made to data that's cached in a dodgy address... Simplest method for sorting out system issues if a system is overclocked/overvolted or undervolted- set things back to the default values & see if the errors go away. Grant Darwin NT |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
Another user reporting tasks going wild with memory (Also Ryzen, also Linux – may just be coincidental) |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1663 Credit: 17,329,447 RAC: 24,445 |
[Edit 2] Well it was this Rosetta task kp8RjDVk_fold_and_dock_SAVE_ALL_OUT_1009390_201. It is grabbing all the memory and the swap file every five minutes or so.It's a resend, this is what the first system got with it. Outcome Computation error Client state Compute error Exit status 1 (0x00000001) Unknown error code Computer ID 5159178 Run time 19 min 44 sec CPU time 18 min 38 sec Validate state Invalid Credit 0.00 Device peak FLOPS 3.28 GFLOPS Application version Rosetta v4.20 windows_x86_64 Stderr output <core_client_version>7.0.80</core_client_version> <![CDATA[ <message> Función incorrecta. (0x1) - exit code 1 (0x1) </message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe @kp8RjDVk_fold_and_dock_flags -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip fold_and_dock_kp8RjDVk_data.zip -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3873245 Using database: database_357d5d93529_n_methylminirosetta_database ERROR: Error in core::kinematics::FoldTree::get_jump_that_builds_residue(): This residue is not the child of (built by) a jump! ERROR:: Exit from: ......srccorekinematicsFoldTree.cc line: 436 BOINC:: Error reading and gzipping output datafile: default.out 16:00:04 (3796): called boinc_finish(1) </stderr_txt> ]]> Grant Darwin NT |
Message boards :
Number crunching :
Rosetta 4.1+ and 4.2+
©2024 University of Washington
https://www.bakerlab.org