Rosetta 4.1+ and 4.2+

Author	Message
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1925 Credit: 18,534,891 RAC: 0	Message 94660 - Posted: 17 Apr 2020, 7:29:08 UTC - in response to Message 94657. Some problems with "12v1n_" wus. I've processed these ones with no problems so far. 12v1n_al_12mer_design_00003_000575_0001_SAVE_ALL_OUT_913418_39_0 12v1n_al_12mer_design_00062_014552_0001_SAVE_ALL_OUT_913824_41_0 12v1n_al_12mer_design_00166_018161_0001_SAVE_ALL_OUT_914183_55_0 12v1n_al_12mer_design_00178_008639_0001_SAVE_ALL_OUT_914209_113_0 12v1n_al_12mer_design_00329_016075_0001_SAVE_ALL_OUT_914468_22_0 They finish early (done in 3hrs with 8hr Target CPU time), but all are Valid. I have 4hs wus (in my profile), but these are crunching over 10hs and with NO checkpoint This is and example 1152133806: Unhandled Exception Detected... - Unhandled Exception Record - Reason: Breakpoint Encountered (0x80000003) at address 0x75C1E062 It's been sent to another system, so we'll see how it goes. Grant Darwin NT ID: 94660 · Rating: 0 · rate: / Reply Quote

Stret Send message Joined: 18 Mar 20 Posts: 7 Credit: 529,664 RAC: 0	Message 94690 - Posted: 17 Apr 2020, 15:38:20 UTC Please move to relevant forum, I was struggling to find a help section. One of my work units has been running for over a day (unusual in and of itself) and is not geting up at all, it says it is 10 minutes from finishing, but that hasn't changed in over 12 hours. I suspect based on my rudamentory programming knowledge that it has hit an infinite loop. What is the best way forward? There's no point in it hitting its deadline and doing the same on another machine. copy and paste from propeties of WU: Application Rosetta 4.15 Name 12v1n_al_12mer_design_00003_000575_0001_SAVE_ALL_OUT_913418_17 State Running Received 15/04/2020 07:16:00 Report deadline 18/04/2020 07:16:02 Estimated computation size 80,000 GFLOPs CPU time 1d 05:34:42 CPU time since checkpoint 1d 05:34:42 Elapsed time 1d 05:55:59 Estimated time remaining 00:10:07 Fraction done 99.439% Virtual memory size 244.53 MB Working set size 48.34 MB Directory slots/5 Process ID 26968 Progress rate 3.240% per hour Executable rosetta_4.15_windows_x86_64.exe ID: 94690 · Rating: 0 · rate: / Reply Quote

Charles Dennett Send message Joined: 27 Sep 05 Posts: 102 Credit: 2,081,660 RAC: 0	Message 94691 - Posted: 17 Apr 2020, 16:52:02 UTC - in response to Message 94690. Yeah, I had a similar one whose name started the same way as yours. It ran over a day and a half before I aborted it. Apparently others have reported issues with tasks with names like that. Just abort it. -Charlie -Charlie ID: 94691 · Rating: 0 · rate: / Reply Quote

James W Send message Joined: 25 Nov 12 Posts: 130 Credit: 1,766,254 RAC: 0	Message 94719 - Posted: 18 Apr 2020, 1:35:54 UTC Task: 1150978005 Task: 12v1n_al_12mer_design_00026_019077_0001_SAVE_ALL_OUT_913633_58 CPU time: 15:01:25 CPU time since checkpoint: 15:01:25 Elapsed time: 15:28:46 Estimated time remaining: 00:10:18 (which varies between 00:10:17-00:10:20) Fraction done: 98.901% (Which has gradually increased over last 1/2 hr or so from 98.893% or so) Original estimated time was 8 hrs, so now is 7.5 hrs over this. Shouldn't Watchdog have stopped processing, as now over 4 hrs longer than estimated processing time? Fraction done is slowly rising, so reluctant to abort at this point. Concerning that only checkpoint was when task first started. If BOINC manager stops or suspends, afraid task will want to start over from scratch! I'll keep an eye on this for now and if no change in hour or so, will likely need to abort at that time. ID: 94719 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 94728 - Posted: 18 Apr 2020, 4:03:36 UTC - in response to Message 94719. The watchdog kicks in 4 hours after the runtime preference, not the estimated runtime shown in the BOINC Manager. Once the WU reports back it will show the runtime preference it was run with. But you are correct, no checkpoints for over an hour is not a good sign, and your other work units seem to be running with the default 8 hour preference. Is this with your i5 Windows 7 Profession machine? It looks like your i3, also with Win7, has already run a few similar tasks with dozens of models completed in the same period of time, even with less memory per core. Are the BOINC settings the same for both systems? Rosetta Moderator: Mod.Sense ID: 94728 · Rating: 0 · rate: / Reply Quote

rsNeutrino Send message Joined: 22 Mar 20 Posts: 13 Credit: 5,886,438 RAC: 262	Message 94733 - Posted: 18 Apr 2020, 5:50:56 UTC Last modified: 18 Apr 2020, 5:54:55 UTC Task 1152764941 also drove into the 12 hour timeout, reaching 98% around 10 minutes before that. <core_client_version>7.16.5</core_client_version> <![CDATA[ <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.15_windows_x86_64.exe @rb_04_16_21806_21365_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -jumps:pairing_file t000_.fasta.bbcontacts.jumps -jumps:random_sheets 2 2 2 1 1 1 1 2 1 1 1 -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_04_16_21806_21365_ab_t000__robetta.zip -frag3 rb_04_16_21806_21365_ab_t000__robetta.200.3mers.index.gz -fragA rb_04_16_21806_21365_ab_t000__robetta.200.4mers.index.gz -fragB rb_04_16_21806_21365_ab_t000__robetta.200.7mers.index.gz -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 5000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 1285868 Starting watchdog... Watchdog active. BOINC:: CPU time: 43301.9s, 14400s + 28800s[2020- 4-18 6:26:34:] :: BOINC WARNING! cannot get file size for default.out.gz: could not open file. Output exists: default.out.gz Size: -1 InternalDecoyCount: 0 (GZ) ----- 0 ----- Stream information inconsistent. Writing W_0000001 ====================================================== DONE :: 1 starting structures 43301.9 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== 06:26:34 (10032): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>rb_04_16_21806_21365_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_07_04_918009_202_0_r1479539452_0</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> </message> ]]> Task 1153161145 is probably going to end up the same, 34% at 4h 10min. ID: 94733 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1925 Credit: 18,534,891 RAC: 0	Message 94734 - Posted: 18 Apr 2020, 5:53:54 UTC - in response to Message 94733. Last modified: 18 Apr 2020, 5:55:27 UTC Task 1152764941 also drove into the 12 hour timeout. </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>rb_04_16_21806_21365_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_07_04_918009_202_0_r1479539452_0</file_name> <error_code>-240 (stat() failed)</error_code> </file_xfer_error> </message> ]]>[/code] Looks like there was a file transfer problem there as well. Grant Darwin NT ID: 94734 · Rating: 0 · rate: / Reply Quote

rsNeutrino Send message Joined: 22 Mar 20 Posts: 13 Credit: 5,886,438 RAC: 262	Message 94735 - Posted: 18 Apr 2020, 5:58:04 UTC - in response to Message 94734. Last modified: 18 Apr 2020, 6:02:52 UTC Looks like there was a file transfer problem there as well. Maybe because it didn't have any checkpoint or result file to upload since it wasn't finished with its first decoy. Thats probably also the reason for the long runtime, it HAS to finish one before shutdown else it keeps going until the watchdog kills it. ID: 94735 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1925 Credit: 18,534,891 RAC: 0	Message 94737 - Posted: 18 Apr 2020, 6:03:10 UTC - in response to Message 94735. Last modified: 18 Apr 2020, 6:07:49 UTC Looks like there was a file transfer problem there as well. Maybe because it didn't have any checkpoint or result file to upload since it wasn't finished. From the couple i've had, it looks like the Watchdog will let it run for up to an extra 4 hours, and if it doesn't finish in that time then the next time it goes to Checkpoint, the Watchdog then ends it and it is considered finished. ====================================================== DONE :: 1 starting structures 43301.9 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== 06:26:34 (10032): called boinc_finish(0) If it had returned the result, it would (or at least should have) Validated. Grant Darwin NT ID: 94737 · Rating: 0 · rate: / Reply Quote

James W Send message Joined: 25 Nov 12 Posts: 130 Credit: 1,766,254 RAC: 0	Message 94739 - Posted: 18 Apr 2020, 7:15:20 UTC - in response to Message 94728. The watchdog kicks in 4 hours after the runtime preference, not the estimated runtime shown in the BOINC Manager. Both computers are set with default CPU runtime of 8 hrs (28000 seconds). Is this with your i5 Windows 7 Profession machine? Yes times, etc. I quoted in my previous post were for the i5 Windows 7 PC. As of 07:00 UTC: Task: 1150978005 Task Name: 12v1n_al_12mer_design_00026_019077_0001_SAVE_ALL_OUT_913633_58 CPU time: 19:29:08 CPU time since checkpoint: 19:29:08 Elapsed time: 20:03:42 Estimated time remaining: 00:10:17 Fraction done: 99.152% Fraction done moved up slightly in last 4.5 hrs, though estimated time remaining has remained the same. CPU runtime also over 11 hrs over default/set time. Doubt this task would be valid, even if Watchdog stops processing because of having only the one checkpoint at start of processing. Probably will need to abort if BOINC manager/project doesn't stop processing. ID: 94739 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1925 Credit: 18,534,891 RAC: 0	Message 94740 - Posted: 18 Apr 2020, 7:21:42 UTC - in response to Message 94737. Looks like there was a file transfer problem there as well. Maybe because it didn't have any checkpoint or result file to upload since it wasn't finished. From the couple i've had, it looks like the Watchdog will let it run for up to an extra 4 hours, and if it doesn't finish in that time then the next time it goes to Checkpoint, the Watchdog then ends it and it is considered finished. ... and then i get 2 Tasks where the watchdog hasn't kicked in at all even after 1hr plus over the 4hrs and multiple checkpoints in that time. Grant Darwin NT ID: 94740 · Rating: 0 · rate: / Reply Quote

rsNeutrino Send message Joined: 22 Mar 20 Posts: 13 Credit: 5,886,438 RAC: 262	Message 94741 - Posted: 18 Apr 2020, 8:19:49 UTC - in response to Message 94737. Last modified: 18 Apr 2020, 8:21:25 UTC From the couple i've had, it looks like the Watchdog will let it run for up to an extra 4 hours, and if it doesn't finish in that time then the next time it goes to Checkpoint, the Watchdog then ends it and it is considered finished My target time is 8 hours. The task reached 12 hours, so it did run for 4 extra hours. I had my eye on that task before it ended, and BOINC told me in the task properties that "CPU time since checkpoint" was equal to the "CPU time" of that task. Which means there wasn't even one checkpoint saved in the 12h since the start of that task. The second task shows the same symptoms at the moment, CPU time 04:40:xx, CPU time since checkpoint 04:40:xx, Elapsed time 04:45:xx. My understanding is that the watchdog is there to kill the task at target time + 4h, regardless of wether there are any results: 18.04.2020 06:26:41 \| Rosetta@home \| Output file rb_04_16_21806_21365_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_07_04_918009_202_0_r1479539452_0 for task rb_04_16_21806_21365_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_07_04_918009_202_0 absent Also the watchdog seems to look for "cpu seconds" alias "CPU time", not the bit longer Elapsed time. The point is, it seems to me that there are some models that are either buggy or need much more time to produce even a single result, and the watchdog doesn't like it. In the case that the model can't be changed to fit in an 8h timeslot, to raise the watchdog timeout could be a necessary option, which MAY has already happened in your and James' cases, but not in mine. ID: 94741 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2155 Credit: 12,958,897 RAC: 6,185	Message 94775 - Posted: 18 Apr 2020, 16:14:12 UTC - in response to Message 94660. Some problems with "12v1n_" wus. I've processed these ones with no problems so far. They finish early (done in 3hrs with 8hr Target CPU time), but all are Valid. Now seems well for me too... 1153276213 ID: 94775 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2155 Credit: 12,958,897 RAC: 6,185	Message 94777 - Posted: 18 Apr 2020, 16:35:56 UTC - in response to Message 94741. Last modified: 18 Apr 2020, 16:36:35 UTC My understanding is that the watchdog is there to kill the task at target time + 4h, regardless of wether there are any results: 18.04.2020 06:26:41 \| Rosetta@home \| Output file rb_04_16_21806_21365_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_07_04_918009_202_0_r1479539452_0 for task rb_04_16_21806_21365_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_07_04_918009_202_0 absent The point is, it seems to me that there are some models that are either buggy or need much more time to produce even a single result, and the watchdog doesn't like it. Problems with cstwt wus are well known ID: 94777 · Rating: 0 · rate: / Reply Quote

Admin Project administrator Send message Joined: 1 Jul 05 Posts: 5146 Credit: 0 RAC: 0	Message 94786 - Posted: 18 Apr 2020, 17:46:26 UTC There very well could be jobs that have long run times per model. I increased the watchdog to 10 hours. This will only take affect on new jobs. ID: 94786 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2538 Credit: 47,104,363 RAC: 1,528	Message 94789 - Posted: 18 Apr 2020, 17:53:30 UTC - in response to Message 94786. There very well could be jobs that have long run times per model. I increased the watchdog to 10 hours. This will only take affect on new jobs. Really? That doesn't seem a great move. The first problem seems to me that "CPU time since checkpoint" is equal to the "CPU time" of the task. That is, even after the requested runtime PLUS the existing 4hr watchdog, the task hasn't checkpointed at all. The watchdog is there for tasks that've gone "rogue", not to wait for a single and first checkpoint. Ok, if the tasks has completed several decoys already but the last one is taking an unexpectedly long time, a longer watchdog is maybe appropriate, but is that what's being reported? ID: 94789 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 94792 - Posted: 18 Apr 2020, 18:15:51 UTC - in response to Message 94789. Last modified: 18 Apr 2020, 18:52:13 UTC I would think that it means they really need to see some of these extreme models completed. That all models might take a long time. Your scenario with early models completed and then one long one doesn't sound like the reason one would make the change described. Rest assured that my experience with the project has always been that model runtimes and the consistency of model runtimes improves with updates to the specific protocols. But, in the meantime, extending the watchdog sounds like the fastest way for them to get some results. {edit} I don't mean to sound like I am refuting any of the desirable attributes of WUs that Sid mentioned. The Project Team is very aware of the desirability of checkpoints, fast consistent model runtimes, and etc. The fact that they chose to extend the watchdog to 10 hours really tells me that we're down to either do this, or don't get the data you need to continue your COVID study. I'm confident it will not be a permanent change. Rosetta Moderator: Mod.Sense ID: 94792 · Rating: 0 · rate: / Reply Quote

CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0	Message 94793 - Posted: 18 Apr 2020, 18:36:44 UTC - in response to Message 94775. Last modified: 18 Apr 2020, 18:40:35 UTC Some problems with "12v1n_" wus. I've processed these ones with no problems so far. They finish early (done in 3hrs with 8hr Target CPU time), but all are Valid. Now seems well for me too... 1153276213 Not all of the 12v1n's are having issues, but I've had the issue mentioned above about long run time (seems to stall). (Task linked below) First time I ran it, it went over 24 hours, got to 99.4%, and then for other reasons I rebooted my machine. The same task after reboot reset to 0% and started over. I let it run 12+ hours the second time, and it was exhibiting the same behavior. I quit Boinc and relaunched, and again the same task reset to 0% and started over. I aborted it. Now it's on a Android machine, we'll see if it goes anywhere. Aborted task: https://boinc.bakerlab.org/rosetta/result.php?resultid=1151472990 Where it lives now: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1035999438 I'm intrigued to see if the second computer is able to finish it. /edit. Should add, I have finished several of the 12v1n tasks without issue, so it's not widespread. ID: 94793 · Rating: 0 · rate: / Reply Quote

rsNeutrino Send message Joined: 22 Mar 20 Posts: 13 Credit: 5,886,438 RAC: 262	Message 94798 - Posted: 18 Apr 2020, 23:16:29 UTC - in response to Message 94786. Last modified: 18 Apr 2020, 23:22:14 UTC There very well could be jobs that have long run times per model. I increased the watchdog to 10 hours. This will only take affect on new jobs. After 12h 8min CPU time this one finished successfully with 1 decoy: 1153161145 Did the watchdog end it? BOINC:: CPU time: 43719.1s, 14400s + 28800s[2020- 4-18 16: 7:48:] :: BOINC ID: 94798 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 94801 - Posted: 18 Apr 2020, 23:49:46 UTC - in response to Message 94798. After 12h 8min CPU time this one finished successfully with 1 decoy: 1153161145 Did the watchdog end it? BOINC:: CPU time: 43719.1s, 14400s + 28800s[2020- 4-18 16: 7:48:] :: BOINC Yes, this looks like a good example why, in future WUs, the watchdog will be set to only kick in 10 hours after the preferred runtime (versus the prior 4 hours past setting). Rosetta Moderator: Mod.Sense ID: 94801 · Rating: 0 · rate: / Reply Quote