Rosetta 4.1+ and 4.2+

Message boards : Number crunching : Rosetta 4.1+ and 4.2+

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 31 · Next

AuthorMessage
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 969
Credit: 10,410,017
RAC: 22,691
Message 94660 - Posted: 17 Apr 2020, 7:29:08 UTC - in response to Message 94657.  

Some problems with "12v1n_" wus.
I've processed these ones with no problems so far.
12v1n_al_12mer_design_00003_000575_0001_SAVE_ALL_OUT_913418_39_0
12v1n_al_12mer_design_00062_014552_0001_SAVE_ALL_OUT_913824_41_0
12v1n_al_12mer_design_00166_018161_0001_SAVE_ALL_OUT_914183_55_0
12v1n_al_12mer_design_00178_008639_0001_SAVE_ALL_OUT_914209_113_0
12v1n_al_12mer_design_00329_016075_0001_SAVE_ALL_OUT_914468_22_0

They finish early (done in 3hrs with 8hr Target CPU time), but all are Valid.



I have 4hs wus (in my profile), but these are crunching over 10hs and with NO checkpoint
This is and example 1152133806:
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x75C1E062
It's been sent to another system, so we'll see how it goes.
Grant
Darwin NT
ID: 94660 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Stret

Send message
Joined: 18 Mar 20
Posts: 7
Credit: 529,664
RAC: 0
Message 94690 - Posted: 17 Apr 2020, 15:38:20 UTC

Please move to relevant forum, I was struggling to find a help section.

One of my work units has been running for over a day (unusual in and of itself) and is not geting up at all, it says it is 10 minutes from finishing, but that hasn't changed in over 12 hours.

I suspect based on my rudamentory programming knowledge that it has hit an infinite loop.

What is the best way forward? There's no point in it hitting its deadline and doing the same on another machine.

copy and paste from propeties of WU:

Application
Rosetta 4.15
Name
12v1n_al_12mer_design_00003_000575_0001_SAVE_ALL_OUT_913418_17
State
Running
Received
15/04/2020 07:16:00
Report deadline
18/04/2020 07:16:02
Estimated computation size
80,000 GFLOPs
CPU time
1d 05:34:42
CPU time since checkpoint
1d 05:34:42
Elapsed time
1d 05:55:59
Estimated time remaining
00:10:07
Fraction done
99.439%
Virtual memory size
244.53 MB
Working set size
48.34 MB
Directory
slots/5
Process ID
26968
Progress rate
3.240% per hour
Executable
rosetta_4.15_windows_x86_64.exe
ID: 94690 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,070,826
RAC: 0
Message 94691 - Posted: 17 Apr 2020, 16:52:02 UTC - in response to Message 94690.  

Yeah, I had a similar one whose name started the same way as yours. It ran over a day and a half before I aborted it. Apparently others have reported issues with tasks with names like that. Just abort it.

-Charlie
-Charlie
ID: 94691 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James W

Send message
Joined: 25 Nov 12
Posts: 130
Credit: 1,766,254
RAC: 0
Message 94719 - Posted: 18 Apr 2020, 1:35:54 UTC

Task: 1150978005
Task: 12v1n_al_12mer_design_00026_019077_0001_SAVE_ALL_OUT_913633_58
CPU time: 15:01:25
CPU time since checkpoint: 15:01:25
Elapsed time: 15:28:46
Estimated time remaining: 00:10:18 (which varies between 00:10:17-00:10:20)
Fraction done: 98.901% (Which has gradually increased over last 1/2 hr or so from 98.893% or so)

Original estimated time was 8 hrs, so now is 7.5 hrs over this. Shouldn't Watchdog have stopped processing, as now over 4 hrs longer than estimated processing time? Fraction done is slowly rising, so reluctant to abort at this point. Concerning that only checkpoint was when task first started. If BOINC manager stops or suspends, afraid task will want to start over from scratch! I'll keep an eye on this for now and if no change in hour or so, will likely need to abort at that time.
ID: 94719 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94728 - Posted: 18 Apr 2020, 4:03:36 UTC - in response to Message 94719.  

The watchdog kicks in 4 hours after the runtime preference, not the estimated runtime shown in the BOINC Manager. Once the WU reports back it will show the runtime preference it was run with. But you are correct, no checkpoints for over an hour is not a good sign, and your other work units seem to be running with the default 8 hour preference.

Is this with your i5 Windows 7 Profession machine? It looks like your i3, also with Win7, has already run a few similar tasks with dozens of models completed in the same period of time, even with less memory per core. Are the BOINC settings the same for both systems?
Rosetta Moderator: Mod.Sense
ID: 94728 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rsNeutrino

Send message
Joined: 22 Mar 20
Posts: 8
Credit: 2,017,538
RAC: 3,696
Message 94733 - Posted: 18 Apr 2020, 5:50:56 UTC
Last modified: 18 Apr 2020, 5:54:55 UTC

Task 1152764941 also drove into the 12 hour timeout, reaching 98% around 10 minutes before that.

<core_client_version>7.16.5</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.15_windows_x86_64.exe @rb_04_16_21806_21365_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -jumps:pairing_file t000_.fasta.bbcontacts.jumps -jumps:random_sheets 2 2 2 1 1 1 1 2 1 1 1 -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_04_16_21806_21365_ab_t000__robetta.zip -frag3 rb_04_16_21806_21365_ab_t000__robetta.200.3mers.index.gz -fragA rb_04_16_21806_21365_ab_t000__robetta.200.4mers.index.gz -fragB rb_04_16_21806_21365_ab_t000__robetta.200.7mers.index.gz -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 5000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 1285868
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 43301.9s, 14400s + 28800s[2020- 4-18  6:26:34:] :: BOINC 
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE ::     1 starting structures  43301.9 cpu seconds
This process generated      1 decoys from       1 attempts
======================================================
06:26:34 (10032): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
  <file_name>rb_04_16_21806_21365_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_07_04_918009_202_0_r1479539452_0</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>
]]>

Task 1153161145 is probably going to end up the same, 34% at 4h 10min.
ID: 94733 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 969
Credit: 10,410,017
RAC: 22,691
Message 94734 - Posted: 18 Apr 2020, 5:53:54 UTC - in response to Message 94733.  
Last modified: 18 Apr 2020, 5:55:27 UTC

Task 1152764941 also drove into the 12 hour timeout.
</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>rb_04_16_21806_21365_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_07_04_918009_202_0_r1479539452_0</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>
]]>[/code]
Looks like there was a file transfer problem there as well.
Grant
Darwin NT
ID: 94734 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rsNeutrino

Send message
Joined: 22 Mar 20
Posts: 8
Credit: 2,017,538
RAC: 3,696
Message 94735 - Posted: 18 Apr 2020, 5:58:04 UTC - in response to Message 94734.  
Last modified: 18 Apr 2020, 6:02:52 UTC

Looks like there was a file transfer problem there as well.

Maybe because it didn't have any checkpoint or result file to upload since it wasn't finished with its first decoy. Thats probably also the reason for the long runtime, it HAS to finish one before shutdown else it keeps going until the watchdog kills it.
ID: 94735 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 969
Credit: 10,410,017
RAC: 22,691
Message 94737 - Posted: 18 Apr 2020, 6:03:10 UTC - in response to Message 94735.  
Last modified: 18 Apr 2020, 6:07:49 UTC

Looks like there was a file transfer problem there as well.
Maybe because it didn't have any checkpoint or result file to upload since it wasn't finished.
From the couple i've had, it looks like the Watchdog will let it run for up to an extra 4 hours, and if it doesn't finish in that time then the next time it goes to Checkpoint, the Watchdog then ends it and it is considered finished.

======================================================
DONE ::     1 starting structures  43301.9 cpu seconds
This process generated      1 decoys from       1 attempts
======================================================
06:26:34 (10032): called boinc_finish(0)
If it had returned the result, it would (or at least should have) Validated.
Grant
Darwin NT
ID: 94737 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James W

Send message
Joined: 25 Nov 12
Posts: 130
Credit: 1,766,254
RAC: 0
Message 94739 - Posted: 18 Apr 2020, 7:15:20 UTC - in response to Message 94728.  

The watchdog kicks in 4 hours after the runtime preference, not the estimated runtime shown in the BOINC Manager.
Both computers are set with default CPU runtime of 8 hrs (28000 seconds).
Is this with your i5 Windows 7 Profession machine?
Yes times, etc. I quoted in my previous post were for the i5 Windows 7 PC.

As of 07:00 UTC:
Task: 1150978005
Task Name: 12v1n_al_12mer_design_00026_019077_0001_SAVE_ALL_OUT_913633_58
CPU time: 19:29:08
CPU time since checkpoint: 19:29:08
Elapsed time: 20:03:42
Estimated time remaining: 00:10:17
Fraction done: 99.152%

Fraction done moved up slightly in last 4.5 hrs, though estimated time remaining has remained the same. CPU runtime also over 11 hrs over default/set time. Doubt this task would be valid, even if Watchdog stops processing because of having only the one checkpoint at start of processing. Probably will need to abort if BOINC manager/project doesn't stop processing.
ID: 94739 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 969
Credit: 10,410,017
RAC: 22,691
Message 94740 - Posted: 18 Apr 2020, 7:21:42 UTC - in response to Message 94737.  

Looks like there was a file transfer problem there as well.
Maybe because it didn't have any checkpoint or result file to upload since it wasn't finished.
From the couple i've had, it looks like the Watchdog will let it run for up to an extra 4 hours, and if it doesn't finish in that time then the next time it goes to Checkpoint, the Watchdog then ends it and it is considered finished.
... and then i get 2 Tasks where the watchdog hasn't kicked in at all even after 1hr plus over the 4hrs and multiple checkpoints in that time.
Grant
Darwin NT
ID: 94740 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rsNeutrino

Send message
Joined: 22 Mar 20
Posts: 8
Credit: 2,017,538
RAC: 3,696
Message 94741 - Posted: 18 Apr 2020, 8:19:49 UTC - in response to Message 94737.  
Last modified: 18 Apr 2020, 8:21:25 UTC

From the couple i've had, it looks like the Watchdog will let it run for up to an extra 4 hours, and if it doesn't finish in that time then the next time it goes to Checkpoint, the Watchdog then ends it and it is considered finished

My target time is 8 hours.
The task reached 12 hours, so it did run for 4 extra hours.
I had my eye on that task before it ended, and BOINC told me in the task properties that "CPU time since checkpoint" was equal to the "CPU time" of that task.
Which means there wasn't even one checkpoint saved in the 12h since the start of that task.
The second task shows the same symptoms at the moment, CPU time 04:40:xx, CPU time since checkpoint 04:40:xx, Elapsed time 04:45:xx.

My understanding is that the watchdog is there to kill the task at target time + 4h, regardless of wether there are any results:
18.04.2020 06:26:41 | Rosetta@home | Output file rb_04_16_21806_21365_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_07_04_918009_202_0_r1479539452_0 for task rb_04_16_21806_21365_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_07_04_918009_202_0 absent

Also the watchdog seems to look for "cpu seconds" alias "CPU time", not the bit longer Elapsed time.

The point is, it seems to me that there are some models that are either buggy or need much more time to produce even a single result, and the watchdog doesn't like it.
In the case that the model can't be changed to fit in an 8h timeslot, to raise the watchdog timeout could be a necessary option, which MAY has already happened in your and James' cases, but not in mine.
ID: 94741 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1386
Credit: 5,723,887
RAC: 3,361
Message 94775 - Posted: 18 Apr 2020, 16:14:12 UTC - in response to Message 94660.  

Some problems with "12v1n_" wus.

I've processed these ones with no problems so far.
They finish early (done in 3hrs with 8hr Target CPU time), but all are Valid.

Now seems well for me too...
1153276213
ID: 94775 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1386
Credit: 5,723,887
RAC: 3,361
Message 94777 - Posted: 18 Apr 2020, 16:35:56 UTC - in response to Message 94741.  
Last modified: 18 Apr 2020, 16:36:35 UTC

My understanding is that the watchdog is there to kill the task at target time + 4h, regardless of wether there are any results:
18.04.2020 06:26:41 | Rosetta@home | Output file rb_04_16_21806_21365_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_07_04_918009_202_0_r1479539452_0 for task rb_04_16_21806_21365_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_07_04_918009_202_0 absent

The point is, it seems to me that there are some models that are either buggy or need much more time to produce even a single result, and the watchdog doesn't like it.

Problems with cstwt wus are well known
ID: 94777 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin
Project administrator

Send message
Joined: 1 Jul 05
Posts: 5545
Credit: 0
RAC: 0
Message 94786 - Posted: 18 Apr 2020, 17:46:26 UTC

There very well could be jobs that have long run times per model. I increased the watchdog to 10 hours. This will only take affect on new jobs.
ID: 94786 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1659
Credit: 30,332,716
RAC: 23,270
Message 94789 - Posted: 18 Apr 2020, 17:53:30 UTC - in response to Message 94786.  

There very well could be jobs that have long run times per model. I increased the watchdog to 10 hours. This will only take affect on new jobs.

Really? That doesn't seem a great move.
The first problem seems to me that "CPU time since checkpoint" is equal to the "CPU time" of the task.
That is, even after the requested runtime PLUS the existing 4hr watchdog, the task hasn't checkpointed at all.
The watchdog is there for tasks that've gone "rogue", not to wait for a single and first checkpoint.
Ok, if the tasks has completed several decoys already but the last one is taking an unexpectedly long time, a longer watchdog is maybe appropriate, but is that what's being reported?
ID: 94789 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94792 - Posted: 18 Apr 2020, 18:15:51 UTC - in response to Message 94789.  
Last modified: 18 Apr 2020, 18:52:13 UTC

I would think that it means they really need to see some of these extreme models completed. That all models might take a long time. Your scenario with early models completed and then one long one doesn't sound like the reason one would make the change described.

Rest assured that my experience with the project has always been that model runtimes and the consistency of model runtimes improves with updates to the specific protocols. But, in the meantime, extending the watchdog sounds like the fastest way for them to get some results.

{edit}
I don't mean to sound like I am refuting any of the desirable attributes of WUs that Sid mentioned. The Project Team is very aware of the desirability of checkpoints, fast consistent model runtimes, and etc. The fact that they chose to extend the watchdog to 10 hours really tells me that we're down to either do this, or don't get the data you need to continue your COVID study. I'm confident it will not be a permanent change.
Rosetta Moderator: Mod.Sense
ID: 94792 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,056,786
RAC: 60
Message 94793 - Posted: 18 Apr 2020, 18:36:44 UTC - in response to Message 94775.  
Last modified: 18 Apr 2020, 18:40:35 UTC

Some problems with "12v1n_" wus.

I've processed these ones with no problems so far.
They finish early (done in 3hrs with 8hr Target CPU time), but all are Valid.

Now seems well for me too...
1153276213



Not all of the 12v1n's are having issues, but I've had the issue mentioned above about long run time (seems to stall).

(Task linked below) First time I ran it, it went over 24 hours, got to 99.4%, and then for other reasons I rebooted my machine. The same task after reboot reset to 0% and started over. I let it run 12+ hours the second time, and it was exhibiting the same behavior. I quit Boinc and relaunched, and again the same task reset to 0% and started over. I aborted it. Now it's on a Android machine, we'll see if it goes anywhere.

Aborted task: https://boinc.bakerlab.org/rosetta/result.php?resultid=1151472990
Where it lives now: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1035999438

I'm intrigued to see if the second computer is able to finish it.

/edit. Should add, I have finished several of the 12v1n tasks without issue, so it's not widespread.
ID: 94793 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rsNeutrino

Send message
Joined: 22 Mar 20
Posts: 8
Credit: 2,017,538
RAC: 3,696
Message 94798 - Posted: 18 Apr 2020, 23:16:29 UTC - in response to Message 94786.  
Last modified: 18 Apr 2020, 23:22:14 UTC

There very well could be jobs that have long run times per model. I increased the watchdog to 10 hours. This will only take affect on new jobs.

After 12h 8min CPU time this one finished successfully with 1 decoy: 1153161145
Did the watchdog end it?
BOINC:: CPU time: 43719.1s, 14400s + 28800s[2020- 4-18 16: 7:48:] :: BOINC 
ID: 94798 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94801 - Posted: 18 Apr 2020, 23:49:46 UTC - in response to Message 94798.  

After 12h 8min CPU time this one finished successfully with 1 decoy: 1153161145
Did the watchdog end it?
BOINC:: CPU time: 43719.1s, 14400s + 28800s[2020- 4-18 16: 7:48:] :: BOINC 


Yes, this looks like a good example why, in future WUs, the watchdog will be set to only kick in 10 hours after the preferred runtime (versus the prior 4 hours past setting).
Rosetta Moderator: Mod.Sense
ID: 94801 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 31 · Next

Message boards : Number crunching : Rosetta 4.1+ and 4.2+



©2021 University of Washington
https://www.bakerlab.org