Computation error - WS_max

Message boards : Number crunching : Computation error - WS_max

To post messages, you must log in.

AuthorMessage
Profile [AF>Libristes] Dudumomo

Send message
Joined: 30 Nov 06
Posts: 6
Credit: 10,826,140
RAC: 1
Message 90549 - Posted: 22 Mar 2019, 6:01:28 UTC

Hello,
I am having a lot of errors on many machines. The compute time is set per default (8hr) and out of 500 results, I got 200 failing, quite massive for me....

It is always the same error, on different machines, with Rosetta v4.08 x86_64-pc-linux-gnu

<core_client_version>7.8.3</core_client_version>
<![CDATA[
<message>
finish file present too long</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu @rb_03_20_1958_2097_ab_t000__h001_robetta_FLAGS -in::file::fasta t000__h001.fasta -psipred_ss2 t000__h001.spider3_ss2 -kill_hairpins t000__h001.nobuformat.spider3_ss2 -abinitio::use_filters true -in:file:boinc_wu_zip rb_03_20_1958_2097_ab_t000__h001_robetta.zip -frag3 rb_03_20_1958_2097_ab_t000__h001_robetta.200.3mers.index.gz -fragA rb_03_20_1958_2097_ab_t000__h001_robetta.200.4mers.index.gz -fragB rb_03_20_1958_2097_ab_t000__h001_robetta.200.7mers.index.gz -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3837233
Starting watchdog...
Watchdog active.
======================================================
DONE ::     1 starting structures    28077 cpu seconds
This process generated     25 decoys from      25 attempts
======================================================
BOINC :: WS_max 2.82124e+08

BOINC :: Watchdog shutting down...
05:08:52 (30439): called boinc_finish(0)

</stderr_txt>
]]>


Any idea how to fix this?

Thank you
MyUneo, the Cupid of Services
ID: 90549 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 272,283,990
RAC: 1,873
Message 90552 - Posted: 22 Mar 2019, 19:01:22 UTC - in response to Message 90549.  

This error is common in machines with high number of threads when several threads finish crunching tasks at the same time and the boinc program can not handle all the requests and for some units it fails with this message.

Your machines do not have such number of threads but maybe you have started all the units at the same time and when they finish quite close each other this error happens. You can see that all units failed at the same time tag which is an indicative of this behaviour.

Try to start units separately and you will see errors decrease.
ID: 90552 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 272
Credit: 21,059,557
RAC: 18,044
Message 90554 - Posted: 22 Mar 2019, 20:11:42 UTC - in response to Message 90552.  

This error is common in machines with high number of threads when several threads finish crunching tasks at the same time and the boinc program can not handle all the requests and for some units it fails with this message.

Your machines do not have such number of threads but maybe you have started all the units at the same time and when they finish quite close each other this error happens. You can see that all units failed at the same time tag which is an indicative of this behaviour.

Try to start units separately and you will see errors decrease.



Starting them at different times does not guarantee finishing at different times in the future. 8-)

I was watching when my 18C/36T machine finished a single 4.08 WU. The time remaining value clocked down to zero. The job did not identify as finished, but went into the Waiting mode. After some time, it restarted and was marked as a Compute Error and aborted. That seems to conflict with the "simultaneous finish" theory. It was curious that the WU went into the wait mode with no time remaining.

Could be a BOINC bug though ...

I sorted the WU by finish time and there were a couple that had another WU that finished at the same time. Maybe the other failing ones that did not have Rosetta WU finishing at the same time, but other projects. This machine has an Nvidia 2080ti and is running finishing WU frequently.

1063313212 956527517 17 Mar 2019, 7:56:26 UTC 17 Mar 2019, 16:34:05 UTC Completed and validated 29,041.25 28,562.28 280.28 Rosetta Mini v3.78
1063259978 957819313 16 Mar 2019, 20:52:15 UTC 17 Mar 2019, 7:56:26 UTC Error while computing 24,301.37 23,899.48 --- Rosetta v4.08
1063255548 957815416 16 Mar 2019, 20:18:24 UTC 17 Mar 2019, 7:56:26 UTC Error while computing 26,449.68 25,963.64 --- Rosetta v4.08

1063592535 958114526 19 Mar 2019, 2:13:38 UTC 19 Mar 2019, 10:01:22 UTC Completed and validated 28,038.34 27,607.97 379.43 Rosetta v4.08


1063704726 958214732 20 Mar 2019, 1:45:29 UTC 20 Mar 2019, 18:31:10 UTC Completed and validated 29,219.93 28,565.12 307.87 Rosetta v4.08
1063704662 958214725 20 Mar 2019, 1:45:29 UTC 20 Mar 2019, 18:40:22 UTC Error while computing 29,352.41 28,678.44 --- Rosetta v4.08
1063704674 958214749 20 Mar 2019, 1:45:29 UTC 20 Mar 2019, 18:40:22 UTC Completed and validated 29,336.58 28,735.34 295.1 Rosetta v4.08

1063704370 958214412 20 Mar 2019, 1:45:29 UTC 20 Mar 2019, 19:12:16 UTC Completed and validated 29,197.75 28,517.11 296.96 Rosetta v4.08

1063877290 958373683 20 Mar 2019, 17:46:57 UTC 21 Mar 2019, 3:07:19 UTC Completed and validated 29,343.53 28,530.64 281.28 Rosetta v4.08
1063857276 958355886 20 Mar 2019, 16:18:21 UTC 21 Mar 2019, 3:10:03 UTC Error while computing 29,451.14 28,604.48 --- Rosetta v4.08
1063877630 958374044 20 Mar 2019, 17:46:57 UTC 21 Mar 2019, 3:12:16 UTC Completed and validated 29,308.20 28,470.42 275.65 Rosetta v4.08


1064060668 958528619 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 14:41:55 UTC Completed and validated 44,962.40 43,453.92 20 Rosetta v4.08
1064059944 958528620 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 14:53:37 UTC Completed and validated 44,850.18 43,373.49 20 Rosetta v4.08
1064060662 958528613 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 14:53:37 UTC Error while computing 44,762.00 43,410.41 --- Rosetta v4.08

1064059821 958528500 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 15:24:30 UTC Completed and validated 45,084.65 43,506.61 20 Rosetta v4.08
1064059792 958528470 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 16:05:34 UTC Completed and validated 44,989.24 43,510.94 116.49 Rosetta v4.08
1064060259 958528209 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 18:32:44 UTC Error while computing 45,186.26 43,625.06 --- Rosetta v4.08
1064073516 958535377 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 19:47:31 UTC Completed and validated 44,952.97 43,484.82 291 Rosetta v4.08
1064060604 958528555 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 19:54:22 UTC Error while computing 44,777.94 43,274.49 --- Rosetta v4.08
1064035288 958507870 21 Mar 2019, 16:25:45 UTC 22 Mar 2019, 2:11:20 UTC Completed and validated 27,686.13 26,899.81 311.93 Rosetta v4.07
ID: 90554 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 272,283,990
RAC: 1,873
Message 90556 - Posted: 22 Mar 2019, 20:46:33 UTC

Sure it will not make disappear them all, my experience however is that I suffer less errors with that approach but you could be totally right. I've never seen a wu in "waiting mode", it is very curious, I do not have an explanation. The "finish file present too long" happens to me in almost every BOINC CPU project at some moment and I ended associating it to this type of conflicts but not sure of the actual reason. When I searched/asked (some years ago) there was I did not find a solution.
ID: 90556 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 272
Credit: 21,059,557
RAC: 18,044
Message 90557 - Posted: 22 Mar 2019, 21:02:55 UTC - in response to Message 90556.  

Sure it will not make disappear them all, my experience however is that I suffer less errors with that approach but you could be totally right. I've never seen a wu in "waiting mode", it is very curious, I do not have an explanation. The "finish file present too long" happens to me in almost every BOINC CPU project at some moment and I ended associating it to this type of conflicts but not sure of the actual reason. When I searched/asked (some years ago) there was I did not find a solution.


Wouldn't you know it. Just after I posted I had a burst of 12 errors. Many were sent at the same time, but all had different RUN times, but finished at the same time. All of these seem to be a WU problem. They all only processed 1 decoy and failed with default.out problems.

BOINC:: CPU time: 43297.4s, 14400s + 28800s[2019- 3-22 13:13: 5:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 43297.4 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
13:13:05 (20611): called boinc_finish(0)

</stderr_txt>

1064060602 958528553 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 13:12:34 UTC Completed and validated 39,821.69 38,579.93 116.28 Rosetta v4.08
1064060668 958528619 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 14:41:55 UTC Completed and validated 44,962.40 43,453.92 20 Rosetta v4.08
1064059944 958528620 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 14:53:37 UTC Completed and validated 44,850.18 43,373.49 20 Rosetta v4.08
1064060662 958528613 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 14:53:37 UTC Error while computing 44,762.00 43,410.41 --- Rosetta v4.08

1064059821 958528500 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 15:24:30 UTC Completed and validated 45,084.65 43,506.61 20 Rosetta v4.08
1064059792 958528470 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 16:05:34 UTC Completed and validated 44,989.24 43,510.94 116.49 Rosetta v4.08
1064060259 958528209 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 18:32:44 UTC Error while computing 45,186.26 43,625.06 --- Rosetta v4.08

1064073516 958535377 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 19:47:31 UTC Completed and validated 44,952.97 43,484.82 291 Rosetta v4.08
1064060604 958528555 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 19:54:22 UTC Error while computing 44,777.94 43,274.49 --- Rosetta v4.08

1064035288 958507870 21 Mar 2019, 16:25:45 UTC 22 Mar 2019, 2:11:20 UTC Completed and validated 27,686.13 26,899.81 311.93 Rosetta v4.07
1064035273 958507856 21 Mar 2019, 16:25:45 UTC 22 Mar 2019, 2:19:38 UTC Completed and validated 27,818.37 27,026.80 248.57 Rosetta v4.07
1064035277 958507860 21 Mar 2019, 16:25:45 UTC 22 Mar 2019, 2:20:09 UTC Completed and validated 26,983.10 26,201.80 184.35 Rosetta v4.07
1064060670 958528621 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 20:03:51 UTC Error while computing 44,845.19 43,253.75 --- Rosetta v4.08
1064073472 958535368 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 20:09:03 UTC Error while computing 44,966.90 43,476.75 --- Rosetta v4.08

1064073428 958535289 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 20:12:04 UTC Completed and validated 44,720.47 43,254.19 395.21 Rosetta v4.08
1064073481 958535376 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 20:14:51 UTC Error while computing 44,777.87 43,297.95 --- Rosetta v4.08
1064073479 958535374 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 20:44:20 UTC Error while computing 45,082.91 43,578.67 --- Rosetta v4.08
1064073494 958535355 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 20:44:20 UTC Error while computing 45,129.88 43,609.00 --- Rosetta v4.08
1064073496 958535357 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 20:44:20 UTC Error while computing 45,082.15 43,594.03 --- Rosetta v4.08
1064073508 958535369 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 20:44:20 UTC Error while computing 45,177.07 43,657.32 --- Rosetta v4.08
1064073408 958535304 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 20:44:20 UTC Error while computing 45,057.32 43,556.66 --- Rosetta v4.08
1064073440 958535301 22 Mar 2019, 6:56:25 UTC 22 Mar 2019, 20:44:20 UTC Error while computing 44,919.55 43,403.97 --- Rosetta v4.08

1064060666 958528617 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 3:09:31 UTC Completed and validated 17,404.94 16,926.27 2.44 Rosetta v4.08
1064060548 958528499 21 Mar 2019, 20:39:44 UTC 22 Mar 2019, 3:18:50 UTC Completed and validated 17,603.57 17,097.12 2.63 Rosetta v4.08
ID: 90557 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 272,283,990
RAC: 1,873
Message 90558 - Posted: 23 Mar 2019, 0:03:52 UTC

These last errors of you are specific of Rosetta, it was supposed that they have retired these units causing this "process got signal 11" error message but continue appearing.
ID: 90558 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Computation error - WS_max



©2024 University of Washington
https://www.bakerlab.org