Some of the erroneous WUs are listed as successful and even pass validation

Message boards : Number crunching : Some of the erroneous WUs are listed as successful and even pass validation

To post messages, you must log in.

AuthorMessage
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 207
Credit: 23,094,416
RAC: 13,124
Message 101729 - Posted: 5 May 2021, 7:05:21 UTC

I spot some WUs which completely failed, but has " Success" status, and even server validate and grant credits for such WUs with critical errors.

Name pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_3cm5iv5y_1391022_1_0
Workunit 1229929395
Created 5 May 2021, 2:48:08 UTC
Sent 5 May 2021, 4:49:41 UTC
Report deadline 8 May 2021, 4:49:41 UTC
Received 5 May 2021, 5:48:12 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x00000000)
Run time 7 min 18 sec
CPU time 6 min 49 sec
Validate state Valid
Credit 3.27



While logs clearly shows fatal errors:
<core_client_version>7.16.11</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol pre_helix_boinc_v1.xml @helix_design.flags -in:file:silent pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_3cm5iv5y.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_3cm5iv5y.zip @pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_3cm5iv5y.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 2578584
Using database: database_357d5d93529_n_methylminirosetta_database

ERROR: [ERROR] Unable to open constraints file: bc1dd6b031238f177cab303f1b5a3aef_0001.MSAcst
ERROR:: Exit from: ......srccorescoringconstraintsConstraintIO.cc line: 457
08:41:23 (4920): called boinc_finish(0)

</stderr_txt>
]]>


Links to tasks

https://boinc.bakerlab.org/rosetta/result.php?resultid=1376543173

https://boinc.bakerlab.org/rosetta/result.php?resultid=1375927242

https://boinc.bakerlab.org/rosetta/result.php?resultid=1375899817
ID: 101729 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tomcat雄猫

Send message
Joined: 20 Dec 14
Posts: 180
Credit: 5,364,639
RAC: 0
Message 102323 - Posted: 31 Jul 2021, 10:01:17 UTC - in response to Message 101729.  

I spot some WUs which completely failed, but has " Success" status, and even server validate and grant credits for such WUs with critical errors.

Name pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_3cm5iv5y_1391022_1_0
Workunit 1229929395
Created 5 May 2021, 2:48:08 UTC
Sent 5 May 2021, 4:49:41 UTC
Report deadline 8 May 2021, 4:49:41 UTC
Received 5 May 2021, 5:48:12 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x00000000)
Run time 7 min 18 sec
CPU time 6 min 49 sec
Validate state Valid
Credit 3.27



While logs clearly shows fatal errors:
<core_client_version>7.16.11</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol pre_helix_boinc_v1.xml @helix_design.flags -in:file:silent pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_3cm5iv5y.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_3cm5iv5y.zip @pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_3cm5iv5y.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 2578584
Using database: database_357d5d93529_n_methylminirosetta_database

ERROR: [ERROR] Unable to open constraints file: bc1dd6b031238f177cab303f1b5a3aef_0001.MSAcst
ERROR:: Exit from: ......srccorescoringconstraintsConstraintIO.cc line: 457
08:41:23 (4920): called boinc_finish(0)

</stderr_txt>
]]>


Links to tasks

https://boinc.bakerlab.org/rosetta/result.php?resultid=1376543173

https://boinc.bakerlab.org/rosetta/result.php?resultid=1375927242

https://boinc.bakerlab.org/rosetta/result.php?resultid=1375899817


It's a known issue with the pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST tasks, it appears.
ID: 102323 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,316,055
RAC: 16,250
Message 102324 - Posted: 31 Jul 2021, 10:14:09 UTC - in response to Message 101729.  

I spot some WUs which completely failed, but has " Success" status, and even server validate and grant credits for such WUs with critical errors.
Even though it errored out before it was due to finish, the work done up to that point was Valid, so it Validated & was awarded Credit for the work done.
Grant
Darwin NT
ID: 102324 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 207
Credit: 23,094,416
RAC: 13,124
Message 102326 - Posted: 31 Jul 2021, 17:35:20 UTC

It did not done any useful work.
Such WUs error out just few minutes after start before computations of a very first decoy/model completed and results saved. Logs suggest it happens due to R@H app can not open/load all necessary input data for computation (possible a configuration error at WU generation stage?).
But still somehow slip through a server validator and marked as Success/Valid.
ID: 102326 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1965
Credit: 38,162,917
RAC: 9,188
Message 102328 - Posted: 1 Aug 2021, 22:14:44 UTC - in response to Message 102326.  
Last modified: 1 Aug 2021, 22:15:46 UTC

It did not done any useful work.
Such WUs error out just few minutes after start before computations of a very first decoy/model completed and results saved. Logs suggest it happens due to R@H app can not open/load all necessary input data for computation (possible a configuration error at WU generation stage?).
But still somehow slip through a server validator and marked as Success/Valid.

This was first reported to admins in late May
I've finally got round to examining this issue involving "ERROR:: Exit from: ......srccorescoringconstraintsConstraintIO.cc line: 457"
I didn't realise I'd been getting this error as much as everyone else and I've now reported it, probably for the first time, which is likely why it's been going on for so long.
I thought there were two types of error resulting from this, but there are actually three.

The one you quote above gives a gzip error and reports in the task list as a Computation Error - boinc_finish(1) - and usually only has a cpu runtime of 15 seconds or less so awards no credit
There's another with no gzip error that has a cpu runtime of 3 or 4 minutes and reports a Validate Error - boinc_finish(0) - and awards a few credits when the daily cleanup job runs
The third one also has no gzip error, has a runtime of 6-9 minutes and reports as Completed and Validated correctly, but obviously doesn't run fully either.

My main PC is reporting 3 computation errors, 3 validate errors and 50 validating properly, but 2 of them are running short, so 8 of 56 have a problem - 1 in 7.
All bar 3 are awarding credit for runtime - and those 3 for less than a minute of cpu runtime in total across all 16 cores - so it's a far bigger issue for the project than it is for any user, and I've reported it as a project issue on that basis.


Their response is summarised here
They're already aware of these bad tasks.
Because picking them out will likely take out some good tasks at the same time, and the impact on users is so minimal (if there's any impact at all - my view) they're going to be left to error out rather than distract researchers for little or no benefit to anyone.
So they're not going to be fixed. Just don't worry about them.

ID: 102328 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Some of the erroneous WUs are listed as successful and even pass validation



©2024 University of Washington
https://www.bakerlab.org