New WUs failing

Message boards : Number crunching : New WUs failing

To post messages, you must log in.

AuthorMessage
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 272,283,990
RAC: 1,873
Message 89451 - Posted: 27 Aug 2018, 21:36:54 UTC

These are all failing:
DESIG_HYBRID_1_...
ACT_1XA4_HYBRID_...
ID: 89451 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 272,283,990
RAC: 1,873
Message 89465 - Posted: 31 Aug 2018, 22:27:30 UTC

These ones are failing in a high percentage in Linux hosts.

PF12228.7_nojumps_aivan

PF12228.7_jumps_aivan

They do not respect the default computing time and when finish they fail with an error in the output file and "Stream information inconsistent" message.

====================================================00
<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process got signal 11</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu @PF12228.7.nojumps.flags -in:file:boinc_wu_zip PF12228.7.nojumps.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2966551
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 43210.5s, 14400s + 28800s[2018- 8-31 7: 4:54:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 43210.5 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
07:04:54 (10851): called boinc_finish(0)
pure virtual method called
terminate called without an active exception

</stderr_txt>
]]>
ID: 89465 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 272,283,990
RAC: 1,873
Message 89468 - Posted: 1 Sep 2018, 8:19:33 UTC

All DESIG_HYBRID_1_... WUs keeping crashing in all systems

<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_DESIG_HYBRID_1_ACT_1_robetta -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip input_DESIG_HYBRID_1_ACT_1_robetta.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 1125040
Starting watchdog...
Watchdog active.

ERROR: Unable to open weights/patch file. None of (./)stage1.wts or (./)stage1.wts.wts or minirosetta_database/scoring/weights/stage1.wts or minirosetta_database/scoring/weights/stage1.wts.wts exist
ERROR:: Exit from: src/core/scoring/ScoreFunction.cc line: 2748
BACKTRACE:
[0x5a62de6]
[0x4370f34]
[0x4380a82]
[0x439a000]
[0x439bbb5]
[0x3698c75]
[0x373a7d2]
[0x3740317]
[0x378c123]
[0x378d621]
[0x382ba98]
[0x382b5a3]
[0x413771]
[0x5fff8cc]
[0x610b97]
BOINC:: Error reading and gzipping output datafile: default.out
10:16:13 (1370): called boinc_finish(1)

</stderr_txt>
]]>
ID: 89468 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 272
Credit: 21,054,380
RAC: 17,800
Message 89471 - Posted: 1 Sep 2018, 15:15:10 UTC - in response to Message 89468.  

It looks like you have 2 machines running. One running Ubuntu 18.04 (457 errors) and the other running Ubuntu 16.04 (4 errors).

The latest failing Ubuntu 18.04 tasks seem to be unable to open files. The error messages seem to say the files do not exist or are short.

Do you have enough free space on the disk partition that BOINC uses? The 16.04 machine seems fine.
ID: 89471 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 89473 - Posted: 1 Sep 2018, 18:15:16 UTC

Sometimes files go "missing" due to anti-virus as well, so another thing to check.
Rosetta Moderator: Mod.Sense
ID: 89473 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 272
Credit: 21,054,380
RAC: 17,800
Message 89474 - Posted: 1 Sep 2018, 19:15:00 UTC - in response to Message 89473.  

Sometimes files go "missing" due to anti-virus as well, so another thing to check.


I think that is true for Windows machines, but I have not heard of the problem on Linux. On Linux machines, a more common problem is the automatic partitioning of the disk. The default partitioning does not allocate enough space to the partition where BOINC puts its directory.
ID: 89474 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 272,283,990
RAC: 1,873
Message 89476 - Posted: 1 Sep 2018, 22:27:59 UTC

Hi, thanks for your post

The Ubuntu 18.04 host is the one crunching Rosetta almost full time since half a month. It needed the "hack" to allow Rosetta 4.07 application WUs to run without crashing all units. I don't think it is related to the errors.

No antivirus and enough free disk space, we know Rosetta is good at detecting insufficient disk when downloading. So I do think these are the issues.

Going to the type of units:
- I have not found anyone that have crunched successfully a DESIG_HYBRID_1_... or an ACT_1XA4_HYBRID_... WU, all my wingmen, linux or windowes, errored as well. It seems to me WU fault. I'm aborting them whenever i find one
- Some of the PFxxxxx.x_(no)jumps_aivan_...(e.g. PF12228.7_jumps_) units are a problem for linux systems, more precisely ubuntu, in windows hosts they seem to crunch without problem.
- PF12228.7_jumps_... units for example do no respect the crunching time (I've tried with default and 4 hours duration) but they apparently finish OK crunching the WU but when closing the unit, it is declared invalid in some cases and no credit is awarded.

I've seen this in many other hosts from other crunchers

Example of valid unit :
https://boinc.bakerlab.org/result.php?resultid=1025270853

=====================================================================0
<core_client_version>7.9.3</core_client_version>
<![CDATA[
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu @PF12228.7.jumps.flags -in:file:boinc_wu_zip PF12228.7.jumps.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2826017
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 43755.4s, 14400s + 28800s[2018- 8-31 12:26:34:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 43755.4 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
12:26:34 (12361): called boinc_finish(0)

</stderr_txt>
]]>


Example of invalid unit :
https://boinc.bakerlab.org/result.php?resultid=1025347236

===================================================
<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process got signal 11</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu @PF12228.7.jumps.flags -in:file:boinc_wu_zip PF12228.7.jumps.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2797347
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 43599.6s, 14400s + 28800s[2018- 8-31 21:42:23:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 43599.6 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
21:42:23 (15153): called boinc_finish(0)

</stderr_txt>
]]>
ID: 89476 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 89484 - Posted: 4 Sep 2018, 2:25:16 UTC - in response to Message 89465.  

Looks to be similar to the problem I just reported for Windows 10 with PF... tasks.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 89484 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,994,764
RAC: 8,835
Message 89485 - Posted: 4 Sep 2018, 7:35:51 UTC

Summer holidays has gone.
I hope the R@H team starts to review the code....and participates in the forum!!
ID: 89485 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 272,283,990
RAC: 1,873
Message 89496 - Posted: 8 Sep 2018, 6:41:54 UTC
Last modified: 8 Sep 2018, 6:42:57 UTC

cis_paper_simulation_1 units failing in all systems, mine and also wingmen's


<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_x86_64-pc-linux-gnu @cis_paper_simulation_1_2_4_5_nmet.flags -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3695217
ERROR: Illegal value for integer option -cyclic_peptide:n_methyl_positions specified: 1_2_4_5

</stderr_txt>
]]>
ID: 89496 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,994,764
RAC: 8,835
Message 89501 - Posted: 8 Sep 2018, 20:40:03 UTC - in response to Message 89496.  
Last modified: 8 Sep 2018, 20:41:47 UTC

cis_paper_simulation_1 units failing in all systems, mine and also wingmen's


Same here with my Win10, but with two different errors on cis_paper
1
ERROR: Cannot open file "native.pdb"
ERROR:: Exit from: ......srccoreimport_poseimport_pose.cc line: 332
BOINC:: Error reading and gzipping output datafile: default.out
22:29:31 (1992): called boinc_finish(1)

2
(0x1) - exit code 1 (0x1)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.07_windows_intelx86.exe @cis_paper_simulation_1_3_4_nmet.flags -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3690271
ERROR: Illegal value for integer option -cyclic_peptide:n_methyl_positions specified: 1_3_4



And also error on Design_Hybrid
ERROR: Unable to open weights/patch file. None of (./)stage1.wts or (./)stage1.wts.wts or minirosetta_databasescoring/weights/stage1.wts or minirosetta_databasescoring/weights/stage1.wts.wts exist
ERROR:: Exit from: ......srccorescoringScoreFunction.cc line: 2748
BOINC:: Error reading and gzipping output datafile: default.out
22:31:45 (10164): called boinc_finish(1)

ID: 89501 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,994,764
RAC: 8,835
Message 89532 - Posted: 11 Sep 2018, 18:49:52 UTC - in response to Message 89501.  

cis_paper_simulation_1 units failing in all systems, mine and also wingmen's


Same here with my Win10, but with two different errors on cis_paper


Again all cis_paper fail.
Admins read the forum? It's frustating
Why not to test it on Ralph?
ID: 89532 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1966
Credit: 38,184,495
RAC: 10,704
Message 89534 - Posted: 11 Sep 2018, 23:08:17 UTC - in response to Message 89532.  

cis_paper_simulation_1 units failing in all systems, mine and also wingmen's

Same here with my Win10, but with two different errors on cis_paper

Again all cis_paper fail.
Admins read the forum? It's frustating
Why not to test it on Ralph?

Sorry I didn't notice these posts before. I reported the same in the Rosetta 4.0x pinned thread the other day.
Fortunately they fail within seconds, but that's still a mass of wasted downloads.
ID: 89534 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,994,764
RAC: 8,835
Message 89535 - Posted: 12 Sep 2018, 7:07:19 UTC - in response to Message 89534.  

Sorry I didn't notice these posts before. I reported the same in the Rosetta 4.0x pinned thread the other day.

Oh, well, it's not a problem
I think admins have not read either thread :-(
ID: 89535 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1966
Credit: 38,184,495
RAC: 10,704
Message 89537 - Posted: 12 Sep 2018, 11:02:07 UTC - in response to Message 89535.  
Last modified: 12 Sep 2018, 11:02:47 UTC

Things will quieten down for a while as it seems like most PF & cis tasks have cleared from my buffers now and pretty much every job is completing successfully again, for the loss of about 500 on my RAC...
ID: 89537 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1847
Credit: 7,994,764
RAC: 8,835
Message 89544 - Posted: 12 Sep 2018, 20:32:28 UTC - in response to Message 89537.  

for the loss of about 500 on my RAC...


It's not only a question of Rac. It's a question of "respect" for volunteers.
For example, see the "glibc problem": it's months that users with recent Ubuntu disto have to change some parameters 'cause the
version of glibc in rosetta is old. Why not fix the problem?
They should want to attract new volunteers, do not push them away
ID: 89544 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BelgianEnthousiast

Send message
Joined: 25 May 15
Posts: 5
Credit: 1,023,045
RAC: 0
Message 89702 - Posted: 6 Oct 2018, 20:01:34 UTC

Does anyone observe faulty WU's on WIN 10 platform ?
Since april 1st, I crunched 366 WU's in total. Up until september 9th 176 WU's without any errors.
Since September 9th, I crunched 190 WU's, but gradually racking up 33 (till today, Oct 6th) failed WU's.

Not sure how to find out which ones failed. Can anyone help ? I'd like to dig a little deeper.

Apart from that, any comments as to why all of a sudden so many WU's fail ?

I'm running 2 cores for Rosetta, 5 cores for LHC (5 core WU's). I do not observe any issues on LHC, so I'm pretty sure
it's not my rig that's having issues.

Many thanks for your advice !

BE.
ID: 89702 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LarryMajor

Send message
Joined: 1 Apr 16
Posts: 22
Credit: 31,533,212
RAC: 0
Message 89704 - Posted: 7 Oct 2018, 20:50:39 UTC - in response to Message 89702.  

Apart from that, any comments as to why all of a sudden so many WU's fail ?


Log on to your Rosetta account and click the "view" link next to TASKS. On the next screen, you can view tasks by completion state.

Most of your errors were caused on Oct 3 because the WUs did not complete before the deadline. There are a few things that will cause this; resetting the project, not processing jobs for a period of time and other reasons.
You probably want to keep an eye on the deadline on jobs that you have queued up, in case your account needs setting adjusted, but that doesn't appear to be the case.

When you look at the reason for the errors, many times you can tell if there was just something wrong with the WU (you had some of these) or a local problem.

Hope this helps.
ID: 89704 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : New WUs failing



©2024 University of Washington
https://www.bakerlab.org