Computation errors: rb_02_25_16883_16706_ab_t000__robetta_cstwt_5.0_xxxx

Message boards : Number crunching : Computation errors: rb_02_25_16883_16706_ab_t000__robetta_cstwt_5.0_xxxx

To post messages, you must log in.

AuthorMessage
biodoc

Send message
Joined: 19 Feb 06
Posts: 14
Credit: 26,749,810
RAC: 833
Message 91793 - Posted: 27 Feb 2020, 13:17:28 UTC
Last modified: 27 Feb 2020, 13:19:16 UTC

I've had ~20 of these tasks fail after 8 hours of computation time: rb_02_25_16883_16706_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_07_04_900260_6_0.

Example: https://boinc.bakerlab.org/rosetta/result.php?resultid=1124284684

I've aborted the others in my que.

linux
3900x processor
64 GB RAM
ID: 91793 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 194,035,479
RAC: 178,480
Message 91797 - Posted: 28 Feb 2020, 13:02:14 UTC

Yes, it is an "old" issue here, the wus containing "cstwt_5.0_FT" are prone to fail often in Linux, better performance in windows. They overpass the computing time set in user preferences and either finish ok through the watchdog or got a "signal 11" and fail to validate. However, it is not deterministic, some batches complete almost ok, other fail almost entirely. The units containing just "cstwt_5.0" complete ok.
ID: 91797 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 194,035,479
RAC: 178,480
Message 91799 - Posted: 28 Feb 2020, 17:45:26 UTC

Looking to my running tasks I see over 100 units of this type. Let's see what happen, most of them have already gone beyond the 8 hours processing time.
ID: 91799 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 194,035,479
RAC: 178,480
Message 91805 - Posted: 29 Feb 2020, 8:24:04 UTC

30 out of them have failed with "signal 11". Something to be checked by investigators.
ID: 91805 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1444
Credit: 5,894,659
RAC: 3,559
Message 91828 - Posted: 2 Mar 2020, 8:43:27 UTC - in response to Message 91805.  

Something to be checked by investigators.

Waiting for Godot
ID: 91828 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 194,035,479
RAC: 178,480
Message 91850 - Posted: 3 Mar 2020, 18:40:01 UTC

So, 30 failed units of this type on 29/02, 20 units on 01/03, 26 units on 02/03 and 20 units so far today. Tomorrow will be less as I've moved hosts but one to other projects. Let's hope it is solved or explained when I come back to crunch again with more resources.
ID: 91850 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luigi R.

Send message
Joined: 7 Feb 14
Posts: 38
Credit: 1,973,947
RAC: 848
Message 91857 - Posted: 4 Mar 2020, 8:21:34 UTC

I'm on Linux, BOINC got freezed because of task rb_02_21_16595_16419_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_05_05_896595_65_1.

After 8 hours, I found that my host is at idle! This is very bad.

Please check it, it's not acceptable that a task blocks crunching on all DC projects.
Sadly, I set no more work on R@H.
ID: 91857 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1444
Credit: 5,894,659
RAC: 3,559
Message 91859 - Posted: 4 Mar 2020, 10:13:42 UTC - in response to Message 91857.  

Please check it, it's not acceptable that a task blocks crunching on all DC projects.
Sadly, I set no more work on R@H.

Do you see admins here? Do you see news about code?
ID: 91859 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luigi R.

Send message
Joined: 7 Feb 14
Posts: 38
Credit: 1,973,947
RAC: 848
Message 91860 - Posted: 4 Mar 2020, 10:19:56 UTC

Well, I see that Admin answers on Number Crunching threads.

As volunteer, I can spend my time to arrange a solution to abort all tasks named like "*cstwt*".
ID: 91860 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1444
Credit: 5,894,659
RAC: 3,559
Message 91861 - Posted: 4 Mar 2020, 13:35:23 UTC - in response to Message 91860.  

Well, I see that Admin answers on Number Crunching threads.

Do you mean Mod.Sense? He is a great guy, but he is NOT an admin.
Admin posts only "Predictor of the day" and "News".
David E.K. - latest post is March 2019.
David Baker - latest post is Decembre 2017.

If you read forums, the "cstwt_5.0" wus has problems since February 2019.
ID: 91861 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luigi R.

Send message
Joined: 7 Feb 14
Posts: 38
Credit: 1,973,947
RAC: 848
Message 91863 - Posted: 4 Mar 2020, 14:04:00 UTC - in response to Message 91861.  
Last modified: 4 Mar 2020, 14:06:54 UTC

Do you mean Mod.Sense? He is a great guy, but he is NOT an admin.
Admin posts only "Predictor of the day" and "News".
David E.K. - latest post is March 2019.
David Baker - latest post is Decembre 2017.
Here he is.
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13510&postid=91696#91696
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13510&postid=91703#91703


If you read forums, the "cstwt_5.0" wus has problems since February 2019.
I see.

I think I have already encountered this issue, but I didn't remember it at all.

BOINC client stops to respond and you can't even kill it. Although my client is standalone and user's process ( not a service), you have to kill as superuser.
ID: 91863 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1444
Credit: 5,894,659
RAC: 3,559
Message 91864 - Posted: 4 Mar 2020, 17:04:29 UTC - in response to Message 91863.  

Admin posts only "Predictor of the day" and "News".

Here he is.
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13510&postid=91696#91696
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13510&postid=91703#91703

These are not news about bugs.
One is a news, other is an info about ram usage.
I don't know who is the developer working on bugs, but here all seems freezed
ID: 91864 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luigi R.

Send message
Joined: 7 Feb 14
Posts: 38
Credit: 1,973,947
RAC: 848
Message 91882 - Posted: 6 Mar 2020, 15:30:23 UTC - in response to Message 91863.  

I'm going to abort all *cstwt_5.0* tasks by bash on Linux to guarantee my contribution to R@H.

Here it is my script:
https://pastebin.com/RKdZKhGx
ID: 91882 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luigi R.

Send message
Joined: 7 Feb 14
Posts: 38
Credit: 1,973,947
RAC: 848
Message 91889 - Posted: 7 Mar 2020, 12:17:54 UTC

On Ubuntu 18.04 there are no problems to run *cstwt_5.0* tasks.

Ubuntu 18.04.4 LTS, kernel 4.15.0-88-generic
BOINC v7.9.3
OK

Ubuntu 14.04.6 LTS, kernel 4.4.0-142-generic
BOINC v7.2.42
Dangerous
ID: 91889 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luigi R.

Send message
Joined: 7 Feb 14
Posts: 38
Credit: 1,973,947
RAC: 848
Message 91893 - Posted: 7 Mar 2020, 17:03:44 UTC

ID: 91893 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] sabayonino

Send message
Joined: 16 Mar 10
Posts: 2
Credit: 3,429,330
RAC: 314
Message 91903 - Posted: 9 Mar 2020, 12:18:38 UTC

ID: 91903 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 614
Credit: 10,477,561
RAC: 5,516
Message 92028 - Posted: 17 Mar 2020, 13:24:20 UTC
Last modified: 17 Mar 2020, 14:11:25 UTC

4 failures so far today,

rb_03_16_18638_18457_ab_t000__h002_robetta_IGNORE_THE_REST_12_10_902207_18_0
rb_03_16_18636_18455_ab_t000__h001_robetta_IGNORE_THE_REST_10_13_902209_11_0
rb_03_16_18636_18455_ab_t000__h002_robetta_IGNORE_THE_REST_05_15_902210_2_0
rb_03_16_18637_18451_ab_t000__h002_robetta_IGNORE_THE_REST_09_19_902203_14_0

No mention of th cstwt_5.0_FT there. Windows 8.1 x64.

<edit>
5 now...

rb_03_16_18639_18459_ab_t000__h002_robetta_IGNORE_THE_REST_05_15_902222_16_0
</edit>
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 92028 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 614
Credit: 10,477,561
RAC: 5,516
Message 92042 - Posted: 18 Mar 2020, 3:01:54 UTC

Another of these, title is a bit different though. Was there not a recent server upgrade?

9v1nm_gb_c3143_9mer_gb_001352_SAVE_ALL_OUT_892356_222_0
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 92042 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 614
Credit: 10,477,561
RAC: 5,516
Message 92044 - Posted: 18 Mar 2020, 7:34:07 UTC
Last modified: 18 Mar 2020, 7:49:20 UTC

<duplicate>
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 92044 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 614
Credit: 10,477,561
RAC: 5,516
Message 92055 - Posted: 18 Mar 2020, 15:55:36 UTC

9v1nm_gb_c3143_9mer_gb_001352_SAVE_ALL_OUT_892356_222_0

Similar to the last one I mentioned.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 92055 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Computation errors: rb_02_25_16883_16706_ab_t000__robetta_cstwt_5.0_xxxx



©2021 University of Washington
https://www.bakerlab.org