Posts by Trotador

1) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 100978)
Posted 1 Apr 2021 by Trotador
Post:
Hello in the last few hours i had 22 WU who stoppt with an error a few second after starting.
Examples:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217359050
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217354682
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1217342042

Is there some kind of underlying Problem? I didnt change anything on my system in the last few weeks and some wu are looking fine.

Sorry i created a seperat thread before reading the instruction to post in this thread.



My reply on your other thread: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=14525&postid=100960


278 fails here
2) Message boards : Number crunching : Computation errors: rb_02_25_16883_16706_ab_t000__robetta_cstwt_5.0_xxxx (Message 91850)
Posted 3 Mar 2020 by Trotador
Post:
So, 30 failed units of this type on 29/02, 20 units on 01/03, 26 units on 02/03 and 20 units so far today. Tomorrow will be less as I've moved hosts but one to other projects. Let's hope it is solved or explained when I come back to crunch again with more resources.
3) Message boards : Number crunching : Computation errors: rb_02_25_16883_16706_ab_t000__robetta_cstwt_5.0_xxxx (Message 91805)
Posted 29 Feb 2020 by Trotador
Post:
30 out of them have failed with "signal 11". Something to be checked by investigators.
4) Message boards : Number crunching : Computation errors: rb_02_25_16883_16706_ab_t000__robetta_cstwt_5.0_xxxx (Message 91799)
Posted 28 Feb 2020 by Trotador
Post:
Looking to my running tasks I see over 100 units of this type. Let's see what happen, most of them have already gone beyond the 8 hours processing time.
5) Message boards : Number crunching : Stalled downloads (Message 91798)
Posted 28 Feb 2020 by Trotador
Post:
This issue continues occurring everyday but it is being specially annoying today. All hosts blocked to download new units and some of them ending idle.

Has it been looked at project side?
6) Message boards : Number crunching : Computation errors: rb_02_25_16883_16706_ab_t000__robetta_cstwt_5.0_xxxx (Message 91797)
Posted 28 Feb 2020 by Trotador
Post:
Yes, it is an "old" issue here, the wus containing "cstwt_5.0_FT" are prone to fail often in Linux, better performance in windows. They overpass the computing time set in user preferences and either finish ok through the watchdog or got a "signal 11" and fail to validate. However, it is not deterministic, some batches complete almost ok, other fail almost entirely. The units containing just "cstwt_5.0" complete ok.
7) Message boards : Number crunching : bh0200xx_MonomerDesign2019_ units failing (Message 91752)
Posted 19 Feb 2020 by Trotador
Post:
The units that failed mostly crunched with application Rosetta v4.07 i686-pc-linux-gnu

The units that completed but just received 20 points mostly used application Rosetta v4.08 x86_64-pc-linux-gnu

Some examples out of the over 60 units of this type crunched by my hosts. Hope it helps

https://boinc.bakerlab.org/rosetta/result.php?resultid=1122875789
https://boinc.bakerlab.org/rosetta/result.php?resultid=1122890588
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1011402918
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1011384104
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1011413743
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1011391746
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1011390304
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1011411592
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1011385693
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1011386374
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1011344088
https://boinc.bakerlab.org/rosetta/result.php?resultid=1122877802
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1011408682
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1011387452
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1011386983
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1011396733
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1011397432
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1011391280
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1011411960
8) Message boards : Number crunching : No "finished" file (Message 91746)
Posted 19 Feb 2020 by Trotador
Post:
Yes it somehow related to disk speed and occurs on SSDs much less frequently, but it still occurs sometimes even on SSDs.
On HDD + lot of concurrent R@H WUs running it happens much often.

Looks like root of the problem is a really old bug somewhere in Rosetta software which cause app to crash if it can not write to disk immediately, instead of just waiting a few seconds while disk is busy by handling other requests.
But devs do not bother to track it and fix so it keeps crashing the app and wasting generated result for years now.

Moving data to SSDs, enable disk write cache, reducing max_concurrent tasks running, etc - all is just partial workarounds(it helps mitigate problems, but not 100%), it does not fix the problem itself.


I had in mind that the above recommendations were related to other classical BOINC error the "finish file present too long" more prone to occur in host with many core/threads.
9) Message boards : Number crunching : bh0200xx_MonomerDesign2019_ units failing (Message 91742)
Posted 19 Feb 2020 by Trotador
Post:
They crunch during 12 hours and most of them fail with signal 11, some few manage to complete and validate (getting 20 credits for 12 hours compute time)

Unit failed:
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process got signal 11</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_i686-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol template.xml -corrections::beta_nov16 -out:prefix bh020073 @bh020073.flags -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip bh020073.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3946742
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 43708.9s, 14400s + 28800s[2020- 2-19 6: 8:53:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 43708.9 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
06:08:53 (4877): called boinc_finish(0)

Unit validated
</stderr_txt>
]]>

<core_client_version>7.14.1</core_client_version>
<![CDATA[
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol template.xml -corrections::beta_nov16 -out:prefix bh020019 @bh020019.flags -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip bh020019.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3945123
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 43765.8s, 14400s + 28800s[2020- 2-19 9:20: 2:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 43765.8 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
09:20:02 (4095): called boinc_finish(0)

</stderr_txt>
]]>
10) Message boards : Number crunching : Stalled downloads (Message 91715)
Posted 16 Feb 2020 by Trotador
Post:
Same problem here, I've had not to restart hosts but in some of them I do have to restart boinc to be able to download wus again.
11) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 91701)
Posted 14 Feb 2020 by Trotador
Post:
Yes, same here, stalled downloads can only be fixed by manual intervention (abort or abort) and therefore a big pain to keep crunching the project. They require continuous attention, which is not sustainable.
12) Message boards : Number crunching : Huge RAM usage by some of latest WUs (Message 91681)
Posted 12 Feb 2020 by Trotador
Post:
Yes, I see the same behavior for all rb_02_08_15652_15556__xx units
13) Message boards : Number crunching : BWSRV2 Down?? (Message 91638)
Posted 31 Jan 2020 by Trotador
Post:
Well. solved :)
14) Message boards : Number crunching : BWSRV2 Down?? (Message 91637)
Posted 31 Jan 2020 by Trotador
Post:
+1
15) Message boards : Number crunching : Out of work (Message 91616)
Posted 25 Jan 2020 by Trotador
Post:
https://boinc.bakerlab.org/rosetta/results.php?hostid=3727721&offset=0&show_names=0&state=6&appid=
https://boinc.bakerlab.org/rosetta/results.php?hostid=3753440&offset=0&show_names=0&state=6&appid=
https://boinc.bakerlab.org/rosetta/results.php?hostid=3753396&offset=0&show_names=0&state=6&appid=
https://boinc.bakerlab.org/rosetta/results.php?hostid=3753394&offset=0&show_names=0&state=6&appid=
https://boinc.bakerlab.org/rosetta/results.php?hostid=3745546&offset=0&show_names=0&state=6&appid=
https://boinc.bakerlab.org/rosetta/results.php?hostid=3758038&offset=0&show_names=0&state=6&appid=
https://boinc.bakerlab.org/rosetta/results.php?hostid=3753439&offset=0&show_names=0&state=6&appid=
etc.
16) Message boards : Number crunching : Out of work (Message 91611)
Posted 24 Jan 2020 by Trotador
Post:
Lot of download errors

It appears that there is something seriously wrong with you Internet. They are downloaded and completed by other people.

I practically never get download errors.


Thanks for your interest, please take into account that the quantity of units you download by day is very low compared with me. If you check the top cruncher hosts you will find a lot of download errors in almost every one. So, I do not think it is my internet connection, not only mine at least :).
17) Message boards : Number crunching : Out of work (Message 91608)
Posted 24 Jan 2020 by Trotador
Post:
Lot of download errors
18) Message boards : Number crunching : Out of work (Message 91551)
Posted 15 Jan 2020 by Trotador
Post:
No tasks available apparently
19) Message boards : Number crunching : BWSRV2 Down?? (Message 91310)
Posted 28 Oct 2019 by Trotador
Post:
+1
20) Message boards : Number crunching : WU crash after some hours (Message 91009)
Posted 9 Aug 2019 by Trotador
Post:
It is simple to understand, some (many) of the current robetta_08_07_xxx wu use 2 GB RAM or above. if your are "lucky" enough to process simultaneously 8 of them you will be over your 16 GB and even if boinc is configured to use no more than 95% of your available memory, you will experience slowdown and most probably wu crash or host crash or both. It is quite known problem here, investigators not always fix correctly the wu memory requirements, they are also human after all.

It has just happened to me both in a host with 64GB but with 72 threads and another one with 124GB and 112 threads. So, I've set to use the 50% of CPUs while the storm goes way :).


Next 20



©2021 University of Washington
https://www.bakerlab.org