I'm geting lots of errors with Rosetta v4.07

Message boards : Number crunching : I'm geting lots of errors with Rosetta v4.07

To post messages, you must log in.

AuthorMessage
Simplex0

Send message
Joined: 13 Jun 18
Posts: 14
Credit: 1,714,717
RAC: 0
Message 89262 - Posted: 12 Jul 2018, 16:26:25 UTC
Last modified: 12 Jul 2018, 17:01:55 UTC

I think I will abort all v4.07 from now on, this is how the Stderr logg looks

Stderr logg
<core_client_version>7.10.2</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.07_windows_intelx86.exe @T1000.flags -in:file:boinc_wu_zip T1000.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 1202697
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 22193.3s, 14400s + 7200s[2018- 7-12 18:11:32:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 22194.4 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
18:11:32 (10080): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>T1000_full_aivan_SAVE_ALL_OUT_03_09_677708_5924_0_r2041150700_0</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>
]]>

Seams to be only this type of workunit

Namn T1000_full_aivan_SAVE_ALL_OUT_03_09_677708_5951_0
Namn T1000_full_aivan_SAVE_ALL_OUT_03_09_677708_5959_0
Namn T1000_full_aivan_SAVE_ALL_OUT_03_09_677708_5971_0
Namn T1000_full_aivan_SAVE_ALL_OUT_03_09_677708_5972_0
Namn T1000_full_aivan_SAVE_ALL_OUT_03_09_677708_5991_0
Namn T1000_full_aivan_SAVE_ALL_OUT_03_09_677708_4411_0
Namn T1000_full_aivan_SAVE_ALL_OUT_03_09_677708_4297_0
Namn T1000_full_aivan_SAVE_ALL_OUT_03_09_677708_4122_0
Namn T1000_full_aivan_SAVE_ALL_OUT_03_09_677708_4081_0
Namn T1000_full_aivan_SAVE_ALL_OUT_03_09_677708_3741_0
Namn T1000_full_aivan_SAVE_ALL_OUT_03_09_677708_3374_0
Namn T1000_full_aivan_SAVE_ALL_OUT_03_09_677708_2579_0
ID: 89262 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Simplex0

Send message
Joined: 13 Jun 18
Posts: 14
Credit: 1,714,717
RAC: 0
Message 89277 - Posted: 14 Jul 2018, 12:38:58 UTC

Ones again I got avian tasks that have been running for 2,5 hours and are estimated to run for 3 - 4 hours more despite that my settings in Rosetta for Target CPU run time is 2 hours.
Should I abort them?
I have aborted all other avian tasks as my experience is that they are running for a long time an all end up with an error.
ID: 89277 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Simplex0

Send message
Joined: 13 Jun 18
Posts: 14
Credit: 1,714,717
RAC: 0
Message 89279 - Posted: 14 Jul 2018, 16:41:42 UTC

Yupp!
Same error as always, 4 work units and in total 20 hours of wasted computing, luckily I aborted all the other avian workuntis before they started running and wasted even more recourses.

Stderr logg
<core_client_version>7.10.2</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.07_windows_intelx86.exe @T1000.3.flags -in:file:boinc_wu_zip T1000.3.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3683498
Starting watchdog...
Watchdog active.
BOINC:: CPU time: 22129s, 14400s + 7200s[2018- 7-14 18:20:52:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 22129 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
18:20:52 (10344): called boinc_finish(0)

</stderr_txt>
ID: 89279 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1866
Credit: 8,186,159
RAC: 7,029
Message 89281 - Posted: 14 Jul 2018, 20:32:57 UTC - in response to Message 89279.  

WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 22129 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
18:20:52 (10344): called boinc_finish(0)

</stderr_txt>


+1
Same error on all my T1000_aivan
ID: 89281 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 89295 - Posted: 16 Jul 2018, 0:09:53 UTC

I have no details on the specific WUs or issues they are having. But I wanted everyone to know that BOINC Manager's "estimated runtime" is really based on history, not the present. So, regardless of the name or likely success of current WUs, if BOINC Manager has a recent history with WUs taking 3 or 4 hours longer than the runtime preference, it will "estimate" future WUs will take 3 to 4 hours longer as well. The likelihood of the current WUs running long is not related to the estimated runtime of the BOINC Manager. If the name of the current tasks has the same prefix as those that you had trouble with, that would be a better indicator for you.
Rosetta Moderator: Mod.Sense
ID: 89295 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1866
Credit: 8,186,159
RAC: 7,029
Message 89298 - Posted: 16 Jul 2018, 8:38:36 UTC - in response to Message 89295.  

I have no details on the specific WUs or issues they are having. But I wanted everyone to know that BOINC Manager's "estimated runtime" is really based on history, not the present. So, regardless of the name or likely success of current WUs, if BOINC Manager has a recent history with WUs taking 3 or 4 hours longer than the runtime preference, it will "estimate" future WUs will take 3 to 4 hours longer as well.


For me the problem is not the runtime of wus (i know the decoy's question), but the validation error.
ID: 89298 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 21,465,703
RAC: 16,826
Message 89301 - Posted: 16 Jul 2018, 18:13:18 UTC - in response to Message 89298.  

I have no details on the specific WUs or issues they are having. But I wanted everyone to know that BOINC Manager's "estimated runtime" is really based on history, not the present. So, regardless of the name or likely success of current WUs, if BOINC Manager has a recent history with WUs taking 3 or 4 hours longer than the runtime preference, it will "estimate" future WUs will take 3 to 4 hours longer as well.


For me the problem is not the runtime of wus (i know the decoy's question), but the validation error.


I found 1 "Invalid" run that you were granted 587.11 credits. Were there others that were a problem for you?
Seems like the 587 credits were similar to the other valid jobs.



name T1000_full3_aivan_SAVE_ALL_OUT_03_09_677955_5874
application Rosetta
created 14 Jul 2018, 8:17:20 UTC
canonical result 1015258491
granted credit 587.11

https://boinc.bakerlab.org/workunit.php?wuid=914821393
ID: 89301 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Simplex0

Send message
Joined: 13 Jun 18
Posts: 14
Credit: 1,714,717
RAC: 0
Message 89302 - Posted: 16 Jul 2018, 21:57:03 UTC - in response to Message 89301.  

I have no details on the specific WUs or issues they are having. But I wanted everyone to know that BOINC Manager's "estimated runtime" is really based on history, not the present. So, regardless of the name or likely success of current WUs, if BOINC Manager has a recent history with WUs taking 3 or 4 hours longer than the runtime preference, it will "estimate" future WUs will take 3 to 4 hours longer as well.


For me the problem is not the runtime of wus (i know the decoy's question), but the validation error.


I found 1 "Invalid" run that you were granted 587.11 credits. Were there others that were a problem for you?
Seems like the 587 credits were similar to the other valid jobs.



name T1000_full3_aivan_SAVE_ALL_OUT_03_09_677955_5874
application Rosetta
created 14 Jul 2018, 8:17:20 UTC
canonical result 1015258491
granted credit 587.11

https://boinc.bakerlab.org/workunit.php?wuid=914821393


The work units that was marked as 'Invali' is in my first post in this thread and I had 5 - 6 more of the same later.
Why you can't find them I have no idea, ask the staff, maybe they can help you.

The lates work units of this kind is here....

https://boinc.bakerlab.org/result.php?resultid=1015234868
https://boinc.bakerlab.org/result.php?resultid=1015234886
https://boinc.bakerlab.org/result.php?resultid=1015234755
https://boinc.bakerlab.org/result.php?resultid=1015234804

They was the first units I run in both my fist and second attempt to crunch a bunch of maybe 100 work units but because the 4 first I run all ended up as 'Invalid' I aborted all the others.
I have not received any more of this "avian" workunits lately and I hope I wont.
The credit is totally irrelevant in this case, the problem imo is that recourses are wasted when hours of crunching ends up with a result that is Invalid.
Luckily I spotted the early and wasted only 40 hours instead of 500 hours.
ID: 89302 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Simplex0

Send message
Joined: 13 Jun 18
Posts: 14
Credit: 1,714,717
RAC: 0
Message 89303 - Posted: 16 Jul 2018, 22:09:25 UTC - in response to Message 89295.  
Last modified: 16 Jul 2018, 22:11:32 UTC

I have no details on the specific WUs or issues they are having. But I wanted everyone to know that BOINC Manager's "estimated runtime" is really based on history, not the present. So, regardless of the name or likely success of current WUs, if BOINC Manager has a recent history with WUs taking 3 or 4 hours longer than the runtime preference, it will "estimate" future WUs will take 3 to 4 hours longer as well. The likelihood of the current WUs running long is not related to the estimated runtime of the BOINC Manager. If the name of the current tasks has the same prefix as those that you had trouble with, that would be a better indicator for you.


The main issue here is not the runtime or credit in this case, it is that a lot of your crunchers resources and YOUR resorses I wasted when a lot of hours of crunching ends up with a result that is invald.
ID: 89303 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Simplex0

Send message
Joined: 13 Jun 18
Posts: 14
Credit: 1,714,717
RAC: 0
Message 89304 - Posted: 17 Jul 2018, 4:41:03 UTC - in response to Message 89295.  

I have no details on the specific WUs or issues they are having. But I wanted everyone to know that BOINC Manager's "estimated runtime" is really based on history, not the present. So, regardless of the name or likely success of current WUs, if BOINC Manager has a recent history with WUs taking 3 or 4 hours longer than the runtime preference, it will "estimate" future WUs will take 3 to 4 hours longer as well. The likelihood of the current WUs running long is not related to the estimated runtime of the BOINC Manager. If the name of the current tasks has the same prefix as those that you had trouble with, that would be a better indicator for you.


I have now checked more than 1000 workunits that has finished successfully and only 1 of them took 4 hours while ALL of the invalid aivan workuntis took more that 6 hours to finish.
Anyway. It seams that I do not get any more of this kind of workunits so hopefully the problem has already been spotted and taking care of by the staff.
ID: 89304 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1866
Credit: 8,186,159
RAC: 7,029
Message 89306 - Posted: 17 Jul 2018, 6:25:22 UTC - in response to Message 89301.  

name T1000_full3_aivan_SAVE_ALL_OUT_03_09_677955_5874
application Rosetta
created 14 Jul 2018, 8:17:20 UTC
canonical result 1015258491
granted credit 587.11

https://boinc.bakerlab.org/workunit.php?wuid=914821393


https://boinc.bakerlab.org/result.php?resultid=1015258491
I've got 0 credits for this wu. But this is not a problem.
I killed the others "_aivan_"
ID: 89306 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 21,465,703
RAC: 16,826
Message 89313 - Posted: 17 Jul 2018, 17:15:18 UTC - in response to Message 89306.  

name T1000_full3_aivan_SAVE_ALL_OUT_03_09_677955_5874
application Rosetta
created 14 Jul 2018, 8:17:20 UTC
canonical result 1015258491
granted credit 587.11

https://boinc.bakerlab.org/workunit.php?wuid=914821393


https://boinc.bakerlab.org/result.php?resultid=1015258491
I've got 0 credits for this wu. But this is not a problem.
I killed the others "_aivan_"


I thought the "granted credits" were issued manually by the project staff on those WUs that run a long time, had a problem caused by Rosetta or researcher and did not issue credits.

Hmmm. I only do the work for the credits. I think I will have enough to retire soon. 8-)


It is good to report the bad WU like aivan jobs so Rosetta can clean out the pipeline, inform the researcher of his mistake AND then fix the bug in Rosetta. Researchers should not be able to set problem controls. The problem should be filtered by the software before the problem starts work.
ID: 89313 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1866
Credit: 8,186,159
RAC: 7,029
Message 89314 - Posted: 18 Jul 2018, 5:42:18 UTC - in response to Message 89313.  

Hmmm. I only do the work for the credits. I think I will have enough to retire soon. 8-)


I hope you will not stop to crunch on R@H and write on forum.
You would miss us
ID: 89314 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 89319 - Posted: 18 Jul 2018, 15:47:47 UTC - in response to Message 89313.  


I thought the "granted credits" were issued manually by the project staff on those WUs that run a long time, had a problem caused by Rosetta or researcher and did not issue credits.


A program runs daily that grants credit to these tasks. The credit reflects the value to the Project Team of understanding what is not working, so things can improve.
Rosetta Moderator: Mod.Sense
ID: 89319 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : I'm geting lots of errors with Rosetta v4.07



©2024 University of Washington
https://www.bakerlab.org