Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 178 · 179 · 180 · 181 · 182 · 183 · 184 . . . 309 · Next

AuthorMessage
JohnDK
Avatar

Send message
Joined: 6 Apr 20
Posts: 33
Credit: 2,390,240
RAC: 0
Message 104948 - Posted: 18 Feb 2022, 17:49:11 UTC

Don't like those python WUs, but with the movingstub problems, I will have to continue with them.
ID: 104948 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 104951 - Posted: 18 Feb 2022, 21:19:59 UTC

Total queued jobs: 4,144,753

Someone at Rosetta has great faith in our ability to run these, though I don't know why.

But I have found that the pythons do not suspend so much if I run only 50% of the cores.
Or maybe they have changed them.
ID: 104951 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1725
Credit: 18,396,094
RAC: 19,482
Message 104955 - Posted: 18 Feb 2022, 21:51:13 UTC - in response to Message 104951.  

Total queued jobs: 4,144,753

Someone at Rosetta has great faith in our ability to run these, though I don't know why.
2 million Tasks for LINUX systems, Instant crash and burn on Windows.
So it's going to take a while to clear them unless the project pulls them & fixes & then re-issues them, or puts in a flag with the Scheduler to only allocate them to LINUX systems. No hope of the second, very slight hope for the first.
Grant
Darwin NT
ID: 104955 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 104958 - Posted: 18 Feb 2022, 22:31:05 UTC

movingstub problems
I have a two almost identical systems
P5Q DeLux motherboard with Q9450 cpu 8GB RAM , win7 , its a crash test dummy . 1*
P5Q DeLux motherboard with Q9550 cpu 8GB RAM . Linux mint , Just another day at the office , crunchin on regardless
Funny old world
----------------------------------------
1* so I feel like being a git
Set cache setting to 10 days and try and trash as many of them as possible :-), until it gets backoffd
Yes that is what I have done
see how many I can get rid of
But , keep an eye on it in case any good work arrives.
ID: 104958 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1725
Credit: 18,396,094
RAC: 19,482
Message 104959 - Posted: 18 Feb 2022, 22:34:30 UTC - in response to Message 104958.  
Last modified: 18 Feb 2022, 22:36:49 UTC

Set cache setting to 10 days and try and trash as many of them as possible :-), until it gets backoffd
Yes that is what I have done
see how many I can get rid of
Not that many (compared to how many there are to get through).
For a particular application, for every error you return, you have the amount of work you can download reduced by 1 until you get to the point you will only be able to get 1 Task per 24 hours.
Returning Valid work increases the amount of work you can download per day.
Grant
Darwin NT
ID: 104959 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 104961 - Posted: 18 Feb 2022, 22:49:39 UTC
Last modified: 18 Feb 2022, 22:53:59 UTC

On the win cruncher all its pythons go zombie so I don't let it do them , normal R4.2 is ok
now It will only let me have 29 at a time to trash ,
And I am already getting 4 hours `go away and don't be silly` time.
nnnn poot .
ID: 104961 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1725
Credit: 18,396,094
RAC: 19,482
Message 104963 - Posted: 18 Feb 2022, 22:55:32 UTC - in response to Message 104961.  
Last modified: 18 Feb 2022, 22:55:54 UTC

And I am already getting 4 hours `go away and don't be silly` time.
That will happen every time Tasks error out (it could be anything from 3min to well over 4 hours, the more that error out between Scheduler contacts the larger the backoff tends to be).
Grant
Darwin NT
ID: 104963 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 104964 - Posted: 18 Feb 2022, 23:34:08 UTC - in response to Message 104946.  

After getting all Computation Errors on all the WU's I set my PC to No New Tasks until you folks can figure out why Windows crashes with the "movingstub" Work units



Unless one of our experts here reaches out to his contact at UW, there is no one that will see these posts.
So "you folks" is a pointless thing.
ID: 104964 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 104965 - Posted: 18 Feb 2022, 23:34:51 UTC

24 movingstubs were in my queue. Aborted them instantly.
Not going to up my error count for stupidness from their end.
ID: 104965 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 104966 - Posted: 18 Feb 2022, 23:36:35 UTC - in response to Message 104961.  

.clair. you've done all the different things we as a group have talked about for python?
downgrade boinc and vbox and check your virtualization setting on your motherboard?
If python still dies on you after all that, that is weird.
ID: 104966 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 104967 - Posted: 18 Feb 2022, 23:41:18 UTC - in response to Message 104944.  

Ricky - your just getting the movingstubs garbage. Watch your queue and abort them as soon as you see them.
They do not work.

If you want to do Vbox stuff, then you can run python tasks

4.2 is a mishmash of stuff, but movingstubs is trash and rb_02_16_213037 has a bug
Also in python the aagb-PHE stuff is buggy
ID: 104967 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 104968 - Posted: 18 Feb 2022, 23:41:53 UTC

I have noticed that in 4.2 the rb_02_16_213037..... has a bug
Also in python the aagb-PHE... stuff is buggy
ID: 104968 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1725
Credit: 18,396,094
RAC: 19,482
Message 104969 - Posted: 19 Feb 2022, 0:10:27 UTC - in response to Message 104965.  

24 movingstubs were in my queue. Aborted them instantly.
Not going to up my error count for stupidness from their end.
Aborted Tasks count as errors.
Grant
Darwin NT
ID: 104969 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 2,014
Message 104970 - Posted: 19 Feb 2022, 0:21:33 UTC - in response to Message 104959.  
Last modified: 19 Feb 2022, 0:21:50 UTC

Set cache setting to 10 days and try and trash as many of them as possible :-), until it gets backoffd
Yes that is what I have done
see how many I can get rid of
Not that many (compared to how many there are to get through).
For a particular application, for every error you return, you have the amount of work you can download reduced by 1 until you get to the point you will only be able to get 1 Task per 24 hours.
Returning Valid work increases the amount of work you can download per day.

For at least some BOINC projects, if every computer that tries a workunit gives computation error, BOINC decides that the workunit is defective and no longer counts any of the failures against the computers that tried it.
ID: 104970 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2141
Credit: 41,528,568
RAC: 10,278
Message 104972 - Posted: 19 Feb 2022, 0:29:57 UTC - in response to Message 104927.  

I only run on one Linux and one Windows 7 atm and all the Windows error out within 20 seconds. All the Ubuntu ones were good until now. The 14 movingstubs units ran the requested 8 hours on my 3900X. I can see that a new team member successfully completed his on Linux, but all were 3 hours. W3670. CPU and OS related?

I've reported this too. Your W7, my W10 and someone else's W11 all fail while Ubuntu and Linux all run ok.
Hopefully it means something to them.
Tasks haven't been withdrawn yet (that I've noticed) and I've had no direct reply yet either.
ID: 104972 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2141
Credit: 41,528,568
RAC: 10,278
Message 104973 - Posted: 19 Feb 2022, 0:43:28 UTC - in response to Message 104964.  
Last modified: 19 Feb 2022, 0:46:26 UTC

After getting all Computation Errors on all the WU's I set my PC to No New Tasks until you folks can figure out why Windows crashes with the "movingstub" Work units

Unless one of our experts here reaches out to his contact at UW, there is no one that will see these posts.
So "you folks" is a pointless thing.

I'm not 100% on the ball at the moment, Greg, but I am reporting things within 24 hours of seeing a post about them here.
Specifically I mean the movingstub tasks computation error, and now the update that they're running ok on UbuntuLinux but not any version of windows.
I haven't got my head around what the specific issue is with rb tasks crashing out, but I do see people reporting problems with those too.
If someone describes the rb issue to me in a way I can pass on, I'll follow up with that too.
Is it much the same as movingstubs tasks or are they crashing on Ubuntu as well?

Edit
Also in python the aagb-PHE stuff is buggy

Just spotted your mention of this too. I'm away from my one PC running Python tasks for another few days, so I'll try to confirm that issue as well
ID: 104973 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 104974 - Posted: 19 Feb 2022, 0:51:20 UTC
Last modified: 19 Feb 2022, 0:57:36 UTC

Well its time for bed so I have undone my movingtargets silliness
Got 500 of them in the bin.
Though I did have a long running one , it lasted a full 30 seconds .
ID: 104974 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1725
Credit: 18,396,094
RAC: 19,482
Message 104975 - Posted: 19 Feb 2022, 2:27:25 UTC

Well, the project was down for a while there, but the movingstub crash and burns are still there since it came back online.
Grant
Darwin NT
ID: 104975 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tomcat雄猫

Send message
Joined: 20 Dec 14
Posts: 180
Credit: 5,386,173
RAC: 0
Message 104976 - Posted: 19 Feb 2022, 3:01:36 UTC - in response to Message 104975.  
Last modified: 19 Feb 2022, 3:02:41 UTC

SO! MANY! ERRORS!

First off are the known bad movingstub tasks that crash almost instantly.

Then there are the broken rb_ tasks that don't crash as fast.

rb_02_16_213031_208962_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_06_04_2906559_35_1
<core_client_version>7.16.20</core_client_version>
<![CDATA[
<message>
Incorrect function.
 (0x1) - exit code 1 (0x1)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe @rb_02_16_213031_208962_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -jumps:pairing_file t000_.fasta.bbcontacts.jumps -jumps:random_sheets 3 -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_02_16_213031_208962_ab_t000__robetta.zip -frag3 rb_02_16_213031_208962_ab_t000__robetta.200.3mers.index.gz -fragA rb_02_16_213031_208962_ab_t000__robetta.200.4mers.index.gz -fragB rb_02_16_213031_208962_ab_t000__robetta.200.6mers.index.gz -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3477109
Using database: database_357d5d93529_n_methylminirosetta_database

[ ERROR ]: Caught exception:


File: C:cygwin64homeboinc4.17Rosettamainsourcesrccore/pack/dunbrack/SingleResidueDunbrackLibrary.hh:306
chi angle must be between -180 and 180: -nan(ind)
 ------------------------ Begin developer's backtrace ------------------------- 
BACKTRACE:
 ------------------------- End developer's backtrace -------------------------- 


AN INTERNAL ERROR HAS OCCURED. PLEASE SEE THE CONTENTS OF ROSETTA_CRASH.log FOR DETAILS.



</stderr_txt>
]]>



rb_02_16_213027_208958_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_06_08_2906555_55_1
<core_client_version>7.16.20</core_client_version>
<![CDATA[
<message>
Incorrect function.
 (0x1) - exit code 1 (0x1)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe @rb_02_16_213027_208958_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -jumps:pairing_file t000_.fasta.bbcontacts.jumps -jumps:random_sheets 3 -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_02_16_213027_208958_ab_t000__robetta.zip -frag3 rb_02_16_213027_208958_ab_t000__robetta.200.3mers.index.gz -fragA rb_02_16_213027_208958_ab_t000__robetta.200.8mers.index.gz -fragB rb_02_16_213027_208958_ab_t000__robetta.200.6mers.index.gz -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3485351
Using database: database_357d5d93529_n_methylminirosetta_database

[ ERROR ]: Caught exception:


File: C:cygwin64homeboinc4.17Rosettamainsourcesrccore/pack/dunbrack/SingleResidueDunbrackLibrary.hh:306
chi angle must be between -180 and 180: -nan(ind)
 ------------------------ Begin developer's backtrace ------------------------- 
BACKTRACE:
 ------------------------- End developer's backtrace -------------------------- 


AN INTERNAL ERROR HAS OCCURED. PLEASE SEE THE CONTENTS OF ROSETTA_CRASH.log FOR DETAILS.



</stderr_txt>
]]>



rb_02_16_213026_208957_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_06_09_2906558_22_1
<core_client_version>7.16.20</core_client_version>
<![CDATA[
<message>
Incorrect function.
 (0x1) - exit code 1 (0x1)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe @rb_02_16_213026_208957_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -jumps:pairing_file t000_.fasta.bbcontacts.jumps -jumps:random_sheets 3 -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_02_16_213026_208957_ab_t000__robetta.zip -frag3 rb_02_16_213026_208957_ab_t000__robetta.200.3mers.index.gz -fragA rb_02_16_213026_208957_ab_t000__robetta.200.9mers.index.gz -fragB rb_02_16_213026_208957_ab_t000__robetta.200.6mers.index.gz -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3477932
Using database: database_357d5d93529_n_methylminirosetta_database

[ ERROR ]: Caught exception:


File: C:cygwin64homeboinc4.17Rosettamainsourcesrccore/pack/dunbrack/SingleResidueDunbrackLibrary.hh:306
chi angle must be between -180 and 180: -nan(ind)
 ------------------------ Begin developer's backtrace ------------------------- 
BACKTRACE:
 ------------------------- End developer's backtrace -------------------------- 


AN INTERNAL ERROR HAS OCCURED. PLEASE SEE THE CONTENTS OF ROSETTA_CRASH.log FOR DETAILS.



</stderr_txt>
]]>


Windows 11
ID: 104976 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
6dj72cn8

Send message
Joined: 18 Apr 06
Posts: 5
Credit: 207,684
RAC: 0
Message 104977 - Posted: 19 Feb 2022, 3:55:07 UTC - in response to Message 104955.  

So it's going to take a while to clear them unless the project . . . puts in a flag with the Scheduler to only allocate them to LINUX systems.

Linux or Mac. It's only Windows that's choking on them.
ID: 104977 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 178 · 179 · 180 · 181 · 182 · 183 · 184 . . . 309 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org