Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 271 · 272 · 273 · 274 · 275 · 276 · 277 . . . 279 · Next

AuthorMessage
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 232
Credit: 332,990
RAC: 1,945
Message 109043 - Posted: 27 Mar 2024, 13:28:25 UTC

I see this in stderr.txt
command: projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.04_windows_x86_64.exe @7ahall_e_hal_7aa_15545_d239_0001.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937
Extracting in project directory: database_0f7f01a1b07.zip
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/rotamer/bbdep02.May.sortlib
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/rotamer/peptoid_rotlibs/001.rotlib
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/chemical/pdb_components/components.R.cif
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/chemical/pdb_components/components.6.cif
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/chemical/pdb_components/components.1.cif
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/chemical/pdb_components/components.D.cif
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/chemical/pdb_components/components.V.cif
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/chemical/pdb_components/components.4.cif
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/sampling/disulfide_jump_database_wip.dat
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/sampling/fragpicker_rama_tables/L_QP.counts.gz
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/sampling/vall.jul19.2011.torsions.gz
        Permission denied
error:  cannot delete old E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07/database/protocol_data/tensorflow_graphs/gcn_test_model/gcn_test_model_plot.png
        Permission denied
Extracting in slot directory: minirosetta_database.zip
Using database: database

looks like each task tried to extract database to E:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta/database_0f7f01a1b07 all at once, gave up, and extracted to slot directory.
ID: 109043 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2003
Credit: 38,597,879
RAC: 17,772
Message 109055 - Posted: 2 Apr 2024, 1:28:54 UTC

It's a strange thing, but every time tasks run out recently, another ~million seem to be added to the queue to keep us going.

I realise many have given up a bit on expecting reliability from Rosetta, but it almost seems like someone is paying a little attention on the quiet.

Or maybe I'm just wishing that was the case.

Either way, it's appreciated.
And there are still enough people around to blast through and return them quickly too.
(Comparatively) good times...
ID: 109055 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 232
Credit: 332,990
RAC: 1,945
Message 109057 - Posted: 2 Apr 2024, 10:47:24 UTC

When i run graphics app it closes immediately.
I see this in stderrgfx.txt

user@ubuntu:/var/lib/boinc/slots/19$ cat stderrgfx.txt

ERROR: Unable to open file: /var/lib/boinc/projects/boinc.bakerlab.org_rosetta/../database/chemical/residue_type_sets/fa_standard/residue_types.txt

ERROR:: Exit from: src/core/chemical/GlobalResidueTypeSet.cc line: 145
13:38:44 (25733): called boinc_finish(0)
ID: 109057 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1867
Credit: 8,195,494
RAC: 6,295
Message 109058 - Posted: 2 Apr 2024, 12:50:40 UTC - in response to Message 109055.  

And there are still enough people around to blast through and return them quickly too.
(Comparatively) good times...


+1
But, after months and hundreds of thousands of wus, maybe it's the time to let the app out from beta stage
ID: 109058 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 232
Credit: 332,990
RAC: 1,945
Message 109059 - Posted: 2 Apr 2024, 13:48:52 UTC

Tasks finish in 3 hours for me.

I have set "Target CPU run time" to "not selected"
ID: 109059 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
just1vet

Send message
Joined: 13 Nov 05
Posts: 4
Credit: 2,997,412
RAC: 22,827
Message 109062 - Posted: 2 Apr 2024, 21:41:05 UTC

Big problems with Rosetta on Linux Mint 20 and 21. Had to remove the project from the client on both of my machines. It would freeze the computers to where it had to be rebooted, only to lock up again, soon as it started on Rosetta.
I narrowed it down to the Rosetta project after replacing hard drive, mother board and ram. they run fine on the other projects. This has been going on for a while. Runs fine on my Windows computers.

Any ideas?
ID: 109062 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 232
Credit: 332,990
RAC: 1,945
Message 109063 - Posted: 2 Apr 2024, 21:43:47 UTC - in response to Message 109062.  

Maybe you can reduce number of cpus allocated from 100% to 75%?
ID: 109063 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2003
Credit: 38,597,879
RAC: 17,772
Message 109064 - Posted: 2 Apr 2024, 23:47:57 UTC - in response to Message 109059.  

Tasks finish in 3 hours for me.

I have set "Target CPU run time" to "not selected"

Has something changed then? I noticed this elsewhere too.

Doesn't apply to me - since tasks have been harder to come by I've changed my default to 12hrs.
Maybe it's time to be explicit on runtime and change it to 8hrs, rather than let it run at a dodgy default value.
ID: 109064 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
just1vet

Send message
Joined: 13 Nov 05
Posts: 4
Credit: 2,997,412
RAC: 22,827
Message 109065 - Posted: 3 Apr 2024, 3:28:05 UTC - in response to Message 109063.  

Right now they are 32 core with 16gb of RAM. Which should be enough for crunching.
ID: 109065 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1498
Credit: 14,728,001
RAC: 16,307
Message 109066 - Posted: 3 Apr 2024, 4:52:03 UTC - in response to Message 109065.  
Last modified: 3 Apr 2024, 4:53:10 UTC

Right now they are 32 core with 16gb of RAM. Which should be enough for crunching.
Some Rosetta 4.20 Tasks require over 2GB of RAM.
32*2= way more than 16GB.
Although the larger RAM Tasks have been very few and far between, 500MB to 1GB has been the usual range for Rosetta 4.20 Tasks lately. And 32*.5= all your RAM.

16GB RAM on a system with 64 cores/threads is way, way, way too little.
Grant
Darwin NT
ID: 109066 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile RDTSC

Send message
Joined: 29 Jan 24
Posts: 2
Credit: 67,077
RAC: 194
Message 109067 - Posted: 3 Apr 2024, 12:08:31 UTC - in response to Message 109009.  
Last modified: 3 Apr 2024, 12:10:22 UTC

A flock of work units arrived recently that are behaving oddly, well, all but one of them...
I have one machine, a workstation Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz / Arch Linux, which crunches Rosetta and WCG packets fine. A few months ago, added a really old dual Intel(R) Xeon(TM) CPU 2.80GHz machine (old Dell server, latest Ubuntu server LTS.) The old machine was getting Rosetta Beta workunits and choking on them; error, error, error... it was able to crunch through several non-beta workunits though. Thought it was the old CPUs, like an unsupported instruction or something. Reading this, now thinking it was bad workunits.
ID: 109067 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2003
Credit: 38,597,879
RAC: 17,772
Message 109068 - Posted: 3 Apr 2024, 22:10:16 UTC
Last modified: 3 Apr 2024, 22:10:38 UTC

And we're back...

Looks like the whole website went down for about 10hours today.
Couldn't even get to the Rosetta home page let alone upload results.
Everything going through fine now
ID: 109068 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
GDB

Send message
Joined: 5 Oct 17
Posts: 1
Credit: 3,610,013
RAC: 6,683
Message 109069 - Posted: 4 Apr 2024, 1:54:49 UTC

All my units are getting validate errors now.
ID: 109069 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MStenholm

Send message
Joined: 18 Apr 20
Posts: 17
Credit: 22,788,671
RAC: 24,677
Message 109070 - Posted: 4 Apr 2024, 4:52:17 UTC

GDB: you are not alone in all returned results getting valitated errors. The top 10 CPUs I checked plus my own got the same verdict.
ID: 109070 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1498
Credit: 14,728,001
RAC: 16,307
Message 109071 - Posted: 4 Apr 2024, 5:19:12 UTC

Yep, The Validator is borked,

For me, anything returned from 3 Apr 2024, 22:02:46 UTC fails, and a quick look at th top computers shows the same thing- everything going back at present fails Validation.

If someone could get the Projects attention?
Grant
Darwin NT
ID: 109071 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1867
Credit: 8,195,494
RAC: 6,295
Message 109072 - Posted: 4 Apr 2024, 7:12:50 UTC - in response to Message 109071.  

If someone could get the Projects attention?

+1
After the over 60 wus failed some hrs ago, i'm ready to upload about ten wus.
Have i to stop the upload?
ID: 109072 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Daniel Graf

Send message
Joined: 2 Nov 05
Posts: 10
Credit: 59,484,059
RAC: 32,402
Message 109073 - Posted: 4 Apr 2024, 7:41:35 UTC

Let's see if these work units are still credited. But I have the feeling that after calculating they will go straight into the trash can. Unfortunately, one computer will be running until this afternoon and will probably only produce garbage before I can separate it from Rosetta.
ID: 109073 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1498
Credit: 14,728,001
RAC: 16,307
Message 109074 - Posted: 4 Apr 2024, 8:04:07 UTC - in response to Message 109073.  

Let's see if these work units are still credited.
If it's not Valid, there is no Credit.
Grant
Darwin NT
ID: 109074 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1498
Credit: 14,728,001
RAC: 16,307
Message 109075 - Posted: 4 Apr 2024, 8:06:52 UTC - in response to Message 109072.  

Have i to stop the upload?
It will stop you from getting new work, but it is the only way to stop returned work from not Validating until the project fixes the issue.
They could also re-run the validation of the presently failed Tasks, but i don't like the odds of that actually happening.
Grant
Darwin NT
ID: 109075 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1867
Credit: 8,195,494
RAC: 6,295
Message 109077 - Posted: 4 Apr 2024, 9:50:31 UTC - in response to Message 109075.  
Last modified: 4 Apr 2024, 9:52:13 UTC

it will stop you from getting new work, but it is the only way to stop returned work from not Validating until the project fixes the issue.

The problem is that some of these wus are near the deadline.
It's a pity to throw away the work done.... (i don't care a lot about points, i care about science)
ID: 109077 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 271 · 272 · 273 · 274 · 275 · 276 · 277 . . . 279 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org