Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 280 · 281 · 282 · 283 · 284 · Next

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2031
Credit: 39,986,263
RAC: 20,918
Message 109374 - Posted: 12 Jun 2024, 19:05:03 UTC - in response to Message 109371.  

Tasks starting with RosettaVS run for 8 hours for me.

Great, but I don't say this for the ones that run as expected, but for all those that don't, of which there seem to be many.
Also, I don't recall seeing any RosettaVS tasks. I don't know how they behave.
ID: 109374 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 380
Credit: 11,424,327
RAC: 11,278
Message 109375 - Posted: 13 Jun 2024, 6:51:24 UTC - in response to Message 109367.  

Now out of work new

This has been the best run we've had for a couple of years - bound to end at some point once everyone's offline cache runs down.
It's at this point my 12hr runtime setting ekes out my remaining work as far as possible.

What I'd re-emphasise is that the default runtime for tasks has fallen to 3hrs for some reason, which I believe to be a mistake and contradicts the forced Boinc setting of 8hrs,
As such, people should go into Boinc's Your Account option, select Rosetta@home preferences and change Target CPU run time to an explicit 8hrs rather than "not selected".
This will almost treble how long tasks run and extend the life of work batches so that we run out less, if at all, while almost trebling the credit we get for tasks too.

This should be considered a high priority for everyone imo.


I’ve always figured to leave it on default as the project scientists who set them up know their requirements better than I do.
ID: 109375 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1558
Credit: 16,050,762
RAC: 18,359
Message 109376 - Posted: 13 Jun 2024, 7:51:00 UTC

New batch of work over at Ralph, with new errors.

RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_f_pred_148_16902_5_1

<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
Codice di accesso non valido.
 (0xc) - exit code 12 (0xc)</message>
<stderr_txt>
Traceback (most recent call last):
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aapredict.py", line 8, in <module>
    import torch
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorch__init__.py", line 124, in <module>
    raise err
OSError: [WinError 1455] Il file di paging &#232; troppo piccolo per essere completato. Error loading "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchlibcaffe2_detectron_ops_gpu.dll" or one of its dependencies.

</stderr_txt>
]]>




RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_e_pred_195_16901_6_1

<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
Codice di accesso non valido.
 (0xc) - exit code 12 (0xc)</message>
<stderr_txt>
Traceback (most recent call last):
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aapredict.py", line 698, in <module>
    b.write(base64.b64decode(f.read()))
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libbase64.py", line 87, in b64decode
    return binascii.a2b_base64(s)
binascii.Error: Invalid base64-encoded string: number of data characters (65) cannot be 1 more than a multiple of 4

</stderr_txt>
]]>




RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_f_pred_119_16902_6_1

<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
Codice di accesso non valido.
 (0xc) - exit code 12 (0xc)</message>
<stderr_txt>
Traceback (most recent call last):
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aapredict.py", line 708, in <module>
    pred.predict(out_name+f'_{n}', 
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aapredict.py", line 551, in predict
    logit_s, logit_aa_s, logit_pae, logit_pde, p_bind, pred_crds, alpha, pred_allatom, pred_lddt_binned,                msa_prev, pair_prev, state_prev = self.model(
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aaRoseTTAFoldModel.py", line 358, in forward
    msa, pair, xyz, alpha_s, xyz_allatom, state, symmsub = self.simulator(
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aaTrack_module.py", line 1106, in forward
    msa, pair, xyz, state, alpha, symmsub = self.main_block[i_m](msa, pair,
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aaTrack_module.py", line 929, in forward
    xyz, state, alpha = self.str2str(
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchcudaampautocast_mode.py", line 141, in decorate_autocast
    return func(*args, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aaTrack_module.py", line 503, in forward
    shift = self.se3(G, node.reshape(B*L, -1, 1), l1_feats, edge_feats)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aaSE3_network.py", line 96, in forward
    return self.se3(G, node_features, edge_features)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodeltransformer.py", line 185, in forward
    node_feats = self.graph_modules(node_feats, edge_feats, graph=graph, basis=basis)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodeltransformer.py", line 47, in forward
    input = module(input, *args, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodellayersattention.py", line 162, in forward
    fused_key_value = self.to_key_value(node_features, edge_features, graph, basis)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodellayersconvolution.py", line 347, in forward
    out += self.conv_in[str(degree_in)](feature, invariant_edge_feats,
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodellayersconvolution.py", line 186, in forward
    radial_weights = self.radial_func(invariant_edge_feats[e_i:e_j]) 
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodellayersconvolution.py", line 118, in forward
    return self.net(features)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulescontainer.py", line 139, in forward
    input = module(input)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmoduleslinear.py", line 96, in forward
    return F.linear(input, self.weight, self.bias)
  File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnfunctional.py", line 1847, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: [enforce fail at ..c10coreCPUAllocator.cpp:79] data. DefaultCPUAllocator: not enough memory: you tried to allocate 536870912 bytes.

</stderr_txt>]]>

Grant
Darwin NT
ID: 109376 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 243
Credit: 435,550
RAC: 802
Message 109377 - Posted: 13 Jun 2024, 12:29:17 UTC
Last modified: 13 Jun 2024, 13:06:40 UTC

Did they port rosetta python projects to native windows?
Try to increase pagefile size.
It helped with gpugrid python project.
It even uses gpu.
ID: 109377 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2031
Credit: 39,986,263
RAC: 20,918
Message 109379 - Posted: 13 Jun 2024, 23:42:54 UTC - in response to Message 109375.  

What I'd re-emphasise is that the default runtime for tasks has fallen to 3hrs for some reason, which I believe to be a mistake and contradicts the forced Boinc setting of 8hrs,
As such, people should go into Boinc's Your Account option, select Rosetta@home preferences and change Target CPU run time to an explicit 8hrs rather than "not selected".
This will almost treble how long tasks run and extend the life of work batches so that we run out less, if at all, while almost trebling the credit we get for tasks too.

This should be considered a high priority for everyone imo.

I’ve always figured to leave it on default as the project scientists who set them up know their requirements better than I do.

While generally true, it's clear imo this 3hr target runtime is an error as it's inconsistent with what Rosetta tells Boinc.
It only ever slips through when a new version of the app comes out.
Istr it happened once before and was corrected in the days when the admins paid more attention to us.
If the 8hr default ever changes I think something would be said - and seeing as no-one's saying anything these days I doubt it ever will change without a very specific reason.
ID: 109379 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2031
Credit: 39,986,263
RAC: 20,918
Message 109380 - Posted: 14 Jun 2024, 3:20:41 UTC

Ooh, 360k tasks. We live to fight another day (or two)
ID: 109380 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1914
Credit: 8,887,981
RAC: 10,948
Message 109383 - Posted: 15 Jun 2024, 6:48:29 UTC
Last modified: 15 Jun 2024, 6:48:45 UTC

Today a lot of "classical" error

ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT.
ERROR:: Exit from: src/protocols/cyclic_peptide_predict/SimpleCycpepPredictApplication.cc line: 2442
BOINC:: Error reading and gzipping output datafile: default.out
08:16:19 (5164): called boinc_finish(1)

ID: 109383 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2031
Credit: 39,986,263
RAC: 20,918
Message 109385 - Posted: 15 Jun 2024, 9:32:32 UTC - in response to Message 109383.  

Today a lot of "classical" error

ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT.
ERROR:: Exit from: src/protocols/cyclic_peptide_predict/SimpleCycpepPredictApplication.cc line: 2442
BOINC:: Error reading and gzipping output datafile: default.out
08:16:19 (5164): called boinc_finish(1)

Yes, but very quickly, so I'm not too worried by them

More concerning are two Validate errors after running to completion
hal_8a_i_hal_8aa_2jp5597_d99_0001_SAVE_ALL_OUT_2978378_13_0
hal_8a_i_hal_8aa_2jp1316_d224_0001_SAVE_ALL_OUT_2978378_13_0
ID: 109385 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2031
Credit: 39,986,263
RAC: 20,918
Message 109387 - Posted: 17 Jun 2024, 20:29:44 UTC - in response to Message 109380.  

Ooh, 360k tasks. We live to fight another day (or two)

Turned into 3+ days, but we're out again.
ID: 109387 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2031
Credit: 39,986,263
RAC: 20,918
Message 109389 - Posted: 19 Jun 2024, 21:01:17 UTC - in response to Message 109387.  

Ooh, 360k tasks. We live to fight another day (or two)

Turned into 3+ days, but we're out again.

While I know most people will have finished up their outstanding tasks already, I managed to sneak 4 extra returned tasks today and now discover that the validators running under boinc-process are down again.
Better now than at other times, I guess
ID: 109389 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1558
Credit: 16,050,762
RAC: 18,359
Message 109390 - Posted: 20 Jun 2024, 6:15:28 UTC
Last modified: 20 Jun 2024, 6:15:54 UTC

That boinc-process server has developed a habit of regularly falling over, it was well past due for another crash.
Grant
Darwin NT
ID: 109390 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2031
Credit: 39,986,263
RAC: 20,918
Message 109391 - Posted: 20 Jun 2024, 7:51:27 UTC - in response to Message 109389.  

Ooh, 360k tasks. We live to fight another day (or two)

Turned into 3+ days, but we're out again.

While I know most people will have finished up their outstanding tasks already, I managed to sneak 4 extra returned tasks today and now discover that the validators running under boinc-process are down again.
Better now than at other times, I guess

Or maybe not better now as 660k tasks newly available
ID: 109391 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1914
Credit: 8,887,981
RAC: 10,948
Message 109396 - Posted: 20 Jun 2024, 20:10:55 UTC - in response to Message 109391.  

Or maybe not better now as 660k tasks newly available


0 wus and a lot of daemons are down....
ID: 109396 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2031
Credit: 39,986,263
RAC: 20,918
Message 109397 - Posted: 20 Jun 2024, 23:19:40 UTC - in response to Message 109396.  
Last modified: 20 Jun 2024, 23:26:14 UTC

Or maybe not better now as 660k tasks newly available

0 wus and a lot of daemons are down....

Yup. I would've expected 660k to last at least 2 days, but I'm not sure it lasted much more than 15hrs, Unless tasks got pulled.
Front page figures borked on top of boinc-process server borked

Edit: Actually, I'm now thinking tasks did get pulled.

Unvalidated tasks were about 20k before the new batch arrived - now 160k
In progress tasks were about 30k, now 112k
That implies 222k tasks were grabbed

But the front page is locked at 7am with 660k queued, 440k have gone missing, presumed pulled
ID: 109397 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2031
Credit: 39,986,263
RAC: 20,918
Message 109400 - Posted: 21 Jun 2024, 9:20:01 UTC - in response to Message 109397.  

Or maybe not better now as 660k tasks newly available

0 wus and a lot of daemons are down...

Yup. I would've expected 660k to last at least 2 days, but I'm not sure it lasted much more than 15hrs, Unless tasks got pulled.
Front page figures borked on top of boinc-process server borked

Still the same - now nudged

Edit while posting: site went down, back 5mins later, no apparent change yet but might be shortly (fingers-crossed)
ID: 109400 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1558
Credit: 16,050,762
RAC: 18,359
Message 109401 - Posted: 21 Jun 2024, 9:51:54 UTC
Last modified: 21 Jun 2024, 9:53:20 UTC

boinc-process server still dead, front page Server Status numbers still not updated (Last update, 07:04 UTC, yesterday).
Grant
Darwin NT
ID: 109401 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2031
Credit: 39,986,263
RAC: 20,918
Message 109403 - Posted: 21 Jun 2024, 11:48:26 UTC - in response to Message 109401.  

boinc-process server still dead, front page Server Status numbers still not updated (Last update, 07:04 UTC, yesterday).

Add it to the very long list of things I'm completely wrong about... <sigh>
I've asked. We wait.
ID: 109403 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1558
Credit: 16,050,762
RAC: 18,359
Message 109404 - Posted: 21 Jun 2024, 22:54:26 UTC

Just heard the fans in my system wind up.
Checked BOINC & lo and behold- Rosetta has work again.


Now if they could just get that boinc-process server that's been dead for a while now up and running again then all would be good.
Grant
Darwin NT
ID: 109404 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2031
Credit: 39,986,263
RAC: 20,918
Message 109405 - Posted: 21 Jun 2024, 23:00:01 UTC - in response to Message 109404.  

Just heard the fans in my system wind up.
Checked BOINC & lo and behold- Rosetta has work again.

Now if they could just get that boinc-process server that's been dead for a while now up and running again then all would be good.

Both you, and this PC were ahead of me.
The rest, still just as you say.

In a way, knowing if there are tasks or not, and whether they give credit or not, or how long they'll last, isn't massively different
ID: 109405 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1558
Credit: 16,050,762
RAC: 18,359
Message 109406 - Posted: 22 Jun 2024, 1:32:44 UTC

Server Status on the front page is yet to update, but all the servers on the Server Status page are now green and work is still flowing.
Grant
Darwin NT
ID: 109406 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 280 · 281 · 282 · 283 · 284 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org