Problems and Technical Issues with Rosetta@home

Author	Message
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 109342 - Posted: 5 Jun 2024, 10:56:08 UTC - in response to Message 109341. They probably rebooted it. It'd be nice if they fixed whatever it was that keeps causing it to die so they don't need to keep rebooting it. Grant Darwin NT ID: 109342 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2599 Credit: 47,220,881 RAC: 0	Message 109350 - Posted: 7 Jun 2024, 9:22:24 UTC - in response to Message 109342. They probably rebooted it. It'd be nice if they fixed whatever it was that keeps causing it to die so they don't need to keep rebooting it. It is very odd - it never used to happen. Anyway, glad it got sorted before too long and they didn't need a nudge this time seeing as I'm 2 days late in finding out ID: 109350 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 109363 - Posted: 11 Jun 2024, 7:49:15 UTC New work at Ralph, with new errors. So some work has been done, but looks like there's still quite a way to go. RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_d_pred_188_16900_2_1 <core_client_version>8.0.2</core_client_version> <![CDATA[ <message> Codice di accesso non valido. (0xc) - exit code 12 (0xc)</message> <stderr_txt> Traceback (most recent call last): File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv1rf2aapredict.py", line 733, in <module> with zipfile.ZipFile(args.z) as z: File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libzipfile.py", line 1268, in __init__ self._RealGetContents() File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libzipfile.py", line 1335, in _RealGetContents raise BadZipFile("File is not a zip file") zipfile.BadZipFile: File is not a zip file </stderr_txt> ]]> RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_d_pred_60_16900_5_1 <core_client_version>7.24.1</core_client_version> <![CDATA[ <message> The access code is invalid. (0xc) - exit code 12 (0xc)</message> <stderr_txt> 'C:ProgramDataBOINC/projects/ralph.bakerlab.orgev0Scriptsactivate.bat' is not recognized as an internal or external command, operable program or batch file. </stderr_txt> ]]> Grant Darwin NT ID: 109363 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2599 Credit: 47,220,881 RAC: 0	Message 109364 - Posted: 11 Jun 2024, 14:51:39 UTC Total queued jobs on the front page down to 222k Advance warning we may be out of new tasks in the next 24hrs unless we get lucky again. Fingers crossed. ID: 109364 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 109365 - Posted: 12 Jun 2024, 7:36:48 UTC Last modified: 12 Jun 2024, 7:39:06 UTC Now out of work new. Also, although the Server status shows all green, there is a backlog of Tasks waiting on Validation. 3,078 at the moment. Grant Darwin NT ID: 109365 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 109366 - Posted: 12 Jun 2024, 9:55:29 UTC - in response to Message 109365. Also, although the Server status shows all green, there is a backlog of Tasks waiting on Validation. 3,078 at the moment. Whatever was going on before, the backlog has now cleared. Grant Darwin NT ID: 109366 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2599 Credit: 47,220,881 RAC: 0	Message 109367 - Posted: 12 Jun 2024, 11:43:19 UTC - in response to Message 109365. Now out of work new This has been the best run we've had for a couple of years - bound to end at some point once everyone's offline cache runs down. It's at this point my 12hr runtime setting ekes out my remaining work as far as possible. What I'd re-emphasise is that the default runtime for tasks has fallen to 3hrs for some reason, which I believe to be a mistake and contradicts the forced Boinc setting of 8hrs, As such, people should go into Boinc's Your Account option, select Rosetta@home preferences and change Target CPU run time to an explicit 8hrs rather than "not selected". This will almost treble how long tasks run and extend the life of work batches so that we run out less, if at all, while almost trebling the credit we get for tasks too. This should be considered a high priority for everyone imo. ID: 109367 · Rating: 0 · rate: / Reply Quote

RDTSC Send message Joined: 29 Jan 24 Posts: 4 Credit: 3,208,026 RAC: 0	Message 109368 - Posted: 12 Jun 2024, 12:16:14 UTC https://boinc.bakerlab.org/rosetta/ Their home page could do with some updates; last post almost two years ago. I get it, web hosting and administration is expensive, along with preparing, running, and maintaining massive job servers. It just seems to me that a little grease, at the right points of this machine, would greatly help it function. ID: 109368 · Rating: 0 · rate: / Reply Quote

kotenok2000 Send message Joined: 22 Feb 11 Posts: 291 Credit: 543,048 RAC: 0	Message 109369 - Posted: 12 Jun 2024, 12:19:21 UTC Hal jobs run for three hours because subtasks are short and produce many results per task. Other jobs run for 8 hours. ID: 109369 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2599 Credit: 47,220,881 RAC: 0	Message 109370 - Posted: 12 Jun 2024, 12:45:17 UTC - in response to Message 109369. Hal jobs run for three hours because subtasks are short and produce many results per task. Other jobs run for 8 hours. No. All mine run for 12hrs because I set them to run for 12hrs. They don't hit a top limit of decoys and end because some internal limit has been reached. Rosetta Beta 6.04 tasks wrongly default to 3hrs CPU runtime while Rosetta v4.20 rightly default to 8hrs. So set the Rosetta@home Target CPU Runtime explicitly to 8hrs so that CPU runtime matches what Boinc is told to assume, and not to 'not selected'. Do more work, get more credits, Boinc schedules more correctly and sooner, batches of tasks issued by Rosetta last longer. Rosetta tasks run out less often. <Everyone> wins. The alternative is what we have now - no new tasks. Everyone loses. ID: 109370 · Rating: 0 · rate: / Reply Quote

kotenok2000 Send message Joined: 22 Feb 11 Posts: 291 Credit: 543,048 RAC: 0	Message 109371 - Posted: 12 Jun 2024, 12:48:37 UTC tasks starting with RosettaVS run for 8 hours for me. ID: 109371 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2599 Credit: 47,220,881 RAC: 0	Message 109374 - Posted: 12 Jun 2024, 19:05:03 UTC - in response to Message 109371. Tasks starting with RosettaVS run for 8 hours for me. Great, but I don't say this for the ones that run as expected, but for all those that don't, of which there seem to be many. Also, I don't recall seeing any RosettaVS tasks. I don't know how they behave. ID: 109374 · Rating: 0 · rate: / Reply Quote

Bryn Mawr Send message Joined: 26 Dec 18 Posts: 442 Credit: 15,697,820 RAC: 4	Message 109375 - Posted: 13 Jun 2024, 6:51:24 UTC - in response to Message 109367. Now out of work new This has been the best run we've had for a couple of years - bound to end at some point once everyone's offline cache runs down. It's at this point my 12hr runtime setting ekes out my remaining work as far as possible. What I'd re-emphasise is that the default runtime for tasks has fallen to 3hrs for some reason, which I believe to be a mistake and contradicts the forced Boinc setting of 8hrs, As such, people should go into Boinc's Your Account option, select Rosetta@home preferences and change Target CPU run time to an explicit 8hrs rather than "not selected". This will almost treble how long tasks run and extend the life of work batches so that we run out less, if at all, while almost trebling the credit we get for tasks too. This should be considered a high priority for everyone imo. I’ve always figured to leave it on default as the project scientists who set them up know their requirements better than I do. ID: 109375 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 109376 - Posted: 13 Jun 2024, 7:51:00 UTC New batch of work over at Ralph, with new errors. RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_f_pred_148_16902_5_1 <core_client_version>8.0.2</core_client_version> <![CDATA[ <message> Codice di accesso non valido. (0xc) - exit code 12 (0xc)</message> <stderr_txt> Traceback (most recent call last): File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aapredict.py", line 8, in <module> import torch File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorch__init__.py", line 124, in <module> raise err OSError: [WinError 1455] Il file di paging è troppo piccolo per essere completato. Error loading "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchlibcaffe2_detectron_ops_gpu.dll" or one of its dependencies. </stderr_txt> ]]> RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_e_pred_195_16901_6_1 <core_client_version>8.0.2</core_client_version> <![CDATA[ <message> Codice di accesso non valido. (0xc) - exit code 12 (0xc)</message> <stderr_txt> Traceback (most recent call last): File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aapredict.py", line 698, in <module> b.write(base64.b64decode(f.read())) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libbase64.py", line 87, in b64decode return binascii.a2b_base64(s) binascii.Error: Invalid base64-encoded string: number of data characters (65) cannot be 1 more than a multiple of 4 </stderr_txt> ]]> RF_SAVE_ALL_OUT_NOJRAN_IGNORE_THE_REST_validation_env_f_pred_119_16902_6_1 <core_client_version>8.0.2</core_client_version> <![CDATA[ <message> Codice di accesso non valido. (0xc) - exit code 12 (0xc)</message> <stderr_txt> Traceback (most recent call last): File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aapredict.py", line 708, in <module> pred.predict(out_name+f'_{n}', File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aapredict.py", line 551, in predict logit_s, logit_aa_s, logit_pae, logit_pde, p_bind, pred_crds, alpha, pred_allatom, pred_lddt_binned, msa_prev, pair_prev, state_prev = self.model( File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl return forward_call(input, kwargs) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aaRoseTTAFoldModel.py", line 358, in forward msa, pair, xyz, alpha_s, xyz_allatom, state, symmsub = self.simulator( File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl return forward_call(input, *kwargs) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aaTrack_module.py", line 1106, in forward msa, pair, xyz, state, alpha, symmsub = self.main_block[i_m](msa, pair, File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl return forward_call(input, *kwargs) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aaTrack_module.py", line 929, in forward xyz, state, alpha = self.str2str( File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl return forward_call(input, *kwargs) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchcudaampautocast_mode.py", line 141, in decorate_autocast return func(args, *kwargs) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aaTrack_module.py", line 503, in forward shift = self.se3(G, node.reshape(BL, -1, 1), l1_feats, edge_feats) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl return forward_call(input, kwargs) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aaSE3_network.py", line 96, in forward return self.se3(G, node_features, edge_features) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl return forward_call(input, *kwargs) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodeltransformer.py", line 185, in forward node_feats = self.graph_modules(node_feats, edge_feats, graph=graph, basis=basis) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl return forward_call(input, *kwargs) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodeltransformer.py", line 47, in forward input = module(input, args, *kwargs) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl return forward_call(input, *kwargs) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodellayersattention.py", line 162, in forward fused_key_value = self.to_key_value(node_features, edge_features, graph, basis) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl return forward_call(input, *kwargs) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodellayersconvolution.py", line 347, in forward out += self.conv_in[str(degree_in)](feature, invariant_edge_feats, File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl return forward_call(input, *kwargs) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodellayersconvolution.py", line 186, in forward radial_weights = self.radial_func(invariant_edge_feats[e_i:e_j]) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl return forward_call(input, *kwargs) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgcv2rf2aa/SE3Transformerse3_transformermodellayersconvolution.py", line 118, in forward return self.net(features) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl return forward_call(input, *kwargs) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulescontainer.py", line 139, in forward input = module(input) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmodulesmodule.py", line 1051, in _call_impl return forward_call(input, **kwargs) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnmoduleslinear.py", line 96, in forward return F.linear(input, self.weight, self.bias) File "C:ProgramDataBOINCprojectsralph.bakerlab.orgev0libsite-packagestorchnnfunctional.py", line 1847, in linear return torch._C._nn.linear(input, weight, bias) RuntimeError: [enforce fail at ..c10coreCPUAllocator.cpp:79] data. DefaultCPUAllocator: not enough memory: you tried to allocate 536870912 bytes. </stderr_txt>]]> Grant Darwin NT ID: 109376 · Rating: 0 · rate: / Reply Quote

kotenok2000 Send message Joined: 22 Feb 11 Posts: 291 Credit: 543,048 RAC: 0	Message 109377 - Posted: 13 Jun 2024, 12:29:17 UTC Last modified: 13 Jun 2024, 13:06:40 UTC Did they port rosetta python projects to native windows? Try to increase pagefile size. It helped with gpugrid python project. It even uses gpu. ID: 109377 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2599 Credit: 47,220,881 RAC: 0	Message 109379 - Posted: 13 Jun 2024, 23:42:54 UTC - in response to Message 109375. What I'd re-emphasise is that the default runtime for tasks has fallen to 3hrs for some reason, which I believe to be a mistake and contradicts the forced Boinc setting of 8hrs, As such, people should go into Boinc's Your Account option, select Rosetta@home preferences and change Target CPU run time to an explicit 8hrs rather than "not selected". This will almost treble how long tasks run and extend the life of work batches so that we run out less, if at all, while almost trebling the credit we get for tasks too. This should be considered a high priority for everyone imo. I’ve always figured to leave it on default as the project scientists who set them up know their requirements better than I do. While generally true, it's clear imo this 3hr target runtime is an error as it's inconsistent with what Rosetta tells Boinc. It only ever slips through when a new version of the app comes out. Istr it happened once before and was corrected in the days when the admins paid more attention to us. If the 8hr default ever changes I think something would be said - and seeing as no-one's saying anything these days I doubt it ever will change without a very specific reason. ID: 109379 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2599 Credit: 47,220,881 RAC: 0	Message 109380 - Posted: 14 Jun 2024, 3:20:41 UTC Ooh, 360k tasks. We live to fight another day (or two) ID: 109380 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2206 Credit: 13,720,774 RAC: 4	Message 109383 - Posted: 15 Jun 2024, 6:48:29 UTC Last modified: 15 Jun 2024, 6:48:45 UTC Today a lot of "classical" error ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT. ERROR:: Exit from: src/protocols/cyclic_peptide_predict/SimpleCycpepPredictApplication.cc line: 2442 BOINC:: Error reading and gzipping output datafile: default.out 08:16:19 (5164): called boinc_finish(1) ID: 109383 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2599 Credit: 47,220,881 RAC: 0	Message 109385 - Posted: 15 Jun 2024, 9:32:32 UTC - in response to Message 109383. Today a lot of "classical" error ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT. ERROR:: Exit from: src/protocols/cyclic_peptide_predict/SimpleCycpepPredictApplication.cc line: 2442 BOINC:: Error reading and gzipping output datafile: default.out 08:16:19 (5164): called boinc_finish(1) Yes, but very quickly, so I'm not too worried by them More concerning are two Validate errors after running to completion hal_8a_i_hal_8aa_2jp5597_d99_0001_SAVE_ALL_OUT_2978378_13_0 hal_8a_i_hal_8aa_2jp1316_d224_0001_SAVE_ALL_OUT_2978378_13_0 ID: 109385 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2599 Credit: 47,220,881 RAC: 0	Message 109387 - Posted: 17 Jun 2024, 20:29:44 UTC - in response to Message 109380. Ooh, 360k tasks. We live to fight another day (or two) Turned into 3+ days, but we're out again. ID: 109387 · Rating: 0 · rate: / Reply Quote