Constant computation errors.

Message boards : Number crunching : Constant computation errors.

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Bryn

Send message
Joined: 28 Nov 20
Posts: 3
Credit: 626,197
RAC: 50
Message 105453 - Posted: 15 Mar 2022, 23:12:26 UTC

Why am I even bothering to try and contribute to the solving of corona when everything keeps coming up with this problem?

Application
Rosetta 4.20
Name
preetham_gen_26749_0001_0001_0_SAVE_ALL_OUT_2911458_130
State
Computation error
Received
16/03/2022 12:04:15
Report deadline
19/03/2022 12:04:09
Estimated computation size
80,000 GFLOPs
CPU time
00:00:02
Elapsed time
00:00:16
Executable
rosetta_4.20_windows_x86_64.exe
ID: 105453 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 66
Credit: 3,945,650
RAC: 109
Message 105454 - Posted: 16 Mar 2022, 0:05:48 UTC - in response to Message 105453.  

Why am I even bothering to try and contribute to the solving of corona when everything keeps coming up with this problem?


The same reason I do, perhaps? It is my guess that they set up a huge group of work units that are defective, and did not test them before letting them loose on us. Either by incompetence, or by some terrible mistake. And perhaps they have inadequate staff to watch how things were going.

On my machine, all recent work units of this batch failed. I disabled getting any new ones for an hour or so, and then tried again for a little while. I now have 100% failure rate on over 300 work units, so I stopped getting new work units. Most of mine have some other machine working on them, and they all failed too. My machine has an Intel Xeon processor running Red Hat Enterprise Linux 5.4. Other users that failed on my work units run either Linux or Windows. They all fail too. Sooner or later, my guess will get back from the long weekend and notice the 100% failure rate and do something about it.
ID: 105454 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1351
Credit: 13,624,788
RAC: 0
Message 105455 - Posted: 16 Mar 2022, 5:50:13 UTC - in response to Message 105453.  

Why am I even bothering to try and contribute to the solving of corona when everything keeps coming up with this problem?
The last batch that died like that only did so on Windows systems, but processed OK on LINUX.
Looks like they fixed the problem with those Tasks so these new ones now fail on all systems.
Grant
Darwin NT
ID: 105455 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 643
Credit: 11,145,078
RAC: 19
Message 105459 - Posted: 16 Mar 2022, 7:30:44 UTC
Last modified: 16 Mar 2022, 8:16:25 UTC

All failing on both my Windows 8.1x64 systems after about 70 seconds. The exit code is "1 (0x00000001) Unknown error code" which is not much help.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 105459 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 51
Credit: 8,823,488
RAC: 4
Message 105460 - Posted: 16 Mar 2022, 9:36:35 UTC

Error like this?

process exited with code 1 (0x1, -255)</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu @preetham_gen_38675_0001_0001_0.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1371728
Using database: database_357d5d93529_n_methyl/minirosetta_database

ERROR: Error in simple_cycpcp_predict app read_sequence() function! The minimum number of residues for a cyclic peptide is 4. (GenKIC requires three residues, plus a fourth to serve as an anchor).
ERROR:: Exit from: src/protocols/cyclic_peptide_predict/SimpleCycpepPredictApplication.cc line: 2264
BOINC:: Error reading and gzipping output datafile: default.out
16:47:07 (139426): called boinc_finish(1)
ID: 105460 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1621
Credit: 6,457,705
RAC: 65
Message 105462 - Posted: 16 Mar 2022, 11:15:51 UTC - in response to Message 105455.  

The last batch that died like that only did so on Windows systems, but processed OK on LINUX.
Looks like they fixed the problem with those Tasks so these new ones now fail on all systems.


:-P
ID: 105462 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn

Send message
Joined: 28 Nov 20
Posts: 3
Credit: 626,197
RAC: 50
Message 105498 - Posted: 17 Mar 2022, 5:51:29 UTC - in response to Message 105454.  

All from this person are corrupt. Who do I contact to tell them?
Application
Rosetta 4.20
Name
preetham_gen_10036_0001_0001_0_SAVE_ALL_OUT_2912745_775
State
Computation error
Received
17/03/2022 17:01:55
Report deadline
20/03/2022 17:01:54
Estimated computation size
80,000 GFLOPs
CPU time
00:01:48
Elapsed time
00:02:06
Executable
rosetta_4.20_windows_x86_64.exe
ID: 105498 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1351
Credit: 13,624,788
RAC: 0
Message 105499 - Posted: 17 Mar 2022, 6:07:28 UTC - in response to Message 105498.  

All from this person are corrupt. Who do I contact to tell them?
Nobody.
In case you haven't been paying attention, it was the project that released a batch of faulty Tasks. And they decided to do nothing about it and just let them error out.
That batch is now gone, although there will be plenty of resends over the next week and a half or so that will continue to error out until they are all gone.
Grant
Darwin NT
ID: 105499 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1621
Credit: 6,457,705
RAC: 65
Message 105500 - Posted: 17 Mar 2022, 9:42:38 UTC - in response to Message 105499.  

Nobody.
In case you haven't been paying attention, it was the project that released a batch of faulty Tasks. And they decided to do nothing about it and just let them error out.
That batch is now gone, although there will be plenty of resends over the next week and a half or so that will continue to error out until they are all gone.


As usual, i wrote a tweet to R@H account.

But my questions are: are R@H servers unattended? Is possible, in Boinc server, activate triggers to warn admins in case of problems??
ID: 105500 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 28 May 06
Posts: 60
Credit: 222,177
RAC: 4
Message 105524 - Posted: 18 Mar 2022, 19:46:24 UTC - in response to Message 105500.  

As usual, i wrote a tweet to R@H account.

But my questions are: are R@H servers unattended? Is possible, in Boinc server, activate triggers to warn admins in case of problems??

I think you will be waiting indefinitely for a reply.

ID: 105524 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5528
Credit: 5,531,989
RAC: 1,409
Message 105545 - Posted: 19 Mar 2022, 19:03:21 UTC - in response to Message 105500.  
Last modified: 19 Mar 2022, 19:06:23 UTC

Nobody.
In case you haven't been paying attention, it was the project that released a batch of faulty Tasks. And they decided to do nothing about it and just let them error out.
That batch is now gone, although there will be plenty of resends over the next week and a half or so that will continue to error out until they are all gone.


As usual, i wrote a tweet to R@H account.

But my questions are: are R@H servers unattended? Is possible, in Boinc server, activate triggers to warn admins in case of problems??



Haven't you been paying attention to what I have said in other posts about this exact same thing?
This project is NOT monitored.
NO ONE watches Twitter (I hammered them in a tweet and they did nothing).NO ONE watches the boards here. They ignore all emails.
Grant I think used to have a in, but not any more.

If they get a 50% return i.e. linux and not windows or the other way around, then they just have a smaller data set to work with, but still data.
If stuff craps out because they don't do the code right, oh well, no need to fix it. Correct it later and resubmit it.

So don't waste your time on Twitter or email. They will all be ignored.

We figure it out on our own and if not your SOL.

As for triggers...who you kidding. That's to advanced for this group.
They can barely write protein software correctly, you really think they know anything about the code behind Rosetta? Or how to set alerts based on error received?

Sorry man, you get what you get good or bad and that's how it is. We just roll with it and bitch about it. But nothing we can do.
ID: 105545 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1851
Credit: 34,016,177
RAC: 3,934
Message 105570 - Posted: 20 Mar 2022, 2:14:00 UTC - in response to Message 105499.  

All from this person are corrupt. Who do I contact to tell them?
Nobody.
In case you haven't been paying attention, it was the project that released a batch of faulty Tasks. And they decided to do nothing about it and just let them error out.
That batch is now gone, although there will be plenty of resends over the next week and a half or so that will continue to error out until they are all gone.

Just to say it publicly, I left for France last Monday, returned to London on Thursday night where the PC I use there was merrily running RB tasks so I didn't appreciate what Grant PM'd me about while I was away and have only now (Sunday am) returned home while there's no Rosetta 4.20 tasks to run, so I missed everything that was going wrong. Sorry about that.
Not sure if it's a plus, but I'll be stuck in one place for the next few months so hopefully I pick up on problems a lot sooner to pass on.
ID: 105570 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1621
Credit: 6,457,705
RAC: 65
Message 105584 - Posted: 20 Mar 2022, 11:35:35 UTC - in response to Message 105545.  
Last modified: 20 Mar 2022, 11:36:00 UTC

As for triggers...who you kidding. That's to advanced for this group.
They can barely write protein software correctly, you really think they know anything about the code behind Rosetta? Or how to set alerts based on error received?


Misunderstanding.
The "trigger functionality" may be introduced by Boinc's developers, not by Rosetta admins.
Alerts based on errors have to be mandatory during server installation/configuration
ID: 105584 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5528
Credit: 5,531,989
RAC: 1,409
Message 105585 - Posted: 20 Mar 2022, 11:59:24 UTC - in response to Message 105584.  

As for triggers...who you kidding. That's to advanced for this group.
They can barely write protein software correctly, you really think they know anything about the code behind Rosetta? Or how to set alerts based on error received?


Misunderstanding.
The "trigger functionality" may be introduced by Boinc's developers, not by Rosetta admins.
Alerts based on errors have to be mandatory during server installation/configuration


ok..interesting. But the new version is not ready yet.
Someone got it in a linux uncompiled personal release, so in the meantime we still have this situation.
But then again, its just an alert, RAH could just as easily ignore those alerts just as we ignore constant non threatening alerts on our systems. Unless it blocks their screens from allowing any new input until acknowledged. But they could just click 'ok' or whatever and be done with the problem.

I don't have much faith in their internal production despite all the old rah rah on the homepage and other locations. It's more like the outside sources produce more reliable tasks than internal.

It looks more and more like the outside sources are where the action is now rather than the lab itself. The lab studies the results and reports back the data. They created the backbone that a lot of the programs use now.

I find it interesting the way this is all structured. You have the the molecular chemistry or whatever department at the university, then the institute, then it seems Baker Lab falls under that umbrella and splits out into robetta, rosetta and foldit, A lot of names that are just sub units of something. It's almost like a circus juggling act.
ID: 105585 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1621
Credit: 6,457,705
RAC: 65
Message 105636 - Posted: 22 Mar 2022, 8:58:23 UTC - in response to Message 105585.  

You have the the molecular chemistry or whatever department at the university, then the institute, then it seems Baker Lab falls under that umbrella and splits out into robetta, rosetta and foldit, A lot of names that are just sub units of something. It's almost like a circus juggling act.


I understand you're tired of this situation (like me), but i think you're a little bit impolite.
This project is not "a circus", it's science and every kind of help, from simply cpu time to Foldit volunteers, is done with a purpose.
ID: 105636 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 268
Credit: 18,911,327
RAC: 16,465
Message 105638 - Posted: 22 Mar 2022, 10:35:16 UTC - in response to Message 105636.  

You have the the molecular chemistry or whatever department at the university, then the institute, then it seems Baker Lab falls under that umbrella and splits out into robetta, rosetta and foldit, A lot of names that are just sub units of something. It's almost like a circus juggling act.


I understand you're tired of this situation (like me), but i think you're a little bit impolite.
This project is not "a circus", it's science and every kind of help, from simply cpu time to Foldit volunteers, is done with a purpose.


Rosetta may not be a "circus", BUT the person integrating the "science program" with the "real world machines" is unqualified to do the job. There are simple warning messages and parameter testing limits that can be implemented that could screen out most of the error situations before they reach volunteer machines.

Simple things like a "Set the ALLOW computer detail switch to enable Python jobs" message. There are many of these informational messages that could be added, but the integrator is unqualified or simply lazy.

My suggestion: require each researcher submitting WU to the public have an identifier embedded in the WU name. Make incompetence public, traceable and give researchers CREDIT for their successes and failures.

8-)
ID: 105638 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1621
Credit: 6,457,705
RAC: 65
Message 105643 - Posted: 22 Mar 2022, 13:58:29 UTC - in response to Message 105638.  

My suggestion: require each researcher submitting WU to the public have an identifier embedded in the WU name. Make incompetence public, traceable and give researchers CREDIT for their successes and failures.


Constructive criticism is always welcome!!
In the past, sometimes, they put the name of researcher into the wus name

P.S.
I'm still waiting when they will introduce your suggestions about cpu optimization (SSEx, AVX)

:-P
ID: 105643 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5528
Credit: 5,531,989
RAC: 1,409
Message 105652 - Posted: 22 Mar 2022, 19:55:21 UTC - in response to Message 105638.  

You have the the molecular chemistry or whatever department at the university, then the institute, then it seems Baker Lab falls under that umbrella and splits out into robetta, rosetta and foldit, A lot of names that are just sub units of something. It's almost like a circus juggling act.


I understand you're tired of this situation (like me), but i think you're a little bit impolite.
This project is not "a circus", it's science and every kind of help, from simply cpu time to Foldit volunteers, is done with a purpose.


Rosetta may not be a "circus", BUT the person integrating the "science program" with the "real world machines" is unqualified to do the job. There are simple warning messages and parameter testing limits that can be implemented that could screen out most of the error situations before they reach volunteer machines.

Simple things like a "Set the ALLOW computer detail switch to enable Python jobs" message. There are many of these informational messages that could be added, but the integrator is unqualified or simply lazy.

My suggestion: require each researcher submitting WU to the public have an identifier embedded in the WU name. Make incompetence public, traceable and give researchers CREDIT for their successes and failures.

8-)


That is what RALPH is for. Alpha and Beta testing before release to Rosetta.
But probably since hardly anyone signs up on RALPH they just try them on their end and if they work, toss them out. Again, if 50% return is valid, I think that is a good enough data set for this person.
You can track non Python stuff back to Robetta if you have the patience to go through all the listings and see who the submitter is. But that is a lot of work.
ID: 105652 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5528
Credit: 5,531,989
RAC: 1,409
Message 105653 - Posted: 22 Mar 2022, 20:28:57 UTC

The science is good.
The execution to the PC group is bad.
The organization reads like a flow chart.
And if the bottom of the flow chart Rosetta and Baker lab can not get their act together and make tasks that work on both linux and windows and communicate or listen and respond to people telling them things, then it is a circus act in its function.

The ignorance of the lab in not assigning someone to monitor boards, to monitor bad tasks, to communicate that they know there are issues and fixing them or offering a way to fix them shows just more of the same.
They are in their world of nice chains of proteins and we are out here struggling to understand why tasks do not work so as to give them nice chains of proteins to work with.

It is as if they really don't care what goes on out here, as long as they get results.

I have been with this project since its early stages. Back when Dr. B took the time to write interesting things.
When DEK took care of technical things and watched here for issues popping up.
Or a MOD (grad student or whatever) that also monitored things here and reported back to DEK and also wrote up information on what we were crunching.

This all disappeared long ago back when they started adding names to their group.

We used to read about how the protein chain we had just finished analyzing was then done in crystal and how the results were close or exactly what the computers had come up with.

Now its just a lot of dead air and figure it yourself mentality and suggestions written to them are ignored. Comments via twitter or other means are ignored. Emails to the project are ignored.
If your not a scientist or a whatever dealing with science, they don't want to hear from you.

I keep going because this is my first project. But there are always other interesting more steady more technically stable projects that I also joined.

There are many little things that they could do to make this project so much better. But no. That is not of interest. But the science is good.
ID: 105653 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1621
Credit: 6,457,705
RAC: 65
Message 105669 - Posted: 23 Mar 2022, 12:42:05 UTC - in response to Message 105652.  

That is what RALPH is for. Alpha and Beta testing before release to Rosetta.
But probably since hardly anyone signs up on RALPH they just try them on their end and if they work, toss them out.


Are you kidding?
When, very rarely, they released works on Ralph, it finished after few hours.
There is a lot of volunteers ready to test wus...
ID: 105669 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Constant computation errors.



©2022 University of Washington
https://www.bakerlab.org