Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 118 · 119 · 120 · 121 · 122 · 123 · 124 . . . 274 · Next

AuthorMessage
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 114,374,163
RAC: 52,951
Message 102376 - Posted: 10 Aug 2021, 7:23:32 UTC - in response to Message 102375.  
Last modified: 10 Aug 2021, 7:24:30 UTC

Their job is to do research, not to create meaningless work, so sometimes there is down-time.

It is especially understandable at the moment with such big improvements being made in the field since AlphaFold 2's showing at CASP13 and more recently, since releasing their source code.

It would be good to have more information posted about what is happening behind the scenes and the different versions of Rosetta though, especially the vbox versions.
ID: 102376 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Albert H.

Send message
Joined: 31 Jan 14
Posts: 2
Credit: 8,463,543
RAC: 3,686
Message 102379 - Posted: 10 Aug 2021, 19:38:50 UTC

I got lots of work at the moment.

Albert
ID: 102379 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
UBT - wbiz

Send message
Joined: 5 Feb 21
Posts: 6
Credit: 695,856
RAC: 861
Message 102380 - Posted: 10 Aug 2021, 21:29:06 UTC - in response to Message 102376.  

Their job is to do research, not to create meaningless work, so sometimes there is down-time.

It is especially understandable at the moment with such big improvements being made in the field since AlphaFold 2's showing at CASP13 and more recently, since releasing their source code.

It would be good to have more information posted about what is happening behind the scenes and the different versions of Rosetta though, especially the vbox versions.


I thought RoseTTAFold was one step ahead of AlphaFold 2 now?
ID: 102380 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 114,374,163
RAC: 52,951
Message 102383 - Posted: 11 Aug 2021, 8:48:10 UTC - in response to Message 102380.  

I don't think so - I think RoseTTaFold was closing the gap, but the reality is probably more complicated than that. I would guess that there are lots of areas where the are differences, like training improving the training, as well as the modelling.
ID: 102383 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,322,361
RAC: 16,418
Message 102384 - Posted: 12 Aug 2021, 7:49:18 UTC

And it looks like we're out of work again.
Grant
Darwin NT
ID: 102384 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 102394 - Posted: 13 Aug 2021, 19:35:59 UTC - in response to Message 102384.  
Last modified: 13 Aug 2021, 20:11:52 UTC

Tasks ready to send 0
Tasks in progress 169215

And now that it is the weekend, nothing will change.
RAH is beginning to make me wonder if its stable or not.
Bugs, no work for days, etc.
This is not the RAH I started with.
ID: 102394 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Cobra

Send message
Joined: 9 Nov 05
Posts: 7
Credit: 16,119,677
RAC: 3,323
Message 102401 - Posted: 16 Aug 2021, 3:39:52 UTC - in response to Message 102394.  
Last modified: 16 Aug 2021, 3:40:15 UTC

Tasks ready to send 0
Tasks in progress 169215

And now that it is the weekend, nothing will change.
RAH is beginning to make me wonder if its stable or not.
Bugs, no work for days, etc.
This is not the RAH I started with.

We've been told to expect periods of down time. It's even at the top of the project home page.

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=14290
https://boinc.bakerlab.org/rosetta/
ID: 102401 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1965
Credit: 38,174,417
RAC: 10,123
Message 102414 - Posted: 17 Aug 2021, 18:51:51 UTC
Last modified: 17 Aug 2021, 18:53:09 UTC

I've been getting several batches of Rosetta work, but they all crash immediately they start,

The error message I'm getting is this one - is it only my system or everyone?

17/08/2021 19:47:28 | Rosetta@home | [error] Signature verification failed for database_357d5d93529_n_methyl.zip


I'd like to report it quickly, but I've had several problems with my PC recently and don't want to say something that's only a local problem to me.
ID: 102414 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 350
Credit: 1,000,634
RAC: 0
Message 102415 - Posted: 17 Aug 2021, 18:56:31 UTC - in response to Message 102414.  

I've got 9 CD98 tasks. 8 have been running fine for several hours now.

Maybe a project reset could work?
ID: 102415 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1223
Credit: 13,824,497
RAC: 2,340
Message 102416 - Posted: 17 Aug 2021, 19:57:14 UTC - in response to Message 102414.  

I've been getting several batches of Rosetta work, but they all crash immediately they start,

The error message I'm getting is this one - is it only my system or everyone?

17/08/2021 19:47:28 | Rosetta@home | [error] Signature verification failed for database_357d5d93529_n_methyl.zip


I'd like to report it quickly, but I've had several problems with my PC recently and don't want to say something that's only a local problem to me.

I looked at the log files. All the errors appear to be due to a problem with the database file shared by all of the failed tasks.

I'd recommend a reset project to replace all of the shared file, since there doesn't appear to be a way of replacing just one of the shared files.
ID: 102416 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1965
Credit: 38,174,417
RAC: 10,123
Message 102417 - Posted: 18 Aug 2021, 0:35:18 UTC - in response to Message 102416.  

I've been getting several batches of Rosetta work, but they all crash immediately they start,

The error message I'm getting is this one - is it only my system or everyone?

17/08/2021 19:47:28 | Rosetta@home | [error] Signature verification failed for database_357d5d93529_n_methyl.zip


I'd like to report it quickly, but I've had several problems with my PC recently and don't want to say something that's only a local problem to me.

I looked at the log files. All the errors appear to be due to a problem with the database file shared by all of the failed tasks.

I'd recommend a reset project to replace all of the shared file, since there doesn't appear to be a way of replacing just one of the shared files.

Each time I grab more tasks that file gets replaced. It hadn't been helping up to now.
I haven't reset the project, but I did a re-boot and when Boinc re-started it reported the following without bringing new tasks down
17/08/2021 23:17:35 | Rosetta@home | Resetting file projects/boinc.bakerlab.org_rosetta/database_357d5d93529_n_methyl.zip: RSA key check failed for file
17/08/2021 23:17:37 | Rosetta@home | Started download of database_357d5d93529_n_methyl.zip
17/08/2021 23:18:59 | Rosetta@home | Finished download of database_357d5d93529_n_methyl.zip


I've now grabbed some more tasks (with over a million appearing in the queue on the front page, I now notice) and the first 3 tasks are running ok up to 7 minutes - no immediate computation errors - so I've got my fingers crossed that it's righted itself.

Thanks for double-checking me
ID: 102417 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kennnnnnneth

Send message
Joined: 20 Jan 20
Posts: 2
Credit: 17,110
RAC: 0
Message 102418 - Posted: 19 Aug 2021, 14:31:52 UTC

Almost every RAH task I have received for the last couple months has a deadline of less than three days. I always have to abort them and refresh or I'm wasting loads of cycles on something that will grant 0 credit. Other times, there appears to be no work available at all. I have set my compute preferences to store at most 0.5 days of work, yet I still get mega-tasks from RAH that will take 5-10x that long and cannot possibly be completed before the deadline even if I make my PC a dedicated BOINC server. These tasks are seriously 5-10x as long as anything I have ever received from another project. Since I don't appear to receive credit for tasks that exceed the deadline, I end up aborting 90% of RAH tasks.

This is comically absurd. I understand there are periods of no work, which indicates that the RAH community is supplying an overabundance of resources to the RAH team, but is there no way to spread the work out more evenly? Rather than weeks of no work followed by 8GB tasks due in 2 days that I'd need a supercomputer to crunch in time, maybe break those tasks down into tasks 1/10th the size and release them over a longer period?

I run BOINC on my network services server, my backup server, my wife's graphics workstation (juicy dual GPUs that idle most of the day), and in the background on my laptop when it's on AC power. I, like most contributors, do not have a fluid-cooled Xeon server stack dedicated to crunching data. I am also a member of a half dozen other projects with which I have no issues (WCG, LHC, MLC, etc). I crank out 10-20 tasks per day across my computers on those projects. RAH on the other hand hasn't seen a single drop of work from me in almost 2 weeks because of this problem.

I guess my real question: is this project dead? Is it worth keeping RAH on my clients or should I put my cycles toward a project with cohesive administration?
ID: 102418 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 102419 - Posted: 19 Aug 2021, 14:42:39 UTC - in response to Message 102418.  
Last modified: 19 Aug 2021, 14:44:38 UTC

I guess my real question: is this project dead? Is it worth keeping RAH on my clients or should I put my cycles toward a project with cohesive administration?

You are relatively new. It is always feast or famine here. A shortage during the summer when the researchers are on vacation is nothing unusual.

The larger issue, which most people have avoided or are unaware of, is will the new AI work be done inhouse or sent out to us?
There may be less work in the future, or maybe even more. You can stay to find out, or leave.

PS - Don't wait for the project to communicate any of that to us. They don't.
ID: 102419 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kennnnnnneth

Send message
Joined: 20 Jan 20
Posts: 2
Credit: 17,110
RAC: 0
Message 102420 - Posted: 19 Aug 2021, 14:45:28 UTC - in response to Message 102419.  

thank you, i will direct my cycles to a better-managed project.
ID: 102420 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,322,361
RAC: 16,418
Message 102422 - Posted: 20 Aug 2021, 6:47:38 UTC - in response to Message 102418.  
Last modified: 20 Aug 2021, 6:55:25 UTC

Almost every RAH task I have received for the last couple months has a deadline of less than three days.
Yes, all Rosetta Tasks have a deadline of 3 days.
The sooner the Project gets the results back, the sooner that information can be used in the real world, for medical research. So unlike many other projects the results are actually time critical. Sooner is better.
Hence the 3 day deadlines (and having 3 days to do 8 hours of work is not a big ask IMHO).

Ideally there is no need for a cache at all, but if you feel the need, 0.5 days + 0.01 additional days is plenty. If you run more than one project, 0 cache is best.



I always have to abort them and refresh or I'm wasting loads of cycles on something that will grant 0 credit.
Or you could have posted here and got help sorting out whatever is wrong with your system,
The default processing time for a Rosetta Task is 8 hours. Some may take longer, some may finish sooner but 95%+ will run for the Target CPU time set.

The only projects showing on your system are Rosetta & LHC.
And both projects show issues with your computer being over committed.

An LHC Task
    Name CMS_2587858_1629250354.094329_0
Run time 13 hours 56 min 33 sec
CPU time  6 hours 37 min  5 sec
Taking 14 hours to do 6 and a half hours work is quite ridiculous.

A Rosetta Task
    Name cd98_again_graft2_bcov_v1_xaj_SAVE_ALL_OUT_IGNORE_THE_REST_9tw1ev3j_1728867_2_0
Run time 8 hours 34 min 18 sec
CPU time 6 hours 15 min 30 sec
And taking 8 and a half hours to do just over 6 hours work isn't good either.

Here is one from one of my systems
    Name cd98_again_graft2_bcov_v1_xab_SAVE_ALL_OUT_IGNORE_THE_REST_9un7be8a_1728858_24_0
Run time 8 hours  0 min  3 sec
CPU time 7 hours 58 min 58 sec

Are you running Folding at Home on the system? If so, you need to limit the number of cores/threads BOINC can use, so they're not trying to do FAH & BOINC work at the same time. If FAH needs 3, then allow BOINC to use only 5.
If you are running some other CPU intensive software on the system, then limit the amount of cores/threads BOINC uses. If that CPU heavy software needs 2 cores, then limit BOINC to only 6.



Then the Rosetta work will complete in time, no missed deadlines, no ridiculous processing times (and the same will occur for your other project, LHC. No missed deadlines & much improved performance due to no wasted CPU time). And whatever other software you are running will also perform much better as CPU cores won't be trying to run more than one application at a time.


NB- and part of your lack of Credit for Rosetta is your system hasn't run the BOINC Benchmarks, which are used for determining Credit, and is using the default values which tend to be much lower than the actual values.
Grant
Darwin NT
ID: 102422 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1965
Credit: 38,174,417
RAC: 10,123
Message 102426 - Posted: 21 Aug 2021, 2:47:11 UTC - in response to Message 102418.  
Last modified: 21 Aug 2021, 2:55:31 UTC

Almost every RAH task I have received for the last couple months has a deadline of less than three days.

They're not less than 3 days. They're exactly 3 days, to the second. 72 hours each.
They're also 8 hours long each, which is somewhat less than 72 hours.

I always have to abort them and refresh or I'm wasting loads of cycles on something that will grant 0 credit. Other times, there appears to be no work available at all. I have set my compute preferences to store at most 0.5 days of work, yet I still get mega-tasks from RAH that will take 5-10x that long and cannot possibly be completed before the deadline even if I make my PC a dedicated BOINC server. These tasks are seriously 5-10x as long as anything I have ever received from another project. Since I don't appear to receive credit for tasks that exceed the deadline, I end up aborting 90% of RAH tasks.

I don't know why you have to abort them, but looking at the ones you have aborted for some unknown reason, they all awarded credits for the time you gave them.
If you allocate 0.5 days of cache, you'll receive 12 8hr tasks at most, even if you have no other tasks from other projects. One will run on each CPU core, so 8 will run, taking up 8hrs, followed by 4 tasks left to run on your 8 cores. This will take 16 hours out of the 72 hour deadline and 56 hours to do anything else you like (or nothing if you wish).

This is comically absurd. I understand there are periods of no work, which indicates that the RAH community is supplying an overabundance of resources to the RAH team, but is there no way to spread the work out more evenly? Rather than weeks of no work followed by 8GB tasks due in 2 days that I'd need a supercomputer to crunch in time, maybe break those tasks down into tasks 1/10th the size and release them over a longer period?

It may or may not be absurd, but take up scheduling issues with Boinc rather than any of the projects.
It's true to say that you don't need a supercomputer to run tasks here (as long as you have sufficient RAM & disk space) as tasks report back however much or little you're able to process in 8hrs of CPU time. There isn't a defined amount of work you need to process at Rosetta, which might be shorter on a faster CPU and longer on a slow CPU. Rather, there's a defined amount of time you dedicate your cores to run the tasks at whatever pace your CPU can do so.

As far as deadlines go, that's the project's business. Users (the tail) don't tell projects (the dog) when they need their work returned by.
As far as task runtime goes, this can be a user-defined setting to reduce the default runtime, but seeing as I don't understand why you struggle to return any 8hr tasks within a 72hr deadline I'm reluctant to encourage it. It would also mean you get more shorter tasks to run within your cache size, which seems to be the opposite of what you want to do, which also defeats what you're complaining about.

I run BOINC on my network services server, my backup server, my wife's graphics workstation (juicy dual GPUs that idle most of the day), and in the background on my laptop when it's on AC power. I, like most contributors, do not have a fluid-cooled Xeon server stack dedicated to crunching data. I am also a member of a half dozen other projects with which I have no issues (WCG, LHC, MLC, etc). I crank out 10-20 tasks per day across my computers on those projects. RAH on the other hand hasn't seen a single drop of work from me in almost 2 weeks because of this problem.

If you're permanently connected to the internet, there's no real reason to have anything more than Boinc's default offline cache size, which I think is 0.25 days, and let Boinc schedule which tasks you get from which project to meet the resource share you've set up. It would also reduce the number of tasks that come down to, I think, 1 default length task per core for Rosetta.

I guess my real question: is this project dead? Is it worth keeping RAH on my clients or should I put my cycles toward a project with cohesive administration?

I don't like to be blunt (this is a lie I often tell) but the administration that isn't down to Boinc is entirely down to you, so I don't think you're going to find "a more cohesive project". Just make your settings appropriate to the projects you run and the time you're prepared to allow your computers to run Boinc.

tl;dr reduce your cache size back to the default 0.25 days or less and I'm pretty sure all your "problems" go away

Edit: the "Store up to an additional..." need be no longer than the default 0.1 days - I suspect the other half of your "problem" is that you have this set inappropriately too.
Point being, if the combined figures come too close to the shortest deadline, your settings plan to fail, so change them so you plan to succeed instead.
ID: 102426 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
wolfman1360

Send message
Joined: 18 Feb 17
Posts: 72
Credit: 18,450,036
RAC: 0
Message 102460 - Posted: 26 Aug 2021, 3:06:37 UTC

Several of these tasks that are running for twice my set computation time and not checkpointing to boot. I hope I get some sort of credit for these.


Application
Rosetta 4.20
Name
rb_08_23_108315_111529_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_06_05_1729195_676
State
Running
Received
8/24/2021 1:28:51 PM
Report deadline
8/27/2021 1:28:51 PM
Estimated computation size
80,000 GFLOPs
CPU time
16:45:42
CPU time since checkpoint
16:45:42
Elapsed time
16:44:31
Estimated time remaining
01:14:20
Fraction done
93.109%
Virtual memory size
955.36 MB
Working set size
802.86 MB
Directory
slots/3
Process ID
2596365
Progress rate
5.400% per hour
Executable
rosetta_4.20_x86_64-pc-linux-gnu
ID: 102460 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1223
Credit: 13,824,497
RAC: 2,340
Message 102461 - Posted: 26 Aug 2021, 4:25:22 UTC - in response to Message 102460.  

Several of these tasks that are running for twice my set computation time and not checkpointing to boot. I hope I get some sort of credit for these.


Application
Rosetta 4.20
Name
rb_08_23_108315_111529_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_06_05_1729195_676

[snip]

Rosetta@Home tasks have sections known as decoys. The decision on whether to end the task normally occurs only at the end of a decoy.

Your very long run time looks like you got at least one task with a very long time per decoy.

I have no information on whether checkpoints are also written only at the ends of decoys. However, if so, this is probably why you also had the long time with no checkpoints.

You might want to read the log file from that task to check whether it completed only one decoy.

Also check if you can read the log files from any other tasks for that workunit. If all of them were that slow, expect to get some credit as long as you either returned it by the deadline, or returned it before the quorum was met.
ID: 102461 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,322,361
RAC: 16,418
Message 102462 - Posted: 26 Aug 2021, 6:35:13 UTC - in response to Message 102460.  

Several of these tasks that are running for twice my set computation time and not checkpointing to boot.
The default Target CPU time is 8 hours. There is a watchdog timer that kicks in at 10 hours after the target time if a Task over runs it.



I hope I get some sort of credit for these.
Credit is being given for them, although it is very, very, very low paying.
Grant
Darwin NT
ID: 102462 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MStenholm

Send message
Joined: 18 Apr 20
Posts: 15
Credit: 21,989,507
RAC: 34,470
Message 102463 - Posted: 26 Aug 2021, 8:45:57 UTC - in response to Message 102460.  
Last modified: 26 Aug 2021, 8:46:33 UTC

Several of these tasks that are running for twice my set computation time and not checkpointing to boot. I hope I get some sort of credit for these.


Application
Rosetta 4.20
Name
rb_08_23_108315_111529_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_06_05_1729195_676

I noticed 5 exceeding or were about to the 8 hours in the same series last night. I aborted them and around 30 others as well. Today I noticed that I did get points for the ones that did run 6-10 hours but only up to the 8 hours.
ID: 102463 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 118 · 119 · 120 · 121 · 122 · 123 · 124 . . . 274 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org