Rosetta needs 6675.72 MB RAM: is the restriction really needed?

Message boards : Number crunching : Rosetta needs 6675.72 MB RAM: is the restriction really needed?

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1633
Credit: 16,775,951
RAC: 13,112
Message 101840 - Posted: 17 May 2021, 8:58:15 UTC - in response to Message 101839.  

I'm running Rosetta on a machine with 16gb of ram and Rosetta is running 8 tasks at once and 2 other projects are using the other 2 available cores and I'm not having any problems getting and returning tasks.
Since you make use of only half of your available cores/threads then it's not surprising that you're not having issues. If you were to use all of your cores & threads, then with so little RAM that system would be having issues just like all the others are.
Surely the point is how many cores are used for Rosetta, not how many cores are in use overall.
I thought that was obvious as the problem relates to doing Rosetta work, so i didn't explicitly state that i was referring to using half or all cores/threads for Rosetta.



I wouldn’t consider filling either machine with just Rosetta with or without the current config problem because of the L3 cache requirements.
?
Fort all of the arguments about using how many cores or threads for what, that just isn't on the list.
Grant
Darwin NT
ID: 101840 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 8,923,505
RAC: 1,107
Message 101841 - Posted: 17 May 2021, 10:27:40 UTC - in response to Message 101837.  

I'm running Rosetta on a machine with 16gb of ram and Rosetta is running 8 tasks at once and 2 other projects are using the other 2 available cores and I'm not having any problems getting and returning tasks.


Since you make use of only half of your available cores/threads then it's not surprising that you're not having issues. If you were to use all of your cores & threads, then with so little RAM that system would be having issues just like all the others are.


That's what I was thinking as well and why an upgrade to 32gb of ram is in my amazon wish list
ID: 101841 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 387
Credit: 11,777,758
RAC: 2,495
Message 101843 - Posted: 17 May 2021, 12:33:57 UTC - in response to Message 101840.  


I wouldn’t consider filling either machine with just Rosetta with or without the current config problem because of the L3 cache requirements.
?
Fort all of the arguments about using how many cores or threads for what, that just isn't on the list.


It has been mentioned in several threads I’ve been involved in but maybe not on this forum?

From memory the main types needing large amounts of cache are MIP on WCG and Rosetta (with mention of shared code?), Africa Rainfall on WCG and CPDN.
ID: 101843 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2074
Credit: 40,613,760
RAC: 5,140
Message 101845 - Posted: 18 May 2021, 2:37:55 UTC - in response to Message 101836.  

And the other major factor is that, if they have a particular batch of work which makes particularly large resource demands because of the nature of the questions they want answered, they're not going to stop asking those questions just because 50% or more hosts don't have the capacity to assist in answering them. Because up to 50% will have the capacity and the bottom line is getting the answer to their question and nothing else, even if that means it takes a little longer to do.
That's all well and good- but is absolutely insane to set such high minimum requirements for Tasks that don't come anywhere close to using the amounts they are requiring as it stops many systems from being able to process them, or results in cores/threads going unused by Rosetta that are available for it's use.
As I mentioned before- we have had high RAM requirement Tasks on the project before- Tasks that required more than double the amount of RAM of any Task I have seen since this excessive configuration value issue started. And people were able to continue processing the existing Tasks at the time without issue as none of them had excessive minimum RAM or disk space requirements above & beyond what they actually required.

Setting a limit that is double what is actually required, just in case, is one thing. But to have a requirement that is 17 times larger than the largest value ever used is beyond ridiculous, and results in them having less resources to process the work they want done. If they really want this work processed, then they should make use of the resources that are available & not block systems that are capable of processing it by having unrealistic & excessive configuration values.

I know what you're saying. It looks like a huge amount of processing resource unnecessarily taken off the table for no reason that's apparent at our end.

I haven't fed back at all following the last changes that were made, so I've now done that and raised your point on top.

Now I've reviewed what it was I said before, it is something I pointed out originally, but it seems to have got lost in focussing on the changes we've had, so it's worth highlighting again now instead of me just guessing.
ID: 101845 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1633
Credit: 16,775,951
RAC: 13,112
Message 101846 - Posted: 18 May 2021, 8:27:49 UTC - in response to Message 101843.  


I wouldn’t consider filling either machine with just Rosetta with or without the current config problem because of the L3 cache requirements.
?
Fort all of the arguments about using how many cores or threads for what, that just isn't on the list.


It has been mentioned in several threads I’ve been involved in but maybe not on this forum?

From memory the main types needing large amounts of cache are MIP on WCG and Rosetta (with mention of shared code?), Africa Rainfall on WCG and CPDN.
Yes, there are programmes where caches can have a massive impact on performance (because the data being worked on is small enough to fit within the cache). And those where they can still have a significant effect. Then there are those (such as Rosetta) where the size of the caches has little if any impact on processing performance- whether just one core is used or all.


When i first started here there was heated discussion on the issue, until someone actually did some monitoring of Rosetta running & the end result was that on current hardware, even with large multicore/thread systems, the caches have very little impact on Rosetta's compute performance- the present caches are sufficient even when all threads/cores are put to use.
Grant
Darwin NT
ID: 101846 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 387
Credit: 11,777,758
RAC: 2,495
Message 101848 - Posted: 18 May 2021, 13:25:23 UTC - in response to Message 101846.  


I wouldn’t consider filling either machine with just Rosetta with or without the current config problem because of the L3 cache requirements.
?
Fort all of the arguments about using how many cores or threads for what, that just isn't on the list.


It has been mentioned in several threads I’ve been involved in but maybe not on this forum?

From memory the main types needing large amounts of cache are MIP on WCG and Rosetta (with mention of shared code?), Africa Rainfall on WCG and CPDN.
Yes, there are programmes where caches can have a massive impact on performance (because the data being worked on is small enough to fit within the cache). And those where they can still have a significant effect. Then there are those (such as Rosetta) where the size of the caches has little if any impact on processing performance- whether just one core is used or all.


When i first started here there was heated discussion on the issue, until someone actually did some monitoring of Rosetta running & the end result was that on current hardware, even with large multicore/thread systems, the caches have very little impact on Rosetta's compute performance- the present caches are sufficient even when all threads/cores are put to use.


Thank you, that’s very good to know and I’ll remove my restriction on Rosetta concurrency.
ID: 101848 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2074
Credit: 40,613,760
RAC: 5,140
Message 101854 - Posted: 19 May 2021, 2:19:37 UTC - in response to Message 101845.  

And the other major factor is that, if they have a particular batch of work which makes particularly large resource demands because of the nature of the questions they want answered, they're not going to stop asking those questions just because 50% or more hosts don't have the capacity to assist in answering them. Because up to 50% will have the capacity and the bottom line is getting the answer to their question and nothing else, even if that means it takes a little longer to do.
That's all well and good - but it's absolutely insane to set such high minimum requirements for Tasks that don't come anywhere close to using the amounts they are requiring as it stops many systems from being able to process them, or results in cores/threads going unused by Rosetta that are available for it's use.
As I mentioned before- we have had high RAM requirement Tasks on the project before - Tasks that required more than double the amount of RAM of any Task I have seen since this excessive configuration value issue started. And people were able to continue processing the existing Tasks at the time without issue as none of them had excessive minimum RAM or disk space requirements above & beyond what they actually required.

Setting a limit that is double what is actually required, just in case, is one thing. But to have a requirement that is 17 times larger than the largest value ever used is beyond ridiculous, and results in them having less resources to process the work they want done. If they really want this work processed, then they should make use of the resources that are available & not block systems that are capable of processing it by having unrealistic & excessive configuration values.
I know what you're saying. It looks like a huge amount of processing resource unnecessarily taken off the table for no reason that's apparent at our end.

I haven't fed back at all following the last changes that were made, so I've now done that and raised your point on top.

Now I've reviewed what it was I said before, it is something I pointed out originally, but it seems to have got lost in focussing on the changes we've had, so it's worth highlighting again now instead of me just guessing.

Grant, I asked the question and I wasn't happy with what I asked because 1) it was a bit insipid on my part and 2) it didn't properly represent the point you're making.
So I re-wrote my question to better reflect your point and sent that as well.

And I've had a reply, which didn't respond to your point and offered something I wasn't asking for as it doesn't solve the root cause of the issue, so I'm going to press your point further.
I'm really trying to get you an answer you'll be happy with, while I'm anxious not to become a nuisance and jeopardise the fantastic cooperation I've had up to now, so I'm going to consider how to frame it first.

In the meantime, In-Progress tasks are now up at 467k - only 15% off the March peak - but I know that could go down as easily as it goes up depending on the mix of tasks we get, in the way you described earlier, so I can't be satisfied if it only turns out to be temporary.

Related to that, one part of the reply I had stated "the Robetta structure prediction server, which did submit jobs with varying memory requirements more in tune with the actual usage, is using less of R@h lately" so that's a factor, while also indicating it is possible to tune memory req'ts closer to actual usage. That may be my best route into the reply I write.
ID: 101854 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1633
Credit: 16,775,951
RAC: 13,112
Message 101855 - Posted: 19 May 2021, 8:10:32 UTC - in response to Message 101854.  

In the meantime, In-Progress tasks are now up at 467k - only 15% off the March peak - but I know that could go down as easily as it goes up depending on the mix of tasks we get, in the way you described earlier, so I can't be satisfied if it only turns out to be temporary.
Lately there has been much more of mix of Task than there was previously, so i suspect many of the newer Tasks do have lower memory requirement configuration values than the ones that started all these problem, so more people are able to get more work running at any given time.

But if the work mix changes back towards predominately Tasks with the high RAM requirement values, then the amount of work being done will drop off significantly again.
Ideally the configuration value would be no more than 1.5x the highest amount actually used. But as long as it's less than 3 times that maximum actually used value then even the RAM limited systems will continue to be able to do Rosetta work (even if some Tasks end up waiting for memory on & off at least they'll still be able to process some Tasks). And over the next couple of months the last of the original batch of Tasks that began all these issues will be gone from the system.

Basically if they just go back to whatever it was they were doing before they made these RAM/storage configuration changes then everything will come out OK.
If someone does come up with some Tasks that do require considerably more RAM/ storage space than at present, then that would be the time to re-vised these changes (but just just for that particular batch of Tasks, not all batches of them...).
Grant
Darwin NT
ID: 101855 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2074
Credit: 40,613,760
RAC: 5,140
Message 101862 - Posted: 21 May 2021, 1:14:24 UTC - in response to Message 101855.  

In the meantime, In-Progress tasks are now up at 467k - only 15% off the March peak - but I know that could go down as easily as it goes up depending on the mix of tasks we get, in the way you described earlier, so I can't be satisfied if it only turns out to be temporary.
Lately there has been much more of mix of Task than there was previously, so i suspect many of the newer Tasks do have lower memory requirement configuration values than the ones that started all these problem, so more people are able to get more work running at any given time.

But if the work mix changes back towards predominately Tasks with the high RAM requirement values, then the amount of work being done will drop off significantly again.

And that's the case right now. More pre_helical_bundles asking for 6675Mb right now and IP down to 430k already

Ideally the configuration value would be no more than 1.5x the highest amount actually used. But as long as it's less than 3 times that maximum actually used value then even the RAM limited systems will continue to be able to do Rosetta work (even if some Tasks end up waiting for memory on & off at least they'll still be able to process some Tasks). And over the next couple of months the last of the original batch of Tasks that began all these issues will be gone from the system.

Basically if they just go back to whatever it was they were doing before they made these RAM/storage configuration changes then everything will come out OK.
If someone does come up with some Tasks that do require considerably more RAM/ storage space than at present, then that would be the time to re-vised these changes (but just just for that particular batch of Tasks, not all batches of them...).

I did about 10 drafts of my reply, was about to send it, checked another few things, did another 2 or 3 drafts, drawing on everything that's been said before finally sending it.
I saw that the "cd98" tasks were calling for the new 3.5Gb setting and using 1.3Gb and running fine. That's the closest RAM-used got to RAM-allocated - still a long way off but not 6.6Gb setting using 500Mb or another I saw with an 8.6Gb setting only using 400Mb. These latter 2 examples are what's causing the entirety of the problems, made worse by the fact there were probably 20million pre_helical_bundles tasks demanding 6.6Gb (now down to 12 or 13m so a way to go yet)

Anyway, see next message
ID: 101862 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2074
Credit: 40,613,760
RAC: 5,140
Message 101863 - Posted: 21 May 2021, 2:28:11 UTC

Maybe I'm too squeamish.

I have been.

The 4th core is running a relatively new task "f60030e2d399cf97bd574292ff707fcd_fae0a51cf659d300dc90ab2264960253_barrel6_L2L5L5L7L5L7L3_relax_SAVE_ALL_OUT_1393099_4" (should I call this "barrel6 ?) whose Virtual memory size is only 380Mb, but looking at my "client_state.xml" file, it's set up to ask for 8583Mb RAM and 1907 Disk space. You might call this misconfigured too, but given it's after the adjustments made, it would be deliberate, so who's to call it misconfigured?
Me.

Bottom, line? You're right

I think the only point I'm making is that if you think it might get better after the huge number of pre_helical_bundles tasks are worked through, I wouldn't personally bank on it.
Which means things won't get any better than they are now, and may even get worse if we get greater numbers of Tasks that are configured for such ridiculously excessive amounts of RAM above & beyond the maximum that they will actually use.

It's apparent now this was a particularly large batch of tasks.
The original offer I got was that when a large batch was being issued, they'd be reviewed to ensure the RAM & Disk demands were appropriately sized.
This whole thing has blown up because they weren't just inappropriate, but on another planet, preventing default users from running them at all - only nerds like you and me.

I wasn't happy this would be looked at by exception for large batches, so I'm now told it's going to be the norm for all batches, including external researchers who submit tasks to the queue, tasks for which the RAM and Disk req'ts aren't currently known at all at the project.

Essentially, this is why RAM and Disk req'ts have been so large. They were a kind of catch-all amount to ensure tasks that were coming from all over would have sufficient. Not because they did need it, but because it wasn't possible to know what they needed.

The fact would be, for each researcher with a batch to submit, who'd no doubt run a few locally first to ensure they're doing what they're supposed to do, finding out what the resource req'ts actually are would be trivial - it's reported for each task. Find the peak amount, add a margin for safety, whether a %age or a default amount (or both) and set it to be that.

As much the point is pushing this back to the researchers themselves.
When I come along complaining about some task or other, the admin has to search through for that batch of tasks, see what the setting is, see what it's using and putting that right. Then I'll complain about another, the same. Then another, the same. It's not long before a setting large enough for <everything> is put in so the admin can get on with his own work. Sound familiar?

And it's no better if the external researcher submits the resource req't with the batch for the admin to put in. That's permanent extra work for little benefit.

So, push it back to the researcher, add a stage for them to set appropriate req'ts for their batch and add to the queue themselves as they currently do. No admin interventions. Tasks get through to users who can run them, without limiting it to a select few with abundant resources, everyone gets tasks they can run in greater numbers than before, the researcher gets their batch completed earlier than they used to. Everyone's a winner with no-one having artificial hurdles put in their way.

That's the theory anyway. Let's see how it turns out over the next few weeks.
ID: 101863 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1633
Credit: 16,775,951
RAC: 13,112
Message 101866 - Posted: 21 May 2021, 6:31:38 UTC - in response to Message 101863.  
Last modified: 21 May 2021, 6:35:13 UTC

So, push it back to the researcher, add a stage for them to set appropriate req'ts for their batch and add to the queue themselves as they currently do. No admin interventions.
That's the ideal- they run a few Tasks locally- they can set it to a 1 hour or even 30min runtime to speed things up -and see just how much RAM they actually do use. Bump it up by 50% and use that value for the batch when it's released here.


The next best thing would take some work to set up, but it would look after itself from then on in.

For all new work set the default RAM value for 2GB, disk space to 500MB.
Run a Cron job once every 24hrs that checks the max RAM values actually used- by batch -of work that has been completed over the last 24hrs. Then run a script to set the required RAM value to the max * 1.5 for each batch of Tasks queued up. Next time the script is run, if the new value is less than the old value, don't change it. If it's more, change it to the new value.
I suspect that after 3 days you'd pretty much have the highest possible value, so no need to run the script on that batch again.

Setup a file to keep track of how many times a batch has had the script run, adding new batches to that list as they are submitted. That way you reduce the database load by only running the script to update RAM values for batches that need it, Once it's been done 3 times (by my WAG) they won't need it done again (by then likely hood of the actually used RAM values being any higher than the current max RAM * 1.5 value would be bugger all IMHO).




As for the disk space required- the only time i've seen a large amount of space used by a Task was when we had some that were erroring out, and producing 100MB+ result/std_error files. Even on my system that has multiple Rosetta & Mini Rosetta application versions, with 8 hour caches & on a 6c/12thread CPU, the most RAM i have seen Rosetta require is 2.5GB. So 500MB per Tasks is way more RAM than will ever be needed, but it's not a huge ask & shouldn't cause issues except for those with very small amounts of storage allocated as available to Rosetta & BOINC.
Grant
Darwin NT
ID: 101866 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2074
Credit: 40,613,760
RAC: 5,140
Message 101874 - Posted: 22 May 2021, 1:33:37 UTC - in response to Message 101866.  
Last modified: 22 May 2021, 1:41:05 UTC

So, push it back to the researcher, add a stage for them to set appropriate req'ts for their batch and add to the queue themselves as they currently do. No admin interventions.
That's the ideal- they run a few Tasks locally- they can set it to a 1 hour or even 30min runtime to speed things up -and see just how much RAM they actually do use. Bump it up by 50% and use that value for the batch when it's released here.

The next best thing would take some work to set up, but it would look after itself from then on in.

For all new work set the default RAM value for 2GB, disk space to 500MB.
Run a Cron job once every 24hrs that checks the max RAM values actually used- by batch -of work that has been completed over the last 24hrs. Then run a script to set the required RAM value to the max * 1.5 for each batch of Tasks queued up. Next time the script is run, if the new value is less than the old value, don't change it. If it's more, change it to the new value.
I suspect that after 3 days you'd pretty much have the highest possible value, so no need to run the script on that batch again.

And I suspect you're exactly right, from what I've just read.
Rather than any manual intervention at all, completed jobs for each batch are going to be reviewed and reflected back into the vast majority of the queue for that batch - bar the 25k that are ready to send.
So, less of me guessing. I'm going to shut up now and see what's changed in about a week.
I dare say the easiest way of knowing is by looking at any changes to pre_helical_bundles tasks as it's their settings that are the most egregious.
ID: 101874 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1633
Credit: 16,775,951
RAC: 13,112
Message 101876 - Posted: 22 May 2021, 2:15:26 UTC - in response to Message 101874.  

I dare say the easiest way of knowing is by looking at any changes to pre_helical_bundles tasks as it's their settings that are the most egregious.
The pre_helical_bundles values won't change unless they run a script on them- i'm pretty sure they were the initial batch of 20 million that started the whole problem and i'd be surprised if they have released any more since that initial batch (that's one of the things that makes it difficult to keep a track of what's going on- we have have the dates for when Tasks are sent out & returned etc. But there is no date stamp of when the batch was released to the project to be turned in to Work Units to process).

Looking at my present Tasks, most of the other Tasks being released have much less extreme values (although they are still much much higher than they need to be). I suspect that's why the amount of work in progress has improved, but still not recovered to it's previous levels. And why it tends to drop significantly every now & then (although at least to no where near as low as it was when the problem first occurred) and then build up for a while to drop again.

<name>pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_5kb7db1y_1389919_2</name>
<rsc_memory_bound>7000000000.000000</rsc_memory_bound>
  <rsc_disk_bound>9000000000.000000</rsc_disk_bound>
   Peak working set size 560 MB
         Peak disk usage 2.7 MB


<name>455bf3e6f13e9a5ea3a74cb191a168ca_6f0dae8d86802072ff4c6a67db2e865a_1kq1A_L2L6L4L2L5L2_fold_SAVE_ALL_OUT_1393099_64</name>
<rsc_memory_bound>900000000.000000</rsc_memory_bound>
 <rsc_disk_bound>2000000000.000000</rsc_disk_bound>
  Peak working set size 440 MB
        Peak disk usage 2.2 MB


<name>SL_2e6p_1_A_trim_SEQ_3.5_3.0_43_L2L4L2L6L2.folded.pdb.rd1_fragments_abinitio_SAVE_ALL_OUT_1393561_333</name>
<rsc_memory_bound>3500000000.000000</rsc_memory_bound>
  <rsc_disk_bound>4000000000.000000</rsc_disk_bound>
   Peak working set size 455 MB
          Peak disk usage 22 MB


<name>rb_05_21_76950_74914_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_04_09_1393730_101</name>
<rsc_memory_bound>3500000000.000000</rsc_memory_bound>
  <rsc_disk_bound>4000000000.000000</rsc_disk_bound>
 Peak working set size 1,100 MB
           Peak disk usage 8 MB



Comparing the configured values with the actually used values gives a pretty good idea of just how excessive they are (even the presently improved ones). Even more so when you look at the storage values (4GB required, 22MB actually used. If i've got enough zeros in the right places, that's 182 times more than what was actually needed...).
Grant
Darwin NT
ID: 101876 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2074
Credit: 40,613,760
RAC: 5,140
Message 101878 - Posted: 22 May 2021, 13:05:07 UTC - in response to Message 101876.  

I dare say the easiest way of knowing is by looking at any changes to pre_helical_bundles tasks as it's their settings that are the most egregious.
The pre_helical_bundles values won't change unless they run a script on them

I know you run a small cache.
Take another look.
ID: 101878 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1633
Credit: 16,775,951
RAC: 13,112
Message 101881 - Posted: 22 May 2021, 21:55:32 UTC - in response to Message 101878.  
Last modified: 22 May 2021, 22:15:30 UTC

I dare say the easiest way of knowing is by looking at any changes to pre_helical_bundles tasks as it's their settings that are the most egregious.
The pre_helical_bundles values won't change unless they run a script on them

I know you run a small cache.
Take another look.
<rsc_memory_bound>653095368.000000</rsc_memory_bound>
 <rsc_disk_bound>9000000000.000000</rsc_disk_bound>

<rsc_memory_bound>525204451.000000</rsc_memory_bound>
 <rsc_disk_bound>9000000000.000000</rsc_disk_bound>
Ah! That's good to see.

So it looks like they are still sending out new pre_helical_bundles Tasks, but with improved RAM values. Unfortunately they're still a small percentage of the total number of those Task types (i've got about two of them out of the dozen pre_helical_bundles on my system), but bit by bit the older ones are being cleared out till they're well in to the minority & eventually gone completely. It helps explain why the peaks in the In progress numbers have been gradually getting higher.
If they could use similarly appropriate RAM values for other Task types, bring the disk requirements down to more realistic levels for all Tasks & things should be back to normal in next to no time (and any new people that join up won't be asking why they need to give Rosetta so much disk space).
Those with RAM limited systems will be back to where they were before- able to process most Tasks & only run in to issues with those that actually need large amounts of RAM to run. That should give Rosetta's compute resources a good lift back to where they were, and keep them up there.




Edit- i wonder if these new pre_helical_bundles Tasks resolved the Compute/Validation error issues that occur soon after starting? The number of such errors does appear to be lower over the last few days than it was (although it has been very variable in the past).

pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_5ds2cy0h_1389892_2_0

<message>
Incorrect function.
 (0x1) - exit code 1 (0x1)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol pre_helix_boinc_v1.xml @helix_design.flags -in:file:silent pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_5ds2cy0h.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_5ds2cy0h.zip @pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_5ds2cy0h.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1649175
Using database: database_357d5d93529_n_methylminirosetta_database


ERROR: [ERROR] Unable to open constraints file: fe4dbf3cfd9598bae6400704d6426ef3_0001.MSAcst
ERROR:: Exit from: ......srccorescoringconstraintsConstraintIO.cc line: 457
BOINC:: Error reading and gzipping output datafile: default.out
01:19:23 (3828): called boinc_finish(1)

</stderr_txt>

Grant
Darwin NT
ID: 101881 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2074
Credit: 40,613,760
RAC: 5,140
Message 101882 - Posted: 22 May 2021, 22:44:00 UTC - in response to Message 101881.  
Last modified: 22 May 2021, 22:46:06 UTC

I dare say the easiest way of knowing is by looking at any changes to pre_helical_bundles tasks as it's their settings that are the most egregious.
The pre_helical_bundles values won't change unless they run a script on them

I know you run a small cache.
Take another look.
<rsc_memory_bound>653095368.000000</rsc_memory_bound>
 <rsc_disk_bound>9000000000.000000</rsc_disk_bound>

<rsc_memory_bound>525204451.000000</rsc_memory_bound>
 <rsc_disk_bound>9000000000.000000</rsc_disk_bound>
Ah! That's good to see.

So it looks like they are still sending out new pre_helical_bundles Tasks, but with improved RAM values. Unfortunately they're still a small percentage of the total number of those Task types (i've got about two of them out of the dozen pre_helical_bundles on my system), but bit by bit the older ones are being cleared out till they're well in to the minority & eventually gone completely.

It took effect a lot quicker than I expected. I'm guessing (again) that the 25k tasks "Ready to Send" are left as is and the adjustment is made to the rest of the 13m queue. I don't know exactly what's happening, but let's see what happens after several days. It may be recursive, so it improves further over time, but again I'm guessing based on this

I started working on some logic that can update the rsc_memory_bound in our queue based on memory usage reported back from completed jobs.
Hopefully this will help but it won’t be perfect.


It helps explain why the peaks in the In progress numbers have been gradually getting higher.

It will, but I don't think so just yet. Let's see after a day or two more.

If they could use similarly appropriate RAM values for other Task types, bring the disk requirements down to more realistic levels for all Tasks & things should be back to normal in next to no time (and any new people that join up won't be asking why they need to give Rosetta so much disk space).
Those with RAM limited systems will be back to where they were before- able to process most Tasks & only run in to issues with those that actually need large amounts of RAM to run. That should give Rosetta's compute resources a good lift back to where they were, and keep them up there.

I'd hope it extended to all task-types, but it may be limited to pre_helical_bundles for now.
And while RAM is the primary limitation, the same logic should apply to Disk space too

Edit- i wonder if these new pre_helical_bundles Tasks resolved the Compute/Validation error issues that occur soon after starting? The number of such errors does appear to be lower over the last few days than it was (although it has been very variable in the past).

I didn't realise that was still happening tbh - is it?
I certainly haven't mentioned it, so if anything's changed there it's more likely coincidence.

Is it connected to pre_helical_bundles tasks?
My impression was it popped up on a variety of other tasks too, but very irregularly. Am I wrong?
ID: 101882 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1633
Credit: 16,775,951
RAC: 13,112
Message 101884 - Posted: 22 May 2021, 23:10:52 UTC - in response to Message 101882.  

Edit- i wonder if these new pre_helical_bundles Tasks resolved the Compute/Validation error issues that occur soon after starting? The number of such errors does appear to be lower over the last few days than it was (although it has been very variable in the past).

I didn't realise that was still happening tbh - is it?
I certainly haven't mentioned it, so if anything's changed there it's more likely coincidence.

Is it connected to pre_helical_bundles tasks?
My impression was it popped up on a variety of other tasks too, but very irregularly. Am I wrong?
Pretty sure that particular error is just with the pre_helical_bundles, and it's still occuring. Just not as much as it was. Hence why i was thinking that the new Tasks might have the issue sorted and the errors are from the original release Tasks.
If you check my Tasks and you'll see 1 Valid error on one system & 2 Compute Errors on the other. The Compute errors usually occur in under a minute, the Validate errors after a few minutes. You've got 3 Compute errors on your Ryzen- all pre_helical_bundles, same std_err output as me.
ERROR: [ERROR] Unable to open constraints file: feab7864d25907b78eb5173513455954_0001.MSAcst
ERROR:: Exit from: ......srccorescoringconstraintsConstraintIO.cc line: 457

Grant
Darwin NT
ID: 101884 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2074
Credit: 40,613,760
RAC: 5,140
Message 101887 - Posted: 23 May 2021, 2:12:34 UTC - in response to Message 101881.  

I dare say the easiest way of knowing is by looking at any changes to pre_helical_bundles tasks as it's their settings that are the most egregious.
The pre_helical_bundles values won't change unless they run a script on them

I know you run a small cache.
Take another look.
<rsc_memory_bound>653095368.000000</rsc_memory_bound>
 <rsc_disk_bound>9000000000.000000</rsc_disk_bound>

<rsc_memory_bound>525204451.000000</rsc_memory_bound>
 <rsc_disk_bound>9000000000.000000</rsc_disk_bound>
Ah! That's good to see.

I've only just realised what this is saying. Too many digits to work it out in my head.

The original figure was 7*10^9 - 7 followed by 9 zeros.
Divide by 1024 twice to convert to Mb = 6675.72Mb

653095368 converts to 622.84Mb RAM
525204451 converts to 500.87Mb RAM

I thought they were 10x higher for only a small reduction (6.53Gb & 5.25Gb)
They've gone from the hardest tasks to download and run to the easiest. Every host will handle them easily.
And no more "Waiting for memory" with these in the mix.
They might even be asking for too little RAM to run successfully (I'll keep that thought to myself for now)
ID: 101887 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2074
Credit: 40,613,760
RAC: 5,140
Message 101888 - Posted: 23 May 2021, 2:26:23 UTC - in response to Message 101884.  

Edit- i wonder if these new pre_helical_bundles Tasks resolved the Compute/Validation error issues that occur soon after starting? The number of such errors does appear to be lower over the last few days than it was (although it has been very variable in the past).

I didn't realise that was still happening tbh - is it?
I certainly haven't mentioned it, so if anything's changed there it's more likely coincidence.

Is it connected to pre_helical_bundles tasks?
My impression was it popped up on a variety of other tasks too, but very irregularly. Am I wrong?
Pretty sure that particular error is just with the pre_helical_bundles, and it's still occuring. Just not as much as it was. Hence why i was thinking that the new Tasks might have the issue sorted and the errors are from the original release Tasks.
If you check my Tasks and you'll see 1 Valid error on one system & 2 Compute Errors on the other. The Compute errors usually occur in under a minute, the Validate errors after a few minutes. You've got 3 Compute errors on your Ryzen- all pre_helical_bundles, same std_err output as me.
ERROR: [ERROR] Unable to open constraints file: feab7864d25907b78eb5173513455954_0001.MSAcst
ERROR:: Exit from: ......srccorescoringconstraintsConstraintIO.cc line: 457

Confirmed.
Very little runtime wasted, but I'll get round to mentioning it by the end of the weekend now I'm back home
ID: 101888 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Kissagogo27

Send message
Joined: 31 Mar 20
Posts: 86
Credit: 2,796,243
RAC: 1,953
Message 101890 - Posted: 23 May 2021, 8:33:07 UTC

hi, i've got seven of "pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_" WU per 2GB computer ,

first to start in few minutes ;)
ID: 101890 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Number crunching : Rosetta needs 6675.72 MB RAM: is the restriction really needed?



©2024 University of Washington
https://www.bakerlab.org