Message boards : Number crunching : Rosetta needs 6675.72 MB RAM: is the restriction really needed?
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1734 Credit: 18,532,940 RAC: 17,945 |
I thought that was obvious as the problem relates to doing Rosetta work, so i didn't explicitly state that i was referring to using half or all cores/threads for Rosetta.Surely the point is how many cores are used for Rosetta, not how many cores are in use overall.I'm running Rosetta on a machine with 16gb of ram and Rosetta is running 8 tasks at once and 2 other projects are using the other 2 available cores and I'm not having any problems getting and returning tasks.Since you make use of only half of your available cores/threads then it's not surprising that you're not having issues. If you were to use all of your cores & threads, then with so little RAM that system would be having issues just like all the others are. I wouldn’t consider filling either machine with just Rosetta with or without the current config problem because of the L3 cache requirements.? Fort all of the arguments about using how many cores or threads for what, that just isn't on the list. Grant Darwin NT |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,217,610 RAC: 822 |
I'm running Rosetta on a machine with 16gb of ram and Rosetta is running 8 tasks at once and 2 other projects are using the other 2 available cores and I'm not having any problems getting and returning tasks. That's what I was thinking as well and why an upgrade to 32gb of ram is in my amazon wish list |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 404 Credit: 12,294,748 RAC: 2,551 |
It has been mentioned in several threads I’ve been involved in but maybe not on this forum? From memory the main types needing large amounts of cache are MIP on WCG and Rosetta (with mention of shared code?), Africa Rainfall on WCG and CPDN. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2146 Credit: 41,570,180 RAC: 8,210 |
And the other major factor is that, if they have a particular batch of work which makes particularly large resource demands because of the nature of the questions they want answered, they're not going to stop asking those questions just because 50% or more hosts don't have the capacity to assist in answering them. Because up to 50% will have the capacity and the bottom line is getting the answer to their question and nothing else, even if that means it takes a little longer to do.That's all well and good- but is absolutely insane to set such high minimum requirements for Tasks that don't come anywhere close to using the amounts they are requiring as it stops many systems from being able to process them, or results in cores/threads going unused by Rosetta that are available for it's use. I know what you're saying. It looks like a huge amount of processing resource unnecessarily taken off the table for no reason that's apparent at our end. I haven't fed back at all following the last changes that were made, so I've now done that and raised your point on top. Now I've reviewed what it was I said before, it is something I pointed out originally, but it seems to have got lost in focussing on the changes we've had, so it's worth highlighting again now instead of me just guessing. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1734 Credit: 18,532,940 RAC: 17,945 |
Yes, there are programmes where caches can have a massive impact on performance (because the data being worked on is small enough to fit within the cache). And those where they can still have a significant effect. Then there are those (such as Rosetta) where the size of the caches has little if any impact on processing performance- whether just one core is used or all. When i first started here there was heated discussion on the issue, until someone actually did some monitoring of Rosetta running & the end result was that on current hardware, even with large multicore/thread systems, the caches have very little impact on Rosetta's compute performance- the present caches are sufficient even when all threads/cores are put to use. Grant Darwin NT |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 404 Credit: 12,294,748 RAC: 2,551 |
Yes, there are programmes where caches can have a massive impact on performance (because the data being worked on is small enough to fit within the cache). And those where they can still have a significant effect. Then there are those (such as Rosetta) where the size of the caches has little if any impact on processing performance- whether just one core is used or all. Thank you, that’s very good to know and I’ll remove my restriction on Rosetta concurrency. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2146 Credit: 41,570,180 RAC: 8,210 |
I know what you're saying. It looks like a huge amount of processing resource unnecessarily taken off the table for no reason that's apparent at our end.And the other major factor is that, if they have a particular batch of work which makes particularly large resource demands because of the nature of the questions they want answered, they're not going to stop asking those questions just because 50% or more hosts don't have the capacity to assist in answering them. Because up to 50% will have the capacity and the bottom line is getting the answer to their question and nothing else, even if that means it takes a little longer to do.That's all well and good - but it's absolutely insane to set such high minimum requirements for Tasks that don't come anywhere close to using the amounts they are requiring as it stops many systems from being able to process them, or results in cores/threads going unused by Rosetta that are available for it's use. Grant, I asked the question and I wasn't happy with what I asked because 1) it was a bit insipid on my part and 2) it didn't properly represent the point you're making. So I re-wrote my question to better reflect your point and sent that as well. And I've had a reply, which didn't respond to your point and offered something I wasn't asking for as it doesn't solve the root cause of the issue, so I'm going to press your point further. I'm really trying to get you an answer you'll be happy with, while I'm anxious not to become a nuisance and jeopardise the fantastic cooperation I've had up to now, so I'm going to consider how to frame it first. In the meantime, In-Progress tasks are now up at 467k - only 15% off the March peak - but I know that could go down as easily as it goes up depending on the mix of tasks we get, in the way you described earlier, so I can't be satisfied if it only turns out to be temporary. Related to that, one part of the reply I had stated "the Robetta structure prediction server, which did submit jobs with varying memory requirements more in tune with the actual usage, is using less of R@h lately" so that's a factor, while also indicating it is possible to tune memory req'ts closer to actual usage. That may be my best route into the reply I write. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1734 Credit: 18,532,940 RAC: 17,945 |
In the meantime, In-Progress tasks are now up at 467k - only 15% off the March peak - but I know that could go down as easily as it goes up depending on the mix of tasks we get, in the way you described earlier, so I can't be satisfied if it only turns out to be temporary.Lately there has been much more of mix of Task than there was previously, so i suspect many of the newer Tasks do have lower memory requirement configuration values than the ones that started all these problem, so more people are able to get more work running at any given time. But if the work mix changes back towards predominately Tasks with the high RAM requirement values, then the amount of work being done will drop off significantly again. Ideally the configuration value would be no more than 1.5x the highest amount actually used. But as long as it's less than 3 times that maximum actually used value then even the RAM limited systems will continue to be able to do Rosetta work (even if some Tasks end up waiting for memory on & off at least they'll still be able to process some Tasks). And over the next couple of months the last of the original batch of Tasks that began all these issues will be gone from the system. Basically if they just go back to whatever it was they were doing before they made these RAM/storage configuration changes then everything will come out OK. If someone does come up with some Tasks that do require considerably more RAM/ storage space than at present, then that would be the time to re-vised these changes (but just just for that particular batch of Tasks, not all batches of them...). Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2146 Credit: 41,570,180 RAC: 8,210 |
In the meantime, In-Progress tasks are now up at 467k - only 15% off the March peak - but I know that could go down as easily as it goes up depending on the mix of tasks we get, in the way you described earlier, so I can't be satisfied if it only turns out to be temporary.Lately there has been much more of mix of Task than there was previously, so i suspect many of the newer Tasks do have lower memory requirement configuration values than the ones that started all these problem, so more people are able to get more work running at any given time. And that's the case right now. More pre_helical_bundles asking for 6675Mb right now and IP down to 430k already Ideally the configuration value would be no more than 1.5x the highest amount actually used. But as long as it's less than 3 times that maximum actually used value then even the RAM limited systems will continue to be able to do Rosetta work (even if some Tasks end up waiting for memory on & off at least they'll still be able to process some Tasks). And over the next couple of months the last of the original batch of Tasks that began all these issues will be gone from the system. I did about 10 drafts of my reply, was about to send it, checked another few things, did another 2 or 3 drafts, drawing on everything that's been said before finally sending it. I saw that the "cd98" tasks were calling for the new 3.5Gb setting and using 1.3Gb and running fine. That's the closest RAM-used got to RAM-allocated - still a long way off but not 6.6Gb setting using 500Mb or another I saw with an 8.6Gb setting only using 400Mb. These latter 2 examples are what's causing the entirety of the problems, made worse by the fact there were probably 20million pre_helical_bundles tasks demanding 6.6Gb (now down to 12 or 13m so a way to go yet) Anyway, see next message |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2146 Credit: 41,570,180 RAC: 8,210 |
Maybe I'm too squeamish. I have been. The 4th core is running a relatively new task "f60030e2d399cf97bd574292ff707fcd_fae0a51cf659d300dc90ab2264960253_barrel6_L2L5L5L7L5L7L3_relax_SAVE_ALL_OUT_1393099_4" (should I call this "barrel6 ?) whose Virtual memory size is only 380Mb, but looking at my "client_state.xml" file, it's set up to ask for 8583Mb RAM and 1907 Disk space. You might call this misconfigured too, but given it's after the adjustments made, it would be deliberate, so who's to call it misconfigured?Me. Bottom, line? You're right I think the only point I'm making is that if you think it might get better after the huge number of pre_helical_bundles tasks are worked through, I wouldn't personally bank on it.Which means things won't get any better than they are now, and may even get worse if we get greater numbers of Tasks that are configured for such ridiculously excessive amounts of RAM above & beyond the maximum that they will actually use. It's apparent now this was a particularly large batch of tasks. The original offer I got was that when a large batch was being issued, they'd be reviewed to ensure the RAM & Disk demands were appropriately sized. This whole thing has blown up because they weren't just inappropriate, but on another planet, preventing default users from running them at all - only nerds like you and me. I wasn't happy this would be looked at by exception for large batches, so I'm now told it's going to be the norm for all batches, including external researchers who submit tasks to the queue, tasks for which the RAM and Disk req'ts aren't currently known at all at the project. Essentially, this is why RAM and Disk req'ts have been so large. They were a kind of catch-all amount to ensure tasks that were coming from all over would have sufficient. Not because they did need it, but because it wasn't possible to know what they needed. The fact would be, for each researcher with a batch to submit, who'd no doubt run a few locally first to ensure they're doing what they're supposed to do, finding out what the resource req'ts actually are would be trivial - it's reported for each task. Find the peak amount, add a margin for safety, whether a %age or a default amount (or both) and set it to be that. As much the point is pushing this back to the researchers themselves. When I come along complaining about some task or other, the admin has to search through for that batch of tasks, see what the setting is, see what it's using and putting that right. Then I'll complain about another, the same. Then another, the same. It's not long before a setting large enough for <everything> is put in so the admin can get on with his own work. Sound familiar? And it's no better if the external researcher submits the resource req't with the batch for the admin to put in. That's permanent extra work for little benefit. So, push it back to the researcher, add a stage for them to set appropriate req'ts for their batch and add to the queue themselves as they currently do. No admin interventions. Tasks get through to users who can run them, without limiting it to a select few with abundant resources, everyone gets tasks they can run in greater numbers than before, the researcher gets their batch completed earlier than they used to. Everyone's a winner with no-one having artificial hurdles put in their way. That's the theory anyway. Let's see how it turns out over the next few weeks. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1734 Credit: 18,532,940 RAC: 17,945 |
So, push it back to the researcher, add a stage for them to set appropriate req'ts for their batch and add to the queue themselves as they currently do. No admin interventions.That's the ideal- they run a few Tasks locally- they can set it to a 1 hour or even 30min runtime to speed things up -and see just how much RAM they actually do use. Bump it up by 50% and use that value for the batch when it's released here. The next best thing would take some work to set up, but it would look after itself from then on in. For all new work set the default RAM value for 2GB, disk space to 500MB. Run a Cron job once every 24hrs that checks the max RAM values actually used- by batch -of work that has been completed over the last 24hrs. Then run a script to set the required RAM value to the max * 1.5 for each batch of Tasks queued up. Next time the script is run, if the new value is less than the old value, don't change it. If it's more, change it to the new value. I suspect that after 3 days you'd pretty much have the highest possible value, so no need to run the script on that batch again. Setup a file to keep track of how many times a batch has had the script run, adding new batches to that list as they are submitted. That way you reduce the database load by only running the script to update RAM values for batches that need it, Once it's been done 3 times (by my WAG) they won't need it done again (by then likely hood of the actually used RAM values being any higher than the current max RAM * 1.5 value would be bugger all IMHO). As for the disk space required- the only time i've seen a large amount of space used by a Task was when we had some that were erroring out, and producing 100MB+ result/std_error files. Even on my system that has multiple Rosetta & Mini Rosetta application versions, with 8 hour caches & on a 6c/12thread CPU, the most RAM i have seen Rosetta require is 2.5GB. So 500MB per Tasks is way more RAM than will ever be needed, but it's not a huge ask & shouldn't cause issues except for those with very small amounts of storage allocated as available to Rosetta & BOINC. Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2146 Credit: 41,570,180 RAC: 8,210 |
So, push it back to the researcher, add a stage for them to set appropriate req'ts for their batch and add to the queue themselves as they currently do. No admin interventions.That's the ideal- they run a few Tasks locally- they can set it to a 1 hour or even 30min runtime to speed things up -and see just how much RAM they actually do use. Bump it up by 50% and use that value for the batch when it's released here. And I suspect you're exactly right, from what I've just read. Rather than any manual intervention at all, completed jobs for each batch are going to be reviewed and reflected back into the vast majority of the queue for that batch - bar the 25k that are ready to send. So, less of me guessing. I'm going to shut up now and see what's changed in about a week. I dare say the easiest way of knowing is by looking at any changes to pre_helical_bundles tasks as it's their settings that are the most egregious. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1734 Credit: 18,532,940 RAC: 17,945 |
I dare say the easiest way of knowing is by looking at any changes to pre_helical_bundles tasks as it's their settings that are the most egregious.The pre_helical_bundles values won't change unless they run a script on them- i'm pretty sure they were the initial batch of 20 million that started the whole problem and i'd be surprised if they have released any more since that initial batch (that's one of the things that makes it difficult to keep a track of what's going on- we have have the dates for when Tasks are sent out & returned etc. But there is no date stamp of when the batch was released to the project to be turned in to Work Units to process). Looking at my present Tasks, most of the other Tasks being released have much less extreme values (although they are still much much higher than they need to be). I suspect that's why the amount of work in progress has improved, but still not recovered to it's previous levels. And why it tends to drop significantly every now & then (although at least to no where near as low as it was when the problem first occurred) and then build up for a while to drop again. <name>pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_5kb7db1y_1389919_2</name> <rsc_memory_bound>7000000000.000000</rsc_memory_bound> <rsc_disk_bound>9000000000.000000</rsc_disk_bound> Peak working set size 560 MB Peak disk usage 2.7 MB <name>455bf3e6f13e9a5ea3a74cb191a168ca_6f0dae8d86802072ff4c6a67db2e865a_1kq1A_L2L6L4L2L5L2_fold_SAVE_ALL_OUT_1393099_64</name> <rsc_memory_bound>900000000.000000</rsc_memory_bound> <rsc_disk_bound>2000000000.000000</rsc_disk_bound> Peak working set size 440 MB Peak disk usage 2.2 MB <name>SL_2e6p_1_A_trim_SEQ_3.5_3.0_43_L2L4L2L6L2.folded.pdb.rd1_fragments_abinitio_SAVE_ALL_OUT_1393561_333</name> <rsc_memory_bound>3500000000.000000</rsc_memory_bound> <rsc_disk_bound>4000000000.000000</rsc_disk_bound> Peak working set size 455 MB Peak disk usage 22 MB <name>rb_05_21_76950_74914_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_04_09_1393730_101</name> <rsc_memory_bound>3500000000.000000</rsc_memory_bound> <rsc_disk_bound>4000000000.000000</rsc_disk_bound> Peak working set size 1,100 MB Peak disk usage 8 MB Comparing the configured values with the actually used values gives a pretty good idea of just how excessive they are (even the presently improved ones). Even more so when you look at the storage values (4GB required, 22MB actually used. If i've got enough zeros in the right places, that's 182 times more than what was actually needed...). Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2146 Credit: 41,570,180 RAC: 8,210 |
I dare say the easiest way of knowing is by looking at any changes to pre_helical_bundles tasks as it's their settings that are the most egregious.The pre_helical_bundles values won't change unless they run a script on them I know you run a small cache. Take another look. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1734 Credit: 18,532,940 RAC: 17,945 |
I dare say the easiest way of knowing is by looking at any changes to pre_helical_bundles tasks as it's their settings that are the most egregious.The pre_helical_bundles values won't change unless they run a script on them <rsc_memory_bound>653095368.000000</rsc_memory_bound> <rsc_disk_bound>9000000000.000000</rsc_disk_bound> <rsc_memory_bound>525204451.000000</rsc_memory_bound> <rsc_disk_bound>9000000000.000000</rsc_disk_bound>Ah! That's good to see. So it looks like they are still sending out new pre_helical_bundles Tasks, but with improved RAM values. Unfortunately they're still a small percentage of the total number of those Task types (i've got about two of them out of the dozen pre_helical_bundles on my system), but bit by bit the older ones are being cleared out till they're well in to the minority & eventually gone completely. It helps explain why the peaks in the In progress numbers have been gradually getting higher. If they could use similarly appropriate RAM values for other Task types, bring the disk requirements down to more realistic levels for all Tasks & things should be back to normal in next to no time (and any new people that join up won't be asking why they need to give Rosetta so much disk space). Those with RAM limited systems will be back to where they were before- able to process most Tasks & only run in to issues with those that actually need large amounts of RAM to run. That should give Rosetta's compute resources a good lift back to where they were, and keep them up there. Edit- i wonder if these new pre_helical_bundles Tasks resolved the Compute/Validation error issues that occur soon after starting? The number of such errors does appear to be lower over the last few days than it was (although it has been very variable in the past). pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_5ds2cy0h_1389892_2_0 <message> Incorrect function. (0x1) - exit code 1 (0x1)</message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol pre_helix_boinc_v1.xml @helix_design.flags -in:file:silent pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_5ds2cy0h.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_5ds2cy0h.zip @pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_5ds2cy0h.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1649175 Using database: database_357d5d93529_n_methylminirosetta_database ERROR: [ERROR] Unable to open constraints file: fe4dbf3cfd9598bae6400704d6426ef3_0001.MSAcst ERROR:: Exit from: ......srccorescoringconstraintsConstraintIO.cc line: 457 BOINC:: Error reading and gzipping output datafile: default.out 01:19:23 (3828): called boinc_finish(1) </stderr_txt> Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2146 Credit: 41,570,180 RAC: 8,210 |
I dare say the easiest way of knowing is by looking at any changes to pre_helical_bundles tasks as it's their settings that are the most egregious.The pre_helical_bundles values won't change unless they run a script on them It took effect a lot quicker than I expected. I'm guessing (again) that the 25k tasks "Ready to Send" are left as is and the adjustment is made to the rest of the 13m queue. I don't know exactly what's happening, but let's see what happens after several days. It may be recursive, so it improves further over time, but again I'm guessing based on this I started working on some logic that can update the rsc_memory_bound in our queue based on memory usage reported back from completed jobs. It helps explain why the peaks in the In progress numbers have been gradually getting higher. It will, but I don't think so just yet. Let's see after a day or two more. If they could use similarly appropriate RAM values for other Task types, bring the disk requirements down to more realistic levels for all Tasks & things should be back to normal in next to no time (and any new people that join up won't be asking why they need to give Rosetta so much disk space). I'd hope it extended to all task-types, but it may be limited to pre_helical_bundles for now. And while RAM is the primary limitation, the same logic should apply to Disk space too Edit- i wonder if these new pre_helical_bundles Tasks resolved the Compute/Validation error issues that occur soon after starting? The number of such errors does appear to be lower over the last few days than it was (although it has been very variable in the past). I didn't realise that was still happening tbh - is it? I certainly haven't mentioned it, so if anything's changed there it's more likely coincidence. Is it connected to pre_helical_bundles tasks? My impression was it popped up on a variety of other tasks too, but very irregularly. Am I wrong? |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1734 Credit: 18,532,940 RAC: 17,945 |
Pretty sure that particular error is just with the pre_helical_bundles, and it's still occuring. Just not as much as it was. Hence why i was thinking that the new Tasks might have the issue sorted and the errors are from the original release Tasks.Edit- i wonder if these new pre_helical_bundles Tasks resolved the Compute/Validation error issues that occur soon after starting? The number of such errors does appear to be lower over the last few days than it was (although it has been very variable in the past). If you check my Tasks and you'll see 1 Valid error on one system & 2 Compute Errors on the other. The Compute errors usually occur in under a minute, the Validate errors after a few minutes. You've got 3 Compute errors on your Ryzen- all pre_helical_bundles, same std_err output as me. ERROR: [ERROR] Unable to open constraints file: feab7864d25907b78eb5173513455954_0001.MSAcst ERROR:: Exit from: ......srccorescoringconstraintsConstraintIO.cc line: 457 Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2146 Credit: 41,570,180 RAC: 8,210 |
I dare say the easiest way of knowing is by looking at any changes to pre_helical_bundles tasks as it's their settings that are the most egregious.The pre_helical_bundles values won't change unless they run a script on them I've only just realised what this is saying. Too many digits to work it out in my head. The original figure was 7*10^9 - 7 followed by 9 zeros. Divide by 1024 twice to convert to Mb = 6675.72Mb 653095368 converts to 622.84Mb RAM 525204451 converts to 500.87Mb RAM I thought they were 10x higher for only a small reduction (6.53Gb & 5.25Gb) They've gone from the hardest tasks to download and run to the easiest. Every host will handle them easily. And no more "Waiting for memory" with these in the mix. They might even be asking for too little RAM to run successfully (I'll keep that thought to myself for now) |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2146 Credit: 41,570,180 RAC: 8,210 |
Pretty sure that particular error is just with the pre_helical_bundles, and it's still occuring. Just not as much as it was. Hence why i was thinking that the new Tasks might have the issue sorted and the errors are from the original release Tasks.Edit- i wonder if these new pre_helical_bundles Tasks resolved the Compute/Validation error issues that occur soon after starting? The number of such errors does appear to be lower over the last few days than it was (although it has been very variable in the past). Confirmed. Very little runtime wasted, but I'll get round to mentioning it by the end of the weekend now I'm back home |
Kissagogo27 Send message Joined: 31 Mar 20 Posts: 86 Credit: 2,981,693 RAC: 1,241 |
hi, i've got seven of "pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_" WU per 2GB computer , first to start in few minutes ;) |
Message boards :
Number crunching :
Rosetta needs 6675.72 MB RAM: is the restriction really needed?
©2024 University of Washington
https://www.bakerlab.org