better data compression and caching of large jobs

Message boards : Number crunching : better data compression and caching of large jobs

To post messages, you must log in.

AuthorMessage
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 77036 - Posted: 19 Jul 2014, 14:08:38 UTC
Last modified: 19 Jul 2014, 14:23:19 UTC

there are jobs which delivers a some 80 meg file with each job, i'd think that's pushing out perhaps 10s-100s of gigabytes of the 80 megs files across the Internet considering the number of jobs sent (imagine the workload on the server and the network jams)

is boinc-client / minirosetta app able to cache large files across different jobs?

apparently rosetta app and database can be cached locally which makes sense, it avoids repeat transfers of large files choking up (internet) bandwidth.

But could the same be perhaps applied for jobs, e.g. if the jobs references common objects say in it's own 'library', perhaps the 'library'/database can be cached locally so as to reduce the number of repeat 80 megs transmissions

the other thing obviously is data compression, perhaps rosetta / boinc would like to use better compression methods

7-ziphttp://en.wikipedia.org/wiki/7-Zip
, XZ http://en.wikipedia.org/wiki/7-Zip,
http://en.wikipedia.org/wiki/AdvanceCOMP

reducing network bandwidth is essential in the Internet context as not all boinc/rosetta@home volunteers/participants has unlimited or low costs bandwidth access. some (e.g. perhaps mobile) internet access could be expensive and/or it has hard cap/limits (e.g. 1GB) and the volume of boinc bandwidth consumed can easily hit up against the limit/cap

this is unlike local lan clusters or even local supercomputer clusters, where 'bandwidth' is essentially an 'internal' affair (e.g. the 'bandwidth' costs may be little more than the electricity costs needed to run the network clusters)

i'd think scientists / researchers may need to split the jobs into high network bandwidth / low compute and low network bandwidth / high compute portions.

those that are high network bandwidth may need to run in clusters that are *local* (e.g. a campus cluster or institution supercomputer cluster). i'm not sure if boinc can cater to such bandwidth 'localization'

just 2 cents
ID: 77036 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 77037 - Posted: 19 Jul 2014, 14:58:36 UTC
Last modified: 19 Jul 2014, 15:07:53 UTC

there are obviously other more 'advanced' approaches. today it seemed it is pretty much 'client-server'

among which includes use of peer to peer networking (http://www.bittorrent.org/introduction.html). that relieves server load but data-compression and 'differential' transmission is still needed to reduce the network bandwidth.

other concepts can most likely be 'borrowed' from the hottest craze in town, namely bitcoin/litecoin (or other cryptocoin) mining (boinc is very similar to that after all). it started with frustrations with 'getwork' protocol which generates huge amounts of *network traffic* and someone invented 'stratum-mining' protocol to relieve network traffic and server load.
https://mining.bitcoin.cz/stratum-mining

the features of stratum-mining is that it is part of p2pool (peer-to-peer pools nodes) and the leaf (client) nodes *generates work* (i.e. the leaf nodes generates work based on certain specs). this could push the envelope of 'differential' transmissions and network bandwidth reduction.

i guess these would need 'leapfrogging' changes to boinc & even perhaps rosetta? :o :D lol

just 2 cents
ID: 77037 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 77040 - Posted: 19 Jul 2014, 19:35:05 UTC
Last modified: 19 Jul 2014, 19:39:41 UTC

is boinc-client / minirosetta app able to cache large files across different jobs?


Yes. This is done in the BOINC client. If required files are already on the client machine, they are not downloaded again. This is one advantage to keeping at least a small cache of work on-hand. It improves the odds that one of your recent tasks required the same file.

You can see this sometimes when you've only got the one large file left downloading, yet have more than one task in the list showing a status of "downloading".

The BOINC client also supported the application designating specific files as "sticky", so that even if there are no tasks left for R@h (perhaps work from other BOINC projects has filled your cache), those files are not removed. This allows the researchers to identify files that are very likely to be used in the future, and avoid repeated downloads.

I believe all contact to the server is done using a form of compressed streaming. So the size you see in the file transfers tab may not be the actual required transmission size.

Some suggestions to reduce overall network usage when running R@h:
* Keep a cache of at least a day of additional work. Use a larger cache when using longer runtime preference. The objective is to keep 10 or 20 tasks in the task list to help ensure you have many of the files required for the current active set of tasks.
* Set the runtime preference (via the website in the R@h preferences) to the largest value you are comfortable with (24hrs in the maximum selection).
* Rotate between BOINC projects, rather than always running all projects with a fractional resource share.
* Set your preferences to download during off-peak times (the server is going 24/7 serving machines from around the world, but your local off-peak).

Another thing you can do is setup a cacheing proxy server. That way if you do end up completing all of the tasks that require a certain large file, and thus the file is removed, and then later you get assigned a retry or additional work that requires the same file, it will be pulled directly from the cache. All files of a given name are unique and static. So if the same name is called for later, your existing copy of that file will be identical. This idea is especially good when you've got multiple machines that can all utilize the same proxy cache.

And while it will not reduce total bandwidth, you can also throttle the up and download speed the BOINC Client attempts to use. This can help ensure that BOINC network usage does not fill your network connection, and thus ensure some bandwidth is always available for other things running on your machine.

Perhaps others have additional suggestions?
Rosetta Moderator: Mod.Sense
ID: 77040 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 77041 - Posted: 19 Jul 2014, 19:59:58 UTC
Last modified: 19 Jul 2014, 20:35:07 UTC

hi Mod.Sense,

thanks much for the response! i guess this in a way also needs the scientist/researcher to define jobs/tasks that can work as such. i.e. distinguishing between repeatedly used common bulk reference data vs task specific differential data.

it is also interesting to note that the size implied in the download tabs need not be the actual size that's downloaded.

however, do hope that scientists/researchers can make use of these (caching) features especially when they are sending out jobs with large files. perhaps this is specific to a particular batch, but it seemed for those, it's pushing out full file sizes each time. to give a benefit of doubt, i'd think there may be technical constraints for having to re-send large files with each job, but preferably like mentioned, try to use the features that could reduce network bandwidth fan out from the server with each job.

for 'leaf' node end users (i'm 1), we couldn't really benefit from proxies as even though proxies cache data between the wan links, as a customer, they still count/charge all data volume downstream even if it is streamed from the internet access provider's proxy servers. however, native cache features when used in boinc/rosetta would really help. hmm, i'd guess we can setup a local proxy cache on the pc, would explore that too :)

(note that proxying would basically fail if the contents (including the bulk reference data) are packaged in zip archives, named differently and sent in each job)

thanks! :)
ID: 77041 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 77042 - Posted: 19 Jul 2014, 21:54:22 UTC
Last modified: 19 Jul 2014, 22:10:32 UTC

I have noticed more large downloads lately as well. But it is a different large file each time. You'll get a few tasks that use that file, but the next time you go for work there seems to be a new batch that uses a different file. This is simply due to the types of things being studied. It is rather difficult to optimize every client machine to align future work with files that already exist locally.

What I meant by a cacheing proxy server is one that you could setup yourself for use within your own WAN. It would essentially extend the BOINC Manager object cache, which only holds on to things as long as they are "sticky" or as long as you have a task that needs the file. It works great for people that have several machines going. Now you only actually download when the first machine is assigned a task of that type and later, when other machines are assigned tasks of the same type, the required files are already locally available, even if the first machine has completed the task and removed the associated files.

It also helps even if there is only a single machine, it essentially would be an area of disk space to check before sending requests for files out on the network. It helps when you get 3 tasks that use an 80MB file, complete them all and then a few days later are assigned some work that uses the same file. With the 10 day deadlines, the expired tasks then get reassigned so sometimes there is a gap between the initial assignment of tasks using the file and then the assignment of the expired tasks ten days later. It all just depends on the mix of tasks they are working on at any given point-in-time.

The one benefit to R@h hosting all of their own server download capacity is that if they should accidentally change something that causes a large increase in downloads, they will see it right away. :)

I believe the general feeling is that most participants prefer to get files from the project they are supporting and not have intermediate third parties serving files. Much the same way you probably prefer to download your software updates for PDF viewers from Adobe rather than some host that is unknown to you.
Rosetta Moderator: Mod.Sense
ID: 77042 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : better data compression and caching of large jobs



©2024 University of Washington
https://www.bakerlab.org