Message boards : Number crunching : WU Compression
Author | Message |
---|---|
blackbird Send message Joined: 4 Nov 05 Posts: 15 Credit: 93,414 RAC: 0 |
I can suggest that lowering WU size can both reduce server traffic and attract new users to the project. GZip is not good enough in compressing such files, and good compression rate can be reached with converting stage. E.g. 44-bytes string 1gox _ 264 A L -64.197 161.100 173.653 can be converted to 18-bytes string [4 char][1 char][word][char][char][int - by *100][int][int] and then gzipped. I can suspect very good compression rate. |
thom217 Send message Joined: 21 Oct 05 Posts: 5 Credit: 0 RAC: 0 |
This can be useful for members with dialup connections. |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
Link to my suggestion in suggestions I put it in suggestions some days ago (and just using rar as an example there) Hi blackbird :D Team mauisun.org |
Divide Overflow Send message Joined: 17 Sep 05 Posts: 82 Credit: 921,382 RAC: 0 |
|
Johnathon Send message Joined: 5 Nov 05 Posts: 120 Credit: 138,226 RAC: 0 |
|
UBT - Halifax--lad Send message Joined: 17 Sep 05 Posts: 157 Credit: 2,687 RAC: 0 |
|
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,450 RAC: 13 |
I think the thread is originally discussing WU "size", meaning the physical size in MB that is transferred. Being on cable, this part bothers me not at all. The thread seems to be also taking on the WU "length" issue, how long it takes to crunch one WU - and I feel that this may SOON become a major issue for attracting/keeping participants. I say this because I (and many others) have some "slow" machines. It may be POSSIBLE to do something like CPDN on them, because of the long deadline, or Einstein, with a shorter (but still long) time and a shorter deadline... but it isn't "fun". Instant gratification, you bet! One of the things that attracted me to Rosetta in the first place was results that could be done in around an hour on my fastest machines. The latest WUs take about twice as long, but that's still okay. I have SETI and Predictor on my very slowest Mac iBook G3 (only because of that 31-hour monster from Rosetta, as soon as another Mac release comes out, I'll switch that Mac back). Those projects, on that machine, take 8 to 20 hours, which is about the maximum I like to put up with. Einstein took days. SETI is about to make their WUs take up to 10x as long - bye bye slowest machines... if Rosetta can keep the WUs "short", at least "significantly shorter than SETI", then it's a point in your favor for many of us. My understanding of the way Rosetta WUs "work", is that you run "x" chunks of work in each. It would be easy to do "2x" or "3x" I suppose, but I would encourage you not to do that, as long as the servers can handle the traffic. As projects go to longer and longer WUs, being the "short" guy can be an advantage! (Hm. Just looked at the Mac Mini, current Rosetta WU estimates 19 hour total at 70% done... that's longer than Einstein, much less SETI... oops.) |
Vester Send message Joined: 2 Nov 05 Posts: 258 Credit: 3,651,260 RAC: 521 |
I believe that blackbird's concern is for users on slow dial-up connections who also pay for service based on bandwidth as he does in Russia. Blackbird is a spokesman for the number one team at Find-a-Drug, TSC! Russia, which has 1832 members. |
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
I can suggest that lowering WU size can both reduce server traffic and attract new users to the project. GZip is not good enough in compressing such files, and good compression rate can be reached with converting stage. E.g. 44-bytes string A few points to note. Zlib1 is being used for compression, and gets close to the best compression possible on the raw data. Without changing the source data, the best we could hope for is an extra 1/2% or so. And the problem with many other compression methods is that there are legal problems with them. As an example, the LZ compression in GIF was patent by UNISYS, which led to all sorts of trouble. Modifying the source data is probably the best way to go, there are a couple of things to note. Data is presented in the WU I'm looking at as groups of three lines. The first and second fields are always the same in a group, and the third field is always three sequential integers. So that could be reduced. Getting rid of duplicate spaces makes some difference as well. The numbers on the end might yield some savings if converted to a binary format. I'll provide the details if anyone asks. Suffice it to say that they currently occupy about 68 bits each, and compress to 23 bits each with the current system. I'd estimate they'd take between 18 and 19 bits each as binary data, saving 4 bits per number. That's 12 bits per line, or something in the area of 64K over the entire size of the file. However, this last change will probably require a major rework of the code they have to parse the file, since it's almost certain that the current system is a text only / line based system, which would break quite badly with this proposed change. |
Andrew Send message Joined: 19 Sep 05 Posts: 162 Credit: 105,512 RAC: 0 |
There's also 7-Zip's LZMA sdk that is supposed to be a pretty good compression lib. It's available in a few licenses as well, namely: LGPL, CPL, Simplified license, and a Proprietary license. |
blackbird Send message Joined: 4 Nov 05 Posts: 15 Credit: 93,414 RAC: 0 |
First, only small part of code that converts downloaded WU file is required - this part converts binary data to text file. This also means that no major code rewriting is required. There are some other ways to increase the homohenity of raw data and therefore the compression ratio, e.g. using columns grouping instead of lines. |
nasher Send message Joined: 5 Nov 05 Posts: 98 Credit: 618,288 RAC: 0 |
Well i agree it would be nice to have shorted (transmision wise) jobs but for me it dosnt make much difrence (another person on a cable modem) i also would like to see shorter work units if posible (yes i do have 1 computer running about 45% of the time on climate predictor) i do like to be sure for myself at least that i am sending results back for most the projects i do at least once per day ( i understand that in blackbirds case it probaly isnt as good) Corse any Optimizations you make would make me feel better since it shows to me that people here care about there users and are trying to make a better product for use to Spend our cpu time on |
blackbird Send message Joined: 4 Nov 05 Posts: 15 Credit: 93,414 RAC: 0 |
I have written some code in Pascal (67 lines) to test the idea. Column grouping between text 'Position: xxx Neighbors: yyy' was used. Results: (sizes in bytes) aa1dcj_09_05.200_v1_3 5294250 - original WU aa1di2_09_05.200_v1_3.gz 1585695 - gzipped WU With grouping: aa1dcj_09_05.200_v1_3 5294250 - original WU aa1dcj_09_05.200_v1_3.grpc 1664680 - converted WU aa1dcj_09_05.200_v1_3.grpc.gz 775250 - gzipped (gzip -9) converted WU You can see twofold decrease of WU size. Of course, better grouping code can be used. |
Message boards :
Number crunching :
WU Compression
©2024 University of Washington
https://www.bakerlab.org