WU Compression

Author	Message
blackbird Send message Joined: 4 Nov 05 Posts: 15 Credit: 93,414 RAC: 0	Message 4578 - Posted: 28 Nov 2005, 17:42:11 UTC I can suggest that lowering WU size can both reduce server traffic and attract new users to the project. GZip is not good enough in compressing such files, and good compression rate can be reached with converting stage. E.g. 44-bytes string 1gox _ 264 A L -64.197 161.100 173.653 can be converted to 18-bytes string [4 char][1 char][word][char][char][int - by *100][int][int] and then gzipped. I can suspect very good compression rate. ID: 4578 · Rating: 0 · rate: / Reply Quote

FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0	Message 4611 - Posted: 28 Nov 2005, 20:13:13 UTC Last modified: 28 Nov 2005, 20:14:07 UTC Link to my suggestion in suggestions I put it in suggestions some days ago (and just using rar as an example there) Hi blackbird :D Team mauisun.org ID: 4611 · Rating: 0 · rate: / Reply Quote

Divide Overflow Send message Joined: 17 Sep 05 Posts: 82 Credit: 921,382 RAC: 0	Message 4615 - Posted: 28 Nov 2005, 20:53:48 UTC - in response to Message 4578. ID: 4615 · Rating: -1 · rate: / Reply Quote

Johnathon Send message Joined: 5 Nov 05 Posts: 120 Credit: 138,226 RAC: 0	Message 4620 - Posted: 28 Nov 2005, 21:19:28 UTC - in response to Message 4615. ID: 4620 · Rating: 0 · rate: / Reply Quote

UBT - Halifax--lad Send message Joined: 17 Sep 05 Posts: 157 Credit: 2,687 RAC: 0	Message 4622 - Posted: 28 Nov 2005, 21:25:30 UTC - in response to Message 4620. Join us in Chat (see the forum) Click the Sig Join UBT ID: 4622 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,700,479 RAC: 3	Message 4626 - Posted: 28 Nov 2005, 21:46:52 UTC I think the thread is originally discussing WU "size", meaning the physical size in MB that is transferred. Being on cable, this part bothers me not at all. The thread seems to be also taking on the WU "length" issue, how long it takes to crunch one WU - and I feel that this may SOON become a major issue for attracting/keeping participants. I say this because I (and many others) have some "slow" machines. It may be POSSIBLE to do something like CPDN on them, because of the long deadline, or Einstein, with a shorter (but still long) time and a shorter deadline... but it isn't "fun". Instant gratification, you bet! One of the things that attracted me to Rosetta in the first place was results that could be done in around an hour on my fastest machines. The latest WUs take about twice as long, but that's still okay. I have SETI and Predictor on my very slowest Mac iBook G3 (only because of that 31-hour monster from Rosetta, as soon as another Mac release comes out, I'll switch that Mac back). Those projects, on that machine, take 8 to 20 hours, which is about the maximum I like to put up with. Einstein took days. SETI is about to make their WUs take up to 10x as long - bye bye slowest machines... if Rosetta can keep the WUs "short", at least "significantly shorter than SETI", then it's a point in your favor for many of us. My understanding of the way Rosetta WUs "work", is that you run "x" chunks of work in each. It would be easy to do "2x" or "3x" I suppose, but I would encourage you not to do that, as long as the servers can handle the traffic. As projects go to longer and longer WUs, being the "short" guy can be an advantage! (Hm. Just looked at the Mac Mini, current Rosetta WU estimates 19 hour total at 70% done... that's longer than Einstein, much less SETI... oops.) ID: 4626 · Rating: 0 · rate: / Reply Quote

Vester Send message Joined: 2 Nov 05 Posts: 259 Credit: 4,625,443 RAC: 13	Message 4635 - Posted: 28 Nov 2005, 23:12:33 UTC Last modified: 28 Nov 2005, 23:13:32 UTC I believe that blackbird's concern is for users on slow dial-up connections who also pay for service based on bandwidth as he does in Russia. Blackbird is a spokesman for the number one team at Find-a-Drug, TSC! Russia, which has 1832 members. ID: 4635 · Rating: 0 · rate: / Reply Quote

dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0	Message 4639 - Posted: 29 Nov 2005, 0:20:11 UTC - in response to Message 4578. I can suggest that lowering WU size can both reduce server traffic and attract new users to the project. GZip is not good enough in compressing such files, and good compression rate can be reached with converting stage. E.g. 44-bytes string 1gox _ 264 A L -64.197 161.100 173.653 can be converted to 18-bytes string [4 char][1 char][word][char][char][int - by *100][int][int] and then gzipped. I can suspect very good compression rate. A few points to note. Zlib1 is being used for compression, and gets close to the best compression possible on the raw data. Without changing the source data, the best we could hope for is an extra 1/2% or so. And the problem with many other compression methods is that there are legal problems with them. As an example, the LZ compression in GIF was patent by UNISYS, which led to all sorts of trouble. Modifying the source data is probably the best way to go, there are a couple of things to note. Data is presented in the WU I'm looking at as groups of three lines. The first and second fields are always the same in a group, and the third field is always three sequential integers. So that could be reduced. Getting rid of duplicate spaces makes some difference as well. The numbers on the end might yield some savings if converted to a binary format. I'll provide the details if anyone asks. Suffice it to say that they currently occupy about 68 bits each, and compress to 23 bits each with the current system. I'd estimate they'd take between 18 and 19 bits each as binary data, saving 4 bits per number. That's 12 bits per line, or something in the area of 64K over the entire size of the file. However, this last change will probably require a major rework of the code they have to parse the file, since it's almost certain that the current system is a text only / line based system, which would break quite badly with this proposed change. ID: 4639 · Rating: 1 · rate: / Reply Quote

Andrew Send message Joined: 19 Sep 05 Posts: 162 Credit: 105,512 RAC: 0	Message 4696 - Posted: 29 Nov 2005, 16:38:26 UTC Last modified: 29 Nov 2005, 16:39:39 UTC There's also 7-Zip's LZMA sdk that is supposed to be a pretty good compression lib. It's available in a few licenses as well, namely: LGPL, CPL, Simplified license, and a Proprietary license. ID: 4696 · Rating: 1 · rate: / Reply Quote

blackbird Send message Joined: 4 Nov 05 Posts: 15 Credit: 93,414 RAC: 0	Message 4699 - Posted: 29 Nov 2005, 18:08:58 UTC First, only small part of code that converts downloaded WU file is required - this part converts binary data to text file. This also means that no major code rewriting is required. There are some other ways to increase the homohenity of raw data and therefore the compression ratio, e.g. using columns grouping instead of lines. ID: 4699 · Rating: 0 · rate: / Reply Quote

nasher Send message Joined: 5 Nov 05 Posts: 98 Credit: 826,286 RAC: 6	Message 4707 - Posted: 29 Nov 2005, 18:51:54 UTC Well i agree it would be nice to have shorted (transmision wise) jobs but for me it dosnt make much difrence (another person on a cable modem) i also would like to see shorter work units if posible (yes i do have 1 computer running about 45% of the time on climate predictor) i do like to be sure for myself at least that i am sending results back for most the projects i do at least once per day ( i understand that in blackbirds case it probaly isnt as good) Corse any Optimizations you make would make me feel better since it shows to me that people here care about there users and are trying to make a better product for use to Spend our cpu time on ID: 4707 · Rating: 0 · rate: / Reply Quote

blackbird Send message Joined: 4 Nov 05 Posts: 15 Credit: 93,414 RAC: 0	Message 4810 - Posted: 30 Nov 2005, 17:32:07 UTC I have written some code in Pascal (67 lines) to test the idea. Column grouping between text 'Position: xxx Neighbors: yyy' was used. Results: (sizes in bytes) aa1dcj_09_05.200_v1_3 5294250 - original WU aa1di2_09_05.200_v1_3.gz 1585695 - gzipped WU With grouping: aa1dcj_09_05.200_v1_3 5294250 - original WU aa1dcj_09_05.200_v1_3.grpc 1664680 - converted WU aa1dcj_09_05.200_v1_3.grpc.gz 775250 - gzipped (gzip -9) converted WU You can see twofold decrease of WU size. Of course, better grouping code can be used. ID: 4810 · Rating: 0 · rate: / Reply Quote