WU Compression

Message boards : Number crunching : WU Compression

To post messages, you must log in.

AuthorMessage
Profile blackbird

Send message
Joined: 4 Nov 05
Posts: 15
Credit: 93,414
RAC: 0
Message 4578 - Posted: 28 Nov 2005, 17:42:11 UTC

I can suggest that lowering WU size can both reduce server traffic and attract new users to the project. GZip is not good enough in compressing such files, and good compression rate can be reached with converting stage. E.g. 44-bytes string

1gox _ 264 A L -64.197 161.100 173.653

can be converted to 18-bytes string

[4 char][1 char][word][char][char][int - by *100][int][int]

and then gzipped. I can suspect very good compression rate.







ID: 4578 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
thom217

Send message
Joined: 21 Oct 05
Posts: 5
Credit: 0
RAC: 0
Message 4579 - Posted: 28 Nov 2005, 17:44:32 UTC

This can be useful for members with dialup connections.
ID: 4579 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 4611 - Posted: 28 Nov 2005, 20:13:13 UTC
Last modified: 28 Nov 2005, 20:14:07 UTC

Link to my suggestion in suggestions

I put it in suggestions some days ago

(and just using rar as an example there)



Hi blackbird :D
Team mauisun.org
ID: 4611 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Divide Overflow

Send message
Joined: 17 Sep 05
Posts: 82
Credit: 921,382
RAC: 0
Message 4615 - Posted: 28 Nov 2005, 20:53:48 UTC - in response to Message 4578.  


ID: 4615 · Rating: -1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Johnathon

Send message
Joined: 5 Nov 05
Posts: 120
Credit: 138,226
RAC: 0
Message 4620 - Posted: 28 Nov 2005, 21:19:28 UTC - in response to Message 4615.  


ID: 4620 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Halifax--lad
Avatar

Send message
Joined: 17 Sep 05
Posts: 157
Credit: 2,687
RAC: 0
Message 4622 - Posted: 28 Nov 2005, 21:25:30 UTC - in response to Message 4620.  


Join us in Chat (see the forum) Click the Sig


Join UBT
ID: 4622 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,504,680
RAC: 1,117
Message 4626 - Posted: 28 Nov 2005, 21:46:52 UTC

I think the thread is originally discussing WU "size", meaning the physical size in MB that is transferred. Being on cable, this part bothers me not at all. The thread seems to be also taking on the WU "length" issue, how long it takes to crunch one WU - and I feel that this may SOON become a major issue for attracting/keeping participants.

I say this because I (and many others) have some "slow" machines. It may be POSSIBLE to do something like CPDN on them, because of the long deadline, or Einstein, with a shorter (but still long) time and a shorter deadline... but it isn't "fun". Instant gratification, you bet!

One of the things that attracted me to Rosetta in the first place was results that could be done in around an hour on my fastest machines. The latest WUs take about twice as long, but that's still okay. I have SETI and Predictor on my very slowest Mac iBook G3 (only because of that 31-hour monster from Rosetta, as soon as another Mac release comes out, I'll switch that Mac back). Those projects, on that machine, take 8 to 20 hours, which is about the maximum I like to put up with. Einstein took days. SETI is about to make their WUs take up to 10x as long - bye bye slowest machines... if Rosetta can keep the WUs "short", at least "significantly shorter than SETI", then it's a point in your favor for many of us. My understanding of the way Rosetta WUs "work", is that you run "x" chunks of work in each. It would be easy to do "2x" or "3x" I suppose, but I would encourage you not to do that, as long as the servers can handle the traffic.

As projects go to longer and longer WUs, being the "short" guy can be an advantage!

(Hm. Just looked at the Mac Mini, current Rosetta WU estimates 19 hour total at 70% done... that's longer than Einstein, much less SETI... oops.)

ID: 4626 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Vester
Avatar

Send message
Joined: 2 Nov 05
Posts: 257
Credit: 3,358,764
RAC: 15,775
Message 4635 - Posted: 28 Nov 2005, 23:12:33 UTC
Last modified: 28 Nov 2005, 23:13:32 UTC

I believe that blackbird's concern is for users on slow dial-up connections who also pay for service based on bandwidth as he does in Russia.

Blackbird is a spokesman for the number one team at Find-a-Drug, TSC! Russia, which has 1832 members.
ID: 4635 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 4639 - Posted: 29 Nov 2005, 0:20:11 UTC - in response to Message 4578.  

I can suggest that lowering WU size can both reduce server traffic and attract new users to the project. GZip is not good enough in compressing such files, and good compression rate can be reached with converting stage. E.g. 44-bytes string

1gox _ 264 A L -64.197 161.100 173.653

can be converted to 18-bytes string

[4 char][1 char][word][char][char][int - by *100][int][int]

and then gzipped. I can suspect very good compression rate.



A few points to note. Zlib1 is being used for compression, and gets close to the best compression possible on the raw data. Without changing the source data, the best we could hope for is an extra 1/2% or so. And the problem with many other compression methods is that there are legal problems with them. As an example, the LZ compression in GIF was patent by UNISYS, which led to all sorts of trouble.

Modifying the source data is probably the best way to go, there are a couple of things to note. Data is presented in the WU I'm looking at as groups of three lines. The first and second fields are always the same in a group, and the third field is always three sequential integers. So that could be reduced. Getting rid of duplicate spaces makes some difference as well.

The numbers on the end might yield some savings if converted to a binary format. I'll provide the details if anyone asks. Suffice it to say that they currently occupy about 68 bits each, and compress to 23 bits each with the current system. I'd estimate they'd take between 18 and 19 bits each as binary data, saving 4 bits per number. That's 12 bits per line, or something in the area of 64K over the entire size of the file. However, this last change will probably require a major rework of the code they have to parse the file, since it's almost certain that the current system is a text only / line based system, which would break quite badly with this proposed change.
ID: 4639 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Andrew

Send message
Joined: 19 Sep 05
Posts: 162
Credit: 105,512
RAC: 0
Message 4696 - Posted: 29 Nov 2005, 16:38:26 UTC
Last modified: 29 Nov 2005, 16:39:39 UTC

There's also 7-Zip's LZMA sdk that is supposed to be a pretty good compression lib.

It's available in a few licenses as well, namely: LGPL, CPL, Simplified license, and a Proprietary license.
ID: 4696 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile blackbird

Send message
Joined: 4 Nov 05
Posts: 15
Credit: 93,414
RAC: 0
Message 4699 - Posted: 29 Nov 2005, 18:08:58 UTC

First, only small part of code that converts downloaded WU file is required - this part converts binary data to text file. This also means that no major code rewriting is required.

There are some other ways to increase the homohenity of raw data and therefore the compression ratio, e.g. using columns grouping instead of lines.
ID: 4699 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile nasher

Send message
Joined: 5 Nov 05
Posts: 98
Credit: 618,288
RAC: 0
Message 4707 - Posted: 29 Nov 2005, 18:51:54 UTC

Well i agree it would be nice to have shorted (transmision wise) jobs but for me it dosnt make much difrence (another person on a cable modem)

i also would like to see shorter work units if posible (yes i do have 1 computer running about 45% of the time on climate predictor) i do like to be sure for myself at least that i am sending results back for most the projects i do at least once per day ( i understand that in blackbirds case it probaly isnt as good)

Corse any Optimizations you make would make me feel better since it shows to me that people here care about there users and are trying to make a better product for use to Spend our cpu time on
ID: 4707 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile blackbird

Send message
Joined: 4 Nov 05
Posts: 15
Credit: 93,414
RAC: 0
Message 4810 - Posted: 30 Nov 2005, 17:32:07 UTC

I have written some code in Pascal (67 lines) to test the idea. Column grouping between text 'Position: xxx Neighbors: yyy' was used.

Results: (sizes in bytes)
aa1dcj_09_05.200_v1_3 5294250 - original WU
aa1di2_09_05.200_v1_3.gz 1585695 - gzipped WU

With grouping:
aa1dcj_09_05.200_v1_3 5294250 - original WU
aa1dcj_09_05.200_v1_3.grpc 1664680 - converted WU
aa1dcj_09_05.200_v1_3.grpc.gz 775250 - gzipped (gzip -9) converted WU

You can see twofold decrease of WU size. Of course, better grouping code can be used.

ID: 4810 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : WU Compression



©2024 University of Washington
https://www.bakerlab.org