-
Notifications
You must be signed in to change notification settings - Fork 18
Main page
Run "java -jar kanzi.jar -h" so see the help general page and "java -jar kanzi.jar -c -h" or "java -jar kanzi.jar -d -h" to see the help page for compression and decompression respectively.
The easiest way to compress is to use one of the existing pre-defined levels (0 to 9) using "-l".
The input file or directory must be provided with "-i".
The output file or directory can be provided with "-o".
Use "-b" to specify a block size (if omitted the default block size is selected).
The compression or decompression can be done concurrently by providing the number of jobs using "-j".
Instead of a specific compression level, one can specify the transforms and entropy to use by using "-t" and "-e".
See the help pages for the list of transforms and entropy codecs available.
The order of the parameters on the command line does not matter.
Examples:
java -jar kanzi.jar -c -i foo.txt -o none -b 4m -l 4 -v 3
Do a dry-run compression of foo.txt using block of 4 MB and the level 4 with a verbosity set to 3.
java -jar kanzi.jar -c -i foo.txt -f -t BWT+MTFT+ZRLT -b 4m -e FPAQ -j 4
Compress foo.txt to foo.txt.knz and overwrite the output file if it already exists. Use the Burrows-Wheeler, Move-to-front and Zero Run Length transforms followed by the FPAQ entropy codec. Use 4 jobs to compress the file.
java -jar kanzi.jar -d -i foo.knz -f -j 2
Decompress the file foo.knz to foo.knz.bak and force an overwrite. Use 2 jobs to decompress the file.
This project offers Java code for manipulation and lossless compression of data. Kanzi contains the most versatile all-purpose data compressor for Java (and the fastest block compressor in Java).
Other utilities include lossless compression codecs (Huffman, Range, LZ, ROLZ, PAQ, Asymmetric Numeral System, Context Model), bit stream manipulation, Burrows-Wheeler (BWT) and the bijective version BWTS, Move-To-Front transform, run length coding, etc ...
kanzi
- app
- bitstream
- entropy
- function
- io
- test
- transform
- util
- util/hash
- util/sort
Description:
- kanzi: top level including common classes and interfaces
- app: contains applications (E.G. block compressor)
- bitstream: utilities to manipulate a stream of data at the bit level
- entropy: implementation of several common entropy codecs (process bits)
- io: implementation of InputStream, OutputStream with block codec
- test: contains many classes to test the utility classes
- transform: implementation of common functions (input and output sizes are identical) such as Burrows-Wheeler, SortByRank
- util: utility classes
- util/hash: implementations of various hash functions (MurMurHash, xxHash32, xxHash64, SipHash_2_4)
- util/sort: implementation of the most common sorting algorithms: QuickSort, Radix, MergeSort, BucketSort, etc...
There are no static dependencies to other jar files.
jdeps kanzi.jar (filtered for external dependencies only) kanzi (kanzi.jar) -> java.io -> java.lang -> java.nio -> java.nio.file -> java.util kanzi.app (kanzi.jar) -> java.io -> java.lang -> java.nio.file -> java.nio.file.attribute -> java.util -> java.util.concurrent kanzi.bitstream (kanzi.jar) -> java.io -> java.lang kanzi.entropy (kanzi.jar) -> java.io -> java.lang -> java.util kanzi.io (kanzi.jar) -> java.io -> java.lang -> java.util -> java.util.concurrent -> java.util.concurrent.atomic kanzi.transform (kanzi.jar) -> java.io -> java.lang -> java.util -> java.util.concurrent kanzi.util (kanzi.jar) -> java.io -> java.lang -> java.nio.charset -> java.util kanzi.util.hash (kanzi.jar) -> java.lang kanzi.util.sort (kanzi.jar) -> java.io -> java.lang -> java.util -> java.util.concurrent
Java 7 or higher is required.
Usage:
java -jar kanzi.jar --help Kanzi 2.0 (c) Frederic Langlet Credits: Matt Mahoney, Yann Collet, Jan Ondrus, Yuta Mori, Ilya Muravyov, Neal Burns, Fabian Giesen, Jarek Duda, Ilya Grebnov -h, --help display this message -i, --input=<inputName> mandatory name of the input file or directory or 'stdin' When the source is a directory, all files in it will be processed. Provide /. at the end of the directory name to avoid recursion (EG: myDir/. => no recursion) -o, --output=<outputName> optional name of the output file or directory (defaults to <inputName.knz>) or 'none' or 'stdout'. 'stdout' is not valid when the number of jobs is greater than 1. -b, --block=<size> size of blocks (default 4|8|16 MB based on level, max 1 GB, min 1 KB). -l, --level=<compression> set the compression level [0..9] Providing this option forces entropy and transform. 0=None&None (store), 1=TEXT+LZ&HUFFMAN, 2=TEXT+FSD+LZX&HUFFMAN 3=TEXT+FSD+ROLZ, 4=TEXT+FSD+ROLZX, 5=TEXT+BWT+RANK+ZRLT&ANS0 6=TEXT+BWT+SRT+ZRLT&FPAQ, 7=LZP+TEXT+BWT+LZP&CM, 8=X86+RLT+TEXT&TPAQ 9=X86+RLT+TEXT&TPAQX -e, --entropy=<codec> entropy codec [None|Huffman|ANS0|ANS1|Range|FPAQ|TPAQ|TPAQX|CM] (default is ANS0) -t, --transform=<codec> transform [None|BWT|BWTS|LZ|LZX|LZP|ROLZ|ROLZX|RLT|ZRLT] [MTFT|RANK|SRT|TEXT|FSD|X86] EG: BWT+RANK or BWTS+MTFT (default is BWT+RANK+ZRLT) -x, --checksum enable block checksum -s, --skip copy blocks with high entropy instead of compressing them. -j, --jobs=<jobs> maximum number of jobs the program may start concurrently (default is 1, maximum is 64). -v, --verbose=<level> 0=silent, 1=default, 2=display details, 3=display configuration, 4=display block size and timings, 5=display extra information Verbosity is reduced to 1 when files are processed concurrently Verbosity is reduced to 0 when the output is 'stdout' -f, --force overwrite the output file if it already exists EG. java -jar kanzi.jar -c -i foo.txt -o none -b 4m -l 4 -v 3 EG. java -jar kanzi.jar -c -i foo.txt -f -t BWT+MTFT+ZRLT -b 4m -e FPAQ -v 3 -j 4 EG. java -jar kanzi.jar --compress --input=foo.txt --output=foo.knz --block=4m --force --transform=BWT+MTFT+ZRLT --entropy=FPAQ --verbose=3 --jobs=4 EG. java -jar kanzi.jar -d -i foo.knz -f -v 2 -j 2 EG. java -jar kanzi.jar --decompress --input=foo.knz --force --verbose=2 --jobs=2
The Block compressor cuts the input file into chunks of 1 MB (or the size provided on the command line with the 'block' option up to 512MB). Optionally, a checksum for the chunk of data can be computed and stored in the output.
As a first step, it applies a transform (default is BWT+MTFT+ZRLT) to turn the block into a smaller number of bytes (byte transform). If BWT(S) is chosen, a post processing transform can be provided (EG. BWTS+RANK or BWT+MTFT). Usually, the BWT(s) and the selected GST are followed by a ZRLT (to remove the runs of 0). Up to 4 transforms (input size = output size) or functions (input size != output size) can be provided at first stage. EG: BWT+RANK+ZRLT.
As a second step, entropy coding is performed (to turn the block into a smaller number of bits).
Each step can be bypassed based on command line options.
The decompressor extracts all necessary information from the header of the bitstream (input file) such as entropy type, transform type, block size, checksum enabled/disabled, etc... before applying appropriate entropy decoder followed by the inverse transform for each block. Optionally, a checksum is computed and checked against the one stored in the bitstream (based on original data).
The 2 step process allows either very fast compression/decompression (LZ+no entropy or LZ+Huffman) or high compression ratio (BWT(S) + CM or PAQ or TPAQ + block size > 1MB).
- BWT: Burrows Wheeler Transform (https://en.wikipedia.org/wiki/Burrows-Wheeler_transform) is a transform that reorders symbols in a (reversible) way that is more amenable to entropy coding. This implementation uses a linear time foward transform and parallel inverse tranform.
- BWTS: Burrows Wheeler Transform by Scott is a bijective variant of the BWT (slower).
- LZ: Lempel Ziv (https://en.wikipedia.org/wiki/LZ77_and_LZ78). An implementation of the dictionary based LZ77 transform that removes redundancy in the data.
- RLT: Run Length Transform (https://en.wikipedia.org/wiki/Run-length_encoding). A simple transform that replaces runs of similar symbols with a compact representation.
- ZRLT: Zero Run Length Transform. Similar to RLT but only processes runs of 0. Usually used post BWT.
- MTFT: Move-To-Front Transform (https://en.wikipedia.org/wiki/Move-to-front_transform). A transform that reduces entropy by assigning shorter symbols to recent data (works like a LRU cache). Usually used post BWT.
- RANK: Rank Transform: A transform that that reduces entropy by assigning shorter symbols based on symbol frequency ranks. Usually used post BWT.
- X86: A transform that reduces the entropy of executable files by replacing relative jump addresses with absolute ones.
- TEXT: A text transform that uses a dictionary to replace common words with their index.
- ROLZ: Reduced Offset Lempel Ziv. An implementation of LZ that replaces match offsets with indexes of match offsets. It yields a more compact output at the cost of an extra indirection (hence slower decoding speeds).
- ROLZX: Extended ROLZ with more match searches and a more compact encoding.
- SRT: Sorted Rank Transform: A transform that that reduces entropy by assigning shorter symbols based on symbol frequency ranks. Usually used post BWT.
- LZP: Lempel Ziv Prediction can be described as an LZ implementation with only one possible match (hence the offset is not even emitted). It is pretty weak (but fast) and usually used prior to a BWT to remove some redundancy and speed up coding.
- FSD: Fixed Shift Delta is a fast transform that removes redundancy in correlated channels in some multimedia files (EG. wav, pnm).
Several entropy codecs have been implemented (sorted by increasing compression):
- Huffman: The codec is a fast canonical Huffman implementation. Both encoder and decoder use tables to compute code lengths and values instead of a tree (for speed purpose).
- RANGE: A fast implementation that uses pre-computed block statistics.
- ANS: Based on Range Asymmetric Numeral Systems by Jarek Duda (specifically an implementation by Fabian Giesen). Works in a similar fashion to the Range encoder but uses only 1 state (instead of bottom and range) and does encoding in reverse byte order.
- FPAQ: A binary arithmetic codec based on FPAQ1 by Matt Mahoney. Uses a simple, adaptive order 0 predictor based on frequencies. Fast and compact code.
- CM: A binary arithmetic codec based on BCM by Ilya Muravyov. Uses context mixing of counters to generate a prediction. Efficient and decently fast.
- TPAQ: A binary arithmetic codec based on Tangelo 2.4 (itself derived from FPAQ8). Uses context mixing of predictions produced by one layer neural networks. The initial code has been heavily tuned to improve compression ratio and speed. Slowest but usually best compression ratio.
bits | optional? | role | value | comment |
32 | no | magic number | 0x4B414E5A | 'KANZ' |
4 | no | bit stream version | 2 | |
1 | no | block checksum ? | 0 or 1 | |
5 | no | entropy | see table | one among 32 |
48 | no | transforms (8*6 bits) | see table | up to 8 among 64 |
28 | no | block size / 16 | in bytes | |
6 | no | number of blocks(*) | 0 to 63 | hint for decompressor |
4 | no | reserved | ignored | for future use |
(*) Number of blocks: If the size of the data is not available to the compressor, 0 is written. Otherwise the compressor divides the input size by the block size (up to 63). This value is used by the decompressor as a hint to optimize the memory allocation in multi-threaded scenarios.
entropy:
- NONE_TYPE = 0; No compression
- HUFFMAN_TYPE = 1; Huffman
- FPAQ_TYPE = 2; Fast PAQ (order 0)
- PAQ_TYPE = 3; PAQ (Obsolete, not supported)
- RANGE_TYPE = 4; Range
- ANS0_TYPE = 5; Asymmetric Numerical System order 0
- CM_TYPE = 6; Context Model
- TPAQ_TYPE = 7; Tangelo PAQ
- ANS1_TYPE = 8; Asymmetric Numerical System order 1
- TPAQX_TYPE = 9; Tangelo PAQ Extra
transform:
- NONE_TYPE = 0; Copy
- BWT_TYPE = 1; Burrows Wheeler
- BWTS_TYPE = 2; Burrows Wheeler Scott
- LZ_TYPE = 3; Lempel Ziv
- SNAPPY_TYPE = 4; Snappy (obsolete, not supported)
- RLT_TYPE = 5; Run Length
- ZRLT_TYPE = 6; Zero Run Length
- MTFT_TYPE = 7; Move To Front
- RANK_TYPE = 8; Rank
- X86_TYPE = 9; X86 codec
- DICT_TYPE = 10; Text codec
- ROLZ_TYPE = 11; ROLZ codec
- ROLZX_TYPE = 12; ROLZ Extra codec
- SRT_TYPE = 13; Sorted Rank
- LZP_TYPE = 14; Lempel Ziv Predict
- FSD_TYPE = 15; Fix Shift Delta codec
- LZX_TYPE = 16; Lempel Ziv Extra
bits | optional? | role | comment |
5 | no | log(cbs) - 3 | block size <= 1GB |
log(cbs) | no | cbs | compressed block size, if 0 last block |
8*cbs | no | block data | if cbs = 0, no data |
bits | optional? | role | comment |
8 | no | block mode | see format below |
8 | yes | block transform skip flags | present if more than 4 transforms only |
8*ds | no | data | ds = 1+((mode>>5)&0x03) |
32 | yes | block checksum |
// mode | 0b10000000 => copy block // | 0b0yy00000 => size(size(block))-1 // | 0b000y0000 => 1 if more than 4 transforms // case 4 transforms or less // | 0b0000yyyy => transform sequence skip flags (1 means skip) // case more than 4 transforms, read next byte // | 0b00000000 // then 0byyyyyyyy => transform sequence skip flags (1 means skip)
To decode a block
- Decode the block header
- Read the block data
- Entropy decode the data
- Run sequentially every inverse transform on the decoded data (except those with a skip flag set to 1)
- If a checksum was present, checksum the transformed data and compare with the checksum provided in the bitstream