Pcodec (or Pco, pronounced "pico") losslessly compresses and decompresses numerical sequences with high compression ratio and moderately fast speed.
Use cases include:
- columnar data
- long-term time series data
- serving numerical data to web clients
- low-bandwidth communication
Data types:
u16
, u32
, u64
, i16
, i32
, i64
, f16
, f32
, f64
Use the CLI (also supports benchmarking)
Pco is designed specifically for numerical data, whereas alternatives rely on general-purpose (LZ) compressors that were designed for string or binary data. Pco uses a holistic, 3-step approach:
- modes.
Pco identifies an approximate structure of the numbers called a
mode and then applies it to all the numbers.
As an example, if all numbers are approximately multiples of 777, int mult mode
decomposes each number
x
into latent variablesl_0
andl_1
such thatx = 777 * l_0 + l_1
. Most natural data uses classic mode, which simply matchesx = l_0
. - delta enoding. Pco identifies whether certain latent variables would be better compressed as consecutive deltas (or deltas of deltas, or so forth). If so, it takes consecutive differences.
- binning. This is the heart and most novel part of Pco. Pco represents each (delta-encoded) latent variable as an approximate, entropy-coded bin paired an exact offset into that bin. This nears the Shannon entropy of any smooth distribution very efficiently.
These 3 steps cohesively capture most entropy of numerical data without waste.
In contrast, LZ compressors are only effective for patterns like repeating exact sequences of numbers. Such patterns constitute just a small fraction of most numerical data's entropy.
Pco is designed to be easily wrapped into another format. It provides a powerful wrapped API with the building blocks to interleave it with the wrapping format. This is useful if the wrapping format needs to support things like nullability, multiple columns, random access or seeking.
The standalone format is a minimal implementation of a wrapped format. It supports batched decompression only with no other niceties. It is mainly recommended for quick proofs of concept and benchmarking.
Pco has a hierarchy of multiple batches per page; multiple pages per chunk; and multiple chunks per file.
unit of ___ | size for good compression | |
---|---|---|
chunk | compression | >10k numbers |
page | interleaving w/ wrapping format | >1k numbers |
batch | decompression | 256 numbers (fixed) |
You will get disappointing results from Pco if your data:
- combines semantically different sequences into a single chunk, or
- contains fewer numbers per chunk or page than recommended (see above table).
Example: the NYC taxi dataset has f64
columns for passenger_base_fare
and
tolls
.
Suppose we assign these as fare[0...n]
and tolls[0...n]
respectively, where
n=50,000
.
- separate chunk for each column => good compression
- single chunk
fare[0], ... fare[n-1], toll[0], ... toll[n-1]
=> mediocre compression - single chunk
fare[0], toll[0], ... fare[n-1], toll[n-1]
=> poor compression
Similarly, we could compress images by making a separate chunk for each flattened channel (red, green, blue). Though dedicated formats like webp likely compress natural images better.