Skip to content

Commit

Permalink
readme and format updates (#249)
Browse files Browse the repository at this point in the history
  • Loading branch information
mwlon authored Nov 9, 2024
1 parent 29a9fcf commit feb8449
Show file tree
Hide file tree
Showing 7 changed files with 285 additions and 338 deletions.
43 changes: 21 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
<div style="text-align:center">
<img alt="Pco logo: a pico-scale, compressed version of the Pyramid of Khafre in the palm of your hand" src="images/logo.svg" width="160px">
<img
alt="Pco logo: a pico-scale, compressed version of the Pyramid of Khafre in the palm of your hand" src="images/logo.svg"
width="160px"
>
</div>
[![crates.io][crates-badge]][crates-url]
Expand Down Expand Up @@ -48,20 +51,21 @@ numerical sequences with
## How is Pco so much better than alternatives?

Pco is designed specifically for numerical data, whereas alternatives rely on
general-purpose (LZ) compressors that were designed for string or binary data.
general-purpose (LZ) compressors that target string or binary data.
Pco uses a holistic, 3-step approach:

* **modes**.
Pco identifies an approximate structure of the numbers called a
mode and then applies it to all the numbers.
mode and then uses it to split numbers into "latents".
As an example, if all numbers are approximately multiples of 777, int mult mode
decomposes each number `x` into latent variables `l_0` and
splits each number `x` into latent variables `l_0` and
`l_1` such that `x = 777 * l_0 + l_1`.
Most natural data uses classic mode, which simply matches `x = l_0`.
* **delta enoding**.
* **delta encoding**.
Pco identifies whether certain latent variables would be better compressed as
consecutive deltas (or deltas of deltas, or so forth).
If so, it takes consecutive differences.
deltas between consecutive elements (or deltas of deltas, or deltas with
lookback).
If so, it takes differences.
* **binning**.
This is the heart and most novel part of Pco.
Pco represents each (delta-encoded) latent variable as an approximate,
Expand All @@ -79,11 +83,11 @@ entropy.

### Wrapped or Standalone

Pco is designed to be easily wrapped into another format.
Pco is designed to embed into wrapping formats.
It provides a powerful wrapped API with the building blocks to interleave it
with the wrapping format.
This is useful if the wrapping format needs to support things like nullability,
multiple columns, random access or seeking.
multiple columns, random access, or seeking.

The standalone format is a minimal implementation of a wrapped format.
It supports batched decompression only with no other niceties.
Expand All @@ -102,24 +106,19 @@ multiple chunks per file.

### Mistakes to Avoid

You will get disappointing results from Pco if your data:
You may get disappointing results from Pco if your data in a single chunk

* combines semantically different sequences into a single chunk, or
* contains fewer numbers per chunk or page than recommended (see above table).
* combines semantically different sequences, or
* is inherently 2D or higher.

Example: the NYC taxi dataset has `f64` columns for `passenger_base_fare` and
`tolls`.
Suppose we assign these as `fare[0...n]` and `tolls[0...n]` respectively, where
Example: the NYC taxi dataset has `f64` columns for `fare` and
`trip_miles`.
Suppose we assign these as `fare[0...n]` and `trip_miles[0...n]` respectively, where
`n=50,000`.

* separate chunk for each column => good compression
* single chunk `fare[0], ... fare[n-1], toll[0], ... toll[n-1]` => mediocre
compression
* single chunk `fare[0], toll[0], ... fare[n-1], toll[n-1]` => poor compression

Similarly, we could compress images by making a separate chunk for each
flattened channel (red, green, blue).
Though dedicated formats like webp likely compress natural images better.
* single chunk `fare[0], ... fare[n-1], trip_miles[0], ... trip_miles[n-1]` => bad compression
* single chunk `fare[0], trip_miles[0], ... fare[n-1], trip_miles[n-1]` => bad compression

## Extra

Expand Down
151 changes: 90 additions & 61 deletions docs/format.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,18 +8,24 @@ Bit packing a component is completed by filling the rest of the byte with 0s.

Let `dtype_size` be the data type's number of bits.
A "raw" value for a number is a `dtype_size` value that maps to the number
via [its `from_unsigned` function](#numbers---latents).
via [its `from_unsigned` function](#Modes).

## Wrapped Format Components

<img alt="Pco wrapped format diagram" title="Pco wrapped format" src="../images/wrapped_format.svg" />

The wrapped format consists of 3 components: header, chunk metadata, and data
pages.
Wrapping formats may encode these components any place they wish.

Pco is designed to have one header per file, possibly multiple chunks per
header, and possibly multiple data pages per chunk.
header, and possibly multiple pages per chunk.

[Plate notation](https://en.wikipedia.org/wiki/Plate_notation) for chunk
metadata component:

<img alt="Pco wrapped chunk meta plate notation" src="../images/wrapped_chunk_meta_plate.svg" width="500px"/>

Plate notation for page component:

<img alt="Pco wrapped page plate notation" src="../images/wrapped_page_plate.svg" width="500px"/>

### Header

Expand All @@ -34,11 +40,12 @@ The header simply consists of

So far, these format versions exist:

| format version | first Rust version | deviations from next format version |
|----------------|--------------------|-----------------------------------------------|
| 0 | 0.0.0 | int mult mode unsupported |
| 1 | 0.1.0 | float quant mode and 16-bit types unsupported |
| 2 | 0.3.0 | - |
| format version | first Rust version | deviations from next format version |
|----------------|--------------------|----------------------------------------------|
| 0 | 0.0.0 | IntMult mode unsupported |
| 1 | 0.1.0 | FloatQuant mode and 16-bit types unsupported |
| 2 | 0.3.0 | delta variants and Lookback unsupported |
| 3 | 0.4.0 | - |

### Chunk Metadata

Expand All @@ -47,22 +54,41 @@ metadata is out of range.
For example, if the sum of bin weights does not equal the tANS size; or if a
bin's offset bits exceed the data type size.

Each chunk metadata consists of
Each chunk meta consists of

* [4 bits] `mode`, using this table:

| value | mode | n latent variables | 2nd latent uses delta? | `extra_mode_bits` |
|-------|--------------|--------------------|------------------------|-------------------|
| 0 | classic | 1 | | 0 |
| 1 | int mult | 2 | no | `dtype_size` |
| 2 | float mult | 2 | no | `dtype_size` |
| 3 | float quant | 2 | no | 8 |
| 4-15 | \<reserved\> | | | |
| value | mode | n latent variables | `extra_mode_bits` |
|-------|--------------|--------------------|-------------------|
| 0 | Classic | 1 | 0 |
| 1 | IntMult | 2 | `dtype_size` |
| 2 | FloatMult | 2 | `dtype_size` |
| 3 | FloatQuant | 2 | 8 |
| 4-15 | \<reserved\> | | |

* [`extra_mode_bits` bits] for certain modes, extra data is parsed. See the
mode-specific formulas below for how this is used, e.g. as the `mult` or `k`
values.
* [3 bits] the delta encoding order `delta_order`.
* per latent variable,
* [4 bits] `delta_encoding`, using this table:

| value | delta encoding | n latent variables | `extra_delta_bits` |
|-------|----------------|--------------------|--------------------|
| 0 | None | 0 | 0 |
| 1 | Consecutive | 0 | 4 |
| 2 | Lookback | 1 | 10 |
| 3-15 | \<reserved\> | | |

* [`extra_delta_bits` bits]
* for `consecutive`, this is 3 bits for `order` from 1-7, and 1 bit for
whether the mode's secondary latent is delta encoded.
An order of 0 is considered a corruption.
Let `state_n = order`.
* for `lookback`, this is 5 bits for `window_n_log - 1`, 4 for
`state_n_log`, and 1 for whether the mode's secondary latent is delta
encoded.
Let `state_n = 1 << state_n_log`.
* per latent variable (ordered by delta latent variables followed by mode
latent variables),
* [4 bits] `ans_size_log`, the log2 of the size of its tANS table.
This may not exceed 14.
* [15 bits] the count of bins
Expand All @@ -77,17 +103,17 @@ Based on chunk metadata, 4-way interleaved tANS decoders should be initialized
using
[the simple `spread_state_tokens` algorithm from this repo](../pco/src/ans/spec.rs).

### Data Page
### Page

If there are `n` numbers in a data page, it will consist of `ceil(n / 256)`
If there are `n` numbers in a page, it will consist of `ceil(n / 256)`
batches. All but the final batch will contain 256 numbers, and the final
batch will contain the rest (<= 256 numbers).

Each data page consists of
Each page consists of

* per latent variable,
* if delta encoding is applicable, for `i in 0..delta_order`,
* [`dtype_size` bits] the `i`th delta moment
* if delta encoding is applicable, for `i in 0..state_n`,
* [`dtype_size` bits] the `i`th delta state
* for `i in 0..4`,
* [`ans_size_log` bits] the `i`th interleaved tANS state index
* [0-7 bits] 0s until byte-aligned
Expand Down Expand Up @@ -117,31 +143,59 @@ It consists of
* [8 bits] a byte for the data type
* [24 bits] 1 less than `chunk_n`, the count of numbers in the chunk
* a wrapped chunk metadata
* a wrapped data page of `chunk_n` numbers
* a wrapped page of `chunk_n` numbers
* [8 bits] a magic termination byte (0).

## Processing Formulas

<img alt="Pco compression and decompression steps" title="compression and decompression steps" src="../images/processing.svg" />
In order of decompression steps in a batch:

### Bin Indices and Offsets -> Latents

To produce latents, we simply do `l[i] = bin[i].lower + offset[i]`.

### Delta Encodings

Depending on `delta_encoding`, the mode latents are further decoded.
Note that the delta latent variable, if it exists, is never delta encoded
itself.

#### None

No additional processing is applied.

##### Consecutive

### Numbers <-> Latents
Latents are decoded by taking a cumulative sum repeatedly.
The delta state is interpreted as delta moments, which are used to initialize
each cumulative sum, and get modified for the next batch.

Based on the mode, unsigneds are decomposed into latents.
For instance, with 2nd order delta encoding, the delta moments `[1, 2]`
and the deltas `[0, 10, 0]` would decode to the latents `[1, 3, 5, 17, 29]`.

#### Lookback

Letting `lookback` be the delta latent variable.
Mode latents are decoded via `l[i] += l[i - lookback[i]]`.

### Modes

Based on the mode, latents are joined into the finalized numbers.
Let `l0` and `l1` be the primary and secondary latents respectively.
Let `MID` be the middle value for the latent type (e.g. 2^31 for `u32`).

| mode | decoding formula |
|-------------|------------------------------------------------------------------------|
| classic | `from_latent_ordered(l0)` |
| int mult | `from_latent_ordered(l0 * mult + l1)` |
| float mult | `int_float_from_latent(l0) * mult + (l1 + MID) ULPs` |
| float quant | `from_latent_ordered((l0 << k) + (l0 << k >= MID ? l1 : 2^k - 1 - l1)` |
| mode | decoding formula |
|------------|------------------------------------------------------------------------|
| Classic | `from_latent_ordered(l0)` |
| IntMult | `from_latent_ordered(l0 * mult + l1)` |
| FloatMult | `int_float_from_latent(l0) * mult + (l1 + MID) ULPs` |
| FloatQuant | `from_latent_ordered((l0 << k) + (l0 << k >= MID ? l1 : 2^k - 1 - l1)` |

Here ULP refers to [unit in the last place](https://en.wikipedia.org/wiki/Unit_in_the_last_place).

Each data type has an order-preserving bijection to an unsigned data type.
For instance, floats have their first bit toggled, and the rest of their bits
bits toggled if the float was originally negative:
toggled if the float was originally negative:

```rust
fn from_unsigned(unsigned: u32) -> f32 {
Expand All @@ -163,28 +217,3 @@ fn from_unsigned(unsigned: u32) -> i32 {
i32::MIN.wrapping_add(unsigned as i32)
}
```

### Latents <-> Deltas

Latents are converted to deltas by taking consecutive differences
`delta_order` times, and decoded by taking a cumulative sum repeatedly.
Delta moments are emitted during encoding and consumed during decoding to
initialize the cumulative sum.

For instance, with 2nd order delta encoding, the delta moments `[1, 2]`
and the deltas `[0, 10, 0]` would decode to the latents `[1, 3, 5, 17, 29]`.

### Deltas <-> Bin Indices and Offsets

To dissect the deltas, we find the bin that contains each delta `x` and compute
its offset as `x - bin.lower`.
For instance, suppose we have these bins, where we have compute the upper bound
for convenience:

| bin idx | lower | offset bits | upper (inclusive) |
|---------|-------|-------------|-------------------|
| 0 | 7 | 2 | 10 |
| 1 | 10 | 3 | 17 |

Then 8 would be in bin 0 with offset 1, and 15 would be in bin 1 with offset 5.
10 could be encoded either as bin 0 with offset 3 or bin 1 with offset 0.
Loading

0 comments on commit feb8449

Please sign in to comment.