readme and format updates (#249)

mwlon · Nov 9, 2024 · feb8449 · feb8449
1 parent 29a9fcf
commit feb8449
Show file tree

Hide file tree

Showing 7 changed files with 285 additions and 338 deletions.
diff --git a/README.md b/README.md
@@ -1,5 +1,8 @@
 <div style="text-align:center">
-  <img alt="Pco logo: a pico-scale, compressed version of the Pyramid of Khafre in the palm of your hand" src="images/logo.svg" width="160px">
+  <img
+    alt="Pco logo: a pico-scale, compressed version of the Pyramid of Khafre in the palm of your hand" src="images/logo.svg"
+    width="160px"
+  >
 </div>
 
 [![crates.io][crates-badge]][crates-url]
@@ -48,20 +51,21 @@ numerical sequences with
 ## How is Pco so much better than alternatives?
 
 Pco is designed specifically for numerical data, whereas alternatives rely on
-general-purpose (LZ) compressors that were designed for string or binary data.
+general-purpose (LZ) compressors that target string or binary data.
 Pco uses a holistic, 3-step approach:
 
 * **modes**.
   Pco identifies an approximate structure of the numbers called a
-  mode and then applies it to all the numbers.
+  mode and then uses it to split numbers into "latents".
   As an example, if all numbers are approximately multiples of 777, int mult mode
-  decomposes each number `x` into latent variables `l_0` and
+  splits each number `x` into latent variables `l_0` and
   `l_1` such that `x = 777 * l_0 + l_1`.
   Most natural data uses classic mode, which simply matches `x = l_0`.
-* **delta enoding**.
+* **delta encoding**.
   Pco identifies whether certain latent variables would be better compressed as
-  consecutive deltas (or deltas of deltas, or so forth).
-  If so, it takes consecutive differences.
+  deltas between consecutive elements (or deltas of deltas, or deltas with 
+  lookback).
+  If so, it takes differences.
 * **binning**.
   This is the heart and most novel part of Pco.
   Pco represents each (delta-encoded) latent variable as an approximate,
@@ -79,11 +83,11 @@ entropy.
 
 ### Wrapped or Standalone
 
-Pco is designed to be easily wrapped into another format.
+Pco is designed to embed into wrapping formats.
 It provides a powerful wrapped API with the building blocks to interleave it
 with the wrapping format.
 This is useful if the wrapping format needs to support things like nullability,
-multiple columns, random access or seeking.
+multiple columns, random access, or seeking.
 
 The standalone format is a minimal implementation of a wrapped format.
 It supports batched decompression only with no other niceties.
@@ -102,24 +106,19 @@ multiple chunks per file.
 
 ### Mistakes to Avoid
 
-You will get disappointing results from Pco if your data:
+You may get disappointing results from Pco if your data in a single chunk
 
-* combines semantically different sequences into a single chunk, or
-* contains fewer numbers per chunk or page than recommended (see above table).
+* combines semantically different sequences, or
+* is inherently 2D or higher.
 
-Example: the NYC taxi dataset has `f64` columns for `passenger_base_fare` and
-`tolls`.
-Suppose we assign these as `fare[0...n]` and `tolls[0...n]` respectively, where
+Example: the NYC taxi dataset has `f64` columns for `fare` and
+`trip_miles`.
+Suppose we assign these as `fare[0...n]` and `trip_miles[0...n]` respectively, where
 `n=50,000`.
 
 * separate chunk for each column => good compression
-* single chunk `fare[0], ... fare[n-1], toll[0], ... toll[n-1]` => mediocre
-  compression
-* single chunk `fare[0], toll[0], ... fare[n-1], toll[n-1]` => poor compression
-
-Similarly, we could compress images by making a separate chunk for each
-flattened channel (red, green, blue).
-Though dedicated formats like webp likely compress natural images better.
+* single chunk `fare[0], ... fare[n-1], trip_miles[0], ... trip_miles[n-1]` => bad compression
+* single chunk `fare[0], trip_miles[0], ... fare[n-1], trip_miles[n-1]` => bad compression
 
 ## Extra
 

diff --git a/docs/format.md b/docs/format.md
@@ -8,18 +8,24 @@ Bit packing a component is completed by filling the rest of the byte with 0s.
 
 Let `dtype_size` be the data type's number of bits.
 A "raw" value for a number is a `dtype_size` value that maps to the number
-via [its `from_unsigned` function](#numbers---latents).
+via [its `from_unsigned` function](#Modes).
 
 ## Wrapped Format Components
 
-<img alt="Pco wrapped format diagram" title="Pco wrapped format" src="../images/wrapped_format.svg" />
-
 The wrapped format consists of 3 components: header, chunk metadata, and data
 pages.
 Wrapping formats may encode these components any place they wish.
-
 Pco is designed to have one header per file, possibly multiple chunks per
-header, and possibly multiple data pages per chunk.
+header, and possibly multiple pages per chunk.
+
+[Plate notation](https://en.wikipedia.org/wiki/Plate_notation) for chunk
+metadata component:
+
+<img alt="Pco wrapped chunk meta plate notation" src="../images/wrapped_chunk_meta_plate.svg" width="500px"/>
+
+Plate notation for page component:
+
+<img alt="Pco wrapped page plate notation" src="../images/wrapped_page_plate.svg" width="500px"/>
 
 ### Header
 
@@ -34,11 +40,12 @@ The header simply consists of
 
 So far, these format versions exist:
 
-| format version | first Rust version | deviations from next format version           |
-|----------------|--------------------|-----------------------------------------------|
-| 0              | 0.0.0              | int mult mode unsupported                     |
-| 1              | 0.1.0              | float quant mode and 16-bit types unsupported |
-| 2              | 0.3.0              | -                                             |
+| format version | first Rust version | deviations from next format version          |
+|----------------|--------------------|----------------------------------------------|
+| 0              | 0.0.0              | IntMult mode unsupported                     |
+| 1              | 0.1.0              | FloatQuant mode and 16-bit types unsupported |
+| 2              | 0.3.0              | delta variants and Lookback unsupported      |
+| 3              | 0.4.0              | -                                            |
 
 ### Chunk Metadata
 
@@ -47,22 +54,41 @@ metadata is out of range.
 For example, if the sum of bin weights does not equal the tANS size; or if a
 bin's offset bits exceed the data type size.
 
-Each chunk metadata consists of
+Each chunk meta consists of
 
 * [4 bits] `mode`, using this table:
 
-  | value | mode         | n latent variables | 2nd latent uses delta? | `extra_mode_bits` |
-                                    |-------|--------------|--------------------|------------------------|-------------------|
-  | 0     | classic      | 1                  |                        | 0                 |
-  | 1     | int mult     | 2                  | no                     | `dtype_size`      |
-  | 2     | float mult   | 2                  | no                     | `dtype_size`      |
-  | 3     | float quant  | 2                  | no                     | 8                 |
-  | 4-15  | \<reserved\> |                    |                        |                   |
+  | value | mode         | n latent variables | `extra_mode_bits` |
+                                    |-------|--------------|--------------------|-------------------|
+  | 0     | Classic      | 1                  | 0                 |
+  | 1     | IntMult      | 2                  | `dtype_size`      |
+  | 2     | FloatMult    | 2                  | `dtype_size`      |
+  | 3     | FloatQuant   | 2                  | 8                 |
+  | 4-15  | \<reserved\> |                    |                   |
+
 * [`extra_mode_bits` bits] for certain modes, extra data is parsed. See the
   mode-specific formulas below for how this is used, e.g. as the `mult` or `k`
   values.
-* [3 bits] the delta encoding order `delta_order`.
-* per latent variable,
+* [4 bits] `delta_encoding`, using this table:
+
+  | value | delta encoding | n latent variables | `extra_delta_bits` |
+  |-------|----------------|--------------------|--------------------|
+  | 0     | None           | 0                  | 0                  |
+  | 1     | Consecutive    | 0                  | 4                  |
+  | 2     | Lookback       | 1                  | 10                 |
+  | 3-15  | \<reserved\>   |                    |                    |
+
+* [`extra_delta_bits` bits]
+  * for `consecutive`, this is 3 bits for `order` from 1-7, and 1 bit for
+    whether the mode's secondary latent is delta encoded.
+    An order of 0 is considered a corruption.
+    Let `state_n = order`.
+  * for `lookback`, this is 5 bits for `window_n_log - 1`, 4 for
+    `state_n_log`, and 1 for whether the mode's secondary latent is delta
+    encoded.
+    Let `state_n = 1 << state_n_log`.
+* per latent variable (ordered by delta latent variables followed by mode
+  latent variables),
   * [4 bits] `ans_size_log`, the log2 of the size of its tANS table.
     This may not exceed 14.
   * [15 bits] the count of bins
@@ -77,17 +103,17 @@ Based on chunk metadata, 4-way interleaved tANS decoders should be initialized
 using
 [the simple `spread_state_tokens` algorithm from this repo](../pco/src/ans/spec.rs).
 
-### Data Page
+### Page
 
-If there are `n` numbers in a data page, it will consist of `ceil(n / 256)`
+If there are `n` numbers in a page, it will consist of `ceil(n / 256)`
 batches. All but the final batch will contain 256 numbers, and the final
 batch will contain the rest (<= 256 numbers).
 
-Each data page consists of
+Each page consists of
 
 * per latent variable,
-  * if delta encoding is applicable, for `i in 0..delta_order`,
-    * [`dtype_size` bits] the `i`th delta moment
+  * if delta encoding is applicable, for `i in 0..state_n`,
+    * [`dtype_size` bits] the `i`th delta state
   * for `i in 0..4`,
     * [`ans_size_log` bits] the `i`th interleaved tANS state index
 * [0-7 bits] 0s until byte-aligned
@@ -117,31 +143,59 @@ It consists of
   * [8 bits] a byte for the data type
   * [24 bits] 1 less than `chunk_n`, the count of numbers in the chunk
   * a wrapped chunk metadata
-  * a wrapped data page of `chunk_n` numbers
+  * a wrapped page of `chunk_n` numbers
 * [8 bits] a magic termination byte (0).
 
 ## Processing Formulas
 
-<img alt="Pco compression and decompression steps" title="compression and decompression steps" src="../images/processing.svg" />
+In order of decompression steps in a batch:
+
+### Bin Indices and Offsets -> Latents
+
+To produce latents, we simply do `l[i] = bin[i].lower + offset[i]`.
+
+### Delta Encodings
+
+Depending on `delta_encoding`, the mode latents are further decoded.
+Note that the delta latent variable, if it exists, is never delta encoded
+itself.
+
+#### None
+
+No additional processing is applied.
+
+##### Consecutive
 
-### Numbers <-> Latents
+Latents are decoded by taking a cumulative sum repeatedly.
+The delta state is interpreted as delta moments, which are used to initialize
+each cumulative sum, and get modified for the next batch.
 
-Based on the mode, unsigneds are decomposed into latents.
+For instance, with 2nd order delta encoding, the delta moments `[1, 2]`
+and the deltas `[0, 10, 0]` would decode to the latents `[1, 3, 5, 17, 29]`.
+
+#### Lookback
+
+Letting `lookback` be the delta latent variable.
+Mode latents are decoded via `l[i] += l[i - lookback[i]]`.
+
+### Modes
+
+Based on the mode, latents are joined into the finalized numbers.
 Let `l0` and `l1` be the primary and secondary latents respectively.
 Let `MID` be the middle value for the latent type (e.g. 2^31 for `u32`).
 
-| mode        | decoding formula                                                       |
-|-------------|------------------------------------------------------------------------|
-| classic     | `from_latent_ordered(l0)`                                              |
-| int mult    | `from_latent_ordered(l0 * mult + l1)`                                  |
-| float mult  | `int_float_from_latent(l0) * mult + (l1 + MID) ULPs`                   |
-| float quant | `from_latent_ordered((l0 << k) + (l0 << k >= MID ? l1 : 2^k - 1 - l1)` |
+| mode       | decoding formula                                                       |
+|------------|------------------------------------------------------------------------|
+| Classic    | `from_latent_ordered(l0)`                                              |
+| IntMult    | `from_latent_ordered(l0 * mult + l1)`                                  |
+| FloatMult  | `int_float_from_latent(l0) * mult + (l1 + MID) ULPs`                   |
+| FloatQuant | `from_latent_ordered((l0 << k) + (l0 << k >= MID ? l1 : 2^k - 1 - l1)` |
 
 Here ULP refers to [unit in the last place](https://en.wikipedia.org/wiki/Unit_in_the_last_place).
 
 Each data type has an order-preserving bijection to an unsigned data type.
 For instance, floats have their first bit toggled, and the rest of their bits
-bits toggled if the float was originally negative:
+toggled if the float was originally negative:
 
 ```rust
 fn from_unsigned(unsigned: u32) -> f32 {
@@ -163,28 +217,3 @@ fn from_unsigned(unsigned: u32) -> i32 {
   i32::MIN.wrapping_add(unsigned as i32)
 }
 ```
-
-### Latents <-> Deltas
-
-Latents are converted to deltas by taking consecutive differences
-`delta_order` times, and decoded by taking a cumulative sum repeatedly.
-Delta moments are emitted during encoding and consumed during decoding to
-initialize the cumulative sum.
-
-For instance, with 2nd order delta encoding, the delta moments `[1, 2]`
-and the deltas `[0, 10, 0]` would decode to the latents `[1, 3, 5, 17, 29]`.
-
-### Deltas <-> Bin Indices and Offsets
-
-To dissect the deltas, we find the bin that contains each delta `x` and compute
-its offset as `x - bin.lower`.
-For instance, suppose we have these bins, where we have compute the upper bound
-for convenience:
-
-| bin idx | lower | offset bits | upper (inclusive) |
-|---------|-------|-------------|-------------------|
-| 0       | 7     | 2           | 10                |
-| 1       | 10    | 3           | 17                |
-
-Then 8 would be in bin 0 with offset 1, and 15 would be in bin 1 with offset 5.
-10 could be encoded either as bin 0 with offset 3 or bin 1 with offset 0.