Skip to content

Latest commit

 

History

History
251 lines (181 loc) · 9.72 KB

011.md

File metadata and controls

251 lines (181 loc) · 9.72 KB

BIPF.tinySSB, a Serialization Format for JSON-like Data Types

Author: cft

Date: 2023-07-16

License: CC0-1.0

URL: https://github.com/ssbc/sips/blob/master/XXX.md

Abstract

BIPF (Binary In-Place Format) is a serialization format for JSON-like data types that is optimized for in-place memory access. It is used in some versions of Secure Scuttlebut (SSB), including tinySSB. This document describes BIPF as used in tinySSB.

Motivation

SSB.classic uses human-readable JSON as representation format: append-only log entries are stored as JSON-encoding of the respective SSB data structure. Using JSON, which has no support for binary data, mandates that byte arrays are specially encoded (as ASCII strings according to BASE64), which further bloats a log entry. By choosing a binary representation as in BIPF, considerable space savings can be achieved and unnecessary transformations of the representation can be avoided. CBOR encoding also provides these features but was not considered because of the possibility of "indefinite lists encoding" which prevents the skipping of a list without parsing its elements. A link to a comprehensive comparison of serialization approaches can be found in the References section; see also the entry for BIPF.rationale.

In order to explain the difference between "JSON data types" and BIPF's "JSON-like data types" we provide two BNF grammers. The first shows a simplified grammar for JSON (that ommits low-level rules e.g. for integers or floating point numbers and strings):

JSON_VAL   ::= JSON_ATOM / JSON_LIST / JSON_DICT
JSON_ATOM  ::= 'true' / 'false' / 'null' / INT_VAL / DOUBLE_VAL / STR_VAL
JSON_LIST  ::= '[' ']' / '[' JSON_VAL *(',' JSON_VAL) ']'
JSON_DICT  ::= '{' '}' / '{' STR_VAL ':' JSON_VAL *(',' STR_VAL ':' JSON_VAL) '}'

BIPF is more expressive in that it also supports byte arrays. Moreover, any atom can be used as a key for a dictionary, not only strings. The following grammer for BIPF permits a direct comparison with the previous grammer for JSON:

BIPF_VAL   ::= BIPF_ATOM / BIPF_LIST / BIPF_DICT
BIPF_ATOM  ::= 'true' / 'false' / 'null' / INT_VAL / DOUBLE_VAL / STR_VAL / BYTES_VAL
BIPF_LIST  ::= '[' ']' / '[' BIPF_VAL *(',' BIPF_VAL) ']'
BIPF_DICT  ::= '{' '}' / '{' BIPF_ATOM ':' BIPF_VAL *(',' BIPF_ATOM ':' BIPF_VAL) '}'

This grammer can be used as a human-readable format of BIPF data items for documentation purposes. However, BIPF is a binary format wherefore integer values, for example, are not represented as digits drawn from the ASCII character set, but as a series of bytes that encode this value. The binary BIPF represenation of the value 123 will be the byte array #7b#, instead of "123" which as a string would translate to the bytes #313233#.

The Specification section below provides the complete list of encoding rules for the BIPF-supported data types.

Terminology

data type       Some abstract data type that we wish to encode such
                that instanves of this data type can be transferred and
                reconstructed by a remote peer. These are integers,
                strings or lists, for example.

encoding rules  A set or reversible mappings for turning potentially
                structured in-memory data values into a flat series
                of bytes that can be stored and/or transferred.

external representation, sometimes also called "transfer syntax"
                A series of bytes that represent the (internal)
                values of a data type. From that series of bytes, a
                remote peer is able to create its own internal
                value of the data.

serialization   Turning an internal instance of a BIPF-supported data
                type item into a series of bytes that can be transferred.

deserialization Turning a received BIPF series of bytes into an
                internal replica of the original data item.

human readable  A format expressed in ASCII characters and, if supported
                by the rendering, of UTF8 unicode characters in strings.
                This is in contrast with BIPF's binary representation where
                encoded values cannot be easily parsed by humans.

Specification

For each of the 7 atomic and 2 structured types supported by BIPF, a bit pattern is assigned. These type identifiers are later used in the encoding of a corresponding value:

STRING  : 0 (000) // utf8 encoded string
BYTES   : 1 (001) // raw byte sequence
INT     : 2 (010) // little endian, two's complement, minimal number of bytes
DOUBLE  : 3 (011) // IEEE 754-encoded double precision floating point
LIST    : 4 (100) // sequence of bipf-encoded values
DICT    : 5 (101) // sequence of alternating bipf-encoded key and value
BOOLNULL: 6 (110) // 1 = true, 0 = false, no value means null
EXTENDED: 7 (111) // custom type. Specific type should be indicated by varint at start of buffer

Note that the BOOLNULL bit pattern is used for three different atomic types.

BIPF values are serialized with a TYPE-LENGTH-VALUE (TLV) encoding. To this end, T and L are combined into a single integer value called tag which is encoded with unsigned LEB128 (also called varint). The encoded tag is then prepended to the bytes of the encoding of the value V proper, as shown in the following pseudo code:

bipf_enc(V)  ::= concat( tag(V.type,V.length), V.bytes )
tag(typ,len) ::= LEB128( len<<3 + typ )

LEB128 encoding (used for encoding the tag) is done by producing one encoded byte for each 7 input bit group, starting with the least significant group of 7 bits. All encoded bytes will have the most significant bit set, except the last byte. Zero is encoded as byte 0x00.

Integer values are encoded with the minimally required number of bytes in little-endian order using two's complement representation.

Lists are encoded by prepending to the concatenation of BIPF-encoded elements a tag with typ=4 and a len value which is the sum of the lengths of the BIPF-encoded elements:

bipf([e1, e2]) = concat( tag(4,len(bipf(e1))+len(bipf(e2)))), bipf(e1), bipf(e2) )

A dictionary is encoded as if one would encode an alternating list of key and value elements, except that the typ value is DICT (5).

bipf({e1: e2}) = concat(tag(5,len(bipf(e1))+len(bipf(e2)))), bipf(e1), bipf(e2))

Following the JSON tradition, a BIPF library will typicall provide at least these two methods:

bipf_dumps(V)  turns an in-memory data item V into a series of bytes
bipf_loads(S)  turns a series of bytes S into the corresponding in-memory data item

One can expect BIPF libraries to have additional convenience methods like bipf_predict_encoded_length(V) etc.

Considerations

(a) BIPF.original (see the References section) departs from this document in the way integer values are encoded. In https://github.com/ssbc/bipf-spec, the authors use a fixed-length encoding for integer values (4 bytes, little endian, two's complement). With space concerns in mind, tinySSB formats integers as little endian, two's complement, and retaining only the minimum number of bytes needed.

(b) As pointed out in the Motivation section, a straight-forward human-readable representation of BIPF exists that is very close to JSON. The only syntactic extension needed is a delimiter for byte arrays where we chose the '#' hash sign as used in S-expressions (which is another serialization format from 1997).

(c) We provide the following test vectors where for each human-readable BIPF-encoding we give its binary representation.

data item as human-   its BIPF serialization
readable BIPF         as byte array
-------------------   ------------------------

null                  06

false                 0e00

true                  0e01

123                   0a7b

-123                  0a85

"¥€$!"                39c2a5e282ac2421

#ABCD#                11abcd

[123,true]            240a7b0e01

{123:false}           250a7b0e00

{#ABCD#:[123,null]}  3d11abcd1c0a7b06

References

Normative

unsigned LEB128 (varint): https://en.wikipedia.org/wiki/LEB128,

Informative

BASE64: https://www.rfc-editor.org/rfc/rfc4648

BIPF.original: https://github.com/ssbc/bipf-spec

BIPF.rationale: https://github.com/ssbc/bipf

BNF: Section 2 of https://www.rfc-editor.org/rfc/rfc822

CBOR: https://www.rfc-editor.org/rfc/rfc8949

Comparison: https://en.wikipedia.org/wiki/Comparison_of_data-serialization_formats

JSON: https://www.rfc-editor.org/rfc/rfc8259

S-exp: https://web.archive.org/web/20230223024606/http://people.csail.mit.edu/rivest/Sexp.txt

SSB.classic: https://ssbc.github.io/scuttlebutt-protocol-guide/

UTF8: https://www.rfc-editor.org/rfc/rfc3629

varint in ProtoBuf: https://protobuf.dev/programming-guides/encoding/#varints

Implementations

BIPF.tinySSB:

BIPF.classic: