DAPHNE defines its own binary representation for the serialization of in-memory data objects (matrices/frames). This representation is intended to be used by default whenever we need to transfer or persistently store these in-memory objects, e.g., for
- the data transfer in the distributed runtime
- a custom binary file format
- the eviction of in-memory data to secondary storage
Disclaimer: The current specification is a first draft and will likely be refined as we proceed. At the moment, we focus on the case of a single block per data object.
Endianess: For now, we assume little endian.
Images: In the images below, all addresses and sizes are specified in bytes ([B]
).
The binary representation of a data object (matrix/frame) starts with a header containing general and data type-specific information. The data object is partitioned into rectangular blocks (in the extreme case, this can mean a single block). All blocks are represented individually (see binary representation of a single block below) and stored along with their position in the data object.
+--------+------+
| header | body |
+--------+------+
The header consists of the following information:
- DAPHNE binary format version number (
1
for now) (uint8) - data type
dt
(uint8) - number of rows
#r
(uint64) - number of columns
#c
(uint64)
We currently support the following data types:
code | data type |
---|---|
0 |
(reserved) |
1 |
DenseMatrix |
2 |
CSRMatrix |
3 |
Frame |
We currently support the following value types:
code | C++ value type |
---|---|
0 | (reserved) |
1 | uint8_t |
2 | uint16_t |
3 | uint32_t |
4 | uint64_t |
5 | int8_t |
6 | int16_t |
7 | int32_t |
8 | int64_t |
9 | float |
10 | double |
Depending on the data type, there are more information in the header:
For DenseMatrix
and CSRMatrix
:
- value type
vt
(uint8)
addr[B] 0 0 1 1 2 9 10 17 18 18
+---+----+----+-----+-----+
| 1 | dt | #r | #c | vt |
+---+----+----+-----+-----+
size[B] 1 1 8 8 1
For Frame
:
- value type
vt
(uint8), for each column - length of the label
len
(uint16) and labellbl
(character string), for each column
addr[B] 0 0 1 1 2 9 10 17 18 18+#c-1 18+#c *
+---+----+----+-----+-------+ +----------+--------+--------+ +-----------+-----------+
| 1 | dt | #r | #c | vt[0] | ... | vt[#c-1] | len[0] | lbl[0] | ... | len[#c-1] | lbl[#c-1] |
+---+----+----+-----+-------+ +----------+--------+--------+ +-----------+-----------+
size[B] 1 1 8 8 1 1 2 len[0] 2 len[#c-1]
The body consists of a sequence of:
- a pair of
- row index
rx
(uint64) - column index
cx
(uint64)
- row index
- a binary block representation
For the special case of a single block, this looks as follows:
addr[B] 0 7 8 15 16 *
+---+----+----------+
| 0 | 0 | block[0] |
+---+----+----------+
8 8 *
size[B]
A single data block is a rectangular partition of a data object. In the extreme case, a single block can span the entire data object in both dimensions (one block per data object).
General block header
- number of rows
#r
(uint32) - number of columns
#c
(uint32) - block type
bt
(uint8) - block type-specific information (see below)
addr[B] 0 3 4 7 8 8 9 *
+----+----+----+--------------------------+
| #r | #c | bt | block type-specific info |
+----+----+----+--------------------------+
size[B] 4 4 1 *
We define different block types to allow for a space-efficient representation depending on the data.
When serializing a data object, the block types are not required to match the in-memory representation (e.g., the blocks of a DenseMatrix
could use the sparse binary representation).
Moreover, different blocks may be represented as different block types (e.g., some blocks might use the dense binary representation and others the sparse one).
We currently support the following block types:
code | block type |
---|---|
0 |
empty |
1 |
dense |
2 |
sparse (CSR) |
3 |
ultra-sparse (COO) |
Most block types store their value type as part of the block type-specific information.
Note that the value type used for the binary representation is not required to match the value type of the in-memory object (e.g., DenseMatrix<uint64_t>
may be represented as a dense block with value type uint8_t
, if the value range permits).
Furthermore, each block may be represented using its individual value type.
This block type is used to represent blocks that contain only zeros of the respective value type very space-efficiently.
Block type-specific information: none
addr[B] 0 3 4 7 8 8
+----+----+---+
| #r | #c | 0 |
+----+----+---+
size[B] 4 4 1
Block type-specific information:
- value type
vt
(uint8) - values
v
in row-major (value typevt
)
Below, S
denotes the size (in bytes) of a single value of type vt
.
addr[B] 0 3 4 7 8 8 9 9 10 10+#r*#c*S
+----+----+---+----+---------+---------+ +---------------+
| #r | #c | 1 | vt | v[0, 0] | v[0, 1] | ... | v[#r-1, #c-1] |
+----+----+---+----+---------+---------+ +---------------+
size[B] 4 4 1 1 S S S
Block type-specific information:
- value type
vt
(uint8) - number of non-zeros in the block
#nzb
(uint64) - for each row
- number of non-zeros in the row
#nzr
(uint32) - for each non-zero in the row
- column index
cx
(uint32) - value
v
(value typevt
)
- column index
- number of non-zeros in the row
Note that both a row and the entire block might contain no non-zeros.
Below, S
denotes the size (in bytes) of a single value of type vt
.
18 + 4*#r +
addr[B] 0 3 4 7 8 8 9 9 10 17 18 #nzb*(4+S)
+----+----+---+----+------+--------+ +--------+ +-----------+
| #r | #c | 2 | vt | #nzb | row[0] | ... | row[i] | ... | row[#r-1] |
+----+----+---+----+------+--------+ +--------+ +-----------+
size[B] 4 4 1 1 8 4+#nzr[i]*(4+S)
______________________________________|_______________________
/ \
+---------+----------+ +----------+ +------------------+
| #nzr[i] | nz[i, 0] | ... | nz[i, j] | ... | nz[i, #nzr[i]-1] |
+---------+----------+ +----------+ +------------------+
4 4+S 4+S 4+S
______________|____________
/ \
+----------+----------------+
| cx[i, j] | v[i, cx[i, j]] |
+----------+----------------+
4 S
Ultra-sparse blocks contain almost no non-zeros, so we want to keep the overhead of the meta data low. Thus, we distinguish blocks with a single column (where we don't need to store the column index) and blocks with more than one column.
Block type-specific information:
- value type
vt
(uint8) - number of non-zeros in the block
#nzb
(uint32) - for each non-zero
- row index
rx
(uint32) - value
v
(value typevt
)
- row index
Below, S
denotes the size (in bytes) of a single value of type vt
.
addr[B] 0 3 4 7 8 8 9 9 10 13 14 14+#nzb*(4+S)
+----+----+---+----+------+-------+ +-------+ +------------+
| #r | #c | 3 | vt | #nzb | nz[0] | ... | nz[i] | ... | nz[#nzb-1] |
+----+----+---+----+------+-------+ +-------+ +------------+
size[B] 4 4 1 1 4 4+S 4+S 4+S
__________|__________
/ \
+-------+-------------+
| rx[i] | v[rx[i], 0] |
+-------+-------------+
4 S
Block type-specific information:
- value type
vt
(uint8) - number of non-zeros in the block
#nzb
(uint32) - for each non-zero
- row index
rx
(uint32) - column index
cx
(uint32) - value
v
(value typevt
)
- row index
Below, S
denotes the size (in bytes) of a single value of type vt
.
addr[B] 0 3 4 7 8 8 9 9 10 13 14 14+#nzb*(8+S)
+----+----+---+----+------+-------+ +-------+ +------------+
| #r | #c | 3 | vt | #nzb | nz[0] | ... | nz[i] | ... | nz[#nzb-1] |
+----+----+---+----+------+-------+ +-------+ +------------+
size[B] 4 4 1 1 4 8+S 8+S 8+S
________________|________________
/ \
+-------+-------+-----------------+
| rx[i] | cx[i] | v[rx[i], cx[i]] |
+-------+-------+-----------------+
4 4 S