Node group-based node table storage #1802

ray6080 · 2023-07-11T15:18:22Z

This is the first major PR towards node group-based storage.

The implementation is detailed as follow:

Column Chunk

ColumnChunk is used during COPY. It buffers all values of one column within a node group.

class ColumnChunk {
    ... ...
    std::unique_ptr<uint8_t[]> buffer;
    std::unique_ptr<NullColumnChunk> nullChunk;
    std::vector<std::unique_ptr<ColumnChunk>> childrenChunks; // for nested data types. (currently, only STRUCT will make use of this.)
    ... ...
}

Basic methods of setting/getting and copying values into the ColumnChunk is essentially same as InMemColumnChunk, which is still used to populate columns in rel tables. Eventually, these two will be merged together.

Node Group

NodeGroup is a wrapper of all ColumnChunks from node properties with the same set of node offsets.
The class is also used during COPY.

class NodeGroup {
    ... ...
    uint64_t nodeGroupIdx;
    common::offset_t numNodes;
    std::unordered_map<common::property_id_t, std::unique_ptr<ColumnChunk>> chunks;
    ... ...
}

Copy Node

COPY node pipeline consists of two operators, ReadFile and CopyNode. The former one is responsible for reading a chunk from files, and feed to CopyNode, while CopyNode populates the node table correspondingly.
The logic of CopyNode is as follows:

Each thread holds a local node group, which is reused to populate data pulled from ReadFile. Once full, it is flushed out and reset afterwards. (CopyNode::appendNodeGroupToTableAndPopulateIndex). When ReadFile is exhausted, data left in the local node group is merged into a shared one among all threads (to avoid flushing many non-full node groups). Finally, the last thread will gaurantee all data left in the shared node group will be flushed to disk (CopyNode::finalize).
Each node group is assigned a node group idx (CopyNodeSharedState::getNextNodeGroupIdx) when it is flushed out (NodeTable::appendNodeGroup). The start node offset of the node group is calculated based on the given node group idx (nodeGroupIdx << NODE_GROUP_SIZE_LOG2). Also, the node group idx is used to access its metadata in the meta disk array (columnChunksMetaDA->get(nodeGroupIdx) used in NodeColumn). The assignment of node group idx is coordinated through the shared state of CopyNode.
When flushing to disk, each column chunk will be assigned a set of sequential pages, and each column chunk will start from offset 0 of the first page.

Note that this design is not order presvering, meaning that the ordering of nodes in the internal storage is not guranteed to be the same as they are in original csv/parquet files.
Order preserving COPY will be added later along with the fix of SERIAL.

Column Chunk Metadata

Catalog holds the file handle of metadata file, which consists of disk arrays of ColumnChunkMetadata for each column.

struct ColumnChunkMetadata {
    common::page_idx_t pageIdx = common::INVALID_PAGE_IDX;
    common::page_idx_t numPages = 0;
};

The ColumnChunkMetadata is used to keep track of starting page idx and num pages for the column chunk, so we can correctly read it back from disk.

class Catalog {
	... ...
	std::unique_ptr<storage::BMFileHandle> nodeGroupsMetaFH;
    ... ...
}

Besides, Property holds MetaDiskArrayHeaderInfo, which keeps track of related disk arrays of this property. (There can be multiple disk arrays due to that we separate data, null, and nested children into different column chunks).

struct MetaDiskArrayHeaderInfo {
    common::page_idx_t mainHeaderPageIdx = common::INVALID_PAGE_IDX;
    common::page_idx_t nullHeaderPageIdx = common::INVALID_PAGE_IDX;
    std::vector<MetaDiskArrayHeaderInfo> childrenMetaDAHeaderInfos;
};

class Property {
	... ...
    MetaDiskArrayHeaderInfo metaDiskArrayHeaderInfo;
	... ...
}

Node Column

Serves similar functionality as Column. Eventually these two should be merged into one.
A node column holds a disk array of column chunk metadata.
Given a starting offset, read starts with calculating the node group idx (nodeOffset >> common::StorageConstants::NODE_GROUP_SIZE_LOG2), and then access the column chunk metadata to get the starting page idx in the data file for the column chunk, finally read necessary pages accordingly (same logic as before this PR).

Same for write, given any node offset, we first need to get the corresponding column chunk metadata to figure out the starting page idx, then the write logic is the same.

(Note: for scans, we can optimize this a bit later to avoid repeated calculating and accessing of the column chunk metadata. if we change the morsel to a node group, the scan operator should keep a scan state, and only need to get the column chunk metadata during the scan of the first vector within a node group, following scans within the same node group can reuse the metadata).

TODOs

Here is a list of stuff broken right now, which is on the way to be fixed:

SERIAL data type.
COPY npy.
WAL-based transaction mechanism for the initialization of metadataDA.
Fix add property and add transaction tests on add property and copy node (rollback, recovery, etc.). This should refactor some existing tests on dll and copy node.
Local-storage based updates on variable-sized values.
Fix copy node error messages.
Add file truncation logic during replaying CopyNode wal record.

Immediate following works of this PR include (not ordered here):

Benchmark the idea of changing the morsel of ReadFile to a node group at one time, so CopyNode can be simplified.
LIST layout rework.
STRING layout rework.
Constant compression.
Bit-packing compression.
Node group-based rel table storage.

src/catalog/catalog.cpp

src/include/catalog/catalog_structs.h

src/catalog/catalog.cpp

src/processor/operator/ddl/add_property.cpp

src/storage/wal_replayer.cpp

src/storage/storage_structure/var_sized_node_column.cpp

src/storage/storage_utils.cpp

acquamarin

I am getting a seg fault while loading the ldbc-100 comment csv on my MAC. Looks like the readfile operator is trying to access a valuevector which has a null dataChunk state.

src/include/processor/operator/copy/copy_node.h

src/processor/operator/copy/copy_node.cpp

semihsalihoglu-uw

I'm in the middle of my review. I'll continue tonight but in case you can start on this, here's my initial set of comments.

src/include/catalog/catalog.h

src/catalog/catalog.cpp

semihsalihoglu-uw · 2023-07-25T10:31:31Z

src/include/catalog/catalog.h

 protected:
    std::unique_ptr<CatalogContent> catalogContentForReadOnlyTrx;
    std::unique_ptr<CatalogContent> catalogContentForWriteTrx;
    storage::WAL* wal;
+    std::unique_ptr<storage::BMFileHandle> nodeGroupsMetaFH;


Is this the right place to store this FH? It looks like this might belong to a class in StorageManager.

Also the name Meta does not sound correct. Maybe kzMetadataFileFH or metadataFileFH or metadataFH. I have a few suggestions around this (e.g., in constants.h or storagemanager.h). Whatever you decide about these file names, just be consistent in every field/variable/constant etc.

Still kept in Catalog for now. Honestly, I'm not sure as to whether Catalog or StorageManager is the best place. I want to revisit this as we move on.

Why? Catalog clearly is not a place to keep track of disk-related fields, such as filehandles. In addition this field already exists (as raw pointer) in node_table.h, which makes sense.

My understanding was regardless logical (schema) or physical (metadata), these information are metadata info around physically stored table data. It depends on if we should separate logical and physical or not. I was a bit indecisive, but now I think separating them makes more sense. So I will make the change.

I tried to move metadataFH to StorageManager. The part I like is that it avoids involving Catalog to interact with metadata file, while the annoying part is that during wal replaying, when it comes to recovery, because there is no StorageManager present, we need to somehow (re)construct a metadataFH, which i don't think it's a correct design.

Following this, not related to this PR, I've been quite confused why we chose to do recovery before we construct Catalog and StorageManager objects. (I kinda cannot remember why we have to do that) Is this still a must-to-go design now?

src/include/storage/store/node_table.h

src/storage/wal_replayer.cpp

semihsalihoglu-uw · 2023-07-26T12:17:09Z

src/storage/wal_replayer.cpp

+        auto tableSchema = catalogForCheckpointing->getReadOnlyVersion()->getTableSchema(tableID);
+        auto property = tableSchema->getProperty(propertyID);
+        if (tableSchema->isNodeTable) {
+            WALReplayerUtils::initPropertyMetaDAsOnDisk(


This looks a bit wrong to me. I was expecting you to go through the regular WAL version of pages mechanisms to checkpoint the necessary pages on disk. And then call something like a "checkpointInMemory()" function on NodeTable to create the NodeColumns (and in the InMemDiskArrays) that are being created as part of the add property DDL statement. I am even curious if we are already doing the WAL version of pages way of checkpointing as well as WALReplayerUtils::initPropertyMetaDAsOnDisk. It's not obvious to me that we are not doing such "double checkpointing" on disk.

I'm separating this change into another PR.

semihsalihoglu-uw

Here are the rest of my comments. Let's do another iteration and also discuss certain things in person if you need.

src/include/storage/wal_replayer_utils.h

src/storage/wal_replayer.cpp

src/storage/storage_structure/node_column.cpp

semihsalihoglu-uw · 2023-07-27T09:31:00Z

src/storage/store/node_table.cpp

@@ -97,6 +132,20 @@ void NodeTable::prepareRollback() {
    }
 }

+void NodeTable::checkpointInMemory() {


I know this code is also used when there is a copy to a node table, so all properties and the PK are being updated. But if I understand correctly, it is also being used when a new property was added, right? So in that case as well, we go through this coarse way of checkpointing each component of a NodeTable in memory. In the wal_replayer we should probably have a mechanism to keep track of not just the nodeTables but individual properties that require inmemory checkpointing.

You don't have to do it here but adding it here to record this problem.

I believe we should keep track of these directly inside Transaction, which seems easier and makes more sense to me.

src/storage/store/struct_column_chunk.cpp

src/storage/store/var_sized_column_chunk.cpp

codecov · 2023-08-01T08:21:46Z

Codecov Report

Patch coverage: 82.39% and project coverage change: -1.49% ⚠️

Comparison is base (416d392) 91.11% compared to head (cd99fab) 89.62%.

❗ Current head cd99fab differs from pull request most recent head d03b06a. Consider uploading reports for the commit d03b06a to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1802      +/-   ##
==========================================
- Coverage   91.11%   89.62%   -1.49%     
==========================================
  Files         813      821       +8     
  Lines       29317    30160     +843     
==========================================
+ Hits        26711    27032     +321     
- Misses       2606     3128     +522

Files Changed	Coverage Δ
src/common/file_utils.cpp	`77.64% <ø> (+3.48%)`	⬆️
src/include/catalog/catalog.h	`100.00% <ø> (ø)`
src/include/common/file_utils.h	`100.00% <ø> (ø)`
src/include/common/types/types.h	`100.00% <ø> (ø)`
src/include/common/vector/value_vector.h	`100.00% <ø> (ø)`
..._plan/logical_operator/logical_create_node_table.h	`100.00% <ø> (ø)`
src/include/processor/operator/ddl/add_property.h	`78.57% <0.00%> (-21.43%)`	⬇️
...include/processor/operator/ddl/create_node_table.h	`100.00% <ø> (ø)`
src/include/storage/copier/npy_reader.h	`100.00% <ø> (ø)`
src/include/storage/file_handle.h	`100.00% <ø> (ø)`
... and 67 more

... and 50 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/storage/wal_replayer.cpp

src/processor/operator/copy/copy_node.cpp

src/catalog/catalog.cpp

src/catalog/table_schema.cpp

src/include/storage/wal_replayer_utils.h

semihsalihoglu-uw · 2023-08-01T11:05:18Z

src/include/catalog/catalog.h

 protected:
    std::unique_ptr<CatalogContent> catalogContentForReadOnlyTrx;
    std::unique_ptr<CatalogContent> catalogContentForWriteTrx;
    storage::WAL* wal;
+    std::unique_ptr<storage::BMFileHandle> nodeGroupsMetaFH;


Why? Catalog clearly is not a place to keep track of disk-related fields, such as filehandles. In addition this field already exists (as raw pointer) in node_table.h, which makes sense.

ray6080 force-pushed the node-group branch 2 times, most recently from f26db56 to 0477f05 Compare July 24, 2023 05:47

ray6080 marked this pull request as ready for review July 24, 2023 13:24

ray6080 requested review from semihsalihoglu-uw and acquamarin July 24, 2023 13:25

ray6080 changed the title ~~[WIP] Node group-based node table storage~~ Node group-based node table storage Jul 24, 2023

acquamarin reviewed Jul 24, 2023

View reviewed changes

acquamarin reviewed Jul 25, 2023

View reviewed changes

semihsalihoglu-uw reviewed Jul 26, 2023

View reviewed changes

semihsalihoglu-uw requested changes Jul 27, 2023

View reviewed changes

ray6080 force-pushed the node-group branch 2 times, most recently from df1bb2c to b7352e4 Compare August 1, 2023 07:05

semihsalihoglu-uw reviewed Aug 1, 2023

View reviewed changes

src/storage/wal_replayer.cpp Show resolved Hide resolved

src/processor/operator/copy/copy_node.cpp Outdated Show resolved Hide resolved

src/catalog/catalog.cpp Outdated Show resolved Hide resolved

semihsalihoglu-uw reviewed Aug 1, 2023

View reviewed changes

semihsalihoglu-uw approved these changes Aug 1, 2023

View reviewed changes

ray6080 force-pushed the node-group branch 2 times, most recently from df07db5 to 00eacad Compare August 2, 2023 07:55

node group-based node table storage

d03b06a

ray6080 force-pushed the node-group branch from 00eacad to d03b06a Compare August 2, 2023 08:23

ray6080 merged commit baf9e56 into master Aug 2, 2023

ray6080 deleted the node-group branch August 2, 2023 11:18

ray6080 mentioned this pull request Aug 6, 2023

Move the initialization of metadada disk arrays to wal-based transaction mechanism #1895

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node group-based node table storage #1802

Node group-based node table storage #1802

ray6080 commented Jul 11, 2023 •

edited

Loading

acquamarin left a comment

semihsalihoglu-uw left a comment

semihsalihoglu-uw Jul 25, 2023

ray6080 Jul 31, 2023 •

edited

Loading

semihsalihoglu-uw Aug 1, 2023

ray6080 Aug 1, 2023

semihsalihoglu-uw Jul 26, 2023

ray6080 Jul 31, 2023

semihsalihoglu-uw left a comment

semihsalihoglu-uw Jul 27, 2023

ray6080 Jul 31, 2023

codecov bot commented Aug 1, 2023 •

edited

Loading

semihsalihoglu-uw Aug 1, 2023

Node group-based node table storage #1802

Node group-based node table storage #1802

Conversation

ray6080 commented Jul 11, 2023 • edited Loading

Column Chunk

Node Group

Copy Node

Column Chunk Metadata

Node Column

TODOs

acquamarin left a comment

Choose a reason for hiding this comment

semihsalihoglu-uw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ray6080 Jul 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

semihsalihoglu-uw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 1, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

ray6080 commented Jul 11, 2023 •

edited

Loading

ray6080 Jul 31, 2023 •

edited

Loading

codecov bot commented Aug 1, 2023 •

edited

Loading