-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node group-based node table storage #1802
Conversation
f26db56
to
0477f05
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am getting a seg fault while loading the ldbc-100 comment csv on my MAC. Looks like the readfile operator is trying to access a valuevector which has a null dataChunk state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in the middle of my review. I'll continue tonight but in case you can start on this, here's my initial set of comments.
src/include/catalog/catalog.h
Outdated
protected: | ||
std::unique_ptr<CatalogContent> catalogContentForReadOnlyTrx; | ||
std::unique_ptr<CatalogContent> catalogContentForWriteTrx; | ||
storage::WAL* wal; | ||
std::unique_ptr<storage::BMFileHandle> nodeGroupsMetaFH; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the right place to store this FH? It looks like this might belong to a class in StorageManager.
Also the name Meta
does not sound correct. Maybe kzMetadataFileFH
or metadataFileFH
or metadataFH
. I have a few suggestions around this (e.g., in constants.h or storagemanager.h). Whatever you decide about these file names, just be consistent in every field/variable/constant etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still kept in Catalog for now. Honestly, I'm not sure as to whether Catalog or StorageManager is the best place. I want to revisit this as we move on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Catalog clearly is not a place to keep track of disk-related fields, such as filehandles. In addition this field already exists (as raw pointer) in node_table.h, which makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was regardless logical (schema) or physical (metadata), these information are metadata info around physically stored table data. It depends on if we should separate logical and physical or not. I was a bit indecisive, but now I think separating them makes more sense. So I will make the change.
I tried to move metadataFH to StorageManager. The part I like is that it avoids involving Catalog to interact with metadata file, while the annoying part is that during wal replaying, when it comes to recovery, because there is no StorageManager present, we need to somehow (re)construct a metadataFH
, which i don't think it's a correct design.
Following this, not related to this PR, I've been quite confused why we chose to do recovery before we construct Catalog and StorageManager objects. (I kinda cannot remember why we have to do that) Is this still a must-to-go design now?
src/storage/wal_replayer.cpp
Outdated
auto tableSchema = catalogForCheckpointing->getReadOnlyVersion()->getTableSchema(tableID); | ||
auto property = tableSchema->getProperty(propertyID); | ||
if (tableSchema->isNodeTable) { | ||
WALReplayerUtils::initPropertyMetaDAsOnDisk( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a bit wrong to me. I was expecting you to go through the regular WAL version of pages mechanisms to checkpoint the necessary pages on disk. And then call something like a "checkpointInMemory()" function on NodeTable to create the NodeColumns (and in the InMemDiskArrays) that are being created as part of the add property DDL statement. I am even curious if we are already doing the WAL version of pages way of checkpointing as well as WALReplayerUtils::initPropertyMetaDAsOnDisk. It's not obvious to me that we are not doing such "double checkpointing" on disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm separating this change into another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the rest of my comments. Let's do another iteration and also discuss certain things in person if you need.
@@ -97,6 +132,20 @@ void NodeTable::prepareRollback() { | |||
} | |||
} | |||
|
|||
void NodeTable::checkpointInMemory() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this code is also used when there is a copy to a node table, so all properties and the PK are being updated. But if I understand correctly, it is also being used when a new property was added, right? So in that case as well, we go through this coarse way of checkpointing each component of a NodeTable in memory. In the wal_replayer we should probably have a mechanism to keep track of not just the nodeTables but individual properties that require inmemory checkpointing.
You don't have to do it here but adding it here to record this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should keep track of these directly inside Transaction
, which seems easier and makes more sense to me.
df1bb2c
to
b7352e4
Compare
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #1802 +/- ##
==========================================
- Coverage 91.11% 89.62% -1.49%
==========================================
Files 813 821 +8
Lines 29317 30160 +843
==========================================
+ Hits 26711 27032 +321
- Misses 2606 3128 +522
☔ View full report in Codecov by Sentry. |
src/include/catalog/catalog.h
Outdated
protected: | ||
std::unique_ptr<CatalogContent> catalogContentForReadOnlyTrx; | ||
std::unique_ptr<CatalogContent> catalogContentForWriteTrx; | ||
storage::WAL* wal; | ||
std::unique_ptr<storage::BMFileHandle> nodeGroupsMetaFH; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Catalog clearly is not a place to keep track of disk-related fields, such as filehandles. In addition this field already exists (as raw pointer) in node_table.h, which makes sense.
df07db5
to
00eacad
Compare
This is the first major PR towards node group-based storage.
The implementation is detailed as follow:
Column Chunk
ColumnChunk
is used during COPY. It buffers all values of one column within a node group.Basic methods of setting/getting and copying values into the
ColumnChunk
is essentially same asInMemColumnChunk
, which is still used to populate columns in rel tables. Eventually, these two will be merged together.Node Group
NodeGroup
is a wrapper of allColumnChunk
s from node properties with the same set of node offsets.The class is also used during COPY.
Copy Node
COPY node pipeline consists of two operators,
ReadFile
andCopyNode
. The former one is responsible for reading a chunk from files, and feed toCopyNode
, whileCopyNode
populates the node table correspondingly.The logic of
CopyNode
is as follows:ReadFile
. Once full, it is flushed out and reset afterwards. (CopyNode::appendNodeGroupToTableAndPopulateIndex
). WhenReadFile
is exhausted, data left in the local node group is merged into a shared one among all threads (to avoid flushing many non-full node groups). Finally, the last thread will gaurantee all data left in the shared node group will be flushed to disk (CopyNode::finalize
).CopyNodeSharedState::getNextNodeGroupIdx
) when it is flushed out (NodeTable::appendNodeGroup). The start node offset of the node group is calculated based on the given node group idx (nodeGroupIdx << NODE_GROUP_SIZE_LOG2
). Also, the node group idx is used to access its metadata in the meta disk array (columnChunksMetaDA->get(nodeGroupIdx)
used inNodeColumn
). The assignment of node group idx is coordinated through the shared state ofCopyNode
.Note that this design is not order presvering, meaning that the ordering of nodes in the internal storage is not guranteed to be the same as they are in original csv/parquet files.
Order preserving COPY will be added later along with the fix of
SERIAL
.Column Chunk Metadata
Catalog
holds the file handle of metadata file, which consists of disk arrays ofColumnChunkMetadata
for each column.The
ColumnChunkMetadata
is used to keep track of starting page idx and num pages for the column chunk, so we can correctly read it back from disk.Besides,
Property
holdsMetaDiskArrayHeaderInfo
, which keeps track of related disk arrays of this property. (There can be multiple disk arrays due to that we separate data, null, and nested children into different column chunks).Node Column
Serves similar functionality as
Column
. Eventually these two should be merged into one.A node column holds a disk array of column chunk metadata.
Given a starting offset, read starts with calculating the node group idx (
nodeOffset >> common::StorageConstants::NODE_GROUP_SIZE_LOG2
), and then access the column chunk metadata to get the starting page idx in the data file for the column chunk, finally read necessary pages accordingly (same logic as before this PR).Same for write, given any node offset, we first need to get the corresponding column chunk metadata to figure out the starting page idx, then the write logic is the same.
(Note: for scans, we can optimize this a bit later to avoid repeated calculating and accessing of the column chunk metadata. if we change the morsel to a node group, the scan operator should keep a scan state, and only need to get the column chunk metadata during the scan of the first vector within a node group, following scans within the same node group can reuse the metadata).
TODOs
Here is a list of stuff broken right now, which is on the way to be fixed:
Immediate following works of this PR include (not ordered here):