-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: expr analyzer for buffer to filter table chunks #25866
Conversation
2eef092
to
daa3fe7
Compare
daa3fe7
to
559414e
Compare
559414e
to
bab428f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a quick look through, but I think a call would be good 😁
influxdb3_write/src/lib.rs
Outdated
/// | ||
/// - determine if there are any filters on the `time` column, in which case, attempt to derive | ||
/// an interval that defines the boundaries on `time` from the query. | ||
/// - determine if there are any _literal guarantees_ on tag columns contained in the filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it should be limited to tag columns. The user can choose to have any column indexed. The default is all tags, but they could override that to index only a single tag, or a field. The index is just a unique value -> file id. And it's actually not even a unique value, it's the xxhash of the value to file id.
literals, | ||
} in literal_guarantees | ||
{ | ||
// We are only interested in literal guarantees on tag columns for the buffer index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use the index columns from the table definition, which may or may not be tags and may not be the entire set of tags.
continue; | ||
}; | ||
|
||
// We are only interested in string literals with respect to tag columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still true, the index is scoped to string fields or tags.
|
||
// Update the guarantees on this column. We handle multiple guarantees here, i.e., | ||
// if there are multiple Expr's that lead to multiple guarantees on a given column. | ||
guarantees |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how these work without pointers to the index. In Enterprise, the part that walks the expression tree for index matches pulls the list of file ids (i.e. the posting list) and then does actual intersection and unique with the list of files that match the expression. The resulting set of IDs are the ones that potentially apply to the query.
I'm not quite understanding what this is doing without access to the posting list that Enterprise uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait, looking further down you have an actual row index in the buffer. Might be better to talk through this one on a call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a row index in a long ago iteration of the buffer, but I don't think I brought it along with a number of different refactorings
Need to log a few follow-on issues:
|
Related to https://github.com/influxdata/influxdb_pro/issues/436
This PR updates the filter handling in the
WriteBuffer
so that sets ofExpr
s provided in a query will better prune both chunks from the in-memory buffer, as well as the set of parquet file chunks that are forwarded to DataFusion, for query execution.New
BufferFilter
typeThis introduces the
BufferFilter
type. This converts a set ofExpr
s from a logical query plan into a filter that can be used to:time
boundary from both the buffer and parquetWHERE tag = 'a'
orWHERE tag IN ['a', 'b']
This type is exposed such that it will be easy to use from replicated buffers and from the compactor when producing
Arc<dyn QueryChunk>
s in Enterprise.Tests
table_buffer
module were updated to use theWriteValidator
. This allows construction of rows based on line protocol directly, and in cleaning up the tests a bit, allowed me to extend some of the test cases in this test.PersistedFiles
test_
.