feat: expr analyzer for buffer to filter table chunks #25866

hiltontj · 2025-01-20T02:31:00Z

Related to https://github.com/influxdata/influxdb_pro/issues/436

This PR updates the filter handling in the WriteBuffer so that sets of Exprs provided in a query will better prune both chunks from the in-memory buffer, as well as the set of parquet file chunks that are forwarded to DataFusion, for query execution.

New `BufferFilter` type

This introduces the BufferFilter type. This converts a set of Exprs from a logical query plan into a filter that can be used to:

prune chunks based on a provided lower/upper time boundary from both the buffer and parquet
prune chunks from the buffer based on any literal guarantees predicated on tag columns in the query, e.g., WHERE tag = 'a' or WHERE tag IN ['a', 'b']

This type is exposed such that it will be easy to use from replicated buffers and from the compactor when producing Arc<dyn QueryChunk>s in Enterprise.

Tests

Tests in the table_buffer module were updated to use the WriteValidator. This allows construction of rows based on line protocol directly, and in cleaning up the tests a bit, allowed me to extend some of the test cases in this test.
I added a test that checks the buffer chunk index filtering for expressions against multiple tag columns.
Added a test that checks time pruning
Added a test that checks time pruning in PersistedFiles
I renamed several tests to start with test_.

pauldix

Had a quick look through, but I think a call would be good 😁

pauldix · 2025-01-20T20:46:17Z

influxdb3_write/src/lib.rs

+    ///
+    /// - determine if there are any filters on the `time` column, in which case, attempt to derive
+    ///   an interval that defines the boundaries on `time` from the query.
+    /// - determine if there are any _literal guarantees_ on tag columns contained in the filter


I don't think it should be limited to tag columns. The user can choose to have any column indexed. The default is all tags, but they could override that to index only a single tag, or a field. The index is just a unique value -> file id. And it's actually not even a unique value, it's the xxhash of the value to file id.

pauldix · 2025-01-20T20:48:51Z

influxdb3_write/src/lib.rs

+                literals,
+            } in literal_guarantees
+            {
+                // We are only interested in literal guarantees on tag columns for the buffer index:


Should use the index columns from the table definition, which may or may not be tags and may not be the entire set of tags.

pauldix · 2025-01-20T20:49:31Z

influxdb3_write/src/lib.rs

+                    continue;
+                };
+
+                // We are only interested in string literals with respect to tag columns:


This is still true, the index is scoped to string fields or tags.

pauldix · 2025-01-20T20:53:43Z

influxdb3_write/src/lib.rs

+
+                // Update the guarantees on this column. We handle multiple guarantees here, i.e.,
+                // if there are multiple Expr's that lead to multiple guarantees on a given column.
+                guarantees


I'm not sure how these work without pointers to the index. In Enterprise, the part that walks the expression tree for index matches pulls the list of file ids (i.e. the posting list) and then does actual intersection and unique with the list of files that match the expression. The resulting set of IDs are the ones that potentially apply to the query.

I'm not quite understanding what this is doing without access to the posting list that Enterprise uses.

Oh wait, looking further down you have an actual row index in the buffer. Might be better to talk through this one on a call.

I had a row index in a long ago iteration of the buffer, but I don't think I brought it along with a number of different refactorings

hiltontj · 2025-01-21T00:58:06Z

Need to log a few follow-on issues:

BufferFilter should leverage the index_columns when filtering out which columns to analyze for literal guarantees (see comment) https://github.com/influxdata/influxdb_pro/issues/444
The BufferGuarantee should store the XXHash of the string literal as a u64 here, and downstream of that, the buffer index should use the XXHash of the literal values as well, instead of string literals. Use a XXHash of literals in BufferFilter and BufferIndex #25875
The BufferFilter should short-circuit on NotIn guarantees and let DataFusion do the filtering. Do not handle NotIn guarantees in the BufferFilter #25881

hiltontj added the v3 label Jan 20, 2025

hiltontj self-assigned this Jan 20, 2025

hiltontj force-pushed the hiltontj/buffer-time-pruning branch 2 times, most recently from 2eef092 to daa3fe7 Compare January 20, 2025 02:51

hiltontj mentioned this pull request Jan 20, 2025

test: interval analysis unit tests apache/datafusion#14189

Merged

hiltontj force-pushed the hiltontj/buffer-time-pruning branch from daa3fe7 to 559414e Compare January 20, 2025 17:27

feat: expr analyzer for buffer to filter table chunks

bab428f

hiltontj force-pushed the hiltontj/buffer-time-pruning branch from 559414e to bab428f Compare January 20, 2025 19:50

hiltontj marked this pull request as ready for review January 20, 2025 20:40

hiltontj requested a review from a team January 20, 2025 20:41

pauldix reviewed Jan 20, 2025

View reviewed changes

pauldix approved these changes Jan 21, 2025

View reviewed changes

refactor: code cleanup

18b087a

hiltontj mentioned this pull request Jan 21, 2025

Use a XXHash of literals in BufferFilter and BufferIndex #25875

Closed

hiltontj merged commit b9a7927 into main Jan 21, 2025
13 checks passed

hiltontj deleted the hiltontj/buffer-time-pruning branch January 21, 2025 01:20

This was referenced Jan 21, 2025

Do not handle NotIn guarantees in the BufferFilter #25881

Open

feat: use u64 hash in buffer index instead of str literal #25883

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expr analyzer for buffer to filter table chunks #25866

feat: expr analyzer for buffer to filter table chunks #25866

hiltontj commented Jan 20, 2025 •

edited

Loading

pauldix left a comment

pauldix Jan 20, 2025

pauldix Jan 20, 2025

pauldix Jan 20, 2025

pauldix Jan 20, 2025

pauldix Jan 20, 2025

pauldix Jan 20, 2025

hiltontj commented Jan 21, 2025 •

edited

Loading

feat: expr analyzer for buffer to filter table chunks #25866

feat: expr analyzer for buffer to filter table chunks #25866

Conversation

hiltontj commented Jan 20, 2025 • edited Loading

New BufferFilter type

Tests

pauldix left a comment

Choose a reason for hiding this comment

pauldix Jan 20, 2025

Choose a reason for hiding this comment

pauldix Jan 20, 2025

Choose a reason for hiding this comment

pauldix Jan 20, 2025

Choose a reason for hiding this comment

pauldix Jan 20, 2025

Choose a reason for hiding this comment

pauldix Jan 20, 2025

Choose a reason for hiding this comment

pauldix Jan 20, 2025

Choose a reason for hiding this comment

hiltontj commented Jan 21, 2025 • edited Loading

hiltontj commented Jan 20, 2025 •

edited

Loading

New `BufferFilter` type

hiltontj commented Jan 21, 2025 •

edited

Loading