Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: expr analyzer for buffer to filter table chunks #25866

Merged
merged 2 commits into from
Jan 21, 2025

Conversation

hiltontj
Copy link
Contributor

@hiltontj hiltontj commented Jan 20, 2025

Related to https://github.com/influxdata/influxdb_pro/issues/436

This PR updates the filter handling in the WriteBuffer so that sets of Exprs provided in a query will better prune both chunks from the in-memory buffer, as well as the set of parquet file chunks that are forwarded to DataFusion, for query execution.

New BufferFilter type

This introduces the BufferFilter type. This converts a set of Exprs from a logical query plan into a filter that can be used to:

  • prune chunks based on a provided lower/upper time boundary from both the buffer and parquet
  • prune chunks from the buffer based on any literal guarantees predicated on tag columns in the query, e.g., WHERE tag = 'a' or WHERE tag IN ['a', 'b']

This type is exposed such that it will be easy to use from replicated buffers and from the compactor when producing Arc<dyn QueryChunk>s in Enterprise.

Tests

  • Tests in the table_buffer module were updated to use the WriteValidator. This allows construction of rows based on line protocol directly, and in cleaning up the tests a bit, allowed me to extend some of the test cases in this test.
  • I added a test that checks the buffer chunk index filtering for expressions against multiple tag columns.
  • Added a test that checks time pruning
  • Added a test that checks time pruning in PersistedFiles
  • I renamed several tests to start with test_.

@hiltontj hiltontj added the v3 label Jan 20, 2025
@hiltontj hiltontj self-assigned this Jan 20, 2025
@hiltontj hiltontj force-pushed the hiltontj/buffer-time-pruning branch 2 times, most recently from 2eef092 to daa3fe7 Compare January 20, 2025 02:51
@hiltontj hiltontj force-pushed the hiltontj/buffer-time-pruning branch from daa3fe7 to 559414e Compare January 20, 2025 17:27
@hiltontj hiltontj force-pushed the hiltontj/buffer-time-pruning branch from 559414e to bab428f Compare January 20, 2025 19:50
@hiltontj hiltontj marked this pull request as ready for review January 20, 2025 20:40
@hiltontj hiltontj requested a review from a team January 20, 2025 20:41
Copy link
Member

@pauldix pauldix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a quick look through, but I think a call would be good 😁

///
/// - determine if there are any filters on the `time` column, in which case, attempt to derive
/// an interval that defines the boundaries on `time` from the query.
/// - determine if there are any _literal guarantees_ on tag columns contained in the filter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should be limited to tag columns. The user can choose to have any column indexed. The default is all tags, but they could override that to index only a single tag, or a field. The index is just a unique value -> file id. And it's actually not even a unique value, it's the xxhash of the value to file id.

literals,
} in literal_guarantees
{
// We are only interested in literal guarantees on tag columns for the buffer index:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use the index columns from the table definition, which may or may not be tags and may not be the entire set of tags.

continue;
};

// We are only interested in string literals with respect to tag columns:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still true, the index is scoped to string fields or tags.


// Update the guarantees on this column. We handle multiple guarantees here, i.e.,
// if there are multiple Expr's that lead to multiple guarantees on a given column.
guarantees
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how these work without pointers to the index. In Enterprise, the part that walks the expression tree for index matches pulls the list of file ids (i.e. the posting list) and then does actual intersection and unique with the list of files that match the expression. The resulting set of IDs are the ones that potentially apply to the query.

I'm not quite understanding what this is doing without access to the posting list that Enterprise uses.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, looking further down you have an actual row index in the buffer. Might be better to talk through this one on a call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a row index in a long ago iteration of the buffer, but I don't think I brought it along with a number of different refactorings

@hiltontj
Copy link
Contributor Author

hiltontj commented Jan 21, 2025

Need to log a few follow-on issues:

@hiltontj hiltontj merged commit b9a7927 into main Jan 21, 2025
13 checks passed
@hiltontj hiltontj deleted the hiltontj/buffer-time-pruning branch January 21, 2025 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants