Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for ngram indices #3468

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

westonpace
Copy link
Contributor

Ngram indices are indices that can speed up various string filters. To start with they will be able to speed up contains(col, 'substr') filters. They work by creating a bitmap for each ngram (short sequence of characters) in a value. For example, consider an index of 1-grams. This would create a bitmap for each letter of the alphabet. Then, at query time, we can use this to narrow down which strings could potentially satisfy the query.

This is the first scalar index that requires a "recheck" step. It doesn't tell us exactly which rows satisfy the query. It only narrows down the list. Other indices that might behave like this are bloom filters and zone maps. This means that we need to still apply the filter on the results of the index search. A good portion of this PR is adding support for this concept into the scanner.

@github-actions github-actions bot added enhancement New feature or request python labels Feb 21, 2025
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 55.23560% with 342 lines in your changes missing coverage. Please review.

Project coverage is 78.63%. Comparing base (cca98fc) to head (511edc5).

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/ngram.rs 28.35% 184 Missing and 8 partials ⚠️
rust/lance-index/src/scalar/expression.rs 37.03% 65 Missing and 3 partials ⚠️
rust/lance-index/src/scalar/inverted/index.rs 15.00% 17 Missing ⚠️
rust/lance-index/src/scalar.rs 69.23% 11 Missing and 1 partial ⚠️
rust/lance-index/src/scalar/btree.rs 38.46% 6 Missing and 2 partials ⚠️
rust/lance-index/src/scalar/label_list.rs 52.94% 7 Missing and 1 partial ⚠️
rust/lance/src/dataset/scanner.rs 94.24% 0 Missing and 8 partials ⚠️
rust/lance-core/src/utils/mask.rs 0.00% 7 Missing ⚠️
rust/lance/src/io/exec/scalar_index.rs 40.00% 3 Missing and 3 partials ⚠️
rust/lance-core/src/datatypes/schema.rs 63.63% 4 Missing ⚠️
... and 6 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3468      +/-   ##
==========================================
- Coverage   78.82%   78.63%   -0.20%     
==========================================
  Files         251      252       +1     
  Lines       92866    93506     +640     
  Branches    92866    93506     +640     
==========================================
+ Hits        73202    73524     +322     
- Misses      16686    16994     +308     
- Partials     2978     2988      +10     
Flag Coverage Δ
unittests 78.63% <55.23%> (-0.20%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants