feat: add support for ngram indices #3468

westonpace · 2025-02-21T00:33:03Z

Ngram indices are indices that can speed up various string filters. To start with they will be able to speed up contains(col, 'substr') filters. They work by creating a bitmap for each ngram (short sequence of characters) in a value. For example, consider an index of 1-grams. This would create a bitmap for each letter of the alphabet. Then, at query time, we can use this to narrow down which strings could potentially satisfy the query.

This is the first scalar index that requires a "recheck" step. It doesn't tell us exactly which rows satisfy the query. It only narrows down the list. Other indices that might behave like this are bloom filters and zone maps. This means that we need to still apply the filter on the results of the index search. A good portion of this PR is adding support for this concept into the scanner.

add support for ngram indices

codecov-commenter · 2025-02-21T01:23:59Z

Codecov Report

Attention: Patch coverage is 55.23560% with 342 lines in your changes missing coverage. Please review.

Project coverage is 78.63%. Comparing base (cca98fc) to head (511edc5).

Files with missing lines	Patch %	Lines
rust/lance-index/src/scalar/ngram.rs	28.35%	184 Missing and 8 partials ⚠️
rust/lance-index/src/scalar/expression.rs	37.03%	65 Missing and 3 partials ⚠️
rust/lance-index/src/scalar/inverted/index.rs	15.00%	17 Missing ⚠️
rust/lance-index/src/scalar.rs	69.23%	11 Missing and 1 partial ⚠️
rust/lance-index/src/scalar/btree.rs	38.46%	6 Missing and 2 partials ⚠️
rust/lance-index/src/scalar/label_list.rs	52.94%	7 Missing and 1 partial ⚠️
rust/lance/src/dataset/scanner.rs	94.24%	0 Missing and 8 partials ⚠️
rust/lance-core/src/utils/mask.rs	0.00%	7 Missing ⚠️
rust/lance/src/io/exec/scalar_index.rs	40.00%	3 Missing and 3 partials ⚠️
rust/lance-core/src/datatypes/schema.rs	63.63%	4 Missing ⚠️
... and 6 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3468      +/-   ##
==========================================
- Coverage   78.82%   78.63%   -0.20%     
==========================================
  Files         251      252       +1     
  Lines       92866    93506     +640     
  Branches    92866    93506     +640     
==========================================
+ Hits        73202    73524     +322     
- Misses      16686    16994     +308     
- Partials     2978     2988      +10

Flag	Coverage Δ
unittests	`78.63% <55.23%> (-0.20%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

add support for inexact indices which need a recheck

d22be8a

add support for ngram indices

github-actions bot added enhancement New feature or request python labels Feb 21, 2025

Add missing license header

511edc5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for ngram indices #3468

feat: add support for ngram indices #3468

westonpace commented Feb 21, 2025

codecov-commenter commented Feb 21, 2025

feat: add support for ngram indices #3468

Are you sure you want to change the base?

feat: add support for ngram indices #3468

Conversation

westonpace commented Feb 21, 2025

codecov-commenter commented Feb 21, 2025

Codecov Report