Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support dynamic vocab len #30

Merged
merged 2 commits into from
Jan 15, 2025
Merged

feat: support dynamic vocab len #30

merged 2 commits into from
Jan 15, 2025

Conversation

silver-ymz
Copy link
Member

close #25

Now all operations about index don't rely on vocab len. Maybe we can remove TOKENIZER_NAME guc, and update tokenizer(text) to tokenizer(text, model_name).

@silver-ymz silver-ymz requested a review from VoVAllen January 13, 2025 04:58
@VoVAllen
Copy link
Member

Maybe we can remove TOKENIZER_NAME guc, and update tokenizer(text) to tokenizer(text, model_name).

Looks good to me. Better to use verb like tokenize(text). And do we need to change the syntax to

SELECT id, passage, embedding <&> to_bm25query('documents_embedding_bm25', tokenize('PostgreSQL')) AS rank
FROM documents
ORDER BY rank
LIMIT 10;

or SELECT id, passage, embedding <&> to_bm25query('documents_embedding_bm25', 'PostgreSQL', tokenizer_name) AS rank FROM documents ORDER BY rank LIMIT 10;

@VoVAllen
Copy link
Member

And is this related to the posting cursor refactor?

@silver-ymz
Copy link
Member Author

And do we need to change the syntax to

SELECT id, passage, embedding <&> to_bm25query('documents_embedding_bm25', tokenize('PostgreSQL')) AS rank
FROM documents
ORDER BY rank
LIMIT 10;

or SELECT id, passage, embedding <&> to_bm25query('documents_embedding_bm25', 'PostgreSQL', tokenizer_name) AS rank FROM documents ORDER BY rank LIMIT 10;

Yes, Here is an example for updated API:

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    passage TEXT
);

INSERT INTO documents (passage) VALUES
('PostgreSQL is a powerful, open-source object-relational database system.');

ALTER TABLE documents ADD COLUMN embedding bm25vector;

UPDATE documents SET embedding = tokenize(passage, "unicode"); -- specify tokenizer here

CREATE INDEX documents_embedding_bm25 ON documents USING bm25 (embedding bm25_ops);

SELECT id, passage, embedding <&> to_bm25query('documents_embedding_bm25', 'PostgreSQL', 'unicode') AS rank -- also specify tokenizer for to_bm25query
FROM documents
ORDER BY rank
LIMIT 10;

And is this related to the posting cursor refactor?

No, it doesn't contain the posting cursor refactor in #18

@VoVAllen
Copy link
Member

The updated API looks good to me. Please update the readme accordingly when finished. You can merge it when ready.

Signed-off-by: Mingzhuo Yin <[email protected]>
@silver-ymz silver-ymz merged commit 496a6cc into main Jan 15, 2025
5 checks passed
@silver-ymz silver-ymz deleted the feat/dynamic-vocab-size branch January 15, 2025 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Support dynamic vocab size
2 participants