-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to create index on svector without dimension #21
Comments
The documentation definitely needs updated, but in the meantime, you can try something like this: -- create a bm25 matrix
SELECT bm25_create('documents', 'passage', 'documents_passage_bm25', 'hf', 'google-bert/bert-base-uncased', 0.75, 1.2);
-- convert a string to sparse vector to get the dimension (the number after the '}/'
SELECT bm25_document_to_svector('documents_passage_bm25', 'Some test string');
-- {24058:0.7689637, 24688:0.7689637, 25455:0.7689637}/28111
-- add embedding column
ALTER TABLE documents ADD COLUMN embedding svector(28111);
-- create index
CREATE INDEX ON documents USING vectors (embedding svector_dot_ops);
-- embed column using specified bm25 matrix
UPDATE documents SET embedding = bm25_document_to_svector('documents_passage_bm25', documents.passage)::svector
-- Query
-- get the query's vector
SELECT bm25_query_to_svector('documents_passage_bm25', 'Where did Brooklyn Sudano''s mother die?');
-- {2927:0.76834136, 5132:0.76834136, 6102:0.76834136, 8652:0.76834136, 11558:0.76834136, 11560:0.76834136, 18712:0.76834136, 22788:0.76834136, 24841:0.76834136, 27195:0.76834136}/28111
-- find 10 most relevant documents
SELECT d.passage, 1 - (d.embedding <=> '{2927:0.056427535, 5132:0.021093048, 6102:0.045897257, 8652:0.24935675, 11558:0.037319094, 11560:0.15588555, 18712:0.12755758, 22788:0.013146327, 24841:0.26317492, 27195:0.030141948}/28111') as score
FROM documents d
ORDER BY score desc
limit 10; |
Forgive me for I don't exactly follow; from https://blog.pgvecto.rs/pgbestmatchrs-elevate-your-postgresql-text-queries-with-bm25 I was led to believe that However, then you speak of Perhaps this library is not the solution I thought it would be for implementing hybrid search on top of pgvectors? |
Falls apart! I'm working with a Ukrainian dataset, and I do get Perhaps we should just wait for this end-to-end solution you're talking about, or try |
@tucnak Would you be able to share your SQL? The dimensions come from the number of unique tokens present in the column used to build the BM25 matrix. |
@tucnak Can you try the newly added tokenizer like tiktoken o200k? It should be a multi lingual one. |
|
I'm trying to follow the tutorial in README with the most recent pgvecto.rs, however I ran into this issue:
Are we even supposed to pick dimension for
svector
and if yes, then how?The text was updated successfully, but these errors were encountered: