Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to create index on svector without dimension #21

Open
tucnak opened this issue Oct 13, 2024 · 7 comments
Open

Unable to create index on svector without dimension #21

tucnak opened this issue Oct 13, 2024 · 7 comments
Labels
question Further information is requested

Comments

@tucnak
Copy link

tucnak commented Oct 13, 2024

I'm trying to follow the tutorial in README with the most recent pgvecto.rs, however I ran into this issue:

CREATE INDEX ON documents USING vectors (embedding svector_dot_ops);

-- ERROR: pgvecto.rs: Dimensions type modifier of a vector column is needed for building the index.

Are we even supposed to pick dimension for svector and if yes, then how?

@jwnz
Copy link
Contributor

jwnz commented Oct 15, 2024

The documentation definitely needs updated, but in the meantime, you can try something like this:

-- create a bm25 matrix
SELECT bm25_create('documents', 'passage', 'documents_passage_bm25', 'hf', 'google-bert/bert-base-uncased', 0.75, 1.2);

-- convert a string to sparse vector to get the dimension (the number after the '}/'
SELECT bm25_document_to_svector('documents_passage_bm25', 'Some test string');
 -- {24058:0.7689637, 24688:0.7689637, 25455:0.7689637}/28111

-- add embedding column
ALTER TABLE documents ADD COLUMN embedding svector(28111);

-- create index
CREATE INDEX ON documents USING vectors (embedding svector_dot_ops);

-- embed column using specified bm25 matrix
UPDATE documents SET embedding = bm25_document_to_svector('documents_passage_bm25', documents.passage)::svector

-- Query
-- get the query's vector
SELECT bm25_query_to_svector('documents_passage_bm25', 'Where did Brooklyn Sudano''s mother die?');
-- {2927:0.76834136, 5132:0.76834136, 6102:0.76834136, 8652:0.76834136, 11558:0.76834136, 11560:0.76834136, 18712:0.76834136, 22788:0.76834136, 24841:0.76834136, 27195:0.76834136}/28111

-- find 10 most relevant documents
SELECT d.passage, 1 - (d.embedding <=> '{2927:0.056427535, 5132:0.021093048, 6102:0.045897257, 8652:0.24935675, 11558:0.037319094, 11560:0.15588555, 18712:0.12755758, 22788:0.013146327, 24841:0.26317492, 27195:0.030141948}/28111') as score
FROM documents d
ORDER BY score desc
limit 10;

@tucnak
Copy link
Author

tucnak commented Oct 15, 2024

Forgive me for I don't exactly follow; from https://blog.pgvecto.rs/pgbestmatchrs-elevate-your-postgresql-text-queries-with-bm25 I was led to believe that pg_bestmatch.rs is a complementary extension in the sense that it introduces BM25 full-text search capability which is a means to hybrid search? I personally found the idea appealing—to use sparse vectors whereas I already use a dense vector type via pgvecto.rs for embeddings.

However, then you speak of google-bert/bert-base-uncased isn't BERT a completely different method altogether? The document vectors pg_bestmatch.rs had generated for me with the README code are all /489 is this a constant of some kind, or how else would I derive it? I couldn't find it by grepping the code.

Perhaps this library is not the solution I thought it would be for implementing hybrid search on top of pgvectors?

@gaocegege gaocegege added the question Further information is requested label Oct 15, 2024
@VoVAllen
Copy link
Member

VoVAllen commented Oct 15, 2024

@tucnak Hi, can you reproduce the example provided by @jwnz ? This extension partially made BM25 search possible inside postgres, but not an end2end solution now. We're writing a brand-new one trying to solve this in an end2end manner. Hopefully we can have this ready in the mid November.

@tucnak
Copy link
Author

tucnak commented Oct 15, 2024

Falls apart! I'm working with a Ukrainian dataset, and I do get {...}/1641 for English documents, and {}/1641 i.e. empty for Ukrainian documents. I still don't understand where dimensions are coming from, and why any of this is necessary for BM25 which is a pretty simple statistical model / ranking function is it not?

Perhaps we should just wait for this end-to-end solution you're talking about, or try pg_search in the meantime.

@jwnz
Copy link
Contributor

jwnz commented Oct 15, 2024

@tucnak Would you be able to share your SQL? The dimensions come from the number of unique tokens present in the column used to build the BM25 matrix.

@VoVAllen
Copy link
Member

VoVAllen commented Oct 15, 2024

@tucnak Can you try the newly added tokenizer like tiktoken o200k? It should be a multi lingual one.

@VoVAllen
Copy link
Member

SELECT bm25_create('documents', 'passage', 'documents_passage_bm25', 'tiktoken', 'o200k_base', 0.75, 1.2);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants