Unable to create index on svector without dimension #21

tucnak · 2024-10-13T20:07:53Z

I'm trying to follow the tutorial in README with the most recent pgvecto.rs, however I ran into this issue:

CREATE INDEX ON documents USING vectors (embedding svector_dot_ops);

-- ERROR: pgvecto.rs: Dimensions type modifier of a vector column is needed for building the index.

Are we even supposed to pick dimension for svector and if yes, then how?

The text was updated successfully, but these errors were encountered:

jwnz · 2024-10-15T03:58:06Z

The documentation definitely needs updated, but in the meantime, you can try something like this:

-- create a bm25 matrix
SELECT bm25_create('documents', 'passage', 'documents_passage_bm25', 'hf', 'google-bert/bert-base-uncased', 0.75, 1.2);

-- convert a string to sparse vector to get the dimension (the number after the '}/'
SELECT bm25_document_to_svector('documents_passage_bm25', 'Some test string');
 -- {24058:0.7689637, 24688:0.7689637, 25455:0.7689637}/28111

-- add embedding column
ALTER TABLE documents ADD COLUMN embedding svector(28111);

-- create index
CREATE INDEX ON documents USING vectors (embedding svector_dot_ops);

-- embed column using specified bm25 matrix
UPDATE documents SET embedding = bm25_document_to_svector('documents_passage_bm25', documents.passage)::svector

-- Query
-- get the query's vector
SELECT bm25_query_to_svector('documents_passage_bm25', 'Where did Brooklyn Sudano''s mother die?');
-- {2927:0.76834136, 5132:0.76834136, 6102:0.76834136, 8652:0.76834136, 11558:0.76834136, 11560:0.76834136, 18712:0.76834136, 22788:0.76834136, 24841:0.76834136, 27195:0.76834136}/28111

-- find 10 most relevant documents
SELECT d.passage, 1 - (d.embedding <=> '{2927:0.056427535, 5132:0.021093048, 6102:0.045897257, 8652:0.24935675, 11558:0.037319094, 11560:0.15588555, 18712:0.12755758, 22788:0.013146327, 24841:0.26317492, 27195:0.030141948}/28111') as score
FROM documents d
ORDER BY score desc
limit 10;

tucnak · 2024-10-15T08:33:41Z

Forgive me for I don't exactly follow; from https://blog.pgvecto.rs/pgbestmatchrs-elevate-your-postgresql-text-queries-with-bm25 I was led to believe that pg_bestmatch.rs is a complementary extension in the sense that it introduces BM25 full-text search capability which is a means to hybrid search? I personally found the idea appealing—to use sparse vectors whereas I already use a dense vector type via pgvecto.rs for embeddings.

However, then you speak of google-bert/bert-base-uncased isn't BERT a completely different method altogether? The document vectors pg_bestmatch.rs had generated for me with the README code are all /489 is this a constant of some kind, or how else would I derive it? I couldn't find it by grepping the code.

Perhaps this library is not the solution I thought it would be for implementing hybrid search on top of pgvectors?

VoVAllen · 2024-10-15T10:48:54Z

@tucnak Hi, can you reproduce the example provided by @jwnz ? This extension partially made BM25 search possible inside postgres, but not an end2end solution now. We're writing a brand-new one trying to solve this in an end2end manner. Hopefully we can have this ready in the mid November.

tucnak · 2024-10-15T15:20:14Z

Falls apart! I'm working with a Ukrainian dataset, and I do get {...}/1641 for English documents, and {}/1641 i.e. empty for Ukrainian documents. I still don't understand where dimensions are coming from, and why any of this is necessary for BM25 which is a pretty simple statistical model / ranking function is it not?

Perhaps we should just wait for this end-to-end solution you're talking about, or try pg_search in the meantime.

jwnz · 2024-10-15T16:08:07Z

@tucnak Would you be able to share your SQL? The dimensions come from the number of unique tokens present in the column used to build the BM25 matrix.

VoVAllen · 2024-10-15T17:15:48Z

@tucnak Can you try the newly added tokenizer like tiktoken o200k? It should be a multi lingual one.

VoVAllen · 2024-10-15T17:17:22Z

SELECT bm25_create('documents', 'passage', 'documents_passage_bm25', 'tiktoken', 'o200k_base', 0.75, 1.2);

gaocegege added the question Further information is requested label Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to create index on svector without dimension #21

Unable to create index on svector without dimension #21

tucnak commented Oct 13, 2024

jwnz commented Oct 15, 2024

tucnak commented Oct 15, 2024

VoVAllen commented Oct 15, 2024 •

edited

Loading

tucnak commented Oct 15, 2024

jwnz commented Oct 15, 2024

VoVAllen commented Oct 15, 2024 •

edited

Loading

VoVAllen commented Oct 15, 2024

Unable to create index on svector without dimension #21

Unable to create index on svector without dimension #21

Comments

tucnak commented Oct 13, 2024

jwnz commented Oct 15, 2024

tucnak commented Oct 15, 2024

VoVAllen commented Oct 15, 2024 • edited Loading

tucnak commented Oct 15, 2024

jwnz commented Oct 15, 2024

VoVAllen commented Oct 15, 2024 • edited Loading

VoVAllen commented Oct 15, 2024

VoVAllen commented Oct 15, 2024 •

edited

Loading

VoVAllen commented Oct 15, 2024 •

edited

Loading