-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce derived vector source via stored fields #2449
Introduce derived vector source via stored fields #2449
Conversation
Generates the vector source in the source field from the KnnVectorsFormat or BVD. It does this by adding StoredFieldsFormat to our existing custom codec. Work is still WIP but rootobject is working okay. Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
c61bd14
to
c7db83e
Compare
Signed-off-by: John Mazanec <[email protected]>
c7db83e
to
927b6de
Compare
Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took the first Pass , A really good shoutout to
- Clean Interfaces .
- Good use of abstraction on Visitor and clean injection.
3.Code has good modularity . - Function interface and supplier.
@Override | ||
public StoredFieldsReader fieldsReader(Directory directory, SegmentInfo segmentInfo, FieldInfos fieldInfos, IOContext ioContext) | ||
throws IOException { | ||
List<FieldInfo> derivedVectorFields = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we short circuit the code to return early in case of setting is disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the reader format, due to SPI, we do not have access to KNNSettings. So we cant check if the setting is set or not. I add a shortcircuit if no fields have the attribute: https://github.com/opensearch-project/k-NN/pull/2449/files#diff-f8a9ebad33a21a479b30eb0dfa0bcc6aa7ddfcb6c464eca0371b60d3c3a38e77R49
@Override | ||
public StoredFieldsReader fieldsReader(Directory directory, SegmentInfo segmentInfo, FieldInfos fieldInfos, IOContext ioContext) | ||
throws IOException { | ||
List<FieldInfo> derivedVectorFields = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we lazy create derivedVectorFields else we will have empty list of array even though setting is disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure I can add that in the below for loop.
private boolean shouldInject = true; | ||
|
||
@Override | ||
public void document(int docId, StoredFieldVisitor storedFieldVisitor) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The document method directly couples vector injection with DerivedSourceVectorInjector. This makes it harder to extend or modify the injection logic. Could we abstracts or implement some loose coupling here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand completely - we do need to create a custom stored fields visitor in order to get access to the source, so I needed to pass it in there. I could add the logics around the fieldsvisitor casting into the DerivedSourceStoredFieldVisitor.
src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsReader.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the reader
src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsFormat.java
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/derivedsource/DerivedSourceVectorInjector.java
Outdated
Show resolved
Hide resolved
|
||
@Override | ||
public StoredFieldsReader clone() { | ||
return new DerivedSourceStoredFieldsReader(delegate.clone(), derivedSourceVectorInjector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will clone affect the refcounts for delegate in any way? Are we sure it will be closed when its supposed to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me double check this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is supposed to heavily delegate. So, when a certain method is called, we want to call the delegate's method. Hence, I believe this is correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsReader.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsReader.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/derivedsource/DerivedSourceVectorInjector.java
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsReader.java
Show resolved
Hide resolved
@Override | ||
public StoredFieldsFormat storedFieldsFormat() { | ||
DerivedSourceReadersSupplier derivedSourceReadersSupplier = new DerivedSourceReadersSupplier( | ||
(segmentReadState) -> knnVectorsFormat().fieldsReader(segmentReadState), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: If this is for Lucene engine, you will also open index inputs for graph files, might be a bit unnecessary. you might be able to do
() -> new Lucene99FlatVectorsFormat(FlatVectorScorerUtil.getLucene99FlatVectorsScorer()).fieldsReader
The only caveat is that will changing lucene versions it might be hard to keep a trace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be any penalty incurred for opening indexInput but not reading from it? My hesitancy to use Lucene99FlatVectorsFormat is that I worry we dont explicitly mention thats how we are storing vectors. So Im leaning towards leaving how it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this change and ended up getting an error - so might hold off:
[2025-01-28T20:09:15,944][WARN ][o.o.i.c.IndicesClusterStateService] [integTest-0] [original-enable-testnestedmultidocbasiczcfrqz][0] marking and sending shard failed due to [shard failure, reason [refresh failed source[schedule]]]
org.apache.lucene.index.CorruptIndexException: Problem reading index from store(ByteSizeCachingDirectory(HybridDirectory@/Users/jmazane/workspace/Opensearch/DockerRunner/k-NN-1/build/testclusters/integTest-0/data/nodes/0/indices/GfG-oBsOSoy75aVhmKoErg/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@47495223)) (resource=store(ByteSizeCachingDirectory(HybridDirectory@/Users/jmazane/workspace/Opensearch/DockerRunner/k-NN-1/build/testclusters/integTest-0/data/nodes/0/indices/GfG-oBsOSoy75aVhmKoErg/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@47495223)))
at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:165) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:96) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:179) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:221) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.apache.lucene.index.IndexWriter.lambda$getReader$0(IndexWriter.java:545) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:138) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:607) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:381) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:355) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:345) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:112) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.opensearch.index.engine.OpenSearchReaderManager.refreshIfNeeded(OpenSearchReaderManager.java:72) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
at org.opensearch.index.engine.OpenSearchReaderManager.refreshIfNeeded(OpenSearchReaderManager.java:52) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.opensearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:433) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
at org.opensearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:413) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:213) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1865) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
at org.opensearch.index.engine.InternalEngine.maybeRefresh(InternalEngine.java:1844) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
at org.opensearch.index.shard.IndexShard.scheduledRefresh(IndexShard.java:4705) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
at org.opensearch.index.IndexService.maybeRefreshEngine(IndexService.java:1300) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
at org.opensearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:1444) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
at org.opensearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:159) ~[opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:955) [opensearch-2.19.0-SNAPSHOT.jar:2.19.0-SNAPSHOT]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.base/java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.io.FileNotFoundException: No sub-file with id .vemf found in compound file "_0.cfs" (fileName=_0.vemf files: [_Lucene912_0.tmd, _Lucene912_0.psm, .fnm, _Lucene90_0.dvd, .kdd, _Lucene912_0.tip, .kdm, _Lucene90_0.dvm, _Lucene912_0.tim, _Lucene912_0.doc, .kdi, _NativeEngines990KnnVectorsFormat_0.vec, .fdm, .fdx, _NativeEngines990KnnVectorsFormat_0.vemf, .fdt])
at org.apache.lucene.codecs.lucene90.Lucene90CompoundReader.openInput(Lucene90CompoundReader.java:170) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.opensearch.knn.index.codec.KNN80Codec.KNN80CompoundDirectory.openInput(KNN80CompoundDirectory.java:50) ~[?:?]
at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:156) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.apache.lucene.codecs.lucene99.Lucene99FlatVectorsReader.readMetadata(Lucene99FlatVectorsReader.java:90) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.apache.lucene.codecs.lucene99.Lucene99FlatVectorsReader.<init>(Lucene99FlatVectorsReader.java:65) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.apache.lucene.codecs.lucene99.Lucene99FlatVectorsFormat.fieldsReader(Lucene99FlatVectorsFormat.java:95) ~[lucene-core-9.12.1.jar:9.12.1 7a97a05a239d6fb9f1f347aa09bfa52e875be092 - 2024-12-09 16:47:48]
at org.opensearch.knn.index.codec.KNN9120Codec.KNN9120Codec.lambda$storedFieldsFormat$0(KNN9120Codec.java:74) ~[?:?]
at org.opensearch.knn.index.codec.derivedsource.DerivedSourceReadersSupplier.getReaders(DerivedSourceReadersSupplier.java:36) ~[?:?]
at org.opensearch.knn.index.codec.derivedsource.DerivedSourceVectorInjector.<init>(DerivedSourceVectorInjector.java:54) ~[?:?]
at org.opensearch.knn.index.codec.KNN9120Codec.DerivedSourceStoredFieldsReader.createDerivedSourceVectorInjector(DerivedSourceStoredFieldsReader.java:63) ~[?:?]
at org.opensearch.knn.index.codec.KNN9120Codec.DerivedSourceStoredFieldsReader.<init>(DerivedSourceStoredFieldsReader.java:59) ~[?:?]
at org.opensearch.knn.index.codec.KNN9120Codec.DerivedSourceStoredFieldsReader.<init>(DerivedSourceStoredFieldsReader.java:44) ~[?:?]
at org.opensearch.knn.index.codec.KNN9120Codec.DerivedSourceStoredFieldsFormat.fieldsReader(DerivedSourceStoredFieldsFormat.java:57) ~[?:?]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, Thanks for trying. Maybe added a TODO for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be any penalty incurred for opening indexInput but not reading from it?
If its already open, then no since its mmapped. But otherwise it will. Was just trying to avoid it
Signed-off-by: John Mazanec <[email protected]>
a6fc86e
to
74387f9
Compare
Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
6e104f2
to
399d281
Compare
public void inject(int docId, Map<String, Object> sourceAsMap) throws IOException { | ||
KNNVectorValues<?> vectorValues = vectorValuesSupplier.get(); | ||
if (vectorValues.docId() == docId || vectorValues.advance(docId) == docId) { | ||
sourceAsMap.put(fieldInfo.name, vectorValues.getVector()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need conditional clone vector here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in general, I shouldnt need to clone the vector because it get serialized (and copied) later. Im going to remove it from above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@navneet1v can you double check my reasoning here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shatejas and @jmazanec15 yes we will need a conditional clone here for the vector because vector values will give vector with same reference always. and For a Map<String, Object> sourceAsMap
the map will just store the reference of the vector and won't do any serialization during map insertions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @navneet1v - need to clone
Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsFormat.java
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsFormat.java
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsFormat.java
Show resolved
Hide resolved
@@ -72,6 +74,37 @@ public static <T> KNNVectorValues<T> getVectorValues(final FieldInfo fieldInfo, | |||
return getVectorValues(FieldInfoExtractor.extractVectorDataType(fieldInfo), vectorValuesIterator); | |||
} | |||
|
|||
/** | |||
* Returns a {@link KNNVectorValues} for the given {@link FieldInfo} and {@link LeafReader} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit-pick] the java doc is not correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix
*/ | ||
public static <T> KNNVectorValues<T> getVectorValues( | ||
final FieldInfo fieldInfo, | ||
final DocValuesProducer docValuesProducer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need docValuesProducer here? Since this feature will be applicable on newer indexes right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we completely get rid of ability to use docValuesProducer for vectors? I thought it could still be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the DV is used for older indices and this setting is never going to be enabled for new indices so we don't need this.
@@ -117,6 +128,12 @@ private LuceneFieldMapper( | |||
this.vectorFieldType = null; | |||
} | |||
|
|||
if (isDerivedSourceEnabled) { | |||
this.fieldType = new FieldType(this.fieldType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need a copying of fieldType for LuceneFieldMapper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It gets frozen in Lucene -
return KnnByteVectorField.createFieldType(dimension / Byte.SIZE, VectorSimilarityFunction.EUCLIDEAN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah my bad. For lucene we freeze the field early.
@@ -363,7 +370,8 @@ public Mapper.Builder<?> parse(String name, Map<String, Object> node, ParserCont | |||
modelDaoSupplier.get(), | |||
parserContext.indexVersionCreated(), | |||
null, | |||
null | |||
null, | |||
KNNSettings.isKNNDerivedSourceEnabled(parserContext.getSettings()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question]
should enable this check for specific versions or for all the older versions this setting will be false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this will be false for all older versions. Will validate manually by de-registering the setting, creating an index, and then checking this value.
public void inject(int docId, Map<String, Object> sourceAsMap) throws IOException { | ||
KNNVectorValues<?> vectorValues = vectorValuesSupplier.get(); | ||
if (vectorValues.docId() == docId || vectorValues.advance(docId) == docId) { | ||
sourceAsMap.put(fieldInfo.name, vectorValues.getVector()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shatejas and @jmazanec15 yes we will need a conditional clone here for the vector because vector values will give vector with same reference always. and For a Map<String, Object> sourceAsMap
the map will just store the reference of the vector and won't do any serialization during map insertions
/** | ||
* Factory for creating {@link PerFieldDerivedVectorInjector} instances. | ||
*/ | ||
public class PerFieldDerivedVectorInjectorFactory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make this as package private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will change
*/ | ||
@Override | ||
public StoredFieldsReader getMergeInstance() { | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just give the this
instance why we are creating a new instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that for merging, the reader is used. So, if we use this, it will add the vectors back into source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets add this as a java doc here. This seems like a good case where someone needs to know why we need to create merge instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think covered in abvove comment
src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsWriter.java
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/KNN9120Codec/DerivedSourceStoredFieldsWriter.java
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/index/codec/derivedsource/DerivedSourceVectorInjector.java
Show resolved
Hide resolved
...n/java/org/opensearch/knn/index/codec/derivedsource/NestedPerFieldDerivedVectorInjector.java
Show resolved
Hide resolved
if (fieldInfo.hasNorms() && derivedSourceReaders.getNormsProducer() != null) { // the field indexes norms | ||
iterator = derivedSourceReaders.getNormsProducer().getNorms(fieldInfo); | ||
} else if (fieldInfo.getVectorDimension() != 0 && derivedSourceReaders.getKnnVectorsReader() != null) { // the field indexes vectors | ||
switch (fieldInfo.getVectorEncoding()) { | ||
case FLOAT32: | ||
iterator = derivedSourceReaders.getKnnVectorsReader().getFloatVectorValues(fieldInfo.name); | ||
break; | ||
case BYTE: | ||
iterator = derivedSourceReaders.getKnnVectorsReader().getByteVectorValues(fieldInfo.name); | ||
break; | ||
} | ||
} else if (fieldInfo.getDocValuesType() != DocValuesType.NONE && derivedSourceReaders.getDocValuesProducer() != null) { // the field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we would have 3 types of field indexes? when this is just vector field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not necessarily the vector field. This is for a field in one child before adding back the vector. We are trying to figure out which docId that child maps to. To do it, I get the first field on/after the offset that contains that field.
So, this is basically field exists - https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/FieldExistsQuery.java
Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
Signed-off-by: John Mazanec <[email protected]>
|
||
public static final String DERIVED_VECTOR_FIELD_ATTRIBUTE_KEY = "knn-derived-source-enabled"; | ||
public static final String DERIVED_VECTOR_FIELD_ATTRIBUTE_TRUE_VALUE = "true"; | ||
public static final String DERIVED_VECTOR_FIELD_ATTRIBUTE_FALSE_VALUE = "false"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit-pick] this constant is not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VijayanB mentioned this. Will remove in another review
); | ||
// setting it explicitly false here to ensure that when flatmapper is used Lucene based Vector field is not created. | ||
this.useLuceneBasedVectorField = false; | ||
this.perDimensionValidator = selectPerDimensionValidator(vectorDataType); | ||
this.fieldType = new FieldType(KNNVectorFieldMapper.Defaults.FIELD_TYPE); | ||
this.fieldType.setDocValuesType(DocValuesType.BINARY); | ||
if (isDerivedSourceEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we adding only if it is true? is that intentional? if so why do we have DERIVED_VECTOR_FIELD_ATTRIBUTE_FALSE_VALUE ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Im going to remove FALSE in another PR.
} else if (docValuesProducer != null) { | ||
docIdSetIterator = docValuesProducer.getBinary(fieldInfo); | ||
} else { | ||
throw new IllegalArgumentException("Field does not have vector values and DocValues"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Field should have either vector values or DocValues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update in another PR
Looks good overall |
// Setup the iterator. Return if no parent | ||
String childFieldName = ParentChildHelper.getChildField(childFieldInfo.name); | ||
String parentFieldName = ParentChildHelper.getParentField(childFieldInfo.name); | ||
if (parentFieldName == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: out of curiosity, is this just being defensive or are we handling a case here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
being defensive
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-main main
# Navigate to the new working tree
cd .worktrees/backport-main
# Create a new branch
git switch --create backport/backport-2449-to-main
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 59b8e6bb24124c4f624e4711cf11f89948ffc594
# Push it to GitHub
git push --set-upstream origin backport/backport-2449-to-main
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-main Then, create a pull request where the |
Generates the vector source in the source field from the KnnVectorsFormat or BVD. It does this by adding StoredFieldsFormat to our existing custom codec. Currently, feature is experimental and behind a feature flag via index setting. In the future, we need to iterate to improve performance and stability for nested/object portions. Signed-off-by: John Mazanec <[email protected]> (cherry picked from commit 59b8e6b)
…#2449) Generates the vector source in the source field from the KnnVectorsFormat or BVD. It does this by adding StoredFieldsFormat to our existing custom codec. Currently, feature is experimental and behind a feature flag via index setting. In the future, we need to iterate to improve performance and stability for nested/object portions. Signed-off-by: John Mazanec <[email protected]>
Generates the vector source in the source field from the KnnVectorsFormat or BVD. It does this by adding StoredFieldsFormat to our existing custom codec. Currently, feature is experimental and behind a feature flag via index setting. In the future, we need to iterate to improve performance and stability for nested/object portions. Signed-off-by: John Mazanec <[email protected]>
Description
This PR introduces derived source for flat vector field mapper via the approach outlined in #2377.
First, for some quick background for reviewers: In OpenSearch, the source refers to a per-document StoredField (key = "_source") that stores the json representation of the document. This field is not searchable - instead, it used during the fetch phase in order to return fields of documents that matched the search request to end users. For instance, a user may search an index using a knn search over a field called passage_embedding, but they really only want to get back the field passage_text. During fetch, the source field of the k-Nearest Neighbors would be fetched and the passage_text field parsed out and sent back. In addition to this, _source may used to reindex documents and or update and delete by queries.
The goal of this PR/feature is to remove the vectors from this _source field on disk, but inject them back from other data formats into the _source when needed.
To implement this, from a high level, we introduce a custom StoredFieldsFormat. On write, this StoredFieldsFormat removes the vector fields from source. On read, it puts them back if necessary.
In its current state, this PR supports:
Here are the following todo items:
Some leftovers after this PR. Leaving out now because feature is experimental for 2.19 and properly isolated behind index setting
Related Issues
#2377
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.