Added more detailed error messages for KNN model training #2378

anntians · 2025-01-09T18:16:27Z

Description

Previously, a consistent feedback we get around PQ and IVF is that there is limited visibility into the failure cases. Part of this is because the errors are thrown on the Faiss side and we don't return stack traces in Rest response. So, this makes it difficult to use PQ and IVF. Thus, this PR provides improved error messages by adding explicit checks for the most common errors:

[ ] For PQ, explicitly check in OpenSearch an invalid configuration where m does not divide dimension
[ ] For PQ/IVF, check the number of training points matches the minimum clustering criteria defined in faiss
[ ] If there is not enough memory, explicitly say that there is not enough memory.

Adding these 3 checks will cover 90% of the training failures that occur.

Related Issues

Resolves #2268

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

martin-gaievski · 2025-01-09T18:19:36Z

src/main/java/org/opensearch/knn/plugin/transport/TrainingJobRouterTransportAction.java

@@ -134,6 +138,30 @@ protected void getTrainingIndexSizeInKB(TrainingModelRequest trainingModelReques
                trainingVectors = trainingModelRequest.getMaximumVectorCount();
            }

+            long minTrainingVectorCount = 1000;


can you make the 1000 value a class level constant?

martin-gaievski · 2025-01-09T18:19:56Z

src/main/java/org/opensearch/knn/plugin/transport/TrainingJobRouterTransportAction.java

+
+            if (trainingVectors < minTrainingVectorCount) {
+                ValidationException exception = new ValidationException();
+                exception.addValidationError("Number of training points should be greater than " + minTrainingVectorCount);


Use String.format for concatenation

martin-gaievski · 2025-01-09T19:08:28Z

src/main/java/org/opensearch/knn/plugin/transport/TrainingModelRequest.java

+        if (knnMethodContext.getMethodComponentContext().getParameters().containsKey(ENCODER_PARAMETER_PQ_M)
+            && knnMethodConfigContext.getDimension() % (Integer) knnMethodContext.getMethodComponentContext()
+                .getParameters()
+                .get(ENCODER_PARAMETER_PQ_M) != 0) {


I'm not sure if that parameter is always present or not, but if it's optional then this line can generate the runtime exception in case parameter is not present

I believe Java has short-circuit evaluation, so if containsKey(ENCODER_PARAMETER_PQ_M) returns false then the second expression will not be evaluated. So a runtime exception shouldn't be thrown.

jmazanec15

Thanks @anntians. I think this is a good first step. What we should do next is move all of the specific checks around the parameters behind the engine method abstraction. See https://github.com/opensearch-project/k-NN/tree/main/src/main/java/org/opensearch/knn/index/engine.

Here is my idea for it: add in a method in KNNLibraryIndexingContext called something like getTrainingConfigValidationSetup() that returns a function that takes as input the number of training vectors (or a more general object) and performs some kind of validation.

Then, we can use this to hide the method specific validations inside the engine abstraction, which will be clean and maintainable. For instance, we can implement checks for IVFPQ in https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/engine/faiss/FaissIVFPQEncoder.java, etc.

jmazanec15 · 2025-01-14T19:29:58Z

src/main/java/org/opensearch/knn/plugin/transport/TrainingModelRequest.java

@@ -283,6 +285,15 @@ public ActionRequestValidationException validate() {
            exception.addValidationError("Description exceeds limit of " + KNNConstants.MAX_MODEL_DESCRIPTION_LENGTH + " characters");
        }

+        // Check if ENCODER_PARAMETER_PQ_M is divisible by vector dimension
+        if (knnMethodContext.getMethodComponentContext().getParameters().containsKey(ENCODER_PARAMETER_PQ_M)


You can remove these checks here now, correct?

jmazanec15 · 2025-01-14T19:31:29Z

src/main/java/org/opensearch/knn/index/engine/KNNLibraryIndexingContextImpl.java

@@ -52,4 +57,33 @@ public PerDimensionValidator getPerDimensionValidator() {
    public PerDimensionProcessor getPerDimensionProcessor() {
        return perDimensionProcessor;
    }
+
+    @Override
+    public BiFunction<Long, KNNMethodContext, TrainingConfigValidationOutput> getTrainingConfigValidationSetup() {


This is the right direction. However, the function that is returned should be passed in via the builder and be KNN method specific. See https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/engine/faiss/AbstractFaissMethod.java#L68-L90 as an example.

The main purpose of this is to not expose parameters outside of the method class unless explicitly necessary.

Signed-off-by: AnnTian Shao <[email protected]>

…tion Signed-off-by: AnnTian Shao <[email protected]>

Signed-off-by: AnnTian Shao <[email protected]>

jmazanec15 · 2025-01-24T16:17:33Z

src/main/java/org/opensearch/knn/index/engine/AbstractKNNMethod.java

@@ -108,6 +114,55 @@ protected PerDimensionProcessor doGetPerDimensionProcessor(
        return PerDimensionProcessor.NOOP_PROCESSOR;
    }

+    protected Function<TrainingConfigValidationInput, TrainingConfigValidationOutput> doGetTrainingConfigValidationSetup() {


This is in the right direction, but ideally we want this to be handled per algo type. For instance, pq validation should happen in https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/engine/faiss/AbstractFaissPQEncoder.java. IVF validation should happen in https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/engine/faiss/FaissIVFMethod.java, etc.

If you think this change wouldnt be possible to make in time for code freeze (Monday), I think we shouldnt block this PR. However, this last refactor should be taken up as a followup.

Got it, will merge this PR and create a separate PR for followup. Thanks

opensearch-trigger-bot · 2025-01-24T21:55:39Z

The backport to main failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-main main
# Navigate to the new working tree
cd .worktrees/backport-main
# Create a new branch
git switch --create backport/backport-2378-to-main
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 4058a53c53d6d4ec397b8bd8b6c079f1b584dfc0
# Push it to GitHub
git push --set-upstream origin backport/backport-2378-to-main
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-main

Then, create a pull request where the base branch is main and the compare/head branch is backport/backport-2378-to-main.

…-project#2378) * Add more detailed error messages for KNN model training Signed-off-by: AnnTian Shao <[email protected]> * Add validation check for training parameters in engine method abstraction Signed-off-by: AnnTian Shao <[email protected]> * Fixes for bwc and IT tests Signed-off-by: AnnTian Shao <[email protected]> --------- Signed-off-by: AnnTian Shao <[email protected]> Co-authored-by: AnnTian Shao <[email protected]> (cherry picked from commit 4058a53)

anntians requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, ryanbogan, luyuncheng, shatejas and 0ctopus13prime as code owners January 9, 2025 18:16

anntians force-pushed the errorMessages branch from a9c0dbd to ef0e7e5 Compare January 9, 2025 18:18

martin-gaievski reviewed Jan 9, 2025

View reviewed changes

jmazanec15 reviewed Jan 10, 2025

View reviewed changes

jmazanec15 reviewed Jan 14, 2025

View reviewed changes

anntians force-pushed the errorMessages branch 3 times, most recently from e61b5c7 to 1da10cf Compare January 23, 2025 07:24

anntians changed the base branch from main to 2.x January 23, 2025 22:35

anntians changed the base branch from 2.x to main January 23, 2025 23:27

AnnTian Shao added 2 commits January 23, 2025 16:16

Add more detailed error messages for KNN model training

585b373

Signed-off-by: AnnTian Shao <[email protected]>

Add validation check for training parameters in engine method abstrac…

e9d4807

…tion Signed-off-by: AnnTian Shao <[email protected]>

anntians force-pushed the errorMessages branch from 1da10cf to bee8983 Compare January 24, 2025 00:32

anntians changed the base branch from main to 2.x January 24, 2025 00:32

anntians force-pushed the errorMessages branch from bee8983 to cd1b553 Compare January 24, 2025 01:00

Fixes for bwc and IT tests

3b87975

Signed-off-by: AnnTian Shao <[email protected]>

anntians force-pushed the errorMessages branch from cd1b553 to 3b87975 Compare January 24, 2025 02:16

jmazanec15 approved these changes Jan 24, 2025

View reviewed changes

jmazanec15 added the backport main label Jan 24, 2025

navneet1v approved these changes Jan 24, 2025

View reviewed changes

navneet1v merged commit 4058a53 into opensearch-project:2.x Jan 24, 2025
102 checks passed

anntians mentioned this pull request Jan 29, 2025

Update training validation to be handled per algo type #2462

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added more detailed error messages for KNN model training #2378

Added more detailed error messages for KNN model training #2378

anntians commented Jan 9, 2025

martin-gaievski Jan 9, 2025

martin-gaievski Jan 9, 2025

martin-gaievski Jan 9, 2025

anntians Jan 14, 2025

jmazanec15 left a comment

jmazanec15 Jan 14, 2025

jmazanec15 Jan 14, 2025

jmazanec15 Jan 24, 2025

anntians Jan 24, 2025

opensearch-trigger-bot bot commented Jan 24, 2025

Added more detailed error messages for KNN model training #2378

Added more detailed error messages for KNN model training #2378

Conversation

anntians commented Jan 9, 2025

Description

Related Issues

Check List

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmazanec15 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

opensearch-trigger-bot bot commented Jan 24, 2025