Retrieve remote model id from registration response in IT to avoid flaky #3244

zane-neo · 2024-11-29T06:19:11Z

Description

Retrieve remote model id from registration response in IT to avoid flaky, an example is:

RestBedRockInferenceIT > test_bedrock_multimodal_model_empty_imageInput_null_textInput FAILED
    org.opensearch.client.ResponseException: method [POST], host [http://127.0.0.1:45675/], URI [/_plugins/_ml/models/null/_deploy], status line [HTTP/1.1 404 Not Found]
    {"error":{"root_cause":[{"type":"status_exception","reason":"Failed to find model"}],"type":"status_exception","reason":"Failed to find model"},"status":404}
        at __randomizedtesting.SeedInfo.seed([7D8B7621F7A9A3F2:CB22567814C8B651]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:501)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:384)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:359)
        at app//org.opensearch.ml.utils.TestHelper.makeRequest(TestHelper.java:182)
        at app//org.opensearch.ml.utils.TestHelper.makeRequest(TestHelper.java:155)
        at app//org.opensearch.ml.utils.TestHelper.makeRequest(TestHelper.java:144)
        at app//org.opensearch.ml.rest.RestMLRemoteInferenceIT.deployRemoteModel(RestMLRemoteInferenceIT.java:1220)
        at app//org.opensearch.ml.rest.MLCommonsRestTestCase.registerRemoteModel(MLCommonsRestTestCase.java:1011)
        at app//org.opensearch.ml.rest.RestBedRockInferenceIT.test_bedrock_multimodal_model_empty_imageInput_null_textInput(RestBedRockInferenceIT.java:216)

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: zane-neo <[email protected]>

…aky (#3244) Signed-off-by: zane-neo <[email protected]> (cherry picked from commit 1d30671)

…aky (#3244) (#3249) Signed-off-by: zane-neo <[email protected]> (cherry picked from commit 1d30671) Co-authored-by: zane-neo <[email protected]>

Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky Signed-off-by: Brian Flores <[email protected]>

…MTest (#3253) * fix uneeded call to get model_id for task api within RestMLGuardrailsIT Following #3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky Signed-off-by: Brian Flores <[email protected]> * fix ToolIntegrationWithLLMTest model undeploy race condition Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method. Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]>

…MTest (opensearch-project#3253) * fix uneeded call to get model_id for task api within RestMLGuardrailsIT Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky Signed-off-by: Brian Flores <[email protected]> * fix ToolIntegrationWithLLMTest model undeploy race condition Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method. Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]> (cherry picked from commit 1a659c8)

…aky (opensearch-project#3244) Signed-off-by: zane-neo <[email protected]>

…MTest (opensearch-project#3253) * fix uneeded call to get model_id for task api within RestMLGuardrailsIT Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky Signed-off-by: Brian Flores <[email protected]> * fix ToolIntegrationWithLLMTest model undeploy race condition Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method. Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]>

…aky (opensearch-project#3244) Signed-off-by: zane-neo <[email protected]> Signed-off-by: tkykenmt <[email protected]>

…MTest (opensearch-project#3253) * fix uneeded call to get model_id for task api within RestMLGuardrailsIT Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky Signed-off-by: Brian Flores <[email protected]> * fix ToolIntegrationWithLLMTest model undeploy race condition Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method. Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]> Signed-off-by: tkykenmt <[email protected]>

…tegrationWithLLMTest (#3263) * Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest (#3253) * fix uneeded call to get model_id for task api within RestMLGuardrailsIT Following #3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky Signed-off-by: Brian Flores <[email protected]> * fix ToolIntegrationWithLLMTest model undeploy race condition Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method. Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]> (cherry picked from commit 1a659c8) * add retry according to how many rest clients are in a IT cluster Signed-off-by: Brian Flores <[email protected]> * fix retry initialization The MAX_RETRIES variable had to wait for the cluster to form before it could call to get the cluster size Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]>

mingshl · 2025-01-10T22:24:15Z

@pyek-bot @xinyual what versions should this fix go back to?

…aky (#3244) Signed-off-by: zane-neo <[email protected]> (cherry picked from commit 1d30671)

Retrieve remote model id from registration response in IT to avoid flaky

3cd13ee

Signed-off-by: zane-neo <[email protected]>

zane-neo requested review from b4sjoo, dhrubo-os, jngz-es, model-collapse, rbhavna, ylwu-amzn, Zhangxunmt, austintlee, HenryL27 and xinyual as code owners November 29, 2024 06:19

zane-neo had a problem deploying to ml-commons-cicd-env November 29, 2024 06:19 — with GitHub Actions Failure

zane-neo temporarily deployed to ml-commons-cicd-env November 29, 2024 06:19 — with GitHub Actions Inactive

zane-neo temporarily deployed to ml-commons-cicd-env November 29, 2024 07:16 — with GitHub Actions Inactive

zane-neo had a problem deploying to ml-commons-cicd-env December 2, 2024 20:48 — with GitHub Actions Failure

zane-neo temporarily deployed to ml-commons-cicd-env December 2, 2024 21:21 — with GitHub Actions Inactive

dhrubo-os approved these changes Dec 3, 2024

View reviewed changes

dhrubo-os added the backport 2.x label Dec 3, 2024

xinyual approved these changes Dec 3, 2024

View reviewed changes

xinyual merged commit 1d30671 into opensearch-project:main Dec 3, 2024
9 checks passed

opensearch-trigger-bot bot pushed a commit that referenced this pull request Dec 3, 2024

Retrieve remote model id from registration response in IT to avoid fl…

5902709

…aky (#3244) Signed-off-by: zane-neo <[email protected]> (cherry picked from commit 1d30671)

opensearch-trigger-bot bot mentioned this pull request Dec 3, 2024

[Backport 2.x] Retrieve remote model id from registration response in IT to avoid flaky #3249

Merged

brianf-aws mentioned this pull request Dec 3, 2024

[BUG]-(flaky tests) ITs involving models have race conditions #3237

Closed

nathaliellenaa mentioned this pull request Dec 4, 2024

[BUG]-(flaky tests) RestMLInferenceSearchResponseProcessorIT Model NOT Found Exception #3228

Closed

brianf-aws mentioned this pull request Dec 5, 2024

Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest #3253

Merged

5 tasks

tkykenmt pushed a commit to tkykenmt/ml-commons that referenced this pull request Dec 15, 2024

Retrieve remote model id from registration response in IT to avoid fl…

f7a966d

…aky (opensearch-project#3244) Signed-off-by: zane-neo <[email protected]>

dhrubo-os added backport 2.15 backport 2.16 backport 2.17 backport 2.18 labels Jan 10, 2025

opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 10, 2025

Retrieve remote model id from registration response in IT to avoid fl…

6510323

…aky (#3244) Signed-off-by: zane-neo <[email protected]> (cherry picked from commit 1d30671)

opensearch-trigger-bot bot mentioned this pull request Jan 10, 2025

[Backport 2.15] Retrieve remote model id from registration response in IT to avoid flaky #3369

Open

opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 10, 2025

Retrieve remote model id from registration response in IT to avoid fl…

f263108

…aky (#3244) Signed-off-by: zane-neo <[email protected]> (cherry picked from commit 1d30671)

opensearch-trigger-bot bot mentioned this pull request Jan 10, 2025

[Backport 2.16] Retrieve remote model id from registration response in IT to avoid flaky #3370

Open

opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 10, 2025

Retrieve remote model id from registration response in IT to avoid fl…

61098dd

…aky (#3244) Signed-off-by: zane-neo <[email protected]> (cherry picked from commit 1d30671)

opensearch-trigger-bot bot mentioned this pull request Jan 10, 2025

[Backport 2.17] Retrieve remote model id from registration response in IT to avoid flaky #3371

Open

opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 10, 2025

Retrieve remote model id from registration response in IT to avoid fl…

8595385

…aky (#3244) Signed-off-by: zane-neo <[email protected]> (cherry picked from commit 1d30671)

opensearch-trigger-bot bot mentioned this pull request Jan 10, 2025

[Backport 2.18] Retrieve remote model id from registration response in IT to avoid flaky #3372

Open

pyek-bot mentioned this pull request Jan 10, 2025

[BUG] Flaky tests: RestMLInferenceSearchRequestProcessorIT.testMLInferenceProcessorRemoteModelRewriteQueryString #3348

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieve remote model id from registration response in IT to avoid flaky #3244

Retrieve remote model id from registration response in IT to avoid flaky #3244

zane-neo commented Nov 29, 2024

mingshl commented Jan 10, 2025

Retrieve remote model id from registration response in IT to avoid flaky #3244

Retrieve remote model id from registration response in IT to avoid flaky #3244

Conversation

zane-neo commented Nov 29, 2024

Description

Related Issues

Check List

mingshl commented Jan 10, 2025