-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Backport 2.x) Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest #3263
(Backport 2.x) Fixes Two Flaky IT classes RestMLGuardrailsIT & ToolIntegrationWithLLMTest #3263
Conversation
But this UT already removed in PR #3159, @dhrubo-os can you check if this PR backported to 2.x ? |
Yes: https://github.com/opensearch-project/ml-commons/pull/3163/files @brianf-aws do you have all the updates from 2.x branch? |
Yes good callout I havent updated my local 2.x branch in awhile. will remedy this ASAP |
…MTest (opensearch-project#3253) * fix uneeded call to get model_id for task api within RestMLGuardrailsIT Following opensearch-project#3244 this IT called the task api to check the model id again however this is redundant. Instead one can directly pull the model_id upon creating the model group. Manual testing was done to see that the behavior is intact, this should help reduce the calls within a IT to make it less flaky Signed-off-by: Brian Flores <[email protected]> * fix ToolIntegrationWithLLMTest model undeploy race condition Previously the test class attempted to delete a model without fully knowing if the model was undeployed in time. This change adds a waiting for 5 retries each 1 second to check the status of the model and only when undeployed will it proceed to delete the model. When the number of retries are exceeded it throws a error indicating a deeper problem. Manual testing was done to check that the model is undeployed by searching for the specific model via the checkForModelUndeployedStatus method. Signed-off-by: Brian Flores <[email protected]> --------- Signed-off-by: Brian Flores <[email protected]> (cherry picked from commit 1a659c8)
7fdd0f2
to
ee4845e
Compare
I am seeing that this is failing besides with the retry; after researching I can gather the following
Here there are 6 rest clients (http://[::1]:40677, http://127.0.0.1:43017,/ http://[::1]:34869, http://127.0.0.1:38417,/ http://[::1]:37987, http://127.0.0.1:40451]) The current amount of retries is hardcoded to 5. My speculation is that the rest client that had the correct result was not hit in time. During this specific test it hits the addresses by port 37987, 38417, 34869, 43017, 40677 (5 in total before throwing the exception it took to long.) Normally when running the reproduce with ... it will launch a simple cluster from what I see is 2 rest clients. I also checked that the test class VisualizationsToolIT follows the extension flow There exists a method within OpenSearchTestCase protected final List<HttpHost> getClusterHosts() {
return clusterHosts;
} Which we can use as the amount of retires instead of a hardcoded amount of 5. This way we can account for multinode clusters of any length. To further this point On the main branch if you look at a recent build pass you will see that this specific test ran a cluster with 2 rest clients and passed
|
Signed-off-by: Brian Flores <[email protected]>
The MAX_RETRIES variable had to wait for the cluster to form before it could call to get the cluster size Signed-off-by: Brian Flores <[email protected]>
|
This backport (
2.x
) is to improve two Flaky classes (RestMLGuardrailsIT, ToolIntegrationWithLLMTest). This was a result of #3253Testing
./gradlew :opensearch-ml-plugin:compileJava
./gradlew test