-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Likely race condition between model creation and deletion #2312
Comments
@dblock thanks for creating the issue. I think from k-NN plugin we should throw exception in cases for deletion when model is in training. I remember 2 year or back there were some work happening around this. @naveentatikonda if I am not wrong you doing something like that. I might be completely wrong here. Can you please check this once if this relates to the work you were doing earlier? |
@dblock @navneet1v I guess what's happening with this test case is as soon as the training request has been submitted you triggered a delete model request so that model details has not yet been stored into the model system index. When we run the delete model request, it first tries to fetch the model information from the model system index using the GET model API but it is failing to find the model info here and throwing "Unable to delete model [model-1]. Model does not exist". If we trigger the delete model request after some time(few sec) where the model info is stored in the system index and entered into TRAINING state, then it will be able to fetch the model info and reject the request with "Cannot delete model [model-1]. Model is still in training" exception. If the metadata already exists for a modelID then it is expected to reject a new train model request with the same model id. PR - #424 |
@naveentatikonda thanks for sharing the details. I do think yes it can be possible that model information is not stored in the index before a delete call comes in. This begs a question, should we make this GET model call more consistent so that it can handle these cases too. Or do you think thats not needed for now. @dblock what your thought on this?
This was added to ensure that Tests are successful. |
Based on the issue description, the reported errors are either This implies that even if the user waits for a long time, they are unable to create or delete One possible explanation is that the model data is inconsistently stored—either in the cluster state or the index, but not in both—leading to failures during both creation and deletion. |
What is the bug?
Coming from opensearch-project/opensearch-api-specification#708 where I added a test that creates a model by ID then deletes it, quickly, without waiting for the model to finish training, and just retrying.
I ended up in a weird state where creation of a model fails because model already exists, yet deleting the model returns that it doesn't exist. There's a race condition somewhere.
What is the expected behavior?
It should not be possible to get into this state.
What is your host/environment?
2.18 docker, Mac OSX
Do you have any additional context?
Here's the log from this docker run.
The text was updated successfully, but these errors were encountered: