-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly designate model state for actively training models when nodes crash or leave cluster #1317
Properly designate model state for actively training models when nodes crash or leave cluster #1317
Conversation
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #1317 +/- ##
============================================
- Coverage 85.15% 85.00% -0.16%
- Complexity 1216 1241 +25
============================================
Files 160 161 +1
Lines 4958 5067 +109
Branches 457 473 +16
============================================
+ Hits 4222 4307 +85
- Misses 538 555 +17
- Partials 198 205 +7 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
src/main/java/org/opensearch/knn/training/TrainingJobRunner.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/training/TrainingJobClusterStateListener.java
Outdated
Show resolved
Hide resolved
public void clusterChanged(ClusterChangedEvent event) { | ||
if (event.localNodeClusterManager()) { | ||
if (event.isNewCluster()) { | ||
// When the cluster is first created, the cluster manager will update models that are still marked as training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which scenario can this happen? How there will be a training job when cluster first created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the cluster crashes completely, the model will still be marked as training even though the background job isn't running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the case where index is restored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with how the restoration code works, is it possible to overwrite system indices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's skip the restoring case here to move things forward.
Please test this scenario and make sure we mark state as failed.
src/main/java/org/opensearch/knn/training/TrainingJobClusterStateListener.java
Outdated
Show resolved
Hide resolved
src/test/java/org/opensearch/knn/training/TrainingJobClusterStateListenerTests.java
Show resolved
Hide resolved
@heemin32 @navneet1v @ryanbogan Discussing with Ryan offline, it seems that it will be difficult to properly detect from the node that drops and rejoins, that it has dropped. Therefore I think my opinion has come back to the following: on the training node, before we serialize the model after training in the JNI completes, we just need to check if the current state of the model (based on either uuid or combo of training node assignment and model name) in the metadata (or in the system index for now) is FAILED or is not there and, if so, cancel serialization. If a node drops, and the cluster-manager detects it, the cluster state (or model index) will be updated to FAILED for that model. And when the node re-joins, it will get this updated cluster state and see its not there or FAILED. If the node drops and the cluster-manager does not detect it, it doesnt matter - there is no need to cancel the job because the model will not be marked as FAILED - the cluster will still think that it is TRAINING. Its not perfect for sure, but I think its good enough for this particular use case for now. In general, the cluster may behave weirdly if nodes are going up and down anyway. As long as we can get it in a consistent state eventually we should be okay. We can do this manually by either restarting the node, or deleting the model that was trained during instability and asking user to re-train with a more stable cluster state. |
Good catch. I think it was either by "cancel serialization on rejoin" or "cancel serialization on invalid state". Somehow we ended up using both of them but I agree that using one of them should be suffice. "cancel serialization on invalid state" would be simpler to implement and test than "cancel serialization on rejoin". |
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks! Make sure to add labels to the PR (bug fixes, v2.12.0 and backport-2.x)
public void clusterChanged(ClusterChangedEvent event) { | ||
if (event.localNodeClusterManager()) { | ||
if (event.isNewCluster()) { | ||
// When the cluster is first created, the cluster manager will update models that are still marked as training. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's skip the restoring case here to move things forward.
Please test this scenario and make sure we mark state as failed.
src/main/java/org/opensearch/knn/training/TrainingJobClusterStateListener.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/training/TrainingJobRunner.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/training/TrainingJobRunner.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/training/TrainingJobClusterStateListener.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/training/TrainingJobClusterStateListener.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/training/TrainingJobClusterStateListener.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/training/TrainingJobClusterStateListener.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/training/TrainingJobClusterStateListener.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/knn/training/TrainingJobRunner.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
Signed-off-by: Ryan Bogan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks.
Manual testing was conducted using the following python script:
Single node cluster crash:
Multi-node cluster crash:
Node leaving while cluster is still running:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-1317-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 33da521e0f98317b4700b62807e1d21b11f54a71
# Push it to GitHub
git push --set-upstream origin backport/backport-1317-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x Then, create a pull request where the |
Description
There is currently a bug where models will be stuck in the state
TRAINING
when a node crashes or leaves the cluster. Since there is a write block on training models, they cannot be removed even though they are not actually training. This PR marks the models as their proper state (either ZOMBIE or FAILED) when a node crashes or leaves the cluster, so that the zombie models can be deleted.Issues Resolved
#837
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.