Properly designate model state for actively training models when nodes crash or leave cluster #1317

ryanbogan · 2023-11-20T22:51:19Z

Description

There is currently a bug where models will be stuck in the state TRAINING when a node crashes or leaves the cluster. Since there is a write block on training models, they cannot be removed even though they are not actually training. This PR marks the models as their proper state (either ZOMBIE or FAILED) when a node crashes or leaves the cluster, so that the zombie models can be deleted.

Issues Resolved

#837

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Ryan Bogan <[email protected]>

codecov · 2023-11-20T23:12:48Z

Codecov Report

Attention: 29 lines in your changes are missing coverage. Please review.

Comparison is base (2e3ab95) 85.15% compared to head (b6b85a9) 85.00%.

Files	Patch %	Lines
.../knn/training/TrainingJobClusterStateListener.java	78.26%	12 Missing and 3 partials ⚠️
...org/opensearch/knn/training/TrainingJobRunner.java	22.22%	6 Missing and 1 partial ⚠️
...java/org/opensearch/knn/indices/ModelMetadata.java	88.88%	1 Missing and 3 partials ⚠️
.../main/java/org/opensearch/knn/index/IndexUtil.java	66.66%	1 Missing and 1 partial ⚠️
...plugin/transport/TrainingModelTransportAction.java	66.66%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1317      +/-   ##
============================================
- Coverage     85.15%   85.00%   -0.16%     
- Complexity     1216     1241      +25     
============================================
  Files           160      161       +1     
  Lines          4958     5067     +109     
  Branches        457      473      +16     
============================================
+ Hits           4222     4307      +85     
- Misses          538      555      +17     
- Partials        198      205       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Ryan Bogan <[email protected]>

src/main/java/org/opensearch/knn/indices/ModelMetadata.java

Signed-off-by: Ryan Bogan <[email protected]>

src/main/java/org/opensearch/knn/indices/ModelMetadata.java

src/main/java/org/opensearch/knn/training/TrainingJobRunner.java

src/main/java/org/opensearch/knn/training/TrainingJobClusterStateListener.java

heemin32 · 2023-12-06T19:56:08Z

src/main/java/org/opensearch/knn/training/TrainingJobClusterStateListener.java

+    public void clusterChanged(ClusterChangedEvent event) {
+        if (event.localNodeClusterManager()) {
+            if (event.isNewCluster()) {
+                // When the cluster is first created, the cluster manager will update models that are still marked as training.


In which scenario can this happen? How there will be a training job when cluster first created?

If the cluster crashes completely, the model will still be marked as training even though the background job isn't running.

How about the case where index is restored?

I'm not familiar with how the restoration code works, is it possible to overwrite system indices?

Let's skip the restoring case here to move things forward.
Please test this scenario and make sure we mark state as failed.

src/main/java/org/opensearch/knn/training/TrainingJobClusterStateListener.java

src/main/java/org/opensearch/knn/training/TrainingJobRunner.java

src/main/java/org/opensearch/knn/common/KNNConstants.java

src/main/java/org/opensearch/knn/indices/ModelMetadata.java

src/test/java/org/opensearch/knn/training/TrainingJobClusterStateListenerTests.java

jmazanec15 · 2023-12-07T04:52:08Z

@heemin32 @navneet1v @ryanbogan Discussing with Ryan offline, it seems that it will be difficult to properly detect from the node that drops and rejoins, that it has dropped.

Therefore I think my opinion has come back to the following: on the training node, before we serialize the model after training in the JNI completes, we just need to check if the current state of the model (based on either uuid or combo of training node assignment and model name) in the metadata (or in the system index for now) is FAILED or is not there and, if so, cancel serialization.

If a node drops, and the cluster-manager detects it, the cluster state (or model index) will be updated to FAILED for that model. And when the node re-joins, it will get this updated cluster state and see its not there or FAILED. If the node drops and the cluster-manager does not detect it, it doesnt matter - there is no need to cancel the job because the model will not be marked as FAILED - the cluster will still think that it is TRAINING.

Its not perfect for sure, but I think its good enough for this particular use case for now. In general, the cluster may behave weirdly if nodes are going up and down anyway. As long as we can get it in a consistent state eventually we should be okay. We can do this manually by either restarting the node, or deleting the model that was trained during instability and asking user to re-train with a more stable cluster state.

heemin32 · 2023-12-07T06:02:27Z

Good catch. I think it was either by "cancel serialization on rejoin" or "cancel serialization on invalid state". Somehow we ended up using both of them but I agree that using one of them should be suffice. "cancel serialization on invalid state" would be simpler to implement and test than "cancel serialization on rejoin".

Signed-off-by: Ryan Bogan <[email protected]>

jmazanec15

LGTM thanks! Make sure to add labels to the PR (bug fixes, v2.12.0 and backport-2.x)

heemin32 · 2023-12-07T22:40:37Z

src/main/java/org/opensearch/knn/training/TrainingJobClusterStateListener.java

+    public void clusterChanged(ClusterChangedEvent event) {
+        if (event.localNodeClusterManager()) {
+            if (event.isNewCluster()) {
+                // When the cluster is first created, the cluster manager will update models that are still marked as training.


Let's skip the restoring case here to move things forward.
Please test this scenario and make sure we mark state as failed.

src/main/java/org/opensearch/knn/training/TrainingJobClusterStateListener.java

src/main/java/org/opensearch/knn/training/TrainingJobRunner.java

src/main/java/org/opensearch/knn/training/TrainingJobClusterStateListener.java

src/main/java/org/opensearch/knn/training/TrainingJobRunner.java

Signed-off-by: Ryan Bogan <[email protected]>

heemin32

LGTM. Thanks.

ryanbogan · 2023-12-12T17:32:47Z

Manual testing was conducted using the following python script:

import random
import sys
import time
import json
from opensearchpy import OpenSearch, RequestsHttpConnection


def _get_model_body():
    return {
                "name": "hnsw",
                "engine": "faiss",
                "space_type": "l2",
                "parameters": {
                    "m": 16,
                    "ef_construction": 128,
                    "encoder": {
                        "name": "pq",
                        "parameters": {
                            "code_size": 8,
                            "m": 32
                        }
                    }
                }
            }
# def _get_model_body():
#     return {
#                 "name": "ivf",
#                 "engine": "faiss",
#                 "space_type": "l2",
#                 "parameters": {
#                     "nlist": 4096,
#                     "nprobes": 64,
#                     "encoder": {
#                         "name": "pq",
#                         "parameters": {
#                             "code_size": 8,
#                             "m": 48
#                         }
#                     }
#                 }
#             }

def _get_test_body(field_name: str, dimension: int, model_id: str):
    return {
        'mappings': {
            'properties': {
                field_name: {
                    'type': 'knn_vector',
                    'dimension': dimension,
                    'model_id': model_id
                }
            }
        },
        'settings': {
            'index': {
                'knn': True,
            },
            'number_of_shards': 200,
            'number_of_replicas': 0,
        }
    }


def _get_train_body(field_name: str, dimension: int):
    return {
      'mappings': {
        'properties': {
            field_name: {
            'dimension': dimension,
            'type': 'knn_vector'
          }
        }
      },
      'settings': {
        'index': {
          'refresh_interval': '30s',
        },
        'number_of_shards': 1,
        'number_of_replicas': 0,
      }
    }


def create_index(os_client: OpenSearch, index_name: str, field_name: str, dimension: int, model_id: str = None):
    os_client.indices.delete(index=index_name, ignore=[400, 404])
    if model_id == None:
        os_client.indices.create(index=index_name, body=_get_train_body(field_name, dimension))
        return

    os_client.indices.create(index=index_name, body=_get_test_body(field_name, dimension, model_id))


def ingest_docs(os_client: OpenSearch, index_name: str, field_name: str, dimension: int, doc_count: int):
    bulk_size = 100

    def create_header(doc_id):
        return {'index': {'_index': index_name, '_id': doc_id}}

    def _bulk_transform(partition, offset: int):
        actions = []
        _ = [
            actions.extend([create_header(_id + offset), None]) for _id in range(len(partition))
        ]
        actions[1::2] = [_build_index_doc(vec) for vec in partition]
        return actions


    def _salt_vector(vec):
        return [v + random.random() for v in vec]


    def _build_index_doc(vec):
        return {field_name: _salt_vector(vec)}

    for i in range(0, doc_count, bulk_size):
        vectors = [[random.random() for _ in range(dimension)] for _ in range(bulk_size)]
        body = _bulk_transform(vectors, i)
        os_client.bulk(index=index_name, body=body)


def train_model(os_client: OpenSearch, train_index_name: str, train_field_name: str, dimension: int, model_id: str):
    timeout = 2400
    print(_get_model_body())
    body = {
        'training_index': train_index_name,
        'training_field': train_field_name,
        'description': "blah",
        'dimension': dimension,
        'method': _get_model_body(),
    }

    method = "POST"
    model_uri = "/_plugins/_knn/models/{}".format(model_id)
    os_client.transport.perform_request(method, "{}/_train".format(model_uri), body=body)

    start_time = time.time()
    while time.time() < start_time + timeout:
        time.sleep(1)
        model_response = os_client.transport.perform_request("GET", model_uri)
        print(model_response)
        if 'state' not in model_response.keys():
            continue

        if model_response['state'] == 'created':
            return
        print(model_response['state'])

        if model_response['state'] == 'failed':
            raise Exception("Failed to create model: {}".format(model_response))

    raise Exception('Failed to create model: {} within timeout {} seconds'
                    .format(model_id, timeout))


def search_index(os_client: OpenSearch, index_name: str, field_name: str, dimension: int, query_count: int):
    def get_body(vec):
        return {
            'size': 10,
            'query': {
                'knn': {
                    field_name: {
                        'vector': vec,
                        'k': 10
                    }
                }
            }
        }

    for i in range(query_count):
        print("Query count {}".format((i+1)))
        query_response = os_client.search(index=index_name,
                                          body=get_body([random.random() for _ in range(dimension)]),
                                          request_timeout=100)
        print(query_response)


def _get_opensearch_client(endpoint: str, port: int):
    return OpenSearch(
        hosts=[{
            'host': endpoint,
            'port': port
        }],
        use_ssl=False,
        verify_certs=False,
        connection_class=RequestsHttpConnection,
        timeout=60,
    )


def main(args):
    TRAIN_INDEX_NAME = "train_index"
    TRAIN_FIELD_NAME = "train_field"
    MODEL_ID = "test_model"
    TEST_INDEX_NAME = "test_index"
    TEST_FIELD_NAME = "test_field"
    DIMENSION = 128
    DOC_COUNT = 5000

    QUERY_COUNT = 1

    step = args[1]
    os_client = _get_opensearch_client("localhost", 9200)

    if step == "train_setup":
        create_index(os_client, TRAIN_INDEX_NAME, TRAIN_FIELD_NAME, DIMENSION)
        ingest_docs(os_client, TRAIN_INDEX_NAME, TRAIN_FIELD_NAME, DIMENSION, DOC_COUNT)
        os_client.indices.refresh(index=TRAIN_INDEX_NAME)
        return

    if step == "train":
        train_model(os_client, TRAIN_INDEX_NAME, TRAIN_FIELD_NAME, DIMENSION, MODEL_ID)
        return

    if step == "ingest":
        create_index(os_client, TEST_INDEX_NAME, TEST_FIELD_NAME, DIMENSION, model_id=MODEL_ID)
        ingest_docs(os_client, TEST_INDEX_NAME, TEST_FIELD_NAME, DIMENSION, DOC_COUNT)
        os_client.indices.refresh(index=TEST_INDEX_NAME)
        return

    if step == "search":
        search_index(os_client, TEST_INDEX_NAME, TEST_FIELD_NAME, DIMENSION, QUERY_COUNT)
        return


if __name__ == "__main__":
    main(sys.argv)

Single node cluster crash:

In terminal 1, ./gradlew run
In a separate terminal, python3 test.py train_setup
In terminal 2, python3 test.py train
In terminal 1, control + C to crash cluster
In terminal 1, ./gradlew run --preserve-data
Once cluster is up in running, curl or use postman to hit the get model API. Model should be failed and able to be deleted.

Multi-node cluster crash:

Same steps as above but for each ./gradlew run, add -PnumNodes=3

Node leaving while cluster is still running:

Navigate to /etc/pf.conf and add the following rule to the bottom of the file, which will block transport traffic to the specified port:
1. block in quick inet proto { tcp, udp } from any to any port 9300
sudo pfctl -f etc/pf.conf
Add a log statement in TrainingJobClusterStateListener to print the node ephemeral ID in the clusterChanged() method
In terminal 1, ./gradlew run -PnumNodes=3
In terminal 2, python3 test.py train_setup
In terminal 2, python3 test.py train
Ensure that the node assignment printed out by the train script is the same as integ-test0 ephemeral id in terminal 1.
1. If not, control + C and restart from step 4
In terminal 3, sudo pfctl -e
At this point, the cluster fails and once the checks fail three times, a new cluster manager node is elected.
Once there is a new cluster manager node, the model will be marked as failed, which can be validated by curl/postman
In terminal 3, sudo pfctl -d
The cluster will stabilize as the node rejoins the cluster.
Once the training completes, the log for “Skipping serialization of model” is printed in terminal 1.
The model is still marked as failed, and can be deleted.

jmazanec15

LGTM

opensearch-trigger-bot · 2023-12-12T18:30:59Z

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-1317-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 33da521e0f98317b4700b62807e1d21b11f54a71
# Push it to GitHub
git push --set-upstream origin backport/backport-1317-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-1317-to-2.x.

ryanbogan added 12 commits November 6, 2023 08:20

Initial implementation

d9269b3

Signed-off-by: Ryan Bogan <[email protected]>

Fix compile errors for tests

945a4da

Signed-off-by: Ryan Bogan <[email protected]>

Temporary tests

b2fc712

Signed-off-by: Ryan Bogan <[email protected]>

Ensure backwards compatibility and add zombie to model state enum

9e21f07

Signed-off-by: Ryan Bogan <[email protected]>

Update current tests

ad09839

Signed-off-by: Ryan Bogan <[email protected]>

Fix current integration tests

0537111

Signed-off-by: Ryan Bogan <[email protected]>

Fix unit tests with new changes

c25075f

Signed-off-by: Ryan Bogan <[email protected]>

Add unit tests

a28ad42

Signed-off-by: Ryan Bogan <[email protected]>

Fix spotless

3f31741

Signed-off-by: Ryan Bogan <[email protected]>

Add changelog entry

85ed0bf

Signed-off-by: Ryan Bogan <[email protected]>

Delete temporary test file

f464d2e

Signed-off-by: Ryan Bogan <[email protected]>

Remove temporary changes to build.gradle

ba7d5f2

Signed-off-by: Ryan Bogan <[email protected]>

ryanbogan added 3 commits November 20, 2023 16:04

Add more backwards compatibility

91778e1

Signed-off-by: Ryan Bogan <[email protected]>

Attempt to fix bwc tests

de2c3aa

Signed-off-by: Ryan Bogan <[email protected]>

Fix spotless

14aa761

Signed-off-by: Ryan Bogan <[email protected]>

ryanbogan marked this pull request as ready for review November 21, 2023 00:39

ryanbogan requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei and martin-gaievski as code owners November 21, 2023 00:39

ryanbogan added 5 commits November 21, 2023 10:07

Remove star imports

62d0082

Signed-off-by: Ryan Bogan <[email protected]>

Add another unit test

47a3800

Signed-off-by: Ryan Bogan <[email protected]>

Modify unit test to increase coverage

c15dc9a

Signed-off-by: Ryan Bogan <[email protected]>

Change unit test to increase coverage

c7e0dcf

Signed-off-by: Ryan Bogan <[email protected]>

Merge branch 'main' into model_stuck_train_state

25eab9c

ryanbogan requested a review from jmazanec15 December 6, 2023 17:10

heemin32 reviewed Dec 6, 2023

View reviewed changes

src/main/java/org/opensearch/knn/indices/ModelMetadata.java Show resolved Hide resolved

Address PR Feedback

3eb2375

Signed-off-by: Ryan Bogan <[email protected]>

heemin32 reviewed Dec 6, 2023

View reviewed changes

src/main/java/org/opensearch/knn/training/TrainingJobClusterStateListener.java Outdated Show resolved Hide resolved

heemin32 reviewed Dec 6, 2023

View reviewed changes

src/main/java/org/opensearch/knn/training/TrainingJobRunner.java Show resolved Hide resolved

jmazanec15 reviewed Dec 7, 2023

View reviewed changes

Address PR Feedback

80574a2

Signed-off-by: Ryan Bogan <[email protected]>

ryanbogan requested review from heemin32 and jmazanec15 December 7, 2023 17:31

Address PR Feedback

6f1a064

Signed-off-by: Ryan Bogan <[email protected]>

jmazanec15 previously approved these changes Dec 7, 2023

View reviewed changes

ryanbogan added Bug Fixes Changes to a system or product designed to handle a programming bug/glitch backport 2.x v2.12.0 labels Dec 7, 2023

heemin32 reviewed Dec 7, 2023

View reviewed changes

ryanbogan added 2 commits December 7, 2023 15:32

Address PR Feedback

bf4407d

Signed-off-by: Ryan Bogan <[email protected]>

Address PR Feedback

586797f

Signed-off-by: Ryan Bogan <[email protected]>

ryanbogan dismissed jmazanec15’s stale review via 586797f December 7, 2023 23:49

Merge branch main into model_stuck_train_state

b6b85a9

Signed-off-by: Ryan Bogan <[email protected]>

heemin32 approved these changes Dec 8, 2023

View reviewed changes

ryanbogan requested a review from jmazanec15 December 8, 2023 16:00

jmazanec15 approved these changes Dec 12, 2023

View reviewed changes

ryanbogan merged commit 33da521 into opensearch-project:main Dec 12, 2023
48 of 49 checks passed

ryanbogan mentioned this pull request Dec 12, 2023

[Backport 2.x] Properly designate model state for actively training models when nodes crash or leave cluster #1348

Merged

5 tasks

ryanbogan deleted the model_stuck_train_state branch December 12, 2023 19:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly designate model state for actively training models when nodes crash or leave cluster #1317

Properly designate model state for actively training models when nodes crash or leave cluster #1317

ryanbogan commented Nov 20, 2023 •

edited

Loading

codecov bot commented Nov 20, 2023 •

edited

Loading

heemin32 Dec 6, 2023

ryanbogan Dec 6, 2023

heemin32 Dec 6, 2023

ryanbogan Dec 7, 2023

heemin32 Dec 7, 2023

jmazanec15 commented Dec 7, 2023

heemin32 commented Dec 7, 2023

jmazanec15 left a comment

heemin32 Dec 7, 2023

heemin32 left a comment

ryanbogan commented Dec 12, 2023 •

edited

Loading

jmazanec15 left a comment

opensearch-trigger-bot bot commented Dec 12, 2023

Properly designate model state for actively training models when nodes crash or leave cluster #1317

Properly designate model state for actively training models when nodes crash or leave cluster #1317

Conversation

ryanbogan commented Nov 20, 2023 • edited Loading

Description

Issues Resolved

Check List

codecov bot commented Nov 20, 2023 • edited Loading

Codecov Report

heemin32 Dec 6, 2023

Choose a reason for hiding this comment

ryanbogan Dec 6, 2023

Choose a reason for hiding this comment

heemin32 Dec 6, 2023

Choose a reason for hiding this comment

ryanbogan Dec 7, 2023

Choose a reason for hiding this comment

heemin32 Dec 7, 2023

Choose a reason for hiding this comment

jmazanec15 commented Dec 7, 2023

heemin32 commented Dec 7, 2023

jmazanec15 left a comment

Choose a reason for hiding this comment

heemin32 Dec 7, 2023

Choose a reason for hiding this comment

heemin32 left a comment

Choose a reason for hiding this comment

ryanbogan commented Dec 12, 2023 • edited Loading

jmazanec15 left a comment

Choose a reason for hiding this comment

opensearch-trigger-bot bot commented Dec 12, 2023

ryanbogan commented Nov 20, 2023 •

edited

Loading

codecov bot commented Nov 20, 2023 •

edited

Loading

ryanbogan commented Dec 12, 2023 •

edited

Loading