Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix model not deploy issue under intensive prediction tasks #1903

Merged
merged 1 commit into from
Jan 26, 2024

Conversation

zane-neo
Copy link
Collaborator

Description

Under intensive prediction tasks, the model not deployed error can happen occasionally like below:

[ERROR] Cannot execute-test. Error in load generator [0]
        Cannot run task [bulk]: Request returned an error. Error type: bulk, Description: HTTP status: 400, message: Model not ready yet. Please run this first: POST /_plugins/_ml/models/Cc36M40Bxy1iWleaz3Bl/_deploy

The reason is the SyncUpJob cleans timed out deploying model tasks from all tasks not filtering prediction tasks, and a prediction task can be removed from cache when prediction is done. And the corresponding taskCache is cleared, the syncUpJob will encounter NPE and return error response for gatherInfoRequest, and in this case, SyncUpJob sends clearRoutingTable request and clear all routingTable info, the next prediction task will see no model in the cache and throw model needs deploy exception.

Issues Resolved

NA

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

codecov bot commented Jan 23, 2024

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (54c788a) 82.61% compared to head (9d7b9dc) 82.63%.
Report is 1 commits behind head on main.

Files Patch % Lines
.../java/org/opensearch/ml/utils/RestActionUtils.java 75.00% 4 Missing ⚠️
.../ml/action/syncup/TransportSyncUpOnNodeAction.java 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1903      +/-   ##
============================================
+ Coverage     82.61%   82.63%   +0.01%     
- Complexity     5383     5388       +5     
============================================
  Files           521      521              
  Lines         21715    21727      +12     
  Branches       2210     2212       +2     
============================================
+ Hits          17940    17954      +14     
+ Misses         2878     2872       -6     
- Partials        897      901       +4     
Flag Coverage Δ
ml-commons 82.63% <68.42%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zane-neo zane-neo merged commit 521b880 into opensearch-project:main Jan 26, 2024
10 of 13 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 26, 2024
ylwu-amzn pushed a commit that referenced this pull request Jan 26, 2024
…1930)

Signed-off-by: zane-neo <[email protected]>
(cherry picked from commit 521b880)

Co-authored-by: zane-neo <[email protected]>
austintlee pushed a commit to austintlee/ml-commons that referenced this pull request Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants