Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change priority for scheduling reroute during timeout #16445

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

imRishN
Copy link
Member

@imRishN imRishN commented Oct 23, 2024

Description

This PR updates the priority of scheduling reroute when timed out from HIGH to NORMAL. This is because consistent HIGH reroutes might starve NORMAL priority tasks. And moreover, NORMAL is right for reasonable clusters. For clusters in messed up state which is causing NORMAL priority tasks to starve, we add a new dynamic cluster setting to raise the priority of reroute task to allocate shards in such scenarios.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • [ ] Functionality includes testing.
  • [ ] API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for 5e83a92: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Comment on lines 346 to 347
"reroute after existing shards allocator timed out",
Priority.HIGH,
"reroute after existing shards allocator [R] timed out",
Priority.NORMAL,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a separate priority for primary vs replica?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NORMAL also seems right for PSA. But during genuine issues in the cluster which can be identified with appropriate monitoring, we might need to raise it to HIGH. I will update the PR with a similar setting for ESA similar to BSA to raise reroute priority. Wdyt?

Copy link
Collaborator

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets update the PR description

@imRishN
Copy link
Member Author

imRishN commented Oct 23, 2024

Lets update the PR description

Updated

Signed-off-by: Rishab Nahata <[email protected]>
Copy link
Contributor

❌ Gradle check result for 6a448d0: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 825a983: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <[email protected]>
Copy link
Contributor

❌ Gradle check result for 5368e7f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <[email protected]>
Copy link
Contributor

❌ Gradle check result for 2ba604d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for 2ba604d: SUCCESS

Copy link

codecov bot commented Oct 25, 2024

Codecov Report

Attention: Patch coverage is 88.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 72.29%. Comparing base (f98f426) to head (1c12935).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
.../allocation/allocator/BalancedShardsAllocator.java 78.57% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #16445      +/-   ##
============================================
+ Coverage     72.24%   72.29%   +0.05%     
+ Complexity    65305    65300       -5     
============================================
  Files          5301     5301              
  Lines        303774   303798      +24     
  Branches      44016    44018       +2     
============================================
+ Hits         219458   219632     +174     
+ Misses        66272    66099     -173     
- Partials      18044    18067      +23     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

✅ Gradle check result for 7329867: SUCCESS

Signed-off-by: Rishab Nahata <[email protected]>
Copy link
Contributor

✅ Gradle check result for 33ffefb: SUCCESS

Setting.Property.NodeScope,
Setting.Property.Dynamic
);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems redundant

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to parse reroute priority?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exception thrown has a different message. This can be rearranged, but currently this state also looks cleaner

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Dec 11, 2024
@imRishN imRishN removed the stalled Issues that have stalled label Jan 14, 2025
Copy link
Contributor

❌ Gradle check result for 7596691: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <[email protected]>
Copy link
Contributor

✅ Gradle check result for 1c12935: SUCCESS

@imRishN imRishN requested a review from cwperks as a code owner January 20, 2025 10:17
Copy link
Contributor

❌ Gradle check result for bf24a5b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants