Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Prevent leader checker from generating excessive duplicate leader tasks #39000

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

weiliu1031
Copy link
Contributor

@weiliu1031 weiliu1031 commented Jan 6, 2025

issue: #39001
Background:
Segment Load Version: Each segment load request assigns a timestamp as its version. When multiple copies of a segment are loaded on different QueryNodes, the leader checker uses this version to identify the latest copy and updates the routing table in the leader view to point to it. Delegator Router Version: When a delegator builds a route to a QueryNode that has loaded a segment, it also records the segment's version.

Router Table Update Logic: If the leader checker detects that the version of a segment in the routing table does not match the version in the worker, it updates the routing table to point to the QueryNode with the latest version. Additionally, it updates the segment's load version in the QueryNode during this process.

Issue:
When a channel is undergoing load balancing, the leader checker may sync the routing table to a new delegator. This sync operation modifies the segment's load version, which invalidates the routing in the old delegator. Subsequently, the leader checker updates the routing table in the old delegator, breaking the routing in the new delegator. This cycle continues, causing repeated updates and inconsistencies.

Fix:
This PR introduces two changes to address the issue:

  1. Use NodeID to verify whether the delegator's routing table needs an update, avoiding unnecessary modifications.
  2. Ensure compatibility by using the latest segment's load version as the version recorded in the routing table.

These changes resolve the cyclic updates and prevent the leader checker from generating excessive duplicate tasks, ensuring routing stability across delegators during load balancing.

@sre-ci-robot sre-ci-robot requested review from sunby and yah01 January 6, 2025 03:31
@sre-ci-robot sre-ci-robot added the size/M Denotes a PR that changes 30-99 lines. label Jan 6, 2025
@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: weiliu1031
To complete the pull request process, please assign jiaoew1991 after the PR has been reviewed.
You can assign the PR to them by writing /assign @jiaoew1991 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mergify mergify bot added the dco-passed DCO check passed. label Jan 6, 2025
Copy link
Contributor

mergify bot commented Jan 6, 2025

@weiliu1031

Invalid PR Title Format Detected

Your PR submission does not adhere to our required standards. To ensure clarity and consistency, please meet the following criteria:

  1. Title Format: The PR title must begin with one of these prefixes:
  • feat: for introducing a new feature.
  • fix: for bug fixes.
  • enhance: for improvements to existing functionality.
  • test: for add tests to existing functionality.
  • doc: for modifying documentation.
  • auto: for the pull request from bot.
  1. Description Requirement: The PR must include a non-empty description, detailing the changes and their impact.

Required Title Structure:

[Type]: [Description of the PR]

Where Type is one of feat, fix, enhance, test or doc.

Example:

enhance: improve search performance significantly 

Please review and update your PR to comply with these guidelines.

@weiliu1031 weiliu1031 changed the title Fix: Prevent Leader Checker from Generating Excessive Duplicate Leader Tasks fix: Prevent Leader Checker from Generating Excessive Duplicate Leader Tasks Jan 6, 2025
@mergify mergify bot added kind/bug Issues or changes related a bug and removed do-not-merge/invalid-pr-format labels Jan 6, 2025
@weiliu1031 weiliu1031 changed the title fix: Prevent Leader Checker from Generating Excessive Duplicate Leader Tasks fix: Prevent leader checker from generating excessive duplicate leader tasks Jan 6, 2025
Copy link
Contributor

mergify bot commented Jan 6, 2025

@weiliu1031 go-sdk check failed, comment rerun go-sdk can trigger the job again.

Copy link

codecov bot commented Jan 6, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.17%. Comparing base (d6206ad) to head (316f3a6).
Report is 1 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #39000       +/-   ##
===========================================
+ Coverage   69.64%   81.17%   +11.52%     
===========================================
  Files         296     1390     +1094     
  Lines       26633   196782   +170149     
===========================================
+ Hits        18548   159728   +141180     
- Misses       8085    31466    +23381     
- Partials        0     5588     +5588     
Components Coverage Δ
Client 79.53% <ø> (∅)
Core 69.64% <ø> (ø)
Go 83.11% <100.00%> (∅)
Files with missing lines Coverage Δ
internal/querycoordv2/checkers/leader_checker.go 96.91% <100.00%> (ø)

... and 1093 files with indirect coverage changes

@weiliu1031 weiliu1031 force-pushed the fix_too_much_leader_task branch 2 times, most recently from 45d467e to 749f41a Compare January 7, 2025 06:08
Copy link
Contributor

mergify bot commented Jan 7, 2025

@weiliu1031 go-sdk check failed, comment rerun go-sdk can trigger the job again.

@mergify mergify bot added the ci-passed label Jan 7, 2025
…r Tasks

Background:
Segment Load Version: Each segment load request assigns a timestamp
as its version. When multiple copies of a segment are loaded on
different QueryNodes, the leader checker uses this version to identify
the latest copy and updates the routing table in the leader view to point to it.
Delegator Router Version: When a delegator builds a route to a QueryNode
that has loaded a segment, it also records the segment's version.

Router Table Update Logic: If the leader checker detects that the
version of a segment in the routing table does not match the version
in the worker, it updates the routing table to point to the QueryNode
with the latest version. Additionally, it updates the segment's
load version in the QueryNode during this process.

Issue:
When a channel is undergoing load balancing, the leader checker may sync
the routing table to a new delegator. This sync operation modifies the segment's
load version, which invalidates the routing in the old delegator. Subsequently,
the leader checker updates the routing table in the old delegator, breaking
the routing in the new delegator. This cycle continues, causing repeated
updates and inconsistencies.

Fix:
This PR introduces two changes to address the issue:
1. Use NodeID to verify whether the delegator's routing table needs an update,
avoiding unnecessary modifications.
2. Ensure compatibility by using the latest segment's load version as the
version recorded in the routing table.

These changes resolve the cyclic updates and prevent the leader checker from
generating excessive duplicate tasks, ensuring routing stability across
delegators during load balancing.

Signed-off-by: Wei Liu <[email protected]>
@weiliu1031 weiliu1031 force-pushed the fix_too_much_leader_task branch from 749f41a to 316f3a6 Compare January 9, 2025 07:53
@sre-ci-robot sre-ci-robot added size/S Denotes a PR that changes 10-29 lines. and removed size/M Denotes a PR that changes 30-99 lines. labels Jan 9, 2025
@mergify mergify bot added ci-passed and removed ci-passed labels Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-passed dco-passed DCO check passed. kind/bug Issues or changes related a bug size/S Denotes a PR that changes 10-29 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants