Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky test in failover2 #1645

Closed
wants to merge 1 commit into from

Conversation

sungming2
Copy link
Contributor

@sungming2 sungming2 commented Jan 29, 2025

Issue #1640

Problem

Resetting the election doesn't work when replicas have the same epoch in failover progress and can't have majority votes.

e.g.,
Untitled Diagram drawio (2)

See: https://github.com/valkey-io/valkey/blob/unstable/src/cluster_legacy.c#L3218-L3223

Replica 1 (failover timeout then restarted):

654224:S 30 Jan 2025 16:19:37.074 * Starting a failover election for epoch 9, node config epoch is 4
654224:S 30 Jan 2025 16:19:37.180 * Needed quorum: 4. Number of votes received so far: 0
654224:S 30 Jan 2025 16:19:37.214 * Node 717ba037a15381ba7d4d34d67d5f431147cd08e2 () reported node 01e298f58ab7bf4c27e39dde0a8858224882e761 () as not reachable.
654224:S 30 Jan 2025 16:19:38.095 * Currently unable to failover: Waiting for votes, but majority still not reached.
654224:S 30 Jan 2025 16:19:38.095 * Needed quorum: 4. Number of votes received so far: 3
654224:S 30 Jan 2025 16:19:39.104 * Currently unable to failover: Waiting for votes, but majority still not reached.
...
654224:S 30 Jan 2025 16:19:46.995 * Currently unable to failover: Failover attempt expired.
654224:S 30 Jan 2025 16:19:46.995 * Needed quorum: 4. Number of votes received so far: 3
654224:S 30 Jan 2025 16:19:47.096 * Currently unable to failover: Failover attempt expired.
654224:S 30 Jan 2025 16:19:47.096 * Needed quorum: 4. Number of votes received so far: 3
...
654224:S 30 Jan 2025 16:19:58.033 * Currently unable to failover: Waiting the delay before I can start a new failover.
654224:S 30 Jan 2025 16:19:58.472 * Starting a failover election for epoch 11, node config epoch is 4
654224:S 30 Jan 2025 16:19:58.585 * Currently unable to failover: Waiting for votes, but majority still not reached.
654224:S 30 Jan 2025 16:19:58.625 * Needed quorum: 4. Number of votes received so far: 0
654224:S 30 Jan 2025 16:19:58.849 * Failover election won: I'm the new primary.
654224:S 30 Jan 2025 16:19:58.849 * configEpoch set to 11 after successful failover

Replica 2 (failover timeout then restarted):

654043:S 30 Jan 2025 16:19:37.076 * Starting a failover election for epoch 9, node config epoch is 7
654043:S 30 Jan 2025 16:19:39.110 * Currently unable to failover: Waiting for votes, but majority still not reached.
654043:S 30 Jan 2025 16:19:39.110 * Needed quorum: 4. Number of votes received so far: 2
654043:S 30 Jan 2025 16:19:40.042 * Currently unable to failover: Waiting for votes, but majority still not reached.
654043:S 30 Jan 2025 16:19:40.042 * Needed quorum: 4. Number of votes received so far: 2
...
654043:S 30 Jan 2025 16:19:47.124 * Currently unable to failover: Failover attempt expired.
654043:S 30 Jan 2025 16:19:47.124 * Needed quorum: 4. Number of votes received so far: 2
654043:S 30 Jan 2025 16:19:48.034 * Currently unable to failover: Failover attempt expired.
654043:S 30 Jan 2025 16:19:48.034 * Needed quorum: 4. Number of votes received so far: 2
...
654043:S 30 Jan 2025 16:19:57.876 * Starting a failover election for epoch 10, node config epoch is 7
654043:S 30 Jan 2025 16:19:58.206 * Currently unable to failover: Waiting for votes, but majority still not reached.
654043:S 30 Jan 2025 16:19:58.206 * Needed quorum: 4. Number of votes received so far: 0
654043:S 30 Jan 2025 16:19:58.440 * Failover election won: I'm the new primary.
654043:S 30 Jan 2025 16:19:58.440 * configEpoch set to 10 after successful failover

Test

Ran failover test hundreds of times to verify working for the same epoch case

Replica1 (Won election):

921683:S 31 Jan 2025 13:50:21.166 * Starting a failover election for epoch 9, node config epoch is 2
921683:S 31 Jan 2025 13:50:21.223 * Currently unable to failover: Waiting for votes, but majority still not reached.
921683:S 31 Jan 2025 13:50:21.228 * Failover election won: I'm the new primary.
921683:S 31 Jan 2025 13:50:21.228 * configEpoch set to 9 after successful failover
921683:S 31 Jan 2025 13:50:21.228 * Setting myself to primary in shard d5c2c3e1da89876791838e3a33ddd58e1ff03371 after failover; my old primary is a791ce5864d7d60ea1ff930f5301ea55b8da524b ()

Replica 2 (Reset election immediately then elected):

921725:S 31 Jan 2025 13:50:21.182 * Starting a failover election for epoch 9, node config epoch is 7

921725:S 31 Jan 2025 13:50:21.270 * Process election resetting for the same epoch. Sender: 9, Server: 9
921725:S 31 Jan 2025 13:50:21.270 # Failover election in progress for epoch 9, but received a claim from node 774080c54d83590622caa352e28a69f871a57a65 () with an equal or higher epoch 9. Resetting the election since we cannot win an election in the past.
...
921725:S 31 Jan 2025 13:50:21.271 * Start of election delayed for 762 milliseconds (rank #0, primary rank #0, offset 14).
921725:S 31 Jan 2025 13:50:22.076 * Starting a failover election for epoch 10, node config epoch is 7
...
921725:S 31 Jan 2025 13:50:22.120 * Failover election won: I'm the new primary.
921725:S 31 Jan 2025 13:50:22.120 * configEpoch set to 10 after successful failover

Copy link

codecov bot commented Jan 29, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.98%. Comparing base (12ec3d5) to head (3cab65e).
Report is 1 commits behind head on unstable.

Additional details and impacted files
@@            Coverage Diff            @@
##           unstable    #1645   +/-   ##
=========================================
  Coverage     70.98%   70.98%           
=========================================
  Files           121      121           
  Lines         65176    65177    +1     
=========================================
+ Hits          46264    46266    +2     
+ Misses        18912    18911    -1     
Files with missing lines Coverage Δ
src/cluster_legacy.c 87.09% <100.00%> (-0.24%) ⬇️

... and 13 files with indirect coverage changes

@hpatro hpatro requested a review from enjoy-binbin January 30, 2025 00:04
@sungming2 sungming2 changed the title Increase timeout for flaky test in failover2 Fix flaky test in failover2 Jan 30, 2025
@sungming2 sungming2 force-pushed the fix-flaky-failover2 branch from 7e53720 to 3ac9049 Compare January 30, 2025 21:52
@sungming2 sungming2 requested a review from hpatro January 30, 2025 22:18
@sungming2 sungming2 marked this pull request as draft January 31, 2025 04:16
@sungming2 sungming2 closed this Jan 31, 2025
@sungming2 sungming2 deleted the fix-flaky-failover2 branch January 31, 2025 04:28
@sungming2 sungming2 restored the fix-flaky-failover2 branch January 31, 2025 04:35
@sungming2 sungming2 reopened this Jan 31, 2025
@sungming2 sungming2 force-pushed the fix-flaky-failover2 branch from 3ac9049 to 74585b7 Compare January 31, 2025 05:37
Signed-off-by: Seungmin Lee <[email protected]>
@sungming2 sungming2 force-pushed the fix-flaky-failover2 branch from 74585b7 to 3cab65e Compare January 31, 2025 05:39
@sungming2 sungming2 marked this pull request as ready for review January 31, 2025 22:10
@sungming2 sungming2 requested a review from hpatro January 31, 2025 22:10
@sungming2 sungming2 marked this pull request as draft February 1, 2025 06:28
@sungming2
Copy link
Contributor Author

Closing this pr since we need further discussion for this case

@sungming2 sungming2 closed this Feb 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants