Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky test SegmentReplicationIT.testReplicaAlreadyAtCheckpoint #17216

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

skumawat2025
Copy link
Contributor

@skumawat2025 skumawat2025 commented Jan 31, 2025

Description

In the SegmentReplicationIT.testReplicaAlreadyAtCheckpoint test, we are creating three nodes with one primary node and two replica nodes. After ingesting documents to the primary shard, we are not checking if the segment replication to both replica 1 and 2 has finished. Without verifying this, we are stopping the primary node. This behavior leads to the test being flaky when the replication has not completed.

1> org.opensearch.transport.NodeDisconnectedException: [node_t0][127.0.0.1:41957][disconnected] disconnected
  1> [2025-01-15T22:22:47,061][INFO ][o.o.c.c.FollowersChecker ] [node_t0] FollowerChecker{discoveryNode={node_t1}{jpNpHksYQ7-Fb8W2sTp8Sg}{RSZGRaXJQfWjT8jbA4p4iA}{127.0.0.1}{127.0.0.1:42245}{d}{shard_indexing_pressure_enabled=true}, failureCountSinceLastSuccess=0, [cluster.fault_detection.follower_check.retry_count]=3} marking node as faulty
  1> [2025-01-15T22:22:47,057][ERROR][o.o.i.r.SegmentReplicationTargetService] [node_t3] [shardId [test-idx-1][0]] [replication id 5] Replication failed, timing data: {INIT=0, GET_CHECKPOINT_INFO=1, FILE_DIFF=0, REPLICATING=0}
  1> org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed
  1> 	at org.opensearch.indices.replication.SegmentReplicator$2.onFailure(SegmentReplicator.java:154) [main/:?]
  1> 	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
  1> 	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:104) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [main/:?]
  1> 	at java.base/java.util.ArrayList.forEach(ArrayList.java:1597) [?:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [main/:?]
  1> 	at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:84) [main/:?]
  1> 	at org.opensearch.core.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:65) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
  1> 	at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:75) [main/:?]
  1> 	at org.opensearch.telemetry.tracing.handler.TraceableTransportResponseHandler.handleException(TraceableTransportResponseHandler.java:81) [main/:?]
  1> 	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1505) [main/:?]
  1> 	at org.opensearch.transport.TransportService$8.run(TransportService.java:1357) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:932) [main/:?]
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
  1> 	at java.base/java.lang.Thread.run(Thread.java:1575) [?:?]
  1> Caused by: org.opensearch.transport.NodeDisconnectedException: [node_t1][127.0.0.1:42245][internal:index/shard/replication/get_segment_files] disconnected
  1> [2025-01-15T22:22:47,057][ERROR][o.o.i.r.SegmentReplicationTargetService] [node_t2] [shardId [test-idx-1][0]] [replication id 6] Replication failed, timing data: {INIT=0, GET_CHECKPOINT_INFO=1, FILE_DIFF=0, REPLICATING=0}
  1> org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed
  1> 	at org.opensearch.indices.replication.SegmentReplicator$2.onFailure(SegmentReplicator.java:154) [main/:?]
  1> 	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
  1> 	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:104) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [main/:?]
  1> 	at java.base/java.util.ArrayList.forEach(ArrayList.java:1597) [?:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [main/:?]
  1> 	at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:84) [main/:?]
  1> 	at org.opensearch.core.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:65) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
  1> 	at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:75) [main/:?]
  1> 	at org.opensearch.telemetry.tracing.handler.TraceableTransportResponseHandler.handleException(TraceableTransportResponseHandler.java:81) [main/:?]
  1> 	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1505) [main/:?]
  1> 	at org.opensearch.transport.TransportService$8.run(TransportService.java:1357) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:932) [main/:?]
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
  1> 	at java.base/java.lang.Thread.run(Thread.java:1575) [?:?]
  1> Caused by: org.opensearch.transport.NodeDisconnectedException: [node_t1][127.0.0.1:42245][internal:index/shard/replication/get_segment_files] disconnected
  1> [2025-01-15T22:22:47,061][WARN ][o.o.i.r.OngoingSegmentReplications] [node_t1] Cancelling replications for allocationIds [nVfWi-IOQ06hLbKTnK05VQ]
  1> [2025-01-15T22:22:47,065][WARN ][o.o.c.r.a.AllocationService] [node_t0] Falling back to single shard assignment since batch mode disable or multiple custom allocators set
  1> [2025-01-15T22:22:47,064][ERROR][o.o.i.r.SegmentReplicationTargetService] [node_t3] [shardId [test-idx-1][0]] [replication id 7] Replication failed, timing data: {INIT=0, REPLICATING=0}
  1> org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed
  1> 	at org.opensearch.indices.replication.SegmentReplicator$2.onFailure(SegmentReplicator.java:154) [main/:?]
  1> 	at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
  1> 	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:104) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:82) [main/:?]
  1> 	at org.opensearch.action.StepListener.whenComplete(StepListener.java:95) [main/:?]
  1> 	at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:179) [main/:?]
  1> 	at org.opensearch.indices.replication.SegmentReplicator.start(SegmentReplicator.java:137) [main/:?]
  1> 	at org.opensearch.indices.replication.SegmentReplicator$ReplicationRunner.doRun(SegmentReplicator.java:123) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:991) [main/:?]
  1> 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [main/:?]
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
  1> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
  1> 	at java.base/java.lang.Thread.run(Thread.java:1575) [?:?]
  1> Caused by: org.opensearch.transport.NodeNotConnectedException: [node_t1][127.0.0.1:42245] Node not connected
  1> 	at org.opensearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:223) ~[main/:?]
  1> 	at org.opensearch.test.transport.StubbableConnectionManager.getConnection(StubbableConnectionManager.java:93) ~[framework-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
  1> 	at org.opensearch.transport.TransportService.getConnection(TransportService.java:898) ~[main/:?]
  1> 	at org.opensearch.transport.TransportService.sendRequest(TransportService.java:857) ~[main/:?]
  1> 	at org.opensearch.indices.replication.PrimaryShardReplicationSource.getCheckpointMetadata(PrimaryShardReplicationSource.java:66) ~[main/:?]
  1> 	at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:177) ~[main/:?]
  1> 	... 7 more

With this change we are ensuring replication has finished before stopping the primary node.

Related Issues

Resolves #14328

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run good first issue Good for newcomers Storage Issues and PRs relating to data and metadata storage Storage:Remote labels Jan 31, 2025
@skumawat2025 skumawat2025 marked this pull request as ready for review January 31, 2025 08:05
Copy link
Contributor

✅ Gradle check result for eed1d1b: SUCCESS

Copy link

codecov bot commented Jan 31, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.31%. Comparing base (1bf8b9c) to head (eed1d1b).
Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17216      +/-   ##
============================================
- Coverage     72.34%   72.31%   -0.03%     
- Complexity    65731    65747      +16     
============================================
  Files          5318     5318              
  Lines        305743   305741       -2     
  Branches      44350    44350              
============================================
- Hits         221182   221095      -87     
- Misses        66394    66573     +179     
+ Partials      18167    18073      -94     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -1892,6 +1892,7 @@ public void testReplicaAlreadyAtCheckpoint() throws Exception {
// index a doc.
client().prepareIndex(INDEX_NAME).setId("1").setSource("foo", randomInt()).get();
refresh(INDEX_NAME);
waitForSearchableDocs(1, primaryNode, replicaNode, replicaNode2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At which step does the test fails? can you share more details about the failure (since i could not get that clearly from description)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autocut flaky-test Random test failure that succeeds on second run good first issue Good for newcomers skip-changelog Storage:Remote Storage Issues and PRs relating to data and metadata storage >test-failure Test failure from CI, local build, etc.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for SegmentReplicationIT
2 participants