Supports Force Committing Segments in Batches #14811

noob-se7en · 2025-01-14T15:37:39Z

Problem Statement
The Force Commit API can cause high ingestion lag and slower queries because it triggers the simultaneous segment commit for all consuming segments. This happens because:

If N is the number of partition groups a server is consuming from, The API will cause all of the N consuming segments to commit. Hence N consumer threads will rush to acquire segment build semaphore. If the Semaphore allows only M permits, Only M consuming segments are in the segment build stage and the remaining (N - M) consumer threads are waiting on the Semaphore. Since (N - M) consumer threads are waiting, the consumption lag can become substantial.
Since M consuming segments are built in parallel, queries can become slower on poorly sized servers due to high memory consumption.

Solution
Adds an additional optional Integer parameter: batchSize to the forceCommit API (Default Value = Integer.MAX_VALUE i.e. commit as many segments as possible at once if no batchSize is provided) .

3 New query params added to the ForceCommit API:

batchSize (integer, optional):
Max number of consuming segments to commit at once.
Default: Integer.MAX_VALUE
Example: batchSize=100
batchStatusCheckIntervalSec (integer, optional):
How often (in seconds) to check whether the current batch of segments has been successfully committed.
Default: 5
Example: batchStatusCheckIntervalSec=10
batchStatusCheckTimeoutSec (integer, optional):
Timeout (in seconds) after which the controller stops checking the forceCommit status and throws an exception.
Default: 180
Example: batchStatusCheckTimeoutSec=300

codecov-commenter · 2025-01-14T16:51:46Z

Codecov Report

Attention: Patch coverage is 48.86364% with 45 lines in your changes missing coverage. Please review.

Project coverage is 63.72%. Comparing base (59551e4) to head (5b78c1b).
Report is 1659 commits behind head on master.

Files with missing lines	Patch %	Lines
.../core/realtime/PinotLLCRealtimeSegmentManager.java	40.67%	30 Missing and 5 partials ⚠️
...ller/api/resources/PinotRealtimeTableResource.java	0.00%	5 Missing ⚠️
...pinot/spi/utils/retry/AttemptFailureException.java	60.00%	4 Missing ⚠️
...ntroller/api/resources/ForceCommitBatchConfig.java	91.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #14811      +/-   ##
============================================
+ Coverage     61.75%   63.72%   +1.97%     
- Complexity      207     1482    +1275     
============================================
  Files          2436     2713     +277     
  Lines        133233   152138   +18905     
  Branches      20636    23510    +2874     
============================================
+ Hits          82274    96949   +14675     
- Misses        44911    47902    +2991     
- Partials       6048     7287    +1239

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	⬆️
integration	`100.00% <ø> (+99.99%)`	⬆️
integration1	`100.00% <ø> (+99.99%)`	⬆️
integration2	`0.00% <ø> (ø)`
java-11	`63.65% <48.86%> (+1.94%)`	⬆️
java-21	`63.61% <48.86%> (+1.99%)`	⬆️
skip-bytebuffers-false	`63.67% <48.86%> (+1.92%)`	⬆️
skip-bytebuffers-true	`63.59% <48.86%> (+35.87%)`	⬆️
temurin	`63.72% <48.86%> (+1.97%)`	⬆️
unittests	`63.72% <48.86%> (+1.97%)`	⬆️
unittests1	`56.22% <66.66%> (+9.32%)`	⬆️
unittests2	`34.05% <47.72%> (+6.32%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Jackie-Jiang · 2025-01-15T03:45:01Z

This solves #11950

Jackie-Jiang

Seems you are pushing the batch throttling to the server side. What will happen if some replicas decide to commit, and others get throttled? Even worse, could this cause deadlock?

siddharthteotia · 2025-01-15T07:25:35Z

Is it not possible to solve the problem on controller / coordinate from the controller ? Pushing this down to the individual server will likely lead to error-prone situations

…ce_commit

noob-se7en · 2025-01-15T18:48:26Z

@Jackie-Jiang I don't quite get what is meant by

and others get throttled?

Regarding Deadlock or any edge case - Server will use the same logic which is used /tables/forceCommitStatus/{jobId} to check the status of the batch, so there should be no deadlock.

…_commit

Jackie-Jiang

Mostly good

Jackie-Jiang · 2025-01-27T23:07:11Z

...ain/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java

@@ -213,6 +222,7 @@ public PinotLLCRealtimeSegmentManager(PinotHelixResourceManager helixResourceMan
        controllerConf.getDeepStoreRetryUploadParallelism()) : null;
    _deepStoreUploadExecutorPendingSegments =
        _isDeepStoreLLCSegmentUploadRetryEnabled ? ConcurrentHashMap.newKeySet() : null;
+    _forceCommitExecutorService = Executors.newFixedThreadPool(4);


Having a fixed size pool could actually cause problems when there are multiple force commit request. Since it is waiting most of the time, I'd actually suggest making a single thread pool for each request same as the last version. It is not query path so the overhead should be fine.

Jackie-Jiang · 2025-01-27T23:10:11Z

...ain/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java

+
+    Map<String, Map<String, String>> segmentNameToInstanceToStateMap = idealState.getRecord().getMapFields();
+    for (String segmentName : segmentNameToInstanceToStateMap.keySet()) {
+      if (!targetConsumingSegments.contains(segmentName)) {


Let's loop over targetConsumingSegments instead of ideal state. Ideal state should always contain targetConsumingSegments because they are extracted from ideal state.

Jackie-Jiang · 2025-01-27T23:13:56Z

...ain/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java

+          instanceToConsumingSegments.compute(instance, (key, value) -> {
+            if (value == null) {
+              value = new LinkedList<>();
+            }
+            value.add(segmentName);
+            return value;
+          });


Suggested change

instanceToConsumingSegments.compute(instance, (key, value) -> {

if (value == null) {

value = new LinkedList<>();

}

value.add(segmentName);

return value;

});

instanceToConsumingSegments.computeIfAbsent(instance, k -> new LinkedList<>()).add(segmentName);

Jackie-Jiang · 2025-01-27T23:15:18Z

...ain/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java

+      for (String instance : instanceToStateMap.keySet()) {
+        String state = instanceToStateMap.get(instance);


Use entrySet() to reduce lookup

Jackie-Jiang · 2025-01-27T23:17:27Z

...ain/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java

+
+    while (segmentsRemaining) {
+      segmentsRemaining = false;
+      // pick segments in round-robin fashion to parallelize


Jackie-Jiang · 2025-01-27T23:18:59Z

...ain/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java

+          String segmentName = queue.poll();
+          // there might be a segment replica hosted on
+          // another instance added before
+          if (segmentsAdded.contains(segmentName)) {


We can reduce a lookup by

Suggested change

if (segmentsAdded.contains(segmentName)) {

if (!segmentsAdded.add(segmentName)) {

Jackie-Jiang · 2025-01-27T23:22:29Z

...ain/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java

+      // pick segments in round-robin fashion to parallelize
+      // forceCommit across max servers
+      for (Queue<String> queue : instanceToConsumingSegments.values()) {
+        if (!queue.isEmpty()) {


We can remove the queue when it is empty to avoid checking it again and again. You may use iterator to remove entry without extra lookup

Jackie-Jiang · 2025-01-27T23:26:22Z

...ain/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java

+
+    try {
+      Thread.sleep(FORCE_COMMIT_STATUS_CHECK_INTERVAL_MS);
+    } catch (InterruptedException ignored) {


Ignoring interrupt could be risky (holding a long running thread). Let's wrap it as a RuntimeException and throw it. We may log an error when catching it

Jackie-Jiang · 2025-01-27T23:27:30Z

...ain/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java

+    }
+
+    int attemptCount = 0;
+    final Set<String>[] segmentsYetToBeCommitted = new Set[]{new HashSet<>()};


Suggested change

final Set<String>[] segmentsYetToBeCommitted = new Set[]{new HashSet<>()};

final Set<String>[] segmentsYetToBeCommitted = new Set[1];

Jackie-Jiang · 2025-01-27T23:33:39Z

...ain/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java

@@ -152,6 +157,9 @@ public class PinotLLCRealtimeSegmentManager {

  // Max time to wait for all LLC segments to complete committing their metadata while stopping the controller.
  private static final long MAX_LLC_SEGMENT_METADATA_COMMIT_TIME_MILLIS = 30_000L;
+  private static final int FORCE_COMMIT_STATUS_CHECK_INTERVAL_MS = 15000;


Let's take the check interval also from the rest API because different use case might want different interval; we might also want to add a TIMEOUT and also take that from rest API. The retry count can be calculated from timeout and interval.
We can provide default values (e.g. 5s, 3min) for them in case they are not provided. IMO 15s interval is too long because it means for each batch we will wait at least 15s.

sure.
(Default case (5s interval, 3m timeout) might be little expensive as we are waiting for segments build to complete and there will be 36 calls at max to ZK per batch)

sajjad-moradi

How are we going to come up with a good value for batch size param?
Since we only want M segment commits on each server, maybe Controller can decide which servers commit which partitions so that for each batch, one server commits at most M segments?

sajjad-moradi · 2025-01-30T17:10:29Z

...ain/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java

+      LOGGER.error(errorMsg, e);
+      throw new RuntimeException(e);


If an exception is thrown, there's no need to log. Add the errorMsg to the runtime exception.

We do need to log the message. This is running in a thread, and we are not handling the exception

sajjad-moradi · 2025-01-30T17:11:13Z

...ain/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java

+    int maxAttempts = (batchStatusCheckTimeoutMs + batchStatusCheckIntervalMs - 1) / batchStatusCheckIntervalMs;
+    RetryPolicy retryPolicy =
+        RetryPolicies.fixedDelayRetryPolicy(maxAttempts, batchStatusCheckIntervalMs);
+    int attemptCount = 0;


No need for this variable. Both AttemptsExceededException and RetriableOperationException have getAttempts method.

noob-se7en · 2025-01-30T19:40:32Z

How are we going to come up with a good value for batch size param?

hmmmm... IMO It should be <= _serverConfig.getProperty(MAX_PARALLEL_SEGMENT_BUILDS) * num_of_server_instances . To begin with we can start with this.

More accurate might be ~min(_numPartitions, _serverConfig.getProperty(MAX_PARALLEL_SEGMENT_BUILDS) * (num_of_server_instances - RF))

Jackie-Jiang

LGTM otherwise

Jackie-Jiang · 2025-01-31T01:20:43Z

...ntroller/src/main/java/org/apache/pinot/controller/api/resources/ForceCommitBatchConfig.java

+  private final int _batchStatusCheckIntervalMs;
+  private final int _batchStatusCheckTimeoutMs;
+
+  private ForceCommitBatchConfig(Integer batchSize, Integer batchStatusCheckIntervalMs,


Add @Nullabe annotation for parameters that can be null. Same for other places

This method has int params only now

Jackie-Jiang · 2025-01-31T01:24:27Z

...ain/java/org/apache/pinot/controller/helix/core/realtime/PinotLLCRealtimeSegmentManager.java

-    sendForceCommitMessageToServers(tableNameWithType, targetConsumingSegments);
+
+    List<Set<String>> segmentBatchList =
+        getSegmentBatchList(idealState, targetConsumingSegments, forceCommitBatchConfig.getBatchSize());


To reduce overhead, can we use the old way when batch size is non-positive?

Actually we can still keep positive only batch size, but first check if batch size >= targetConsumingSegments.size() and fall back

Jackie-Jiang · 2025-01-31T01:24:39Z

...ller/src/main/java/org/apache/pinot/controller/api/resources/PinotRealtimeTableResource.java

+      @ApiParam(value = "Max number of consuming segments to commit at once (default = Integer.MAX_VALUE)")
+      @QueryParam("batchSize")
+      Integer batchSize,


We can directly put default in the API:

Suggested change

@ApiParam(value = "Max number of consuming segments to commit at once (default = Integer.MAX_VALUE)")

@QueryParam("batchSize")

Integer batchSize,

@ApiParam(value = "Max number of consuming segments to commit at once (default = Integer.MAX_VALUE)")

@QueryParam("batchSize") @DefaultValue(Integer.toString(Integer.MAX_VALUE))

int batchSize,

We can only put compile-time constant here, hence this won't work.
Refactoring Integer to int and setting 0 to Integer.MAX

noob-se7en · 2025-01-31T17:03:20Z

The integration test seems to be flaky which I am unable to reproduce locally. Might have to revert that test in future.

noob-se7en added 13 commits January 13, 2025 01:07

Supports batching in ForceCommit API

eeb5be1

nit

5f5a554

Refactoring

ca5104a

nit

434e8a3

nit

504f3c9

nit

987bb00

nit

ff25c5f

nit

e28ff47

nit

99a7cee

lint

255bc34

nit

3a9e41a

fixes lint

b2eeb85

nit

1782207

Jackie-Jiang added feature ingestion rest-api labels Jan 15, 2025

Jackie-Jiang reviewed Jan 15, 2025

View reviewed changes

noob-se7en added 8 commits January 15, 2025 21:43

Merge branch 'master' of github.com:Harnoor7/pinot into add_batch_for…

90db3b8

…ce_commit

refactoring

fa418b9

refactoring

470c6eb

fixes bug

8de7bfc

nit

4f2d4fc

nit

50af02e

nit

09d557e

nit

32b7fd5

noob-se7en requested a review from Jackie-Jiang January 15, 2025 18:49

noob-se7en added 3 commits January 24, 2025 15:05

Merge branch 'master' of github.com:apache/pinot into add_batch_force…

bb84ae2

…_commit

Attempts to fix test

71f4ee1

attempt to fix test

2408d13

Jackie-Jiang reviewed Jan 27, 2025

View reviewed changes

noob-se7en added 6 commits January 29, 2025 00:24

Addresses PR comments

ff67929

Adds timeout and interval query parameters in API

80dda07

nit

11299f4

fixes lint

ad7aec0

Adds unit test

7ea5535

nit

5ea7c3f

noob-se7en requested a review from Jackie-Jiang January 29, 2025 09:35

sajjad-moradi reviewed Jan 30, 2025

View reviewed changes

noob-se7en added 3 commits January 30, 2025 23:55

Addresses PR comments

6907c8f

attempts to fix test

2a61ce4

speeds up test

445efbc

noob-se7en added 2 commits January 31, 2025 01:56

Attempts to fix test

444fc49

nit

e7ab323

Jackie-Jiang approved these changes Jan 31, 2025

View reviewed changes

noob-se7en added 6 commits January 31, 2025 17:46

addresses PR comments and attempts to fixe test

5c186d8

nit

6110a5a

fixes lint

f2fbd4b

nit

1d5af84

nit

92c771d

nit

b14e2af

noob-se7en requested a review from Jackie-Jiang January 31, 2025 20:47

Jackie-Jiang added 2 commits January 31, 2025 14:49

Misc fix and cleanup

de52012

Add missing empty line

5b78c1b

Jackie-Jiang merged commit 6747ad0 into apache:master Feb 1, 2025
20 of 21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supports Force Committing Segments in Batches #14811

Supports Force Committing Segments in Batches #14811

noob-se7en commented Jan 14, 2025 •

edited

Loading

codecov-commenter commented Jan 14, 2025 •

edited

Loading

Jackie-Jiang commented Jan 15, 2025

Jackie-Jiang left a comment

siddharthteotia commented Jan 15, 2025

noob-se7en commented Jan 15, 2025

Jackie-Jiang left a comment

Jackie-Jiang Jan 27, 2025

Jackie-Jiang Jan 27, 2025

Jackie-Jiang Jan 27, 2025

Jackie-Jiang Jan 27, 2025

Jackie-Jiang Jan 27, 2025

Jackie-Jiang Jan 27, 2025

Jackie-Jiang Jan 27, 2025

Jackie-Jiang Jan 27, 2025

Jackie-Jiang Jan 27, 2025

Jackie-Jiang Jan 27, 2025

noob-se7en Jan 28, 2025

sajjad-moradi left a comment

sajjad-moradi Jan 30, 2025

Jackie-Jiang Jan 31, 2025

sajjad-moradi Jan 30, 2025

noob-se7en commented Jan 30, 2025 •

edited

Loading

Jackie-Jiang left a comment

Jackie-Jiang Jan 31, 2025

noob-se7en Jan 31, 2025

Jackie-Jiang Jan 31, 2025

Jackie-Jiang Jan 31, 2025

Jackie-Jiang Jan 31, 2025

noob-se7en Jan 31, 2025 •

edited

Loading

noob-se7en commented Jan 31, 2025

		for (String instance : instanceToStateMap.keySet()) {
		String state = instanceToStateMap.get(instance);

	if (segmentsAdded.contains(segmentName)) {
	if (!segmentsAdded.add(segmentName)) {

	final Set<String>[] segmentsYetToBeCommitted = new Set[]{new HashSet<>()};
	final Set<String>[] segmentsYetToBeCommitted = new Set[1];

Supports Force Committing Segments in Batches #14811

Supports Force Committing Segments in Batches #14811

Conversation

noob-se7en commented Jan 14, 2025 • edited Loading

codecov-commenter commented Jan 14, 2025 • edited Loading

Codecov Report

Jackie-Jiang commented Jan 15, 2025

Jackie-Jiang left a comment

Choose a reason for hiding this comment

siddharthteotia commented Jan 15, 2025

noob-se7en commented Jan 15, 2025

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sajjad-moradi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noob-se7en commented Jan 30, 2025 • edited Loading

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noob-se7en Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

noob-se7en commented Jan 31, 2025

noob-se7en commented Jan 14, 2025 •

edited

Loading

codecov-commenter commented Jan 14, 2025 •

edited

Loading

noob-se7en commented Jan 30, 2025 •

edited

Loading

noob-se7en Jan 31, 2025 •

edited

Loading