Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try to acquire segmentLock before taking segment snapshot #14179

Merged
merged 5 commits into from
Oct 8, 2024
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,6 @@
import javax.annotation.Nullable;
import javax.annotation.concurrent.ThreadSafe;
import org.apache.helix.HelixManager;
import org.apache.pinot.common.Utils;
import org.apache.pinot.common.metrics.ServerGauge;
import org.apache.pinot.common.metrics.ServerMeter;
import org.apache.pinot.common.metrics.ServerMetrics;
Expand Down Expand Up @@ -879,6 +878,8 @@ protected void doTakeSnapshot() {
// segments. Because the valid docs as tracked by the existing validDocIds snapshots can only get less. That no
// overlap of valid docs among segments with snapshots is required by the preloading to work correctly.
Set<ImmutableSegmentImpl> segmentsWithoutSnapshot = new HashSet<>();
TableDataManager tableDataManager = _context.getTableDataManager();
boolean skippedSegment = false;
for (IndexSegment segment : _trackedSegments) {
if (!(segment instanceof ImmutableSegmentImpl)) {
numConsumingSegments++;
Expand All @@ -889,31 +890,70 @@ protected void doTakeSnapshot() {
numUnchangedSegments++;
continue;
}
// Try to acquire the segmentLock when taking snapshot for the segment because the segment directory can be
// modified, e.g. a new snapshot file can be added to the directory. If not taking the lock, the Helix task
// thread replacing the segment could fail. For example, we found FileUtils.cleanDirectory() failed due to
// DirectoryNotEmptyException because a new snapshot file got added into the segment directory just between two
// major cleanup steps in the cleanDirectory() method.
String segmentName = segment.getSegmentName();
Lock segmentLock = tableDataManager.getSegmentLock(segmentName);
boolean locked = segmentLock.tryLock();
if (!locked) {
// Try to get the segmentLock in a non-blocking manner to avoid deadlock. The Helix task thread takes
// segmentLock first and then the snapshot RLock when replacing a segment. However, the consuming thread has
// already acquired the snapshot WLock when reaching here, and if it has to wait for segmentLock, it may
// enter deadlock with the Helix task threads waiting for snapshot RLock.
// If we can't get the segmentLock, we'd better skip taking snapshot for the tracked segment. Because the
// validDocIds of the tracked segment might be wrong for the new segment. It's better to have no snapshot
// file on disk than having a wrong one.
// The numMissedSegments metrics below can help monitor such cases.
_logger.warn("Could not get segmentLock to take snapshot for segment: {}, skipping", segmentName);
skippedSegment = true;
continue;
klsince marked this conversation as resolved.
Show resolved Hide resolved
}
try {
ImmutableSegmentImpl immutableSegment = (ImmutableSegmentImpl) segment;
if (!immutableSegment.hasValidDocIdsSnapshotFile()) {
segmentsWithoutSnapshot.add(immutableSegment);
continue;
}
immutableSegment.persistValidDocIdsSnapshot();
_updatedSegmentsSinceLastSnapshot.remove(segment);
numImmutableSegments++;
numPrimaryKeysInSnapshot += immutableSegment.getValidDocIds().getMutableRoaringBitmap().getCardinality();
} catch (Exception e) {
_logger.warn("Caught exception while taking snapshot for segment: {}, skipping", segment.getSegmentName(), e);
Utils.rethrowException(e);
_logger.warn("Caught exception while taking snapshot for segment: {}, skipping", segmentName, e);
klsince marked this conversation as resolved.
Show resolved Hide resolved
skippedSegment = true;
} finally {
segmentLock.unlock();
}
}
for (ImmutableSegmentImpl segment : segmentsWithoutSnapshot) {
try {
segment.persistValidDocIdsSnapshot();
numImmutableSegments++;
numPrimaryKeysInSnapshot += segment.getValidDocIds().getMutableRoaringBitmap().getCardinality();
} catch (Exception e) {
_logger.warn("Caught exception while taking snapshot for segment: {}, skipping", segment.getSegmentName(), e);
Utils.rethrowException(e);
// If we skipped any segments with existing snapshot files in the for-loop above, then we should not take snapshots
// for segments w/o snapshot files yet, i.e. skip the next for-loop to not add new snapshot files on disk. This
// ensures that the validDocIds snapshots kept on disk are all disjoint as to the PKs tracked by them, in order for
// segment preloading to work correctly.
if (!skippedSegment) {
for (ImmutableSegmentImpl segment : segmentsWithoutSnapshot) {
String segmentName = segment.getSegmentName();
Lock segmentLock = tableDataManager.getSegmentLock(segmentName);
boolean locked = segmentLock.tryLock();
if (!locked) {
_logger.warn("Could not get segmentLock to take snapshot for segment: {} w/o snapshot, skipping",
segmentName);
continue;
}
try {
segment.persistValidDocIdsSnapshot();
_updatedSegmentsSinceLastSnapshot.remove(segment);
numImmutableSegments++;
numPrimaryKeysInSnapshot += segment.getValidDocIds().getMutableRoaringBitmap().getCardinality();
} catch (Exception e) {
_logger.warn("Caught exception while taking snapshot for segment: {} w/o snapshot, skipping", segmentName, e);
} finally {
segmentLock.unlock();
}
}
}
_updatedSegmentsSinceLastSnapshot.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there guarantee that all segments are cleared here? Seems we don't remove segments when they are deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. To be simple, I'll do retainAll(_trackedSegments) here to avoid keeping removed segments around

// Persist TTL watermark after taking snapshots if TTL is enabled, so that segments out of TTL can be loaded with
// updated validDocIds bitmaps. If the TTL watermark is persisted first, segments out of TTL may get loaded with
// stale bitmaps or even no bitmap snapshots to use.
Expand Down
Loading