Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try to acquire segmentLock before taking segment snapshot #14179

Merged
merged 5 commits into from
Oct 8, 2024

Conversation

klsince
Copy link
Contributor

@klsince klsince commented Oct 7, 2024

This PR tries to fix a race condition between consuming thread taking snapshot for upsert table, and the Helix task threading replacing a segment.

When replacing a segment, FileUtils.cleanupDirectory(indexDir) was called but failed due to DirectoryNotEmptyException with stack trace like below

Caused by: java.nio.file.DirectoryNotEmptyException:  <path to a segment directory>
        at java.base/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:289)
        at java.base/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:104)
        at java.base/java.nio.file.Files.delete(Files.java:1152)
        at org.apache.commons.io.FileUtils.delete(FileUtils.java:1222)
        at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1242)
        at org.apache.pinot.core.data.manager.BaseTableDataManager.moveSegment(BaseTableDataManager.java:864)
        at org.apache.pinot.core.data.manager.BaseTableDataManager.downloadSegmentFromDeepStore(BaseTableDataManager.java:801)
        at org.apache.pinot.core.data.manager.BaseTableDataManager.downloadSegment(BaseTableDataManager.java:746)
        at org.apache.pinot.core.data.manager.BaseTableDataManager.downloadAndLoadSegment(BaseTableDataManager.java:389)
        at org.apache.pinot.core.data.manager.BaseTableDataManager.replaceSegmentIfCrcMismatch(BaseTableDataManager.java:380)
        at org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.doAddOnlineSegment(RealtimeTableDataManager.java:424)
        at org.apache.pinot.core.data.manager.BaseTableDataManager.addOnlineSegment(BaseTableDataManager.java:313)
        at org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addOnlineSegment(HelixInstanceDataManager.java:275)
        at org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromConsuming(SegmentOnlineOfflineStateModelFactory.java:88)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)

Because due to race condition, the consuming thread put a new snapshot file into the folder between the two major cleanup steps in this deleteDirectory() method showed below.

FileUtils.java:
 public static void deleteDirectory(final File directory) throws IOException {
...
        if (!isSymlink(directory)) {
            cleanDirectory(directory); <--- clean up files or subfolders in the folder
        } 
        // the consuming thread might drop a new snapshot file into the folder, and failed the delete() method call.
        delete(directory); <--- remove the folder that's supposed to be empty now
    }

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

@codecov-commenter
Copy link

codecov-commenter commented Oct 7, 2024

Codecov Report

Attention: Patch coverage is 77.41935% with 7 lines in your changes missing coverage. Please review.

Project coverage is 63.95%. Comparing base (59551e4) to head (baf7243).
Report is 1150 commits behind head on master.

Files with missing lines Patch % Lines
...cal/upsert/BasePartitionUpsertMetadataManager.java 77.41% 6 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14179      +/-   ##
============================================
+ Coverage     61.75%   63.95%   +2.19%     
- Complexity      207     1536    +1329     
============================================
  Files          2436     2621     +185     
  Lines        133233   144103   +10870     
  Branches      20636    22039    +1403     
============================================
+ Hits          82274    92154    +9880     
- Misses        44911    45141     +230     
- Partials       6048     6808     +760     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.91% <77.41%> (+2.21%) ⬆️
java-21 63.75% <77.41%> (+2.13%) ⬆️
skip-bytebuffers-false 63.93% <77.41%> (+2.18%) ⬆️
skip-bytebuffers-true 63.70% <77.41%> (+35.98%) ⬆️
temurin 63.95% <77.41%> (+2.19%) ⬆️
unittests 63.94% <77.41%> (+2.19%) ⬆️
unittests1 55.53% <3.22%> (+8.63%) ⬆️
unittests2 34.46% <77.41%> (+6.73%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@tibrewalpratik17 tibrewalpratik17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

}
}
_updatedSegmentsSinceLastSnapshot.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there guarantee that all segments are cleared here? Seems we don't remove segments when they are deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. To be simple, I'll do retainAll(_trackedSegments) here to avoid keeping removed segments around

@klsince klsince force-pushed the persist_snapshot_race_condition branch from 6530186 to 1eeab6f Compare October 8, 2024 05:31
@klsince klsince merged commit 2a443ee into apache:master Oct 8, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants