Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Load the protocol and metadata from the CRC files when available #4077

Merged
merged 62 commits into from
Jan 31, 2025

Conversation

huan233usc
Copy link
Collaborator

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

CP crc loading code to master branch from "kernel-20250115-crc-optimization"
PR created using git cherry-pick 7d32f66d9efd1dca41ed2a45cf259525cbdd8952

How was this patch tested?

Does this PR introduce any user-facing changes?

No

… in DeltaLog

Initial read current version CRC w test to verify

wip

end-2-end working + tests

Co-authored-by:  Allison Portis <[email protected]>
Co-authored-by: Venki Korukanti <[email protected]>
Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Left some comments.

ChecksumReader.getVersionStats(
engine, logSegment.logPath, snapshotVersion, crcSearchLowerBound);
if (versionStatsOpt.isPresent()) {
// We found the protocol and metadata for the version we are looking for
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a check that versionStateOpt.get.version is > (or >=, whatever is corect) than the snapshot hint?

Comment on lines 72 to 77
try (CloseableIterator<FileStatus> crcFiles =
engine.getFileSystemClient().listFrom(lowerBoundFilePath.toString())) {
List<FileStatus> crcFilesList = new ArrayList<>();
crcFiles
.filter(file -> isChecksumFile(new Path(file.getPath())))
.forEachRemaining(crcFilesList::add);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scottsand-db I wonder if this can be refactored to use any of the log listing methods in your PR?

Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines 172 to 177
public static <T> List<T> toFilteredList(
CloseableIterator<T> iterator, Function<T, Boolean> filter) {
List<T> result = new ArrayList<>();
iterator.filter(filter).forEachRemaining(result::add);
return result;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry nit, I think it's easier if this is just toList and we can do the iterator filtering in the outer call. This way we can re-use this even when we don't want to filter

i.e. toList(iterator.filter(filter))

Copy link
Collaborator

@allisonport-db allisonport-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor remaining comments then LGTM

@scottsand-db scottsand-db self-requested a review January 28, 2025 21:51
Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Unfortunately, takeWhile definitely needs to be its own PR.

@scottsand-db scottsand-db self-requested a review January 30, 2025 21:11
@scottsand-db scottsand-db merged commit 92a8a22 into delta-io:master Jan 31, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants