Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel]Cache the crc info for a snapshot if its version's checksum file is read #4113

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

huan233usc
Copy link
Collaborator

@huan233usc huan233usc commented Feb 1, 2025

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Persist CRCInfo for a logreplay.java if its version's checksum file is read. This will be used for calculate the new CRC info for a new commit. No actual functional changes to loading crc.

Follow up PR will be #4116 which introduces crc_simple post commit action to write CRC file

How was this patch tested?

Unit test

Does this PR introduce any user-facing changes?

No

@huan233usc huan233usc marked this pull request as ready for review February 3, 2025 01:54
@huan233usc huan233usc self-assigned this Feb 3, 2025
Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Requested some changes

// We found a CRCInfo of a version (a) older than the one we are looking for (snapshotVersion)
// but (b) newer than the current hint. Use this CRCInfo to create a new hint.
return new Tuple2<>(
Optional.of(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have duplicate creations of snapshot hints from the CRC Info. I think you could make a SnapshotHint::fromCrcInfo static helper in SnapshotHint.java

@@ -392,4 +372,57 @@ private Map<String, DomainMetadata> loadDomainMetadataMap(Engine engine) {
throw new UncheckedIOException("Could not close iterator", ex);
}
}

// Calculates the latest snapshot hint before current snapshot version, returns the CRCInfo if
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts:

  1. Please use /** */ for multi-line methoddocs, no //

  2. Calculates the latest snapshot hint before current snapshot version --> below, you also return it if snapshotHint.version == snapshotVersion. So I don't think "before" is correct, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change it to "before or at the current snapshot version". Thanks for catching this boundary case.

s"INSERT INTO delta.`$tablePath` SELECT 1L as id"
)
assert(
Files.deleteIfExists(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth making a helper function that deletes CRC at a given version in a given table path?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could probably add it to TestUtils

@@ -17,18 +17,21 @@ package io.delta.kernel.defaults

import java.io.File
import java.util.Optional

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: undo these deletes plz

@@ -384,7 +384,7 @@ class LogReplayEngineMetricsSuite extends AnyFunSuite with TestUtils {
// Tests for loading P & M through checksums files //
/////////////////////////////////////////////////////////////////////////////////////////////////

Seq(-1L, 3L, 4L).foreach { version => // -1 means latest version
Seq(-1L, 0L, 3L, 4L).foreach { version => // -1 means latest version
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please implement the one small change I've asked for, but then LGTM!

Optional<CRCInfo> crcInfoOpt =
ChecksumReader.getCRCInfo(
engine, logSegment.getLogPath(), snapshotVersion, crcSearchLowerBound);
if (crcInfoOpt.isPresent()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's reduce indentation by checking and exiting earlier.

if (!crcInfoOpt.isPresent()) {
  return new Tuple2<>(snapshotHint, Optional.empty());
}

// the rest of the code, but now it is not indented!

@@ -165,6 +172,11 @@ public long getVersion() {
return logSegment.getVersion();
}

/** Returns the crc info for the current snapshot if the checksum file is read */
public Optional<CRCInfo> getCurrentCrcInfo() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason that we only save it if it's the current version? in the future won't we also want to still save the info for an earlier one? (to pass along as much info as possible)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants