feat: v2 fast rebuilding #278

shuo-wu · 2024-12-20T09:04:16Z

Which issue(s) this PR fixes:

The related docs that can help understand this feature:
V2 Data Engine Fast Rebuilding diagram flow
V2 Data Engine Fast Rebuilding illustration and animation

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

coderabbitai · 2024-12-20T09:04:27Z

Note

Reviews paused

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Walkthrough

This pull request introduces several modifications to the SPDK (Storage Performance Development Kit) related code, focusing on updates to the Dockerfile.dapper, simplifications in the SPDKClient interface, enhancements in error handling and logging within the Engine struct, and improvements to the Replica management logic. Changes span multiple files, including client and server implementations, type definitions, and test scripts, aimed at refining the functionality and usability of the SPDK system.

Changes

File	Change Summary
`Dockerfile.dapper`	Updated `SPDK_COMMIT_ID` from `002cd0679259c4e4b782d9540441a46e1228d484` to `a7421a6e59f1d099294af6f65d73ebec4afebfc5` and `GO_SPDK_HELPER_COMMIT_ID` from `6a324e95979662be592bfda0e867d2678ecbc756` to `e50a23ddfd64ebc267064e5aa07e0c3f9366ffbc`.
`pkg/client/client.go`	- Removed `ReplicaRebuildingSrcAttach` and `ReplicaRebuildingSrcDetach` methods. - Modified `ReplicaRebuildingSrcShallowCopyStart` to include `dstRebuildingLvolAddress` and changed return type to `error`. - Simplified `ReplicaRebuildingSrcShallowCopyCheck` by removing `shallowCopyOpId` parameter.
`pkg/spdk/engine.go`	- Enhanced error handling in `ReplicaAdd` and updated logging. - Improved shallow copy management in `replicaShallowCopy` with timer and ticker.
`pkg/spdk/replica.go`	- Added `SnapshotChecksumEnabled` and `constructRequired` fields to `Replica` struct. - Updated shallow copy methods to handle new logic for rebuilding operations.
`pkg/spdk/server.go`	- Removed `ReplicaRebuildingSrcAttach` and `ReplicaRebuildingSrcDetach` methods. - Modified `ReplicaRebuildingSrcShallowCopyStart` to return an empty response and added `DstRebuildingLvolAddress` parameter. - Added validation for `EngineReplicaAdd`.
`pkg/spdk/types.go`	- Added constants: `ReplicaExpiredLvolSuffix`, `MaxShallowCopyWaitTime`, `ShallowCopyCheckInterval`. - Added functions: `GenerateReplicaExpiredLvolName`, `IsReplicaExpiredLvol`. - Updated `Lvol` struct with `SnapshotChecksum` field.
`pkg/spdk_test.go`	- Added `TestSPDKMultipleThreadFastRebuilding` test. - Updated `defaultTestSnapChecksumWaitInterval` and `defaultTestSnapChecksumWaitCount`. - Modified `revertSnapshot` and `WaitForReplicaRebuildingComplete` methods.
`scripts/test`	Increased test timeout from `30m` to `40m`.
`pkg/api/types.go`	Added `SnapshotChecksum` field to `Lvol` struct and updated conversion functions.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Server
    participant Engine
    participant Replica

    Client->>Server: ReplicaRebuildingSrcShallowCopyStart
    Server->>Engine: Initiate Shallow Copy
    Engine->>Replica: Prepare Rebuilding Lvol
    Replica-->>Engine: Lvol Prepared
    Engine->>Replica: Start Shallow Copy
    Replica-->>Engine: Shallow Copy Status
    Engine-->>Server: Shallow Copy Progress
    Server-->>Client: Operation Complete

This sequence diagram illustrates the high-level flow of the replica rebuilding shallow copy process, highlighting the interactions between the client, server, engine, and replica components.

Possibly related PRs

fix: reconstruct ActiveChain and SnapshotLvolMap while creating replica #284: The changes in this PR involve modifications to the Replica struct and its methods in pkg/spdk/replica.go, which are related to the management of replicas, including the construction of the SnapshotLvolMap. This is relevant to the main PR's focus on updating environment variables in the Dockerfile, as both involve aspects of the SPDK system's configuration and management.

Suggested reviewers

ChanYiLin
c3y1huang

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov · 2024-12-20T09:17:53Z

Codecov Report

Attention: Patch coverage is 2.11640% with 555 lines in your changes missing coverage. Please review.

Project coverage is 8.69%. Comparing base (7081bda) to head (7844558).
Report is 18 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/spdk/replica.go	0.00%	477 Missing ⚠️
pkg/spdk/engine.go	0.00%	57 Missing ⚠️
pkg/spdk/server.go	0.00%	10 Missing ⚠️
pkg/spdk/types.go	0.00%	6 Missing ⚠️
pkg/client/client.go	66.66%	4 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##            main    #278      +/-   ##
========================================
- Coverage   9.17%   8.69%   -0.48%     
========================================
  Files         23      23              
  Lines       8536    8713     +177     
========================================
- Hits         783     758      -25     
- Misses      7646    7856     +210     
+ Partials     107      99       -8

Flag	Coverage Δ
unittests	`8.69% <2.11%> (-0.48%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (14)

pkg/client/client.go (3)

266-266: Add more context to error messages.

This error is clear about the missing parameter. Consider augmenting the message with the function name or caller context for quicker troubleshooting.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 266-266: pkg/client/client.go#L266
Added line #L266 was not covered by tests

284-285: Add concurrency notes in docstring.

The new docstring is good. Consider clarifying if the method is safe to call concurrently from multiple goroutines, or if it requires external synchronization.

298-300: Pass-through request parameters.

All request fields are properly assigned. When refactoring in future, consider centralizing these parameter assignments to reduce duplication.

pkg/spdk/server.go (2)

717-717: Handle the error gracefully.

If r.RebuildingSrcShallowCopyStart fails, consider adding logs with details about the replica's current state to diagnose potential concurrency or environment issues.

736-736: Possible optimization.

When reading the shallow copy status, consider caching results or throttling requests to avoid excessive gRPC calls.

pkg/spdk/replica.go (6)

262-264: Consider more logging.

When the state is "Stopped" but constructRequired is set, you skip here. A debug log might help diagnose scenarios in which this branch triggers unexpectedly.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 262-264: pkg/spdk/replica.go#L262-L264
Added lines #L262 - L264 were not covered by tests

266-266: Add context to error message.

The message does not mention whether the state is invalid or if the logic disallows transitions. More detail could help debugging if user logs are minimal.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 266-266: pkg/spdk/replica.go#L266
Added line #L266 was not covered by tests

561-562: Document fallback logic.

Lines 561-562 handle the case where the chain might require removal of the nil head. Document the reasoning to help future maintainers.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 561-562: pkg/spdk/replica.go#L561-L562
Added lines #L561 - L562 were not covered by tests

570-570: Add error handling.

If logging for “Replica cloned a new head lvol” fails or returns partial success, consider capturing it in debug logs for later.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 570-570: pkg/spdk/replica.go#L570
Added line #L570 was not covered by tests

575-575: Provide meaningful logs.

This line logs “Replica created a new head lvol.” Possibly add parent lvol alias in the log to clarify the relationship.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 575-575: pkg/spdk/replica.go#L575
Added line #L575 was not covered by tests

655-655: Handle error more gracefully.

A user might attempt unexpected revert chains. Log or handle partial rest states if getRootLvolName is missing.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 655-655: pkg/spdk/replica.go#L655
Added line #L655 was not covered by tests

pkg/spdk/engine.go (3)

1166-1166: Ensure correct client cleanup.
Good practice to log any close failures. In some scenarios, a retry mechanism might be beneficial.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1166-1166: pkg/spdk/engine.go#L1166
Added line #L1166 was not covered by tests

1196-1198: Initiator resume failure handling.
Properly wrapping errors is good. Consider adding a fallback or re-try logic for cases where resume fails, if possible.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1196-1198: pkg/spdk/engine.go#L1196-L1198
Added lines #L1196 - L1198 were not covered by tests

1344-1351: Finalizing shallow copy.
The code gracefully logs partial progress if it's incomplete. Ensure any partial state is cleaned up if needed.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1344-1351: pkg/spdk/engine.go#L1344-L1351
Added lines #L1344 - L1351 were not covered by tests

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c84ca84 and 8c4f187.

⛔ Files ignored due to path filters (5)

go.mod is excluded by !go.mod
go.sum is excluded by !**/*.sum, !go.sum
vendor/github.com/longhorn/types/pkg/generated/spdkrpc/spdk.pb.go is excluded by !**/*.pb.go, !**/generated/**, !vendor/**
vendor/github.com/longhorn/types/pkg/generated/spdkrpc/spdk_grpc.pb.go is excluded by !**/*.pb.go, !**/generated/**, !vendor/**
vendor/modules.txt is excluded by !vendor/**

📒 Files selected for processing (8)

Dockerfile.dapper (1 hunks)
pkg/client/client.go (1 hunks)
pkg/spdk/engine.go (4 hunks)
pkg/spdk/replica.go (26 hunks)
pkg/spdk/server.go (4 hunks)
pkg/spdk/types.go (4 hunks)
pkg/spdk_test.go (5 hunks)
scripts/test (1 hunks)

🧰 Additional context used

🪛 GitHub Check: codecov/patch

pkg/client/client.go

[warning] 263-264: pkg/client/client.go#L263-L264
Added lines #L263 - L264 were not covered by tests

[warning] 266-266: pkg/client/client.go#L266
Added line #L266 was not covered by tests

[warning] 279-279: pkg/client/client.go#L279
Added line #L279 was not covered by tests

pkg/spdk/engine.go

[warning] 1127-1129: pkg/spdk/engine.go#L1127-L1129
Added lines #L1127 - L1129 were not covered by tests

[warning] 1131-1136: pkg/spdk/engine.go#L1131-L1136
Added lines #L1131 - L1136 were not covered by tests

[warning] 1138-1139: pkg/spdk/engine.go#L1138-L1139
Added lines #L1138 - L1139 were not covered by tests

[warning] 1166-1166: pkg/spdk/engine.go#L1166
Added line #L1166 was not covered by tests

[warning] 1169-1169: pkg/spdk/engine.go#L1169
Added line #L1169 was not covered by tests

[warning] 1175-1175: pkg/spdk/engine.go#L1175
Added line #L1175 was not covered by tests

[warning] 1177-1178: pkg/spdk/engine.go#L1177-L1178
Added lines #L1177 - L1178 were not covered by tests

[warning] 1181-1182: pkg/spdk/engine.go#L1181-L1182
Added lines #L1181 - L1182 were not covered by tests

[warning] 1196-1198: pkg/spdk/engine.go#L1196-L1198
Added lines #L1196 - L1198 were not covered by tests

[warning] 1303-1304: pkg/spdk/engine.go#L1303-L1304
Added lines #L1303 - L1304 were not covered by tests

[warning] 1311-1315: pkg/spdk/engine.go#L1311-L1315
Added lines #L1311 - L1315 were not covered by tests

[warning] 1324-1338: pkg/spdk/engine.go#L1324-L1338
Added lines #L1324 - L1338 were not covered by tests

[warning] 1340-1342: pkg/spdk/engine.go#L1340-L1342
Added lines #L1340 - L1342 were not covered by tests

[warning] 1344-1351: pkg/spdk/engine.go#L1344-L1351
Added lines #L1344 - L1351 were not covered by tests

pkg/spdk/replica.go

[warning] 262-264: pkg/spdk/replica.go#L262-L264
Added lines #L262 - L264 were not covered by tests

[warning] 266-266: pkg/spdk/replica.go#L266
Added line #L266 was not covered by tests

[warning] 545-559: pkg/spdk/replica.go#L545-L559
Added lines #L545 - L559 were not covered by tests

[warning] 561-562: pkg/spdk/replica.go#L561-L562
Added lines #L561 - L562 were not covered by tests

[warning] 565-566: pkg/spdk/replica.go#L565-L566
Added lines #L565 - L566 were not covered by tests

[warning] 570-570: pkg/spdk/replica.go#L570
Added line #L570 was not covered by tests

[warning] 575-575: pkg/spdk/replica.go#L575
Added line #L575 was not covered by tests

[warning] 651-653: pkg/spdk/replica.go#L651-L653
Added lines #L651 - L653 were not covered by tests

[warning] 655-655: pkg/spdk/replica.go#L655
Added line #L655 was not covered by tests

[warning] 769-778: pkg/spdk/replica.go#L769-L778
Added lines #L769 - L778 were not covered by tests

[warning] 846-848: pkg/spdk/replica.go#L846-L848
Added lines #L846 - L848 were not covered by tests

[warning] 878-878: pkg/spdk/replica.go#L878
Added line #L878 was not covered by tests

[warning] 906-906: pkg/spdk/replica.go#L906
Added line #L906 was not covered by tests

🪛 GitHub Check: CodeFactor

pkg/spdk/replica.go

[notice] 1435-1518: pkg/spdk/replica.go#L1435-L1518
Complex Method

[notice] 1466-1466: pkg/spdk/replica.go#L1466
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)

[notice] 1929-2066: pkg/spdk/replica.go#L1929-L2066
Complex Method

🔇 Additional comments (31)

pkg/client/client.go (3)

261-264: Ensure consistent parameter error handling.

These checks correctly enforce required parameters for the shallow copy start method. Make sure all error messages are consistently phrased to help with debugging.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 263-264: pkg/client/client.go#L263-L264
Added lines #L263 - L264 were not covered by tests

273-276: Validate call arguments.

This logic constructs the gRPC request to the server. All parameters are consistently used. Make sure to handle potential client-side timeouts or partial failures.

279-281: Error wrapping is good.

Using errors.Wrapf provides clear traceability. Ensure you’ve tested large snapshots or long snapshot names to confirm the error message does not become truncated in logs.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 279-279: pkg/client/client.go#L279
Added line #L279 was not covered by tests

pkg/spdk/server.go (3)

693-702: Validate request parameters.

These lines correctly verify that the request has valid fields. Check any logs or monitoring to see if more context is needed when logging invalid arguments.

713-713: Check for concurrency issues.

This method call might happen concurrently with other operations on the same replica. Ensure there's adequate locking or checks to avoid races.

1088-1090: Validate arguments for EngineReplicaAdd.

It's good to ensure both replicaName and replicaAddress are present. If either is missing, an error is returned early. This is consistent with the rest of the codebase.

pkg/spdk_test.go (1)

1203-1583: Comprehensive multi-thread test coverage.

This new test provides a thorough scenario exercising fast rebuilding with multiple replicas and snapshots. Consider the following for further improvement:
• Test random node failures or abrupt restarts to ensure resilience.
• Add short descriptive logging or sub-tests to pinpoint any failing scenario quickly.

pkg/spdk/replica.go (7)

72-74: Track constructRequired carefully.

Marking a replica as requiring construction when it errors out can help with recovery logic. Ensure code paths that toggle constructRequired handle concurrency properly.

545-559: Clean separation of chain logic.

Creating or cloning the head from the parent lvol is well structured. Confirm you have robust test coverage for nil parent or mismatch in SpecSize.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 545-559: pkg/spdk/replica.go#L545-L559
Added lines #L545 - L559 were not covered by tests

565-566: Perform size validation.

Here, you resize if parent size and spec size differ. Make sure the parent is always a valid reference to avoid unexpected resizing.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 565-566: pkg/spdk/replica.go#L565-L566
Added lines #L565 - L566 were not covered by tests

651-653: Check snapshot references.

Distinguishing between parent snapshots and the backing image is correct. Ensure references to root are consistent if a new snapshot is introduced at run time.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 651-653: pkg/spdk/replica.go#L651-L653
Added lines #L651 - L653 were not covered by tests

846-848: Track state transitions.

When previous state is error and you set constructRequired, consider whether an immediate re-construct attempt is desired or if a backoff is beneficial.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 846-848: pkg/spdk/replica.go#L846-L848
Added lines #L846 - L848 were not covered by tests

878-878: Graceful concurrency check.

Immediately stopping or cleaning up the source building cache can race other code calls. Evaluate an atomic operation to avoid incomplete states.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 878-878: pkg/spdk/replica.go#L878
Added line #L878 was not covered by tests

906-906: Double-check logic for partial leftover lvols.

If leftover snapshots remain from partial merges, ensure they don’t hamper future new replicas. Additional log statements might help.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 906-906: pkg/spdk/replica.go#L906
Added line #L906 was not covered by tests

scripts/test (1)

61-61: Extended test timeout is beneficial.

Increasing the timeout to 40 minutes allows more comprehensive tests to complete successfully, especially with complex slow I/O scenarios. Ensure CI infrastructure can handle this longer duration.

pkg/spdk/types.go (4)

26-26: Looks good for new suffix usage.
The new "expired" suffix is clear and consistent with existing suffix conventions.

38-39: Reasonable default intervals.
Setting a maximum wait time of 72 hours and a 3-second check interval seems balanced for long-running copy processes. Confirm these values are suitable for all operational scenarios.

200-202: Appropriate name generator.
Clearly constructs a unique LVOL name for expired replicas using a short UUID. Ensure collisions are handled upstream if there's any possibility of reusing suffixes.

212-214: Simple check for expired suffix.
Implementation is straightforward and aligns with previously mentioned naming conventions.

pkg/spdk/engine.go (11)

1127-1129: Handle previous mode tracking carefully.
Maintaining the previous mode in a local variable is a good approach for logging rollback scenarios. Consider verifying if there's any edge case (e.g., uninitialized map or missing key).

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1127-1129: pkg/spdk/engine.go#L1127-L1129
Added lines #L1127 - L1129 were not covered by tests

1131-1136: Assigning ERR mode with address.
Setting a replica to ERR mode while preserving its address helps for debugging. Consider if partial transitions might require special handling.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1131-1136: pkg/spdk/engine.go#L1131-L1136
Added lines #L1131 - L1136 were not covered by tests

1138-1139: Comprehensive logging.
Attaching both the entire replica status map and the error ensures thorough observability.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1138-1139: pkg/spdk/engine.go#L1138-L1139
Added lines #L1138 - L1139 were not covered by tests

1169-1169: Symmetry in client closure logging.
Matches the pattern used for source replica.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1169-1169: pkg/spdk/engine.go#L1169
Added line #L1169 was not covered by tests

1175-1175: Warn vs Error logging.
Currently using Errorf for shallow copy failure. This is appropriate if it indicates a critical state that blocks normal operations.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1175-1175: pkg/spdk/engine.go#L1175
Added line #L1175 was not covered by tests

1177-1178: Error logging reason.
Clear enough messages to differentiate between “replica error” vs “engine error.”

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1177-1178: pkg/spdk/engine.go#L1177-L1178
Added lines #L1177 - L1178 were not covered by tests

1181-1182: Finish step cleanup.
Makes sense to attempt finishing even on errors, to avoid resource leakage.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1181-1182: pkg/spdk/engine.go#L1181-L1182
Added lines #L1181 - L1182 were not covered by tests

1303-1304: Log message clarity.
This readies the logs with enough details to know source and destination.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1303-1304: pkg/spdk/engine.go#L1303-L1304
Added lines #L1303 - L1304 were not covered by tests

1311-1315: Timer and ticker approach.
Combining a hard cap with periodic checks is a practical method to limit the shallow copy duration.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1311-1315: pkg/spdk/engine.go#L1311-L1315
Added lines #L1311 - L1315 were not covered by tests

1324-1338: Robust retry logic.
Incrementing a local counter and checking it against maxNumRetries helps avoid indefinite retries. Ensure maxNumRetries is large enough for potential slow I/O conditions.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1324-1338: pkg/spdk/engine.go#L1324-L1338
Added lines #L1324 - L1338 were not covered by tests

1340-1342: Error path exit.
Immediate exit on copying error is correct. Potentially store partial progress for diagnostics.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1340-1342: pkg/spdk/engine.go#L1340-L1342
Added lines #L1340 - L1342 were not covered by tests

Dockerfile.dapper (1)

20-20: Updating commit ID for go-spdk-helper.
This likely addresses compatibility with the new replica rebuilding flow. Consider verifying the updated commit for any backward-incompatible changes.

pkg/spdk/replica.go

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (7)

pkg/spdk/types.go (2)

26-26: Additional constants
Defining ReplicaExpiredLvolSuffix, MaxShallowCopyWaitTime, and ShallowCopyCheckInterval is clear. Consider adding docstrings so users know how these values are derived.

Also applies to: 37-39

203-205: Short UUID usage
Truncating the UUID to 8 characters can cause potential collisions in longer-running systems. Consider using a larger substring or a distinct suffix strategy.

pkg/spdk_test.go (1)

1205-1588: Comprehensive multi-thread fast rebuilding test
This new test suite thoroughly exercises snapshot creation, reverts, and concurrent rebuild scenarios. Consider adding negative tests (e.g., verifying behavior under insufficient disk space) and capturing logs for deeper debugging.
pkg/spdk/replica.go (4)
236-253: Consider using a goroutine pool for checksum registration.

The TODO comment suggests using a goroutine pool, which is a good optimization. Currently, each snapshot checksum registration spawns a new goroutine, which could lead to resource exhaustion with many snapshots.

Would you like me to propose an implementation using a worker pool pattern?

1467-1527: Consider breaking down the complex shallow copy logic.

The method RebuildingSrcShallowCopyStart is quite complex with nested defers and goroutine management. Consider extracting the goroutine management into a separate method for better maintainability.

Example refactor:
func (r *Replica) RebuildingSrcShallowCopyStart(...) (err error) {
    // ... existing locking code ...
    
    if err = r.rebuildingSrcDetachNoLock(spdkClient); err != nil {
        return err
    }
    if err = r.rebuildingSrcAttachNoLock(spdkClient); err != nil {
        return err
    }
    
+   return r.startShallowCopyAndWait(spdkClient, snapshotName)
}

+func (r *Replica) startShallowCopyAndWait(spdkClient *spdkclient.Client, snapshotName string) error {
+   // Extract goroutine management logic here
}
🧰 Tools

🪛 GitHub Check: CodeFactor

[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)

1496-1499: Remove unnecessary break statement.

The break statement after setting stopWaiting is redundant as the loop will exit naturally when stopWaiting is checked.
case <-timer.C:
    r.log.Errorf("Timeout waiting for the src replica %s shallow copy %v complete before detaching the rebuilding lvol of the dst replica %s, will give up", r.Name, shallowCopyOpID, r.rebuildingSrcCache.dstReplicaName)
    stopWaiting = true
-   break
🧰 Tools

🪛 GitHub Check: CodeFactor

[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)

1961-2096: Consider splitting the complex rebuilding preparation logic.

The method rebuildingDstShallowCopyPrepare handles multiple scenarios and has deep nesting. Consider splitting it into smaller, focused methods:

Handling existing snapshot lvols

Managing expired lvols

Creating new lvols

Would you like me to propose a refactored structure that improves readability while maintaining the same functionality?

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8c4f187 and 0b0258a.

📒 Files selected for processing (9)

Dockerfile.dapper (1 hunks)
pkg/api/types.go (3 hunks)
pkg/client/client.go (1 hunks)
pkg/spdk/engine.go (4 hunks)
pkg/spdk/replica.go (30 hunks)
pkg/spdk/server.go (4 hunks)
pkg/spdk/types.go (7 hunks)
pkg/spdk_test.go (7 hunks)
scripts/test (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

scripts/test
Dockerfile.dapper
pkg/spdk/engine.go

🧰 Additional context used

🪛 golangci-lint (1.62.2)

pkg/spdk/types.go

121-121: unknown field SnapshotChecksum in struct literal of type spdkrpc.Lvol

(typecheck)

🪛 GitHub Check: CodeFactor

pkg/spdk/replica.go

[notice] 1468-1551: pkg/spdk/replica.go#L1468-L1551
Complex Method

[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)

[notice] 1962-2099: pkg/spdk/replica.go#L1962-L2099
Complex Method

🔇 Additional comments (15)

pkg/client/client.go (4)

261-266: Add parameter validation for consistency
You've introduced a new required parameter, dstRebuildingLvolAddress. The checks for empty strings look good, but consider documenting valid formats or expected address patterns to help users avoid passing malformed values.

273-281: Good usage of error wrapping and final return
The flow for invoking the RPC and returning either a wrapped error or nil is consistent. This is an excellent pattern for error tracing.

284-285: Accurate inline documentation
The docstring now correctly matches the adjusted parameters and behavior for the shallow copy check.

298-300: Confirm removal of shallow copy op ID usage
With the old shallowCopyOpId removed, verify that no references remain in the codebase and all callers rely solely on the three parameters.

pkg/api/types.go (2)

43-43: New field addition looks appropriate
Adding SnapshotChecksum to the Lvol struct helps track data integrity. Ensure it is assigned consistently in all code paths.

60-60: Propagation of SnapshotChecksum
The lines correctly serialize and deserialize SnapshotChecksum. This keeps the in-memory representation in sync with the Proto definitions.

Also applies to: 78-78

pkg/spdk/types.go (4)

79-79: Maintaining consistency with the API
The SnapshotChecksum field in Lvol aligns well with the API structures. Good job on consistent naming.

121-121: Snapshot checksum annotation
Assigning the field here is coherent with the rest of the code for consistent marshalling.

🧰 Tools

🪛 golangci-lint (1.62.2)

121-121: unknown field SnapshotChecksum in struct literal of type spdkrpc.Lvol

(typecheck)

155-155: Validate existence of xattr
Ensure you handle the case where SnapshotChecksum might not exist in Xattrs. Otherwise, key lookups can result in empty strings or unexpected outcomes.

215-217: Simple prefix check
The naming convention with "expired" is straightforward. This function is simple and works well given the suffix-based approach.

pkg/spdk/server.go (3)

693-693: Check for mandatory address parameter
The new logic verifying DstRebuildingLvolAddress is essential. Returning gRPC InvalidArgument is consistent with existing error handling patterns. The follow-up call and final return are appropriately structured.

Also applies to: 700-702, 713-713, 717-717

736-736: Removed shallow copy operation ID
Invoking r.RebuildingSrcShallowCopyCheck using only SnapshotName aligns with the updated interface. This simplifies the check logic.

1088-1090: Stronger validation in EngineReplicaAdd
Ensuring both ReplicaName and ReplicaAddress are provided helps prevent incomplete engine states and reduces potential runtime errors.

pkg/spdk/replica.go (2)

70-74: LGTM: New fields for snapshot checksum and rebuilding.

The new fields enhance the functionality:

SnapshotChecksumEnabled provides control over snapshot checksum registration
constructRequired helps manage errored replica state transitions
rebuildingLvolAddress tracks rebuilding lvol location

Also applies to: 96-98

878-880: LGTM: Error state handling improvement.

The addition of setting constructRequired when transitioning from error state improves replica recovery handling.

shuo-wu · 2024-12-25T08:13:09Z

Updated this PR for lvol checksum API & lvol detach parent API integration.

coderabbitai

Actionable comments posted: 2

🔭 Outside diff range comments (1)

pkg/spdk/replica.go (1)

Line range hint 1864-1960: Enhance error handling for critical cleanup operations.

While aggregating errors is good for cleanup, some critical failures should stop the process to prevent inconsistent state.

  func (r *Replica) doCleanupForRebuildingDst(spdkClient *spdkclient.Client) error {
    aggregatedErrors := []error{}
+   criticalError := false
    
    if r.rebuildingDstCache.externalSnapshotBdevName != "" {
      if err := disconnectNVMfBdev(spdkClient, r.rebuildingDstCache.externalSnapshotBdevName); err != nil {
        r.log.WithError(err).Error("Failed to disconnect external snapshot bdev")
        aggregatedErrors = append(aggregatedErrors, err)
+       criticalError = true
+       return util.CombineErrors(aggregatedErrors...)
      }
      // ... rest of cleanup
    }
    
    // Continue with non-critical cleanup only if no critical errors
+   if !criticalError {
      // ... rest of method
+   }
    
    return util.CombineErrors(aggregatedErrors...)
  }

🧰 Tools

🪛 GitHub Check: CodeFactor

[notice] 1962-2099: pkg/spdk/replica.go#L1962-L2099
Complex Method

🧹 Nitpick comments (6)

pkg/spdk/replica.go (4)
236-253: Consider adding goroutine pool for checksum registration.

The current implementation launches unlimited goroutines for checksum registration. Consider:

Using a worker pool to limit concurrent goroutine count

Adding a mechanism to track/wait for completion

Implementing proper error propagation

Example implementation:
+ var checksumWorkerPool = make(chan struct{}, 10) // Limit concurrent workers
  
  if r.SnapshotChecksumEnabled {
    for _, bdevLvol := range bdevLvolMap {
      if !bdevLvol.DriverSpecific.Lvol.Snapshot {
        continue
      }
      if bdevLvol.DriverSpecific.Lvol.Xattrs[spdkclient.SnapshotChecksum] != "" {
        continue
      }
+     checksumWorkerPool <- struct{}{} // Acquire worker
      go func(bdevLvol *spdktypes.BdevInfo) {
+       defer func() { <-checksumWorkerPool }() // Release worker
        logrus.Debugf("Replica %v is registering checksum for snapshot %v", r.Name, bdevLvol.Aliases[0])
        _, err := spdkClient.BdevLvolRegisterSnapshotChecksum(bdevLvol.Aliases[0])
        if err != nil {
          logrus.Errorf("Replica %v failed to register checksum for snapshot %v: %v", r.Name, bdevLvol.Name, err)
        }
      }(bdevLvol)
    }
  }
284-288: Improve error message clarity.

The error message could be more descriptive about why construction is not allowed in this state.
-       return fmt.Errorf("invalid state %s with construct required flag %v for replica %s construct", r.State, r.constructRequired, r.Name)
+       return fmt.Errorf("cannot construct replica %s in state %s: construction is only allowed when constructRequired is true", r.Name, r.State)
1961-2096: Address TODOs for delta shallow copy implementation.

The method contains several TODOs related to delta shallow copy implementation. I can help implement:

Reusing expired lvols for rebuilding

Handling corrupted/outdated snapshots

Implementing delta shallow copy logic

Would you like me to:

Create a GitHub issue to track these improvements?

Generate an implementation proposal for the delta shallow copy feature?

467-475: Optimize snapshot checksum comparison logic.

The current implementation can be made more efficient and clearer.
- if prev.SnapshotChecksum == "" {
-   prev.SnapshotChecksum = cur.SnapshotChecksum
- }
- if cur.SnapshotChecksum == "" {
-   prev.SnapshotChecksum = ""
- }
- if prev.SnapshotChecksum != cur.SnapshotChecksum {
+ // Only compare checksums if both exist
+ if prev.SnapshotChecksum != "" && cur.SnapshotChecksum != "" && 
+    prev.SnapshotChecksum != cur.SnapshotChecksum {
    return fmt.Errorf("found mismatching lvol snapshot checksum %v with recorded prev lvol snapshot checksum %v when validating lvol %s", 
      cur.SnapshotChecksum, prev.SnapshotChecksum, prev.Name)
  }
+ // Update prev checksum if it's empty
+ if prev.SnapshotChecksum == "" {
+   prev.SnapshotChecksum = cur.SnapshotChecksum
+ }
pkg/spdk_test.go (2)
1205-1262: Consider breaking down the test function into smaller sub-tests.

The test function is quite large (>300 lines) and handles multiple test scenarios. Consider using Go's subtests to break it down into more manageable pieces:
func (s *TestSuite) TestSPDKMultipleThreadFastRebuilding(c *C) {
    testCases := []struct {
        name string
        test func(*C)
    }{
        {"BasicFastRebuilding", testBasicFastRebuilding},
        {"RebuildingWithInvalidSnapshots", testRebuildingWithInvalidSnapshots},
        // ... more test cases
    }
    
    for _, tc := range testCases {
        c.Run(tc.name, tc.test)
    }
}
1624-1661: Enhance error handling in waitReplicaSnapshotChecksumTimeout.

While the function works correctly, consider improving error handling:

Add context to the assertion errors

Consider returning errors instead of using assertions for better reusability
-func waitReplicaSnapshotChecksumTimeout(c *C, spdkCli *client.SPDKClient, replicaName, targetSnapName string, timeoutInSecond int) {
+func waitReplicaSnapshotChecksumTimeout(c *C, spdkCli *client.SPDKClient, replicaName, targetSnapName string, timeoutInSecond int) error {
     // ... existing code ...
-    c.Assert(err, IsNil)
+    if err != nil {
+        return fmt.Errorf("failed to get replica %s: %v", replicaName, err)
+    }
     // ... rest of the function
}

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b0258a and 77a2637.

📒 Files selected for processing (9)

Dockerfile.dapper (1 hunks)
pkg/api/types.go (3 hunks)
pkg/client/client.go (1 hunks)
pkg/spdk/engine.go (4 hunks)
pkg/spdk/replica.go (30 hunks)
pkg/spdk/server.go (4 hunks)
pkg/spdk/types.go (7 hunks)
pkg/spdk_test.go (7 hunks)
scripts/test (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

scripts/test
pkg/api/types.go
Dockerfile.dapper
pkg/spdk/engine.go

👮 Files not reviewed due to content moderation or server errors (3)

pkg/spdk/types.go
pkg/client/client.go
pkg/spdk/server.go

🧰 Additional context used

🪛 GitHub Check: CodeFactor

pkg/spdk/replica.go

[notice] 1468-1551: pkg/spdk/replica.go#L1468-L1551
Complex Method

[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)

[notice] 1962-2099: pkg/spdk/replica.go#L1962-L2099
Complex Method

🪛 golangci-lint (1.62.2)

pkg/spdk/types.go

121-121: unknown field SnapshotChecksum in struct literal of type spdkrpc.Lvol

(typecheck)

🔇 Additional comments (5)

pkg/spdk/replica.go (2)

70-74: LGTM! Well-structured field additions.

The new fields enhance the replica's functionality:

SnapshotChecksumEnabled provides control over checksum verification
constructRequired improves error handling when stopping errored replicas

1499-1499: LGTM! Break statement is correct.

The break statement is necessary to exit the select statement after timeout. Static analysis flagged this as a false positive.

🧰 Tools

🪛 GitHub Check: CodeFactor

[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)

pkg/spdk_test.go (3)

42-47: LGTM! Constants are well-defined and appropriately sized.

The constants are logically organized and their values are appropriate for testing purposes:

Disk and volume sizes are properly scaled using MiB units
Wait intervals and counts are reasonable for test scenarios

Also applies to: 62-65

1370-1376: Well-documented snapshot tree structure.

The ASCII diagrams clearly illustrate the snapshot tree structure, making it easier to understand the test scenarios and expected outcomes.

Also applies to: 1506-1512

1674-1674: LGTM! Good improvements to existing functions.

The changes improve the functions by:

Using actual volume size instead of hardcoded values in revertSnapshot
Adding timeout functionality to WaitForReplicaRebuildingComplete

Also applies to: 1682-1682, 1697-1697, 1707-1710

pkg/spdk/replica.go

pkg/spdk_test.go

derekbit · 2024-12-26T00:28:58Z

Will be fixed by the longhorn/types update.

pkg/api/types.go:60:24: l.SnapshotChecksum undefined (type *spdkrpc.Lvol has no field or method SnapshotChecksum)
pkg/api/types.go:78:3: unknown field SnapshotChecksum in struct literal of type spdkrpc.Lvol

derekbit · 2024-12-26T00:53:10Z

pkg/spdk/types.go

 		UserCreated:       bdev.DriverSpecific.Lvol.Xattrs[spdkclient.UserCreated] == strconv.FormatBool(true),
 		SnapshotTimestamp: bdev.DriverSpecific.Lvol.Xattrs[spdkclient.SnapshotTimestamp],
+		SnapshotChecksum:  bdev.DriverSpecific.Lvol.Xattrs[spdkclient.SnapshotChecksum],


Shall we need to check if the xattrs exist?

NO. The upper layer should guarantee that this bdev is a lvol, which means bdev.DriverSpecific.Lvol not nil. Then if Xattrs[spdkclient.SnapshotChecksum] exists or not does not matter.

pkg/spdk/types.go

mergify · 2024-12-26T02:06:46Z

This pull request is now in conflict. Could you fix it @shuo-wu? 🙏

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

pkg/spdk/server.go (1)
Line range hint 910-910: Consider logging cleanup errors.

Using _ to ignore the return value from doCleanupForRebuildingDst could mask potential cleanup issues. Consider logging any errors:
-_ = r.doCleanupForRebuildingDst(spdkClient)
+if err := r.doCleanupForRebuildingDst(spdkClient); err != nil {
+    r.log.WithError(err).Warn("Failed to cleanup rebuilding destination, continuing anyway")
+}
pkg/spdk/replica.go (2)

1961-2096: LGTM with suggestion for future improvement.

The method is well structured with thorough error handling and clear documentation. However, consider breaking it down into smaller functions in the future for better maintainability:

prepareRebuildingLvol

handleExistingSnapshot

reuseExpiredLvol

236-253: Implement goroutine pool for checksum registration.

The current implementation launches unlimited goroutines for checksum registration which could cause resource issues with many snapshots. Consider:

Implementing a worker pool to limit concurrent checksum registrations

Adding a mechanism to track and report failed registrations

Implementing retry logic for failed registrations

Would you like me to help implement a worker pool solution for the checksum registration?

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 77a2637 and 94c6e34.

📒 Files selected for processing (2)

pkg/spdk/replica.go (30 hunks)
pkg/spdk/server.go (4 hunks)

🧰 Additional context used

🪛 GitHub Check: CodeFactor

pkg/spdk/replica.go

[notice] 1468-1551: pkg/spdk/replica.go#L1468-L1551
Complex Method

[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)

[notice] 1962-2099: pkg/spdk/replica.go#L1962-L2099
Complex Method

🔇 Additional comments (4)

pkg/spdk/server.go (2)

Line range hint 692-716: LGTM! Input validation and error handling look good.

The changes to include dstRebuildingLvolAddress parameter and modify the return type are well implemented with proper validation and error handling.

Line range hint 735-745: LGTM! Function simplification improves maintainability.

The function has been simplified to focus on snapshot name validation while maintaining proper error handling.

pkg/spdk/replica.go (2)

70-74: LGTM! New fields are well documented.

The addition of SnapshotChecksumEnabled and constructRequired fields with clear documentation improves the code's maintainability.

1467-1549: 🛠️ Refactor suggestion

Refactor complex method to improve maintainability.

The method has high cyclomatic complexity and could benefit from refactoring:

Extract monitoring logic into a separate method
Add proper error handling for goroutine panics
Remove redundant break statement in select

+func (r *Replica) monitorShallowCopy(spdkClient *spdkclient.Client, shallowCopyOpID uint32, snapshotName string) {
+    defer func() {
+        if r := recover(); r != nil {
+            r.log.Errorf("Panic in shallow copy monitoring: %v", r)
+        }
+    }()
+    
+    timer := time.NewTimer(MaxShallowCopyWaitTime)
+    defer timer.Stop()
+    ticker := time.NewTicker(ShallowCopyCheckInterval)
+    defer ticker.Stop()
+    
+    for {
+        select {
+        case <-timer.C:
+            r.log.Error("Timeout waiting for shallow copy")
+            return
+        case <-ticker.C:
+            r.Lock()
+            if r.rebuildingSrcCache.shallowCopyOpID != shallowCopyOpID {
+                r.Unlock()
+                return
+            }
+            status, err := r.rebuildingSrcShallowCopyStatusUpdateAndHandlingNoLock(spdkClient)
+            r.Unlock()
+            if err != nil || status.State == types.ProgressStateComplete {
+                return
+            }
+        }
+    }
+}

Likely invalid or redundant comment.

🧰 Tools

🪛 GitHub Check: CodeFactor

[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)

pkg/spdk/replica.go

pkg/spdk/engine.go

derekbit · 2024-12-26T07:12:52Z

pkg/spdk/engine.go

+
+		timer.Reset(MaxShallowCopyWaitTime)
+		continuousRetryCount := 0
+		for finished := false; !finished; {


Can we use

for { select { case <-timer.C: ... case <-ticker.C: ... if shallowCopyStatus.State == helpertypes.ShallowCopyStateComplete { ... break } } }

and remove finished flag?

NO. The break in your code means leaving from select rather than the whole for loop

pkg/spdk/replica.go

coderabbitai

Actionable comments posted: 0

🔭 Outside diff range comments (1)

pkg/spdk/types.go (1)
Line range hint 469-477: Improve checksum comparison logic.

The current checksum comparison logic might lead to data inconsistency. If both checksums are empty, they're considered equal. Consider adding a warning log in this case.
 	if prev.SnapshotChecksum == "" {
+		logrus.Warnf("Empty checksum found for lvol %s during comparison", prev.Name)
 		prev.SnapshotChecksum = cur.SnapshotChecksum
 	}
 	if cur.SnapshotChecksum == "" {
+		logrus.Warnf("Empty checksum found for lvol %s during comparison", cur.Name)
 		prev.SnapshotChecksum = ""
 	}

🧹 Nitpick comments (6)

pkg/spdk/types.go (1)
38-39: Consider using time.Duration for time constants.

For better type safety and clarity, consider using time.Duration for time-related constants.
-	MaxShallowCopyWaitTime   = 72 * time.Hour
-	ShallowCopyCheckInterval = 3 * time.Second
+	MaxShallowCopyWaitTime   time.Duration = 72 * time.Hour
+	ShallowCopyCheckInterval time.Duration = 3 * time.Second
pkg/spdk/replica.go (5)
238-255: Consider using a goroutine pool for checksum registration.

The non-blocking checksum registration is good, but spawning unlimited goroutines could lead to resource exhaustion under heavy load.

Consider implementing a worker pool pattern:
+var checksumWorkerPool chan struct{}
+
+func init() {
+    checksumWorkerPool = make(chan struct{}, 10) // Adjust pool size as needed
+}

-// TODO: Use a goroutine pool
-go func() {
+go func(bdevLvol *spdktypes.BdevInfo) {
+    checksumWorkerPool <- struct{}{} // Acquire worker
+    defer func() { <-checksumWorkerPool }() // Release worker
     logrus.Debugf("Replica %v is registering checksum for snapshot %v", r.Name, bdevLvol.Aliases[0])
     _, err := spdkClient.BdevLvolRegisterSnapshotChecksum(bdevLvol.Aliases[0])
     if err != nil {
         logrus.Errorf("Replica %v failed to register checksum for snapshot %v: %v", r.Name, bdevLvol.Name, err)
     }
-}()
+}(bdevLvol)
469-478: Enhance error messages for checksum mismatches.

The checksum comparison logic is correct, but the error message could be more descriptive.

Consider this improvement:
-return fmt.Errorf("found mismatching lvol snapshot checksum %v with recorded prev lvol snapshot checksum %v when validating lvol %s", cur.SnapshotChecksum, prev.SnapshotChecksum, prev.Name)
+return fmt.Errorf("snapshot checksum mismatch for lvol %s: current='%v' previous='%v'", prev.Name, cur.SnapshotChecksum, prev.SnapshotChecksum)
1963-2098: Consider breaking down the complex method.

The method is quite long and handles multiple responsibilities. Consider extracting some logic into helper methods.

Suggested structure:

Extract snapshot validation logic

Extract rebuilding lvol cleanup logic

Extract expired lvol reuse logic

Example:
+func (r *Replica) validateRebuildingSnapshot(snapshotName string) (*api.Lvol, error) {
+    srcSnapSvcLvol := r.rebuildingDstCache.rebuildingSnapshotMap[snapshotName]
+    if srcSnapSvcLvol == nil {
+        return nil, fmt.Errorf("cannot find snapshot %s in rebuilding list", snapshotName)
+    }
+    return srcSnapSvcLvol, nil
+}

+func (r *Replica) cleanupExistingRebuildingLvol(spdkClient *spdkclient.Client) error {
+    // Extract cleanup logic here
+}
1686-1689: Track implementation of delta shallow copy.

Multiple TODO comments indicate planned implementation of delta shallow copy functionality. This will optimize the rebuilding process.

Would you like me to create a GitHub issue to track the implementation of delta shallow copy? The issue would include:

Implementation of lvol checksum API

Implementation of delta shallow copy API

Integration with the existing rebuilding logic

Also applies to: 1742-1753

Line range hint 1866-1961: Enhance error logging during cleanup.

While the error aggregation is good, the error logging could be more detailed to aid in debugging.

Consider adding more context to error logs:
-r.log.WithError(err).Errorf("Failed to disconnect the external src snapshot bdev %s for rebuilding dst cleanup, will continue", r.rebuildingDstCache.externalSnapshotBdevName)
+r.log.WithError(err).WithFields(logrus.Fields{
+    "replica": r.Name,
+    "snapshot": r.rebuildingDstCache.externalSnapshotName,
+    "bdev": r.rebuildingDstCache.externalSnapshotBdevName,
+}).Errorf("Failed to disconnect external src snapshot bdev during rebuilding dst cleanup, continuing with remaining cleanup steps")
🧰 Tools

🪛 GitHub Check: CodeFactor

[notice] 1964-2101: pkg/spdk/replica.go#L1964-L2101
Complex Method

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 82f564c and 933cfc6.

📒 Files selected for processing (4)

pkg/api/types.go (3 hunks)
pkg/spdk/replica.go (30 hunks)
pkg/spdk/types.go (7 hunks)
pkg/spdk_test.go (7 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

pkg/api/types.go
pkg/spdk_test.go

🧰 Additional context used

🪛 GitHub Check: CodeFactor

pkg/spdk/replica.go

[notice] 1470-1553: pkg/spdk/replica.go#L1470-L1553
Complex Method

[notice] 1501-1501: pkg/spdk/replica.go#L1501
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)

[notice] 1964-2101: pkg/spdk/replica.go#L1964-L2101
Complex Method

🔇 Additional comments (3)

pkg/spdk/replica.go (3)

70-74: LGTM: New fields enhance replica functionality.

The new fields improve the replica's capabilities:

SnapshotChecksumEnabled adds data integrity verification through snapshot checksums
constructRequired improves error handling when stopping errored replicas

284-290: LGTM: Improved state validation logic.

The changes enhance error handling by properly validating the constructRequired flag and ensuring correct state transitions.

1469-1529: 🛠️ Refactor suggestion

Refactor monitoring logic into a separate method.

The method is complex and handles multiple responsibilities. Consider extracting the monitoring logic.

Suggested refactoring:

+func (r *Replica) monitorShallowCopy(spdkClient *spdkclient.Client, shallowCopyOpID uint32, snapshotName string) {
+    timer := time.NewTimer(MaxShallowCopyWaitTime)
+    defer timer.Stop()
+    ticker := time.NewTicker(ShallowCopyCheckInterval)
+    defer ticker.Stop()
+    
+    for {
+        select {
+        case <-timer.C:
+            r.log.Errorf("Timeout waiting for shallow copy %v complete", shallowCopyOpID)
+            return
+        case <-ticker.C:
+            r.Lock()
+            if r.rebuildingSrcCache.shallowCopyOpID != shallowCopyOpID {
+                r.Unlock()
+                return
+            }
+            status, err := r.rebuildingSrcShallowCopyStatusUpdateAndHandlingNoLock(spdkClient)
+            r.Unlock()
+            if err != nil || status.State == types.ProgressStateError || status.State == types.ProgressStateComplete {
+                return
+            }
+        }
+    }
+}

Also, remove the useless break statement at line 1501 and use return instead.

Likely invalid or redundant comment.

🧰 Tools

🪛 GitHub Check: CodeFactor

[notice] 1501-1501: pkg/spdk/replica.go#L1501
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)

mergify · 2024-12-27T01:47:10Z

This pull request is now in conflict. Could you fix it @shuo-wu? 🙏

derekbit · 2024-12-27T02:15:28Z

@coderabbitai pause

coderabbitai · 2024-12-27T02:15:35Z

✅ Actions performed

Reviews paused.

derekbit

In general, LGTM. Only some comments and conflicts.

pkg/spdk/engine.go

pkg/spdk/replica.go

derekbit · 2024-12-27T05:16:43Z

pkg/spdk/replica.go

@@ -182,6 +191,8 @@ func NewReplica(ctx context.Context, replicaName, lvsName, lvsUUID string, specS
 		SpecSize:        roundedSpecSize,
 		State:           types.InstanceStatePending,

+		SnapshotChecksumEnabled: true,


Will we allow users to disable snapshot checksum?

In the future, YES.
I haven't checked if the v1 checksum setting fits v2 or not. Hence we directly enabled it by default for now.

pkg/spdk/replica.go

Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

…lding start Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

Signed-off-by: Shuo Wu <[email protected]>

The 1st defer invocation e.initiator.Resume() should not overwrite the existing error. Otherwise, the next defer call e.replicaShallowCopy() will be excuted. Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

To reuse possible existing snapshots, the online rebuilding flow is refactored. More specifically: 1. Dst replica will re-create a rebuilding lvol under the correct snap before each shallow copy start 2. During dst replica rebuilding lvol preparation, the dst replica will try to reuse the existing snapshots, non-empty previous head, rebuilding lvols, and expired lvols. 3. During dst replica snapshot shallow copy, the shallow copy can be skipped if the snapshot is there and intact. 4. During dst replica snapshot creation stage, it will blindly clean up the rebuilding lvol and correct the current rebuilding snapshot parent. 5. The dest replica rebuilding cleanup is responisble for cleaning up redundant lvols only (after rebuilding finish) 6. Src replica will blindly attach the exposed rebuilding lvol at the beginning of each shallow copy, then detach it in the end. 7. The replica deletion function is responsible for cleaning up all related lvols when Longhorn no longer wants (to reuse) this replica. Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

…after shallow copy start Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

…timeout Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

Since the delta shallow copy API is not available yet. Need to revert this commit when the API is ready. Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

Signed-off-by: Shuo Wu <[email protected]>

Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

Longhorn 8666, 9488 Signed-off-by: Shuo Wu <[email protected]>

Signed-off-by: Shuo Wu <[email protected]>

derekbit

LGTM

shuo-wu requested review from innobead, ChanYiLin, derekbit, PhanLe1010 and DamiaSan December 20, 2024 09:04

shuo-wu changed the title ~~feature: fast rebuilding~~ feat: v2 fast rebuilding Dec 20, 2024

shuo-wu force-pushed the reuse branch 2 times, most recently from 793fe70 to 8c4f187 Compare December 21, 2024 02:17

shuo-wu marked this pull request as ready for review December 24, 2024 06:32

coderabbitai bot reviewed Dec 24, 2024

View reviewed changes

pkg/spdk/replica.go Outdated Show resolved Hide resolved

This was referenced Dec 24, 2024

[FEATURE] Support v2 volume delta replica rebuilding based on snapshot checksum: data plane and control plane longhorn/longhorn#9488

Closed

fix(deps): update patch digest dependencies (main) #279

Merged

shuo-wu force-pushed the reuse branch from 8c4f187 to 0b0258a Compare December 25, 2024 08:06

coderabbitai bot reviewed Dec 25, 2024

View reviewed changes

shuo-wu force-pushed the reuse branch from 0b0258a to 77a2637 Compare December 25, 2024 08:11

coderabbitai bot reviewed Dec 25, 2024

View reviewed changes

pkg/spdk/replica.go Show resolved Hide resolved

pkg/spdk_test.go Show resolved Hide resolved

derekbit mentioned this pull request Dec 25, 2024

fix: reconstruct ActiveChain and SnapshotLvolMap while creating replica #284

Merged

derekbit reviewed Dec 26, 2024

View reviewed changes

pkg/spdk/types.go Show resolved Hide resolved

coderabbitai bot reviewed Dec 26, 2024

View reviewed changes

derekbit reviewed Dec 26, 2024

View reviewed changes

pkg/spdk/replica.go Outdated Show resolved Hide resolved

ChanYiLin mentioned this pull request Dec 26, 2024

fix: delete the backing image with unmatched UUID before creating #286

Merged

derekbit mentioned this pull request Dec 26, 2024

fix(revert): revert the dependency updates related to delta rebuilding #287

Merged

derekbit reviewed Dec 26, 2024

View reviewed changes

shuo-wu force-pushed the reuse branch from 94c6e34 to 82f564c Compare December 26, 2024 19:57

shuo-wu force-pushed the reuse branch from 933cfc6 to 524faf4 Compare December 26, 2024 20:05

coderabbitai bot reviewed Dec 26, 2024

View reviewed changes

derekbit reviewed Dec 27, 2024

View reviewed changes

shuo-wu added 6 commits December 26, 2024 23:58

fix: update dependencies for v2 fast rebuilding feature

de65b99

Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

feat: replica retains valid lvols and removes redundancy during rebui…

41a4110

…lding start Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

feat: replica removes orphans during rebuilding finish

d8d83f5

Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

chore: rename variables

431646e

Signed-off-by: Shuo Wu <[email protected]>

fix: engine should not start shallow copy when ReplicaAdd errors out

22e6867

The 1st defer invocation e.initiator.Resume() should not overwrite the existing error. Otherwise, the next defer call e.replicaShallowCopy() will be excuted. Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

shuo-wu force-pushed the reuse branch 2 times, most recently from 0aa2daa to c4579c9 Compare December 27, 2024 08:05

shuo-wu added 9 commits December 27, 2024 00:07

feat: add a new unit test for fast rebuilding

fa10869

Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

fix: update commit IDs for dapper file

f5f824b

Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

fix: replica rebuilding src caches the status and waits for complete …

74610dd

…after shallow copy start Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

fix: engine wait for snapshot shallow copy with a fixed interval and …

5fc5e7c

…timeout Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

feat: avoid non-empty rebuilding lvol during rebuilding

e5d9a2f

Since the delta shallow copy API is not available yet. Need to revert this commit when the API is ready. Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

feat: reconstruct caches when starting a previously failed replica

2cf05c0

Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

fix: replicas do clone from a correct lvol during head prepare

a912c06

Signed-off-by: Shuo Wu <[email protected]>

fix: improve replica rebuilding dst cache cleanup logic

6d491cc

Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>

feat: introduce snapshot checksum

31f905f

Longhorn 8666, 9488 Signed-off-by: Shuo Wu <[email protected]>

shuo-wu force-pushed the reuse branch from c4579c9 to d034334 Compare December 27, 2024 08:08

refactor: use API spdkClient.BdevLvolGetByName

7844558

Signed-off-by: Shuo Wu <[email protected]>

shuo-wu force-pushed the reuse branch from d034334 to 7844558 Compare December 27, 2024 08:16

shuo-wu requested a review from derekbit December 27, 2024 08:25

derekbit approved these changes Dec 27, 2024

View reviewed changes

derekbit merged commit 884e1cb into longhorn:main Dec 27, 2024
7 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: v2 fast rebuilding #278

feat: v2 fast rebuilding #278

shuo-wu commented Dec 20, 2024

coderabbitai bot commented Dec 20, 2024 •

edited

Loading

Reviews paused

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

codecov bot commented Dec 20, 2024 •

edited

Loading

coderabbitai bot left a comment

coderabbitai bot left a comment

shuo-wu commented Dec 25, 2024

coderabbitai bot left a comment

derekbit commented Dec 26, 2024 •

edited

Loading

derekbit Dec 26, 2024

shuo-wu Dec 27, 2024 •

edited

Loading

mergify bot commented Dec 26, 2024

coderabbitai bot left a comment

derekbit Dec 26, 2024

shuo-wu Dec 27, 2024

coderabbitai bot left a comment

mergify bot commented Dec 27, 2024

derekbit commented Dec 27, 2024

coderabbitai bot commented Dec 27, 2024

derekbit left a comment •

edited

Loading

derekbit Dec 27, 2024

shuo-wu Dec 27, 2024

derekbit left a comment

feat: v2 fast rebuilding #278

feat: v2 fast rebuilding #278

Conversation

shuo-wu commented Dec 20, 2024

Which issue(s) this PR fixes:

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

coderabbitai bot commented Dec 20, 2024 • edited Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

codecov bot commented Dec 20, 2024 • edited Loading

Codecov Report

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

shuo-wu commented Dec 25, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

derekbit commented Dec 26, 2024 • edited Loading

derekbit Dec 26, 2024

Choose a reason for hiding this comment

shuo-wu Dec 27, 2024 • edited Loading

Choose a reason for hiding this comment

mergify bot commented Dec 26, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

derekbit Dec 26, 2024

Choose a reason for hiding this comment

shuo-wu Dec 27, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

mergify bot commented Dec 27, 2024

derekbit commented Dec 27, 2024

coderabbitai bot commented Dec 27, 2024

derekbit left a comment • edited Loading

Choose a reason for hiding this comment

derekbit Dec 27, 2024

Choose a reason for hiding this comment

shuo-wu Dec 27, 2024

Choose a reason for hiding this comment

derekbit left a comment

Choose a reason for hiding this comment

coderabbitai bot commented Dec 20, 2024 •

edited

Loading

codecov bot commented Dec 20, 2024 •

edited

Loading

derekbit commented Dec 26, 2024 •

edited

Loading

shuo-wu Dec 27, 2024 •

edited

Loading

derekbit left a comment •

edited

Loading