Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: v2 fast rebuilding #278

Merged
merged 16 commits into from
Dec 27, 2024
Merged

feat: v2 fast rebuilding #278

merged 16 commits into from
Dec 27, 2024

Conversation

shuo-wu
Copy link
Contributor

@shuo-wu shuo-wu commented Dec 20, 2024

Which issue(s) this PR fixes:

longhorn/longhorn#9488

The related docs that can help understand this feature:
V2 Data Engine Fast Rebuilding diagram flow
V2 Data Engine Fast Rebuilding illustration and animation

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

Copy link

coderabbitai bot commented Dec 20, 2024

Note

Reviews paused

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Walkthrough

This pull request introduces several modifications to the SPDK (Storage Performance Development Kit) related code, focusing on updates to the Dockerfile.dapper, simplifications in the SPDKClient interface, enhancements in error handling and logging within the Engine struct, and improvements to the Replica management logic. Changes span multiple files, including client and server implementations, type definitions, and test scripts, aimed at refining the functionality and usability of the SPDK system.

Changes

File Change Summary
Dockerfile.dapper Updated SPDK_COMMIT_ID from 002cd0679259c4e4b782d9540441a46e1228d484 to a7421a6e59f1d099294af6f65d73ebec4afebfc5 and GO_SPDK_HELPER_COMMIT_ID from 6a324e95979662be592bfda0e867d2678ecbc756 to e50a23ddfd64ebc267064e5aa07e0c3f9366ffbc.
pkg/client/client.go - Removed ReplicaRebuildingSrcAttach and ReplicaRebuildingSrcDetach methods.
- Modified ReplicaRebuildingSrcShallowCopyStart to include dstRebuildingLvolAddress and changed return type to error.
- Simplified ReplicaRebuildingSrcShallowCopyCheck by removing shallowCopyOpId parameter.
pkg/spdk/engine.go - Enhanced error handling in ReplicaAdd and updated logging.
- Improved shallow copy management in replicaShallowCopy with timer and ticker.
pkg/spdk/replica.go - Added SnapshotChecksumEnabled and constructRequired fields to Replica struct.
- Updated shallow copy methods to handle new logic for rebuilding operations.
pkg/spdk/server.go - Removed ReplicaRebuildingSrcAttach and ReplicaRebuildingSrcDetach methods.
- Modified ReplicaRebuildingSrcShallowCopyStart to return an empty response and added DstRebuildingLvolAddress parameter.
- Added validation for EngineReplicaAdd.
pkg/spdk/types.go - Added constants: ReplicaExpiredLvolSuffix, MaxShallowCopyWaitTime, ShallowCopyCheckInterval.
- Added functions: GenerateReplicaExpiredLvolName, IsReplicaExpiredLvol.
- Updated Lvol struct with SnapshotChecksum field.
pkg/spdk_test.go - Added TestSPDKMultipleThreadFastRebuilding test.
- Updated defaultTestSnapChecksumWaitInterval and defaultTestSnapChecksumWaitCount.
- Modified revertSnapshot and WaitForReplicaRebuildingComplete methods.
scripts/test Increased test timeout from 30m to 40m.
pkg/api/types.go Added SnapshotChecksum field to Lvol struct and updated conversion functions.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Server
    participant Engine
    participant Replica

    Client->>Server: ReplicaRebuildingSrcShallowCopyStart
    Server->>Engine: Initiate Shallow Copy
    Engine->>Replica: Prepare Rebuilding Lvol
    Replica-->>Engine: Lvol Prepared
    Engine->>Replica: Start Shallow Copy
    Replica-->>Engine: Shallow Copy Status
    Engine-->>Server: Shallow Copy Progress
    Server-->>Client: Operation Complete
Loading

This sequence diagram illustrates the high-level flow of the replica rebuilding shallow copy process, highlighting the interactions between the client, server, engine, and replica components.

Possibly related PRs

  • fix: reconstruct ActiveChain and SnapshotLvolMap while creating replica #284: The changes in this PR involve modifications to the Replica struct and its methods in pkg/spdk/replica.go, which are related to the management of replicas, including the construction of the SnapshotLvolMap. This is relevant to the main PR's focus on updating environment variables in the Dockerfile, as both involve aspects of the SPDK system's configuration and management.

Suggested reviewers

  • ChanYiLin
  • c3y1huang

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

codecov bot commented Dec 20, 2024

Codecov Report

Attention: Patch coverage is 2.11640% with 555 lines in your changes missing coverage. Please review.

Project coverage is 8.69%. Comparing base (7081bda) to head (7844558).
Report is 18 commits behind head on main.

Files with missing lines Patch % Lines
pkg/spdk/replica.go 0.00% 477 Missing ⚠️
pkg/spdk/engine.go 0.00% 57 Missing ⚠️
pkg/spdk/server.go 0.00% 10 Missing ⚠️
pkg/spdk/types.go 0.00% 6 Missing ⚠️
pkg/client/client.go 66.66% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##            main    #278      +/-   ##
========================================
- Coverage   9.17%   8.69%   -0.48%     
========================================
  Files         23      23              
  Lines       8536    8713     +177     
========================================
- Hits         783     758      -25     
- Misses      7646    7856     +210     
+ Partials     107      99       -8     
Flag Coverage Δ
unittests 8.69% <2.11%> (-0.48%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@shuo-wu shuo-wu changed the title feature: fast rebuilding feat: v2 fast rebuilding Dec 20, 2024
@shuo-wu shuo-wu force-pushed the reuse branch 2 times, most recently from 793fe70 to 8c4f187 Compare December 21, 2024 02:17
@shuo-wu shuo-wu marked this pull request as ready for review December 24, 2024 06:32
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (14)
pkg/client/client.go (3)

266-266: Add more context to error messages.

This error is clear about the missing parameter. Consider augmenting the message with the function name or caller context for quicker troubleshooting.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 266-266: pkg/client/client.go#L266
Added line #L266 was not covered by tests


284-285: Add concurrency notes in docstring.

The new docstring is good. Consider clarifying if the method is safe to call concurrently from multiple goroutines, or if it requires external synchronization.


298-300: Pass-through request parameters.

All request fields are properly assigned. When refactoring in future, consider centralizing these parameter assignments to reduce duplication.

pkg/spdk/server.go (2)

717-717: Handle the error gracefully.

If r.RebuildingSrcShallowCopyStart fails, consider adding logs with details about the replica's current state to diagnose potential concurrency or environment issues.


736-736: Possible optimization.

When reading the shallow copy status, consider caching results or throttling requests to avoid excessive gRPC calls.

pkg/spdk/replica.go (6)

262-264: Consider more logging.

When the state is "Stopped" but constructRequired is set, you skip here. A debug log might help diagnose scenarios in which this branch triggers unexpectedly.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 262-264: pkg/spdk/replica.go#L262-L264
Added lines #L262 - L264 were not covered by tests


266-266: Add context to error message.

The message does not mention whether the state is invalid or if the logic disallows transitions. More detail could help debugging if user logs are minimal.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 266-266: pkg/spdk/replica.go#L266
Added line #L266 was not covered by tests


561-562: Document fallback logic.

Lines 561-562 handle the case where the chain might require removal of the nil head. Document the reasoning to help future maintainers.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 561-562: pkg/spdk/replica.go#L561-L562
Added lines #L561 - L562 were not covered by tests


570-570: Add error handling.

If logging for “Replica cloned a new head lvol” fails or returns partial success, consider capturing it in debug logs for later.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 570-570: pkg/spdk/replica.go#L570
Added line #L570 was not covered by tests


575-575: Provide meaningful logs.

This line logs “Replica created a new head lvol.” Possibly add parent lvol alias in the log to clarify the relationship.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 575-575: pkg/spdk/replica.go#L575
Added line #L575 was not covered by tests


655-655: Handle error more gracefully.

A user might attempt unexpected revert chains. Log or handle partial rest states if getRootLvolName is missing.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 655-655: pkg/spdk/replica.go#L655
Added line #L655 was not covered by tests

pkg/spdk/engine.go (3)

1166-1166: Ensure correct client cleanup.
Good practice to log any close failures. In some scenarios, a retry mechanism might be beneficial.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 1166-1166: pkg/spdk/engine.go#L1166
Added line #L1166 was not covered by tests


1196-1198: Initiator resume failure handling.
Properly wrapping errors is good. Consider adding a fallback or re-try logic for cases where resume fails, if possible.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 1196-1198: pkg/spdk/engine.go#L1196-L1198
Added lines #L1196 - L1198 were not covered by tests


1344-1351: Finalizing shallow copy.
The code gracefully logs partial progress if it's incomplete. Ensure any partial state is cleaned up if needed.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 1344-1351: pkg/spdk/engine.go#L1344-L1351
Added lines #L1344 - L1351 were not covered by tests

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c84ca84 and 8c4f187.

⛔ Files ignored due to path filters (5)
  • go.mod is excluded by !go.mod
  • go.sum is excluded by !**/*.sum, !go.sum
  • vendor/github.com/longhorn/types/pkg/generated/spdkrpc/spdk.pb.go is excluded by !**/*.pb.go, !**/generated/**, !vendor/**
  • vendor/github.com/longhorn/types/pkg/generated/spdkrpc/spdk_grpc.pb.go is excluded by !**/*.pb.go, !**/generated/**, !vendor/**
  • vendor/modules.txt is excluded by !vendor/**
📒 Files selected for processing (8)
  • Dockerfile.dapper (1 hunks)
  • pkg/client/client.go (1 hunks)
  • pkg/spdk/engine.go (4 hunks)
  • pkg/spdk/replica.go (26 hunks)
  • pkg/spdk/server.go (4 hunks)
  • pkg/spdk/types.go (4 hunks)
  • pkg/spdk_test.go (5 hunks)
  • scripts/test (1 hunks)
🧰 Additional context used
🪛 GitHub Check: codecov/patch
pkg/client/client.go

[warning] 263-264: pkg/client/client.go#L263-L264
Added lines #L263 - L264 were not covered by tests


[warning] 266-266: pkg/client/client.go#L266
Added line #L266 was not covered by tests


[warning] 279-279: pkg/client/client.go#L279
Added line #L279 was not covered by tests

pkg/spdk/engine.go

[warning] 1127-1129: pkg/spdk/engine.go#L1127-L1129
Added lines #L1127 - L1129 were not covered by tests


[warning] 1131-1136: pkg/spdk/engine.go#L1131-L1136
Added lines #L1131 - L1136 were not covered by tests


[warning] 1138-1139: pkg/spdk/engine.go#L1138-L1139
Added lines #L1138 - L1139 were not covered by tests


[warning] 1166-1166: pkg/spdk/engine.go#L1166
Added line #L1166 was not covered by tests


[warning] 1169-1169: pkg/spdk/engine.go#L1169
Added line #L1169 was not covered by tests


[warning] 1175-1175: pkg/spdk/engine.go#L1175
Added line #L1175 was not covered by tests


[warning] 1177-1178: pkg/spdk/engine.go#L1177-L1178
Added lines #L1177 - L1178 were not covered by tests


[warning] 1181-1182: pkg/spdk/engine.go#L1181-L1182
Added lines #L1181 - L1182 were not covered by tests


[warning] 1196-1198: pkg/spdk/engine.go#L1196-L1198
Added lines #L1196 - L1198 were not covered by tests


[warning] 1303-1304: pkg/spdk/engine.go#L1303-L1304
Added lines #L1303 - L1304 were not covered by tests


[warning] 1311-1315: pkg/spdk/engine.go#L1311-L1315
Added lines #L1311 - L1315 were not covered by tests


[warning] 1324-1338: pkg/spdk/engine.go#L1324-L1338
Added lines #L1324 - L1338 were not covered by tests


[warning] 1340-1342: pkg/spdk/engine.go#L1340-L1342
Added lines #L1340 - L1342 were not covered by tests


[warning] 1344-1351: pkg/spdk/engine.go#L1344-L1351
Added lines #L1344 - L1351 were not covered by tests

pkg/spdk/replica.go

[warning] 262-264: pkg/spdk/replica.go#L262-L264
Added lines #L262 - L264 were not covered by tests


[warning] 266-266: pkg/spdk/replica.go#L266
Added line #L266 was not covered by tests


[warning] 545-559: pkg/spdk/replica.go#L545-L559
Added lines #L545 - L559 were not covered by tests


[warning] 561-562: pkg/spdk/replica.go#L561-L562
Added lines #L561 - L562 were not covered by tests


[warning] 565-566: pkg/spdk/replica.go#L565-L566
Added lines #L565 - L566 were not covered by tests


[warning] 570-570: pkg/spdk/replica.go#L570
Added line #L570 was not covered by tests


[warning] 575-575: pkg/spdk/replica.go#L575
Added line #L575 was not covered by tests


[warning] 651-653: pkg/spdk/replica.go#L651-L653
Added lines #L651 - L653 were not covered by tests


[warning] 655-655: pkg/spdk/replica.go#L655
Added line #L655 was not covered by tests


[warning] 769-778: pkg/spdk/replica.go#L769-L778
Added lines #L769 - L778 were not covered by tests


[warning] 846-848: pkg/spdk/replica.go#L846-L848
Added lines #L846 - L848 were not covered by tests


[warning] 878-878: pkg/spdk/replica.go#L878
Added line #L878 was not covered by tests


[warning] 906-906: pkg/spdk/replica.go#L906
Added line #L906 was not covered by tests

🪛 GitHub Check: CodeFactor
pkg/spdk/replica.go

[notice] 1435-1518: pkg/spdk/replica.go#L1435-L1518
Complex Method


[notice] 1466-1466: pkg/spdk/replica.go#L1466
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)


[notice] 1929-2066: pkg/spdk/replica.go#L1929-L2066
Complex Method

🔇 Additional comments (31)
pkg/client/client.go (3)

261-264: Ensure consistent parameter error handling.

These checks correctly enforce required parameters for the shallow copy start method. Make sure all error messages are consistently phrased to help with debugging.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 263-264: pkg/client/client.go#L263-L264
Added lines #L263 - L264 were not covered by tests


273-276: Validate call arguments.

This logic constructs the gRPC request to the server. All parameters are consistently used. Make sure to handle potential client-side timeouts or partial failures.


279-281: Error wrapping is good.

Using errors.Wrapf provides clear traceability. Ensure you’ve tested large snapshots or long snapshot names to confirm the error message does not become truncated in logs.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 279-279: pkg/client/client.go#L279
Added line #L279 was not covered by tests

pkg/spdk/server.go (3)

693-702: Validate request parameters.

These lines correctly verify that the request has valid fields. Check any logs or monitoring to see if more context is needed when logging invalid arguments.


713-713: Check for concurrency issues.

This method call might happen concurrently with other operations on the same replica. Ensure there's adequate locking or checks to avoid races.


1088-1090: Validate arguments for EngineReplicaAdd.

It's good to ensure both replicaName and replicaAddress are present. If either is missing, an error is returned early. This is consistent with the rest of the codebase.

pkg/spdk_test.go (1)

1203-1583: Comprehensive multi-thread test coverage.

This new test provides a thorough scenario exercising fast rebuilding with multiple replicas and snapshots. Consider the following for further improvement:
• Test random node failures or abrupt restarts to ensure resilience.
• Add short descriptive logging or sub-tests to pinpoint any failing scenario quickly.

pkg/spdk/replica.go (7)

72-74: Track constructRequired carefully.

Marking a replica as requiring construction when it errors out can help with recovery logic. Ensure code paths that toggle constructRequired handle concurrency properly.


545-559: Clean separation of chain logic.

Creating or cloning the head from the parent lvol is well structured. Confirm you have robust test coverage for nil parent or mismatch in SpecSize.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 545-559: pkg/spdk/replica.go#L545-L559
Added lines #L545 - L559 were not covered by tests


565-566: Perform size validation.

Here, you resize if parent size and spec size differ. Make sure the parent is always a valid reference to avoid unexpected resizing.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 565-566: pkg/spdk/replica.go#L565-L566
Added lines #L565 - L566 were not covered by tests


651-653: Check snapshot references.

Distinguishing between parent snapshots and the backing image is correct. Ensure references to root are consistent if a new snapshot is introduced at run time.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 651-653: pkg/spdk/replica.go#L651-L653
Added lines #L651 - L653 were not covered by tests


846-848: Track state transitions.

When previous state is error and you set constructRequired, consider whether an immediate re-construct attempt is desired or if a backoff is beneficial.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 846-848: pkg/spdk/replica.go#L846-L848
Added lines #L846 - L848 were not covered by tests


878-878: Graceful concurrency check.

Immediately stopping or cleaning up the source building cache can race other code calls. Evaluate an atomic operation to avoid incomplete states.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 878-878: pkg/spdk/replica.go#L878
Added line #L878 was not covered by tests


906-906: Double-check logic for partial leftover lvols.

If leftover snapshots remain from partial merges, ensure they don’t hamper future new replicas. Additional log statements might help.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 906-906: pkg/spdk/replica.go#L906
Added line #L906 was not covered by tests

scripts/test (1)

61-61: Extended test timeout is beneficial.

Increasing the timeout to 40 minutes allows more comprehensive tests to complete successfully, especially with complex slow I/O scenarios. Ensure CI infrastructure can handle this longer duration.

pkg/spdk/types.go (4)

26-26: Looks good for new suffix usage.
The new "expired" suffix is clear and consistent with existing suffix conventions.


38-39: Reasonable default intervals.
Setting a maximum wait time of 72 hours and a 3-second check interval seems balanced for long-running copy processes. Confirm these values are suitable for all operational scenarios.


200-202: Appropriate name generator.
Clearly constructs a unique LVOL name for expired replicas using a short UUID. Ensure collisions are handled upstream if there's any possibility of reusing suffixes.


212-214: Simple check for expired suffix.
Implementation is straightforward and aligns with previously mentioned naming conventions.

pkg/spdk/engine.go (11)

1127-1129: Handle previous mode tracking carefully.
Maintaining the previous mode in a local variable is a good approach for logging rollback scenarios. Consider verifying if there's any edge case (e.g., uninitialized map or missing key).

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 1127-1129: pkg/spdk/engine.go#L1127-L1129
Added lines #L1127 - L1129 were not covered by tests


1131-1136: Assigning ERR mode with address.
Setting a replica to ERR mode while preserving its address helps for debugging. Consider if partial transitions might require special handling.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 1131-1136: pkg/spdk/engine.go#L1131-L1136
Added lines #L1131 - L1136 were not covered by tests


1138-1139: Comprehensive logging.
Attaching both the entire replica status map and the error ensures thorough observability.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 1138-1139: pkg/spdk/engine.go#L1138-L1139
Added lines #L1138 - L1139 were not covered by tests


1169-1169: Symmetry in client closure logging.
Matches the pattern used for source replica.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 1169-1169: pkg/spdk/engine.go#L1169
Added line #L1169 was not covered by tests


1175-1175: Warn vs Error logging.
Currently using Errorf for shallow copy failure. This is appropriate if it indicates a critical state that blocks normal operations.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 1175-1175: pkg/spdk/engine.go#L1175
Added line #L1175 was not covered by tests


1177-1178: Error logging reason.
Clear enough messages to differentiate between “replica error” vs “engine error.”

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 1177-1178: pkg/spdk/engine.go#L1177-L1178
Added lines #L1177 - L1178 were not covered by tests


1181-1182: Finish step cleanup.
Makes sense to attempt finishing even on errors, to avoid resource leakage.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 1181-1182: pkg/spdk/engine.go#L1181-L1182
Added lines #L1181 - L1182 were not covered by tests


1303-1304: Log message clarity.
This readies the logs with enough details to know source and destination.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 1303-1304: pkg/spdk/engine.go#L1303-L1304
Added lines #L1303 - L1304 were not covered by tests


1311-1315: Timer and ticker approach.
Combining a hard cap with periodic checks is a practical method to limit the shallow copy duration.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 1311-1315: pkg/spdk/engine.go#L1311-L1315
Added lines #L1311 - L1315 were not covered by tests


1324-1338: Robust retry logic.
Incrementing a local counter and checking it against maxNumRetries helps avoid indefinite retries. Ensure maxNumRetries is large enough for potential slow I/O conditions.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 1324-1338: pkg/spdk/engine.go#L1324-L1338
Added lines #L1324 - L1338 were not covered by tests


1340-1342: Error path exit.
Immediate exit on copying error is correct. Potentially store partial progress for diagnostics.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 1340-1342: pkg/spdk/engine.go#L1340-L1342
Added lines #L1340 - L1342 were not covered by tests

Dockerfile.dapper (1)

20-20: Updating commit ID for go-spdk-helper.
This likely addresses compatibility with the new replica rebuilding flow. Consider verifying the updated commit for any backward-incompatible changes.

pkg/spdk/replica.go Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (7)
pkg/spdk/types.go (2)

26-26: Additional constants
Defining ReplicaExpiredLvolSuffix, MaxShallowCopyWaitTime, and ShallowCopyCheckInterval is clear. Consider adding docstrings so users know how these values are derived.

Also applies to: 37-39


203-205: Short UUID usage
Truncating the UUID to 8 characters can cause potential collisions in longer-running systems. Consider using a larger substring or a distinct suffix strategy.

pkg/spdk_test.go (1)

1205-1588: Comprehensive multi-thread fast rebuilding test
This new test suite thoroughly exercises snapshot creation, reverts, and concurrent rebuild scenarios. Consider adding negative tests (e.g., verifying behavior under insufficient disk space) and capturing logs for deeper debugging.

pkg/spdk/replica.go (4)

236-253: Consider using a goroutine pool for checksum registration.

The TODO comment suggests using a goroutine pool, which is a good optimization. Currently, each snapshot checksum registration spawns a new goroutine, which could lead to resource exhaustion with many snapshots.

Would you like me to propose an implementation using a worker pool pattern?


1467-1527: Consider breaking down the complex shallow copy logic.

The method RebuildingSrcShallowCopyStart is quite complex with nested defers and goroutine management. Consider extracting the goroutine management into a separate method for better maintainability.

Example refactor:

func (r *Replica) RebuildingSrcShallowCopyStart(...) (err error) {
    // ... existing locking code ...
    
    if err = r.rebuildingSrcDetachNoLock(spdkClient); err != nil {
        return err
    }
    if err = r.rebuildingSrcAttachNoLock(spdkClient); err != nil {
        return err
    }
    
+   return r.startShallowCopyAndWait(spdkClient, snapshotName)
}

+func (r *Replica) startShallowCopyAndWait(spdkClient *spdkclient.Client, snapshotName string) error {
+   // Extract goroutine management logic here
}
🧰 Tools
🪛 GitHub Check: CodeFactor

[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)


1496-1499: Remove unnecessary break statement.

The break statement after setting stopWaiting is redundant as the loop will exit naturally when stopWaiting is checked.

case <-timer.C:
    r.log.Errorf("Timeout waiting for the src replica %s shallow copy %v complete before detaching the rebuilding lvol of the dst replica %s, will give up", r.Name, shallowCopyOpID, r.rebuildingSrcCache.dstReplicaName)
    stopWaiting = true
-   break
🧰 Tools
🪛 GitHub Check: CodeFactor

[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)


1961-2096: Consider splitting the complex rebuilding preparation logic.

The method rebuildingDstShallowCopyPrepare handles multiple scenarios and has deep nesting. Consider splitting it into smaller, focused methods:

  1. Handling existing snapshot lvols
  2. Managing expired lvols
  3. Creating new lvols

Would you like me to propose a refactored structure that improves readability while maintaining the same functionality?

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8c4f187 and 0b0258a.

📒 Files selected for processing (9)
  • Dockerfile.dapper (1 hunks)
  • pkg/api/types.go (3 hunks)
  • pkg/client/client.go (1 hunks)
  • pkg/spdk/engine.go (4 hunks)
  • pkg/spdk/replica.go (30 hunks)
  • pkg/spdk/server.go (4 hunks)
  • pkg/spdk/types.go (7 hunks)
  • pkg/spdk_test.go (7 hunks)
  • scripts/test (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • scripts/test
  • Dockerfile.dapper
  • pkg/spdk/engine.go
🧰 Additional context used
🪛 golangci-lint (1.62.2)
pkg/spdk/types.go

121-121: unknown field SnapshotChecksum in struct literal of type spdkrpc.Lvol

(typecheck)

🪛 GitHub Check: CodeFactor
pkg/spdk/replica.go

[notice] 1468-1551: pkg/spdk/replica.go#L1468-L1551
Complex Method


[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)


[notice] 1962-2099: pkg/spdk/replica.go#L1962-L2099
Complex Method

🔇 Additional comments (15)
pkg/client/client.go (4)

261-266: Add parameter validation for consistency
You've introduced a new required parameter, dstRebuildingLvolAddress. The checks for empty strings look good, but consider documenting valid formats or expected address patterns to help users avoid passing malformed values.


273-281: Good usage of error wrapping and final return
The flow for invoking the RPC and returning either a wrapped error or nil is consistent. This is an excellent pattern for error tracing.


284-285: Accurate inline documentation
The docstring now correctly matches the adjusted parameters and behavior for the shallow copy check.


298-300: Confirm removal of shallow copy op ID usage
With the old shallowCopyOpId removed, verify that no references remain in the codebase and all callers rely solely on the three parameters.

pkg/api/types.go (2)

43-43: New field addition looks appropriate
Adding SnapshotChecksum to the Lvol struct helps track data integrity. Ensure it is assigned consistently in all code paths.


60-60: Propagation of SnapshotChecksum
The lines correctly serialize and deserialize SnapshotChecksum. This keeps the in-memory representation in sync with the Proto definitions.

Also applies to: 78-78

pkg/spdk/types.go (4)

79-79: Maintaining consistency with the API
The SnapshotChecksum field in Lvol aligns well with the API structures. Good job on consistent naming.


121-121: Snapshot checksum annotation
Assigning the field here is coherent with the rest of the code for consistent marshalling.

🧰 Tools
🪛 golangci-lint (1.62.2)

121-121: unknown field SnapshotChecksum in struct literal of type spdkrpc.Lvol

(typecheck)


155-155: Validate existence of xattr
Ensure you handle the case where SnapshotChecksum might not exist in Xattrs. Otherwise, key lookups can result in empty strings or unexpected outcomes.


215-217: Simple prefix check
The naming convention with "expired" is straightforward. This function is simple and works well given the suffix-based approach.

pkg/spdk/server.go (3)

693-693: Check for mandatory address parameter
The new logic verifying DstRebuildingLvolAddress is essential. Returning gRPC InvalidArgument is consistent with existing error handling patterns. The follow-up call and final return are appropriately structured.

Also applies to: 700-702, 713-713, 717-717


736-736: Removed shallow copy operation ID
Invoking r.RebuildingSrcShallowCopyCheck using only SnapshotName aligns with the updated interface. This simplifies the check logic.


1088-1090: Stronger validation in EngineReplicaAdd
Ensuring both ReplicaName and ReplicaAddress are provided helps prevent incomplete engine states and reduces potential runtime errors.

pkg/spdk/replica.go (2)

70-74: LGTM: New fields for snapshot checksum and rebuilding.

The new fields enhance the functionality:

  • SnapshotChecksumEnabled provides control over snapshot checksum registration
  • constructRequired helps manage errored replica state transitions
  • rebuildingLvolAddress tracks rebuilding lvol location

Also applies to: 96-98


878-880: LGTM: Error state handling improvement.

The addition of setting constructRequired when transitioning from error state improves replica recovery handling.

@shuo-wu
Copy link
Contributor Author

shuo-wu commented Dec 25, 2024

Updated this PR for lvol checksum API & lvol detach parent API integration.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
pkg/spdk/replica.go (1)

Line range hint 1864-1960: Enhance error handling for critical cleanup operations.

While aggregating errors is good for cleanup, some critical failures should stop the process to prevent inconsistent state.

  func (r *Replica) doCleanupForRebuildingDst(spdkClient *spdkclient.Client) error {
    aggregatedErrors := []error{}
+   criticalError := false
    
    if r.rebuildingDstCache.externalSnapshotBdevName != "" {
      if err := disconnectNVMfBdev(spdkClient, r.rebuildingDstCache.externalSnapshotBdevName); err != nil {
        r.log.WithError(err).Error("Failed to disconnect external snapshot bdev")
        aggregatedErrors = append(aggregatedErrors, err)
+       criticalError = true
+       return util.CombineErrors(aggregatedErrors...)
      }
      // ... rest of cleanup
    }
    
    // Continue with non-critical cleanup only if no critical errors
+   if !criticalError {
      // ... rest of method
+   }
    
    return util.CombineErrors(aggregatedErrors...)
  }
🧰 Tools
🪛 GitHub Check: CodeFactor

[notice] 1962-2099: pkg/spdk/replica.go#L1962-L2099
Complex Method

🧹 Nitpick comments (6)
pkg/spdk/replica.go (4)

236-253: Consider adding goroutine pool for checksum registration.

The current implementation launches unlimited goroutines for checksum registration. Consider:

  1. Using a worker pool to limit concurrent goroutine count
  2. Adding a mechanism to track/wait for completion
  3. Implementing proper error propagation

Example implementation:

+ var checksumWorkerPool = make(chan struct{}, 10) // Limit concurrent workers
  
  if r.SnapshotChecksumEnabled {
    for _, bdevLvol := range bdevLvolMap {
      if !bdevLvol.DriverSpecific.Lvol.Snapshot {
        continue
      }
      if bdevLvol.DriverSpecific.Lvol.Xattrs[spdkclient.SnapshotChecksum] != "" {
        continue
      }
+     checksumWorkerPool <- struct{}{} // Acquire worker
      go func(bdevLvol *spdktypes.BdevInfo) {
+       defer func() { <-checksumWorkerPool }() // Release worker
        logrus.Debugf("Replica %v is registering checksum for snapshot %v", r.Name, bdevLvol.Aliases[0])
        _, err := spdkClient.BdevLvolRegisterSnapshotChecksum(bdevLvol.Aliases[0])
        if err != nil {
          logrus.Errorf("Replica %v failed to register checksum for snapshot %v: %v", r.Name, bdevLvol.Name, err)
        }
      }(bdevLvol)
    }
  }

284-288: Improve error message clarity.

The error message could be more descriptive about why construction is not allowed in this state.

-       return fmt.Errorf("invalid state %s with construct required flag %v for replica %s construct", r.State, r.constructRequired, r.Name)
+       return fmt.Errorf("cannot construct replica %s in state %s: construction is only allowed when constructRequired is true", r.Name, r.State)

1961-2096: Address TODOs for delta shallow copy implementation.

The method contains several TODOs related to delta shallow copy implementation. I can help implement:

  1. Reusing expired lvols for rebuilding
  2. Handling corrupted/outdated snapshots
  3. Implementing delta shallow copy logic

Would you like me to:

  1. Create a GitHub issue to track these improvements?
  2. Generate an implementation proposal for the delta shallow copy feature?

467-475: Optimize snapshot checksum comparison logic.

The current implementation can be made more efficient and clearer.

- if prev.SnapshotChecksum == "" {
-   prev.SnapshotChecksum = cur.SnapshotChecksum
- }
- if cur.SnapshotChecksum == "" {
-   prev.SnapshotChecksum = ""
- }
- if prev.SnapshotChecksum != cur.SnapshotChecksum {
+ // Only compare checksums if both exist
+ if prev.SnapshotChecksum != "" && cur.SnapshotChecksum != "" && 
+    prev.SnapshotChecksum != cur.SnapshotChecksum {
    return fmt.Errorf("found mismatching lvol snapshot checksum %v with recorded prev lvol snapshot checksum %v when validating lvol %s", 
      cur.SnapshotChecksum, prev.SnapshotChecksum, prev.Name)
  }
+ // Update prev checksum if it's empty
+ if prev.SnapshotChecksum == "" {
+   prev.SnapshotChecksum = cur.SnapshotChecksum
+ }
pkg/spdk_test.go (2)

1205-1262: Consider breaking down the test function into smaller sub-tests.

The test function is quite large (>300 lines) and handles multiple test scenarios. Consider using Go's subtests to break it down into more manageable pieces:

func (s *TestSuite) TestSPDKMultipleThreadFastRebuilding(c *C) {
    testCases := []struct {
        name string
        test func(*C)
    }{
        {"BasicFastRebuilding", testBasicFastRebuilding},
        {"RebuildingWithInvalidSnapshots", testRebuildingWithInvalidSnapshots},
        // ... more test cases
    }
    
    for _, tc := range testCases {
        c.Run(tc.name, tc.test)
    }
}

1624-1661: Enhance error handling in waitReplicaSnapshotChecksumTimeout.

While the function works correctly, consider improving error handling:

  1. Add context to the assertion errors
  2. Consider returning errors instead of using assertions for better reusability
-func waitReplicaSnapshotChecksumTimeout(c *C, spdkCli *client.SPDKClient, replicaName, targetSnapName string, timeoutInSecond int) {
+func waitReplicaSnapshotChecksumTimeout(c *C, spdkCli *client.SPDKClient, replicaName, targetSnapName string, timeoutInSecond int) error {
     // ... existing code ...
-    c.Assert(err, IsNil)
+    if err != nil {
+        return fmt.Errorf("failed to get replica %s: %v", replicaName, err)
+    }
     // ... rest of the function
}
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b0258a and 77a2637.

📒 Files selected for processing (9)
  • Dockerfile.dapper (1 hunks)
  • pkg/api/types.go (3 hunks)
  • pkg/client/client.go (1 hunks)
  • pkg/spdk/engine.go (4 hunks)
  • pkg/spdk/replica.go (30 hunks)
  • pkg/spdk/server.go (4 hunks)
  • pkg/spdk/types.go (7 hunks)
  • pkg/spdk_test.go (7 hunks)
  • scripts/test (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • scripts/test
  • pkg/api/types.go
  • Dockerfile.dapper
  • pkg/spdk/engine.go
👮 Files not reviewed due to content moderation or server errors (3)
  • pkg/spdk/types.go
  • pkg/client/client.go
  • pkg/spdk/server.go
🧰 Additional context used
🪛 GitHub Check: CodeFactor
pkg/spdk/replica.go

[notice] 1468-1551: pkg/spdk/replica.go#L1468-L1551
Complex Method


[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)


[notice] 1962-2099: pkg/spdk/replica.go#L1962-L2099
Complex Method

🪛 golangci-lint (1.62.2)
pkg/spdk/types.go

121-121: unknown field SnapshotChecksum in struct literal of type spdkrpc.Lvol

(typecheck)

🔇 Additional comments (5)
pkg/spdk/replica.go (2)

70-74: LGTM! Well-structured field additions.

The new fields enhance the replica's functionality:

  • SnapshotChecksumEnabled provides control over checksum verification
  • constructRequired improves error handling when stopping errored replicas

1499-1499: LGTM! Break statement is correct.

The break statement is necessary to exit the select statement after timeout. Static analysis flagged this as a false positive.

🧰 Tools
🪛 GitHub Check: CodeFactor

[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)

pkg/spdk_test.go (3)

42-47: LGTM! Constants are well-defined and appropriately sized.

The constants are logically organized and their values are appropriate for testing purposes:

  • Disk and volume sizes are properly scaled using MiB units
  • Wait intervals and counts are reasonable for test scenarios

Also applies to: 62-65


1370-1376: Well-documented snapshot tree structure.

The ASCII diagrams clearly illustrate the snapshot tree structure, making it easier to understand the test scenarios and expected outcomes.

Also applies to: 1506-1512


1674-1674: LGTM! Good improvements to existing functions.

The changes improve the functions by:

  1. Using actual volume size instead of hardcoded values in revertSnapshot
  2. Adding timeout functionality to WaitForReplicaRebuildingComplete

Also applies to: 1682-1682, 1697-1697, 1707-1710

pkg/spdk/replica.go Show resolved Hide resolved
pkg/spdk_test.go Show resolved Hide resolved
@derekbit
Copy link
Member

derekbit commented Dec 26, 2024

Will be fixed by the longhorn/types update.

pkg/api/types.go:60:24: l.SnapshotChecksum undefined (type *spdkrpc.Lvol has no field or method SnapshotChecksum)
pkg/api/types.go:78:3: unknown field SnapshotChecksum in struct literal of type spdkrpc.Lvol

Comment on lines 153 to +155
UserCreated: bdev.DriverSpecific.Lvol.Xattrs[spdkclient.UserCreated] == strconv.FormatBool(true),
SnapshotTimestamp: bdev.DriverSpecific.Lvol.Xattrs[spdkclient.SnapshotTimestamp],
SnapshotChecksum: bdev.DriverSpecific.Lvol.Xattrs[spdkclient.SnapshotChecksum],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we need to check if the xattrs exist?

Copy link
Contributor Author

@shuo-wu shuo-wu Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NO. The upper layer should guarantee that this bdev is a lvol, which means bdev.DriverSpecific.Lvol not nil. Then if Xattrs[spdkclient.SnapshotChecksum] exists or not does not matter.

Copy link

mergify bot commented Dec 26, 2024

This pull request is now in conflict. Could you fix it @shuo-wu? 🙏

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
pkg/spdk/server.go (1)

Line range hint 910-910: Consider logging cleanup errors.

Using _ to ignore the return value from doCleanupForRebuildingDst could mask potential cleanup issues. Consider logging any errors:

-_ = r.doCleanupForRebuildingDst(spdkClient)
+if err := r.doCleanupForRebuildingDst(spdkClient); err != nil {
+    r.log.WithError(err).Warn("Failed to cleanup rebuilding destination, continuing anyway")
+}
pkg/spdk/replica.go (2)

1961-2096: LGTM with suggestion for future improvement.

The method is well structured with thorough error handling and clear documentation. However, consider breaking it down into smaller functions in the future for better maintainability:

  • prepareRebuildingLvol
  • handleExistingSnapshot
  • reuseExpiredLvol

236-253: Implement goroutine pool for checksum registration.

The current implementation launches unlimited goroutines for checksum registration which could cause resource issues with many snapshots. Consider:

  1. Implementing a worker pool to limit concurrent checksum registrations
  2. Adding a mechanism to track and report failed registrations
  3. Implementing retry logic for failed registrations

Would you like me to help implement a worker pool solution for the checksum registration?

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 77a2637 and 94c6e34.

📒 Files selected for processing (2)
  • pkg/spdk/replica.go (30 hunks)
  • pkg/spdk/server.go (4 hunks)
🧰 Additional context used
🪛 GitHub Check: CodeFactor
pkg/spdk/replica.go

[notice] 1468-1551: pkg/spdk/replica.go#L1468-L1551
Complex Method


[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)


[notice] 1962-2099: pkg/spdk/replica.go#L1962-L2099
Complex Method

🔇 Additional comments (4)
pkg/spdk/server.go (2)

Line range hint 692-716: LGTM! Input validation and error handling look good.

The changes to include dstRebuildingLvolAddress parameter and modify the return type are well implemented with proper validation and error handling.


Line range hint 735-745: LGTM! Function simplification improves maintainability.

The function has been simplified to focus on snapshot name validation while maintaining proper error handling.

pkg/spdk/replica.go (2)

70-74: LGTM! New fields are well documented.

The addition of SnapshotChecksumEnabled and constructRequired fields with clear documentation improves the code's maintainability.


1467-1549: 🛠️ Refactor suggestion

Refactor complex method to improve maintainability.

The method has high cyclomatic complexity and could benefit from refactoring:

  1. Extract monitoring logic into a separate method
  2. Add proper error handling for goroutine panics
  3. Remove redundant break statement in select
+func (r *Replica) monitorShallowCopy(spdkClient *spdkclient.Client, shallowCopyOpID uint32, snapshotName string) {
+    defer func() {
+        if r := recover(); r != nil {
+            r.log.Errorf("Panic in shallow copy monitoring: %v", r)
+        }
+    }()
+    
+    timer := time.NewTimer(MaxShallowCopyWaitTime)
+    defer timer.Stop()
+    ticker := time.NewTicker(ShallowCopyCheckInterval)
+    defer ticker.Stop()
+    
+    for {
+        select {
+        case <-timer.C:
+            r.log.Error("Timeout waiting for shallow copy")
+            return
+        case <-ticker.C:
+            r.Lock()
+            if r.rebuildingSrcCache.shallowCopyOpID != shallowCopyOpID {
+                r.Unlock()
+                return
+            }
+            status, err := r.rebuildingSrcShallowCopyStatusUpdateAndHandlingNoLock(spdkClient)
+            r.Unlock()
+            if err != nil || status.State == types.ProgressStateComplete {
+                return
+            }
+        }
+    }
+}

Likely invalid or redundant comment.

🧰 Tools
🪛 GitHub Check: CodeFactor

[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)

pkg/spdk/replica.go Outdated Show resolved Hide resolved
pkg/spdk/engine.go Show resolved Hide resolved

timer.Reset(MaxShallowCopyWaitTime)
continuousRetryCount := 0
for finished := false; !finished; {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use

for {
    select {
    case <-timer.C:
        ...
    case <-ticker.C:
        ...
        if shallowCopyStatus.State == helpertypes.ShallowCopyStateComplete {
            ...
            break
        }
    }
}

and remove finished flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NO. The break in your code means leaving from select rather than the whole for loop

pkg/spdk/replica.go Show resolved Hide resolved
pkg/spdk/replica.go Show resolved Hide resolved
pkg/spdk/replica.go Outdated Show resolved Hide resolved
pkg/spdk/replica.go Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
pkg/spdk/types.go (1)

Line range hint 469-477: Improve checksum comparison logic.

The current checksum comparison logic might lead to data inconsistency. If both checksums are empty, they're considered equal. Consider adding a warning log in this case.

 	if prev.SnapshotChecksum == "" {
+		logrus.Warnf("Empty checksum found for lvol %s during comparison", prev.Name)
 		prev.SnapshotChecksum = cur.SnapshotChecksum
 	}
 	if cur.SnapshotChecksum == "" {
+		logrus.Warnf("Empty checksum found for lvol %s during comparison", cur.Name)
 		prev.SnapshotChecksum = ""
 	}
🧹 Nitpick comments (6)
pkg/spdk/types.go (1)

38-39: Consider using time.Duration for time constants.

For better type safety and clarity, consider using time.Duration for time-related constants.

-	MaxShallowCopyWaitTime   = 72 * time.Hour
-	ShallowCopyCheckInterval = 3 * time.Second
+	MaxShallowCopyWaitTime   time.Duration = 72 * time.Hour
+	ShallowCopyCheckInterval time.Duration = 3 * time.Second
pkg/spdk/replica.go (5)

238-255: Consider using a goroutine pool for checksum registration.

The non-blocking checksum registration is good, but spawning unlimited goroutines could lead to resource exhaustion under heavy load.

Consider implementing a worker pool pattern:

+var checksumWorkerPool chan struct{}
+
+func init() {
+    checksumWorkerPool = make(chan struct{}, 10) // Adjust pool size as needed
+}

-// TODO: Use a goroutine pool
-go func() {
+go func(bdevLvol *spdktypes.BdevInfo) {
+    checksumWorkerPool <- struct{}{} // Acquire worker
+    defer func() { <-checksumWorkerPool }() // Release worker
     logrus.Debugf("Replica %v is registering checksum for snapshot %v", r.Name, bdevLvol.Aliases[0])
     _, err := spdkClient.BdevLvolRegisterSnapshotChecksum(bdevLvol.Aliases[0])
     if err != nil {
         logrus.Errorf("Replica %v failed to register checksum for snapshot %v: %v", r.Name, bdevLvol.Name, err)
     }
-}()
+}(bdevLvol)

469-478: Enhance error messages for checksum mismatches.

The checksum comparison logic is correct, but the error message could be more descriptive.

Consider this improvement:

-return fmt.Errorf("found mismatching lvol snapshot checksum %v with recorded prev lvol snapshot checksum %v when validating lvol %s", cur.SnapshotChecksum, prev.SnapshotChecksum, prev.Name)
+return fmt.Errorf("snapshot checksum mismatch for lvol %s: current='%v' previous='%v'", prev.Name, cur.SnapshotChecksum, prev.SnapshotChecksum)

1963-2098: Consider breaking down the complex method.

The method is quite long and handles multiple responsibilities. Consider extracting some logic into helper methods.

Suggested structure:

  1. Extract snapshot validation logic
  2. Extract rebuilding lvol cleanup logic
  3. Extract expired lvol reuse logic

Example:

+func (r *Replica) validateRebuildingSnapshot(snapshotName string) (*api.Lvol, error) {
+    srcSnapSvcLvol := r.rebuildingDstCache.rebuildingSnapshotMap[snapshotName]
+    if srcSnapSvcLvol == nil {
+        return nil, fmt.Errorf("cannot find snapshot %s in rebuilding list", snapshotName)
+    }
+    return srcSnapSvcLvol, nil
+}

+func (r *Replica) cleanupExistingRebuildingLvol(spdkClient *spdkclient.Client) error {
+    // Extract cleanup logic here
+}

1686-1689: Track implementation of delta shallow copy.

Multiple TODO comments indicate planned implementation of delta shallow copy functionality. This will optimize the rebuilding process.

Would you like me to create a GitHub issue to track the implementation of delta shallow copy? The issue would include:

  1. Implementation of lvol checksum API
  2. Implementation of delta shallow copy API
  3. Integration with the existing rebuilding logic

Also applies to: 1742-1753


Line range hint 1866-1961: Enhance error logging during cleanup.

While the error aggregation is good, the error logging could be more detailed to aid in debugging.

Consider adding more context to error logs:

-r.log.WithError(err).Errorf("Failed to disconnect the external src snapshot bdev %s for rebuilding dst cleanup, will continue", r.rebuildingDstCache.externalSnapshotBdevName)
+r.log.WithError(err).WithFields(logrus.Fields{
+    "replica": r.Name,
+    "snapshot": r.rebuildingDstCache.externalSnapshotName,
+    "bdev": r.rebuildingDstCache.externalSnapshotBdevName,
+}).Errorf("Failed to disconnect external src snapshot bdev during rebuilding dst cleanup, continuing with remaining cleanup steps")
🧰 Tools
🪛 GitHub Check: CodeFactor

[notice] 1964-2101: pkg/spdk/replica.go#L1964-L2101
Complex Method

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 82f564c and 933cfc6.

📒 Files selected for processing (4)
  • pkg/api/types.go (3 hunks)
  • pkg/spdk/replica.go (30 hunks)
  • pkg/spdk/types.go (7 hunks)
  • pkg/spdk_test.go (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/api/types.go
  • pkg/spdk_test.go
🧰 Additional context used
🪛 GitHub Check: CodeFactor
pkg/spdk/replica.go

[notice] 1470-1553: pkg/spdk/replica.go#L1470-L1553
Complex Method


[notice] 1501-1501: pkg/spdk/replica.go#L1501
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)


[notice] 1964-2101: pkg/spdk/replica.go#L1964-L2101
Complex Method

🔇 Additional comments (3)
pkg/spdk/replica.go (3)

70-74: LGTM: New fields enhance replica functionality.

The new fields improve the replica's capabilities:

  • SnapshotChecksumEnabled adds data integrity verification through snapshot checksums
  • constructRequired improves error handling when stopping errored replicas

284-290: LGTM: Improved state validation logic.

The changes enhance error handling by properly validating the constructRequired flag and ensuring correct state transitions.


1469-1529: 🛠️ Refactor suggestion

Refactor monitoring logic into a separate method.

The method is complex and handles multiple responsibilities. Consider extracting the monitoring logic.

Suggested refactoring:

+func (r *Replica) monitorShallowCopy(spdkClient *spdkclient.Client, shallowCopyOpID uint32, snapshotName string) {
+    timer := time.NewTimer(MaxShallowCopyWaitTime)
+    defer timer.Stop()
+    ticker := time.NewTicker(ShallowCopyCheckInterval)
+    defer ticker.Stop()
+    
+    for {
+        select {
+        case <-timer.C:
+            r.log.Errorf("Timeout waiting for shallow copy %v complete", shallowCopyOpID)
+            return
+        case <-ticker.C:
+            r.Lock()
+            if r.rebuildingSrcCache.shallowCopyOpID != shallowCopyOpID {
+                r.Unlock()
+                return
+            }
+            status, err := r.rebuildingSrcShallowCopyStatusUpdateAndHandlingNoLock(spdkClient)
+            r.Unlock()
+            if err != nil || status.State == types.ProgressStateError || status.State == types.ProgressStateComplete {
+                return
+            }
+        }
+    }
+}

Also, remove the useless break statement at line 1501 and use return instead.

Likely invalid or redundant comment.

🧰 Tools
🪛 GitHub Check: CodeFactor

[notice] 1501-1501: pkg/spdk/replica.go#L1501
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)

Copy link

mergify bot commented Dec 27, 2024

This pull request is now in conflict. Could you fix it @shuo-wu? 🙏

@derekbit
Copy link
Member

@coderabbitai pause

Copy link

coderabbitai bot commented Dec 27, 2024

✅ Actions performed

Reviews paused.

Copy link
Member

@derekbit derekbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, LGTM. Only some comments and conflicts.

pkg/spdk/engine.go Show resolved Hide resolved
pkg/spdk/engine.go Outdated Show resolved Hide resolved
pkg/spdk/replica.go Outdated Show resolved Hide resolved
@@ -182,6 +191,8 @@ func NewReplica(ctx context.Context, replicaName, lvsName, lvsUUID string, specS
SpecSize: roundedSpecSize,
State: types.InstanceStatePending,

SnapshotChecksumEnabled: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we allow users to disable snapshot checksum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, YES.
I haven't checked if the v1 checksum setting fits v2 or not. Hence we directly enabled it by default for now.

pkg/spdk/replica.go Outdated Show resolved Hide resolved
The 1st defer invocation e.initiator.Resume() should not overwrite
the existing error. Otherwise, the next defer call
e.replicaShallowCopy() will be excuted.

Longhorn 9488

Signed-off-by: Shuo Wu <[email protected]>
To reuse possible existing snapshots, the online rebuilding flow
is refactored. More specifically:
1. Dst replica will re-create a rebuilding lvol under the correct
snap before each shallow copy start
2. During dst replica rebuilding lvol preparation, the dst replica
will try to reuse the existing snapshots, non-empty previous head,
rebuilding lvols, and expired lvols.
3. During dst replica snapshot shallow copy, the shallow copy can
be skipped if the snapshot is there and intact.
4. During dst replica snapshot creation stage, it will blindly
clean up the rebuilding lvol and correct the current rebuilding
snapshot parent.
5. The dest replica rebuilding cleanup is responisble for cleaning
up redundant lvols only (after rebuilding finish)
6. Src replica will blindly attach the exposed rebuilding lvol at
the beginning of each shallow copy, then detach it in the end.
7. The replica deletion function is responsible for cleaning up
all related lvols when Longhorn no longer wants (to reuse) this
replica.

Longhorn 9488

Signed-off-by: Shuo Wu <[email protected]>
@shuo-wu shuo-wu force-pushed the reuse branch 2 times, most recently from 0aa2daa to c4579c9 Compare December 27, 2024 08:05
Longhorn 9488

Signed-off-by: Shuo Wu <[email protected]>
…after shallow copy start

Longhorn 9488

Signed-off-by: Shuo Wu <[email protected]>
Since the delta shallow copy API is not available yet.
Need to revert this commit when the API is ready.

Longhorn 9488

Signed-off-by: Shuo Wu <[email protected]>
Longhorn 8666, 9488

Signed-off-by: Shuo Wu <[email protected]>
Copy link
Member

@derekbit derekbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekbit derekbit merged commit 884e1cb into longhorn:main Dec 27, 2024
7 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants