-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: v2 fast rebuilding #278
Conversation
Note Reviews pausedUse the following commands to manage reviews:
WalkthroughThis pull request introduces several modifications to the SPDK (Storage Performance Development Kit) related code, focusing on updates to the Changes
Sequence DiagramsequenceDiagram
participant Client
participant Server
participant Engine
participant Replica
Client->>Server: ReplicaRebuildingSrcShallowCopyStart
Server->>Engine: Initiate Shallow Copy
Engine->>Replica: Prepare Rebuilding Lvol
Replica-->>Engine: Lvol Prepared
Engine->>Replica: Start Shallow Copy
Replica-->>Engine: Shallow Copy Status
Engine-->>Server: Shallow Copy Progress
Server-->>Client: Operation Complete
This sequence diagram illustrates the high-level flow of the replica rebuilding shallow copy process, highlighting the interactions between the client, server, engine, and replica components. Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #278 +/- ##
========================================
- Coverage 9.17% 8.69% -0.48%
========================================
Files 23 23
Lines 8536 8713 +177
========================================
- Hits 783 758 -25
- Misses 7646 7856 +210
+ Partials 107 99 -8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
793fe70
to
8c4f187
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (14)
pkg/client/client.go (3)
266-266
: Add more context to error messages.This error is clear about the missing parameter. Consider augmenting the message with the function name or caller context for quicker troubleshooting.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 266-266: pkg/client/client.go#L266
Added line #L266 was not covered by tests
284-285
: Add concurrency notes in docstring.The new docstring is good. Consider clarifying if the method is safe to call concurrently from multiple goroutines, or if it requires external synchronization.
298-300
: Pass-through request parameters.All request fields are properly assigned. When refactoring in future, consider centralizing these parameter assignments to reduce duplication.
pkg/spdk/server.go (2)
717-717
: Handle the error gracefully.If r.RebuildingSrcShallowCopyStart fails, consider adding logs with details about the replica's current state to diagnose potential concurrency or environment issues.
736-736
: Possible optimization.When reading the shallow copy status, consider caching results or throttling requests to avoid excessive gRPC calls.
pkg/spdk/replica.go (6)
262-264
: Consider more logging.When the state is "Stopped" but constructRequired is set, you skip here. A debug log might help diagnose scenarios in which this branch triggers unexpectedly.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 262-264: pkg/spdk/replica.go#L262-L264
Added lines #L262 - L264 were not covered by tests
266-266
: Add context to error message.The message does not mention whether the state is invalid or if the logic disallows transitions. More detail could help debugging if user logs are minimal.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 266-266: pkg/spdk/replica.go#L266
Added line #L266 was not covered by tests
561-562
: Document fallback logic.Lines 561-562 handle the case where the chain might require removal of the nil head. Document the reasoning to help future maintainers.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 561-562: pkg/spdk/replica.go#L561-L562
Added lines #L561 - L562 were not covered by tests
570-570
: Add error handling.If logging for “Replica cloned a new head lvol” fails or returns partial success, consider capturing it in debug logs for later.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 570-570: pkg/spdk/replica.go#L570
Added line #L570 was not covered by tests
575-575
: Provide meaningful logs.This line logs “Replica created a new head lvol.” Possibly add parent lvol alias in the log to clarify the relationship.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 575-575: pkg/spdk/replica.go#L575
Added line #L575 was not covered by tests
655-655
: Handle error more gracefully.A user might attempt unexpected revert chains. Log or handle partial rest states if getRootLvolName is missing.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 655-655: pkg/spdk/replica.go#L655
Added line #L655 was not covered by testspkg/spdk/engine.go (3)
1166-1166
: Ensure correct client cleanup.
Good practice to log any close failures. In some scenarios, a retry mechanism might be beneficial.🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 1166-1166: pkg/spdk/engine.go#L1166
Added line #L1166 was not covered by tests
1196-1198
: Initiator resume failure handling.
Properly wrapping errors is good. Consider adding a fallback or re-try logic for cases where resume fails, if possible.🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 1196-1198: pkg/spdk/engine.go#L1196-L1198
Added lines #L1196 - L1198 were not covered by tests
1344-1351
: Finalizing shallow copy.
The code gracefully logs partial progress if it's incomplete. Ensure any partial state is cleaned up if needed.🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 1344-1351: pkg/spdk/engine.go#L1344-L1351
Added lines #L1344 - L1351 were not covered by tests
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (5)
go.mod
is excluded by!go.mod
go.sum
is excluded by!**/*.sum
,!go.sum
vendor/github.com/longhorn/types/pkg/generated/spdkrpc/spdk.pb.go
is excluded by!**/*.pb.go
,!**/generated/**
,!vendor/**
vendor/github.com/longhorn/types/pkg/generated/spdkrpc/spdk_grpc.pb.go
is excluded by!**/*.pb.go
,!**/generated/**
,!vendor/**
vendor/modules.txt
is excluded by!vendor/**
📒 Files selected for processing (8)
Dockerfile.dapper
(1 hunks)pkg/client/client.go
(1 hunks)pkg/spdk/engine.go
(4 hunks)pkg/spdk/replica.go
(26 hunks)pkg/spdk/server.go
(4 hunks)pkg/spdk/types.go
(4 hunks)pkg/spdk_test.go
(5 hunks)scripts/test
(1 hunks)
🧰 Additional context used
🪛 GitHub Check: codecov/patch
pkg/client/client.go
[warning] 263-264: pkg/client/client.go#L263-L264
Added lines #L263 - L264 were not covered by tests
[warning] 266-266: pkg/client/client.go#L266
Added line #L266 was not covered by tests
[warning] 279-279: pkg/client/client.go#L279
Added line #L279 was not covered by tests
pkg/spdk/engine.go
[warning] 1127-1129: pkg/spdk/engine.go#L1127-L1129
Added lines #L1127 - L1129 were not covered by tests
[warning] 1131-1136: pkg/spdk/engine.go#L1131-L1136
Added lines #L1131 - L1136 were not covered by tests
[warning] 1138-1139: pkg/spdk/engine.go#L1138-L1139
Added lines #L1138 - L1139 were not covered by tests
[warning] 1166-1166: pkg/spdk/engine.go#L1166
Added line #L1166 was not covered by tests
[warning] 1169-1169: pkg/spdk/engine.go#L1169
Added line #L1169 was not covered by tests
[warning] 1175-1175: pkg/spdk/engine.go#L1175
Added line #L1175 was not covered by tests
[warning] 1177-1178: pkg/spdk/engine.go#L1177-L1178
Added lines #L1177 - L1178 were not covered by tests
[warning] 1181-1182: pkg/spdk/engine.go#L1181-L1182
Added lines #L1181 - L1182 were not covered by tests
[warning] 1196-1198: pkg/spdk/engine.go#L1196-L1198
Added lines #L1196 - L1198 were not covered by tests
[warning] 1303-1304: pkg/spdk/engine.go#L1303-L1304
Added lines #L1303 - L1304 were not covered by tests
[warning] 1311-1315: pkg/spdk/engine.go#L1311-L1315
Added lines #L1311 - L1315 were not covered by tests
[warning] 1324-1338: pkg/spdk/engine.go#L1324-L1338
Added lines #L1324 - L1338 were not covered by tests
[warning] 1340-1342: pkg/spdk/engine.go#L1340-L1342
Added lines #L1340 - L1342 were not covered by tests
[warning] 1344-1351: pkg/spdk/engine.go#L1344-L1351
Added lines #L1344 - L1351 were not covered by tests
pkg/spdk/replica.go
[warning] 262-264: pkg/spdk/replica.go#L262-L264
Added lines #L262 - L264 were not covered by tests
[warning] 266-266: pkg/spdk/replica.go#L266
Added line #L266 was not covered by tests
[warning] 545-559: pkg/spdk/replica.go#L545-L559
Added lines #L545 - L559 were not covered by tests
[warning] 561-562: pkg/spdk/replica.go#L561-L562
Added lines #L561 - L562 were not covered by tests
[warning] 565-566: pkg/spdk/replica.go#L565-L566
Added lines #L565 - L566 were not covered by tests
[warning] 570-570: pkg/spdk/replica.go#L570
Added line #L570 was not covered by tests
[warning] 575-575: pkg/spdk/replica.go#L575
Added line #L575 was not covered by tests
[warning] 651-653: pkg/spdk/replica.go#L651-L653
Added lines #L651 - L653 were not covered by tests
[warning] 655-655: pkg/spdk/replica.go#L655
Added line #L655 was not covered by tests
[warning] 769-778: pkg/spdk/replica.go#L769-L778
Added lines #L769 - L778 were not covered by tests
[warning] 846-848: pkg/spdk/replica.go#L846-L848
Added lines #L846 - L848 were not covered by tests
[warning] 878-878: pkg/spdk/replica.go#L878
Added line #L878 was not covered by tests
[warning] 906-906: pkg/spdk/replica.go#L906
Added line #L906 was not covered by tests
🪛 GitHub Check: CodeFactor
pkg/spdk/replica.go
[notice] 1435-1518: pkg/spdk/replica.go#L1435-L1518
Complex Method
[notice] 1466-1466: pkg/spdk/replica.go#L1466
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)
[notice] 1929-2066: pkg/spdk/replica.go#L1929-L2066
Complex Method
🔇 Additional comments (31)
pkg/client/client.go (3)
261-264
: Ensure consistent parameter error handling.
These checks correctly enforce required parameters for the shallow copy start method. Make sure all error messages are consistently phrased to help with debugging.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 263-264: pkg/client/client.go#L263-L264
Added lines #L263 - L264 were not covered by tests
273-276
: Validate call arguments.
This logic constructs the gRPC request to the server. All parameters are consistently used. Make sure to handle potential client-side timeouts or partial failures.
279-281
: Error wrapping is good.
Using errors.Wrapf provides clear traceability. Ensure you’ve tested large snapshots or long snapshot names to confirm the error message does not become truncated in logs.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 279-279: pkg/client/client.go#L279
Added line #L279 was not covered by tests
pkg/spdk/server.go (3)
693-702
: Validate request parameters.
These lines correctly verify that the request has valid fields. Check any logs or monitoring to see if more context is needed when logging invalid arguments.
713-713
: Check for concurrency issues.
This method call might happen concurrently with other operations on the same replica. Ensure there's adequate locking or checks to avoid races.
1088-1090
: Validate arguments for EngineReplicaAdd.
It's good to ensure both replicaName and replicaAddress are present. If either is missing, an error is returned early. This is consistent with the rest of the codebase.
pkg/spdk_test.go (1)
1203-1583
: Comprehensive multi-thread test coverage.
This new test provides a thorough scenario exercising fast rebuilding with multiple replicas and snapshots. Consider the following for further improvement:
• Test random node failures or abrupt restarts to ensure resilience.
• Add short descriptive logging or sub-tests to pinpoint any failing scenario quickly.
pkg/spdk/replica.go (7)
72-74
: Track constructRequired carefully.
Marking a replica as requiring construction when it errors out can help with recovery logic. Ensure code paths that toggle constructRequired handle concurrency properly.
545-559
: Clean separation of chain logic.
Creating or cloning the head from the parent lvol is well structured. Confirm you have robust test coverage for nil parent or mismatch in SpecSize.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 545-559: pkg/spdk/replica.go#L545-L559
Added lines #L545 - L559 were not covered by tests
565-566
: Perform size validation.
Here, you resize if parent size and spec size differ. Make sure the parent is always a valid reference to avoid unexpected resizing.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 565-566: pkg/spdk/replica.go#L565-L566
Added lines #L565 - L566 were not covered by tests
651-653
: Check snapshot references.
Distinguishing between parent snapshots and the backing image is correct. Ensure references to root are consistent if a new snapshot is introduced at run time.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 651-653: pkg/spdk/replica.go#L651-L653
Added lines #L651 - L653 were not covered by tests
846-848
: Track state transitions.
When previous state is error and you set constructRequired, consider whether an immediate re-construct attempt is desired or if a backoff is beneficial.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 846-848: pkg/spdk/replica.go#L846-L848
Added lines #L846 - L848 were not covered by tests
878-878
: Graceful concurrency check.
Immediately stopping or cleaning up the source building cache can race other code calls. Evaluate an atomic operation to avoid incomplete states.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 878-878: pkg/spdk/replica.go#L878
Added line #L878 was not covered by tests
906-906
: Double-check logic for partial leftover lvols.
If leftover snapshots remain from partial merges, ensure they don’t hamper future new replicas. Additional log statements might help.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 906-906: pkg/spdk/replica.go#L906
Added line #L906 was not covered by tests
scripts/test (1)
61-61
: Extended test timeout is beneficial.
Increasing the timeout to 40 minutes allows more comprehensive tests to complete successfully, especially with complex slow I/O scenarios. Ensure CI infrastructure can handle this longer duration.
pkg/spdk/types.go (4)
26-26
: Looks good for new suffix usage.
The new "expired" suffix is clear and consistent with existing suffix conventions.
38-39
: Reasonable default intervals.
Setting a maximum wait time of 72 hours and a 3-second check interval seems balanced for long-running copy processes. Confirm these values are suitable for all operational scenarios.
200-202
: Appropriate name generator.
Clearly constructs a unique LVOL name for expired replicas using a short UUID. Ensure collisions are handled upstream if there's any possibility of reusing suffixes.
212-214
: Simple check for expired suffix.
Implementation is straightforward and aligns with previously mentioned naming conventions.
pkg/spdk/engine.go (11)
1127-1129
: Handle previous mode tracking carefully.
Maintaining the previous mode in a local variable is a good approach for logging rollback scenarios. Consider verifying if there's any edge case (e.g., uninitialized map or missing key).
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 1127-1129: pkg/spdk/engine.go#L1127-L1129
Added lines #L1127 - L1129 were not covered by tests
1131-1136
: Assigning ERR mode with address.
Setting a replica to ERR mode while preserving its address helps for debugging. Consider if partial transitions might require special handling.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 1131-1136: pkg/spdk/engine.go#L1131-L1136
Added lines #L1131 - L1136 were not covered by tests
1138-1139
: Comprehensive logging.
Attaching both the entire replica status map and the error ensures thorough observability.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 1138-1139: pkg/spdk/engine.go#L1138-L1139
Added lines #L1138 - L1139 were not covered by tests
1169-1169
: Symmetry in client closure logging.
Matches the pattern used for source replica.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 1169-1169: pkg/spdk/engine.go#L1169
Added line #L1169 was not covered by tests
1175-1175
: Warn vs Error logging.
Currently using Errorf for shallow copy failure. This is appropriate if it indicates a critical state that blocks normal operations.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 1175-1175: pkg/spdk/engine.go#L1175
Added line #L1175 was not covered by tests
1177-1178
: Error logging reason.
Clear enough messages to differentiate between “replica error” vs “engine error.”
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 1177-1178: pkg/spdk/engine.go#L1177-L1178
Added lines #L1177 - L1178 were not covered by tests
1181-1182
: Finish step cleanup.
Makes sense to attempt finishing even on errors, to avoid resource leakage.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 1181-1182: pkg/spdk/engine.go#L1181-L1182
Added lines #L1181 - L1182 were not covered by tests
1303-1304
: Log message clarity.
This readies the logs with enough details to know source and destination.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 1303-1304: pkg/spdk/engine.go#L1303-L1304
Added lines #L1303 - L1304 were not covered by tests
1311-1315
: Timer and ticker approach.
Combining a hard cap with periodic checks is a practical method to limit the shallow copy duration.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 1311-1315: pkg/spdk/engine.go#L1311-L1315
Added lines #L1311 - L1315 were not covered by tests
1324-1338
: Robust retry logic.
Incrementing a local counter and checking it against maxNumRetries helps avoid indefinite retries. Ensure maxNumRetries is large enough for potential slow I/O conditions.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 1324-1338: pkg/spdk/engine.go#L1324-L1338
Added lines #L1324 - L1338 were not covered by tests
1340-1342
: Error path exit.
Immediate exit on copying error is correct. Potentially store partial progress for diagnostics.
🧰 Tools
🪛 GitHub Check: codecov/patch
[warning] 1340-1342: pkg/spdk/engine.go#L1340-L1342
Added lines #L1340 - L1342 were not covered by tests
Dockerfile.dapper (1)
20-20
: Updating commit ID for go-spdk-helper.
This likely addresses compatibility with the new replica rebuilding flow. Consider verifying the updated commit for any backward-incompatible changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (7)
pkg/spdk/types.go (2)
26-26
: Additional constants
DefiningReplicaExpiredLvolSuffix
,MaxShallowCopyWaitTime
, andShallowCopyCheckInterval
is clear. Consider adding docstrings so users know how these values are derived.Also applies to: 37-39
203-205
: Short UUID usage
Truncating the UUID to 8 characters can cause potential collisions in longer-running systems. Consider using a larger substring or a distinct suffix strategy.pkg/spdk_test.go (1)
1205-1588
: Comprehensive multi-thread fast rebuilding test
This new test suite thoroughly exercises snapshot creation, reverts, and concurrent rebuild scenarios. Consider adding negative tests (e.g., verifying behavior under insufficient disk space) and capturing logs for deeper debugging.pkg/spdk/replica.go (4)
236-253
: Consider using a goroutine pool for checksum registration.The TODO comment suggests using a goroutine pool, which is a good optimization. Currently, each snapshot checksum registration spawns a new goroutine, which could lead to resource exhaustion with many snapshots.
Would you like me to propose an implementation using a worker pool pattern?
1467-1527
: Consider breaking down the complex shallow copy logic.The method
RebuildingSrcShallowCopyStart
is quite complex with nested defers and goroutine management. Consider extracting the goroutine management into a separate method for better maintainability.Example refactor:
func (r *Replica) RebuildingSrcShallowCopyStart(...) (err error) { // ... existing locking code ... if err = r.rebuildingSrcDetachNoLock(spdkClient); err != nil { return err } if err = r.rebuildingSrcAttachNoLock(spdkClient); err != nil { return err } + return r.startShallowCopyAndWait(spdkClient, snapshotName) } +func (r *Replica) startShallowCopyAndWait(spdkClient *spdkclient.Client, snapshotName string) error { + // Extract goroutine management logic here }🧰 Tools
🪛 GitHub Check: CodeFactor
[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)
1496-1499
: Remove unnecessary break statement.The
break
statement after settingstopWaiting
is redundant as the loop will exit naturally whenstopWaiting
is checked.case <-timer.C: r.log.Errorf("Timeout waiting for the src replica %s shallow copy %v complete before detaching the rebuilding lvol of the dst replica %s, will give up", r.Name, shallowCopyOpID, r.rebuildingSrcCache.dstReplicaName) stopWaiting = true - break
🧰 Tools
🪛 GitHub Check: CodeFactor
[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)
1961-2096
: Consider splitting the complex rebuilding preparation logic.The method
rebuildingDstShallowCopyPrepare
handles multiple scenarios and has deep nesting. Consider splitting it into smaller, focused methods:
- Handling existing snapshot lvols
- Managing expired lvols
- Creating new lvols
Would you like me to propose a refactored structure that improves readability while maintaining the same functionality?
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
Dockerfile.dapper
(1 hunks)pkg/api/types.go
(3 hunks)pkg/client/client.go
(1 hunks)pkg/spdk/engine.go
(4 hunks)pkg/spdk/replica.go
(30 hunks)pkg/spdk/server.go
(4 hunks)pkg/spdk/types.go
(7 hunks)pkg/spdk_test.go
(7 hunks)scripts/test
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- scripts/test
- Dockerfile.dapper
- pkg/spdk/engine.go
🧰 Additional context used
🪛 golangci-lint (1.62.2)
pkg/spdk/types.go
121-121: unknown field SnapshotChecksum in struct literal of type spdkrpc.Lvol
(typecheck)
🪛 GitHub Check: CodeFactor
pkg/spdk/replica.go
[notice] 1468-1551: pkg/spdk/replica.go#L1468-L1551
Complex Method
[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)
[notice] 1962-2099: pkg/spdk/replica.go#L1962-L2099
Complex Method
🔇 Additional comments (15)
pkg/client/client.go (4)
261-266
: Add parameter validation for consistency
You've introduced a new required parameter, dstRebuildingLvolAddress
. The checks for empty strings look good, but consider documenting valid formats or expected address patterns to help users avoid passing malformed values.
273-281
: Good usage of error wrapping and final return
The flow for invoking the RPC and returning either a wrapped error or nil
is consistent. This is an excellent pattern for error tracing.
284-285
: Accurate inline documentation
The docstring now correctly matches the adjusted parameters and behavior for the shallow copy check.
298-300
: Confirm removal of shallow copy op ID usage
With the old shallowCopyOpId
removed, verify that no references remain in the codebase and all callers rely solely on the three parameters.
pkg/api/types.go (2)
43-43
: New field addition looks appropriate
Adding SnapshotChecksum
to the Lvol
struct helps track data integrity. Ensure it is assigned consistently in all code paths.
60-60
: Propagation of SnapshotChecksum
The lines correctly serialize and deserialize SnapshotChecksum
. This keeps the in-memory representation in sync with the Proto definitions.
Also applies to: 78-78
pkg/spdk/types.go (4)
79-79
: Maintaining consistency with the API
The SnapshotChecksum
field in Lvol
aligns well with the API structures. Good job on consistent naming.
121-121
: Snapshot checksum annotation
Assigning the field here is coherent with the rest of the code for consistent marshalling.
🧰 Tools
🪛 golangci-lint (1.62.2)
121-121: unknown field SnapshotChecksum in struct literal of type spdkrpc.Lvol
(typecheck)
155-155
: Validate existence of xattr
Ensure you handle the case where SnapshotChecksum
might not exist in Xattrs
. Otherwise, key lookups can result in empty strings or unexpected outcomes.
215-217
: Simple prefix check
The naming convention with "expired"
is straightforward. This function is simple and works well given the suffix-based approach.
pkg/spdk/server.go (3)
693-693
: Check for mandatory address parameter
The new logic verifying DstRebuildingLvolAddress
is essential. Returning gRPC InvalidArgument
is consistent with existing error handling patterns. The follow-up call and final return are appropriately structured.
Also applies to: 700-702, 713-713, 717-717
736-736
: Removed shallow copy operation ID
Invoking r.RebuildingSrcShallowCopyCheck
using only SnapshotName
aligns with the updated interface. This simplifies the check logic.
1088-1090
: Stronger validation in EngineReplicaAdd
Ensuring both ReplicaName
and ReplicaAddress
are provided helps prevent incomplete engine states and reduces potential runtime errors.
pkg/spdk/replica.go (2)
70-74
: LGTM: New fields for snapshot checksum and rebuilding.
The new fields enhance the functionality:
SnapshotChecksumEnabled
provides control over snapshot checksum registrationconstructRequired
helps manage errored replica state transitionsrebuildingLvolAddress
tracks rebuilding lvol location
Also applies to: 96-98
878-880
: LGTM: Error state handling improvement.
The addition of setting constructRequired
when transitioning from error state improves replica recovery handling.
Updated this PR for lvol checksum API & lvol detach parent API integration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
pkg/spdk/replica.go (1)
Line range hint
1864-1960
: Enhance error handling for critical cleanup operations.While aggregating errors is good for cleanup, some critical failures should stop the process to prevent inconsistent state.
func (r *Replica) doCleanupForRebuildingDst(spdkClient *spdkclient.Client) error { aggregatedErrors := []error{} + criticalError := false if r.rebuildingDstCache.externalSnapshotBdevName != "" { if err := disconnectNVMfBdev(spdkClient, r.rebuildingDstCache.externalSnapshotBdevName); err != nil { r.log.WithError(err).Error("Failed to disconnect external snapshot bdev") aggregatedErrors = append(aggregatedErrors, err) + criticalError = true + return util.CombineErrors(aggregatedErrors...) } // ... rest of cleanup } // Continue with non-critical cleanup only if no critical errors + if !criticalError { // ... rest of method + } return util.CombineErrors(aggregatedErrors...) }🧰 Tools
🪛 GitHub Check: CodeFactor
[notice] 1962-2099: pkg/spdk/replica.go#L1962-L2099
Complex Method
🧹 Nitpick comments (6)
pkg/spdk/replica.go (4)
236-253
: Consider adding goroutine pool for checksum registration.The current implementation launches unlimited goroutines for checksum registration. Consider:
- Using a worker pool to limit concurrent goroutine count
- Adding a mechanism to track/wait for completion
- Implementing proper error propagation
Example implementation:
+ var checksumWorkerPool = make(chan struct{}, 10) // Limit concurrent workers if r.SnapshotChecksumEnabled { for _, bdevLvol := range bdevLvolMap { if !bdevLvol.DriverSpecific.Lvol.Snapshot { continue } if bdevLvol.DriverSpecific.Lvol.Xattrs[spdkclient.SnapshotChecksum] != "" { continue } + checksumWorkerPool <- struct{}{} // Acquire worker go func(bdevLvol *spdktypes.BdevInfo) { + defer func() { <-checksumWorkerPool }() // Release worker logrus.Debugf("Replica %v is registering checksum for snapshot %v", r.Name, bdevLvol.Aliases[0]) _, err := spdkClient.BdevLvolRegisterSnapshotChecksum(bdevLvol.Aliases[0]) if err != nil { logrus.Errorf("Replica %v failed to register checksum for snapshot %v: %v", r.Name, bdevLvol.Name, err) } }(bdevLvol) } }
284-288
: Improve error message clarity.The error message could be more descriptive about why construction is not allowed in this state.
- return fmt.Errorf("invalid state %s with construct required flag %v for replica %s construct", r.State, r.constructRequired, r.Name) + return fmt.Errorf("cannot construct replica %s in state %s: construction is only allowed when constructRequired is true", r.Name, r.State)
1961-2096
: Address TODOs for delta shallow copy implementation.The method contains several TODOs related to delta shallow copy implementation. I can help implement:
- Reusing expired lvols for rebuilding
- Handling corrupted/outdated snapshots
- Implementing delta shallow copy logic
Would you like me to:
- Create a GitHub issue to track these improvements?
- Generate an implementation proposal for the delta shallow copy feature?
467-475
: Optimize snapshot checksum comparison logic.The current implementation can be made more efficient and clearer.
- if prev.SnapshotChecksum == "" { - prev.SnapshotChecksum = cur.SnapshotChecksum - } - if cur.SnapshotChecksum == "" { - prev.SnapshotChecksum = "" - } - if prev.SnapshotChecksum != cur.SnapshotChecksum { + // Only compare checksums if both exist + if prev.SnapshotChecksum != "" && cur.SnapshotChecksum != "" && + prev.SnapshotChecksum != cur.SnapshotChecksum { return fmt.Errorf("found mismatching lvol snapshot checksum %v with recorded prev lvol snapshot checksum %v when validating lvol %s", cur.SnapshotChecksum, prev.SnapshotChecksum, prev.Name) } + // Update prev checksum if it's empty + if prev.SnapshotChecksum == "" { + prev.SnapshotChecksum = cur.SnapshotChecksum + }pkg/spdk_test.go (2)
1205-1262
: Consider breaking down the test function into smaller sub-tests.The test function is quite large (>300 lines) and handles multiple test scenarios. Consider using Go's subtests to break it down into more manageable pieces:
func (s *TestSuite) TestSPDKMultipleThreadFastRebuilding(c *C) { testCases := []struct { name string test func(*C) }{ {"BasicFastRebuilding", testBasicFastRebuilding}, {"RebuildingWithInvalidSnapshots", testRebuildingWithInvalidSnapshots}, // ... more test cases } for _, tc := range testCases { c.Run(tc.name, tc.test) } }
1624-1661
: Enhance error handling in waitReplicaSnapshotChecksumTimeout.While the function works correctly, consider improving error handling:
- Add context to the assertion errors
- Consider returning errors instead of using assertions for better reusability
-func waitReplicaSnapshotChecksumTimeout(c *C, spdkCli *client.SPDKClient, replicaName, targetSnapName string, timeoutInSecond int) { +func waitReplicaSnapshotChecksumTimeout(c *C, spdkCli *client.SPDKClient, replicaName, targetSnapName string, timeoutInSecond int) error { // ... existing code ... - c.Assert(err, IsNil) + if err != nil { + return fmt.Errorf("failed to get replica %s: %v", replicaName, err) + } // ... rest of the function }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
Dockerfile.dapper
(1 hunks)pkg/api/types.go
(3 hunks)pkg/client/client.go
(1 hunks)pkg/spdk/engine.go
(4 hunks)pkg/spdk/replica.go
(30 hunks)pkg/spdk/server.go
(4 hunks)pkg/spdk/types.go
(7 hunks)pkg/spdk_test.go
(7 hunks)scripts/test
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- scripts/test
- pkg/api/types.go
- Dockerfile.dapper
- pkg/spdk/engine.go
👮 Files not reviewed due to content moderation or server errors (3)
- pkg/spdk/types.go
- pkg/client/client.go
- pkg/spdk/server.go
🧰 Additional context used
🪛 GitHub Check: CodeFactor
pkg/spdk/replica.go
[notice] 1468-1551: pkg/spdk/replica.go#L1468-L1551
Complex Method
[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)
[notice] 1962-2099: pkg/spdk/replica.go#L1962-L2099
Complex Method
🪛 golangci-lint (1.62.2)
pkg/spdk/types.go
121-121: unknown field SnapshotChecksum in struct literal of type spdkrpc.Lvol
(typecheck)
🔇 Additional comments (5)
pkg/spdk/replica.go (2)
70-74
: LGTM! Well-structured field additions.
The new fields enhance the replica's functionality:
SnapshotChecksumEnabled
provides control over checksum verificationconstructRequired
improves error handling when stopping errored replicas
1499-1499
: LGTM! Break statement is correct.
The break statement is necessary to exit the select statement after timeout. Static analysis flagged this as a false positive.
🧰 Tools
🪛 GitHub Check: CodeFactor
[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)
pkg/spdk_test.go (3)
42-47
: LGTM! Constants are well-defined and appropriately sized.
The constants are logically organized and their values are appropriate for testing purposes:
- Disk and volume sizes are properly scaled using MiB units
- Wait intervals and counts are reasonable for test scenarios
Also applies to: 62-65
1370-1376
: Well-documented snapshot tree structure.
The ASCII diagrams clearly illustrate the snapshot tree structure, making it easier to understand the test scenarios and expected outcomes.
Also applies to: 1506-1512
1674-1674
: LGTM! Good improvements to existing functions.
The changes improve the functions by:
- Using actual volume size instead of hardcoded values in revertSnapshot
- Adding timeout functionality to WaitForReplicaRebuildingComplete
Also applies to: 1682-1682, 1697-1697, 1707-1710
Will be fixed by the longhorn/types update.
|
UserCreated: bdev.DriverSpecific.Lvol.Xattrs[spdkclient.UserCreated] == strconv.FormatBool(true), | ||
SnapshotTimestamp: bdev.DriverSpecific.Lvol.Xattrs[spdkclient.SnapshotTimestamp], | ||
SnapshotChecksum: bdev.DriverSpecific.Lvol.Xattrs[spdkclient.SnapshotChecksum], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we need to check if the xattrs exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NO. The upper layer should guarantee that this bdev is a lvol, which means bdev.DriverSpecific.Lvol
not nil. Then if Xattrs[spdkclient.SnapshotChecksum]
exists or not does not matter.
This pull request is now in conflict. Could you fix it @shuo-wu? 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
pkg/spdk/server.go (1)
Line range hint
910-910
: Consider logging cleanup errors.Using
_
to ignore the return value fromdoCleanupForRebuildingDst
could mask potential cleanup issues. Consider logging any errors:-_ = r.doCleanupForRebuildingDst(spdkClient) +if err := r.doCleanupForRebuildingDst(spdkClient); err != nil { + r.log.WithError(err).Warn("Failed to cleanup rebuilding destination, continuing anyway") +}pkg/spdk/replica.go (2)
1961-2096
: LGTM with suggestion for future improvement.The method is well structured with thorough error handling and clear documentation. However, consider breaking it down into smaller functions in the future for better maintainability:
prepareRebuildingLvol
handleExistingSnapshot
reuseExpiredLvol
236-253
: Implement goroutine pool for checksum registration.The current implementation launches unlimited goroutines for checksum registration which could cause resource issues with many snapshots. Consider:
- Implementing a worker pool to limit concurrent checksum registrations
- Adding a mechanism to track and report failed registrations
- Implementing retry logic for failed registrations
Would you like me to help implement a worker pool solution for the checksum registration?
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pkg/spdk/replica.go
(30 hunks)pkg/spdk/server.go
(4 hunks)
🧰 Additional context used
🪛 GitHub Check: CodeFactor
pkg/spdk/replica.go
[notice] 1468-1551: pkg/spdk/replica.go#L1468-L1551
Complex Method
[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)
[notice] 1962-2099: pkg/spdk/replica.go#L1962-L2099
Complex Method
🔇 Additional comments (4)
pkg/spdk/server.go (2)
Line range hint 692-716
: LGTM! Input validation and error handling look good.
The changes to include dstRebuildingLvolAddress parameter and modify the return type are well implemented with proper validation and error handling.
Line range hint 735-745
: LGTM! Function simplification improves maintainability.
The function has been simplified to focus on snapshot name validation while maintaining proper error handling.
pkg/spdk/replica.go (2)
70-74
: LGTM! New fields are well documented.
The addition of SnapshotChecksumEnabled
and constructRequired
fields with clear documentation improves the code's maintainability.
1467-1549
: 🛠️ Refactor suggestion
Refactor complex method to improve maintainability.
The method has high cyclomatic complexity and could benefit from refactoring:
- Extract monitoring logic into a separate method
- Add proper error handling for goroutine panics
- Remove redundant break statement in select
+func (r *Replica) monitorShallowCopy(spdkClient *spdkclient.Client, shallowCopyOpID uint32, snapshotName string) {
+ defer func() {
+ if r := recover(); r != nil {
+ r.log.Errorf("Panic in shallow copy monitoring: %v", r)
+ }
+ }()
+
+ timer := time.NewTimer(MaxShallowCopyWaitTime)
+ defer timer.Stop()
+ ticker := time.NewTicker(ShallowCopyCheckInterval)
+ defer ticker.Stop()
+
+ for {
+ select {
+ case <-timer.C:
+ r.log.Error("Timeout waiting for shallow copy")
+ return
+ case <-ticker.C:
+ r.Lock()
+ if r.rebuildingSrcCache.shallowCopyOpID != shallowCopyOpID {
+ r.Unlock()
+ return
+ }
+ status, err := r.rebuildingSrcShallowCopyStatusUpdateAndHandlingNoLock(spdkClient)
+ r.Unlock()
+ if err != nil || status.State == types.ProgressStateComplete {
+ return
+ }
+ }
+ }
+}
Likely invalid or redundant comment.
🧰 Tools
🪛 GitHub Check: CodeFactor
[notice] 1499-1499: pkg/spdk/replica.go#L1499
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)
|
||
timer.Reset(MaxShallowCopyWaitTime) | ||
continuousRetryCount := 0 | ||
for finished := false; !finished; { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use
for {
select {
case <-timer.C:
...
case <-ticker.C:
...
if shallowCopyStatus.State == helpertypes.ShallowCopyStateComplete {
...
break
}
}
}
and remove finished
flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NO. The break
in your code means leaving from select
rather than the whole for
loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
pkg/spdk/types.go (1)
Line range hint
469-477
: Improve checksum comparison logic.The current checksum comparison logic might lead to data inconsistency. If both checksums are empty, they're considered equal. Consider adding a warning log in this case.
if prev.SnapshotChecksum == "" { + logrus.Warnf("Empty checksum found for lvol %s during comparison", prev.Name) prev.SnapshotChecksum = cur.SnapshotChecksum } if cur.SnapshotChecksum == "" { + logrus.Warnf("Empty checksum found for lvol %s during comparison", cur.Name) prev.SnapshotChecksum = "" }
🧹 Nitpick comments (6)
pkg/spdk/types.go (1)
38-39
: Consider using time.Duration for time constants.For better type safety and clarity, consider using
time.Duration
for time-related constants.- MaxShallowCopyWaitTime = 72 * time.Hour - ShallowCopyCheckInterval = 3 * time.Second + MaxShallowCopyWaitTime time.Duration = 72 * time.Hour + ShallowCopyCheckInterval time.Duration = 3 * time.Secondpkg/spdk/replica.go (5)
238-255
: Consider using a goroutine pool for checksum registration.The non-blocking checksum registration is good, but spawning unlimited goroutines could lead to resource exhaustion under heavy load.
Consider implementing a worker pool pattern:
+var checksumWorkerPool chan struct{} + +func init() { + checksumWorkerPool = make(chan struct{}, 10) // Adjust pool size as needed +} -// TODO: Use a goroutine pool -go func() { +go func(bdevLvol *spdktypes.BdevInfo) { + checksumWorkerPool <- struct{}{} // Acquire worker + defer func() { <-checksumWorkerPool }() // Release worker logrus.Debugf("Replica %v is registering checksum for snapshot %v", r.Name, bdevLvol.Aliases[0]) _, err := spdkClient.BdevLvolRegisterSnapshotChecksum(bdevLvol.Aliases[0]) if err != nil { logrus.Errorf("Replica %v failed to register checksum for snapshot %v: %v", r.Name, bdevLvol.Name, err) } -}() +}(bdevLvol)
469-478
: Enhance error messages for checksum mismatches.The checksum comparison logic is correct, but the error message could be more descriptive.
Consider this improvement:
-return fmt.Errorf("found mismatching lvol snapshot checksum %v with recorded prev lvol snapshot checksum %v when validating lvol %s", cur.SnapshotChecksum, prev.SnapshotChecksum, prev.Name) +return fmt.Errorf("snapshot checksum mismatch for lvol %s: current='%v' previous='%v'", prev.Name, cur.SnapshotChecksum, prev.SnapshotChecksum)
1963-2098
: Consider breaking down the complex method.The method is quite long and handles multiple responsibilities. Consider extracting some logic into helper methods.
Suggested structure:
- Extract snapshot validation logic
- Extract rebuilding lvol cleanup logic
- Extract expired lvol reuse logic
Example:
+func (r *Replica) validateRebuildingSnapshot(snapshotName string) (*api.Lvol, error) { + srcSnapSvcLvol := r.rebuildingDstCache.rebuildingSnapshotMap[snapshotName] + if srcSnapSvcLvol == nil { + return nil, fmt.Errorf("cannot find snapshot %s in rebuilding list", snapshotName) + } + return srcSnapSvcLvol, nil +} +func (r *Replica) cleanupExistingRebuildingLvol(spdkClient *spdkclient.Client) error { + // Extract cleanup logic here +}
1686-1689
: Track implementation of delta shallow copy.Multiple TODO comments indicate planned implementation of delta shallow copy functionality. This will optimize the rebuilding process.
Would you like me to create a GitHub issue to track the implementation of delta shallow copy? The issue would include:
- Implementation of lvol checksum API
- Implementation of delta shallow copy API
- Integration with the existing rebuilding logic
Also applies to: 1742-1753
Line range hint
1866-1961
: Enhance error logging during cleanup.While the error aggregation is good, the error logging could be more detailed to aid in debugging.
Consider adding more context to error logs:
-r.log.WithError(err).Errorf("Failed to disconnect the external src snapshot bdev %s for rebuilding dst cleanup, will continue", r.rebuildingDstCache.externalSnapshotBdevName) +r.log.WithError(err).WithFields(logrus.Fields{ + "replica": r.Name, + "snapshot": r.rebuildingDstCache.externalSnapshotName, + "bdev": r.rebuildingDstCache.externalSnapshotBdevName, +}).Errorf("Failed to disconnect external src snapshot bdev during rebuilding dst cleanup, continuing with remaining cleanup steps")🧰 Tools
🪛 GitHub Check: CodeFactor
[notice] 1964-2101: pkg/spdk/replica.go#L1964-L2101
Complex Method
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
pkg/api/types.go
(3 hunks)pkg/spdk/replica.go
(30 hunks)pkg/spdk/types.go
(7 hunks)pkg/spdk_test.go
(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/api/types.go
- pkg/spdk_test.go
🧰 Additional context used
🪛 GitHub Check: CodeFactor
pkg/spdk/replica.go
[notice] 1470-1553: pkg/spdk/replica.go#L1470-L1553
Complex Method
[notice] 1501-1501: pkg/spdk/replica.go#L1501
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)
[notice] 1964-2101: pkg/spdk/replica.go#L1964-L2101
Complex Method
🔇 Additional comments (3)
pkg/spdk/replica.go (3)
70-74
: LGTM: New fields enhance replica functionality.
The new fields improve the replica's capabilities:
SnapshotChecksumEnabled
adds data integrity verification through snapshot checksumsconstructRequired
improves error handling when stopping errored replicas
284-290
: LGTM: Improved state validation logic.
The changes enhance error handling by properly validating the constructRequired
flag and ensuring correct state transitions.
1469-1529
: 🛠️ Refactor suggestion
Refactor monitoring logic into a separate method.
The method is complex and handles multiple responsibilities. Consider extracting the monitoring logic.
Suggested refactoring:
+func (r *Replica) monitorShallowCopy(spdkClient *spdkclient.Client, shallowCopyOpID uint32, snapshotName string) {
+ timer := time.NewTimer(MaxShallowCopyWaitTime)
+ defer timer.Stop()
+ ticker := time.NewTicker(ShallowCopyCheckInterval)
+ defer ticker.Stop()
+
+ for {
+ select {
+ case <-timer.C:
+ r.log.Errorf("Timeout waiting for shallow copy %v complete", shallowCopyOpID)
+ return
+ case <-ticker.C:
+ r.Lock()
+ if r.rebuildingSrcCache.shallowCopyOpID != shallowCopyOpID {
+ r.Unlock()
+ return
+ }
+ status, err := r.rebuildingSrcShallowCopyStatusUpdateAndHandlingNoLock(spdkClient)
+ r.Unlock()
+ if err != nil || status.State == types.ProgressStateError || status.State == types.ProgressStateComplete {
+ return
+ }
+ }
+ }
+}
Also, remove the useless break
statement at line 1501 and use return
instead.
Likely invalid or redundant comment.
🧰 Tools
🪛 GitHub Check: CodeFactor
[notice] 1501-1501: pkg/spdk/replica.go#L1501
Useless break in case clause (WARN: this break statement affects this switch or select statement and not the loop enclosing it) (useless-break)
This pull request is now in conflict. Could you fix it @shuo-wu? 🙏 |
@coderabbitai pause |
✅ Actions performedReviews paused. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, LGTM. Only some comments and conflicts.
@@ -182,6 +191,8 @@ func NewReplica(ctx context.Context, replicaName, lvsName, lvsUUID string, specS | |||
SpecSize: roundedSpecSize, | |||
State: types.InstanceStatePending, | |||
|
|||
SnapshotChecksumEnabled: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we allow users to disable snapshot checksum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, YES.
I haven't checked if the v1 checksum setting fits v2 or not. Hence we directly enabled it by default for now.
Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>
…lding start Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>
Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>
Signed-off-by: Shuo Wu <[email protected]>
The 1st defer invocation e.initiator.Resume() should not overwrite the existing error. Otherwise, the next defer call e.replicaShallowCopy() will be excuted. Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>
To reuse possible existing snapshots, the online rebuilding flow is refactored. More specifically: 1. Dst replica will re-create a rebuilding lvol under the correct snap before each shallow copy start 2. During dst replica rebuilding lvol preparation, the dst replica will try to reuse the existing snapshots, non-empty previous head, rebuilding lvols, and expired lvols. 3. During dst replica snapshot shallow copy, the shallow copy can be skipped if the snapshot is there and intact. 4. During dst replica snapshot creation stage, it will blindly clean up the rebuilding lvol and correct the current rebuilding snapshot parent. 5. The dest replica rebuilding cleanup is responisble for cleaning up redundant lvols only (after rebuilding finish) 6. Src replica will blindly attach the exposed rebuilding lvol at the beginning of each shallow copy, then detach it in the end. 7. The replica deletion function is responsible for cleaning up all related lvols when Longhorn no longer wants (to reuse) this replica. Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>
0aa2daa
to
c4579c9
Compare
Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>
Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>
…after shallow copy start Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>
…timeout Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>
Since the delta shallow copy API is not available yet. Need to revert this commit when the API is ready. Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>
Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>
Signed-off-by: Shuo Wu <[email protected]>
Longhorn 9488 Signed-off-by: Shuo Wu <[email protected]>
Longhorn 8666, 9488 Signed-off-by: Shuo Wu <[email protected]>
Signed-off-by: Shuo Wu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Which issue(s) this PR fixes:
longhorn/longhorn#9488
The related docs that can help understand this feature:
V2 Data Engine Fast Rebuilding diagram flow
V2 Data Engine Fast Rebuilding illustration and animation
What this PR does / why we need it:
Special notes for your reviewer:
Additional documentation or context