-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: reconstruct ActiveChain and SnapshotLvolMap while creating replica #284
Conversation
Longhorn 10049 Signed-off-by: Derek Su <[email protected]>
WalkthroughThe pull request introduces modifications to the replica management in the SPDK (Storage Performance Development Kit) package. The changes primarily focus on enhancing the Changes
Assessment against linked issues
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
be72266
to
402da16
Compare
The failure will be fixed in #278 |
…ing replica If a replica's ActiveChain is not constructed while creating it, the mismatch found by the validateAndUpdate() will result in the error state. ``` .... newSnapshotLvolMap, err := constructSnapshotLvolMap(r.Name, bdevLvolMap) if err != nil { return err } if len(r.SnapshotLvolMap) != len(newSnapshotLvolMap) { return fmt.Errorf("replica current active snapshot ....") } ... ```` Longhorn 10033 Signed-off-by: Derek Su <[email protected]>
402da16
to
387771b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
pkg/spdk/replica.go (1)
Line range hint
1-1
: Consider improving state management architectureThe current implementation handles complex state management through multiple methods and conditions. Consider the following architectural improvements:
Extract state management logic into a dedicated state machine component to:
- Centralize state transition logic
- Make state transitions more explicit and traceable
- Reduce the risk of invalid state transitions
Consider using the State pattern to encapsulate state-specific behavior and transitions.
This would improve maintainability and make the code more resilient to future changes.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
pkg/spdk/replica.go
(2 hunks)
🔇 Additional comments (2)
pkg/spdk/replica.go (2)
252-252
: LGTM: Valid state addition for replica construction
Adding types.InstanceStateStopped
to the allowed states is appropriate as it enables proper reconstruction of stopped replicas.
738-747
: LGTM: Proper construction of replica state after head preparation
The addition of snapshot lvol map and active chain construction after head preparation ensures that the replica's state is properly initialized. This change directly addresses the issue where ActiveChain and SnapshotLvolMap were not properly constructed during replica creation.
After applying the fix, the e2e test can successfully ran 8 times without hitting the errors, but it hit
The error has nothing to do with the issue. It will be investigated in longhorn/longhorn#8207. |
Because the construction happens when creating a replica, it becomes a blocking operation. Do you think this will be a concern? if multiple replicas are created and re-constructed at the same time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
We might need to consider whether the current implementation is scalable.
Do we know what server returned this error? Usually, it just means a server error. |
It is per-replica lock, so multiple replicas shouldn't be an issue.
After investigating the support bundle, volume controller is unable to salvage because of the disk pressure (codes) due to the orphaned replicas created while restarting nodes. |
This issue seems related to the current implementation, where failed replicas cannot be reused. The failed salvage due to unused orphaned replicas appears to be expected in this context. We should revisit this once we have complete snapshot reuse in version 1.9, including intact snapshot reuse (introduced in 1.8) and delta snapshot reuse (planned for 1.9). cc @longhorn/dev-data-plane |
Added a support bundle for the future improvement problematic volume: pvc-a9c26ee9-3918-4ef6-b243-c340b160b41b |
@derekbit This message simply indicates that the test is waiting for a pod that isn't currently running. It doesn't draw any conclusions and requires human intervention to investigate any potential issues. Possible causes could include:
Just for your information. |
Thanks @yangchiu |
Which issue(s) this PR fixes:
Issue longhorn/longhorn#10033
What this PR does / why we need it:
If the replica's ActiveChain is not constructed while creating the replica, the mismatch found by the
validateAndUpdate()
will result in the error state.Special notes for your reviewer:
Additional documentation or context