-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: 2nd validator node errors and does not recover #172
Comments
Need a testing/development framework around the scripts used for the glue code and the init-containers. Need a consistent script for the full init-containers steps. |
This is very tricky and annoying. Happens more often then not, the reasons are still unclear.
Faulty action: https://github.com/cosmology-tech/starship/actions/runs/6011439384/job/16306603533 Approaches
|
More thoughts. Even if the node does get into a CrashLoopBackoff, it should be able to recover from it, based on init containers. The validator node also seems to re-create the genesis file from init which seems to be incorrect one. Maybe the init-container
|
Maybe the scripts running in the init containers should be such that incase of any error, exit early. Then logs could be more useful in pointing out the issue. |
Note:
Try and update the kind version in the gh-action. |
Overview
If there is a bug on a validator node (after the genesis node), then it does not seem to recover and get into a state of
CrashLoopBackOff
, specially frompostStartHook
which performs thecreate-validator
txnProposal
Inorder to make a robust setup, we need to make the nodes self-healing, using the primitives of k8s itself.
We can utilize the liveliness and readiness probes, to check the state and as well force validator nodes to restart properly.
Option 1: Clean start on failure
Delete
~/.<chain>
after it failsOption 2: PostStartHook fallback
Since we use postStartHook for registring the validator node, we can make the post startup hook more robust, and be aware of the failure
Problem
Validator node can be failing for multiple reasons, and one way of recovery can cause issues in other types of transient errors. We need a more robust way of recovering failing nodes.
Nodes can also be manually shut down, in that case the
postStartHook
should not run itself.The text was updated successfully, but these errors were encountered: