Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: 2nd validator node errors and does not recover #172

Open
Tracked by #169
Anmol1696 opened this issue Aug 11, 2023 · 5 comments
Open
Tracked by #169

bug: 2nd validator node errors and does not recover #172

Anmol1696 opened this issue Aug 11, 2023 · 5 comments

Comments

@Anmol1696
Copy link
Collaborator

Overview

If there is a bug on a validator node (after the genesis node), then it does not seem to recover and get into a state of CrashLoopBackOff, specially from postStartHook which performs the create-validator txn

Proposal

Inorder to make a robust setup, we need to make the nodes self-healing, using the primitives of k8s itself.

We can utilize the liveliness and readiness probes, to check the state and as well force validator nodes to restart properly.

Option 1: Clean start on failure

Delete ~/.<chain> after it fails

Option 2: PostStartHook fallback

Since we use postStartHook for registring the validator node, we can make the post startup hook more robust, and be aware of the failure

Problem

Validator node can be failing for multiple reasons, and one way of recovery can cause issues in other types of transient errors. We need a more robust way of recovering failing nodes.

Nodes can also be manually shut down, in that case the postStartHook should not run itself.

@Anmol1696
Copy link
Collaborator Author

Need a testing/development framework around the scripts used for the glue code and the init-containers.
It is becoming hard to change the validator scripts, for handling multiple kinds of failures.

Need a consistent script for the full init-containers steps.

@Anmol1696
Copy link
Collaborator Author

This is very tricky and annoying. Happens more often then not, the reasons are still unclear.
But there is usally exit code of 137 that occurs:

' exited with 137: , message: "Validator Index: 0, Key name: val1. Chain bin osmosisd\n"

Faulty action: https://github.com/cosmology-tech/starship/actions/runs/6011439384/job/16306603533
Code 137 implies running out of memory, although that should not be the case with the current resources.

Approaches

  • Can try to not let the pods restart on failure and try and re-create the issue again in our feature branch

@Anmol1696
Copy link
Collaborator Author

More thoughts. Even if the node does get into a CrashLoopBackoff, it should be able to recover from it, based on init containers. The validator node also seems to re-create the genesis file from init which seems to be incorrect one.

Maybe the init-container

  • checks that the genesis file is actually same as genesis file from genesis node
  • have more logs on the post startup hook.

@Anmol1696
Copy link
Collaborator Author

Maybe the scripts running in the init containers should be such that incase of any error, exit early. Then logs could be more useful in pointing out the issue.
Init containers scripts should be able to exit early.

@Anmol1696
Copy link
Collaborator Author

Anmol1696 commented Aug 31, 2023

Note:
There seems to be an DNS lookup issue, when trying to fetch the genesis file from exposer of genesis node:

curl: (6) Could not resolve host: osmosis-1-genesis.ci-cosmology-tech-starship-smoke-tests-refs-pull-195-merge.svc.cluster.local

Try and update the kind version in the gh-action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant