bug: 2nd validator node errors and does not recover #172

Anmol1696 · 2023-08-11T15:02:34Z

Overview

If there is a bug on a validator node (after the genesis node), then it does not seem to recover and get into a state of CrashLoopBackOff, specially from postStartHook which performs the create-validator txn

Proposal

Inorder to make a robust setup, we need to make the nodes self-healing, using the primitives of k8s itself.

We can utilize the liveliness and readiness probes, to check the state and as well force validator nodes to restart properly.

Option 1: Clean start on failure

Delete ~/.<chain> after it fails

Option 2: PostStartHook fallback

Since we use postStartHook for registring the validator node, we can make the post startup hook more robust, and be aware of the failure

Problem

Validator node can be failing for multiple reasons, and one way of recovery can cause issues in other types of transient errors. We need a more robust way of recovering failing nodes.

Nodes can also be manually shut down, in that case the postStartHook should not run itself.

The text was updated successfully, but these errors were encountered:

Anmol1696 · 2023-08-12T18:02:05Z

Need a testing/development framework around the scripts used for the glue code and the init-containers.
It is becoming hard to change the validator scripts, for handling multiple kinds of failures.

Need a consistent script for the full init-containers steps.

Anmol1696 · 2023-08-29T13:37:41Z

This is very tricky and annoying. Happens more often then not, the reasons are still unclear.
But there is usally exit code of 137 that occurs:

' exited with 137: , message: "Validator Index: 0, Key name: val1. Chain bin osmosisd\n"

Faulty action: https://github.com/cosmology-tech/starship/actions/runs/6011439384/job/16306603533
Code 137 implies running out of memory, although that should not be the case with the current resources.

Approaches

Can try to not let the pods restart on failure and try and re-create the issue again in our feature branch

Anmol1696 · 2023-08-30T02:15:34Z

More thoughts. Even if the node does get into a CrashLoopBackoff, it should be able to recover from it, based on init containers. The validator node also seems to re-create the genesis file from init which seems to be incorrect one.

Maybe the init-container

checks that the genesis file is actually same as genesis file from genesis node
have more logs on the post startup hook.

Anmol1696 · 2023-08-31T02:32:07Z

Maybe the scripts running in the init containers should be such that incase of any error, exit early. Then logs could be more useful in pointing out the issue.
Init containers scripts should be able to exit early.

Anmol1696 · 2023-08-31T03:04:47Z

Note:
There seems to be an DNS lookup issue, when trying to fetch the genesis file from exposer of genesis node:

curl: (6) Could not resolve host: osmosis-1-genesis.ci-cosmology-tech-starship-smoke-tests-refs-pull-195-merge.svc.cluster.local

Try and update the kind version in the gh-action.

Anmol1696 mentioned this issue Aug 11, 2023

bug: smoke testing of infra running in CI #169

Closed

4 tasks

Anmol1696 mentioned this issue Aug 29, 2023

[do not merge] debug: CI, multi-validator bug #195

Draft

Anmol1696 mentioned this issue Sep 1, 2023

chore: add liveliness and readiness probes for nodes #209

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: 2nd validator node errors and does not recover #172

bug: 2nd validator node errors and does not recover #172

Anmol1696 commented Aug 11, 2023

Anmol1696 commented Aug 12, 2023

Anmol1696 commented Aug 29, 2023

Anmol1696 commented Aug 30, 2023

Anmol1696 commented Aug 31, 2023

Anmol1696 commented Aug 31, 2023 •

edited

Loading

bug: 2nd validator node errors and does not recover #172

bug: 2nd validator node errors and does not recover #172

Comments

Anmol1696 commented Aug 11, 2023

Overview

Proposal

Option 1: Clean start on failure

Option 2: PostStartHook fallback

Problem

Anmol1696 commented Aug 12, 2023

Anmol1696 commented Aug 29, 2023

Approaches

Anmol1696 commented Aug 30, 2023

Anmol1696 commented Aug 31, 2023

Anmol1696 commented Aug 31, 2023 • edited Loading

Anmol1696 commented Aug 31, 2023 •

edited

Loading