Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubeone continues despite failed healthz check #3012

Open
judge-red opened this issue Jan 19, 2024 · 5 comments
Open

Kubeone continues despite failed healthz check #3012

judge-red opened this issue Jan 19, 2024 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management.
Milestone

Comments

@judge-red
Copy link

What happened?

I'm trying to set up a K1 cluster in a new environment. I'm still fine tuning the infra and particularly the firewalls (OpenStack security groups). Access to the CP nodes through the LB was broken when I ran kubeone apply to install a Kubernetes cluster.

Luckily, K1 seems to run a healthz check before trying to do anything on the cluster. Unluckily, after the healthz check fails, it just continues anyway. And then it fails to create a resource but still keeps on going.

Also, fixing the firewall issue and letting kubeone apply run again wasn't successful, I had to replace the VMs and start fresh.

Expected behavior

K1 notices that the healthz check fails and doesn't continue. K1 notices that creating a resource failed and doesn't continue.

Also, K1 should probably be able to recover from this in a subsequent run.

How to reproduce the issue?

Yea, that's not going to be easy, I guess. As I described above, I had a custom TF-based OpenStack setup. Everything worked as expected, except accessing port 6443 through the LB. I think the LB accepted the connection, but the connection between LB and VM was blocked. Access to port 6443 on the LB without going through the LB worked.

What KubeOne version are you using?

1.7.2

Provide your KubeOneCluster manifest here (if applicable)

Don't think it matters, otherwise let me know. (I need to manually do some of the steps our pipeline does to get this manifest.)

What cloud provider are you running on?

OpenStack

What operating system are you running in your cluster?

Ubuntu 22.04

Additional information

I'll add the logs of the initial "install run" and the "subsequent run". I eventually cancelled both job runs, equivalent to ctrl+c.

k1-install-run.log
k1-subsequent-run.log

@judge-red judge-red added kind/bug Categorizes issue or PR as related to a bug. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management. labels Jan 19, 2024
@kron4eg
Copy link
Member

kron4eg commented Jan 23, 2024

Logs look good to me, as kubeone did what it should have been done but no too much.

  • The underlying instances been super slow and didn't allowed to initialize control plane normally
  • kubeone been retrying multiple times (most operations could be fixed by retrying) init phase
  • half-dead control-plane has been launched
  • which lead to the false kubeapi server UP

At this point you should run kubeone reset right after the failed attempt and resulting to "undead" control-plane.

@kubermatic-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
After a furter 30 days, they will turn rotten.
Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@kubermatic-bot kubermatic-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 22, 2024
@xmudrii
Copy link
Member

xmudrii commented Apr 22, 2024

/remove-lifecycle stale

@kubermatic-bot kubermatic-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 22, 2024
@xmudrii xmudrii added this to the KubeOne 1.9 milestone Jun 24, 2024
@xmudrii xmudrii added the priority/low Not that important. label Jun 24, 2024
@kron4eg kron4eg removed the priority/low Not that important. label Aug 14, 2024
@kubermatic-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
After a furter 30 days, they will turn rotten.
Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@kubermatic-bot kubermatic-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 12, 2024
@xmudrii
Copy link
Member

xmudrii commented Nov 12, 2024

/remove-lifecycle stale

@kubermatic-bot kubermatic-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management.
Projects
None yet
Development

No branches or pull requests

4 participants