Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RKE2 keeps restarting with etcd error after cert rotate #7629

Closed
fmoral2 opened this issue Jan 23, 2025 · 3 comments
Closed

RKE2 keeps restarting with etcd error after cert rotate #7629

fmoral2 opened this issue Jan 23, 2025 · 3 comments

Comments

@fmoral2
Copy link
Contributor

fmoral2 commented Jan 23, 2025

Environmental Info:
RKE2 Version:

Rcs

Node(s) CPU architecture, OS, and Version:

rhel 9.4

Cluster Configuration:

split roles
etcd_only_nodes = 3
cp_only_nodes = 2
no_of_worker_nodes = 1

Describe the bug:

After rotate certificates:
TLS Directory name: tls-1737633868
Comparing Directories: /var/lib/rancher/rke2/server/tls and /var/lib/rancher/rke2/server/tls-1737633868
file client-ca.crt found
file client-ca.key found
file client-ca.nochain.crt found
file peer-ca.crt found
file peer-ca.key found
file server-ca.crt found
file server-ca.key found
file request-header-ca.crt found
file request-header-ca.key found
file server-ca.crt found
file server-ca.key found
file server-ca.nochain.crt found
file service.current.key found
file service.key found

All checks and validations are fine ,please refer Test from - https://github.com/rancher/distros-test-framework/blob/main/pkg/testcase/certrotate.go

But then RKE2 keeps restart endlessly

Steps To Reproduce:

  • Installed RKE2:
  • split roles
    etcd_only_nodes = 3
    cp_only_nodes = 2
    no_of_worker_nodes = 1

On both the etcd and control-plane nodes, run the following for a full certificate rotation:
$ sudo systemctl stop rke2

$ sudo rke2 --debug certificate rotate
$ sudo systemctl start rke2

Restart the agent node: "sudo systemctl restart rke2-agent"

Expected behavior:

RKE2 cluster should be up and running fine

Actual behavior:

But then RKE2 keeps restart endlessly

Additional context / logs:

Control plane logs

an 23 12:12:38 ip-172-31-17-225.us-east-2.compute.internal rke2[23930]: time="2025-01-23T12:12:38Z" level=info msg="Reconciling bootstrap data between datastore and disk"
Jan 23 12:12:39 ip-172-31-17-225.us-east-2.compute.internal rke2[23930]: time="2025-01-23T12:12:39Z" level=fatal msg="starting kubernetes: preparing server: failed to get MemberList from server: Internal error occurred: failed to get etcd MemberList: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\""

vailable desc = connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\""
Jan 23 16:20:08 ip-.us-east-2.compute.internal rke2[18393]: time="2025-01-23T16:20:08Z" level=error msg="Sending HTTP/1.1 500 response to 18.224.25.139:53118: failed to get etcd MemberList: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\""
Jan 23 16:20:08 ip-1.us-east-2.compute.internal rke2[18393]: time="2025-01-23T16:20:08Z" level=info msg="Waiting for control-plane node to register apiserver addresses in etcd"
Jan 23 16:20:09 ip-us-east-2.compute.internal rke2[18393]: time="2025-01-23T16:20:09Z" level=warning msg="Failed to get apiserver address from etcd: context deadline exceeded"
Jan 23 16:20:10 ip--.us-east-2.compute.internal rke2[18393]: time="2025-01-23T16:20:10Z" level=error msg="Sending HTTP/1.1 503 response to 3.136.84.50:55034: runtime core not ready"
Jan 23 16:20:11 ip-1.us-east-2.compute.internal rke2[18393]: time="2025-01-23T16:20:11Z" level=error msg="Sending HTTP/1.1 503 response to 52.14.121.184:44908: runtime core not ready"

ETCD LOGS


et etcd MemberList: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\""
Jan 23 17:41:20 ip-.us-east-2.compute.internal rke2[18393]: time="2025-01-23T17:41:20Z" level=error msg="Sending HTTP/1.1 503 response to  : runtime core not ready"
Jan 23 17:41:21 ip-.us-east-2.compute.internal rke2[18393]: time="2025-01-23T17:41:21Z" level=info msg="Failed to test data store connection: failed to get etcd status: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\""
Jan 23 17:41:21 ip-.us-east-2.compute.internal rke2[18393]: time="2025-01-23T17:41:21Z" level=info msg="Waiting for control-plane node to register apiserver addresses in etcd"
Jan 23 17:41:22 ip-.us-east-2.compute.internal rke2[18393]: time="2025-01-23T17:41:22Z" level=error msg="Sending HTTP/1.1 503 response to   runtime core not ready"
Jan 23 17:41:22 ip-.us-east-2.compute.internal rke2[18393]: time="2025-01-23T17:41:22Z" level=error msg="Sending HTTP/1.1 500 response to   failed to get etcd MemberList: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\""
Jan 23 17:41:23 ip-.us-east-2.compute.internal rke2[18393]: time="2025-01-23T17:41:23Z" level=warning msg="Failed to get apiserver address from etcd: context deadline exceeded"
Jan 23 17:41:25 ip-.us-east-2.compute.internal rke2[18393]: time="2025-01-23T17:41:25Z" level=error msg="Sending HTTP/1.1 503 response to   runtime core not ready"
Jan 23 17:41:26 ip-.us-east-2.compute.internal rke2[18393]: time="2025-01-23T17:41:26Z" level=info msg="Failed to test data store connection: failed to get etcd status: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\""
Jan 23 17:41:26 ip-.us-east-2.compute.internal rke2[18393]: time="2025-01-23T17:41:26Z" level=info msg="Waiting for control-plane node to register apiserver addresses in etcd"
Jan 23 17:41:26 ip-.us-east-2.compute.internal rke2[18393]: time="2025-01-23T17:41:26Z" level=error msg="Sending HTTP/1.1 500 response to   failed to get etcd MemberList: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\""


p.us-east-2.compute.internal rke2[25548]: time="2025-01-23T17:35:05Z" level=info msg="Waiting for API server to become available to start cloud-controller-manager"
Jan 23 17:35:05 ip.us-east-2.compute.internal rke2[25548]: time="2025-01-23T17:35:05Z" level=info msg="Waiting for etcd server to become available"
Jan 23 17:35:06 ip.us-east-2.compute.internal rke2[25548]: time="2025-01-23T17:35:06Z" level=warning msg="Failed to get apiserver address from etcd: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""
Jan 23 17:35:07 ip.us-east-2.compute.internal rke2[25548]: time="2025-01-23T17:35:07Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: /var/lib/rancher/rke2/agent/serving-kubelet.crt: https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt: 503 Service Unavailable"
Jan 23 17:35:10 ip.us-east-2.compute.internal rke2[25548]: time="2025-01-23T17:35:10Z" level=info msg="Failed to test data store connection: failed to get etcd status: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""
Jan 23 17:35:11 ip.us-east-2.compute.internal rke2[25548]: time="2025-01-23T17:35:11Z" level=warning msg="Failed to get apiserver address from etcd: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""
Jan 23 17:35:14 ip.us-east-2.compute.internal rke2[25548]: time="2025-01-23T17:35:14Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: /var/lib/rancher/rke2/agent/serving-kubelet.crt: https://127.0.0.1:6444/v1-rke2/serving-kubelet.crt: 503 Service Unavailable"
Jan 23 17:35:15 ip.us-east-2.compute.internal rke2[25548]: time="2025-01-23T17:35:15Z" level=info msg="Failed to test data store connection: failed to get etcd status: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""
Jan 23 17:35:16 ip.us-east-2.compute.internal rke2[25548]: time="2025-01-23T17:35:16Z" level=warning msg="Failed to get apiserver address from etcd: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\"


 g="Failed waiting for static pods to sync: timed out waiting for the condition"


EDIT:
Commands used:


On both the etcd and control-plane nodes, run the following for a full certificate rotation:
 $ sudo systemctl stop rke2
-server
$ sudo rke2 --debug certificate rotate 
$  sudo systemctl start rke2-server
@brandond
Copy link
Member

brandond commented Jan 23, 2025

On both the etcd and control-plane nodes, run the following for a full certificate rotation:

$ sudo systemctl stop rke2

$ sudo rke2 --debug certificate rotate
$ sudo systemctl start rke2

The service is named rke2-server on servers, and rke2-agent on agents, not rke2. Please ensure that you're stopping/starting the correct services before running the rotate command. You should see an error if you try to stop/start the wrong service...

root@systemd-node-1:/# systemctl stop rke2
Failed to stop rke2.service: Unit rke2.service not loaded.

@fmoral2
Copy link
Contributor Author

fmoral2 commented Jan 23, 2025

as per conversation and debug from @brandond this was a timing issue from framework test

@fmoral2 fmoral2 closed this as completed Jan 23, 2025
@brandond
Copy link
Member

brandond commented Jan 23, 2025

For the record, in case this helps someone else:

The order and timing is the problem. They are all taken down within seconds of each other.

Jan 23 12:04:22 ip-172-31-29-176.us-east-2.compute.internal systemd[1]: Stopping Rancher Kubernetes Engine v2 (server)...
disable-apiserver: true
disable-controller-manager: true
disable-scheduler: true
Jan 23 12:04:23 ip-172-31-22-20.us-east-2.compute.internal systemd[1]: Stopping Rancher Kubernetes Engine v2 (server)...
disable-apiserver: true
disable-controller-manager: true
disable-scheduler: true
Jan 23 12:04:25 ip-172-31-25-135.us-east-2.compute.internal systemd[1]: Stopping Rancher Kubernetes Engine v2 (server)...
disable-apiserver: true
disable-controller-manager: true
disable-scheduler: true
Jan 23 12:04:26 ip-172-31-17-225.us-east-2.compute.internal systemd[1]: Stopping Rancher Kubernetes Engine v2 (server)...
disable-etcd: true
Jan 23 12:04:27 ip-172-31-18-127.us-east-2.compute.internal systemd[1]: Stopping Rancher Kubernetes Engine v2 (server)...
disable-etcd: true

You need to do them one at a time, and wait for the post-rotate systemctl start rke2-server command from one server to complete before moving on to the next. You can’t just run through and do them all in parallel. You have ended up taking down all of the control-plane nodes just as the etcd nodes were starting, and now they are all very mad and confused. If this was a real cluster that you were trying to fix I would say just stop all the services and then start them again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants