Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods don't run after a node is rebooted #624

Open
seguidor777 opened this issue Dec 10, 2020 · 5 comments
Open

Pods don't run after a node is rebooted #624

seguidor777 opened this issue Dec 10, 2020 · 5 comments
Labels
Milestone

Comments

@seguidor777
Copy link

seguidor777 commented Dec 10, 2020

Environment

  • OS: Debian buster
  • Erlang/OTP:
  • EMQ: 4.2.3

Description

I have installed emqx on my machine using kind. The values that I'm using are the following:

replicaCount: 3
service:
  type: LoadBalancer

After it is installed, the pods run correctly and I can connect to the dashboard and login without any problem.
But if I reboot my machine or the nodes where the pods are running, they are not initialized. Instead they show this error:

cluster.k8s.address_type=hostname
cluster.k8s.address_type=hostname
node.max_ports=1048576
cluster.k8s.suffix=svc.cluster.local
cluster.k8s.suffix=svc.cluster.local
listener.tcp.external.acceptors=64
listener.ssl.external.acceptors=32
node.process_limit=2097152
node.max_ets_tables=2097152
cluster.k8s.service_name=emqx-headless
cluster.k8s.service_name=emqx-headless
cluster.discovery=k8s
cluster.discovery=k8s
listener.ws.external.acceptors=16
cluster.k8s.app_name=emqx
cluster.k8s.app_name=emqx
cluster.k8s.apiserver=https://kubernetes.default.svc:443
cluster.k8s.apiserver=https://kubernetes.default.svc:443
cluster.k8s.namespace=commbat-cloud
cluster.k8s.namespace=commbat-cloud
EMQ X Broker 4.2.3 is started successfully!
['2020-12-10T04:57:53Z']:emqx start

=====
===== LOGGING STARTED Thu Dec 10 04:57:30 UTC 2020
=====
Exec: /opt/emqx/erts-10.7.2.1/bin/erlexec -boot /opt/emqx/releases/4.2.3/emqx -mode embedded -boot_var ERTS_LIB_DIR /opt/emqx/erts-10.7.2.1/../lib -mnesia dir "/opt/emqx/data/mnesia/[email protected]" -config /opt/emqx/data/configs/app.2020.12.10.04.57.30.config -args_file /opt/emqx/data/configs/vm.2020.12.10.04.57.30.args -vm_args /opt/emqx/data/configs/vm.2020.12.10.04.57.30.args -start_epmd false -epmd_module ekka_epmd -proto_dist ekka -- console
Root: /opt/emqx
/opt/emqx
Starting emqx on node [email protected]
['2020-12-10T04:57:53Z']:emqx not running, waiting for recovery in 20 seconds
['2020-12-10T04:57:58Z']:emqx not running, waiting for recovery in 15 seconds
['2020-12-10T04:58:03Z']:emqx not running, waiting for recovery in 10 seconds
['2020-12-10T04:58:08Z']:emqx not running, waiting for recovery in 5 seconds
['2020-12-10T04:58:13Z']:emqx not running, waiting for recovery in 0 seconds

=====
===== LOGGING STARTED Thu Dec 10 04:57:30 UTC 2020
=====
Exec: /opt/emqx/erts-10.7.2.1/bin/erlexec -boot /opt/emqx/releases/4.2.3/emqx -mode embedded -boot_var ERTS_LIB_DIR /opt/emqx/erts-10.7.2.1/../lib -mnesia dir "/opt/emqx/data/mnesia/[email protected]" -config /opt/emqx/data/configs/app.2020.12.10.04.57.30.config -args_file /opt/emqx/data/configs/vm.2020.12.10.04.57.30.args -vm_args /opt/emqx/data/configs/vm.2020.12.10.04.57.30.args -start_epmd false -epmd_module ekka_epmd -proto_dist ekka -- console
Root: /opt/emqx
/opt/emqx
Starting emqx on node [email protected]
['2020-12-10T04:58:18Z']:emqx exit abnormally

Edited (I had used the wrong logs)

It just exits and I cannot see the cause, could you please help me to troubleshoot this issue?

@seguidor777
Copy link
Author

If I run the statefulset with only 1 replica, I get this log and it works fine after a node reboot.
I think that the error is related to some network/communication issue between 2 or more replicas

cluster.k8s.address_type=hostname
cluster.k8s.address_type=hostname
node.max_ports=1048576
cluster.k8s.suffix=svc.cluster.local
cluster.k8s.suffix=svc.cluster.local
listener.tcp.external.acceptors=64
listener.ssl.external.acceptors=32
node.process_limit=2097152
node.max_ets_tables=2097152
cluster.k8s.service_name=emqx-headless
cluster.k8s.service_name=emqx-headless
cluster.discovery=k8s
cluster.discovery=k8s
listener.ws.external.acceptors=16
cluster.k8s.app_name=emqx
cluster.k8s.app_name=emqx
cluster.k8s.apiserver=https://kubernetes.default.svc:443
cluster.k8s.apiserver=https://kubernetes.default.svc:443
cluster.k8s.namespace=commbat-cloud
cluster.k8s.namespace=commbat-cloud
EMQ X Broker 4.2.3 is started successfully!
['2020-12-10T05:24:35Z']:emqx start
Start http:management listener on 8081 successfully.
Start http:dashboard listener on 18083 successfully.
Start mqtt:tcp listener on 127.0.0.1:11883 successfully.
Start mqtt:tcp listener on 0.0.0.0:1883 successfully.
Start mqtt:ws listener on 0.0.0.0:8083 successfully.
Start mqtt:ssl listener on 0.0.0.0:8883 successfully.
Start mqtt:wss listener on 0.0.0.0:8084 successfully.
EMQ X Broker 4.2.3 is running now!
Eshell V10.7.2.1  (abort with ^G)
([email protected])1> 2020-12-10 05:24:35.724 [error] Ekka(AutoCluster): Discovery error: {failed_connect,
                                     [{to_address,
                                       {"kubernetes.default.svc",443}},
                                      {inet,[inet],nxdomain}]}

Any help would be appreciated

@Rory-Z
Copy link
Member

Rory-Z commented Dec 11, 2020

Hi, @seguidor777
This seems to be a distributed problem. When multiple emqx in a cluster restart at the same time, emqx cannot determine which node is the last to stop. This will cause emqx confusion. You can try to start one node first, and then the other nodes Clear the data under /opt/emqx/data/mnesia/ (note that this will cause the node data loss, operate with caution), and observe whether they can return to normal.

@seguidor777
Copy link
Author

Hi @zhanghongtong,
Sorry for the delay. I hadn't have any chance to try what you suggested. But that really works. However I think that the pods should be more resilient and prevent this failure to happen on a real scenario. Can you please just let me know if this will be tackled on a next release? I'll be looking forward

@HJianBo HJianBo added this to the 4.3-alpha.1 milestone Jan 4, 2021
@HJianBo HJianBo added the bug label Jan 4, 2021
@Rory-Z
Copy link
Member

Rory-Z commented Jan 4, 2021

@seguidor777
We plan to resolve this issue in version 4.3.0

@seguidor777
Copy link
Author

Awesome, thanks for your support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants