Pods don't run after a node is rebooted #624

seguidor777 · 2020-12-10T05:08:46Z

Environment

OS: Debian buster
Erlang/OTP:
EMQ: 4.2.3

Description

I have installed emqx on my machine using kind. The values that I'm using are the following:

replicaCount: 3
service:
  type: LoadBalancer

After it is installed, the pods run correctly and I can connect to the dashboard and login without any problem.
But if I reboot my machine or the nodes where the pods are running, they are not initialized. Instead they show this error:

cluster.k8s.address_type=hostname
cluster.k8s.address_type=hostname
node.max_ports=1048576
cluster.k8s.suffix=svc.cluster.local
cluster.k8s.suffix=svc.cluster.local
listener.tcp.external.acceptors=64
listener.ssl.external.acceptors=32
node.process_limit=2097152
node.max_ets_tables=2097152
cluster.k8s.service_name=emqx-headless
cluster.k8s.service_name=emqx-headless
cluster.discovery=k8s
cluster.discovery=k8s
listener.ws.external.acceptors=16
cluster.k8s.app_name=emqx
cluster.k8s.app_name=emqx
cluster.k8s.apiserver=https://kubernetes.default.svc:443
cluster.k8s.apiserver=https://kubernetes.default.svc:443
cluster.k8s.namespace=commbat-cloud
cluster.k8s.namespace=commbat-cloud
EMQ X Broker 4.2.3 is started successfully!
['2020-12-10T04:57:53Z']:emqx start

=====
===== LOGGING STARTED Thu Dec 10 04:57:30 UTC 2020
=====
Exec: /opt/emqx/erts-10.7.2.1/bin/erlexec -boot /opt/emqx/releases/4.2.3/emqx -mode embedded -boot_var ERTS_LIB_DIR /opt/emqx/erts-10.7.2.1/../lib -mnesia dir "/opt/emqx/data/mnesia/[email protected]" -config /opt/emqx/data/configs/app.2020.12.10.04.57.30.config -args_file /opt/emqx/data/configs/vm.2020.12.10.04.57.30.args -vm_args /opt/emqx/data/configs/vm.2020.12.10.04.57.30.args -start_epmd false -epmd_module ekka_epmd -proto_dist ekka -- console
Root: /opt/emqx
/opt/emqx
Starting emqx on node [email protected]
['2020-12-10T04:57:53Z']:emqx not running, waiting for recovery in 20 seconds
['2020-12-10T04:57:58Z']:emqx not running, waiting for recovery in 15 seconds
['2020-12-10T04:58:03Z']:emqx not running, waiting for recovery in 10 seconds
['2020-12-10T04:58:08Z']:emqx not running, waiting for recovery in 5 seconds
['2020-12-10T04:58:13Z']:emqx not running, waiting for recovery in 0 seconds

=====
===== LOGGING STARTED Thu Dec 10 04:57:30 UTC 2020
=====
Exec: /opt/emqx/erts-10.7.2.1/bin/erlexec -boot /opt/emqx/releases/4.2.3/emqx -mode embedded -boot_var ERTS_LIB_DIR /opt/emqx/erts-10.7.2.1/../lib -mnesia dir "/opt/emqx/data/mnesia/[email protected]" -config /opt/emqx/data/configs/app.2020.12.10.04.57.30.config -args_file /opt/emqx/data/configs/vm.2020.12.10.04.57.30.args -vm_args /opt/emqx/data/configs/vm.2020.12.10.04.57.30.args -start_epmd false -epmd_module ekka_epmd -proto_dist ekka -- console
Root: /opt/emqx
/opt/emqx
Starting emqx on node [email protected]
['2020-12-10T04:58:18Z']:emqx exit abnormally

Edited (I had used the wrong logs)

It just exits and I cannot see the cause, could you please help me to troubleshoot this issue?

The text was updated successfully, but these errors were encountered:

seguidor777 · 2020-12-10T05:31:44Z

If I run the statefulset with only 1 replica, I get this log and it works fine after a node reboot.
I think that the error is related to some network/communication issue between 2 or more replicas

cluster.k8s.address_type=hostname
cluster.k8s.address_type=hostname
node.max_ports=1048576
cluster.k8s.suffix=svc.cluster.local
cluster.k8s.suffix=svc.cluster.local
listener.tcp.external.acceptors=64
listener.ssl.external.acceptors=32
node.process_limit=2097152
node.max_ets_tables=2097152
cluster.k8s.service_name=emqx-headless
cluster.k8s.service_name=emqx-headless
cluster.discovery=k8s
cluster.discovery=k8s
listener.ws.external.acceptors=16
cluster.k8s.app_name=emqx
cluster.k8s.app_name=emqx
cluster.k8s.apiserver=https://kubernetes.default.svc:443
cluster.k8s.apiserver=https://kubernetes.default.svc:443
cluster.k8s.namespace=commbat-cloud
cluster.k8s.namespace=commbat-cloud
EMQ X Broker 4.2.3 is started successfully!
['2020-12-10T05:24:35Z']:emqx start
Start http:management listener on 8081 successfully.
Start http:dashboard listener on 18083 successfully.
Start mqtt:tcp listener on 127.0.0.1:11883 successfully.
Start mqtt:tcp listener on 0.0.0.0:1883 successfully.
Start mqtt:ws listener on 0.0.0.0:8083 successfully.
Start mqtt:ssl listener on 0.0.0.0:8883 successfully.
Start mqtt:wss listener on 0.0.0.0:8084 successfully.
EMQ X Broker 4.2.3 is running now!
Eshell V10.7.2.1  (abort with ^G)
([email protected])1> 2020-12-10 05:24:35.724 [error] Ekka(AutoCluster): Discovery error: {failed_connect,
                                     [{to_address,
                                       {"kubernetes.default.svc",443}},
                                      {inet,[inet],nxdomain}]}

Any help would be appreciated

Rory-Z · 2020-12-11T03:14:14Z

Hi, @seguidor777
This seems to be a distributed problem. When multiple emqx in a cluster restart at the same time, emqx cannot determine which node is the last to stop. This will cause emqx confusion. You can try to start one node first, and then the other nodes Clear the data under /opt/emqx/data/mnesia/ (note that this will cause the node data loss, operate with caution), and observe whether they can return to normal.

seguidor777 · 2021-01-01T07:07:55Z

Hi @zhanghongtong,
Sorry for the delay. I hadn't have any chance to try what you suggested. But that really works. However I think that the pods should be more resilient and prevent this failure to happen on a real scenario. Can you please just let me know if this will be tackled on a next release? I'll be looking forward

Rory-Z · 2021-01-04T02:38:57Z

@seguidor777
We plan to resolve this issue in version 4.3.0

seguidor777 · 2021-01-04T02:50:30Z

Awesome, thanks for your support

HJianBo added this to the 4.3-alpha.1 milestone Jan 4, 2021

HJianBo added the bug label Jan 4, 2021

HJianBo mentioned this issue Jan 13, 2021

perf(mnesia): unify the copy types of all nodes emqx/emqx#4002

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods don't run after a node is rebooted #624

Pods don't run after a node is rebooted #624

seguidor777 commented Dec 10, 2020 •

edited

Loading

seguidor777 commented Dec 10, 2020

Rory-Z commented Dec 11, 2020

seguidor777 commented Jan 1, 2021

Rory-Z commented Jan 4, 2021

seguidor777 commented Jan 4, 2021

Pods don't run after a node is rebooted #624

Pods don't run after a node is rebooted #624

Comments

seguidor777 commented Dec 10, 2020 • edited Loading

Environment

Description

seguidor777 commented Dec 10, 2020

Rory-Z commented Dec 11, 2020

seguidor777 commented Jan 1, 2021

Rory-Z commented Jan 4, 2021

seguidor777 commented Jan 4, 2021

seguidor777 commented Dec 10, 2020 •

edited

Loading