Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] coreMQTT doesn't reconnect and keep alive handling fails #91

Closed
NightSkySK opened this issue Jul 24, 2024 · 27 comments
Closed

[BUG] coreMQTT doesn't reconnect and keep alive handling fails #91

NightSkySK opened this issue Jul 24, 2024 · 27 comments

Comments

@NightSkySK
Copy link
Contributor

Describe the bug

At first glance, the subject can suggest that it is the same issue as described in #47 and #48 However, as the symptoms are the same the solution from those issues haven't solve my problem. Let me explain step by step what happened and my observations and logs (It will be quite a long thread, I'm sorry...)

The story begins when I found that my code is suffering from mbedtls memory leak:
In logs I found esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00 which was clear indication mbedtls memory leak

2024-07-22 09:01:59	
I (38176357) coreMQTT: Publishing message to /gates/GA97DF0004.

2024-07-22 09:02:00	
E (38177577) coreMQTT: sendMessageVector: Unable to send packet: Network Error.
E (38177577) coreMQTT: MQTT PUBLISH failed with status MQTTSendFailed.
E (38177577) coreMQTT: MQTT operation failed with status MQTTSendFailed

W (38177577) gate_control: Error or timed out waiting for ack for publish message 12420. Re-attempting publish.
I (38177587) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (38177587) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 1664

I (38177607) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (38177597) core_mqtt_agent_manager: TLS connection was disconnected.
I (38177627) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (38177597) gate_control: Task "GateReport" sending publish request to coreMQTT-Agent with message "{"openReceived": 0,"gateSensors":{ "IN1": 0, "IN2": 0},"keyboard":{ "new_key": 0, "key": ""}, "iteration": 4135}" on topic "/gates/GA97DF0004" with ID 12420.
I (38177647) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
I (38177667) gate_control: Task "GateReport" waiting for publish 12420 to complete.
I (38177677) core_mqtt_agent_manager: coreMQTT-Agent disconnected.

2024-07-22 09:02:00	
E (38177947) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38177947) esp-tls: create_ssl_handle failed
E (38177947) esp-tls: Failed to open new connection

2024-07-22 09:02:01	
I (38178267) core_mqtt_agent_manager: Retry attempt 1.

2024-07-22 09:02:01	
E (38178487) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38178487) esp-tls: create_ssl_handle failed
E (38178487) esp-tls: Failed to open new connection

2024-07-22 09:02:01	
W (38178577) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (38178577) AWS_OTA: OTA Agent is suspended.
I (38178577) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]

2024-07-22 09:02:01	
I (38178947) core_mqtt_agent_manager: Retry attempt 2.

2024-07-22 09:02:01	
E (38179077) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38179077) esp-tls: create_ssl_handle failed
E (38179077) esp-tls: Failed to open new connection

2024-07-22 09:02:03	I (38180557) core_mqtt_agent_manager: Retry attempt 3.
E (38180707) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38180717) esp-tls: create_ssl_handle failed
E (38180717) esp-tls: Failed to open new connection

In the same software version I noticed that MQTTKeepAliveTimeout appear from time to time but the software can handle reconnection without issue:

2024-07-21 16:33:28	
I (18501007) gate_control: Task "GateReport" sending subscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0005 with id 6002

2024-07-21 16:33:36	
E (18508867) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (18508867) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (18508867) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout
I (18508877) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (18508887) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (18508877) core_mqtt_agent_manager: TLS connection was disconnected.
I (18508907) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (18508927) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
I (18508927) core_mqtt_agent_manager: coreMQTT-Agent disconnected.

2024-07-21 16:33:38	
I (18511227) coreMQTT: MQTT connection established with the broker.
I (18511227) core_mqtt_agent_manager: Session present: 0

I (18511227) core_mqtt_agent_manager: Resubscribe to the topic /gates/GA97DF0005/update will be attempted.
W (18511227) gate_control: Error or timed out waiting for ack to subscribe message 6002. Re-attempting subscribe.
I (18511237) report: coreMQTT-Agent connected.
I (18511257) gate_control: coreMQTT-Agent connected.
I (18511257) supervisor: coreMQTT-Agent connected.
I (18511257) gate_control: Task "GateReport" sending subscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0005 with id 6002
I (18511267) ota_over_mqtt: coreMQTT-Agent connected. Resuming OTA agent.
I (18511287) core_mqtt_agent_manager: coreMQTT-Agent connected.

So to fix mbedtls memory issue I've applied following commits:
a0fe220 referring to sdkconfig.default
and e2d407f referring to main/networking/mqtt/core_mqtt_agent_manager.c

And it worked well for mbedtls memory issue, I can't find any evidence esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00 anymore, however the frequency of MQTTKeepAliveTimeout significantly increased and software is no longer capable to recover MQTT connection.

2024-07-24 01:30:43	
I (7749746) coreMQTT: Publishing message to /gates/GA97DF0004.


2024-07-24 01:30:46	
I (7753386) ota_over_mqtt: OTA agent resumed by timer.

2024-07-24 01:30:48	
I (7754386) AWS_OTA: otaPal_GetPlatformImageState
I (7754386) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (7754386) esp_ota_ops: [0] aflags/seq:0x2/0x1, pflags/seq:0xffffffff/0x0
I (7754386) AWS_OTA: Current State=[RequestingJob], Event=[Resume], New state=[RequestingJob]

2024-07-24 01:30:49	
I (7755416) gate_control: Task "GateControl" iteration 128 completed.

2024-07-24 01:30:49	
E (7755746) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (7755746) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (7755746) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout

I (7755756) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (7755766) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (7755786) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (7755796) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
I (7755806) core_mqtt_agent_manager: coreMQTT-Agent disconnected.

2024-07-24 01:30:50	W (7756396) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
2024-07-24 01:32:11	E (7837816) coreMQTT: No command structure available.
2024-07-24 01:32:21	E (7847866) coreMQTT: No command structure available.
2024-07-24 01:32:31	E (7857916) coreMQTT: No command structure available.
2024-07-24 01:32:41	E (7867966) coreMQTT: No command structure available.

And on this stage we have two issues:

  1. Why MQTTKeepAliveTimeout even if few seconds earlier there was successfully published msg to AWS IoT MQTT server and from AWS IoT Documantation page I can read:

For MQTT (or MQTT over WebSockets) connections, a client can request a keep-alive interval between 30 - 1200 seconds as part of the MQTT CONNECT message. AWS IoT starts the keep-alive timer for a client when sending CONNACK in response to the CONNECT message. This timer is reset whenever AWS IoT receives a PUBLISH, SUBSCRIBE, PING, or PUBACK message from the client. AWS IoT will disconnect a client whose keep-alive timer has reached 1.5x the specified keep-alive interval (i.e., by a factor of 1.5).The default keep-alive interval is 1200 seconds. If a client requests a keep-alive interval of zero, the default keep-alive interval will be used. If a client requests a keep-alive interval greater than 1200 seconds, the default keep-alive interval will be used. If a client requests a keep-alive interval shorter than 30 seconds but greater than zero, the server treats the client as though it requested a keep-alive interval of 30 seconds.

where key information is This timer is reset whenever AWS IoT receives a PUBLISH, SUBSCRIBE, PING, or PUBACK message from the client.

I've also tried to increase in sdkconfig.default from CONFIG_GRI_MQTT_AGENT_KEEP_ALIVE_INTERVAL_SECONDS=60 to CONFIG_GRI_MQTT_AGENT_KEEP_ALIVE_INTERVAL_SECONDS=600 without any major difference

which I found confirmation at the server side as
Livecycle Connect/Disconnect events don't show MQTT_KEEP_ALIVE_TIMEOUT but DUPLICATE_CLIENTID caused by rebooting ESP32 without proper disconnection with MQTT server.

  1. The application lost the capability to recover from coreMQTT-Agent disconnected my short-term solution is to trigger the device reboot by my supervisor task, it isn't an elegant solution and I would prefer to fix it properly.

System information

  • Hardware board: ESP32
  • IDE used: 5.1.4
  • Operating System: Windows
@NightSkySK
Copy link
Contributor Author

NightSkySK commented Jul 24, 2024

I've just revert changes introduced by commit e2d407f

and as long as it doesn't solve MQTTKeepAliveTimeout software is again capable of recovering MQTT connection without necessary rebooting the device

2024-07-24 12:08:17	I (343095) gate_control: Task "GateReport" sending subscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0005 with id 110

2024-07-24 12:08:24	E (350095) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (350095) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (350095) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout

I (350105) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (350105) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 1896

I (350115) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (350105) core_mqtt_agent_manager: TLS connection was disconnected.
I (350135) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (350145) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
I (350155) core_mqtt_agent_manager: coreMQTT-Agent disconnected.

2024-07-24 12:08:27	I (352415) coreMQTT: MQTT connection established with the broker.
I (352415) core_mqtt_agent_manager: Session present: 0

I (352415) core_mqtt_agent_manager: Resubscribe to the topic /gates/GA97DF0005/update will be attempted.
W (352415) gate_control: Error or timed out waiting for ack to subscribe message 110. Re-attempting subscribe.
I (352435) report: coreMQTT-Agent connected.
I (352445) gate_control: coreMQTT-Agent connected.
I (352455) supervisor: coreMQTT-Agent connected.
I (352455) gate_control: Task "GateReport" sending subscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0005 with id 110
I (352455) ota_over_mqtt: coreMQTT-Agent connected. Resuming OTA agent.
I (352475) core_mqtt_agent_manager: coreMQTT-Agent connected.

2024-07-24 12:08:29	I (355055) gate_control: Subscribe 110 for topic filter /gates/GA97DF0005 succeeded for task "GateReport".

Unfortunately without those changes also E (38628625) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00 returns to be an issue:

I (38625455) gate_control: Task "GateReport" sending unsubscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0004 with id 12502

2024-07-24 22:46:00	E (38627905) coreMQTT: sendMessageVector: Unable to send packet: Network Error.
W (38627905) gate_control: Error or timed out waiting for ack to unsubscribe message 12502. Re-attempting subscribe.
E (38627905) coreMQTT: MQTT operation failed with status MQTTSendFailed

I (38627925) gate_control: Task "GateReport" sending unsubscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0004 with id 12502
I (38627925) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (38627925) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 1708

I (38627945) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (38627925) core_mqtt_agent_manager: TLS connection was disconnected.
I (38627965) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (38627985) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
I (38627995) core_mqtt_agent_manager: coreMQTT-Agent disconnected.

2024-07-24 22:46:02	E (38628625) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38628635) esp-tls: create_ssl_handle failed
E (38628635) esp-tls: Failed to open new connection
W (38628695) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (38628695) AWS_OTA: OTA Agent is suspended.
I (38628695) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]
I (38628945) core_mqtt_agent_manager: Retry attempt 1.

2024-07-24 22:46:02	E (38629155) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38629155) esp-tls: create_ssl_handle failed
E (38629155) esp-tls: Failed to open new connection

2024-07-24 22:46:02	I (38629615) core_mqtt_agent_manager: Retry attempt 2.

2024-07-24 22:46:03	E (38629715) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38629715) esp-tls: create_ssl_handle failed
E (38629715) esp-tls: Failed to open new connection

2024-07-24 22:46:04	I (38631185) core_mqtt_agent_manager: Retry attempt 3.
E (38631405) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38631405) esp-tls: create_ssl_handle failed
E (38631405) esp-tls: Failed to open new connection

2024-07-24 22:46:06	I (38633155) core_mqtt_agent_manager: Retry attempt 4.

@ActoryOu
Copy link
Member

Hi @NightSkySK,
Thanks for your feedback!

1. Why MQTTKeepAliveTimeout even if few seconds earlier there was successfully published msg to
   AWS IoT MQTT server and from AWS IoT Documantation page I can read:

My personal guess is that the ping request is sent from device side and it's waiting for ping response. In this case, even though we've sent the publish message successfully, this scenario will also happen (refer to this line). Let me trigger an internal discussion to see if we can have better solution!

2. The application lost the capability to recover from coreMQTT-Agent disconnected
   my short-term solution is to trigger the device reboot by my supervisor task, it isn't
   an elegant solution and I would prefer to fix it properly.

I'm looking into this, will back later once I found something.

Thank you!

@NightSkySK
Copy link
Contributor Author

Hi @ActoryOu thank you for looking into this issue.

My personal guess is that the ping request is sent from device side and it's waiting for ping response. In this case, even though we've sent the publish message successfully, this scenario will also happen (refer to this line). Let me trigger an internal discussion to see if we can have better solution!

Maybe the MQTT_PINGRESP_TIMEOUT_MS is too short for weaker network conditions? To investigate this issue I'm running simultaneously 6 devices, of which 3 are connected to IoT WiFI network where the router is supplied with 300Mbit/s fibre internet and there I don't see that MQTTKeepAliveTimeout was so frequently or even at all, much worse situation is with three other devices which are connected to TP-LINK Archer MR200 4G LTE where internet is taken from cellular LTE network. I haven't compared the pings in those networks, but I can imagine that LTE one has higher pings than the fibre one.

@ActoryOu
Copy link
Member

2. The application lost the capability to recover from coreMQTT-Agent disconnected
   my short-term solution is to trigger the device reboot by my supervisor task, it isn't
   an elegant solution and I would prefer to fix it properly.

Looks like the CORE_MQTT_AGENT_DISCONNECTED_BIT in event group is cleared somehow, which is not expected.
Would you mind adding some logs to track that? Please help provide device log in error period. Here are some places to add logs.

@ActoryOu
Copy link
Member

Maybe the MQTT_PINGRESP_TIMEOUT_MS is too short for weaker network conditions? To investigate this issue I'm running simultaneously 6 devices, of which 3 are connected to IoT WiFI network where the router is supplied with 300Mbit/s fibre internet and there I don't see that MQTTKeepAliveTimeout was so frequently or even at all, much worse situation is with three other devices which are connected to TP-LINK Archer MR200 4G LTE where internet is taken from cellular LTE network. I haven't compared the pings in those networks, but I can imagine that LTE one has higher pings than the fibre one.

This is definitely possible. You can try enlarge it as workaround.

@NightSkySK
Copy link
Contributor Author

NightSkySK commented Jul 25, 2024

@ActoryOu

I've implemented all logging requested:

2. The application lost the capability to recover from coreMQTT-Agent disconnected
   my short-term solution is to trigger the device reboot by my supervisor task, it isn't
   an elegant solution and I would prefer to fix it properly.

Looks like the CORE_MQTT_AGENT_DISCONNECTED_BIT in event group is cleared somehow, which is not expected. Would you mind adding some logs to track that? Please help provide device log in error period. Here are some places to add logs.

As in normal condition each while loop iteration on L777 it was printing:
W (167796) TAG: L777 xEventGroupWaitBits: 5

I've made such condition in code:

        if( eMqttRet == MQTTSuccess )
        {
            while( 1 )
            {
                xReturnedEventBits_2 = xEventGroupWaitBits(
                    xNetworkEventGroup,
                    CORE_MQTT_AGENT_DISCONNECTED_BIT,
                    pdFALSE,
                    pdFALSE,
                    0 );
                if( xReturnedEventBits_2 != 5 )
                {
                    ESP_LOGW( "TAG",
                              "L777 xEventGroupWaitBits: %lx",
                              xReturnedEventBits_2 );
                }
                if( xReturnedEventBits_2 == CORE_MQTT_AGENT_DISCONNECTED_BIT )
                {
                    break;
                }
                fd_set readSet;
                fd_set errorSet;

And here are the logs when issue appear:

2024-07-25 12:57:06	I (3959356) gate_control: Unsubscribe 1777 for topic filter /gates/GA97DF0006 succeeded for task "GateReport".
I (3959356) gate_control: Task "GateReport" iteration 591 completed.

2024-07-25 12:57:10	I (3962626) gate_control: Task "GateReport" sending subscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0006 with id 1778

2024-07-25 12:57:15	E (3967626) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (3967626) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (3967626) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout

I (3967636) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (3967636) core_mqtt_agent_manager: L533 Set disconnect bit.
I (3967646) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (3967666) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (3967656) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 1740

I (3967676) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
W (3967646) TAG: L777 xEventGroupWaitBits: 9
I (3967686) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
2024-07-25 12:57:15	W (3967766) TAG: L777 xEventGroupWaitBits: 9
W (3967836) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:15	W (3967906) TAG: L777 xEventGroupWaitBits: 9
W (3967976) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:15	W (3968046) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:15	W (3968116) TAG: L777 xEventGroupWaitBits: 9
W (3968146) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (3968146) AWS_OTA: OTA Agent is suspended.
I (3968146) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]
W (3968186) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:15	W (3968256) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:15	W (3968326) TAG: L777 xEventGroupWaitBits: 9
W (3968396) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:20	I (3973536) gate_control: Task "GateControl" iteration 65 completed.
2024-07-25 12:57:25	W (3978446) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:35	W (3988496) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:45	W (3998546) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:56	W (4008596) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:58:06	W (4018646) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:58:16	W (4028696) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:58:17	E (4030026) supervisor: GateControl task i
2024-07-25 12:58:17	s not running

So even if xEventGroupWaitBits: 5 changed to xEventGroupWaitBits: 9 the condition

              if( xReturnedEventBits_2 == CORE_MQTT_AGENT_DISCONNECTED_BIT )
                {
                    break;
                }

didn't break the loop which seems to be reasonable as we have defined:

#define WIFI_CONNECTED_BIT               ( 1 << 0 )
#define WIFI_DISCONNECTED_BIT            ( 1 << 1 )
#define CORE_MQTT_AGENT_CONNECTED_BIT    ( 1 << 2 )
#define CORE_MQTT_AGENT_DISCONNECTED_BIT ( 1 << 3 )

So 5 mean WIFI_CONNECTED (1) and CORE_MQTT_AGENT_CONNECTED (4), isn't it?
Than flipping bits 2 & 3:

                        xEventGroupClearBits( xNetworkEventGroup,
                                              CORE_MQTT_AGENT_CONNECTED_BIT );
                        xEventGroupSetBits( xNetworkEventGroup,
                                            CORE_MQTT_AGENT_DISCONNECTED_BIT );

we get 9 as CORE_MQTT_AGENT_DISCONNECTED_BIT (8) and WIFI_CONNECTED_BIT(1) is still enabled.
This mean that the if condition to beak the loop should be something like that:

                if( ( xReturnedEventBits_2 &
                      CORE_MQTT_AGENT_DISCONNECTED_BIT ) == 8 )
                {
                    break;
                }

or

                if( ( xReturnedEventBits_2 &
                      CORE_MQTT_AGENT_DISCONNECTED_BIT ) != 0 )
                {
                    break;
                }

than L777 should look something more like this:

while( (xEventGroupWaitBits( xNetworkEventGroup, CORE_MQTT_AGENT_DISCONNECTED_BIT, pdFALSE, pdFALSE, 0 )& CORE_MQTT_AGENT_DISCONNECTED_BIT) != 0 )

I'm not very experienced with bitwise operations, please correct me if I'm wrong.

@NightSkySK
Copy link
Contributor Author

Once I implemented this change:

               if( ( xReturnedEventBits_2 &
                    CORE_MQTT_AGENT_DISCONNECTED_BIT ) != 0 )
               {
                  break;
               }

Now it looks much better:

2024-07-25 17:54:04	I (13997266) gate_control: Task "GateControl" iteration 232 completed.

2024-07-25 17:54:04	I (13997786) coreMQTT: Publishing message to /gates/GA97DF0005.


2024-07-25 17:54:11	E (14004786) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (14004786) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (14004786) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout

I (14004796) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (14004796) core_mqtt_agent_manager: L533 Set disconnect bit.
I (14004806) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (14004816) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 1880

I (14004826) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (14004816) core_mqtt_agent_manager: L777 xEventGroupWaitBits: 9
I (14004846) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
W (14004846) core_mqtt_agent_manager: L712 xEventGroupWaitBits: 9
I (14004856) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
I (14004866) core_mqtt_agent_manager: TLS connection was disconnected.

2024-07-25 17:54:11	W (14004976) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (14004976) AWS_OTA: OTA Agent is suspended.
I (14004976) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]

2024-07-25 17:54:14	I (14007516) coreMQTT: MQTT connection established with the broker.
I (14007526) core_mqtt_agent_manager: Session present: 0

I (14007526) core_mqtt_agent_manager: Resubscribe to the topic /gates/GA97DF0005/update will be attempted.
W (14007526) gate_control: Error or timed out waiting for ack for publish message 6271. Re-attempting publish.
I (14007536) core_mqtt_agent_manager: Resubscribe to the topic /gates/GA97DF0005 will be attempted.
W (14007556) core_mqtt_agent_manager: L772 Clear disconnect bit 
I (14007556) report: coreMQTT-Agent connected.
W (14007556) core_mqtt_agent_manager: L509 xEventGroupWaitBits: 5
I (14007566) gate_control: coreMQTT-Agent connected.
I (14007576) supervisor: coreMQTT-Agent connected.
I (14007586) ota_over_mqtt: coreMQTT-Agent connected. Resuming OTA agent.
I (14007576) gate_control: Task "GateReport" sending publish request to coreMQTT-Agent with message "{"openReceived": 0,"gateSensors":{ "IN1": 0, "IN2": 0},"keyboard":{ "new_key": 0, "key": ""}, "iteration": 2088}" on topic "/gates/GA97DF0005" with ID 6271.
I (14007596) core_mqtt_agent_manager: coreMQTT-Agent connected.
I (14007616) gate_control: Task "GateReport" waiting for publish 6271 to complete.

2024-07-25 17:54:14	I (14007976) AWS_OTA: otaPal_GetPlatformImageState
I (14007976) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (14007976) esp_ota_ops: [0] aflags/seq:0x2/0x1, pflags/seq:0xffffffff/0x0
I (14007976) AWS_OTA: Current State=[RequestingJob], Event=[Resume], New state=[RequestingJob]

2024-07-25 17:54:16	I (14009036) coreMQTT: Publishing message to /gates/GA97DF0005.

@JasonYan324
Copy link

Maybe the MQTT_PINGRESP_TIMEOUT_MS is too short for weaker network conditions? To investigate this issue I'm running simultaneously 6 devices, of which 3 are connected to IoT WiFI network where the router is supplied with 300Mbit/s fibre internet and there I don't see that MQTTKeepAliveTimeout was so frequently or even at all, much worse situation is with three other devices which are connected to TP-LINK Archer MR200 4G LTE where internet is taken from cellular LTE network. I haven't compared the pings in those networks, but I can imagine that LTE one has higher pings than the fibre one.

This is definitely possible. You can try enlarge it as workaround.

Can I ask MQTT cloud reconnect itself when wifi reconnect?In my test I foud that when wifi disconnect and reconnect itself,MQTT client can not reconnect itself. here are some log.

I (36611157) wifi:bcn_timeout,ap_probe_send_start
[0;32mI (36611317) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0[0m
[0;31mE (36612037) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout[0m
[0;31mE (36612037) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout[0m
[0;31mE (36612037) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout
[0m
[0;32mI (36612037) core_mqtt_agent_manager: coreMQTT-Agent disconnected.[0m
[0;32mI (36612037) sub_pub_unsub_demo: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.[0m
[0;32mI (36612037) ota_over_mqtt_demo: coreMQTT-Agent disconnected. Suspending OTA agent.[0m
[0;32mI (36612037) temp_sub_pub_and_led_control_demo: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.[0m
[0;33mW (36612317) AWS_OTA: OTA Timer handle NULL for Timerid=0, can't stop.[0m
[0;33mW (36612317) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.[0m
[0;32mI (36612317) AWS_OTA: OTA Agent is suspended.[0m
[0;32mI (36612317) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended][0m
[0;32mI (36612317) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0[0m
[0;32mI (36613317) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0[0m
I (36613657) wifi:ap_probe_send over, resett wifi status to disassoc
I (36613657) wifi:state: run -> init (c800)
I (36613657) wifi:pm stop, total sleep time: 31072080977 us / 36610841828 us
I (36613657) wifi:idx:1, tid:0
I (36613657) wifi:idx:0, tid:4
I (36613657) wifi:new:<6,0>, old:<6,0>, ap:<255,255>, sta:<6,0>, prof:1
[0;32mI (36613667) core_mqtt_agent_manager: WiFi disconnected.[0m
[0;32mI (36613667) app_wifi: Disconnected. Connecting to the AP again...[0m
[0;32mI (36614317) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0[0m
[0;32mI (36615317) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0[0m
I (36616077) wifi:new:<1,0>, old:<6,0>, ap:<255,255>, sta:<1,0>, prof:1
I (36616077) wifi:state: init -> auth (b0)
I (36616077) wifi:state: auth -> assoc (0)
I (36616107) wifi:state: assoc -> run (10)
I (36616117) wifi:connected with test123, aid = 17, channel 1, BW20, bssid = f6:84:8d:1b:33:62
I (36616117) wifi:security: WPA2-PSK, phy: bgn, rssi: -67
I (36616137) wifi:pm start, type: 1
I (36616137) wifi:set rx beacon pti, rx_bcn_pti: 14, bcn_timeout: 25000, mt_pti: 14, mt_time: 10000
I (36616167) wifi:AP's beacon interval = 102400 us, DTIM period = 1
I (36616177) wifi:idx:0 (ifx:0, f6:84:8d:1b:33:62), tid:4, ssn:2, winSize:64
[0;32mI (36616317) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0[0m
[0;32mI (36617137) core_mqtt_agent_manager: WiFi connected.[0m
[0;32mI (36617137) app_wifi: Connected with IP Address:192.168.2.158[0m
[0;32mI (36617137) esp_netif_handlers: sta ip: 192.168.2.158, mask: 255.255.255.0, gw: 192.168.2.1[0m
[0;32mI (36617317) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0[0m
[0;31mE (36713767) coreMQTT: No command structure available.[0m
[0;32mI (36714317) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0[0m

@NightSkySK
Copy link
Contributor Author

@JasonYan324 it all depends which version of main/networking/mqtt/core_mqtt_agent_manager.c file you are using, If your version is up to date with main branch than it's a case which we investigate here.
My proposal for fix is make change in L777 from xEventGroupWaitBits to following code:

while( (xEventGroupWaitBits( xNetworkEventGroup, CORE_MQTT_AGENT_DISCONNECTED_BIT, pdFALSE, pdFALSE, 0 )& CORE_MQTT_AGENT_DISCONNECTED_BIT) != 0 )

There is quite big chance that it will help with your case as well. Please let us know if that works for you as I'm about to prepare the Pull Request with that solution

@JasonYan324
Copy link

it all depends which version of main/networking/mqtt/core_mqtt_agent_manager.c file you are using, If your version is up to date with main branch than it's a case which we investigate here.
My proposal for fix is make change in L777 from xEventGroupWaitBits to following code:

my SDK version is 5.2.2, not the new. Today, I saw the SDK is updtate to 5.3.

one more thing, even if wifi does not diconnect, MQTT keep alive failed and disconnect, but can not be reconnect itselt.

here are some log:

�[0;32mI (5357267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;32mI (5357297) sub_pub_unsub_demo: Task "SubPub0" sending subscribe request to coreMQTT-Agent for topic filter: /Meshed/Jason/404CCA89C184/Filter/SubPub0 with id 247�[0m
�[0;32mI (5358267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;32mI (5358837) coreMQTT: Ack packet deserialized with result: MQTTSuccess.�[0m
�[0;32mI (5358847) coreMQTT: State record updated. New state=MQTTPublishDone.�[0m
�[0;32mI (5359267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;32mI (5360267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;32mI (5361267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;32mI (5362267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;32mI (5363267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;31mE (5363847) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout�[0m
�[0;31mE (5363847) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout�[0m
�[0;31mE (5363847) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout
�[0m
�[0;32mI (5363847) core_mqtt_agent_manager: coreMQTT-Agent disconnected.�[0m
�[0;32mI (5363847) sub_pub_unsub_demo: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.�[0m
�[0;32mI (5363847) ota_over_mqtt_demo: coreMQTT-Agent disconnected. Suspending OTA agent.�[0m
�[0;32mI (5363847) temp_sub_pub_and_led_control_demo: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.�[0m
�[0;33mW (5364267) AWS_OTA: OTA Timer handle NULL for Timerid=0, can't stop.�[0m
�[0;33mW (5364267) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.�[0m
�[0;32mI (5364267) AWS_OTA: OTA Agent is suspended.�[0m
�[0;32mI (5364267) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]�[0m
�[0;32mI (5452267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;32mI (5453267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;32mI (5454267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;32mI (5455267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;32mI (5456267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;32mI (5457267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;32mI (5458267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;32mI (5459267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;31mE (5460097) coreMQTT: No command structure available.�[0m
�[0;32mI (5478267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;32mI (5479267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;31mE (5480117) coreMQTT: No command structure available.�[0m
�[0;32mI (5487267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;32mI (5488267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;32mI (5489267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;31mE (5490127) coreMQTT: No command structure available.�[0m
�[0;32mI (5498267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;32mI (5499267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;31mE (5500137) coreMQTT: No command structure available.�[0m
�[0;32mI (5509267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;31mE (5510147) coreMQTT: No command structure available.�[0m
�[0;32mI (5510267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m
�[0;32mI (5511267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m

@NightSkySK
Copy link
Contributor Author

I see the solution is merged with this commit 36492fc

@JasonYan324
Copy link

I see the solution is merged with this commit 36492fc

It seems, the commit 36492fc have a lot of problems

@NightSkySK
Copy link
Contributor Author

NightSkySK commented Jul 29, 2024

as I mentioned in first e-mail I've applied changes from following commits:
a0fe220 referring to sdkconfig.default
and e2d407f referring to main/networking/mqtt/core_mqtt_agent_manager.c
mainly to fix esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00 which was clear indication mbedtls memory leak.
In the mean time we found that coreMQTT can't reestablish a connection after error (e.g. due MQTTKeepAliveTimeout)
which was solved with 36492fc .
How ever after all those changes I find out that esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00 still appear again after about 11 hours running the program.

2024-07-29 08:24:04	E (39396617) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (39396617) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (39396617) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout

I (39396627) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (39396627) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 1816

I (39396637) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (39396627) core_mqtt_agent_manager: TLS connection was disconnected.
I (39396657) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (39396667) core_mqtt_agent_manager: Internal heap size: 57260bytes
I (39396677) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
W (39396677) core_mqtt_agent_manager: Minimum free heap size: 11924 bytes
I (39396687) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
W (39396737) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (39396737) AWS_OTA: OTA Agent is suspended.
I (39396737) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]

2024-07-29 08:24:04	E (39397197) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39397197) esp-tls: create_ssl_handle failed
E (39397197) esp-tls: Failed to open new connection

2024-07-29 08:24:04	I (39397517) core_mqtt_agent_manager: Retry attempt 1.

2024-07-29 08:24:05	E (39397707) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39397717) esp-tls: create_ssl_handle failed
E (39397717) esp-tls: Failed to open new connection
I (39398167) core_mqtt_agent_manager: Retry attempt 2.

2024-07-29 08:24:05	E (39398427) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39398427) esp-tls: create_ssl_handle failed
E (39398427) esp-tls: Failed to open new connection

2024-07-29 08:24:07	I (39399907) core_mqtt_agent_manager: Retry attempt 3.

2024-07-29 08:24:07	E (39400077) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39400077) esp-tls: create_ssl_handle failed
E (39400077) esp-tls: Failed to open new connection

2024-07-29 08:24:09	I (39401837) core_mqtt_agent_manager: Retry attempt 4.
E (39402017) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39402017) esp-tls: create_ssl_handle failed
E (39402017) esp-tls: Failed to open new connection

2024-07-29 08:24:11	I (39404217) core_mqtt_agent_manager: Retry attempt 5.

2024-07-29 08:24:11	E (39404387) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39404387) esp-tls: create_ssl_handle failed
E (39404387) esp-tls: Failed to open new connection

2024-07-29 08:24:16	I (39409227) core_mqtt_agent_manager: Retry attempt 6.

2024-07-29 08:24:16	E (39409417) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39409427) esp-tls: create_ssl_handle failed
E (39409427) esp-tls: Failed to open new connection

2024-07-29 08:24:17	I (39410127) core_mqtt_agent_manager: Retry attempt 7.

2024-07-29 08:24:17	E (39410307) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39410307) esp-tls: create_ssl_handle failed
E (39410307) esp-tls: Failed to open new connection

2024-07-29 08:24:17	E (39410527) supervisor: Connection can't be established

Could you please advice any more steps to debug reasons of memory leak causing finally to esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00?

@NightSkySK NightSkySK reopened this Jul 29, 2024
@ActoryOu
Copy link
Member

Hello @NightSkySK,
From your log, you've added extra info like core_mqtt_agent_manager: Internal heap size: 57260bytes and Minimum free heap size: 11924 bytes. Could you help attach the entire log? I'd like to narrow down the scenario.

Thank you.

@ActoryOu
Copy link
Member

ActoryOu commented Jul 29, 2024

In the meantime, I will try to perform a code review to see if there are any clues. BTW, what change did you do? Could you help share your testing branch?

@ActoryOu
Copy link
Member

Hi @NightSkySK,
I'm still having trouble identifying the potential cause of the heap memory leakage based on the information provided. Could you please share the demo configuration you're using? Have you made any modifications to the demo, or are you using the demo with only configuration changes? That additional context would be helpful in assisting you with this issue.

Thank you for your cooperation.

@NightSkySK
Copy link
Contributor Author

@ActoryOu This week I'm travelling and not able to provide more evidence. I need to include more logs like Minimum free heap size: into the code to trace it in the time of code running, so far I have it only in this one place. I can invite you to my repo however it's already quite heavily modified to adjust to my needs.

@WilliamFrasson
Copy link

Hello
I face the same issues with my ESP32-S3

System information
Hardware board: ESP32-S3 (ESP32 S3 DevKitC 1 N16R8 ESP32 S3 WROOM1 N16R8)
ESP-IDF: 5.2.2
ESP-IDE 3.0.0
Operating System: Windows

I (232435) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (233435) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (234435) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (235435) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (236115) main: Send TOPIC: ESP32S3_TEST1/out/memcheck MESSAGE: {"xxxxx"}
I (236135) MQTT_UTILS: Sending publish request Topic: "ESP32S3_TEST1/out/memcheck": "{"xxxx"}"  with ID 24.
I (236155) coreMQTT: Publishing message to ESP32S3_TEST1/out/memcheck.

I (236165) MQTT_UTILS: Task "APPTASK" waiting for publish 24 to complete.
I (236175) MQTT_UTILS: Publish 24 succeeded for task "APPTASK".
I (236435) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0


E (237165) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (237165) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (237165) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout


I (237175) ota_over_mqtt_demo: coreMQTT-Agent disconnected. Suspending OTA agent.
I (237185) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
I (237185) MQTT_UTILS: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (237435) ota_over_mqtt_demo: Call OTA_Suspend...
W (237435) AWS_OTA: OTA Timer handle NULL for Timerid=0, can't stop.
W (237435) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (237435) AWS_OTA: OTA Agent is suspended.
I (237445) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]
I (237445) ota_over_mqtt_demo: Call OTA_Suspend...OK
I (237455) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (238465) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (239465) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0

@JasonYan324
Copy link

Hello I face the same issues with my ESP32-S3

System information Hardware board: ESP32-S3 (ESP32 S3 DevKitC 1 N16R8 ESP32 S3 WROOM1 N16R8) ESP-IDF: 5.2.2 ESP-IDE 3.0.0 Operating System: Windows

I (232435) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (233435) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (234435) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (235435) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (236115) main: Send TOPIC: ESP32S3_TEST1/out/memcheck MESSAGE: {"xxxxx"}
I (236135) MQTT_UTILS: Sending publish request Topic: "ESP32S3_TEST1/out/memcheck": "{"xxxx"}"  with ID 24.
I (236155) coreMQTT: Publishing message to ESP32S3_TEST1/out/memcheck.

I (236165) MQTT_UTILS: Task "APPTASK" waiting for publish 24 to complete.
I (236175) MQTT_UTILS: Publish 24 succeeded for task "APPTASK".
I (236435) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0


E (237165) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (237165) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (237165) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout


I (237175) ota_over_mqtt_demo: coreMQTT-Agent disconnected. Suspending OTA agent.
I (237185) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
I (237185) MQTT_UTILS: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (237435) ota_over_mqtt_demo: Call OTA_Suspend...
W (237435) AWS_OTA: OTA Timer handle NULL for Timerid=0, can't stop.
W (237435) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (237435) AWS_OTA: OTA Agent is suspended.
I (237445) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]
I (237445) ota_over_mqtt_demo: Call OTA_Suspend...OK
I (237455) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (238465) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0
I (239465) ota_over_mqtt_demo:  Received: 0   Queued: 0   Processed: 0   Dropped: 0

you can update the git source code,them fix this issue, but “keep alive failed” will still happens,what is new is that coreMQTT-Agent can be reconnect itself.
But, even worse: in my test ota_over_mqtt_demo can not run, it will cause reboot, I have to close it for test up to now. I still waiting for them to feedback

@WilliamFrasson
Copy link

Hello, update source code and as you already said

"Handling of keep alive failed. Status=MQTTKeepAliveTimeout" still happen but now seem that ESP32-s3 reset as you can see in the firts lines of logs here below:

I (1099585) main: Send TOPIC: ESP32S3_TEST1/out/memcheck MESSAGE: {"xxxxxx"}
I (1099605) MQTT_UTILS: Sending publish request Topic: "ESP32S3_TEST1/out/memcheck": "xxxxxx"  with ID 139.
I (1099635) MQTT_UTILS: Task "APPTASK" waiting for publish 139 to complete.
I (1101275) MQTT_UTILS: Publish 139 succeeded for task "APPTASK".

E (1107325) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (1107325) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout

ESP-ROM:esp32s3-20210327
Build:Mar 27 2021
rst:0x8 (TG1WDT_SYS_RST),boot:0x8 (SPI_FAST_FLASH_BOOT)
Saved PC:0x42004d7b
SPIWP:0xee
mode:DIO, clock div:1
load:0x3fce3810,len:0x11b4
load:0x403c9700,len:0x4
load:0x403c9704,len:0xab4
load:0x403cc700,len:0x2c14
entry 0x403c9890
I (261) cpu_start: Multicore app
I (271) cpu_start: Pro cpu start user code
I (271) cpu_start: cpu freq: 160000000 Hz
I (271) cpu_start: Application information:
I (274) cpu_start: Project name:     smartmeter
I (279) cpu_start: App version:      v202212.00-64-g06f5ce1-dirty
I (286) cpu_start: Compile time:     Aug  1 2024 08:04:32
I (292) cpu_start: ELF file SHA256:  57fe881d4...
I (297) cpu_start: ESP-IDF:          v5.2.2-dirty
I (303) cpu_start: Min chip rev:     v0.0
I (307) cpu_start: Max chip rev:     v0.99
I (312) cpu_start: Chip rev:         v0.2
I (317) heap_init: Initializing. RAM available for dynamic allocation:
I (324) heap_init: At 3FCABFE8 len 0003D728 (245 KiB): RAM
I (330) heap_init: At 3FCE9710 len 00005724 (21 KiB): RAM
I (336) heap_init: At 3FCF0000 len 00008000 (32 KiB): DRAM
I (343) heap_init: At 600FE014 len 00001FD4 (7 KiB): RTCRAM
I (350) spi_flash: detected chip: generic
I (354) spi_flash: flash io: dio
W (357) spi_flash: Detected size(16384k) larger than the size in the binary image header(4096k). Using the size in the binary image header.
W (371) mcpwm(legacy): legacy driver is deprecated, please migrate to `driver/mcpwm_prelude.h`
I (380) sleep: Configure to isolate all GPIO pins in sleep state
I (387) sleep: Enable automatic switching of GPIO sleep configuration
I (394) coexist: coex firmware version: d96c1e51f
I (399) coexist: coexist rom version e7ae62f
I (405) main_task: Started on CPU0
I (415) main_task: Calling app_main()
I (415) main: ------ APPLICATION START ----------
I (415) main:
EWC_S3 WFR ESP-IDF Version = v5.2.2-dirty

I (425) main: Get Device Credentil From FLASH Initial partition. OK!
I (435) main: xDeviceInitialSettings:
I (435) main:   cDeviceSerial[sizeDeviceSerial]=ESP32S3_TEST1
I (445) main:   cUdpPin[sizeUdpPin]=1234
I (445) main:   cXorKey[sizeXorKey]=DEMO_KEY
I (455) main:   cThingName[sizeThingName]=ESP32S3_TEST1
I (455) main:   cCertificate[sizeCertificate]=-----BEGIN CERTIFICATE-----
...

and secondly, already as you said: ota_over_mqtt_demo can not run, : with "with error code 3."


I (2195) ota_over_mqtt_demo: OTA over MQTT demo, Application version 0.0.0
I (2205) main: START APPLICATION EWC
I (2205) main: ------ APPLICATION START ----------
...
I (9485) core_mqtt_agent_manager: TLS connection established.
I (10095) coreMQTT: MQTT connection established with the broker.
I (10095) core_mqtt_agent_manager: Session present: 0

I (10095) ota_over_mqtt_demo: coreMQTT-Agent connected. Resuming OTA agent.
I (10105) core_mqtt_agent_manager: coreMQTT-Agent connected.
I (10105) MQTT_UTILS: coreMQTT-Agent connected.
I (10185) ota_over_mqtt_demo: Request Job Document event Received

E (10185) ota_over_mqtt_demo: Failed to write job start next topic to buffer with error code 3.

@WilliamFrasson
Copy link

Hello,
just un update, after I have excluded the ota_over_mqtt_demo from my code and all seem work without any error from the 16 hours, without any event of MQTTKeepAliveTimeout.

@ActoryOu
Copy link
Member

ActoryOu commented Aug 5, 2024

Hi @WilliamFrasson,
We recently fixed an issue on OTA demo via #103. Could you help pull latest version and test if same issue exists?

Thank you.

@WilliamFrasson
Copy link

Hi @ActoryOu
I enabled the ota demo test with the ota fix, and now running .. after 4 hours i don't get MQTTKeepAliveTimeout events.

@kar-rahul-aws
Copy link
Member

Hi @JasonYan324
Since you were also facing a similar issue , can you confirm if the #103 change works for you as well?

@JasonYan324
Copy link

Hi @JasonYan324 Since you were also facing a similar issue , can you confirm if the #103 change works for you as well?

Thanks,I have already update for test, and gonging well

@ActoryOu
Copy link
Member

I'm closing this because the patch works for multiple people.
@NightSkySK, please feel free to create a new issue to describe your current status if the patch doesn't work.

Thank you.

@ivobelitz
Copy link

I've just revert changes introduced by commit e2d407f

and as long as it doesn't solve MQTTKeepAliveTimeout software is again capable of recovering MQTT connection without necessary rebooting the device

2024-07-24 12:08:17	I (343095) gate_control: Task "GateReport" sending subscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0005 with id 110

2024-07-24 12:08:24	E (350095) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (350095) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (350095) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout

I (350105) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (350105) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 1896

I (350115) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (350105) core_mqtt_agent_manager: TLS connection was disconnected.
I (350135) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (350145) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
I (350155) core_mqtt_agent_manager: coreMQTT-Agent disconnected.

2024-07-24 12:08:27	I (352415) coreMQTT: MQTT connection established with the broker.
I (352415) core_mqtt_agent_manager: Session present: 0

I (352415) core_mqtt_agent_manager: Resubscribe to the topic /gates/GA97DF0005/update will be attempted.
W (352415) gate_control: Error or timed out waiting for ack to subscribe message 110. Re-attempting subscribe.
I (352435) report: coreMQTT-Agent connected.
I (352445) gate_control: coreMQTT-Agent connected.
I (352455) supervisor: coreMQTT-Agent connected.
I (352455) gate_control: Task "GateReport" sending subscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0005 with id 110
I (352455) ota_over_mqtt: coreMQTT-Agent connected. Resuming OTA agent.
I (352475) core_mqtt_agent_manager: coreMQTT-Agent connected.

2024-07-24 12:08:29	I (355055) gate_control: Subscribe 110 for topic filter /gates/GA97DF0005 succeeded for task "GateReport".

Unfortunately without those changes also E (38628625) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00 returns to be an issue:

I (38625455) gate_control: Task "GateReport" sending unsubscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0004 with id 12502

2024-07-24 22:46:00	E (38627905) coreMQTT: sendMessageVector: Unable to send packet: Network Error.
W (38627905) gate_control: Error or timed out waiting for ack to unsubscribe message 12502. Re-attempting subscribe.
E (38627905) coreMQTT: MQTT operation failed with status MQTTSendFailed

I (38627925) gate_control: Task "GateReport" sending unsubscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0004 with id 12502
I (38627925) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (38627925) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 1708

I (38627945) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (38627925) core_mqtt_agent_manager: TLS connection was disconnected.
I (38627965) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (38627985) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
I (38627995) core_mqtt_agent_manager: coreMQTT-Agent disconnected.

2024-07-24 22:46:02	E (38628625) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38628635) esp-tls: create_ssl_handle failed
E (38628635) esp-tls: Failed to open new connection
W (38628695) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (38628695) AWS_OTA: OTA Agent is suspended.
I (38628695) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]
I (38628945) core_mqtt_agent_manager: Retry attempt 1.

2024-07-24 22:46:02	E (38629155) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38629155) esp-tls: create_ssl_handle failed
E (38629155) esp-tls: Failed to open new connection

2024-07-24 22:46:02	I (38629615) core_mqtt_agent_manager: Retry attempt 2.

2024-07-24 22:46:03	E (38629715) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38629715) esp-tls: create_ssl_handle failed
E (38629715) esp-tls: Failed to open new connection

2024-07-24 22:46:04	I (38631185) core_mqtt_agent_manager: Retry attempt 3.
E (38631405) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38631405) esp-tls: create_ssl_handle failed
E (38631405) esp-tls: Failed to open new connection

2024-07-24 22:46:06	I (38633155) core_mqtt_agent_manager: Retry attempt 4.

This issue still persists. After running for ~3.3 hours, the device gets stuck in the following loop:

I (12075390) core_mqtt_agent_manager: Retry attempt 1.
E (12075430) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (12075430) esp-tls: create_ssl_handle failed
E (12075430) esp-tls: Failed to open new connection

I've also circumvented the issue by restarting the device after N attempts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants