-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] coreMQTT doesn't reconnect and keep alive handling fails #91
Comments
I've just revert changes introduced by commit e2d407f and as long as it doesn't solve 2024-07-24 12:08:17 I (343095) gate_control: Task "GateReport" sending subscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0005 with id 110
2024-07-24 12:08:24 E (350095) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (350095) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (350095) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout
I (350105) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (350105) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 1896
I (350115) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (350105) core_mqtt_agent_manager: TLS connection was disconnected.
I (350135) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (350145) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
I (350155) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
2024-07-24 12:08:27 I (352415) coreMQTT: MQTT connection established with the broker.
I (352415) core_mqtt_agent_manager: Session present: 0
I (352415) core_mqtt_agent_manager: Resubscribe to the topic /gates/GA97DF0005/update will be attempted.
W (352415) gate_control: Error or timed out waiting for ack to subscribe message 110. Re-attempting subscribe.
I (352435) report: coreMQTT-Agent connected.
I (352445) gate_control: coreMQTT-Agent connected.
I (352455) supervisor: coreMQTT-Agent connected.
I (352455) gate_control: Task "GateReport" sending subscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0005 with id 110
I (352455) ota_over_mqtt: coreMQTT-Agent connected. Resuming OTA agent.
I (352475) core_mqtt_agent_manager: coreMQTT-Agent connected.
2024-07-24 12:08:29 I (355055) gate_control: Subscribe 110 for topic filter /gates/GA97DF0005 succeeded for task "GateReport". Unfortunately without those changes also I (38625455) gate_control: Task "GateReport" sending unsubscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0004 with id 12502
2024-07-24 22:46:00 E (38627905) coreMQTT: sendMessageVector: Unable to send packet: Network Error.
W (38627905) gate_control: Error or timed out waiting for ack to unsubscribe message 12502. Re-attempting subscribe.
E (38627905) coreMQTT: MQTT operation failed with status MQTTSendFailed
I (38627925) gate_control: Task "GateReport" sending unsubscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0004 with id 12502
I (38627925) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (38627925) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 1708
I (38627945) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (38627925) core_mqtt_agent_manager: TLS connection was disconnected.
I (38627965) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (38627985) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
I (38627995) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
2024-07-24 22:46:02 E (38628625) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38628635) esp-tls: create_ssl_handle failed
E (38628635) esp-tls: Failed to open new connection
W (38628695) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (38628695) AWS_OTA: OTA Agent is suspended.
I (38628695) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]
I (38628945) core_mqtt_agent_manager: Retry attempt 1.
2024-07-24 22:46:02 E (38629155) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38629155) esp-tls: create_ssl_handle failed
E (38629155) esp-tls: Failed to open new connection
2024-07-24 22:46:02 I (38629615) core_mqtt_agent_manager: Retry attempt 2.
2024-07-24 22:46:03 E (38629715) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38629715) esp-tls: create_ssl_handle failed
E (38629715) esp-tls: Failed to open new connection
2024-07-24 22:46:04 I (38631185) core_mqtt_agent_manager: Retry attempt 3.
E (38631405) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
E (38631405) esp-tls: create_ssl_handle failed
E (38631405) esp-tls: Failed to open new connection
2024-07-24 22:46:06 I (38633155) core_mqtt_agent_manager: Retry attempt 4. |
Hi @NightSkySK,
My personal guess is that the ping request is sent from device side and it's waiting for ping response. In this case, even though we've sent the publish message successfully, this scenario will also happen (refer to this line). Let me trigger an internal discussion to see if we can have better solution!
I'm looking into this, will back later once I found something. Thank you! |
Hi @ActoryOu thank you for looking into this issue.
Maybe the |
Looks like the
|
This is definitely possible. You can try enlarge it as workaround. |
I've implemented all logging requested:
As in normal condition each I've made such condition in code: if( eMqttRet == MQTTSuccess )
{
while( 1 )
{
xReturnedEventBits_2 = xEventGroupWaitBits(
xNetworkEventGroup,
CORE_MQTT_AGENT_DISCONNECTED_BIT,
pdFALSE,
pdFALSE,
0 );
if( xReturnedEventBits_2 != 5 )
{
ESP_LOGW( "TAG",
"L777 xEventGroupWaitBits: %lx",
xReturnedEventBits_2 );
}
if( xReturnedEventBits_2 == CORE_MQTT_AGENT_DISCONNECTED_BIT )
{
break;
}
fd_set readSet;
fd_set errorSet; And here are the logs when issue appear: 2024-07-25 12:57:06 I (3959356) gate_control: Unsubscribe 1777 for topic filter /gates/GA97DF0006 succeeded for task "GateReport".
I (3959356) gate_control: Task "GateReport" iteration 591 completed.
2024-07-25 12:57:10 I (3962626) gate_control: Task "GateReport" sending subscribe request to coreMQTT-Agent for topic filter: /gates/GA97DF0006 with id 1778
2024-07-25 12:57:15 E (3967626) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (3967626) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (3967626) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout
I (3967636) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (3967636) core_mqtt_agent_manager: L533 Set disconnect bit.
I (3967646) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (3967666) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (3967656) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 1740
I (3967676) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
W (3967646) TAG: L777 xEventGroupWaitBits: 9
I (3967686) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
2024-07-25 12:57:15 W (3967766) TAG: L777 xEventGroupWaitBits: 9
W (3967836) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:15 W (3967906) TAG: L777 xEventGroupWaitBits: 9
W (3967976) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:15 W (3968046) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:15 W (3968116) TAG: L777 xEventGroupWaitBits: 9
W (3968146) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (3968146) AWS_OTA: OTA Agent is suspended.
I (3968146) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]
W (3968186) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:15 W (3968256) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:15 W (3968326) TAG: L777 xEventGroupWaitBits: 9
W (3968396) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:20 I (3973536) gate_control: Task "GateControl" iteration 65 completed.
2024-07-25 12:57:25 W (3978446) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:35 W (3988496) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:45 W (3998546) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:57:56 W (4008596) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:58:06 W (4018646) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:58:16 W (4028696) TAG: L777 xEventGroupWaitBits: 9
2024-07-25 12:58:17 E (4030026) supervisor: GateControl task i
2024-07-25 12:58:17 s not running So even if xEventGroupWaitBits: 5 changed to xEventGroupWaitBits: 9 the condition if( xReturnedEventBits_2 == CORE_MQTT_AGENT_DISCONNECTED_BIT )
{
break;
} didn't break the loop which seems to be reasonable as we have defined: #define WIFI_CONNECTED_BIT ( 1 << 0 )
#define WIFI_DISCONNECTED_BIT ( 1 << 1 )
#define CORE_MQTT_AGENT_CONNECTED_BIT ( 1 << 2 )
#define CORE_MQTT_AGENT_DISCONNECTED_BIT ( 1 << 3 ) So xEventGroupClearBits( xNetworkEventGroup,
CORE_MQTT_AGENT_CONNECTED_BIT );
xEventGroupSetBits( xNetworkEventGroup,
CORE_MQTT_AGENT_DISCONNECTED_BIT ); we get 9 as CORE_MQTT_AGENT_DISCONNECTED_BIT (8) and WIFI_CONNECTED_BIT(1) is still enabled. if( ( xReturnedEventBits_2 &
CORE_MQTT_AGENT_DISCONNECTED_BIT ) == 8 )
{
break;
} or if( ( xReturnedEventBits_2 &
CORE_MQTT_AGENT_DISCONNECTED_BIT ) != 0 )
{
break;
} than L777 should look something more like this: while( (xEventGroupWaitBits( xNetworkEventGroup, CORE_MQTT_AGENT_DISCONNECTED_BIT, pdFALSE, pdFALSE, 0 )& CORE_MQTT_AGENT_DISCONNECTED_BIT) != 0 ) I'm not very experienced with bitwise operations, please correct me if I'm wrong. |
Once I implemented this change:
Now it looks much better: 2024-07-25 17:54:04 I (13997266) gate_control: Task "GateControl" iteration 232 completed.
2024-07-25 17:54:04 I (13997786) coreMQTT: Publishing message to /gates/GA97DF0005.
2024-07-25 17:54:11 E (14004786) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (14004786) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (14004786) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout
I (14004796) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (14004796) core_mqtt_agent_manager: L533 Set disconnect bit.
I (14004806) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (14004816) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 1880
I (14004826) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (14004816) core_mqtt_agent_manager: L777 xEventGroupWaitBits: 9
I (14004846) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
W (14004846) core_mqtt_agent_manager: L712 xEventGroupWaitBits: 9
I (14004856) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
I (14004866) core_mqtt_agent_manager: TLS connection was disconnected.
2024-07-25 17:54:11 W (14004976) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (14004976) AWS_OTA: OTA Agent is suspended.
I (14004976) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]
2024-07-25 17:54:14 I (14007516) coreMQTT: MQTT connection established with the broker.
I (14007526) core_mqtt_agent_manager: Session present: 0
I (14007526) core_mqtt_agent_manager: Resubscribe to the topic /gates/GA97DF0005/update will be attempted.
W (14007526) gate_control: Error or timed out waiting for ack for publish message 6271. Re-attempting publish.
I (14007536) core_mqtt_agent_manager: Resubscribe to the topic /gates/GA97DF0005 will be attempted.
W (14007556) core_mqtt_agent_manager: L772 Clear disconnect bit
I (14007556) report: coreMQTT-Agent connected.
W (14007556) core_mqtt_agent_manager: L509 xEventGroupWaitBits: 5
I (14007566) gate_control: coreMQTT-Agent connected.
I (14007576) supervisor: coreMQTT-Agent connected.
I (14007586) ota_over_mqtt: coreMQTT-Agent connected. Resuming OTA agent.
I (14007576) gate_control: Task "GateReport" sending publish request to coreMQTT-Agent with message "{"openReceived": 0,"gateSensors":{ "IN1": 0, "IN2": 0},"keyboard":{ "new_key": 0, "key": ""}, "iteration": 2088}" on topic "/gates/GA97DF0005" with ID 6271.
I (14007596) core_mqtt_agent_manager: coreMQTT-Agent connected.
I (14007616) gate_control: Task "GateReport" waiting for publish 6271 to complete.
2024-07-25 17:54:14 I (14007976) AWS_OTA: otaPal_GetPlatformImageState
I (14007976) esp_ota_ops: aws_esp_ota_get_boot_flags: 1
I (14007976) esp_ota_ops: [0] aflags/seq:0x2/0x1, pflags/seq:0xffffffff/0x0
I (14007976) AWS_OTA: Current State=[RequestingJob], Event=[Resume], New state=[RequestingJob]
2024-07-25 17:54:16 I (14009036) coreMQTT: Publishing message to /gates/GA97DF0005. |
Can I ask MQTT cloud reconnect itself when wifi reconnect?In my test I foud that when wifi disconnect and reconnect itself,MQTT client can not reconnect itself. here are some log. I (36611157) wifi:bcn_timeout,ap_probe_send_start |
@JasonYan324 it all depends which version of while( (xEventGroupWaitBits( xNetworkEventGroup, CORE_MQTT_AGENT_DISCONNECTED_BIT, pdFALSE, pdFALSE, 0 )& CORE_MQTT_AGENT_DISCONNECTED_BIT) != 0 ) There is quite big chance that it will help with your case as well. Please let us know if that works for you as I'm about to prepare the Pull Request with that solution |
my SDK version is 5.2.2, not the new. Today, I saw the SDK is updtate to 5.3. one more thing, even if wifi does not diconnect, MQTT keep alive failed and disconnect, but can not be reconnect itselt. here are some log: �[0;32mI (5357267) ota_over_mqtt_demo: Received: 0 Queued: 0 Processed: 0 Dropped: 0�[0m |
I see the solution is merged with this commit 36492fc |
as I mentioned in first e-mail I've applied changes from following commits: 2024-07-29 08:24:04 E (39396617) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
E (39396617) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout
E (39396617) coreMQTT: MQTT operation failed with status MQTTKeepAliveTimeout
I (39396627) report: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (39396627) core_mqtt_agent_manager: Stack size uxMqttAgentTask: 1816
I (39396637) gate_control: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (39396627) core_mqtt_agent_manager: TLS connection was disconnected.
I (39396657) supervisor: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
W (39396667) core_mqtt_agent_manager: Internal heap size: 57260bytes
I (39396677) ota_over_mqtt: coreMQTT-Agent disconnected. Suspending OTA agent.
W (39396677) core_mqtt_agent_manager: Minimum free heap size: 11924 bytes
I (39396687) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
W (39396737) AWS_OTA: OTA Timer handle NULL for Timerid=1, can't stop.
I (39396737) AWS_OTA: OTA Agent is suspended.
I (39396737) AWS_OTA: Current State=[Suspended], Event=[Suspend], New state=[Suspended]
2024-07-29 08:24:04 E (39397197) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39397197) esp-tls: create_ssl_handle failed
E (39397197) esp-tls: Failed to open new connection
2024-07-29 08:24:04 I (39397517) core_mqtt_agent_manager: Retry attempt 1.
2024-07-29 08:24:05 E (39397707) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39397717) esp-tls: create_ssl_handle failed
E (39397717) esp-tls: Failed to open new connection
I (39398167) core_mqtt_agent_manager: Retry attempt 2.
2024-07-29 08:24:05 E (39398427) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39398427) esp-tls: create_ssl_handle failed
E (39398427) esp-tls: Failed to open new connection
2024-07-29 08:24:07 I (39399907) core_mqtt_agent_manager: Retry attempt 3.
2024-07-29 08:24:07 E (39400077) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39400077) esp-tls: create_ssl_handle failed
E (39400077) esp-tls: Failed to open new connection
2024-07-29 08:24:09 I (39401837) core_mqtt_agent_manager: Retry attempt 4.
E (39402017) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39402017) esp-tls: create_ssl_handle failed
E (39402017) esp-tls: Failed to open new connection
2024-07-29 08:24:11 I (39404217) core_mqtt_agent_manager: Retry attempt 5.
2024-07-29 08:24:11 E (39404387) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39404387) esp-tls: create_ssl_handle failed
E (39404387) esp-tls: Failed to open new connection
2024-07-29 08:24:16 I (39409227) core_mqtt_agent_manager: Retry attempt 6.
2024-07-29 08:24:16 E (39409417) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39409427) esp-tls: create_ssl_handle failed
E (39409427) esp-tls: Failed to open new connection
2024-07-29 08:24:17 I (39410127) core_mqtt_agent_manager: Retry attempt 7.
2024-07-29 08:24:17 E (39410307) esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7f00
E (39410307) esp-tls: create_ssl_handle failed
E (39410307) esp-tls: Failed to open new connection
2024-07-29 08:24:17 E (39410527) supervisor: Connection can't be established Could you please advice any more steps to debug reasons of memory leak causing finally to |
Hello @NightSkySK, Thank you. |
In the meantime, I will try to perform a code review to see if there are any clues. BTW, what change did you do? Could you help share your testing branch? |
Hi @NightSkySK, Thank you for your cooperation. |
@ActoryOu This week I'm travelling and not able to provide more evidence. I need to include more logs like |
Hello System information
|
you can update the git source code,them fix this issue, but “keep alive failed” will still happens,what is new is that coreMQTT-Agent can be reconnect itself. |
Hello, update source code and as you already said "Handling of keep alive failed. Status=MQTTKeepAliveTimeout" still happen but now seem that ESP32-s3 reset as you can see in the firts lines of logs here below:
and secondly, already as you said: ota_over_mqtt_demo can not run, : with "with error code 3."
|
Hello, |
Hi @WilliamFrasson, Thank you. |
Hi @ActoryOu |
Hi @JasonYan324 |
Thanks,I have already update for test, and gonging well |
I'm closing this because the patch works for multiple people. Thank you. |
This issue still persists. After running for ~3.3 hours, the device gets stuck in the following loop:
I've also circumvented the issue by restarting the device after N attempts. |
Describe the bug
At first glance, the subject can suggest that it is the same issue as described in #47 and #48 However, as the symptoms are the same the solution from those issues haven't solve my problem. Let me explain step by step what happened and my observations and logs (It will be quite a long thread, I'm sorry...)
The story begins when I found that my code is suffering from mbedtls memory leak:
In logs I found
esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
which was clear indicationmbedtls memory leak
In the same software version I noticed that
MQTTKeepAliveTimeout
appear from time to time but the software can handle reconnection without issue:So to fix mbedtls memory issue I've applied following commits:
a0fe220 referring to
sdkconfig.default
and e2d407f referring to
main/networking/mqtt/core_mqtt_agent_manager.c
And it worked well for mbedtls memory issue, I can't find any evidence
esp-tls-mbedtls: mbedtls_ssl_setup returned -0x7F00
anymore, however the frequency ofMQTTKeepAliveTimeout
significantly increased and software is no longer capable to recover MQTT connection.And on this stage we have two issues:
MQTTKeepAliveTimeout
even if few seconds earlier there was successfully published msg to AWS IoT MQTT server and from AWS IoT Documantation page I can read:where key information is
This timer is reset whenever AWS IoT receives a PUBLISH, SUBSCRIBE, PING, or PUBACK message from the client.
I've also tried to increase in
sdkconfig.default
fromCONFIG_GRI_MQTT_AGENT_KEEP_ALIVE_INTERVAL_SECONDS=60
toCONFIG_GRI_MQTT_AGENT_KEEP_ALIVE_INTERVAL_SECONDS=600
without any major differencewhich I found confirmation at the server side as
Livecycle Connect/Disconnect events don't show
MQTT_KEEP_ALIVE_TIMEOUT
butDUPLICATE_CLIENTID
caused by rebooting ESP32 without proper disconnection with MQTT server.coreMQTT-Agent disconnected
my short-term solution is to trigger the device reboot by my supervisor task, it isn't an elegant solution and I would prefer to fix it properly.System information
The text was updated successfully, but these errors were encountered: