Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Can't be found from the inflight" log spam #81

Closed
louisburton opened this issue Jun 25, 2020 · 24 comments · Fixed by #132
Closed

"Can't be found from the inflight" log spam #81

louisburton opened this issue Jun 25, 2020 · 24 comments · Fixed by #132
Assignees
Labels
bug Something isn't working
Milestone

Comments

@louisburton
Copy link

louisburton commented Jun 25, 2020

I am unsure if this is a duplicate of : Bridge connection loops messages back forever

Due to this reported issue, yesterday I rolled out a change to about 20 devices disabling the file persisted replayq, and upgrading from 3.2.2 to 4.1.0.
I have not used a receive_mountpoint (it wasn't available in 3.2.2 which we're maintaining the contract for), and have seen this occur on two devices since rolling out yesterday. This issue was resolved by restarting - but I am fearful of it occurring again or if I roll this out further.

Bridge config:

bridge.mqtt.cloud.username=uniquedeviceid
bridge.mqtt.cloud.batch_size=32
bridge.mqtt.cloud.queue.max_total_size=1000MB
bridge.mqtt.cloud.clean_start=false
bridge.mqtt.cloud.subscription.1.topic=ctoe/uniquedeviceid/#
bridge.mqtt.cloud.forwards=reg/service/#,etoc/service/#,logs/#
bridge.mqtt.cloud.proto_ver=mqttv4
bridge.mqtt.cloud.tls_versions=tlsv1.2,tlsv1.1,tlsv1
bridge.mqtt.cloud.start_type=auto
bridge.mqtt.cloud.bridge_mode=true
bridge.mqtt.cloud.keepalive=60s
bridge.mqtt.cloud.clientid=uniquedeviceid
bridge.mqtt.cloud.subscription.1.qos=1
bridge.mqtt.cloud.max_inflight_size=32
bridge.mqtt.cloud.ssl=on
bridge.mqtt.cloud.address=edge.xxx.com:443
bridge.mqtt.cloud.queue.replayq_seg_bytes=10MB
bridge.mqtt.cloud.reconnect_interval=30s
bridge.mqtt.cloud.forward_mountpoint=etoc/uniquedeviceid
bridge.mqtt.cloud.retry_interval=20s

On two out of about 20 devices affected, at different times overnight it started churning out :

...
2020-06-25 09:00:41.996 [error] [Bridge] Can't be found from the inflight:54787
2020-06-25 09:00:41.997 [error] [Bridge] Can't be found from the inflight:54788
2020-06-25 09:00:41.998 [error] [Bridge] Can't be found from the inflight:54789
2020-06-25 09:00:41.999 [error] [Bridge] Can't be found from the inflight:54790
...

Our logs are sent over MQTT and these were sent successfully despite these errors from the bridge, but effectively started Denial of Servicing our EMQX broker on the cloud.

Why am I getting this error? I don't see a loopback configuration error?
Regardless of error - it should not constantly spam the logs at a rate of ~200 messages per second.

https://github.com/emqx/emqx-bridge-mqtt/blob/v4.1.0/src/emqx_bridge_worker.erl#L521-L523
Is it possible this refactor has created this regression @turtleDeng ? #39

emqx.log

@louisburton
Copy link
Author

@zhanghongtong Unfortunately, the nature of this issue has meant we've been unable to upgrade to 4.x.x, this is the most critical of the issues we've encountered yet, because when it fails it effectively attacks our cloud MQTT broker.

Not sure if this needs assignment, or if you need any further information. Let me know if so - thanks again!

@Rory-Z
Copy link
Member

Rory-Z commented Jun 25, 2020

CC @turtleDeng @HJianBo

@Rory-Z
Copy link
Member

Rory-Z commented Jun 25, 2020

Hi, @louisburton
We are on vacation now, so please forgive us for the delay

@louisburton
Copy link
Author

No problem, sorry I did not realise this - thanks for letting me know and for being so responsive, even on your vacation! 😄

@Rory-Z
Copy link
Member

Rory-Z commented Jun 25, 2020

Thank you for your understanding. We will finish our vacation on Monday. Have a nice day :)

@HJianBo
Copy link
Member

HJianBo commented Jun 29, 2020

Hi, @louisburton I have reproduced this issue. It happens after shutting down the remote Broker for a long time. Then, the emqx_bridge_mqtt will cache a mass of message to bridge to the remote Broker.

The problem occurs when emqx_bridge_mqtt reconnects to the remote server and delivers a large number of messages continuously.

It should be a BUG in emqx_bridge_mqtt for re-delivering messages. Messages can be correctly sent to the remote Broker, even if it is printed on so many logs. But it's a bad experience!

@HJianBo HJianBo added the bug Something isn't working label Jun 29, 2020
@HJianBo HJianBo added this to the 4.2.0 milestone Jun 29, 2020
@louisburton
Copy link
Author

Thanks very much for reproducing, hope you enjoyed your break 😎 .
You're right that messages can be correctly sent to the remote broker, even if it prints that many logs - but that degree of influx (potentially ~200 messages per second, per device) is too much for us and means we'll need to wait for this fix before we can upgrade to 4.x.x and take other fixes.

@louisburton
Copy link
Author

@zhanghongtong thanks for all the assistance with fixes! Do you think this ticket will still hit 4.2.0, and do you know roughly when this might be released? I'm going to look at rate limiting to mitigate against this a little, but would really like to avoid this issue altogether.

@Rory-Z
Copy link
Member

Rory-Z commented Jul 14, 2020

@louisburton We will fix this problem in the next version, in fact my code has been submitted, you can check this PR

@louisburton
Copy link
Author

I have seen this log spam still from v4.1.2 - I'm not sure this is fixed by the PR @zhanghongtong
Perhaps emqx/emqx#3629 acknowledges that?

@Rory-Z
Copy link
Member

Rory-Z commented Jul 29, 2020

CC @HJianBo
please verify

@HJianBo
Copy link
Member

HJianBo commented Jul 31, 2020

Yes, this PR is not to fix the problem. We plan to fix it at v4.2.0

@louisburton
Copy link
Author

louisburton commented Aug 10, 2020

@HJianBo @zhanghongtong - do you know if this is already fixed in the v4.2-alpha.3 release? If not, do we know when a v4.2.0 release is expected? Or is there a workaround, or something we could do to avoid this occurring?

We have actually suppressed/ignored the log spam, but we have realised that when this occurs, the same MQTT payloads get sent repeatedly. Thus even without the log spam, we spam repeated messages indefinitely. As a result, this is a critical issue for us that prevents us from rolling EMQX further to our device fleet.

We've managed to address most the issues I've raised thus far, and I'd really like to keep our integration with EMQX continuing, so any help is greatly appreciated 🙏 thx

@louisburton
Copy link
Author

Was using QoS 1 and tried CLEAN_START true.
Can confirm they don't help avoid this issue.

@JeremyTheocharis
Copy link

JeremyTheocharis commented Mar 3, 2021

@louisburton Did you find a solution?

We've been experiencing the same and I am quite wondered why there is no stable mqtt bridge with persistent storage solution yet (same with other brokers as mosquitto or vernemq).

@louisburton
Copy link
Author

@JeremyTheocharis sorry for the slow response.

In honesty, I did not find a solution. After a number of issues with the bridge, which might all be resolved in the latest emqx version (haven't tested), we decided to switch to Mosquitto's MQTT bridge (which I believe does now have persistence).

@JeremyTheocharis
Copy link

@louisburton thank you for the response!

Although mosquitto is advertising with persistent bridges we never got it to work properly and this resulted in a lot of missed messages. We decided now to program our own mqtt-bridge, which is still lacking a lot of features, but at least it does not DDoS the server or loses messages. It definitely works with emqx-edge. If anyone is interested, feel free to contact.

@gercorri
Copy link

gercorri commented Mar 15, 2021

Hi @JeremyTheocharis I would be interested in finding out more about your solution. We have a SaaS product relying EMQX Edge which is due to into beta phase for 1st customers in a few weeks and this issue is a major blocker.

@JeremyTheocharis
Copy link

I've added you to our (currently still) private repo and wrote you an email.

@gercorri
Copy link

Hi @JeremyTheocharis thanks for adding me to the repo. I was able to join and can access it. However I haven't recevied a sperate email from you, can you double check it sent ok?

@JeremyTheocharis
Copy link

@gercorri I tried a second email address from your git commits. If this does not work as well feel free to contact me at my email adress (just made it public).

@gercorri
Copy link

@JeremyTheocharis got it now, thanks.

@zmstone
Copy link
Member

zmstone commented Apr 9, 2021

Sorry I did not notice this thread until today.

I made a fix here: emqx/emqx#4513

@JeremyTheocharis if you don't mind, we'd love to know your fix.
or maybe help to review mine see if it makes sense, thanks.

@zmstone
Copy link
Member

zmstone commented Apr 9, 2021

will port the fix to this repo later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants