Skip to content
This repository has been archived by the owner on Mar 15, 2021. It is now read-only.

Add Reconnect delay #167

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

KohliDev
Copy link
Contributor

Adding reconnect random time so that every MQTT client connects randomly to the broker in case of disconnection.

log.info("Clean disconnection")
else:
wait_time = round(random.uniform(1.0, 10.0), 1)
log.info("Unexpected disconnection! Reconnecting in {0} seconds".format(str(wait_time)))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to log here, we should log the reason for the disconnect. Are some reasons fatal or do they all warrant a retry? Also, you shouldn't need to use .format(str(wait_time)) and {0}, right? "... {0:d}".format(wait_time) should be sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RC code is included above which will inform the user about the reason for disconnection as per MQTT. I`ll refactor the log statement.

if self._disconnect_result_code == 0:
log.info("Clean disconnection")
else:
wait_time = round(random.uniform(1.0, 10.0), 1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use random.randint(1, 10) as you do elsewhere, no need for floating point... and please add a comment explaining why we delay and how the upper bound of 10 seconds was chosen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the delay so that every mqtt client have the different random delay(different gateway) before retrying to connect, thus reducing the load on the broker. I used float trying to make it more random. Will add the comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being a float and rounding to a whole number doesn't make this any more random.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't rounding to complete whole number it still had once decimal place but agreed not added much random effect.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python got me. :-) But I still disagree with using floats when they are not necessary.

@@ -271,6 +278,7 @@ def connect_soc(self):
self._paho_client.connect(host=self.url, port=self.port, keepalive=self.keep_alive)

# Start network loop to handle auto-reconnect
self._paho_client.reconnect_delay_set(random.randint(1, 15), random.randint(150, 250))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too random for us to understand the behaviour; isn't paho going to slowly move between min and max, anyway, so we should just say 1, 250? Otherwise we need a good comment explaining why we're making min & max random values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When connection is lost, wait initially min_delay seconds and double this time every attempt. The wait is capped at max_delay as per the documentation in Paho MQTT Library.

I have added random delay so that after the max_delay limit is reached every client retries at the different max_delay time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't think that the various starting times of trying to make the failed connection by each individual client will be random enough? They're not all operating in-sync...

Copy link
Contributor Author

@KohliDev KohliDev Oct 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per QE testing, they figured out that the MQTT broker rejects client connection as all of them try to connect together (due to a large number of DB calls for ACL authorization, it wasn't occurring before ACL was implemented). This issue should be resolved at database front but I'm trying to add some randomness for every client as they try to re-connect.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, don't you think there will be enough randomness across clients from when they first try to connect (and fail), and how quickly every subsequent attempt occurs? (I think there is enough randomness there, especially as you back off to almost 5 minutes between retries!) This sounds like you're fixing the QE environment, and not the customer environment. I bet this is a scale test where everything starts very quickly and closely together? I agree we need to support that, but does that apply to a customer environment and should they be penalized with these long back offs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that's the reason I have asked for completing testing report from QE pointing out if this is the main cause. I'll remove the reconnect_delay_set temporarily and just have a random sleep if the disconnection happens unexpectedly.

@@ -1,7 +1,5 @@
aenum==1.4.5
linux-metrics==0.1.4
mock==2.0.0
paho-mqtt==1.2
paho-mqtt==1.3.1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is something really wrong with the way liota-lite is designed. There should be a way to have a common file structure to describe the requirements and setup.py, with a flag being passed. This is a developer's nightmare to keep files in sync for liota and liota-lite.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can raise a separate bug to fix this later, as it is not a concern related to this CL.

Copy link
Contributor Author

@KohliDev KohliDev Oct 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this is required to be fixed as this approach has already been validated by Open Source team.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add me tothe email chain where this is discussed, or point me to the person in Open Source team? I want to discuss this solution.

# time to reduce load
wait_time = random.randint(1, 5)
log.info("Unexpected disconnection! Reconnecting in {0:d} seconds".format(wait_time))
time.sleep(wait_time)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per comments of this function, it can be called by a client disconnection as well. Why should we wait here if it is a client initiated disconnection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is handled in the "if" loop above.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean if..else at line 82-87? Do you mean a client disconnection always leads to clean disconnection? It is a return code, not how the disconnect was initiated...

else:
# Added a random sleep so that every client tries out reconnection to broker at different
# time to reduce load
wait_time = random.randint(1, 5)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have two random functions, one with a wait_time here and another at line 283? Can't we be random enough just with one single function? Or at least have them in the same place?

@@ -271,6 +280,7 @@ def connect_soc(self):
self._paho_client.connect(host=self.url, port=self.port, keepalive=self.keep_alive)

# Start network loop to handle auto-reconnect
self._paho_client.reconnect_delay_set(random.randint(1, 15), random.randint(150, 250))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why these numbers where chosen? How many clients did we try out with? Are we sure this will succeed for thousands of client?
We should make these parameters user configurable so that QE can test and set it to correct values, instead of some randomly chosen numbers.

Copy link

@pmasrani pmasrani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make various numbers which are used in the delay functions as configurable in liota.conf? I don't want QE to hit connectivity issues and come back for a code change. They can change the default configuration and retry.

# time to reduce load
wait_time = random.randint(1, 5)
log.info("Unexpected disconnection! Reconnecting in {0:d} seconds".format(wait_time))
time.sleep(wait_time)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean if..else at line 82-87? Do you mean a client disconnection always leads to clean disconnection? It is a return code, not how the disconnect was initiated...

@@ -1,7 +1,5 @@
aenum==1.4.5
linux-metrics==0.1.4
mock==2.0.0
paho-mqtt==1.2
paho-mqtt==1.3.1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add me tothe email chain where this is discussed, or point me to the person in Open Source team? I want to discuss this solution.

@KohliDev
Copy link
Contributor Author

KohliDev commented Oct 31, 2017

We are putting this PR on hold after discussion with @gkmccready . We need concrete reports and exact reason if the MQTT Broker is rejecting connection due to various clients connecting at the same time because liota mqtt client has enough randomness before retrying the connection and are on different OS. Also, we can`t penalize Liota MQTT client with these long back offs.

@Venkat2811
Copy link
Contributor

Please find Randomized exponential back implementation here: diwanc@a7f229c

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants