-
Notifications
You must be signed in to change notification settings - Fork 120
Add Reconnect delay #167
base: master
Are you sure you want to change the base?
Add Reconnect delay #167
Conversation
liota/lib/transports/mqtt.py
Outdated
log.info("Clean disconnection") | ||
else: | ||
wait_time = round(random.uniform(1.0, 10.0), 1) | ||
log.info("Unexpected disconnection! Reconnecting in {0} seconds".format(str(wait_time))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to log here, we should log the reason for the disconnect. Are some reasons fatal or do they all warrant a retry? Also, you shouldn't need to use .format(str(wait_time)) and {0}, right? "... {0:d}".format(wait_time) should be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RC code is included above which will inform the user about the reason for disconnection as per MQTT. I`ll refactor the log statement.
liota/lib/transports/mqtt.py
Outdated
if self._disconnect_result_code == 0: | ||
log.info("Clean disconnection") | ||
else: | ||
wait_time = round(random.uniform(1.0, 10.0), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use random.randint(1, 10) as you do elsewhere, no need for floating point... and please add a comment explaining why we delay and how the upper bound of 10 seconds was chosen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the delay so that every mqtt client have the different random delay(different gateway) before retrying to connect, thus reducing the load on the broker. I used float trying to make it more random. Will add the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being a float and rounding to a whole number doesn't make this any more random.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't rounding to complete whole number it still had once decimal place but agreed not added much random effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python got me. :-) But I still disagree with using floats when they are not necessary.
liota/lib/transports/mqtt.py
Outdated
@@ -271,6 +278,7 @@ def connect_soc(self): | |||
self._paho_client.connect(host=self.url, port=self.port, keepalive=self.keep_alive) | |||
|
|||
# Start network loop to handle auto-reconnect | |||
self._paho_client.reconnect_delay_set(random.randint(1, 15), random.randint(150, 250)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too random for us to understand the behaviour; isn't paho going to slowly move between min and max, anyway, so we should just say 1, 250? Otherwise we need a good comment explaining why we're making min & max random values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When connection is lost, wait initially min_delay seconds and double this time every attempt. The wait is capped at max_delay as per the documentation in Paho MQTT Library.
I have added random delay so that after the max_delay limit is reached every client retries at the different max_delay time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't think that the various starting times of trying to make the failed connection by each individual client will be random enough? They're not all operating in-sync...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per QE testing, they figured out that the MQTT broker rejects client connection as all of them try to connect together (due to a large number of DB calls for ACL authorization, it wasn't occurring before ACL was implemented). This issue should be resolved at database front but I'm trying to add some randomness for every client as they try to re-connect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, don't you think there will be enough randomness across clients from when they first try to connect (and fail), and how quickly every subsequent attempt occurs? (I think there is enough randomness there, especially as you back off to almost 5 minutes between retries!) This sounds like you're fixing the QE environment, and not the customer environment. I bet this is a scale test where everything starts very quickly and closely together? I agree we need to support that, but does that apply to a customer environment and should they be penalized with these long back offs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that's the reason I have asked for completing testing report from QE pointing out if this is the main cause. I'll remove the reconnect_delay_set temporarily and just have a random sleep if the disconnection happens unexpectedly.
@@ -1,7 +1,5 @@ | |||
aenum==1.4.5 | |||
linux-metrics==0.1.4 | |||
mock==2.0.0 | |||
paho-mqtt==1.2 | |||
paho-mqtt==1.3.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is something really wrong with the way liota-lite is designed. There should be a way to have a common file structure to describe the requirements and setup.py, with a flag being passed. This is a developer's nightmare to keep files in sync for liota and liota-lite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can raise a separate bug to fix this later, as it is not a concern related to this CL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe this is required to be fixed as this approach has already been validated by Open Source team.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add me tothe email chain where this is discussed, or point me to the person in Open Source team? I want to discuss this solution.
# time to reduce load | ||
wait_time = random.randint(1, 5) | ||
log.info("Unexpected disconnection! Reconnecting in {0:d} seconds".format(wait_time)) | ||
time.sleep(wait_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per comments of this function, it can be called by a client disconnection as well. Why should we wait here if it is a client initiated disconnection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is handled in the "if" loop above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean if..else at line 82-87? Do you mean a client disconnection always leads to clean disconnection? It is a return code, not how the disconnect was initiated...
else: | ||
# Added a random sleep so that every client tries out reconnection to broker at different | ||
# time to reduce load | ||
wait_time = random.randint(1, 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have two random functions, one with a wait_time here and another at line 283? Can't we be random enough just with one single function? Or at least have them in the same place?
liota/lib/transports/mqtt.py
Outdated
@@ -271,6 +280,7 @@ def connect_soc(self): | |||
self._paho_client.connect(host=self.url, port=self.port, keepalive=self.keep_alive) | |||
|
|||
# Start network loop to handle auto-reconnect | |||
self._paho_client.reconnect_delay_set(random.randint(1, 15), random.randint(150, 250)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why these numbers where chosen? How many clients did we try out with? Are we sure this will succeed for thousands of client?
We should make these parameters user configurable so that QE can test and set it to correct values, instead of some randomly chosen numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make various numbers which are used in the delay functions as configurable in liota.conf? I don't want QE to hit connectivity issues and come back for a code change. They can change the default configuration and retry.
# time to reduce load | ||
wait_time = random.randint(1, 5) | ||
log.info("Unexpected disconnection! Reconnecting in {0:d} seconds".format(wait_time)) | ||
time.sleep(wait_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean if..else at line 82-87? Do you mean a client disconnection always leads to clean disconnection? It is a return code, not how the disconnect was initiated...
@@ -1,7 +1,5 @@ | |||
aenum==1.4.5 | |||
linux-metrics==0.1.4 | |||
mock==2.0.0 | |||
paho-mqtt==1.2 | |||
paho-mqtt==1.3.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add me tothe email chain where this is discussed, or point me to the person in Open Source team? I want to discuss this solution.
We are putting this PR on hold after discussion with @gkmccready . We need concrete reports and exact reason if the MQTT Broker is rejecting connection due to various clients connecting at the same time because liota mqtt client has enough randomness before retrying the connection and are on different OS. Also, we can`t penalize Liota MQTT client with these long back offs. |
Please find Randomized exponential back implementation here: diwanc@a7f229c |
Adding reconnect random time so that every MQTT client connects randomly to the broker in case of disconnection.