Fix global mode #218

miparnisari · 2024-02-08T02:07:06Z

Description

@elbuo8 will add more details later :)

References

Closes #208

global.go

gubernator.go

Co-authored-by: Yamil Asusta <[email protected]>

elbuo8 · 2024-02-08T16:56:38Z

algorithms.go

 		if r.Hits > int64(b.Remaining) {
 			metricOverLimitCounter.Add(1)
+			b.Remaining = 0


This was changed because the owner receives collapsed/accumulated hits, so if the remainder is 50, drip/leak brings it to 75 and the peers reported a total of 200 hits, it's fair to say that the quota has been consumed AND that the tokens be taken away given that drip was applied and we don't want to give more tokens away than we should (even if you were dripping exclusively on the owner).

This is a great catch! However, this changes the semantics of asking for more than what is available. If we went forward with this change, any request that asks for more than is available will gobble up the remainder without actually using them.

For example: If my email limit is 1,000 and I've already sent 500 emails. Then I request to send 1,000 more emails, I would expect the 1,000 email request to be rejected as "over the limit". However, subsequent requests for 500 emails or less should be allowed, as I have 500 emails remaining within my rate limit window.

I think I have an idea of how to solve this. I'm adding some tests to cover this scenario and will have a PR soon.

I had just merged in this behavior as a new behavior type DRAIN_OVER_LIMIT. I faced the same usability issue, but I did not want to break legacy applications that may have depended on old logic.
#209

elbuo8 · 2024-02-08T17:00:30Z

gubernator.go

-		rl, ok := item.Value.(*RateLimitResp)
-		if ok {
-			return rl, nil
+	/*


However, for LEAKY_BUCKET the peer doesn't really know what tokens have leaked since we last checked.... I guess we could have the peer calculate the leak... but that sounds icky... Maybe someone can convince me otherwise?

@thrawn01 there is more to this story we'd like to discuss.

This change (left in comments so its easier to discuss, we can remove or uncomment later) basically applies drip on both peers and owner. Thus the owner has the true count while peers are constantly changing.

This has benefits and drawbacks.

Benefits:

If owner takes too long to broadcast (network failure, saturation, etc) the peers can eventually converge on OVER_LIMIT and the system can protect itself.

If owner takes too long to broadcast, peers can start letting traffic through instead of completely closing the circuit (stuck on OVER_LIMIT until TTL of the cached response).

If owner takes too long to broadcast, peers can start letting traffic through slowly given that the limit won't be immediately reset once the TTL of the cached response happens.

If peer(s) suffer from network issues where their hits can't be flushed, they can operate individually until the converge. If high rate, they will converge at rate limiting, otherwise they will constantly be flapping.

Basically we want to reduce the single point of failure the owner introduces by being the only one that can lower the remaining count. Specially if the Duration is 1s (or less) with a high leak/drip rate.
This comes with a particular drawback. It makes it so that you give away more tokens that you'd normally would. For example, if you set an RPS of 100 and you have 20 instances, that means that you will have 20*100 RPS in the case the owner is not fast enough. The more nodes you add, the more permissive it is because it's constantly dripping on each individual node. Thus you have to be mindful of your underlying capacity/fleet of gubernator. The smaller the fleet, the more strict it will be, the bigger, the more permissive it will be in high drip/leak scenarios.

thrawn01 · 2024-02-10T03:16:44Z

I found a few other issues in this PR, and embraced your feed back. I think you are on the right track. I've created a new PR with the commits from this PR. #219

Please provide feed back on these changes, and I hope to work with @Baliedge next week to get this and all your other PR's merged!

Baliedge · 2024-02-19T20:49:22Z

Closing in favor of: #219

philipgough and others added 5 commits December 13, 2023 09:49

test: Add test for global rate limiting with load balancing

a48a45d

fix: owners set correct UNDER_LIMIT status

18b14db

fix global update behavior

2e0648a

Added findNonOwningPeer() and getClientToNonOwningPeer()

f95acfd

Fix global mode

c8c40c6

miparnisari requested a review from Baliedge as a code owner February 8, 2024 02:07

elbuo8 reviewed Feb 8, 2024

View reviewed changes

global.go Outdated Show resolved Hide resolved

elbuo8 reviewed Feb 8, 2024

View reviewed changes

global.go Outdated Show resolved Hide resolved

elbuo8 reviewed Feb 8, 2024

View reviewed changes

gubernator.go Outdated Show resolved Hide resolved

miparnisari mentioned this pull request Feb 8, 2024

Fix global behavior #216

Closed

remove logs and add comment

ad17f66

Co-authored-by: Yamil Asusta <[email protected]>

elbuo8 reviewed Feb 8, 2024

View reviewed changes

thrawn01 mentioned this pull request Feb 10, 2024

Change global behavior #219

Merged

Baliedge closed this Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix global mode #218

Fix global mode #218

miparnisari commented Feb 8, 2024

elbuo8 Feb 8, 2024 •

edited

Loading

thrawn01 Feb 9, 2024

Baliedge Feb 9, 2024

elbuo8 Feb 8, 2024

thrawn01 commented Feb 10, 2024

Baliedge commented Feb 19, 2024

Fix global mode #218

Fix global mode #218

Conversation

miparnisari commented Feb 8, 2024

Description

References

elbuo8 Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

thrawn01 Feb 9, 2024

Choose a reason for hiding this comment

Baliedge Feb 9, 2024

Choose a reason for hiding this comment

elbuo8 Feb 8, 2024

Choose a reason for hiding this comment

thrawn01 commented Feb 10, 2024

Baliedge commented Feb 19, 2024

elbuo8 Feb 8, 2024 •

edited

Loading