Skip to content
This repository has been archived by the owner on Apr 19, 2024. It is now read-only.

Fix global mode #218

Closed
wants to merge 6 commits into from
Closed

Fix global mode #218

wants to merge 6 commits into from

Conversation

miparnisari
Copy link
Contributor

Description

@elbuo8 will add more details later :)

References

Closes #208

@miparnisari miparnisari requested a review from Baliedge as a code owner February 8, 2024 02:07
global.go Outdated Show resolved Hide resolved
global.go Outdated Show resolved Hide resolved
gubernator.go Outdated Show resolved Hide resolved
@miparnisari miparnisari mentioned this pull request Feb 8, 2024
Co-authored-by: Yamil Asusta <[email protected]>
if r.Hits > int64(b.Remaining) {
metricOverLimitCounter.Add(1)
b.Remaining = 0
Copy link
Contributor

@elbuo8 elbuo8 Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed because the owner receives collapsed/accumulated hits, so if the remainder is 50, drip/leak brings it to 75 and the peers reported a total of 200 hits, it's fair to say that the quota has been consumed AND that the tokens be taken away given that drip was applied and we don't want to give more tokens away than we should (even if you were dripping exclusively on the owner).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great catch! However, this changes the semantics of asking for more than what is available. If we went forward with this change, any request that asks for more than is available will gobble up the remainder without actually using them.

For example: If my email limit is 1,000 and I've already sent 500 emails. Then I request to send 1,000 more emails, I would expect the 1,000 email request to be rejected as "over the limit". However, subsequent requests for 500 emails or less should be allowed, as I have 500 emails remaining within my rate limit window.

I think I have an idea of how to solve this. I'm adding some tests to cover this scenario and will have a PR soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had just merged in this behavior as a new behavior type DRAIN_OVER_LIMIT. I faced the same usability issue, but I did not want to break legacy applications that may have depended on old logic.
#209

rl, ok := item.Value.(*RateLimitResp)
if ok {
return rl, nil
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, for LEAKY_BUCKET the peer doesn't really know what tokens have leaked since we last checked.... I guess we could have the peer calculate the leak... but that sounds icky... Maybe someone can convince me otherwise?

@thrawn01 there is more to this story we'd like to discuss.

This change (left in comments so its easier to discuss, we can remove or uncomment later) basically applies drip on both peers and owner. Thus the owner has the true count while peers are constantly changing.

This has benefits and drawbacks.

Benefits:

  1. If owner takes too long to broadcast (network failure, saturation, etc) the peers can eventually converge on OVER_LIMIT and the system can protect itself.
  2. If owner takes too long to broadcast, peers can start letting traffic through instead of completely closing the circuit (stuck on OVER_LIMIT until TTL of the cached response).
  3. If owner takes too long to broadcast, peers can start letting traffic through slowly given that the limit won't be immediately reset once the TTL of the cached response happens.
  4. If peer(s) suffer from network issues where their hits can't be flushed, they can operate individually until the converge. If high rate, they will converge at rate limiting, otherwise they will constantly be flapping.

Basically we want to reduce the single point of failure the owner introduces by being the only one that can lower the remaining count. Specially if the Duration is 1s (or less) with a high leak/drip rate.
This comes with a particular drawback. It makes it so that you give away more tokens that you'd normally would. For example, if you set an RPS of 100 and you have 20 instances, that means that you will have 20*100 RPS in the case the owner is not fast enough. The more nodes you add, the more permissive it is because it's constantly dripping on each individual node. Thus you have to be mindful of your underlying capacity/fleet of gubernator. The smaller the fleet, the more strict it will be, the bigger, the more permissive it will be in high drip/leak scenarios.

@thrawn01 thrawn01 mentioned this pull request Feb 10, 2024
@thrawn01
Copy link
Contributor

I found a few other issues in this PR, and embraced your feed back. I think you are on the right track. I've created a new PR with the commits from this PR. #219

Please provide feed back on these changes, and I hope to work with @Baliedge next week to get this and all your other PR's merged!

@Baliedge
Copy link
Contributor

Closing in favor of: #219

@Baliedge Baliedge closed this Feb 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexpected GetRateLimits result from non-owning peer when using Global behaviour
5 participants