-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework gorouter error classifiers and retry logic #321
Comments
ping: @geofffranks @domdom82 @ameowlia |
On first read through this seems pretty reasonable and thorough. I like the notion of splitting the classifiers out explicitly to reduce confusion and accidental side effects of changing any one of these individually. |
Good idea. I'd say we tackle both issues in one go:
|
I would clarify this also include dial and handshake timeouts in in this. There could be cases where future requests would eventually be possible once the condition causing a timeout goes away, but we shouldn't keep the backend in the pool while its known to time out. Otherwise a lot of requests will incur extra latency trying it before retrying requests. If the backend eventually recovers, route-emitter will re-add it. |
This commit removes `IdempotentRequestEOF` and `IncompleteRequest` from the fail-able and prune-able classifier group. This prevents errors that should not affect the endpoint from being marked as failed or being pruned. To do so all classifiers groups are split into distinct groups and any cross references between them are removed. The main motivation for this change is to avoid confusion and bugs due to artificial dependencies between the groups. Resolves: cloudfoundry/routing-release#321 Tested-By: `routing-release/scripts/run-unit-tests-in-docker gorouter`
With this commit `isRetriable` no longer overwrites / wraps the error that is passed to it. This was done to accommodate context from the circumstances in which the error occurred into the error itself to be able to match on those later on. This mechanism has proven to cause bugs and increase overall complexity by abusing the error type. Instead `isRetriable` now only returns whether a certain combination of parameters is considered retry-able, either because the circumstances allow for it or because the error matches one of the retry-able error classifiers. Resovles: cloudfoundry/routing-release#321
Sorry that his has been so quiet for the last months. I've split out the simple change into a dedicated PR that can probably be merged more easily and addresses the first issue: cloudfoundry/gorouter#355. The combined changes are still in cloudfoundry/gorouter#349 but since I need to figure out the testing and the change is more involved I think it makes sense to separate them. I will look into it in the coming days and (hopefully) provide a full PR soon. |
This commit removes `IdempotentRequestEOF` and `IncompleteRequest` from the fail-able and prune-able classifier group. This prevents errors that should not affect the endpoint from being marked as failed or being pruned. To do so all classifiers groups are split into distinct groups and any cross references between them are removed. The main motivation for this change is to avoid confusion and bugs due to artificial dependencies between the groups. Partially-Resolves: cloudfoundry/routing-release#321 Tested-By: `routing-release/scripts/run-unit-tests-in-docker gorouter`
ah, sorry. gh auto-closed this when i merged :D |
My fault, I tried to prevent it by changing the commit trailer to |
Hi @domdom82 & @geofffranks , |
There is still one PR from me open which I unfortunately never got around to finish up :/ So it's about half-way done 😄 |
Is this a security vulnerability?
No.
Issue
The current classifiers are misleading because they seem to represent relations between different sets of errors which should be completely distinct and, to some extent, abuse the
error
type.Affected Versions
All.
Context
These two changes are closely related which is why I opened this as one issue. If we feel like this is not a good idea I can split it into two separate issues.
Suggested Changes
There are two (relatively) independent changes that I would like to discuss.
Make All Classifier Groups Distinct
Just because the groups happen to share some errors / classifiers they shouldn't depend on each other. This is a possible source of confusion and has caused mistakes in the past.
This change includes a minor tweak to the groups as well:
IdempotentRequestEOF
andIncompleteRequest
are no longer part of theFailableClassifiers
andPruneableClassifiers
because those two errors are only "annotated" versions of an underlying error. Their sole purpose is to be able to match them using theRetriableClassifiers
because we checked some pre-condition that allows us to retry the wrapped error even though we usually wouldn't be able to retry in that case.Get Rid of "annotated" Errors
We initially introduced
IdempotentRequestEOF
andIncompleteRequest
because we needed a way to tell the classifiers that those errors are retry-able without unconditionally retrying all of the errors that might get wrapped inside them. This included an additional check which is done inisRetriable
.The main issue is that we wrap errors without providing details that are particularly relevant in the sense that they enrich the error message. They purely exist because we need to pass a value of type
error
to the classifier groups. Instead I propose to split the logic for performing retries:and completely remove
IdempotentRequestEOF
andIncompleteRequest
. This way we get the benefits of our additional retries either from the checks that allow us to retry in special cases or from the classifiers while not tampering with the errors that are passed around and even displayed to the user (and I'm pretty sure that end-users would be confused if they seeincomplete request (EOF)
in response to their request).Next Steps
The text was updated successfully, but these errors were encountered: